<a href="https://colab.research.google.com/github/rtajeong/ChatGPT_for_Management/blob/main/7_RAG_examples_with_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG (Retrieval-Augmented Generation)
- GPT와 같은 언어 모델이 데이터를 생성하는 데 있어 외부 데이터를 참고하도록 설계된 기법
- RAG는 다음 두 가지로 구성
  - Retrieval(검색): 외부 데이터 소스에서 관련 정보를 검색.
  - Generation(생성): 검색한 정보를 바탕으로 자연스러운 응답을 생성.

## RAG 예제 1
- RAG와 비-RAG 방식 비교
  - Without RAG:
    - 내부 데이터만 사용해 요약.
    - 외부 데이터에 대한 접근이 없으므로 최신 정보나 추가적인 맥락이 부족.
  - With RAG:
    - 내부 데이터와 외부 데이터를 결합하여 더 풍부한 맥락 제공.
    - 외부 데이터를 통해 더 깊이 있는 분석 가능.

- facebook/bart-large-cnn model:
  - BART라는 모델을 기반으로 한 사전 학습된 언어 모델로, **문서 요약(Summarization)**에 특화된 버전
  - 이 모델은 Facebook AI에서 개발한 BART 아키텍처를 사용하며, CNN/Daily Mail 뉴스 데이터셋에서 훈련되어 긴 문서를 간결하게 요약하는 데 효과적으로 알려짐.
  - 주로 영어로 훈련된 모델.
- KoBART:
  - 한국어 버전의 BART 모델로, SKT에서 공개한 모델
  - 한국어 텍스트 요약 및 생성에 적합.
  - 한국어 뉴스 기사, 소셜 미디어 데이터 등으로 학습.

In [None]:
from transformers import pipeline

# 1. 기본 GPT 모델만 사용하는 경우
def summarize_without_rag(reviews):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
    joined_reviews = " ".join(reviews)
    summary = summarizer(joined_reviews, max_length=40, min_length=10, do_sample=False)
    return summary[0]['summary_text']

# 2. RAG 방식을 사용하는 경우
def summarize_with_rag(reviews, external_data):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn", device=0)
    # 외부 데이터를 먼저 검색한 뒤 리뷰와 결합
    combined_reviews = " ".join(reviews) + " " + " ".join(external_data)
    summary = summarizer(combined_reviews, max_length=40, min_length=10, do_sample=False)
    return summary[0]['summary_text']

# 샘플 데이터
internal_reviews = [
    "The battery life is acceptable, but the performance is not very impressive.",
    "The product is lightweight, but the camera quality is below average.",
]

# 외부 데이터 (RAG 사용 시 추가적으로 검색된 데이터)
external_reviews = [
    "Online reviews highlight that the latest software update improves performance significantly.",
    "Many users suggest the camera can take excellent pictures in bright lighting conditions but struggles in low light."
]

# 실행
print("Without RAG:")
print(summarize_without_rag(internal_reviews))

print("\nWith RAG:")
print(summarize_with_rag(internal_reviews, external_reviews))


Without RAG:


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Your max_length is set to 40, but your input_length is only 29. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=14)


The battery life is acceptable, but the performance is not very impressive. The product is lightweight, but camera quality is below average.

With RAG:


Device set to use cuda:0


The camera can take excellent pictures in bright lighting conditions but struggles in low light. The battery life is acceptable, but the performance is not very impressive.


In [None]:
!pip install transformers sentencepiece



In [None]:
from transformers import BartForConditionalGeneration, PreTrainedTokenizerFast

# KoBART 모델과 토크나이저 로드
def load_kobart():
    model = BartForConditionalGeneration.from_pretrained('gogamza/kobart-summarization')
    tokenizer = PreTrainedTokenizerFast.from_pretrained('gogamza/kobart-summarization')
    return model, tokenizer

# 텍스트 요약 함수 (KoBART 사용)
def summarize_with_kobart(model, tokenizer, text):
    input_ids = tokenizer.encode(text, return_tensors="pt")
    summary_ids = model.generate(input_ids, max_length=50, min_length=10, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# RAG를 사용하지 않는 경우
def summarize_without_rag(model, tokenizer, reviews):
    # 리뷰를 하나의 텍스트로 합침
    combined_reviews = " ".join(reviews)
    return summarize_with_kobart(model, tokenizer, combined_reviews)

# RAG를 사용하는 경우
def summarize_with_rag(model, tokenizer, reviews, external_data):
    # 리뷰와 외부 데이터를 결합
    combined_reviews = " ".join(reviews) + " " + " ".join(external_data)
    return summarize_with_kobart(model, tokenizer, combined_reviews)

# 샘플 데이터
internal_reviews = [
    "배터리 수명은 보통이지만, 성능이 그다지 뛰어나지 않습니다.",
    "제품은 가볍지만, 카메라 화질이 평균 이하입니다."
]

external_reviews = [
    "최신 소프트웨어 업데이트로 성능이 크게 개선되었다는 리뷰가 많습니다.",
    "카메라가 밝은 조명에서는 훌륭한 사진을 찍지만, 어두운 환경에서는 부진하다는 의견이 있습니다."
]

# KoBART 모델 로드
model, tokenizer = load_kobart()

# 실행
print("RAG를 사용하지 않은 경우:")
print(summarize_without_rag(model, tokenizer, internal_reviews))

print("\nRAG를 사용한 경우:")
print(summarize_with_rag(model, tokenizer, internal_reviews, external_reviews))


You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
You passed along `num_labels=3` with an incompatible id to label map: {'0': 'NEGATIVE', '1': 'POSITIVE'}. The number of labels wil be overwritten to 2.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'BartTokenizer'. 
The class this function is called from is 'PreTrainedTokenizerFast'.


RAG를 사용하지 않은 경우:
제품은 가볍지만, 가볍지만, 카메라 화질이 평균 이하인 제품은 가볍지만, 카메라 화질이 평균 이하인 카메라 화질이 평균 이하인 카메라 화질이 평균 이하인 카메라 화질이 평균 이하입니다.

RAG를 사용한 경우:
카메라가 밝은 조명에서는 훌륭한 사진을 찍지만, 어두운 환경에서는 부진하다는 의견이 있습니다.


- With RAG:
  - 외부 데이터를 포함하여 최신 성능 개선 정보와 특정 사용 시나리오(밝은 조명에서의 카메라 성능)를 요약에 반영.
  - 결과적으로 더 풍부한 맥락과 실질적인 정보를 제공.

## RAG 예제 2: 고객 지원 개선
- 문제 정의:
  - 한 회사가 고객 만족도를 높이기 위해 고객 문의와 과거 지원 기록을 분석하고, 개선 방안을 도출하고자 한다.

In [None]:
# 고객 문의 데이터 (입력 데이터): 고객이 최근에 남긴 문의 메시지
internal_reviews = [
    "My order arrived late, and the packaging was damaged.",
    "I’m not happy with the battery life of the product.",
    "Customer support is too slow to respond to my queries."
]

# 내부 데이터베이스 데이터 (추가 데이터): 고객 지원 시스템에서 추출된 과거 데이터:
external_reviews = [
    "Delivery issues were resolved faster after increasing the logistics team size in Q2.",
    "Battery complaints decreased by 15% after the firmware update in Q3.",
    "Customer support response times improved by 30% after hiring additional staff in Q4."
]

# 실행
print("Without RAG:")
print(summarize_without_rag(internal_reviews))

print("\nWith RAG:")
print(summarize_with_rag(internal_reviews, external_reviews))


Without RAG:


Device set to use cuda:0
Your max_length is set to 40, but your input_length is only 38. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=19)


Customer support is too slow to respond to my queries. My order arrived late, and the packaging was damaged. I’m not happy with the battery life of the product.

With RAG:


Device set to use cuda:0


Customer support response times improved by 30% after hiring additional staff in Q4. Delivery issues were resolved faster after increasing the logistics team size in Q2. Battery complaints decreased by 15%


- With RAG:
  - 내부 데이터베이스에서 검색된 과거 기록을 반영하여 더 심층적인 분석 제공.
  - 고객 불만에 대한 대응 방안을 내부 데이터를 통해 도출

## 예제 3: RAG를 활용한 고객 불만 분석과 개선 방안 제안
- 목표:
  - 고객 문의 데이터를 요약하여 주요 문제를 도출.
  - 회사 내부 데이터와 외부 데이터(실시간 API, 파일 등)를 결합하여 문제 해결 방안을 제안.
  - RAG를 활용한 분석 결과와 단순 요약 결과의 차이를 비교.

In [None]:
import pandas as pd
import json
from transformers import BartForConditionalGeneration, BartTokenizer

# Bart 모델 로드 (영어 전용)
def load_bart():
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    return model, tokenizer

# 요약 함수
def summarize_text(model, tokenizer, text, max_length=100):
    input_ids = tokenizer.encode(text, return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(input_ids, max_length=max_length, min_length=30, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# 데이터 로드 함수
def load_data():
    # 고객 문의 데이터
    customer_complaints = [
        "The delivery was late, and the package was damaged when it arrived.",
        "The software crashes frequently during use, causing inconvenience.",
        "Customer service took three days to respond to my issue.",
        "The product's screen has a persistent flickering issue.",
        "The payment system failed multiple times during checkout.",
        "Shipping tracking information was inaccurate, causing confusion.",
        "The latest software update caused more bugs than it fixed."
    ]

    # 회사 내부 데이터베이스
    internal_data = [
        "Delivery delays were reduced by 40% in Q3 after implementing route optimization.",
        "Software stability improved by 30% after the latest patch in September.",
        "Customer service response time improved by 25% in Q4 after introducing an AI chatbot.",
        "Screen flickering issues were addressed by replacing defective panels in Q2.",
        "Payment failures decreased by 50% after server upgrade in Q3.",
        "Shipping tracking accuracy improved by 35% with new tracking software."
    ]

    # 외부 데이터 (CSV와 JSON 형식)
    #   - 실시간 API 또는 파일 데이터
    external_shipping_issues = pd.DataFrame({
        "Issue": ["Package lost", "Delivery delayed", "Damaged items", "Shipping address errors"],
        "Impact": ["High", "Medium", "High", "Low"],
        "Date": ["2024-01-01", "2024-01-02", "2024-01-03", "2024-01-04"]
    })

    #   - 소프트웨어 리뷰 데이터 (API에서 가져오기)
    external_software_reviews = [
        {"review": "The latest update reduced crashes significantly.", "date": "2024-01-01"},
        {"review": "Some bugs still persist, but performance has improved overall.", "date": "2024-01-02"},
        {"review": "Software is now more stable than before.", "date": "2024-01-03"},
        {"review": "Critical bugs remain unresolved, causing frustration.", "date": "2024-01-04"}
    ]

    return customer_complaints, internal_data, external_shipping_issues, external_software_reviews

# RAG 구현 함수
def rag_analysis(model, tokenizer, complaints, internal_data, external_shipping, external_software):
    # 내부 데이터 통합
    internal_summary = " ".join(internal_data)

    # 외부 데이터 통합
    external_shipping_summary = external_shipping.to_csv(index=False)
    external_software_summary = " ".join([item["review"] for item in external_software])

    # 고객 데이터 결합
    customer_summary = " ".join(complaints)
    combined_text = f"Customer Complaints: {customer_summary} Internal Data: {internal_summary} External Shipping Data: {external_shipping_summary} External Software Reviews: {external_software_summary}"

    # 요약 실행
    return summarize_text(model, tokenizer, combined_text)

# 실행
if __name__ == "__main__":
    # 데이터 로드
    customer_complaints, internal_data, external_shipping_issues, external_software_reviews = load_data()

    # BART 로드
    model, tokenizer = load_bart()

    # RAG 분석 실행
    summary = rag_analysis(model, tokenizer, customer_complaints, internal_data, external_shipping_issues, external_software_reviews)

    print("=== Final Summary ===")
    print(summary)


=== Final Summary ===
Software crashes frequently during use, causing inconvenience. Payment system failed multiple times during checkout. Shipping tracking information was inaccurate, causing confusion. Customer service took three days to respond to my issue.


In [None]:
summary

'Software crashes frequently during use, causing inconvenience. Payment system failed multiple times during checkout. Shipping tracking information was inaccurate, causing confusion. Customer service took three days to respond to my issue.'

```
Software crashes frequently during use, causing inconvenience. Payment system
failed multiple times during checkout. Shipping tracking information was
inaccurate, causing confusion. Customer service took three days to respond to
my issue.
```

- 지금 얻은 결과는 모델이 고객 문의 데이터를 단순히 나열하거나 일부만 요약한 것으로 보인다. 이는 "facebook/bart-large-cnn" 모델의 기본 동작이 데이터를 완전히 요약하지 못하거나, 특정 정보를 놓치는 경우가 있기 때문일 수 있다.
- 이 결과가 반드시 "reasonable"한 것은 아니며, 특히 RAG의 핵심인 내부 데이터와 외부 데이터를 통합한 분석이 충분히 반영되지 않은 것으로 보인다.
- 이를 개선하기 위해 수정해 보자.
  - facebook/bart-large-cnn 모델은 문서 요약에 강하지만, 입력 텍스트가 길거나 복잡하면 중요한 정보를 놓칠 수 있다. RAG의 핵심인 데이터 통합 효과를 잘 반영하지 못했을 가능성이 크다.
  - 텍스트가 너무 길거나, 여러 섹션이 혼합되어 있을 경우 모델이 적절히 요약하지 못할 수 있다.

In [None]:
def structured_rag_analysis(model, tokenizer, complaints, internal_data, external_shipping, external_software):
    # 섹션별 텍스트 생성
    customer_section = "Customer Complaints: " + " ".join(complaints)
    internal_section = "Internal Data: " + " ".join(internal_data)
    external_shipping_section = "External Shipping Data: " + external_shipping.to_csv(index=False)
    external_software_section = "External Software Reviews: " + " ".join([item["review"] for item in external_software])

    # 섹션별 요약
    customer_summary = summarize_text(model, tokenizer, customer_section)
    internal_summary = summarize_text(model, tokenizer, internal_section)
    external_shipping_summary = summarize_text(model, tokenizer, external_shipping_section)
    external_software_summary = summarize_text(model, tokenizer, external_software_section)

    # 요약 통합
    combined_summary = f"{customer_summary} {internal_summary} {external_shipping_summary} {external_software_summary}"
    return combined_summary

# 실행
summary = structured_rag_analysis(
    model, tokenizer,
    customer_complaints, internal_data, external_shipping_issues, external_software_reviews
)
print("=== Final Structured Summary ===")
print(summary)


=== Final Structured Summary ===
The delivery was late, and the package was damaged when it arrived. The payment system failed multiple times during checkout. The latest software update caused more bugs than it fixed. Internal Data: Delivery delays were reduced by 40% in Q3 after implementing route optimization. Software stability improved by 30% after the latest patch in September. Customer service response time improved by 25% after introducing an AI chatbot. Package lost,High,2024-01-01.Delivery delayed,Medium, 2024-1-02.Damaged items, high,1924-02-03.Shipping address errors, low,1923-03-04. Shipping date errors,Low,2023-04-05. The latest update reduced crashes significantly. Some bugs still persist, but performance has improved overall. Software is now more stable than before. Critical bugs remain unresolved, causing frustration.


In [None]:
summary

'The delivery was late, and the package was damaged when it arrived. The payment system failed multiple times during checkout. The latest software update caused more bugs than it fixed. Internal Data: Delivery delays were reduced by 40% in Q3 after implementing route optimization. Software stability improved by 30% after the latest patch in September. Customer service response time improved by 25% after introducing an AI chatbot. Package lost,High,2024-01-01.Delivery delayed,Medium, 2024-1-02.Damaged items, high,1924-02-03.Shipping address errors, low,1923-03-04. Shipping date errors,Low,2023-04-05. The latest update reduced crashes significantly. Some bugs still persist, but performance has improved overall. Software is now more stable than before. Critical bugs remain unresolved, causing frustration.'

```
The delivery was late, and the package was damaged when it arrived. The payment
system failed multiple times during checkout. The latest software update
caused more bugs than it fixed. Internal Data: Delivery delays were reduced by
40% in Q3 after implementing route optimization. Software stability improved
by 30% after the latest patch in September. Customer service response time
improved by 25% after introducing an AI chatbot. Package lost,High,
2024-01-01.Delivery delayed,Medium, 2024-1-02.Damaged items, high,1924-02-03.
Shipping address errors, low,1923-03-04. Shipping date errors,Low,
2023-04-05. The latest update reduced crashes significantly. Some bugs still
persist, but performance has improved overall. Software is now more stable
than before. Critical bugs remain unresolved, causing frustration.
```

## RAG 예제 4: 개선 방안 생성
- 고객 불만 사항을 요약한 후, 내부 데이터와 외부 데이터를 결합하여 각 불만 사항에 대한 개선 방안을 도출
- 결과가 기대에 미치지 못한다면, 더 강력한 언어 생성 모델(GPT-3.5, GPT-4) 사용 필요.

In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer
import pandas as pd

# Bart 모델 로드
def load_bart():
    model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
    tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
    return model, tokenizer

# 요약 함수
def summarize_text(model, tokenizer, text, max_length=100):
    input_ids = tokenizer.encode(text, return_tensors="pt", truncation=True, max_length=1024)
    summary_ids = model.generate(input_ids, max_length=max_length, min_length=30, num_beams=4, early_stopping=True)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# 개선 방안 생성 함수
def generate_improvement_plan(model, tokenizer, complaint, internal_data, external_data):
    prompt = (
        f"Customer Complaint: {complaint}\n"
        f"Internal Data: {internal_data}\n"
        f"External Data: {external_data}\n"
        "Based on the above, suggest specific improvement plans to address the complaint."
    )
    return summarize_text(model, tokenizer, prompt)

# 데이터 로드
def load_data():
    # 고객 문의 데이터
    customer_complaints = [
        "The delivery was late, and the package was damaged when it arrived.",
        "The software crashes frequently during use, causing inconvenience.",
        "Customer service took three days to respond to my issue.",
    ]

    # 내부 데이터베이스
    internal_data = [
        "Delivery delays were reduced by 40% in Q3 after implementing route optimization.",
        "Software stability improved by 30% after the latest patch in September.",
        "Customer service response time improved by 25% in Q4 after introducing an AI chatbot."
    ]

    # 외부 데이터
    external_shipping_issues = pd.DataFrame({
        "Issue": ["Package lost", "Delivery delayed", "Damaged items"],
        "Impact": ["High", "Medium", "High"],
        "Date": ["2024-01-01", "2024-01-02", "2024-01-03"]
    })

    external_software_reviews = [
        {"review": "The latest update reduced crashes significantly.", "date": "2024-01-01"},
        {"review": "Some bugs still persist, but performance has improved overall.", "date": "2024-01-02"},
    ]

    return customer_complaints, internal_data, external_shipping_issues, external_software_reviews

# 개선 방안 실행 함수
def generate_improvement_plan_enhanced(model, tokenizer, complaint, internal_data, external_data):
    prompt = (
        f"Customer Complaint: {complaint}\n"
        f"Internal Data: {internal_data}\n"
        f"External Data: {external_data}\n"
        f"Task: Based on the above data, propose a specific and actionable improvement plan to address the customer complaint. "
        f"Make sure the plan is detailed and directly solves the mentioned issues."
    )
    return summarize_text(model, tokenizer, prompt)

# 개선 방안 실행 함수
def improvement_plan_analysis_enhanced(model, tokenizer, complaints, internal_data, external_shipping, external_software):
    plans = []
    for complaint in complaints:
        # 관련 내부 데이터 및 외부 데이터 연결
        related_internal_data = " ".join(internal_data)
        related_external_data = (
            external_shipping.to_csv(index=False) + " " +
            " ".join([item["review"] for item in external_software])
        )

        # 개선 방안 생성
        plan = generate_improvement_plan_enhanced(model, tokenizer, complaint, related_internal_data, related_external_data)
        plans.append({"Complaint": complaint, "Improvement Plan": plan})

    return plans

# 실행
if __name__ == "__main__":
    # 데이터 로드
    customer_complaints, internal_data, external_shipping_issues, external_software_reviews = load_data()

    # BART 모델 로드
    model, tokenizer = load_bart()

    # 개선 방안 분석 실행
    improvement_plans = improvement_plan_analysis_enhanced(
        model, tokenizer,
        customer_complaints, internal_data, external_shipping_issues, external_software_reviews
    )

    # 결과 출력
    print("=== Enhanced Improvement Plans ===")
    for plan in improvement_plans:
        print(f"Complaint: {plan['Complaint']}")
        print(f"Improvement Plan: {plan['Improvement Plan']}\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

=== Enhanced Improvement Plans ===
Complaint: The delivery was late, and the package was damaged when it arrived.
Improvement Plan: Software stability improved by 30% after the latest patch in September. Customer service response time improved by 25% in Q4 after introducing an AI chatbot. The latest update reduced crashes significantly. Some bugs still persist, but performance has improved overall.

Complaint: The software crashes frequently during use, causing inconvenience.
Improvement Plan: Software stability improved by 30% after the latest patch in September. Customer service response time improved by 25% in Q4 after introducing an AI chatbot.

Complaint: Customer service took three days to respond to my issue.
Improvement Plan: Customer service took three days to respond to my issue. Software stability improved by 30% after the latest patch in September. Customer service response time improved by 25% in Q4 after introducing an AI chatbot.



```
Complaint: The delivery was late, and the package was
damaged when it arrived.
Improvement Plan: Software stability improved by 30%
after the latest patch in September. Customer service
response time improved by 25% in Q4 after introducing
an AI chatbot. The latest update reduced crashes
significantly. Some bugs still persist, but performance
has improved overall.

Complaint: The software crashes frequently during use,
causing inconvenience.
Improvement Plan: Software stability improved by 30%
after the latest patch in September. Customer service
response time improved by 25% in Q4 after introducing
an AI chatbot.

Complaint: Customer service took three days to respond
to my issue.
Improvement Plan: Customer service took three days to
respond to my issue. Software stability improved by 30%
after the latest patch in September. Customer service
response time improved by 25% in Q4 after introducing
an AI chatbot.
```

In [None]:
from google.colab import drive
import openai
import os

drive.mount('/content/drive')

api_key_file = '/content/drive/My Drive/Colab Notebooks/api_key.txt'

with open(api_key_file, 'r') as f:
    api_key = f.read().strip()

os.environ['OPENAI_API_KEY'] = api_key

Mounted at /content/drive


In [None]:
import openai
from openai import OpenAI

# OpenAI API 키 설정
openai.api_key = api_key

# 개선 방안 생성 함수
def generate_improvement_plan_gpt(complaint, internal_data, external_data):

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "developer",
             "content": "Based on the content you are provided, propose a specific and actionable \
              improvement plan to address the complaint. Ensure the plan is detailed and directly \
              solves the mentioned issues."
            },
            {"role": "user",
             "content": f"Customer Complaint: {complaint}\n"
                        f"Internal Data: {internal_data}\n"
                        f"External Data: {external_data}\n"
            }
        ],
        max_tokens=300,
        temperature=0.7,
        top_p=1
    )
    return response.choices[0].message.content

# 실행
if __name__ == "__main__":
    # 데이터 로드
    customer_complaints, internal_data, external_shipping_issues, external_software_reviews = load_data()

    # 각 불만 사항에 대해 개선 방안 생성
    for complaint in customer_complaints:
        internal_summary = " ".join(internal_data)
        external_summary = external_shipping_issues.to_csv(index=False) + " " + " ".join([item["review"] for item in external_software_reviews])
        improvement_plan = generate_improvement_plan_gpt(complaint, internal_summary, external_summary)

        print(f"Complaint: {complaint}")
        print(f"Improvement Plan: {improvement_plan}\n")


Complaint: The delivery was late, and the package was damaged when it arrived.
Improvement Plan: To address the complaint about late delivery and damaged packages, we propose the following specific and actionable improvement plan. This plan targets both the logistics and customer service aspects to ensure customer satisfaction and reduce future incidents.

### Improvement Plan

#### 1. Enhance Packaging and Handling Protocols
   - **Conduct a Root Cause Analysis**: Investigate recent incidents of damaged packages to identify common factors contributing to the damage. Focus on packaging materials, handling processes, and transportation conditions.
   - **Upgrade Packaging Materials**: Transition to more durable and protective packaging materials for fragile and high-value items. Implement a standardized packaging guideline based on item type and fragility.
   - **Employee Training**: Conduct mandatory training sessions for warehouse and delivery personnel focused on proper handling tech

- With direct conversation:
prompt:
```
RAG example prompt: Based on the content you are
provided, propose a specific and actionable improvement
plan to address the complaint in English and Korean.
Ensure the plan is detailed and directly solves the
mentioned issues. "customer complaints": 1. The
delivery was late, and the package was damaged when it
arrived. 2. The software crashes frequently during use,
causing inconvenience., 3. Customer service took three
days to respond to my issue. "internal database": 1.
Delivery delays were reduced by 40% in Q3 after
implementing route optimization. 2. "Software stability
improved by 30% after the latest patch in September. 3.
Customer service response time improved by 25% in Q4
after introducing an AI chatbot. "external software
reviews": [{"review": "The latest update reduced
crashes significantly.", "date": "2024-01-01"},
{"review": "Some bugs still persist, but performance
has improved overall.", "date": "2024-01-02"}],
"external_shipping_issues = pd.DataFrame({"Issue":
["Package lost", "Delivery delayed", "Damaged items"],
"Impact": ["High", "Medium", "High"], "Date":
["2024-01-01", "2024-01-02", "2024-01-03"] }).

```

```
Proposed Improvement Plan (English):

Addressing Delivery Issues:

Action Plan: Implement additional quality checks for
package handling and delivery by introducing advanced
packaging standards and monitoring tools. Collaborate
with logistics partners to enforce best practices for
fragile items. Leverage the route optimization system
already in place to further reduce delays, and expand
its scope to include real-time tracking for customers.

Timeline: Begin within two weeks by introducing
packaging guidelines and monitoring pilot programs.

Expected Outcome: A 20% further reduction in damaged
items and delivery delays by Q2.
Improving Software Stability:

Action Plan: Expedite an additional patch based on
feedback received after the September update. Focus on
resolving the remaining critical bugs mentioned in
external reviews. Establish a dedicated task force to
address software stability, incorporating user-reported
crash scenarios.

Timeline: Complete patch rollout within four weeks.

Expected Outcome: A reduction in crash incidents by an additional 20% by Q3.
Enhancing Customer Service Response Time:

Action Plan: Increase the deployment of the AI chatbot
that improved response times in Q4, enhancing its
capabilities to handle more complex issues. Train
customer service agents to integrate chatbot
resolutions quickly for cases requiring human
intervention.

Timeline: Expand chatbot functionalities and train
staff within one month.

Expected Outcome: Achieve a 50% response time
improvement by Q3, reducing delays to same-day
resolutions for most cases.

```

```
개선 계획 (Korean):

배송 문제 해결:

실행 계획: 포장 기준과 모니터링 도구를 통해 포장 상태를
추가적으로 점검하며 품질을 강화합니다. 물류 파트너와
협력하여 깨지기 쉬운 품목에 대한 최선의 취급 방식을
시행합니다. 현재 운영 중인 경로 최적화 시스템을 실시간 고객
추적 기능으로 확대합니다.

실행 일정: 2주 내로 포장 지침 도입 및 모니터링 파일럿
프로그램 시작.

기대 효과: Q2까지 파손 및 배송 지연 문제를 추가적으로 20% 감소.

소프트웨어 안정성 향상:

실행 계획: 9월 업데이트 이후 받은 피드백을 기반으로 추가
패치를 신속히 배포합니다. 외부 리뷰에서 언급된 남은 주요
버그를 해결하는 데 중점을 둡니다. 사용자 보고 사례를
통합하여 소프트웨어 안정성을 강화하는 전담 팀을 설립합니다.

실행 일정: 4주 이내에 패치 배포 완료.

기대 효과: Q3까지 크래시 발생률을 추가로 20% 감소.
고객 서비스 응답 시간 개선:

실행 계획: Q4 응답 시간을 개선한 AI 챗봇을 확대 배치하며,
복잡한 문제를 처리할 수 있는 기능을 강화합니다. 고객 서비스
요원이 챗봇 해결책을 신속히 통합할 수 있도록 교육합니다.

실행 일정: 1개월 내 챗봇 기능 강화 및 직원 교육 완료.

기대 효과: Q3까지 응답 시간을 50% 개선하여 대부분의
사례에서 당일 해결 달성.

```

## 예제 5: 민감한 정보 숨기기
- (여기서는 Hugging Face Transformer 가 만족할 만 한 결과를 보여 주지 못해 openai API 를 사용한다.)
- 프롬프트에 민감한 정보를 포함하지 않고도 데이터베이스에서 관련 정보를 조회하여 상세한 응답을 생성
- 민감한 데이터를 노출하지 않으면서 안전하고 정확한 답변 제공.

- 시나리오:
  - 상황:  
    - 회사의 고객 지원 FAQ 시스템.
    - 민감한 정보(고객 계정 세부 정보)는 데이터베이스에 저장되며, 사용자 질문에는 일반적인 정보만 포함.
    - 데이터베이스에서 추가 정보를 조회해 상세한 응답 생성.
  - 예시:
    - 사용자 프롬프트: "I forgot my password. What should I do?"
    - 데이터베이스 조회:
      - 고객 계정 상태: "Account is locked due to multiple failed login attempts."
      - 고객 이메일: "user@example.com"
    - 최종 응답: "Your account is locked due to multiple failed login attempts. Please reset your password using the link sent to user@example.com."

In [None]:
import openai

# OpenAI API 키 설정
openai.api_key = api_key  # OpenAI API 키 입력

# 샘플 데이터베이스 - 민감한 정보
customer_data = {
    "user_001": {
        "email": "user001@example.com",
        "status": "Account is locked due to multiple failed login attempts."
    },
    "user_002": {
        "email": "user002@example.com",
        "status": "Account is active."
    }
}

def retrieve_sensitive_data(user_id, database):
    """
    사용자 ID에 해당하는 민감한 데이터를 데이터베이스에서 조회.
    """
    if user_id in database:
        return database[user_id]
    return {"email": "N/A", "status": "User not found."}

def generate_response_with_openai(user_query, sensitive_info):
    """
    OpenAI API를 사용하여 사용자 질문과 민감 정보를 기반으로 응답 생성.
    """
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "developer",
             "content": "Provide a clear and helpful response to assist the user in resolving their issue."
            },
            {"role": "user",
             "content": f"User Query: {user_query}\n"
                        f"Account Status: {sensitive_info['status']}\n"
                        f"User Email: {sensitive_info['email']}\n"
            }
        ],
        max_tokens=150,
        temperature=0.7,
        top_p=1
    )
    return response.choices[0].message.content

def main():
    # 사용자 입력
    user_id = "user_001"  # 사용자 ID 예제
    user_query = "I forgot my password. What should I do?"

    # 데이터베이스에서 민감 정보 조회
    sensitive_info = retrieve_sensitive_data(user_id, customer_data)

    # OpenAI를 사용하여 응답 생성
    response = generate_response_with_openai(user_query, sensitive_info)

    # 결과 출력
    print("=== User Query ===")
    print(user_query)
    print("\n=== Generated Response ===")
    print(response)

if __name__ == "__main__":
    main()


=== User Query ===
I forgot my password. What should I do?

=== Generated Response ===
To reset your password and unlock your account, you can typically follow these steps:

1. Visit the login page of the website or service for which you need to reset your password.
2. Look for the "Forgot Password" or "Reset Password" option on the login page.
3. Enter your email address (user001@example.com) when prompted.
4. You may receive an email with instructions on how to reset your password. Check your email inbox and spam folder.
5. Follow the link provided in the email to reset your password.
6. Create a new strong password and confirm it.
7. Log in with your new password.

If you are unable to reset your password online or encounter any issues, you may need to reach out to


# Assignment: FAQ 시스템 구현

## 경영학 FAQ 시스템 구현: 간단한 예제

In [None]:
import pandas as pd

# 경영학 FAQ 샘플 데이터
faq_data = {
    "Question": [
        "What is the time value of money?",
        "How do I calculate NPV?",
        "What are the 4Ps in marketing?",
        "What is SWOT analysis?",
        "How do I conduct a competitive analysis?"
    ],
    "Answer": [
        "The time value of money is the concept that money available today is worth more than the same amount in the future due to its earning capacity.",
        "NPV is calculated by taking the present value of future cash flows and subtracting the initial investment.",
        "The 4Ps in marketing are Product, Price, Place, and Promotion.",
        "SWOT analysis identifies Strengths, Weaknesses, Opportunities, and Threats of an organization.",
        "A competitive analysis involves assessing competitors' strengths, weaknesses, and market positioning."
    ]
}

# DataFrame 생성
faq_df = pd.DataFrame(faq_data)


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# TF-IDF 모델 생성
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(faq_df["Question"])

def find_best_match(user_question, faq_df, tfidf_matrix):
    # 사용자 질문을 TF-IDF로 변환
    user_tfidf = vectorizer.transform([user_question])

    # 코사인 유사도 계산
    similarities = cosine_similarity(user_tfidf, tfidf_matrix).flatten()

    # 유사도가 높은 질문 찾기
    best_match_idx = similarities.argmax()
    similar_questions_idx = similarities.argsort()[-3:][::-1]  # 유사한 질문 3개

    # 결과 반환
    best_match = faq_df.iloc[best_match_idx]
    similar_questions = faq_df.iloc[similar_questions_idx]

    return best_match, similar_questions

# 사용자 입력 질문
user_question = "What is SWOT in business?"

# 검색 및 추천
best_match, similar_questions = find_best_match(user_question, faq_df, tfidf_matrix)

# 출력
print("=== Best Match ===")
print(f"Q: {best_match['Question']}")
print(f"A: {best_match['Answer']}")

print("\n=== Similar Questions ===")
for idx, row in similar_questions.iterrows():
    print(f"- {row['Question']}")


=== Best Match ===
Q: What is SWOT analysis?
A: SWOT analysis identifies Strengths, Weaknesses, Opportunities, and Threats of an organization.

=== Similar Questions ===
- What is SWOT analysis?
- What are the 4Ps in marketing?
- What is the time value of money?


## Assignment

1. 제목: "경영학 FAQ 시스템 구축: 질문 매칭 및 추천 시스템"
2. 과제 목표:
  - 경영학의 주요 주제에서 자주 묻는 질문과 답변을 수집하여 데이터베이스 구축.
  - 사용자가 입력한 질문과 유사한 질문-답변을 반환하는 시스템 개발.
  - 학생들이 각자 경영학 지식을 응용하여 데이터를 확장.

3. 과제 요구사항
  - 앞의 예제와 같은 시스템 구현
  - 데이터 수집 및 정리:
    - 주어진 데이터셋(예: FAQ 샘플) 또는 외부 API(예: StackExchange, Wikipedia)에서 데이터를 수집.
  - 질문 검색 및 추천:
    - 사용자가 입력한 질문과 가장 유사한 질문-답변을 반환.
유사 질문 3~5개 추천.
  - 보고서 작성:
    - 경영학 관련 FAQ 시스템의 설계 과정 및 결과 분석.
  - 웹 인터페이스 개발 (Option)
    - Flask 또는 Streamlit으로 간단한 웹 기반 FAQ 시스템 구현.

In [None]:
!pip install wikipedia-api

Collecting wikipedia-api
  Downloading wikipedia_api-0.7.3.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia-api
  Building wheel for wikipedia-api (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia-api: filename=Wikipedia_API-0.7.3-py3-none-any.whl size=14325 sha256=22462f4c467473cee48a8ce2586c392d037afe0d4c2e0bfc8b62c32aa2705dab
  Stored in directory: /root/.cache/pip/wheels/39/c8/cd/ba9a385ad2f6e457226fe2a7dceaeb0c17a19346d2e78ff4be
Successfully built wikipedia-api
Installing collected packages: wikipedia-api
Successfully installed wikipedia-api-0.7.3


In [None]:
# Wikipedia 데이터 수집 코드

import wikipediaapi
import pandas as pd

topics = ["SWOT analysis",
          "Marketing strategy",
          "Time value of money",
          "Data Science",
          "Artificial Intelligence",
          "Asset",
          "Portfolio Management"]

# Define a descriptive user agent string  (use your email address !)
user_agent = "MyWikipediaApp/1.0 (rtajeong@gmail.com)"

# Initialize the Wikipedia object with the user agent
wiki = wikipediaapi.Wikipedia('en', extract_format=wikipediaapi.ExtractFormat.HTML, headers={'User-Agent': user_agent})



data = []
for topic in topics:
    page = wiki.page(topic)  # 특정 주제에 대한 페이지를 가져온다 (summary, text, title)
    if page.exists():
        data.append({"Question": f"What is {topic}?", "Answer": page.summary[:300] + "..."})

# DataFrame 생성
faq_df = pd.DataFrame(data)
print(faq_df)


                           Question  \
0            What is SWOT analysis?   
1       What is Marketing strategy?   
2      What is Time value of money?   
3             What is Data Science?   
4  What is Artificial Intelligence?   
5                    What is Asset?   
6     What is Portfolio Management?   

                                              Answer  
0  <p class="mw-empty-elt">\n</p>\n\n\n<p>In stra...  
1  <p><b>Marketing strategy</b> refers to efforts...  
2  <p>The <b>time value of money</b> refers to th...  
3  <p class="mw-empty-elt">\n</p>\n\n<p><b>Data s...  
4  <p class="mw-empty-elt">\n</p>\n\n<p><b>Artifi...  
5  <p>In financial accounting, an <b>asset</b> is...  
6  <p><b>Portfolio management</b> may refer to:\n...  


In [None]:
def clean_and_format_html(raw_html):
    """
    HTML 태그를 제거하고 불필요한 공백을 정리.
    """
    # HTML 태그 제거
    soup = BeautifulSoup(raw_html, "html.parser")
    text = soup.get_text()

    # 불필요한 공백 제거
    formatted_text = " ".join(text.split())
    return formatted_text

faq_df["Answer"] = faq_df["Answer"].apply(clean_and_format_html)
faq_df

Unnamed: 0,Question,Answer
0,What is SWOT analysis?,In strategic planning and strategic management...
1,What is Marketing strategy?,Marketing strategy refers to efforts undertake...
2,What is Time value of money?,The time value of money refers to the fact tha...
3,What is Data Science?,Data science is an interdisciplinary academic ...
4,What is Artificial Intelligence?,"Artificial intelligence (AI), in its broadest ..."
5,What is Asset?,"In financial accounting, an asset is any resou..."
6,What is Portfolio Management?,Portfolio management may refer to: ...
