# Naive Bayesian Classifier
### Q1. Bayes Rule을 이해하고 Naive  Bayes classifier가 사용하는 사후 확률 계산 과정을 서술하세요.

- Bayes Rule:   
$P(w_i|x) = \frac{P(x|w_i)|P(w_i)}{P(x)} = \frac{P(x|w_i) P(w_i)}{\Sigma_j P(x|w_j)P(w_j)}$
  -
  - $P(w_i|x)\text{: 사후 확률, posterior}\\
P(x|w_i) \text{: 가능도/우도, likelihood}\\
P(w_i) \text{: 사전 확률, prior}\\
P(x) \text{: 증거, evidence}$

A1.
-
나이브 베이즈 분류기는 특정 클래스에 속할 확률을 계산하기 위해 베이즈 룰을 사용한다. 나이브 베이즈 분류기는 naive 하다는 점에서 입력 특성들이 서로 독립적이라는 가정을 한다.
1. 사전 확률($P(w_i)$)을 계산 한다
2. 가능도($P(x|w_i)$)를 계산한다
3. 증거($P(x)$)를 계산한다.
4. 사후 확률($P(w_i|x)$)을 계산한다.


### Q2. Naive Bayes Classification 방법을 이용해서 다음 생성된 리뷰 데이터에 기반한 감정 분석을 해봅시다.

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [2]:
# 리뷰 데이터 생성
data = {
    'review': [
        'I love this great product! It exceeded my expectations.',
        'The Worst purchase I have ever made. Completely useless.',
        'It is an average product, nothing special but not terrible either.',
        'Great service and who can help but love this design? Highly recommend!',
        'Terrible experience, I will never buy from this poor brand again.',
        'It’s acceptable, but I expected better service, not just an acceptable one.',
        'Absolutely wonderful! I am very satisfied with this great service.',
        'The quality is poor and it broke after one use. Terrible enough!',
        'Acceptable product for the price, but there are better options out there.',
        'Great quality and fast shipping with wonderful service! I love it'
    ],
    'sentiment': [
        'positive', 'negative', 'neutral', 'positive', 'negative',
        'neutral', 'positive', 'negative', 'neutral', 'positive',
    ]
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,review,sentiment
0,I love this great product! It exceeded my expe...,positive
1,The Worst purchase I have ever made. Completel...,negative
2,"It is an average product, nothing special but ...",neutral
3,Great service and who can help but love this d...,positive
4,"Terrible experience, I will never buy from thi...",negative


In [3]:
# 불용어 리스트 정의
stopwords = ['i', 'my', 'am', 'this', 'it', 'its', 'an', 'a', 'the', 'is', 'are', 'and', 'product', 'service']

In [4]:
# 텍스트 전처리 함수 정의
def preprocess_text(text):
    # 소문자로 변환
    text = text.lower()
    # 특수 기호 제거
    text = re.sub(r'[^a-z\s]', '', text)
    # 불용어 제거
    words = text.split()
    filtered_words = [word for word in words if word not in stopwords]
    return ' '.join(filtered_words)

# 모든 리뷰에 대해 전처리 수행
df['review'] = df['review'].apply(preprocess_text)

In [5]:
df['review']

Unnamed: 0,review
0,love great exceeded expectations
1,worst purchase have ever made completely useless
2,average nothing special but not terrible either
3,great who can help but love design highly reco...
4,terrible experience will never buy from poor b...
5,acceptable but expected better not just accept...
6,absolutely wonderful very satisfied with great
7,quality poor broke after one use terrible enough
8,acceptable for price but there better options ...
9,great quality fast shipping with wonderful love


기본적인 데이터 전처리가 완료되었습니다!
이제부터 직접 나이브 베이지안 분류를 수행해 봅시다.  
우리가 분류하고자 하는 문장은 총 두가지 입니다.  
전처리가 완료되었다고 치고,   
첫번째 문장은 **'love, great, awesome'**,  
두번째 문장은 **'terrible, not, never'** 입니다.

사전 확률 $P(positive), P(negative), P(neutral)$을 구합니다.

In [6]:
# 사전 확률 계산
total_reviews = len(df) # 전체 리뷰 수
prior_positive = df['sentiment'].value_counts()['positive'] / total_reviews
prior_negative = df['sentiment'].value_counts()['negative'] / total_reviews
prior_neutral = df['sentiment'].value_counts()['neutral'] / total_reviews

print(f"P(positive) = {prior_positive}")
print(f"P(negative) = {prior_negative}")
print(f"P(neutral) = {prior_neutral}")


P(positive) = 0.4
P(negative) = 0.3
P(neutral) = 0.3


가능도를 구하기 위한 확률들을 계산합니다.  
예: 첫번째 문장 분류를 위해서는, $P(love|positive), P(great|positive), P(awesome|positive)\\
P(love|negative), P(great|negative), P(awesome|negative)\\
P(love|neutral), P(great|neutral), P(great|neutral)$를 구합니다.

이 때 CountVectorizer를 사용하여 도출한 단어 벡터를 활용하면 확률들을 간편하게 구할 수 있습니다.  
참고: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
review_array = vectorizer.fit_transform(df['review']).toarray()
review_array

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 

In [8]:
feature_names = vectorizer.get_feature_names_out()
feature_names

array(['absolutely', 'acceptable', 'after', 'again', 'average', 'better',
       'brand', 'broke', 'but', 'buy', 'can', 'completely', 'design',
       'either', 'enough', 'ever', 'exceeded', 'expectations', 'expected',
       'experience', 'fast', 'for', 'from', 'great', 'have', 'help',
       'highly', 'just', 'love', 'made', 'never', 'not', 'nothing', 'one',
       'options', 'out', 'poor', 'price', 'purchase', 'quality',
       'recommend', 'satisfied', 'shipping', 'special', 'terrible',
       'there', 'use', 'useless', 'very', 'who', 'will', 'with',
       'wonderful', 'worst'], dtype=object)

In [9]:
vectorizer.vocabulary_

{'love': 28,
 'great': 23,
 'exceeded': 16,
 'expectations': 17,
 'worst': 53,
 'purchase': 38,
 'have': 24,
 'ever': 15,
 'made': 29,
 'completely': 11,
 'useless': 47,
 'average': 4,
 'nothing': 32,
 'special': 43,
 'but': 8,
 'not': 31,
 'terrible': 44,
 'either': 13,
 'who': 49,
 'can': 10,
 'help': 25,
 'design': 12,
 'highly': 26,
 'recommend': 40,
 'experience': 19,
 'will': 50,
 'never': 30,
 'buy': 9,
 'from': 22,
 'poor': 36,
 'brand': 6,
 'again': 3,
 'acceptable': 1,
 'expected': 18,
 'better': 5,
 'just': 27,
 'one': 33,
 'absolutely': 0,
 'wonderful': 52,
 'very': 48,
 'satisfied': 41,
 'with': 51,
 'quality': 39,
 'broke': 7,
 'after': 2,
 'use': 46,
 'enough': 14,
 'for': 21,
 'price': 37,
 'there': 45,
 'options': 34,
 'out': 35,
 'fast': 20,
 'shipping': 42}

In [10]:
frequency_matrix = pd.DataFrame(review_array, columns = vectorizer.get_feature_names_out())
frequency_matrix = pd.concat([df['sentiment'], frequency_matrix], axis=1) #sentiment 열 추가
frequency_matrix

Unnamed: 0,sentiment,absolutely,acceptable,after,again,average,better,brand,broke,but,...,terrible,there,use,useless,very,who,will,with,wonderful,worst
0,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,neutral,0,0,0,0,1,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
3,positive,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,negative,0,0,0,1,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
5,neutral,0,2,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6,positive,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
7,negative,0,0,1,0,0,0,0,1,0,...,1,0,1,0,0,0,0,0,0,0
8,neutral,0,1,0,0,0,1,0,0,1,...,0,2,0,0,0,0,0,0,0,0
9,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0


In [11]:
# 위와 같이 조건부 확률을 구하는 코드를 작성해주세요

# ['love', 'great', 'awesome']
# 클래스별 빈도 계산 함수
def calculate_conditional_probability(word, sentiment_class):
    # 특정 sentiment_class에서 해당 단어의 빈도 합계
    word_count_in_class = frequency_matrix[frequency_matrix['sentiment'] == sentiment_class][word].sum()
    # 해당 sentiment_class에서 모든 단어의 빈도 합계
    total_words_in_class = frequency_matrix[frequency_matrix['sentiment'] == sentiment_class].iloc[:, 1:].sum().sum()
    # 조건부 확률 계산
    return word_count_in_class / total_words_in_class

# 조건부 확률 계산
target_review1 = ['love', 'great', 'awesome']
sentiment_classes = ['positive', 'negative', 'neutral']

# 조건부 확률 계산 결과 저장
conditional_probabilities = {}

for sentiment in sentiment_classes:
    for word in target_review1:
        if word in feature_names:
            conditional_probabilities[f"P({word}|{sentiment})"] = calculate_conditional_probability(word, sentiment)
        else:
            # 해당 단어가 벡터에 없으면 확률을 0으로 설정
            conditional_probabilities[f"P({word}|{sentiment})"] = 0

# 결과 출력
for key, value in conditional_probabilities.items():
    print(f"{key} = {value}")

P(love|positive) = 0.11538461538461539
P(great|positive) = 0.15384615384615385
P(awesome|positive) = 0
P(love|negative) = 0.0
P(great|negative) = 0.0
P(awesome|negative) = 0
P(love|neutral) = 0.0
P(great|neutral) = 0.0
P(awesome|neutral) = 0


In [12]:
#  ['terrible', 'not', 'never']
def calculate_conditional_probability(word, sentiment_class):
    word_count_in_class = frequency_matrix[frequency_matrix['sentiment'] == sentiment_class][word].sum()
    total_words_in_class = frequency_matrix[frequency_matrix['sentiment'] == sentiment_class].iloc[:, 1:].sum().sum()
    return word_count_in_class / total_words_in_class


target_review2 = ['terrible', 'not', 'never']
sentiment_classes = ['positive', 'negative', 'neutral']


conditional_probabilities = {}

for sentiment in sentiment_classes:
    for word in target_review2:
        if word in feature_names:
            conditional_probabilities[f"P({word}|{sentiment})"] = calculate_conditional_probability(word, sentiment)
        else:
            conditional_probabilities[f"P({word}|{sentiment})"] = 0

for key, value in conditional_probabilities.items():
    print(f"{key} = {value}")

P(terrible|positive) = 0.0
P(not|positive) = 0.0
P(never|positive) = 0.0
P(terrible|negative) = 0.08333333333333333
P(not|negative) = 0.0
P(never|negative) = 0.041666666666666664
P(terrible|neutral) = 0.041666666666666664
P(not|neutral) = 0.08333333333333333
P(never|neutral) = 0.0


독립성 가정을 이용하여 가능도(likelihood)를 구합니다.  
첫번째 문장 예시: $P(love, great, awesome|positive), P(love, great, awesome|negative), P(love, great, awesome|neutral)$

In [13]:
# 주어진 문장에 대한 가능도 계산
#['love', 'great', 'awesome']
def calculate_likelihood(words_to_check, sentiment_class):
    likelihoods = []
    for word in words_to_check:
        if word in feature_names:
            likelihoods.append(calculate_conditional_probability(word, sentiment_class))
        else:
            likelihoods.append(0)
    return likelihoods

# 첫 번째 문장
target_review1 = ['love', 'great', 'awesome']
# 두 번째 문장
target_review2 = [ 'terrible', 'not', 'never' ]

# 각 감정 클래스에 대한 가능도 계산 및 출력
sentiment_classes = ['positive', 'negative', 'neutral']
print("target_review1")
for sentiment in sentiment_classes:
    likelihoods = calculate_likelihood(target_review1, sentiment)
    print(f"Likelihoods for '{sentiment}': {likelihoods}")

print("target_review2")
for sentiment in sentiment_classes:
    likelihoods = calculate_likelihood(target_review2, sentiment)
    print(f"Likelihoods for '{sentiment}': {likelihoods}")

target_review1
Likelihoods for 'positive': [0.11538461538461539, 0.15384615384615385, 0]
Likelihoods for 'negative': [0.0, 0.0, 0]
Likelihoods for 'neutral': [0.0, 0.0, 0]
target_review2
Likelihoods for 'positive': [0.0, 0.0, 0.0]
Likelihoods for 'negative': [0.08333333333333333, 0.0, 0.041666666666666664]
Likelihoods for 'neutral': [0.041666666666666664, 0.08333333333333333, 0.0]


위에서 구한 사전 확률과 가능도를 이용하여 타겟 문장이 positive, negative, neutral일 확률을 구하고 최종적으로 어떤 감성일지 분석해봅니다.

In [14]:
# 사전 확률 계산
def calculate_prior_probability(sentiment_class):
    return len(df[df['sentiment'] == sentiment_class]) / len(df)

# 사후 확률 계산
def calculate_posterior_probability(target_review, sentiment_class):
    prior_prob = calculate_prior_probability(sentiment_class)
    likelihood = calculate_likelihood(target_review, sentiment_class)
    posterior_prob = prior_prob
    for l in likelihood:
        posterior_prob *= l
    return posterior_prob

posterior_probs_target_review1 = {sentiment: calculate_posterior_probability(target_review1, sentiment) for sentiment in df['sentiment'].unique()}
posterior_probs_target_review2 = {sentiment: calculate_posterior_probability(target_review2, sentiment) for sentiment in df['sentiment'].unique()}


# 결과 출력
print("For target review 1 (['love', 'great', 'awesome']):")
print(f"P(positive|target_review1): {posterior_probs_target_review1.get('positive', 0)}")
print(f"P(negative|target_review1): {posterior_probs_target_review1.get('negative', 0)}")
print(f"P(neutral|target_review1): {posterior_probs_target_review1.get('neutral', 0)}")

print("\nFor target review 2 (['terrible', 'not', 'never']):")
print(f"P(positive|target_review2): {posterior_probs_target_review2.get('positive', 0)}")
print(f"P(negative|target_review2): {posterior_probs_target_review2.get('negative', 0)}")
print(f"P(neutral|target_review2): {posterior_probs_target_review2.get('neutral', 0)}")


For target review 1 (['love', 'great', 'awesome']):
P(positive|target_review1): 0.0
P(negative|target_review1): 0.0
P(neutral|target_review1): 0.0

For target review 2 (['terrible', 'not', 'never']):
P(positive|target_review2): 0.0
P(negative|target_review2): 0.0
P(neutral|target_review2): 0.0


A2-1.   
- Target review1의 분류 결과: Positive에 가깝나 정확하지 않음.
- Target review2의 분류 결과: Negative에 가깝나 정확하지 않음.

Q2-2. 나이브 베이지안 기반 확률을 구하는 과정에서 어떤 문제점을 발견할 수 있었나요? 그리고 그 문제를 해결하기 위한 방법에 대해 간략하게 조사 및 서술해 주세요. (힌트: Laplace smoothing)

A2-2.
나이브 베이즈 분류기에서 희소 문제가 발생하였다.
따라서 Laplace smoothing기법을 사용하여 해당 문제를 해결할 수 있다.
Laplace(add-one) smoothing 방법은 주어진 데이터에서 특정 단어가 주어진 데이터에서 특정 단어의 빈도가 0으로 나오는 경우에 대비하여, 확률값이 0이 되지 않도록 단어의 출현 빈도에 1을 더하여 빈도가 0이 되지 않도록 할 수 있다.