# Naive Bayesian Classifier
### Q1. Bayes Rule을 이해하고 Naive  Bayes classifier가 사용하는 사후 확률 계산 과정을 서술하세요.

- Bayes Rule:   
$P(w_i|x) = \frac{P(x|w_i)|P(w_i)}{P(x)} = \frac{P(x|w_i) P(w_i)}{\Sigma_j P(x|w_j)P(w_j)}$
  -
  - $P(x|w_i)\text{: 사후 확률, posterior}\\
P(x|w_i) \text{: 가능도/우도, likelihood}\\
P(w_i) \text{: 사전 확률, prior}\\
P(x) \text{: 증거, evidence}$

A1. 각 특징이 독립적이라는 가정 아래에 진행한다. 
사전 확률을 계산한 다음 가능도를 계산한다. 모든 클래스에 대해 가능도*사전확률 값을 구한 값을 분모로 두고, 사전확률*가능도를 분자에 두어서 계산한다

### Q2. Naive Bayes Classification 방법을 이용해서 다음 생성된 리뷰 데이터에 기반한 감정 분석을 해봅시다.

In [1]:
# pip install pandas
import pandas as pd
import re

In [2]:
# 리뷰 데이터 생성
data = {
    'review': [
        'I love this great product! It exceeded my expectations.',
        'The Worst purchase I have ever made. Completely useless.',
        'It is an average product, nothing special but not terrible either.',
        'Great service and who can help but love this design? Highly recommend!',
        'Terrible experience, I will never buy from this poor brand again.',
        'It’s acceptable, but I expected better service, not just an acceptable one.',
        'Absolutely wonderful! I am very satisfied with this great service.',
        'The quality is poor and it broke after one use. Terrible enough!',
        'Acceptable product for the price, but there are better options out there.',
        'Great quality and fast shipping with wonderful service! I love it'
    ],
    'sentiment': [
        'positive', 'negative', 'neutral', 'positive', 'negative',
        'neutral', 'positive', 'negative', 'neutral', 'positive',
    ]
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,review,sentiment
0,I love this great product! It exceeded my expe...,positive
1,The Worst purchase I have ever made. Completel...,negative
2,"It is an average product, nothing special but ...",neutral
3,Great service and who can help but love this d...,positive
4,"Terrible experience, I will never buy from thi...",negative


In [5]:
# 불용어 리스트 정의
stopwords = ['i', 'my', 'am', 'this', 'it', 'its', 'an', 'a', 'the', 'is', 'are', 'and', 'product', 'service']

In [7]:
# 텍스트 전처리 함수 정의
def preprocess_text(text):
    # 소문자로 변환
    text = text.lower()
    # 특수 기호 제거
    text = re.sub(r'[^a-z\s]', '', text)
    # 불용어 제거
    words = text.split()
    filtered_words = [word for word in words if word not in stopwords]
    return ' '.join(filtered_words)

# 모든 리뷰에 대해 전처리 수행
df['review'] = df['review'].apply(preprocess_text)

기본적인 데이터 전처리가 완료되었습니다!
이제부터 직접 나이브 베이지안 분류를 수행해 봅시다.  
우리가 분류하고자 하는 문장은 총 두가지 입니다.  
전처리가 완료되었다고 치고,   
첫번째 문장은 **'love, great, awesome'**,  
두번째 문장은 **'terrible, not, never'** 입니다. 

사전 확률 $P(positive), P(negative), P(neutral)$을 구합니다. 

In [9]:
# 사전 확률 구하는 코드를 작성해주세요.

sentiment_counts=df['sentiment'].value_counts()
prior_prob=sentiment_counts/len(df)

print(prior_prob)



sentiment
positive    0.4
negative    0.3
neutral     0.3
Name: count, dtype: float64


In [26]:
df

Unnamed: 0,review,sentiment
0,love great exceeded expectations,positive
1,worst purchase have ever made completely useless,negative
2,average nothing special but not terrible either,neutral
3,great who can help but love design highly reco...,positive
4,terrible experience will never buy from poor b...,negative
5,acceptable but expected better not just accept...,neutral
6,absolutely wonderful very satisfied with great,positive
7,quality poor broke after one use terrible enough,negative
8,acceptable for price but there better options ...,neutral
9,great quality fast shipping with wonderful love,positive


가능도를 구하기 위한 확률들을 계산합니다.  
예: 첫번째 문장 분류를 위해서는, $P(love|positive), P(great|positive), P(awesome|positive)\\
P(love|negative), P(great|negative), P(awesome|negative)\\
P(love|neutral), P(great|neutral), P(great|neutral)$를 구합니다.

이 때 CountVectorizer를 사용하여 도출한 단어 벡터를 활용하면 확률들을 간편하게 구할 수 있습니다.  
참고: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
review_array = vectorizer.fit_transform(df['review']).toarray()
review_array

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
        0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 

In [13]:
vectorizer.get_feature_names_out()

array(['absolutely', 'acceptable', 'after', 'again', 'average', 'better',
       'brand', 'broke', 'but', 'buy', 'can', 'completely', 'design',
       'either', 'enough', 'ever', 'exceeded', 'expectations', 'expected',
       'experience', 'fast', 'for', 'from', 'great', 'have', 'help',
       'highly', 'just', 'love', 'made', 'never', 'not', 'nothing', 'one',
       'options', 'out', 'poor', 'price', 'purchase', 'quality',
       'recommend', 'satisfied', 'shipping', 'special', 'terrible',
       'there', 'use', 'useless', 'very', 'who', 'will', 'with',
       'wonderful', 'worst'], dtype=object)

In [15]:
vectorizer.vocabulary_

{'love': 28,
 'great': 23,
 'exceeded': 16,
 'expectations': 17,
 'worst': 53,
 'purchase': 38,
 'have': 24,
 'ever': 15,
 'made': 29,
 'completely': 11,
 'useless': 47,
 'average': 4,
 'nothing': 32,
 'special': 43,
 'but': 8,
 'not': 31,
 'terrible': 44,
 'either': 13,
 'who': 49,
 'can': 10,
 'help': 25,
 'design': 12,
 'highly': 26,
 'recommend': 40,
 'experience': 19,
 'will': 50,
 'never': 30,
 'buy': 9,
 'from': 22,
 'poor': 36,
 'brand': 6,
 'again': 3,
 'acceptable': 1,
 'expected': 18,
 'better': 5,
 'just': 27,
 'one': 33,
 'absolutely': 0,
 'wonderful': 52,
 'very': 48,
 'satisfied': 41,
 'with': 51,
 'quality': 39,
 'broke': 7,
 'after': 2,
 'use': 46,
 'enough': 14,
 'for': 21,
 'price': 37,
 'there': 45,
 'options': 34,
 'out': 35,
 'fast': 20,
 'shipping': 42}

In [17]:
frequency_matrix = pd.DataFrame(review_array, columns = vectorizer.get_feature_names_out())
frequency_matrix = pd.concat([df['sentiment'], frequency_matrix], axis=1)
frequency_matrix

Unnamed: 0,sentiment,absolutely,acceptable,after,again,average,better,brand,broke,but,...,terrible,there,use,useless,very,who,will,with,wonderful,worst
0,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,negative,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,1
2,neutral,0,0,0,0,1,0,0,0,1,...,1,0,0,0,0,0,0,0,0,0
3,positive,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,0
4,negative,0,0,0,1,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
5,neutral,0,2,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6,positive,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
7,negative,0,0,1,0,0,0,0,1,0,...,1,0,1,0,0,0,0,0,0,0
8,neutral,0,1,0,0,0,1,0,0,1,...,0,2,0,0,0,0,0,0,0,0
9,positive,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0


In [30]:
def compute_conditional_probabilities(frequency_matrix):
    word_probabilities = {}
    sentiments = frequency_matrix['sentiment'].unique()

    for sentiment in sentiments:
        sentiment_data = frequency_matrix[frequency_matrix['sentiment'] == sentiment]
        total_words = sentiment_data.iloc[:, 1:].sum().sum()  # 해당 sentiment의 전체 단어 수
        word_counts = sentiment_data.iloc[:, 1:].sum()  # 각 단어의 등장 빈도
        conditional_probabilities = (word_counts + 1) / (total_words + len(word_counts))  # Add-1 Smoothing 적용
        word_probabilities[sentiment] = conditional_probabilities

    return word_probabilities

word_probabilities = compute_conditional_probabilities(frequency_matrix)

# 결과 출력
for sentiment, probs in word_probabilities.items():
    print(f"Sentiment: {sentiment}")
    print(probs)
    print()



Sentiment: positive
absolutely      0.0250
acceptable      0.0125
after           0.0125
again           0.0125
average         0.0125
better          0.0125
brand           0.0125
broke           0.0125
but             0.0250
buy             0.0125
can             0.0250
completely      0.0125
design          0.0250
either          0.0125
enough          0.0125
ever            0.0125
exceeded        0.0250
expectations    0.0250
expected        0.0125
experience      0.0125
fast            0.0250
for             0.0125
from            0.0125
great           0.0625
have            0.0125
help            0.0250
highly          0.0250
just            0.0125
love            0.0500
made            0.0125
never           0.0125
not             0.0125
nothing         0.0125
one             0.0125
options         0.0125
out             0.0125
poor            0.0125
price           0.0125
purchase        0.0125
quality         0.0250
recommend       0.0250
satisfied       0.0250
shipping      

독립성 가정을 이용하여 가능도(likelihood)를 구합니다.  
첫번째 문장 예시: $P(love, great, awesome|positive), P(love, great, awesome|negative), P(love, great, awesome|neutral)$

In [38]:
# 가능도 구하는 코드를 작성해주세요.

def compute_likelihood(review, word_probabilities):
    review_words = vectorizer.transform([review]).toarray().flatten()
    likelihoods = {}

    for sentiment, probs in word_probabilities.items():
        likelihood = 1
        for word, count in zip(vectorizer.get_feature_names_out(), review_words):
            likelihood *= probs.get(word, 1) ** count
        likelihoods[sentiment] = likelihood

    return likelihoods

# 테스트 리뷰에 대한 가능도 계산
for i in range(0, len(df)):
    
    review_test = df['review'][i]
    likelihoods = compute_likelihood(review_test, word_probabilities)
    print(f"Likelihoods for the review '{review_test}':")
    print(likelihoods)


Likelihoods for the review 'love great exceeded expectations':
{'positive': 1.9531250000000005e-06, 'negative': 2.7016033691803677e-08, 'neutral': 2.7016033691803677e-08}
Likelihoods for the review 'worst purchase have ever made completely useless':
{'positive': 4.768371582031252e-14, 'negative': 7.286982907143727e-12, 'neutral': 5.692955396206037e-14}
Likelihoods for the review 'average nothing special but not terrible either':
{'positive': 9.536743164062505e-14, 'negative': 1.7078866188618113e-13, 'neutral': 2.1860948721431184e-11}
Likelihoods for the review 'great who can help but love design highly recommend':
{'positive': 1.907348632812501e-14, 'negative': 9.357257390213735e-18, 'neutral': 3.742902956085494e-17}
Likelihoods for the review 'terrible experience will never buy from poor brand again':
{'positive': 7.450580596923832e-18, 'negative': 1.0779560513526225e-14, 'neutral': 1.871451478042747e-17}
Likelihoods for the review 'acceptable but expected better not just acceptable o

위에서 구한 사전 확률과 가능도를 이용하여 타겟 문장이 positive, negative, neutral일 확률을 구하고 최종적으로 어떤 감성일지 분석해봅니다.

In [40]:
import numpy as np
# 최종 확률 구하는 코드를 작성해주세요.
# 첫번째 문장
# P(positive|target_review1)

# P(negative|target_review1)

# P(neutral|target_review1)

# 두번째 문장
# P(positive|target_review2)

# P(negative|target_review2)

# P(neutral|target_review2)


# 사전 확률 계산
sentiment_counts = df['sentiment'].value_counts()
prior_prob = sentiment_counts / len(df)

def compute_final_probabilities(review, word_probabilities, prior_prob):
    likelihoods = compute_likelihood(review, word_probabilities)
    final_probabilities = {}

    # 각 감정에 대한 사후 확률 계산
    total_likelihood = sum(likelihoods[sentiment] * prior_prob[sentiment] for sentiment in likelihoods)
    for sentiment in likelihoods:
        posterior_prob = (likelihoods[sentiment] * prior_prob[sentiment]) / total_likelihood
        final_probabilities[sentiment] = posterior_prob

    return final_probabilities

# 첫 번째 문장에 대한 최종 확률 계산
target_review1 = df['review'][0]
final_probabilities1 = compute_final_probabilities(target_review1, word_probabilities, prior_prob)

print(f"Final probabilities for the first review '{target_review1}':")
for sentiment, prob in final_probabilities1.items():
    print(f"P({sentiment}|review1) = {prob}")

print("\n")

# 두 번째 문장에 대한 최종 확률 계산
target_review2 = df['review'][1]
final_probabilities2 = compute_final_probabilities(target_review2, word_probabilities, prior_prob)

print(f"Final probabilities for the second review '{target_review2}':")
for sentiment, prob in final_probabilities2.items():
    print(f"P({sentiment}|review2) = {prob}")


Final probabilities for the first review 'love great exceeded expectations':
P(positive|review1) = 0.9796734282160766
P(negative|review1) = 0.010163285891961726
P(neutral|review1) = 0.010163285891961726


Final probabilities for the second review 'worst purchase have ever made completely useless':
P(positive|review2) = 0.008582972279803123
P(negative|review2) = 0.9837316244045365
P(neutral|review2) = 0.007685403315660442


A2-1.   
Target review1의 분류 결과:positive
Target review2의 분류 결과:negative

Q2-2. 나이브 베이지안 기반 확률을 구하는 과정에서 어떤 문제점을 발견할 수 있었나요? 그리고 그 문제를 해결하기 위한 방법에 대해 간략하게 조사 및 서술해 주세요. (힌트: Laplace smoothing)

A2-2. 제로 확률 문제, 특정 단어가 훈련 데이터의 특정 클래스에서 전혀 나타나지 않는 경우 발생한다. 이 경우 나이브 베이지안 모델에서는 해당 단어의 조건부 확률을 0으로 계산하게 된다. 이로 인해 전체 확률이 0이 되어서 모델 예측 성능이 크게 저하 된다. 이를 해결하기 위해 라플라스 스무딩을 사용할 수 있다. 이는 모든 단어가 적어도 한 번은 나타난 것처럼 만들어 주는 기법이다.  