# Naive Bayes
- 확률 기반 머신러닝 분류 알고리즘의 대표
- Naive Bayes 분류 알고리즘은 데이터를 나이브(단순)하게 독립적인 사건으로 가정
- 이 독립 사건들을 베이즈이론에 대입시켜 가장 높은 확률의 label로 분류를 실행하는 알고리즘
- 베이즈이론을 단순하게 한게 nave bayes

### 베이즈 이론
$P(A|B) = P(B|A) * P(A) \div P(B)$

P(A|B) : 어떤 사건 B를 전제로 A가 일어날 확률   
P(B|A) : 어떤 사건 A를 전제로 B가 일어날 확률   
p(A) : 어떤 사건 A가 일어날 확률   

![](../data/%EB%B2%A0%EC%9D%B4%EC%A6%88%EC%9D%B4%EB%A1%A0.png)

> 모든 사건이 발생한 경우의 수 : 12   
> A사건 경우의 수 : 10   
> B사건 경우의 수 : 5   
> A사건과 B사건이 동시에 일어난 경우의 수 : 3   
> P(A) = 10   
> p(B) = 5   
> $P(A|B) = P(A \cap B) / P(B) = 3 / 5 = 0.6$   
> $P(B|A) = P(A \cap B) / P(A) = 3 / 10 = 0.3$

### Naive Bayes 알고리즘 머신러닝에 응용하기
$P(Label|Feature) = P(Feature|Label) * P(Label) / P(Feature) = P(Feature \cap Label) / P(Feature)$

### 치킨집에서 저녁에 손님이 주문을 할 때 맥주를 주문할 확률
![](../data/%EC%B9%98%ED%82%A8%EC%A7%91_%EB%A7%A5%EC%A3%BC.png)

> $P(맥주주문|저녁시간) = P(맥주주문 \cap 저녁시간) / P(저녁시간) = 3 / 5 = 0.6$   

---
### 가우시안 Naive Bayes 이용한 붓꽃 분류

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
import numpy as np

np.random.seed(5)

In [2]:
df = pd.read_csv('../data/iris.csv')
df.head()

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
csv_data = df.drop('Name', axis=1)
csv_label = df.Name

In [4]:
X_train, X_test, y_train, y_test = train_test_split(csv_data, csv_label, test_size=0.2)

In [5]:
X_train.shape

(120, 4)

In [8]:
y_train.shape

(120,)

In [10]:
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [12]:
model.score(X_test, y_test)

0.9

---
### 베르누이 Naive Bayes 활용한 스팸메일 분류

In [13]:
# library
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer # 벡터화(count 수)

In [14]:
train = pd.read_csv('../data/email_train.csv')
test = pd.read_csv('../data/email_test.csv')

In [15]:
train.head()

Unnamed: 0,email title,spam
0,free game only today,True
1,cheapest flight deak,True
2,limited time offer only today only today,True
3,today meeting schedule,False
4,your flight schedule attached,False


### 데이터 다듬기
sklearn의 베르누이NB 분류기는 숫자만 다루기 때문에 True False를 1과 0으로 치환

In [17]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   email title  6 non-null      object
 1   spam         6 non-null      bool  
dtypes: bool(1), object(1)
memory usage: 182.0+ bytes


In [18]:
train['label'] = train.spam.map({True:1, False:0})
train

Unnamed: 0,email title,spam,label
0,free game only today,True,1
1,cheapest flight deak,True,1
2,limited time offer only today only today,True,1
3,today meeting schedule,False,0
4,your flight schedule attached,False,0
5,your credit card statement,False,0


In [19]:
# 학습에 사용할 데이터와 분류값을 나누기
df_x = train['email title']
df_y = train['label']

In [20]:
df_x

0                        free game only today
1                        cheapest flight deak
2    limited time offer only today only today
3                      today meeting schedule
4               your flight schedule attached
5                  your credit card statement
Name: email title, dtype: object

In [21]:
df_y

0    1
1    1
2    1
3    0
4    0
5    0
Name: label, dtype: int64

베르누이NB의 입력데이터는 고정된 크기의 벡터로서, 0과 1로 구분된 데이터여만 합니다.   
sklearn의 CounterVectorizer를 이용하여 쉽게 구현할 수 있습니다.   
Countervectorizer는 입력된 데이터(6개 이메일)에 출현한 모든 단어의 개수만큼 벡터로 만들고,   
각각의 이메일을 그 고정된 벡터로 표현합니다.


In [22]:
cv = CountVectorizer(binary=True)
x_traincv = cv.fit_transform(df_x)

In [29]:
x_traincv

<6x17 sparse matrix of type '<class 'numpy.int64'>'
	with 23 stored elements in Compressed Sparse Row format>

In [52]:
encoded_input = x_traincv.toarray()
encoded_input

array([[0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1]])

In [53]:
encoded_input.shape

(6, 17)

In [32]:
# 벡터의 17개 인덱스에 해당하는 값들
print(cv.get_feature_names())

['attached', 'card', 'cheapest', 'credit', 'deak', 'flight', 'free', 'game', 'limited', 'meeting', 'offer', 'only', 'schedule', 'statement', 'time', 'today', 'your']


In [33]:
# 1번 메일의 단어 구성
cv.inverse_transform(encoded_input[0].reshape(1, -1)) # reshape : 1행, 열 수는 모르겠다

[array(['free', 'game', 'only', 'today'], dtype='<U9')]

In [34]:
bnb = BernoulliNB()
y_train = df_y.astype('int')
bnb.fit(x_traincv, y_train)

BernoulliNB()

In [36]:
# 테스트 데이터 다듬기
test

Unnamed: 0,email title,spam
0,free flight offer,True
1,hey traveler free flight deal,True
2,limited free game iffer,True
3,today flight schedule,False
4,your credit card attached,False
5,free credit card offer only today,False


In [37]:
test['label'] = test['spam'].map({True:1, False:0})
test_x = test['email title']
test_y = test['label']
x_testcv = cv.transform(test_x)

In [39]:
bnb.score(x_testcv, test_y)

0.8333333333333334

---
# 다항분포 나이브베이즈 영화리뷰 감정 분류

In [41]:
movie = pd.read_csv('../data/naive_movie.csv')
movie

Unnamed: 0,movie_review,type
0,this is great great movie. I will watch again,positive
1,I like this movie,positive
2,amazing movie in this year,positive
3,cool my boyfriend also said the movie is cool,positive
4,awesome of the awesome movie ever,positive
5,shame I wasted money and time,negative
6,regret on this move. I will never never what m...,negative
7,I do not like this movie,negative
8,I do not like actors in this movie,negative
9,boring boring sleeping movie,negative


In [42]:
movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   movie_review  10 non-null     object
 1   type          10 non-null     object
dtypes: object(2)
memory usage: 288.0+ bytes


In [43]:
# target : type (positive : 1, negative : 0)
movie['label'] = movie.type.map({'positive':1, 'negative':0})
movie

Unnamed: 0,movie_review,type,label
0,this is great great movie. I will watch again,positive,1
1,I like this movie,positive,1
2,amazing movie in this year,positive,1
3,cool my boyfriend also said the movie is cool,positive,1
4,awesome of the awesome movie ever,positive,1
5,shame I wasted money and time,negative,0
6,regret on this move. I will never never what m...,negative,0
7,I do not like this movie,negative,0
8,I do not like actors in this movie,negative,0
9,boring boring sleeping movie,negative,0


In [44]:
# 학습에 사용할 데이터와 분류값을 나누기
df_x = movie['movie_review']
df_y = movie['label']

In [45]:
cv = CountVectorizer()
x_moviecv = cv.fit_transform(df_x)

In [55]:
encoded_input = x_moviecv.toarray()
encoded_input

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0,
        0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
        0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 2,
        0, 0, 1, 1, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0

In [56]:
encoded_input.shape

(10, 37)

In [57]:
# 벡터의 37개 인덱스에 해당하는 값들
print(cv.get_feature_names())

['actors', 'again', 'also', 'amazing', 'and', 'awesome', 'boring', 'boyfriend', 'cool', 'director', 'do', 'ever', 'from', 'great', 'in', 'is', 'like', 'money', 'move', 'movie', 'my', 'never', 'not', 'of', 'on', 'regret', 'said', 'shame', 'sleeping', 'the', 'this', 'time', 'wasted', 'watch', 'what', 'will', 'year']


In [58]:
# 1번 후기 단어 구성
cv.inverse_transform(encoded_input[0].reshape(1, -1)) # reshape : 1행, 열 수는 모르겠다

[array(['again', 'great', 'is', 'movie', 'this', 'watch', 'will'],
       dtype='<U9')]

In [59]:
bnb = BernoulliNB()
y_train = df_y.astype('int')
bnb.fit(x_moviecv, y_train)

BernoulliNB()

In [60]:
# 테스트데이터 다듬기
movie_test = pd.read_csv('../data/naive_movie_test.csv')
movie_test

Unnamed: 0,movie_review,type
0,great great great movie ever,positive
1,I like this amazing movie,positive
2,my boyfriend said great movie ever,positive
3,cool cool cool,positive
4,awesome boyfriend said cool movie ever,positive
5,shame shame shame,negative
6,awesome director shame movie boring movie,negative
7,do not like this movie,negative
8,I do not like this boring movie,negative
9,aweful terrible boring movie,negative


In [62]:
movie_test['label'] = movie_test['type'].map({'positive':1, 'negative':0})
test_x = movie_test['movie_review']
test_y = movie_test['label']
x_testcv = cv.transform(test_x)

In [63]:
bnb.score(x_testcv, test_y)

1.0