# Naive Bayes with Mushrooms

建立一個模型，藉由香菇的11種屬性，判斷香菇是可食用還是有毒的

In [1]:
import pandas as pd
#讀取CSV檔案 轉成dataFrame
data = pd.read_csv('mushrooms.csv')  

## info ( )
主要可以看有幾筆資料、每個欄位的資料型別是什麼(int, float..)、有無空值(null)的存在、佔據多少記憶體
## describe ( )
主要是看資料的平均值、分佈情況、是否有資料傾斜Skew的問題

In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 12 columns):
type                      8124 non-null object
cap_shape                 8124 non-null object
cap_surface               8124 non-null object
cap_color                 8124 non-null object
odor                      8124 non-null object
stalk_shape               8124 non-null object
stalk_color_above_ring    8124 non-null object
stalk_color_below_ring    8124 non-null object
ring_number               8124 non-null object
ring_type                 8124 non-null object
population                8124 non-null object
habitat                   8124 non-null object
dtypes: object(12)
memory usage: 761.7+ KB


In [3]:
data

Unnamed: 0,type,cap_shape,cap_surface,cap_color,odor,stalk_shape,stalk_color_above_ring,stalk_color_below_ring,ring_number,ring_type,population,habitat
0,p,x,s,n,p,e,w,w,o,p,s,u
1,e,x,s,y,a,e,w,w,o,p,n,g
2,e,b,s,w,l,e,w,w,o,p,n,m
3,p,x,y,w,p,e,w,w,o,p,s,u
4,e,x,s,g,n,t,w,w,o,e,a,g
5,e,x,y,y,a,e,w,w,o,p,n,g
6,e,b,s,w,a,e,w,w,o,p,n,m
7,e,b,y,w,l,e,w,w,o,p,s,m
8,p,x,y,w,p,e,w,w,o,p,v,g
9,e,b,s,y,a,e,w,w,o,p,s,m


In [4]:
data.describe()

Unnamed: 0,type,cap_shape,cap_surface,cap_color,odor,stalk_shape,stalk_color_above_ring,stalk_color_below_ring,ring_number,ring_type,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,9,2,9,9,3,5,6,7
top,e,x,y,n,n,t,w,w,o,p,v,d
freq,4208,3656,3244,2284,3528,4608,4464,4384,7488,3968,4040,3148


## 切分input 和output
把data切成2個dataFrame，分別是input和output

In [5]:
#x:input
x = data.iloc[:,1:12]

#y:output
y = data.loc[:,['type']]

## sklearn: Naive Bayes Classifier

Naive Bayes需要將nominal的資料轉成數值資料(numeric)，`le.fit_transform()`會分析list裡面有幾種標籤，把標籤轉成set，再把標籤轉成0~n的代號。`zip()`方法會一次iterate多個陣列，把argument裡面的list一次取一對封裝成tuple。

- LabelEncoder官方文件：https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
- zip()的範例文件：https://www.w3schools.com/python/ref_func_zip.asp

最後的input是`features`array，Label是`Y_type_label`。

In [6]:
from sklearn import preprocessing

#將屬性轉為數字label，le.fit_transform()是將
le = preprocessing.LabelEncoder()

columns = []
for c in x:
    encoded = le.fit_transform(data[c])
    columns.append(encoded)

#將cap_shape轉為數字label
#play: no: 0 ,yes: 1
Y_type_label=le.fit_transform(y.type)

#將屬性合併，使用zip()一次iterate多個陣列，封裝成8124筆data
#變成list
# aaa = [i.tolist() for i in columns] # Is there a faster way???
feature=list(zip(columns[0], columns[1], columns[2], columns[3], 
                 columns[4], columns[5], columns[6], columns[7], 
                 columns[8], columns[9], columns[10]))


#轉成array
import numpy as np
features=np.asarray(feature)

In [7]:
features

array([[5, 2, 4, ..., 4, 3, 5],
       [5, 2, 9, ..., 4, 2, 1],
       [0, 2, 8, ..., 4, 2, 3],
       ...,
       [2, 2, 4, ..., 4, 1, 2],
       [3, 3, 4, ..., 0, 4, 2],
       [5, 2, 4, ..., 4, 1, 2]], dtype=int64)

## 訓練模型：訓練集
高斯樸素貝氏就是常態分佈的意思，使用`fit()`方法來train模型。

- Gaussian Naive Bayes官方文件：https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes

In [8]:
#Import Gaussian Naive Bayes 模型 (高斯樸素貝氏)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

# 訓練集訓練模型
model.fit(features, Y_type_label)

GaussianNB(priors=None)

## 測試集 測試模型

要測試模型用`predict()`方法來餵資料，這邊使用training data來當testing data，因此把features餵進去。得知精確度為0.83，沒有weka的0.98高，但是比較沒有overfitting的問題。
- classification_report官方文件：https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
- classification_report中文說明：https://www.cnblogs.com/178mz/p/8558435.html
- confusion_matrix官方文件：https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix

In [9]:
expected = Y_type_label
predicted = model.predict(features)

from sklearn import metrics
print(metrics.classification_report(expected, predicted))

             precision    recall  f1-score   support

          0       0.87      0.78      0.82      4208
          1       0.79      0.87      0.83      3916

avg / total       0.83      0.83      0.83      8124



In [10]:
print(metrics.confusion_matrix(expected, predicted))

[[3296  912]
 [ 493 3423]]


In [11]:
# 丟一筆資料，看看是否是能吃的香菇
predicted= model.predict([[0, 2, 9, 0, 0, 7, 7, 1, 4, 3, 3]]) # 抄features[9]的attr
print ("Predicted Value:", predicted)

Predicted Value: [1]
