## Classyfing Mushrooms
We first consider the following small (artificial) dataset

$$
\begin{array}{ccc}
EdibleOrPoisonous & red-color & capSurface\\
e &                y     & s\\   
 e &                y    & s    \\
  e &                  y   & y     \\
 p &                  n    & f    \\
  p &                 y    & s    \\
  p &                  n    & f \\
\end{array}
$$

We first write it down in Python using a numpy 2D-array. We then convert it into a Pandas Dataframe, that is another type of data-container.

In [1]:
import pandas as pd
import numpy as np
#e->1,p->0
#y->1,n->0
#s->0,y->1,f->2
data=np.array([
    [1 , 1 , 0],
    [1 , 1 , 0],
    [1 , 1 , 1],
    [0 , 0 , 2],
    [0 , 1 , 0],
    [0 , 0 , 2]])

#we convert it into a pandas Dataframe
df=pd.DataFrame(data,columns=['EdibleOrPoisonous','RedColor','CapSurface'])
df

Unnamed: 0,EdibleOrPoisonous,RedColor,CapSurface
0,1,1,0
1,1,1,0
2,1,1,1
3,0,0,2
4,0,1,0
5,0,0,2


The first column (0,1,2,...) is the index column and then we have the three columns 'EdibleOrPoisonous','RedColor','CapSurface'

In [2]:
#We prepare the data in the format that is needed by MultinomialNB
train_df=df.copy()
col=np.array(['RedColor','CapSurface'])
for f in range(1,df.shape[1]) :
    for elem in df.iloc[:,f].unique():
        train_df[col[f-1]+'_'+str(elem)] = (train_df.iloc[:,f]==elem)+0.0

#we drop the original columns 'RedColor','CapSurface'
train_df=train_df.drop(columns=col)
train_df

Unnamed: 0,EdibleOrPoisonous,RedColor_1,RedColor_0,CapSurface_0,CapSurface_1,CapSurface_2
0,1,1.0,0.0,1.0,0.0,0.0
1,1,1.0,0.0,1.0,0.0,0.0
2,1,1.0,0.0,0.0,1.0,0.0
3,0,0.0,1.0,0.0,0.0,1.0
4,0,1.0,0.0,1.0,0.0,0.0
5,0,0.0,1.0,0.0,0.0,1.0


In [3]:
#we prepare the input data (features) and output data (class)
X=train_df.iloc[:,1:].values #.values converts pandas back to numpy
y=train_df.iloc[:,0].values

In [4]:
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB(alpha=0)#note that we do not use any reguralisation 
clf.fit(X,y)
clf.predict_proba(np.array([[1,0,1,0,0]]))



array([[0.14285714, 0.85714286]])

In [6]:
clf=MultinomialNB(alpha=0)
clf.fit(X,y)
clf.predict_proba(np.array([[1,0,0,1,0]]))



array([[3.33333333e-11, 1.00000000e+00]])

Note the sharp probability, the probability is 1 for edible

In [8]:
from sklearn.naive_bayes import MultinomialNB
clf=MultinomialNB(alpha=1) #we add the reguralisation in this case
clf.fit(X,y)
clf.predict_proba(np.array([[1,0,0,1,0]]))

array([[0.2, 0.8]])

Note that the probability is not sharp anymore

## Real example
We use a UCI dataset https://archive.ics.uci.edu/ml/datasets/mushroom

In [9]:
df = pd.read_csv('mushrooms.csv')#put here the folder tha includes your data
from sklearn.utils import shuffle
#we shuffle the dataset
df = shuffle(df, random_state=42)
df

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
1971,e,f,f,n,f,n,f,w,b,h,...,f,w,w,p,w,o,e,n,s,g
6654,p,f,s,e,f,y,f,c,n,b,...,s,p,p,p,w,o,e,w,v,l
5606,p,x,y,n,f,f,f,c,n,b,...,s,w,p,p,w,o,e,w,v,l
3332,e,f,y,g,t,n,f,c,b,n,...,s,g,p,p,w,o,p,n,y,d
6988,p,f,s,e,f,s,f,c,n,b,...,s,p,p,p,w,o,e,w,v,l
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,p,x,y,n,f,f,f,c,n,b,...,s,p,w,p,w,o,e,w,v,p
5390,e,k,y,e,t,n,f,c,b,w,...,s,w,e,p,w,t,e,w,c,w
860,e,f,y,n,t,l,f,c,b,w,...,y,w,w,p,w,o,p,n,y,p
7603,p,k,s,e,f,f,f,c,n,b,...,s,p,p,p,w,o,e,w,v,p


In [19]:
#we divide the dataset in training and testing
train_df = df[:7000]
test_df = df[7000:]
print(test_df.shape)
train_df

(1124, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
1971,e,f,f,n,f,n,f,w,b,h,...,f,w,w,p,w,o,e,n,s,g
6654,p,f,s,e,f,y,f,c,n,b,...,s,p,p,p,w,o,e,w,v,l
5606,p,x,y,n,f,f,f,c,n,b,...,s,w,p,p,w,o,e,w,v,l
3332,e,f,y,g,t,n,f,c,b,n,...,s,g,p,p,w,o,p,n,y,d
6988,p,f,s,e,f,s,f,c,n,b,...,s,p,p,p,w,o,e,w,v,l
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4143,p,f,s,g,t,f,f,c,b,p,...,f,w,w,p,w,o,p,h,v,u
2878,e,f,f,g,t,n,f,c,b,p,...,s,g,g,p,w,o,p,n,y,d
4202,p,x,f,y,f,f,f,c,b,g,...,k,b,b,p,w,o,l,h,v,g
6493,p,f,s,e,f,s,f,c,n,b,...,k,p,w,p,w,o,e,w,v,l


In [20]:
#from this we can derive the accuracy of the majority class classifier (0.51)
print(train_df['class'].value_counts(normalize=1))

#we know convert the class variable 'e', 'p', 'u' into numerical values: 0,1,2
#and we save the result in the list target
target=[]
for i in range(len(train_df['class'].values)):
    if train_df['class'].values[i]=='e':
        target.append(0)
    if train_df['class'].values[i]=='p':
        target.append(1)
    if train_df['class'].values[i]=='u':
        target.append(2)
        
#we convert the list into an array
target=np.array(target)
print(train_df)

e    0.515714
p    0.483143
u    0.001143
Name: class, dtype: float64
     class cap-shape cap-surface cap-color bruises odor gill-attachment  \
1971     e         f           f         n       f    n               f   
6654     p         f           s         e       f    y               f   
5606     p         x           y         n       f    f               f   
3332     e         f           y         g       t    n               f   
6988     p         f           s         e       f    s               f   
...    ...       ...         ...       ...     ...  ...             ...   
4143     p         f           s         g       t    f               f   
2878     e         f           f         g       t    n               f   
4202     p         x           f         y       f    f               f   
6493     p         f           s         e       f    s               f   
5175     e         x           s         p       t    n               f   

     gill-spacing gill-size g

In [21]:
#we delete the class column because now it is stored in target
del train_df['class']

#we transform inputs for multinomialNB
cols = list(train_df)
for f in cols :
    for elem in df[f].unique():
        train_df[f+'_'+str(elem)] = (train_df[f]==elem)
#we delete old columns
for f in cols:
    del train_df[f]
train_df.head()
train_df=train_df+0.0 #this converts Boolean to 0 and 1
train_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df[f+'_'+str(elem)] = (train_df[f]==elem)


Unnamed: 0,cap-shape_f,cap-shape_x,cap-shape_b,cap-shape_k,cap-shape_s,cap-shape_c,cap-surface_f,cap-surface_s,cap-surface_y,cap-surface_g,...,population_n,population_a,population_c,habitat_g,habitat_l,habitat_d,habitat_u,habitat_p,habitat_m,habitat_w
1971,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6654,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5606,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3332,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
6988,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4143,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2878,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4202,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6493,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [22]:
from sklearn.naive_bayes import MultinomialNB

#we train the model
clf = MultinomialNB() #by default alpha=1.0
train_x = train_df.values
clf.fit(train_x,target)

MultinomialNB()

In [23]:
#this is the accuracy in the training dataset
from sklearn.metrics import accuracy_score 
y_pred=clf.predict(train_x)
accuracy_score(y_pred,target)

0.9478571428571428

In [24]:
#we transform the test dataset
test_y = test_df['class']
del test_df['class']
for f in cols :
    for elem in df[f].unique():
        test_df[f+'_'+str(elem)] = (test_df[f]==elem)
for f in cols:
    del test_df[f]
test_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df[f+'_'+str(elem)] = (test_df[f]==elem)


Unnamed: 0,cap-shape_f,cap-shape_x,cap-shape_b,cap-shape_k,cap-shape_s,cap-shape_c,cap-surface_f,cap-surface_s,cap-surface_y,cap-surface_g,...,population_n,population_a,population_c,habitat_g,habitat_l,habitat_d,habitat_u,habitat_p,habitat_m,habitat_w
2286,True,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,False,False,False,False
2261,False,True,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
5240,True,False,False,False,False,False,False,True,False,False,...,False,False,False,True,False,False,False,False,False,False
7228,True,False,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,False
2296,False,True,False,False,False,False,False,False,True,False,...,False,False,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5226,False,True,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
5390,False,False,False,True,False,False,False,False,True,False,...,False,False,True,False,False,False,False,False,False,True
860,True,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
7603,False,False,False,True,False,False,False,True,False,False,...,False,False,False,False,False,False,False,True,False,False


Now we check the accuracy for unseen data, that is we make prediction for the test-dataset and then we compute the accuracy

In [25]:
test_x = test_df.values
#we convert the class 'e','p','u' into 0,1,2
test_y1=[]
for i in range(len(test_y)):
    if test_y.values[i]=='e':
        test_y1.append(0)
    if test_y.values[i]=='p':
        test_y1.append(1)
    if test_y.values[i]=='u':
        test_y1.append(2)
        
#we convert the list into an array
test_y1=np.array(test_y1)

#we compute the prediction
y_pred=clf.predict(test_x)
#we compute the accuracy 
accuracy_score(y_pred,test_y1)

0.949288256227758

The classifier is performing very-well. Let's see the confusion matrix

In [26]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y1,y_pred)

array([[594,   4,   0],
       [ 45, 473,   8],
       [  0,   0,   0]])

In [27]:
? confusion_matrix

It never correctly classifies the unknowns, but there are only few cases. The problem is that there 45 cases where a poisonous mushroom is classified as edible. This is dangerous. The confusion matrix in this case gives us more information than accuracy, because the errors are not the same. That is it is worse if we classify poisonous as edible than vice versa.

## Question 
Your goal here is to reduce the dangerous misclassifications to zero by changing the prediction rule. To do that, we can use a threshold. For instance, consider these probabilities for the class 0 (edible), 1 (poisonous), 2 (unknown)


In [28]:
clf.predict_proba(test_x[2:3,:])

array([[1.33688508e-01, 8.66309957e-01, 1.53425038e-06]])

The MultinomialNB classifier in this case returns the class "1" (poisonous), becuase it has the greatest probability

In [29]:
clf.predict(test_x[2:3,:])

array([1])

We could make decisions only if the probability is greater than a large threshold. 
For instance, the threshold could be 0.999.

Given the probabilities for all the test inputs

In [30]:
np.set_printoptions(suppress=True)#this suppresses the scientific notation in the visualisation of the porbabilities
proba = clf.predict_proba(test_x)
proba

array([[0.99999999, 0.00000001, 0.        ],
       [0.99999491, 0.00000509, 0.        ],
       [0.13368851, 0.86630996, 0.00000153],
       ...,
       [0.99999983, 0.00000002, 0.00000015],
       [0.        , 0.99958562, 0.00041438],
       [0.99996303, 0.00003652, 0.00000045]])

write down a function that implements the above decision.
More precisely, if $[p1,p2,p3]$ is one row in the above matrix of probabilities, the decision is

`
if p1>threshold
    prediction =0
elif p2>threshold
    prediction =1
elif p3>threshold
    prediction =2
else
    prediction = -1 #this "-1" means none of the above cases.
`    

In other words,  You need to write down a function that implements that decision rule for all elements in `proba` and then compute the resulting accuracy for all the cases where the classifier made a decision (that is it returns something different from -1).

You should then verify that when threshold=1/3, you  get the same accuracy as 

In [31]:
#we compute the prediction
y_pred=clf.predict(test_x)
#we compute the accuracy 
accuracy_score(y_pred,test_y1)

0.949288256227758

In [32]:
import pandas as pd
data = pd.read_csv('spam.csv', encoding='latin-1')
data

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,
