# CS 595 Data 2

### Student: Sarthak Anand (A20389087)

### Dataset : Mushroom Dataset from UCI

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.

The objective of classification on this dataset is to classify mushrooms as edible or poisonous based the attributes given below.

**Target -> class**: edible=e, poisonous=p
#### Attribute information:
1. **cap-shape**:                bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s
2. **cap-surface**:              fibrous=f, grooves=g, scaly=y, smooth=s
3. **cap-color**:                brown=n, buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y
4. **bruises**:                  bruises=t, no=f
5. **odor**:                     almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s
6. **gill-attachment**:          attached=a, descending=d, free=f, notched=n
7. **gill-spacing**:             close=c, crowded=w, distant=d
8. **gill-size**:                broad=b, narrow=n
9. **gill-color**:               black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y
10. **stalk-shape**:              enlarging=e, tapering=t
11. **stalk-root**:               bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?
12. **stalk-surface-above-ring**: fibrous=f, scaly=y, silky=k, smooth=s
13. **stalk-surface-below-ring**: fibrous=f, scaly=y, silky=k, smooth=s
14. **stalk-color-above-ring**:   brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
15. **stalk-color-below-ring**:   brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y
16. **veil-type**:                partial=p, universal=u
17. **veil-color**:               brown=n, orange=o, white=w, yellow=y
18. **ring-number**:              none=n, one=o, two=t
19. **ring-type**:                cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z
20. **spore-print-color**:        black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y
21. **population**:               abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y
22. **habitat**:                  grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB 
from sklearn import metrics

In [2]:
df = pd.read_csv("mushrooms.csv")
print("Original Data Shape = ",df.shape)
df = df[df['stalk-root'] != '?']#Stalk root contains some rows with '?' values, so we remove them 
print("Data Shape after dropping objects with missing values = ",df.shape)
df.head(2)

Original Data Shape =  (8124, 23)
Data Shape after dropping objects with missing values =  (5644, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g


In [3]:
# Since the features are categorical, I do one-hot encoding for the 22 features 
# The get_dummies pandas function automatically does one-hot encoding 

for col in df.drop('class',axis=1).columns:
    one_hot = pd.get_dummies(df[col],prefix=col) # prefix is set to orginal col name + feature value so the column name is understandable 
    df.drop(col,axis=1,inplace=True) #drop the column that has been one-hot encoded
    df = df.join(one_hot) # add the new columns to the original frame 
    
# Mapping the values of target class from {p,e} to {0,1}
df['class'] = df['class'].map(lambda x:0 if x == 'p' else 1)
print("Data Shape after One-hot encoding = ",df.shape)
print("\nData after One-hot Encoding:")
df.head(2)

Data Shape after One-hot encoding =  (5644, 99)

Data after One-hot Encoding:


Unnamed: 0,class,cap-shape_b,cap-shape_c,cap-shape_f,cap-shape_k,cap-shape_s,cap-shape_x,cap-surface_f,cap-surface_g,cap-surface_s,...,population_n,population_s,population_v,population_y,habitat_d,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u
0,0,0,0,0,0,0,1,0,0,1,...,0,1,0,0,0,0,0,0,0,1
1,1,0,0,0,0,0,1,0,0,1,...,1,0,0,0,0,1,0,0,0,0


In [4]:
#Creating Train-Test split with a 2/3 train and 1/3 test split 
X_train, X_test, y_train, y_test = train_test_split(df.drop('class',axis=1), df['class'], test_size=0.33, random_state=43)

In [5]:
#Training BernoulliNB classifiers with default parameters
clf = BernoulliNB(fit_prior=True)
clf.fit(X_train, y_train)
print("Test Accuracy =",metrics.accuracy_score(y_test,clf.predict(X_test)))
print("Train Accuracy = ",metrics.accuracy_score(y_train,clf.predict(X_train)))

Test Accuracy = 0.939345142244
Train Accuracy =  0.9370536895


In [6]:
#probablities of the test objects
probabilities = clf.predict_proba(X_test)

In [49]:
# calculating the Total positive and total negative evidences for all objects 

reverse_log_prob = np.log(1 - np.exp(clf.feature_log_prob_)) #calculating log prob for negative dataset 

reverse_X_test = 1 - X_test # the negative of the original dataset ,since we dont know feature values for feature probabilites we use this

log_ratios = clf.feature_log_prob_[1] - clf.feature_log_prob_[0] # log ratios
reverse_log_ratios = reverse_log_prob[1] - reverse_log_prob[0]

for i in range(0,X_test.shape[1]): 
    if log_ratios[i]<0:
        neg_log_ratios[i] = log_ratios[i]  # if log ratio is less than 1 , we keep it for negative log evidence 
    else:
        pos_log_ratios[i] = log_ratios[i] # if log ratio is more than 1, we keep it for positive log evidence 

for i in range(0,X_test.shape[1]): # same for negative dataset
    if reverse_log_ratios[i]<0:
        reverse_neg_log_ratios[i] = reverse_log_ratios[i]
    else:
        reverse_pos_log_ratios[i] = reverse_log_ratios[i]

total_positive_evidence = np.dot(X_test,pos_log_ratios) + np.dot(reverse_X_test,reverse_pos_log_ratios) # add all positive log ratios
total_negative_evidence = np.dot(X_test,neg_log_ratios) + np.dot(reverse_X_test,reverse_neg_log_ratios) # add all negative log ratios

#to compensate for imbalance in the classes, we add the difference of log_prior to postive evidence if it is greater than 0
# or to the negative evidence if it is less than zero 

if clf.class_log_prior_[0] - clf.class_log_prior_[1] >0:
    total_positive_evidence+=(clf.class_log_prior_[0] - clf.class_log_prior_[1])
else:
    total_negative_evidence+=(clf.class_log_prior_[0] - clf.class_log_prior_[1])

In [40]:
def get_top_features(row, clf):
    reverse_row = 1 - row
    
    #using multiply allows us to use evidences for each feature in an object
    total_positive_evidence_row = np.add(np.multiply(row,pos_log_ratios),np.dot(reverse_row,reverse_pos_log_ratios))
    total_negative_evidence_row = np.add(np.multiply(row,neg_log_ratios),np.dot(reverse_row,reverse_neg_log_ratios))
    
    top3_pos_ft_ids = total_positive_evidence_row.sort_values()[::-1][:3]
    top3_neg_ft_ids = total_negative_evidence_row.sort_values()[:3]
    
    print("\nd) Top 3 feature values that contribute to postive evidence: ")
    for index,val in top3_pos_ft_ids.iteritems():
        print("Feature: %s = %i                Feature Evidence: %f"%(index, row[index], val))
    print("\ne) Top 3 feature values that contribute to negative evidence: ")
    for index,val in top3_neg_ft_ids.iteritems():
        print("Feature: %s = %i                Feature Evidence: %f"%(index, row[index], val))

### 1. Most Positive object in terms of probablity

In [41]:
id_most_pos = np.argmax(probabilities[:,1])
print("a) Total Positive log Evidence = ",total_positive_evidence[id_most_pos])
print("b) Total Negative log Evidence = ", total_negative_evidence[id_most_pos])
print("c) Probability Distribution = ", probabilities[id_most_pos])
get_top_features(X_test.iloc[id_most_pos],clf)

a) Total Positive log Evidence =  37.4324336361
b) Total Negative log Evidence =  -5.3688829504
c) Probability Distribution =  [  4.75526809e-15   1.00000000e+00]

d) Top 3 feature values that contribute to postive evidence: 
Feature: stalk-color-below-ring_g = 1                Feature Evidence: 16.923246
Feature: stalk-color-above-ring_g = 1                Feature Evidence: 16.915423
Feature: cap-color_e = 1                Feature Evidence: 14.958986

e) Top 3 feature values that contribute to negative evidence: 
Feature: stalk-root_b = 1                Feature Evidence: -4.556393
Feature: gill-spacing_c = 1                Feature Evidence: -4.344146
Feature: cap-shape_f = 1                Feature Evidence: -4.216045


### 2. Most Negative object in terms of probablity

In [30]:
id_most_neg = np.argmax(probabilities[:,0])

print("a) Total Positive log Evidence = ",total_positive_evidence[id_most_neg])
print("b) Total Negative log Evidence = ", total_negative_evidence[id_most_neg])
print("c) Probability Distribution = ", probabilities[id_most_neg])
get_top_features(X_test.iloc[id_most_neg],clf)

a) Total Positive log Evidence =  2.54953949578
b) Total Negative log Evidence =  -64.0885646645
c) Probability Distribution =  [  1.00000000e+00   4.69615917e-27]

d) Top 3 feature values that contribute to postive evidence: 
Feature: gill-size_b = 1    Feature Evidence: 2.363646
Feature: cap-surface_f = 1    Feature Evidence: 2.327093
Feature: ring-number_o = 1    Feature Evidence: 2.214022

e) Top 3 feature values that contribute to negative evidence: 
Feature: odor_f = 1    Feature Evidence: -20.740597
Feature: spore-print-color_h = 1    Feature Evidence: -20.740597
Feature: stalk-surface-above-ring_k = 1    Feature Evidence: -20.558647


### 3. Object that has highest Postive evidence 

In [31]:
id_most_pos_evi = np.argmax(total_positive_evidence)

print("a) Total Positive log Evidence = ",total_positive_evidence[id_most_pos_evi])
print("b) Total Negative log Evidence = ", total_negative_evidence[id_most_pos_evi])
print("c) Probability Distribution = ", probabilities[id_most_pos_evi])
get_top_features(X_test.iloc[id_most_pos_evi],clf)

a) Total Positive log Evidence =  38.1977740213
b) Total Negative log Evidence =  -5.40564703881
c) Probability Distribution =  [  2.29487133e-15   1.00000000e+00]

d) Top 3 feature values that contribute to postive evidence: 
Feature: stalk-color-below-ring_g = 1    Feature Evidence: 16.923246
Feature: stalk-color-above-ring_g = 1    Feature Evidence: 16.915423
Feature: cap-color_e = 1    Feature Evidence: 14.958986

e) Top 3 feature values that contribute to negative evidence: 
Feature: stalk-root_b = 1    Feature Evidence: -4.593157
Feature: gill-spacing_c = 1    Feature Evidence: -4.380910
Feature: cap-shape_f = 1    Feature Evidence: -4.252809


### 4. Object that has highest Negative evidence

In [32]:
id_most_neg_evi = np.argmin(total_negative_evidence)

print("a) Total Positive log Evidence = ",total_positive_evidence[id_most_neg_evi])
print("b) Total Negative log Evidence = ", total_negative_evidence[id_most_neg_evi])
print("c) Probability Distribution = ", probabilities[id_most_neg_evi])
get_top_features(X_test.iloc[id_most_neg_evi],clf)

a) Total Positive log Evidence =  2.33107629003
b) Total Negative log Evidence =  -70.2487735903
c) Probability Distribution =  [  1.00000000e+00   7.97114482e-30]

d) Top 3 feature values that contribute to postive evidence: 
Feature: gill-size_b = 1    Feature Evidence: 2.145183
Feature: cap-surface_f = 1    Feature Evidence: 2.108630
Feature: ring-number_o = 1    Feature Evidence: 1.995559

e) Top 3 feature values that contribute to negative evidence: 
Feature: odor_f = 1    Feature Evidence: -20.740597
Feature: spore-print-color_h = 1    Feature Evidence: -20.740597
Feature: stalk-surface-above-ring_k = 1    Feature Evidence: -20.558647


### 5. Most Uncertain object

In [33]:
id_most_neutral = np.argmin(np.absolute(np.subtract(probabilities[:,0], probabilities[:,1])))

print("a) Total Positive log Evidence = ",total_positive_evidence[id_most_neutral])
print("b) Total Negative log Evidence = ", total_negative_evidence[id_most_neutral])
print("c) Probability Distribution = ", probabilities[id_most_neutral])
get_top_features(X_test.iloc[id_most_neutral],clf)

a) Total Positive log Evidence =  17.5677025156
b) Total Negative log Evidence =  -18.489436257
c) Probability Distribution =  [ 0.50144027  0.49855973]

d) Top 3 feature values that contribute to postive evidence: 
Feature: spore-print-color_n = 1    Feature Evidence: 11.839002
Feature: stalk-surface-above-ring_s = 1    Feature Evidence: 11.383251
Feature: stalk-surface-below-ring_s = 1    Feature Evidence: 11.329455

e) Top 3 feature values that contribute to negative evidence: 
Feature: odor_p = 1    Feature Evidence: -13.190534
Feature: habitat_u = 1    Feature Evidence: -9.147483
Feature: gill-size_n = 1    Feature Evidence: -8.770101
