# classification

## dataset information:
This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom.

Objective of this project is to find the best machinelearning model to classify any mashrooms into edible or poisonous category with features like cap-shape, cap-color,gill-color,odor, etc.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv("mushrooms.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     8124 non-null   object
 1   cap-shape                 8124 non-null   object
 2   cap-surface               8124 non-null   object
 3   cap-color                 8124 non-null   object
 4   bruises                   8124 non-null   object
 5   odor                      8124 non-null   object
 6   gill-attachment           8124 non-null   object
 7   gill-spacing              8124 non-null   object
 8   gill-size                 8124 non-null   object
 9   gill-color                8124 non-null   object
 10  stalk-shape               8124 non-null   object
 11  stalk-root                8124 non-null   object
 12  stalk-surface-above-ring  8124 non-null   object
 13  stalk-surface-below-ring  8124 non-null   object
 14  stalk-color-above-ring  

dataset has 23 column and 8124 rows with no na values
so now we will add 0 to 10% in random 6 columns

In [3]:
np.random.seed(6)
for i in range(6):
    col = np.random.randint(22)
    n = np.random.uniform(0, 0.1)
    df.loc[df.sample(frac=n).index, df.columns[col]] = np.nan

### dealing with NA values

In [4]:
df = df[df.isnull().sum(axis=1) < 2]

dropped all rows containing more than 1 na values

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7739 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     7739 non-null   object
 1   cap-shape                 7576 non-null   object
 2   cap-surface               7152 non-null   object
 3   cap-color                 7739 non-null   object
 4   bruises                   7556 non-null   object
 5   odor                      7739 non-null   object
 6   gill-attachment           7196 non-null   object
 7   gill-spacing              7739 non-null   object
 8   gill-size                 7739 non-null   object
 9   gill-color                7739 non-null   object
 10  stalk-shape               7173 non-null   object
 11  stalk-root                7739 non-null   object
 12  stalk-surface-above-ring  7739 non-null   object
 13  stalk-surface-below-ring  7739 non-null   object
 14  stalk-color-above-ring  

In [6]:
df[['cap-shape','cap-surface','bruises','gill-attachment','stalk-shape','veil-color']].describe()

Unnamed: 0,cap-shape,cap-surface,bruises,gill-attachment,stalk-shape,veil-color
count,7576,7152,7556,7196,7173,7623
unique,6,4,2,2,2,4
top,x,y,f,f,t,w
freq,3410,2834,4430,7007,4068,7435


replacing na values with most frequent values in gill-attachment and veil color and dropping all other rows with na value

In [7]:
df['gill-attachment'].fillna(value='f',inplace=True)
df['veil-color'].fillna(value='w',inplace=True)

In [8]:
df_cleaned = df.dropna()
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6240 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   class                     6240 non-null   object
 1   cap-shape                 6240 non-null   object
 2   cap-surface               6240 non-null   object
 3   cap-color                 6240 non-null   object
 4   bruises                   6240 non-null   object
 5   odor                      6240 non-null   object
 6   gill-attachment           6240 non-null   object
 7   gill-spacing              6240 non-null   object
 8   gill-size                 6240 non-null   object
 9   gill-color                6240 non-null   object
 10  stalk-shape               6240 non-null   object
 11  stalk-root                6240 non-null   object
 12  stalk-surface-above-ring  6240 non-null   object
 13  stalk-surface-below-ring  6240 non-null   object
 14  stalk-color-above-ring  

### data ptreprocessing

making dummy variables for all columns except target column

In [9]:
df_class = df_cleaned['class']
del df_cleaned['class']

In [10]:
dummies = pd.get_dummies(df_cleaned)
final_df = dummies.join(df_class)

In [11]:
final_df['class']= final_df['class'].map({'e':0, 'p':1})

saperating target variable from predictors

In [12]:
y = final_df['class']
x = final_df.drop('class', axis=1)

splitting dataset into training and testing subsets

In [13]:
# splitting dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=1)

## KNN

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier()
param_grid = {'n_neighbors':[1, 5, 10, 15, 20]}

grid_knn = GridSearchCV(knn, param_grid=param_grid, cv = 5, scoring='roc_auc')
grid_knn.fit(x_train, y_train)
grid_knn.score(x_train, y_train)

0.9999998854124557

In [15]:
grid_knn.score(x_test, y_test)

1.0

In [16]:
grid_knn.best_params_

{'n_neighbors': 15}

In [17]:
cv_results = pd.DataFrame.from_dict(grid_knn.cv_results_)
cv_results[['param_n_neighbors','mean_test_score']]

Unnamed: 0,param_n_neighbors,mean_test_score
0,1,0.999505
1,5,0.999752
2,10,0.999752
3,15,0.999998
4,20,0.999998


In [18]:
from sklearn.metrics import roc_auc_score
y_knn_predict = grid_knn.predict(x_test)
y_knn_train_predict = grid_knn.predict(x_train)
print('Train roc_auc_score: %.2f'%roc_auc_score(y_knn_train_predict, y_train))
print('Test roc_auc_score: %.2f '%roc_auc_score(y_knn_predict, y_test))

Train roc_auc_score: 1.00
Test roc_auc_score: 1.00 


In [19]:
knn_cmatrix = confusion_matrix(y_test, y_knn_predict)
print (knn_cmatrix)

[[1061    0]
 [   5  994]]


In [20]:
report_table =[['knn', 'k = 15', grid_knn.score(x_train, y_train), grid_knn.score(x_test, y_test), roc_auc_score(y_knn_train_predict, y_train), roc_auc_score(y_knn_predict, y_test) ]]

## logistic regresion

In [21]:
from sklearn.linear_model import LogisticRegression

param_grid = {'C':[0.01, 0.1, 1, 10, 100]}

logis_reg = LogisticRegression(solver='liblinear',multi_class='auto')

grid_log = GridSearchCV(logis_reg, param_grid = param_grid, cv = 5)
grid_log.fit(x_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='liblinear',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100]}, pre_dispatch='2*n_jobs',
             refit=True, return_train_score=False, scoring=None, verbose=0)

In [22]:
cv_results = pd.DataFrame.from_dict(grid_log.cv_results_)
cv_results[['params','mean_test_score']]

Unnamed: 0,params,mean_test_score
0,{'C': 0.01},0.9811
1,{'C': 0.1},0.996651
2,{'C': 1},0.999282
3,{'C': 10},0.999282
4,{'C': 100},0.999282


In [23]:
from sklearn.metrics import roc_auc_score
y_log_predict = grid_log.predict(x_test)
y_log_train_predict = grid_log.predict(x_train)
print('Train roc_auc_score: %.2f'%roc_auc_score(y_log_train_predict, y_train))
print('Test roc_auc_score: %.2f '%roc_auc_score(y_log_predict, y_test))

Train roc_auc_score: 1.00
Test roc_auc_score: 1.00 


In [24]:
log_cmatrix = confusion_matrix(y_test, y_log_predict)
print (log_cmatrix)

[[1061    0]
 [   2  997]]


In [25]:
report_table = report_table+[['Logistic', 'C = 1', grid_log.score(x_train, y_train), grid_log.score(x_test, y_test), roc_auc_score(y_log_train_predict, y_train), roc_auc_score(y_log_predict, y_test) ]]

## svm:kernal(linear,rbf)

In [26]:
from sklearn import svm
from sklearn.model_selection import GridSearchCV
grid_svc = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [0.01,0.1,1,10,100],
    'kernel': ['linear','rbf']
}, cv=5, return_train_score=True)
grid_svc.fit(x,y)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.01, 0.1, 1, 10, 100],
                         'kernel': ['linear', 'rbf']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring=None, verbose=0)

In [27]:
df = pd.DataFrame(grid_svc.cv_results_)
df[['param_C','param_kernel','mean_train_score','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_train_score,mean_test_score
0,0.01,linear,0.993269,0.855609
1,0.01,rbf,0.895112,0.887981
2,0.1,linear,0.999239,0.939103
3,0.1,rbf,0.977444,0.879808
4,1.0,linear,1.0,0.959615
5,1.0,rbf,0.997676,0.861058
6,10.0,linear,1.0,0.959615
7,10.0,rbf,0.99996,0.943269
8,100.0,linear,1.0,0.959615
9,100.0,rbf,1.0,0.945513


In [28]:
y_svc_predict_train = grid_svc.predict(x_train)
y_svc_predict = grid_svc.predict(x_test)

In [29]:
svc_cmatrix = confusion_matrix(y_test, y_svc_predict)
print (svc_cmatrix)

[[1061    0]
 [   0  999]]


In [30]:
report_table = report_table + [['SVC linear Kernalized', 'C = 1', grid_svc.score(x_train, y_train), grid_svc.score(x_test, y_test), roc_auc_score(y_svc_predict_train, y_train), roc_auc_score(y_svc_predict, y_test)]]

## svmpoly

In [31]:
svc_poly = svm.SVC(degree = 3) 
param_grid = {'C':[0.01, 0.1, 1, 10]}

grid_svcPoly = GridSearchCV(svc_poly, param_grid = param_grid, cv = 5, n_jobs = -1, scoring='roc_auc')
grid_svcPoly.fit(x_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=-1, param_grid={'C': [0.01, 0.1, 1, 10]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

In [32]:
print("Best cross-validation score: {:.2f}".format(grid_svcPoly.best_score_))
print('Best parameters term:',grid_svcPoly.best_params_)
print("Training Score: {:.4f}".format(grid_svcPoly.score(x_train, y_train)))
print("Testing Score: {:.4f}".format(grid_svcPoly.score(x_test, y_test)))

Best cross-validation score: 1.00
Best parameters term: {'C': 1}
Training Score: 1.0000
Testing Score: 1.0000


In [33]:
df = pd.DataFrame(grid_svcPoly.cv_results_)
df[['param_C','mean_test_score']]

Unnamed: 0,param_C,mean_test_score
0,0.01,0.993561
1,0.1,0.99994
2,1.0,1.0
3,10.0,1.0


In [34]:
y_svcPoly_predict_train = grid_svcPoly.predict(x_train)
y_svcPoly_predict = grid_svcPoly.predict(x_test)

In [35]:
svcPoly_cmatrix = confusion_matrix(y_test, y_svcPoly_predict)
print (svcPoly_cmatrix)

[[1061    0]
 [   0  999]]


In [36]:
report_table = report_table + [['SVC Poly', 'C = 1', grid_svcPoly.score(x_train, y_train), grid_svcPoly.score(x_test, y_test), roc_auc_score(y_svcPoly_predict_train, y_train), roc_auc_score(y_svcPoly_predict, y_test)]]

## linear svm

In [37]:
from sklearn.svm import LinearSVC

svc_lin = LinearSVC()
param_grid = {'C':[0.001, 0.01, 0.1, 1, 10, 100]}

grid_svc_lin = GridSearchCV(svc_lin, param_grid, cv = 5, scoring='roc_auc', return_train_score=True)

In [38]:
grid_svc_lin.fit(x_train, y_train)



GridSearchCV(cv=5, error_score=nan,
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [0.001, 0.01, 0.1, 1, 10, 100]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
             scoring='roc_auc', verbose=0)

In [39]:
y_svc_lin_predict_train = grid_svc_lin.predict(x_train)
y_svc_lin_predict = grid_svc_lin.predict(x_test)

In [40]:
df = pd.DataFrame(grid_svc_lin.cv_results_)
df[['param_C','mean_test_score']]

Unnamed: 0,param_C,mean_test_score
0,0.001,0.998101
1,0.01,0.999534
2,0.1,0.99929
3,1.0,0.99921
4,10.0,0.999204
5,100.0,0.999204


In [41]:
svc_lin_cmatrix = confusion_matrix(y_test, y_svc_lin_predict)
print (svc_lin_cmatrix)

[[1061    0]
 [   8  991]]


In [42]:
report_table = report_table + [['LinearSVC', 'C = 0.01', grid_svc_lin.score(x_train, y_train), grid_svc_lin.score(x_test, y_test), roc_auc_score(y_svc_lin_predict_train, y_train), roc_auc_score(y_svc_lin_predict, y_test)]]

## decision tree classifier

In [43]:
from sklearn.tree import DecisionTreeClassifier

param_grid = {'max_depth':[1, 2, 3, 4, 5, 6]}
dtree = DecisionTreeClassifier()

grid_tree = GridSearchCV(dtree, param_grid, cv = 5, scoring='roc_auc', return_train_score=True)
grid_tree.fit(x_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=None,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': [1, 2, 3, 4, 5, 6]},
             pre_di

In [44]:
df = pd.DataFrame(grid_tree.cv_results_)
df[['params','mean_test_score']]

Unnamed: 0,params,mean_test_score
0,{'max_depth': 1},0.888549
1,{'max_depth': 2},0.964457
2,{'max_depth': 3},0.987669
3,{'max_depth': 4},0.994689
4,{'max_depth': 5},0.998858
5,{'max_depth': 6},0.999235


In [45]:
y_dtree_predict_train = grid_tree.predict(x_train)
y_dtree_predict = grid_tree.predict(x_test)

In [46]:
dtree_cmatrix = confusion_matrix(y_test, y_dtree_predict)
print (dtree_cmatrix)

[[1061    0]
 [   7  992]]


In [47]:
report_table = report_table + [['Decision Tree', 'max_depth = 6', grid_tree.score(x_train, y_train), grid_tree.score(x_test, y_test), roc_auc_score(y_dtree_predict_train, y_train), roc_auc_score(y_dtree_predict, y_test)]]

In [48]:
report = pd.DataFrame(report_table,columns = ['Model name', 'Model parameter', 'Train accuracy', 'Test accuracy', 'Train auc score', 'Test auc score'])

In [49]:
report

Unnamed: 0,Model name,Model parameter,Train accuracy,Test accuracy,Train auc score,Test auc score
0,knn,k = 15,1.0,1.0,0.999537,0.997655
1,Logistic,C = 1,0.999761,0.999029,0.999768,0.999059
2,SVC linear Kernalized,C = 1,1.0,1.0,1.0,1.0
3,SVC Poly,C = 1,1.0,1.0,1.0,1.0
4,LinearSVC,C = 0.01,0.99994,0.99987,0.998383,0.996258
5,Decision Tree,max_depth = 6,0.999597,0.998337,0.999075,0.996723


In [50]:
report.to_csv (r'C:\Users\ray19\Downloads\export_dataframe.csv', index = False, header=True)

as we can see from above table that all of the models are giving very high accuracy on both training and test dataset.
but among them kernalised svc with hyperparameter 'linear' and 'poly' with c = 1 have highest accuracy. 
so we should choose svc kernalised model for our classification problem

# Final SVM model

In [51]:
import numpy as np
import pylab as pl
from sklearn import svm
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve, auc

In [52]:
from sklearn import svm
clf=clf = svm.SVC(kernel='linear', C = 1.0, probability=True)
clf.fit(x_train,y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=True, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

In [53]:
clf.score(x_test,y_test)

1.0

In [54]:
y_pred = clf.predict(x_test)
clf_cmatrix = confusion_matrix(y_test, y_pred)
print (clf_cmatrix)

[[1061    0]
 [   0  999]]
