# This is project 1 in Machine Learning Basic Course of Framgia Training

### The Dataset use in this project is Mushroom Database: https://archive.ics.uci.edu/ml/datasets/mushroom.

### The purpose of project is using the algorithms that studied (KNN, Decision Tree, Naive Bayes, SVM) to predict whether a mushroom is poisonous or edible.
--------------

### 1. Import neccessary libraries:

In [92]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

%matplotlib inline

### 2. Read csv data file into dataframe pandas:

In [93]:
df = pd.read_csv('data/agaricus-lepiota.csv')

X = df.drop('class', axis=1).values

y = df.loc[:, 'class'].values

columns = df.drop('class', axis=1).columns

In [94]:
columns

Index(['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number',
       'ring-type', 'spore-print-color', 'population', 'habitat'],
      dtype='object')

In [95]:
# Split training vs testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

### 3. Pre-processing

Convert categorical string feature to categorical number feature. Do this for Decision Tree.

In [96]:
def convert_feature(X):
    N, D = X.shape
    converted_feature = []
    for d in range(D):
        feature = X[:, d]
        categories = np.unique(feature)
        converted_feature.append(categories)
        for index, value in enumerate(categories):
            feature[feature == value] = index
    return X

In [97]:
X_train = convert_feature(X_train)
X_test = convert_feature(X_test)

### 4. Decision Tree algorithm

Since the feature in dataset is categorical, I think the most appropriate algorithm for this kind of dataset is Decision Tree. 

In [98]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier()

Fit training dataset into Decision Tree

In [99]:
dt.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [100]:
pred_train = dt.predict(X_train)

In [101]:
print("Training accuracy: %.2f percent" % (len(y_train[pred_train == y_train])/len(y_train)*100))

Training accuracy: 100.00 percent


No doubt! We're predicting on our X_train. And it's 100% accuracy.
Now let's predict on X_test.

In [102]:
pred_test = dt.predict(X_test)

In [103]:
print("Training accuracy: %.2f percent" % (len(y_test[pred_test == y_test])/len(y_test)*100))

Training accuracy: 100.00 percent


This is actually a good result for Decision Tree. Now let's compare Decision Tree with other algorithms: KNN, Naive Bayes, SVM.

### 5. Compare Decision Tree with other algorithms
#### 5.1 KNN:

In [104]:
from sklearn.neighbors import KNeighborsClassifier

In [105]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [106]:
knn_pred = knn.predict(X_test)

In [107]:
print("KNN accuracy: %.2f percent" % (len(y_test[knn_pred == y_test])/len(y_test)*100))

KNN accuracy: 99.84 percent


#### 5.2 Naive Bayes

In [108]:
from sklearn.naive_bayes import GaussianNB

In [109]:
gaussian_nb = GaussianNB()

In [110]:
gaussian_nb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [111]:
nb_pred = gaussian_nb.predict(X_test)

In [112]:
print("Naive Bayes accuracy: %.2f percent" % (len(y_test[nb_pred == y_test])/len(y_test)*100))

Naive Bayes accuracy: 92.17 percent


#### 5.3 SVM

In [113]:
from sklearn.svm import SVC

In [114]:
svm = SVC()

In [115]:
svm.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [116]:
svm_pred = svm.predict(X_test)

In [117]:
print("SVM accuracy: %.2f percent" % (len(y_test[svm_pred == y_test])/len(y_test)*100))

SVM accuracy: 100.00 percent


Through experiment we see that all algorithms works well for this Mushroom dataset (maybe the dataset is simple and good). `Decision Tree` and `SVM` work best.