## Solving the problem with different algorithms

#### Using different algorithms in classification and finding the one that performs the best

#### Tags:
    Data: labeled data, Kaggle competition
    Technologies: python, pandas, scikit-learn
    Techniques: runing different algorithms on the same data 
    
#### Resources:
[Kaggle competition data](https://www.kaggle.com/uciml/mushroom-classification)



## Using different algorithms

Each problem and each data set is special, hence knowing how to apply different statistical approaches and machine learning algorithms helps identify a method or set of methods that performs best on the data set given. 

With the dataset at hand we will do a classification to predict one of 2 classes, e - edible and p - poisonus. The main idea here is to show how different algorithms can be used to identify the best one. All the algorithms will be used out of the box and no parameter tuning will be done (as this will be looked into in detail in a separate project). We will use AUC as a metric to identify which model performs best.

We will be using 3 different classifiers for the task:
    1. Logistic Regression - an extension of the multiple linear regression as a Generalized Linear Model
    2. Random Forest - an ensemble of Decision Trees
    3. K-Nearest Neighbours - another non-parametric approach that calculates Euclidean distance between the observation and its k nearest neighbours


In [97]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report
%matplotlib inline

In [98]:
# import the relevant dataset
df = pd.read_csv('../data/mushrooms.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
class                       8124 non-null object
cap-shape                   8124 non-null object
cap-surface                 8124 non-null object
cap-color                   8124 non-null object
bruises                     8124 non-null object
odor                        8124 non-null object
gill-attachment             8124 non-null object
gill-spacing                8124 non-null object
gill-size                   8124 non-null object
gill-color                  8124 non-null object
stalk-shape                 8124 non-null object
stalk-root                  8124 non-null object
stalk-surface-above-ring    8124 non-null object
stalk-surface-below-ring    8124 non-null object
stalk-color-above-ring      8124 non-null object
stalk-color-below-ring      8124 non-null object
veil-type                   8124 non-null object
veil-color                  8124 non-null object
ring-number

##### There are 8124 observations and 23 columns 

In [99]:
# Data preparation
X = df.drop(['class'],axis=1)
y = df['class']

# Encode the categorical data (creates a large matrix)
# X = pd.get_dummies(X) - here i deliberately did not use one hot encoding as then the auc score is 1!
# for all the models
X = X.apply(LabelEncoder().fit_transform)

encoder = LabelEncoder()
y = encoder.fit_transform(y)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size=0.2, random_state=42)


In [110]:
# Train the different models

def train_predict(classifiers, X_train, y_train, X_test, y_test):
    '''
    Given a model train the model given the data
    '''
    
    scores = {}
    class_report = {}
    for model in classifiers:
        classifiers[model].fit(X_train, y_train)    
        y_hat = classifiers[model].predict(X_test)

        scores[model] = roc_auc_score(y_test,y_hat)
        class_report[model] = confusion_matrix(y_test,y_hat)
        
    return scores, class_report

classifiers = {}
classifiers['Logistic Regression'] = LogisticRegression(random_state=42)
classifiers['Random Forrest'] = RandomForestClassifier(random_state=42)
classifiers['KNN'] = KNeighborsClassifier(n_neighbors=3)

scores, class_report = train_predict(classifiers, X_train, y_train, X_test, y_test)
#for idx, model in enumerate(classifiers):
#    classifiers[model].fit(X_train, y_train)



In [111]:
scores

{'Logistic Regression': 0.9464553885920761,
 'Random Forrest': 1.0,
 'KNN': 0.9982206405693951}

In [112]:
for i in class_report:
    print(class_report[i])

[[798  45]
 [ 42 740]]
[[843   0]
 [  0 782]]
[[840   3]
 [  0 782]]


### Selecting the best model

By inspecting the AUC score we found out that all of the models perform extremely well, with Random Forrest achieveing the perfect AUC score. It seems this data set is such that classifying between edible and poisonus mushrooms is an easy task.

Confusion matrix also shows the usual TP, FP, TN, FN data cells.