### Training Classification Models using sklearn

### Step 1: Import your librarys
Import the following first:
1. pandas

In [2]:
# Step 1: Import the library
import pandas as pd

### Step 2: Read the CSV from Part III
We will now read the CSV that we exported from Part III containing the dummified values.

In [3]:
# Step 2: Read the CSV from Part III
df = pd.read_csv('dummified_df.csv', index_col = 0)
df

Unnamed: 0,class,cap_shape_c,cap_shape_f,cap_shape_k,cap_shape_s,cap_shape_x,cap_surface_g,cap_surface_s,cap_surface_y,cap_color_c,...,population_n,population_s,population_v,population_y,habitat_g,habitat_l,habitat_m,habitat_p,habitat_u,habitat_w
0,1,0,0,0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0
1,0,0,0,0,0,1,0,1,0,0,...,1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,0
3,1,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,0,0,0,1,0
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
8120,0,0,0,0,0,1,0,1,0,0,...,0,0,1,0,0,1,0,0,0,0
8121,0,0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0
8122,1,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,1,0,0,0,0


### Step 3: Import the machine learning libraries
1. train_test_split from sklearn.model_selection
2. LogisticRegression from sklearn.linear_model
3. DecisionTreeClassifier from sklearn.tree
4. f1_score from sklearn.metrics
5. confusion_matrix from sklearn.metrics

In [4]:
# Step 3: Import the next set of libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

### Step 4: Preparing the independent and dependent variables
Now that we have everything done, let's prepare our independent variables (the dummified physical characteristics) and the dependent variable (edible/poison binary). 

In [5]:
# Step 4: Prepare your independent and dependent variables
# all columns except 'class'
independent_variables = df.columns[1:].to_list()

# independent variables
X = df.drop(labels = ['class'], axis = 'columns')

# dependent variable
y = df.drop(labels = independent_variables, axis = 'columns')


### Step 5: Split indepedent and dependent variables into train and test sets
We'll be using a 80/20 split for train and test set respectively, using the train_test_split function, stratified by y. 

In [9]:
# Step 5: Split your data into train and test set
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify = y)

### Step 6: Training a Logistic Regression model
1. Start with a model
2. Declare a variable, and store your model in it (don't forget to use brackets)
3. Fit your training data into the instantiated model
4. Declare a variable that contains predictions from the model you just trained, using the train dataset (X_test)


In [11]:
# Step 6a: Declare a variable to store the LogisticRegression model
LogReg = LogisticRegression()

# Step 6b: Fit train dataset
LogReg.fit(train_X, train_y)

# Step 6c: Declare a variable and store predictions made with the model using X test data
predictions = LogReg.predict(test_X)

# Step 6d: Print the f1_score between the y test and prediction
print('f1 score:', f1_score(test_y, predictions))
print('')

# Step 6e: Print the confusion matrix using the y test and prediction
print(confusion_matrix(test_y, predictions))

f1 score: 1.0

[[1052    0]
 [   0  979]]


  y = column_or_1d(y, warn=True)


It seems that a Logistic Regression model has managed to make completely accurate predictions. While this is great, we need to investigate further using a different model and see if similar results are attained.

### Step 7: Train a DecisionTree model


In [16]:
# Step 7a: Declare a variable to store the DecisionTreeClassifier model
DecTreeCls = DecisionTreeClassifier()

# Step 7b: Fit your train dataset
DecTreeCls.fit(train_X, train_y)

# Step 7c: Declare a variable and store your predictions that you make with your model using X test data
predictions = DecTreeCls.predict(test_X)

# Step 7d: Print the f1_score between the y test and prediction
print('f1 score:', f1_score(test_y, predictions))
print('')

# Step 7e: Print the confusion matrix using the y test and prediction
print(confusion_matrix(test_y, predictions))

f1 score: 1.0

[[1052    0]
 [   0  979]]


Yet again, the decision tree model also seems to be perfect.

### Step 8: Get feature_importances of the DecisionTree model
Let's take a look under the hood and see what's driving the 'decisions' in the DecisionTreeClassifier. 

Create a DataFrame using the train data's columns, and the .feature_importances_ attribute in the model. This will show us a table containing the feature names and their importance in the model. 

In [18]:
# Step 8: Get feature importances of the DecisionTree model
feature_list = X.columns.to_list()
DecTreeModel_df = pd.DataFrame({'feature': feature_list, 'importance': DecTreeCls.feature_importances_})
DecTreeModel_df

Unnamed: 0,feature,importance
0,cap_shape_c,0.000656
1,cap_shape_f,0.000000
2,cap_shape_k,0.000000
3,cap_shape_s,0.000000
4,cap_shape_x,0.000000
...,...,...
90,habitat_l,0.000000
91,habitat_m,0.000000
92,habitat_p,0.000000
93,habitat_u,0.000000


### Step 9: Sort your feature importance DataFrame
Sort DataFrame in a descending order and take the first 20 rows to identify the top features used.

In [19]:
# Step 9: Sort DataFrame by feature importances

DecTreeModel_df.sort_values('importance', ascending = False).head(20)

Unnamed: 0,feature,importance
22,odor_n,0.610568
42,stalk_root_c,0.179056
44,stalk_root_r,0.086905
80,spore_print_color_r,0.033042
20,odor_l,0.022798
46,stalk_surface_above_ring_s,0.017747
50,stalk_surface_below_ring_y,0.016341
81,spore_print_color_u,0.011617
40,stalk_shape_t,0.008966
88,population_y,0.006231


It seems the interesting observation we made about odorless mushrooms having a high chance of being edible is reflected here.

Looks like it is the feature that is the most useful in building the tree, and it seems like odor is the most crucial feature in determining whether or not a mushroom is edible.