![](https://i.insider.com/5c3fad1a5241471eae53d303?width=1100&format=jpeg&auto=webp)

**Problem Statement**: Mushroom hunting, mushrooming, mushroom picking, mushroom foraging, and similar terms describe the activity of gathering mushrooms in the wild, typically for culinary use. This practice is popular throughout most of Europe, Australia, Japan, Korea, parts of the Middle East, and the Indian subcontinent, as well as the temperate regions of Canada and the United States.

In this kernel we're gonna explore an ensemble based model called Random Forest and further dig down to the following: <br>
*  Random Forest with Tuning
* Identifying ways to estimate Feature Importance
    * 1. Built-in
    * 2. SHAP values
* Try 3 different types of Encoding (Categorical data -> Numeric data)
    * 1. Label Encoding
    * 2. One Hot Encoding
    * 3. Target Encoding

[Reference 1](https://medium.com/analytics-vidhya/target-encoding-vs-one-hot-encoding-with-simple-examples-276a7e7b3e64)
[Reference 2](https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/)

In [None]:
# Import the necessary packages used in this notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import plotly
import os
import os

from sklearn.preprocessing import LabelEncoder
print(os.listdir("../input"))
import warnings
# Remove any warning messages
warnings.filterwarnings("ignore")
# Any results you write to the current directory are saved as output.

### Attribute Information:
![](https://i.pinimg.com/originals/ab/0d/d4/ab0dd459bfeeb16857224738e7919da1.gif)
* **classes**: edible=e, poisonous=p
* **cap-shape**: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s
* **cap-surface**: fibrous=f,grooves=g,scaly=y,smooth=s
* **cap-color**: brown=n,buff=b,cinnamon=c,gray=g,green=r,pink=p,purple=u,red=e,white=w,yellow=y
* **bruises**: bruises=t,no=f
* **odor**: almond=a,anise=l,creosote=c,fishy=y,foul=f,musty=m,none=n,pungent=p,spicy=s
* **gill-attachment**: attached=a,descending=d,free=f,notched=n
* **gill-spacing**: close=c,crowded=w,distant=d
* **gill-size**: broad=b,narrow=n
* **gill-color**: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e,white=w,yellow=y
* **stalk-shape**: enlarging=e,tapering=t
* **stalk-root**: bulbous=b,club=c,cup=u,equal=e,rhizomorphs=z,rooted=r,missing=?
* **stalk-surface-above-ring**: fibrous=f,scaly=y,silky=k,smooth=s
* **stalk-surface-below-ring**: fibrous=f,scaly=y,silky=k,smooth=s
* **stalk-color-above-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* **stalk-color-below-ring**: brown=n,buff=b,cinnamon=c,gray=g,orange=o,pink=p,red=e,white=w,yellow=y
* **veil-type**: partial=p,universal=u
* **veil-color**: brown=n,orange=o,white=w,yellow=y
* **ring-number**: none=n,one=o,two=t
* **ring-type**: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z
* **spore-print-color**: black=k,brown=n,buff=b,chocolate=h,green=r,orange=o,purple=u,white=w,yellow=y
* **population**: abundant=a,clustered=c,numerous=n,scattered=s,several=v,solitary=y
* **habitat**: grasses=g,leaves=l,meadows=m,paths=p,urban=u,waste=w,woods=d

In [None]:
datafr = pd.read_csv("../input/mushroom-classification/mushrooms.csv", error_bad_lines=False)

In [None]:
datafr.shape

In [None]:
datafr.head()

In [None]:
# Replace ? with NaN
datafr = datafr.replace(r'^\?$', np.nan, regex=True)

In [None]:
# Check the missing values in the column
missing_data = datafr.isnull().sum().sort_values(ascending=False)

In [None]:
# Check for Data Duplication
duplicateRowsDF = datafr[datafr.duplicated()]
duplicateRowsDF

In [None]:
missing_data = missing_data.reset_index(drop=False)
missing_data = missing_data.rename(columns={"index": "Columns", 0: "Value"})
missing_data['Proportion'] = (missing_data['Value']/len(datafr))*100

In [None]:
sample = missing_data[missing_data['Proportion']>10]
fig = px.pie(sample, names='Columns', values='Proportion',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Columns with a percentage of Missing values over 10%')
fig.update_traces(textposition='inside', textinfo='label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

In [None]:
# Fill Missing Value
datafr = datafr.fillna(method='ffill')

### Check the proportion of data for each class

In [None]:
fig = px.pie(datafr, names='class',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Class column')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

### Exploratory Data Analysis

In [None]:
datafr.columns

In [None]:
fig = px.pie(datafr, names='cap-shape',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Cap-Shape column ')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

In [None]:
fig = px.pie(datafr, names='cap-color',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Cap-Color column ')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

In [None]:
fig = px.pie(datafr, names='bruises',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Bruises column ')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

In [None]:
fig = px.pie(datafr, names='gill-attachment',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Gill Attachment column ')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

In [None]:
fig = px.pie(datafr, names='gill-size',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Gill Size column ')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

In [None]:
fig = px.pie(datafr, names='population',
             color_discrete_sequence=px.colors.sequential.Viridis_r,
             title='Proportion of data for Population column ')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.update_layout(paper_bgcolor='rgba(0,0,0,0)',
                  plot_bgcolor='rgba(0,0,0,0)',
                  font=dict(family='Cambria, monospace', size=12, color='#000000'))
fig.show()

### Define Cross Validation Function

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

from sklearn.model_selection import cross_val_score
# Using 10 folds cross-validation
def CrossVal(trainX,trainY,model):
    accuracy=cross_val_score(model,trainX , trainY, cv=10, scoring='accuracy')
    return(accuracy)

### Hyperparameter Tuning for Random Forest

In [None]:
'''
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state = 1)
modelF = forest.fit(X_train, Y_train)
y_predF = modelF.predict(X_test)

from sklearn.model_selection import GridSearchCV
n_estimators = [100, 200, 300, 400, 500]
max_depth = [5, 8, 15, 25, 30]
min_samples_split = [2, 5, 10, 15, 100]
min_samples_leaf = [1, 2, 5, 10] 

hyperF = dict(n_estimators = n_estimators, max_depth = max_depth,  
              min_samples_split = min_samples_split, 
             min_samples_leaf = min_samples_leaf)

gridF = GridSearchCV(forest, hyperF, cv = 3, verbose = 1, 
                      n_jobs = -1)
bestF = gridF.fit(X_train, Y_train)
'''

### Define Random Forest Model Function

In [None]:
from sklearn.ensemble import RandomForestClassifier
def random_forest(X_train,Y_train, X_test):
    # Next we take Random Forest Model (Ensemble) for Binary Classification
    rf = RandomForestClassifier(n_estimators = 200,random_state = 40)
    # Creare a model with X_train and Y_train data
    rf.fit(X_train,Y_train)
    # predict probabilities
    probs = rf.predict_proba(X_test)
    # keep probabilities for the positive outcome only
    probs = probs[:, 1]
    return rf, probs

## 1. Label Encoding
In this encoding technique, the categorical data is assigned a value from 1 to N (N is the number for different categories present in the data). This kind of an encoding technique is applied to the ordinal data. The assigning of the value from 1 to N happens either in an increasing or a decreasing order. Once if the order is chosen to be ascending or descending it is fixed throughout for all the values in the column and cannot be changed randomly or in between. 

In [None]:
sample_1 = datafr.copy()

label = LabelEncoder()
for col in sample_1.columns:
    sample_1[col] = label.fit_transform(sample_1[col])
# Print Updated Data
sample_1.head(10)

In [None]:
# Splitting the dataset
# Predictor variables
X = sample_1.drop('class',axis=1)
# Target or Class variable
Y = sample_1['class']

In [None]:
# Let's using scikit learn to split our dataset
from sklearn.model_selection import train_test_split
# Using 70:30 ratio for train:test
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=.3,random_state=400)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Call Random Forest Classifier
rf1, probs = random_forest(X_train,Y_train, X_test)

### Random Forest Feature Importance

#### 1) Built In Feature Importance

In [None]:
sorted_idx = rf1.feature_importances_.argsort()
plt.barh(X.columns[sorted_idx], rf1.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

#### 2) Feature Importance computed with SHAP values
The SHAP interpretation can be used (it is model-agnostic) to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate the how does each feature contribute to the prediction.

In [None]:
import shap
explainer = shap.TreeExplainer(rf1)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

In [None]:
# Run the model on X_test to predict the target labels. Use cross-validation accuracy to check if model overfits or underfits
predict1 = rf1.predict(X_test)
rf1 = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=40)
score_rf = CrossVal(X_train,Y_train,rf1)
print('Cross-Validation accuracy is {:.2f}%'.format(score_rf.mean()*100))

### Display Results

In [None]:
# Compare the predicted target labels with Y_test
from sklearn.metrics import accuracy_score,confusion_matrix, f1_score
print("Accuracy using Random Forest Model: {:.2f}%".format(accuracy_score(Y_test,predict1)*100))
# assign cnf_matrix with result of confusion_matrix array
cnf_matrix = confusion_matrix(Y_test,predict1)

# calculate AUC
auc_rf = roc_auc_score(Y_test, probs)
#print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(Y_test, probs)
# plot no skill
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
plt.title("ROC Curve for Random Forest with AUC Score: {:.3f}".format(auc_rf))
# show the plot
plt.show()

#create a heat map
sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'Purples', fmt = 'd')
rf_f1=f1_score(Y_test,predict1)
plt.title('F1 Score for Random Forest model is {:.2f}'.format(rf_f1))

**Cons:** <br>
The problem using the number is that they introduce relation/comparison between them. Apparently, there is no
relation between various bridge type. The algorithm might misunderstand that data has some kind of hierarchy/order 0 < 1 < 2 … < 6.

## 2. One Hot Encoding (with PCA)
One-hot encoding is easier to conceptually understand. This type of encoding simply “produces one feature per category, each binary.” Or for the example above, creating a new feature for cat, dog, and hamster. In the column cat, for example, we show that a cat exists with a 1, and it doesn’t exist with a 0.

In [None]:
sample_2 = datafr.copy()
le = LabelEncoder()
sample_2['Class Encoded'] = le.fit_transform(sample_2['class'])
# Predictor variables
X = sample_2.drop(['class', 'Class Encoded'],axis=1)

new_df = pd.DataFrame()
for col in X.columns:
    y = pd.get_dummies(X[col], prefix=col)
    new_df = pd.concat([new_df, y], axis=1)
# Print Updated Data
new_df.head(10)

In [None]:
# Principal Component Analysis
from sklearn.decomposition import PCA
pca = PCA(n_components=5)
principal_component = pca.fit_transform(new_df)
principalDf = pd.DataFrame(data = principal_component,
                           columns = ['principal component 1', 'principal component 2',
                                     'principal component 3', 'principal component 4',
                                     'principal component 5'])

In [None]:
principalDf.head(10)

In [None]:
# Splitting the dataset
# Predictor variables
X = principalDf
# Target or Class variable
Y = sample_2['Class Encoded']

In [None]:
# Let's using scikit learn to split our dataset
from sklearn.model_selection import train_test_split
# Using 70:30 ratio for train:test
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=.3,random_state=400)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Call Random Forest Classifier
rf2, probs = random_forest(X_train,Y_train, X_test)

### Random Forest Feature Importance

#### 1) Built In Feature Importance

In [None]:
sorted_idx = rf2.feature_importances_.argsort()
plt.barh(X.columns[sorted_idx], rf2.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

#### 2) Feature Importance computed with SHAP values
The SHAP interpretation can be used (it is model-agnostic) to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate the how does each feature contribute to the prediction.

In [None]:
import shap
explainer = shap.TreeExplainer(rf2)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

In [None]:
# Run the model on X_test to predict the target labels. Use cross-validation accuracy to check if model overfits or underfits
predict2 = rf2.predict(X_test)
rf2 = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=40)
score_rf = CrossVal(X_train,Y_train,rf2)
print('Cross-Validation accuracy is {:.2f}%'.format(score_rf.mean()*100))

### Display Results

In [None]:
# Compare the predicted target labels with Y_test
from sklearn.metrics import accuracy_score,confusion_matrix, f1_score
print("Accuracy using Random Forest Model: {:.2f}%".format(accuracy_score(Y_test,predict2)*100))
# assign cnf_matrix with result of confusion_matrix array
cnf_matrix = confusion_matrix(Y_test,predict2)

# calculate AUC
auc_rf = roc_auc_score(Y_test, probs)
#print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(Y_test, probs)
# plot no skill
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
plt.title("ROC Curve for Random Forest with AUC Score: {:.3f}".format(auc_rf))
# show the plot
plt.show()

#create a heat map
sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'Purples', fmt = 'd')
rf2_f1 = f1_score(Y_test,predict2)
plt.title('F1 Score for Random Forest model is {:.2f}'.format(rf2_f1))

## 3) Target Encoding

“features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.”

In [None]:
sample_3 = datafr.copy()
le = LabelEncoder()
sample_3['Class Encoded'] = le.fit_transform(sample_3['class'])
sample_3 = sample_3.drop('class', axis=1)
sample_3.head()

In [None]:
# Source: https://maxhalford.github.io/blog/target-encoding-done-the-right-way/
def calc_smooth_mean(df1, df2, cat_name, target, weight):
    # Compute the global mean
    mean = sample_3[target].mean()

    # Compute the number of values and the mean of each group
    agg = sample_3.groupby(cat_name)[target].agg(['count', 'mean'])
    counts = agg['count']
    means = agg['mean']

    # Compute the "smoothed" means
    smooth = (counts * means + weight * mean) / (counts + weight)

    # Replace each value by the according smoothed mean
    if df2 is None:
        return df1[cat_name].map(smooth)
    else:
        return df1[cat_name].map(smooth),df2[cat_name].map(smooth.to_dict())

In [None]:
# Target Encode all the columns in sample_3 dataset except Class variable
WEIGHT = 5
new_df = pd.DataFrame()
for col in sample_3.columns[:-1]:
    new_df[col] = calc_smooth_mean(df1=sample_3, df2=None, cat_name=col, target='Class Encoded', weight=WEIGHT)

In [None]:
# Target Encoded columns
new_df.head(10)

In [None]:
# Splitting the dataset
# Predictor variables
X = new_df
# Target or Class variable
Y = sample_3['Class Encoded']

In [None]:
# Let's using scikit learn to split our dataset
from sklearn.model_selection import train_test_split
# Using 70:30 ratio for train:test
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=.3,random_state=400)

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
# Call Random Forest Classifier
rf3, probs = random_forest(X_train,Y_train, X_test)

### Random Forest Feature Importance

#### 1) Built In Feature Importance

In [None]:
sorted_idx = rf3.feature_importances_.argsort()
plt.barh(X.columns[sorted_idx], rf3.feature_importances_[sorted_idx])
plt.xlabel("Random Forest Feature Importance")

#### 2) Feature Importance computed with SHAP values
The SHAP interpretation can be used (it is model-agnostic) to compute the feature importances from the Random Forest. It is using the Shapley values from game theory to estimate the how does each feature contribute to the prediction.

In [None]:
import shap
explainer = shap.TreeExplainer(rf3)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

In [None]:
# Run the model on X_test to predict the target labels. Use cross-validation accuracy to check if model overfits or underfits
predict3 = rf3.predict(X_test)
rf3 = RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=40)
score_rf = CrossVal(X_train,Y_train,rf3)
print('Cross-Validation accuracy is {:.2f}%'.format(score_rf.mean()*100))

### Display Results

In [None]:
# Compare the predicted target labels with Y_test
from sklearn.metrics import accuracy_score,confusion_matrix, f1_score
print("Accuracy using Random Forest Model: {:.2f}%".format(accuracy_score(Y_test,predict3)*100))
# assign cnf_matrix with result of confusion_matrix array
cnf_matrix = confusion_matrix(Y_test,predict3)

# calculate AUC
auc_rf = roc_auc_score(Y_test, probs)
#print('AUC: %.3f' % auc)
# calculate roc curve
fpr, tpr, thresholds = roc_curve(Y_test, probs)
# plot no skill
plt.plot([0, 1], [0, 1], linestyle='--')
# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
plt.title("ROC Curve for Random Forest with AUC Score: {:.3f}".format(auc_rf))
# show the plot
plt.show()

#create a heat map
sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'Purples', fmt = 'd')
rf3_f1 = f1_score(Y_test,predict3)
plt.title('F1 Score for Random Forest model is {:.2f}'.format(rf3_f1))

In this kernel, we've explored different ways of identifying the feature importance in a tree based algorithm (Random Forest) also we have tried 3 different ways of encoding 'categorical' data to 'numeric' data.

I hope it was a helpful kernel for those trying out these algorithms and beginning their journey in Data Science domain.