![](http://www.woodlandtrust.org.uk/media/1997/fly-agaric-mushroom-close-up-alamy-dhfcm9-ivan-kmit.jpg?center=0.49618320610687022,0.63948497854077258&mode=crop&width=1110&height=624&rnd=132078488660000000)

In this notebook I would like to introduce 3 methods we can use to transform categorical columns into numerical form. This notebook is for people who have just started their journey. For those who code for some time it would be propably to simple.

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

## Mushroom dataset

In [None]:
df = pd.read_csv("/kaggle/input/mushroom-classification/mushrooms.csv")
df.head()

All columns are categorical and there are not ordered, so we treat them as nominal. Therefore, we can use one hot encoder, we can create dummy variables or use Label encoder. I will try all techniques and find out which gives us better performance.

In [None]:
df.describe().T

In [None]:
df.isna().sum()

In [None]:
for col in df.columns:
    print(f"Column {col} unique values: {df[col].unique()}")

In [None]:
sns.countplot(x=df['class'])

Our prediction label is well balanced so we don't have to worry about it. For the classification problem we could leave our label as string type as some algorithms can cope with categorical label, but for binary clasification it is better to use boolean values(0, 1). I wiil map edible as 0, and poisonous as 1.

In [None]:
df["class"] = df["class"].apply(lambda x: 1 if x == "e" else 0)

In [None]:
df.drop("class", axis=1).columns

In [None]:
fig = plt.figure(figsize=(16, 30))
for i, col in enumerate(df.columns):
    plt.subplot(12,2,i+1)
    sns.countplot(x=df[col])
    plt.tight_layout()
fig.show()

In [None]:
plt.figure(figsize=(12, 4))
sns.countplot(x=df["odor"], hue=df['class']);

It looks like most of poisnonous mushrooms have no odor. Fresh mushrooms should smell slightly sweet and earthy, but not foul. If they smell fishy or pungent, it's time to toss them.

### Seperate our label from features

In [None]:
X = df.drop("class", axis=1)
y = df["class"].values

In [None]:
from sklearn.model_selection import train_test_split
Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size=0.2, random_state=101)

### One hot encoding

In [None]:
one_hot = OneHotEncoder()
Xtrain_onehot = one_hot.fit_transform(Xtrain)
Xvalid_onehot = one_hot.transform(Xvalid)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, plot_confusion_matrix

In [None]:
rfc_base = RandomForestClassifier()
rfc_base.fit(Xtrain_onehot, ytrain)
base_preds = rfc_base.predict(Xvalid_onehot)
acc = accuracy_score(yvalid, base_preds)
print(f"Random Forest accuracy: {acc}")

In [None]:
pd.DataFrame({"predictions":base_preds,
              "ytrue": yvalid})

In [None]:
print(classification_report(yvalid, base_preds))

In [None]:
plot_confusion_matrix(rfc_base, Xvalid_onehot, yvalid)

As we can see our base model achieved 100%, but in this case it is hard to find out which features we most important as onehot produce sparse matrix.

### Using pipeline to do the same task

In [None]:
pipe = Pipeline([
    ("onehot", OneHotEncoder()),
    ("rfc_base", RandomForestClassifier())
])

pipe.fit(Xtrain, ytrain)
pipe_preds = pipe.predict(Xvalid)
acc = accuracy_score(yvalid, pipe_preds)
print(f"Random Forest acc={acc}")

### Label Encoding

In [None]:
for col in X.columns:
    le = LabelEncoder()
    Xtrain.loc[:, col] = le.fit_transform(Xtrain[col].values)
    Xvalid.loc[:, col] = le.transform(Xvalid[col].values)

In [None]:
Xtrain

In [None]:
rfc_le = RandomForestClassifier()
rfc_le.fit(Xtrain, ytrain)
le_preds = rfc_le.predict(Xvalid)
acc = accuracy_score(yvalid, le_preds)
print(f"Random Forest accuracy: {acc}")

In [None]:
print(classification_report(yvalid, le_preds))

In [None]:
plot_confusion_matrix(rfc_le, Xvalid, yvalid)

In [None]:
feat_imp = pd.DataFrame(rfc_le.feature_importances_, index=Xtrain.columns, columns=["feat_imp"])
feat_imp = feat_imp.sort_values("feat_imp", ascending=False)
feat_imp.style.background_gradient("Blues")

### Dummy variables

Similar method to OneHotEncoding is creating dummy variables, but we don't lose informtion about which features importance in our model. Couple things we need to remember, first dummy trap(i.e multicollinearity) and second course of dimentionality( in huge datasets it is not that easy to use this method in my opinion).

In [None]:
X = df.drop("class", axis=1)
y = df["class"].values

Xtrain, Xvalid, ytrain, yvalid = train_test_split(X, y, test_size=0.2, random_state=101)

In [None]:
dummy_Xtrain = pd.get_dummies(Xtrain, drop_first=True)
dummy_Xvalid = pd.get_dummies(Xvalid, drop_first=True)

In [None]:
rfc_d = RandomForestClassifier()
rfc_d.fit(dummy_Xtrain, ytrain)
d_preds = rfc_d.predict(dummy_Xvalid)
acc = accuracy_score(yvalid, d_preds)
print(f"Random Forest accuracy: {acc}")

In [None]:
feat_imp = pd.DataFrame(rfc_d.feature_importances_, index=dummy_Xtrain.columns, columns=["feat_imp"])
feat_imp = feat_imp.sort_values("feat_imp", ascending=False)[:20]
feat_imp.style.background_gradient("Blues")