<a href="https://www.kaggle.com/code/zjzhao1002/poisonous-mushroom-classification?scriptVersionId=194218576" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 1. Basic Information of Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df_train = pd.read_csv("/kaggle/input/playground-series-s4e8/train.csv")
print(df_train.head())

In [None]:
df_train.columns

In [None]:
df_train.isnull().sum()

In [None]:
df_train.shape

In [None]:
df_train.info()

# 2. Data Cleaning

We have seen that there are many missing values in the data, so we have to clean data before training our model.

In [None]:
num_columns = df_train.select_dtypes(include='float64').columns
cat_columns = df_train.select_dtypes(include='object').columns

For the numeric columns, we impute the median of corresponding column. 

In [None]:
for column in num_columns:
    df_train[column] = df_train[column].fillna(float(df_train[column].median()))

For the catagorical columns, we have seen some columns have too many missing values. If a column has missing values more than 60% of total, we may say that this column can be removed safely. 

In [None]:
nan_count = df_train.isna().sum()
total_count = df_train.shape[0]
nan_ratio = nan_count / total_count
high_nan_ratio = nan_ratio[nan_ratio > 0.6]
print(high_nan_ratio)

In [None]:
df_train_new = df_train.drop(['id', 'stem-root', 'stem-surface', 'veil-type', 'veil-color', 'spore-print-color'], axis=1)

We fill 'none' to the remained columns.

In [None]:
cat_columns = df_train_new.select_dtypes(include='object').columns

for column in cat_columns:
    df_train_new[column].fillna('none', inplace=True)

Now we can check if there is any missing value in our data.

In [None]:
print(df_train_new.isna().sum())

Now we look into the unique values in catagorical columns.

In [None]:
df_train_new[cat_columns].nunique()

Obviously, thera are too many values in these columns except the 'class' and 'season' columns. We want to collect the values that appear less frequently, and treat them as 'noise'.

In [None]:
def remove_noise(df):
    cat_columns = df.select_dtypes(include='object').columns
    for column in cat_columns:
        count = df[column].value_counts()
        less_freq = count[count<101].index
        df[column] = df[column].apply(lambda x: 'noise' if x in less_freq else x)
    return df

In [None]:
df_train_new = remove_noise(df_train_new)

In [None]:
df_train_new[cat_columns].nunique()

It seems much better now. We make some plots to see the distributions.

In [None]:
def make_barplot(df):
    cat_columns = df.select_dtypes(include='object').columns
    for column in cat_columns:
        plt.figure(figsize=(10, 5))
        uni_count = df[column].value_counts()
        sns.barplot(x=uni_count.index, y=uni_count.values)
        plt.ylabel(f"Count of '{column}'")
        plt.xlabel(f"{column}")
        plt.title(f"Count of unique categories in column '{column}'")
        plt.show()

In [None]:
make_barplot(df_train_new)

We can see that most columns have replaced the 'none' by the 'noise'. However, we also see that the 'none' is the largest fraction of the 'cap-surface' column. This may introduce some uncertainties. 

We can also look at the distributions of the numeric columns:

In [None]:
def make_hist(df):
    num_columns = df.select_dtypes(include='float64').columns
    for column in num_columns:
        pvalue = df[df['class']=='p'][column].to_numpy()
        evalue = df[df['class']=='e'][column].to_numpy()
        fig, ax = plt.subplots(figsize=(10,5))
        ax.hist(pvalue, bins=30, range=[0, 60], density=True, label='p')
        ax.hist(evalue, bins=30, range=[0, 60], density=True, label='e')
        ax.legend()
        ax.set_xlabel(f'{column}')
        ax.set_ylabel('Fraction of data')
        plt.show()

In [None]:
make_hist(df_train_new)

It seems that the poisonous mushrooms tend to have smaller cap-diameter and stem-width than the editable mushrooms.

# 3. Encoding

The next step is encoding the data for training. 

In [None]:
X_train = df_train_new.drop(['class'], axis=1)
y_train = df_train_new['class']

We scale the numeric data by the StandardScaler and encode the catagorical data by the OrdinalEncoder. 

In [None]:
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder

def preprocess(df):
    cat_columns = df.select_dtypes(include='object').columns
    num_columns = df.select_dtypes(include='float64').columns
    
    scaler = StandardScaler()
    df[num_columns] = scaler.fit_transform(df[num_columns])
    
    encoder = OrdinalEncoder()
    df[cat_columns] = encoder.fit_transform(df[cat_columns].astype(str))
    
    return df

In [None]:
X_train = preprocess(X_train)

In [None]:
label_encoder = LabelEncoder()
y_train = label_encoder.fit_transform(y_train)
y_train

# 4. Model

In [None]:
from sklearn.model_selection import train_test_split

X_train_new, X_val, y_train_new, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.20,
    random_state=1
)

We use the XGBoost model in this case.

In [None]:
from xgboost import XGBClassifier

model = XGBClassifier(
    gamma = 0.01,
    min_child_weight=1,
    subsample = 0.8, 
    colsample_bytree = 0.7,
    reg_alpha = 0.5,
    reg_lambda = 1.0, 
    learning_rate = 0.01,
    n_estimators = 2000, 
    max_depth = 10,
    random_state = 1,
    early_stopping_rounds = 10,
    device = "cuda"
)

In [None]:
history = model.fit(
    X_train_new, 
    y_train_new,
    eval_set = [(X_val, y_val)],
    verbose = True
)

In [None]:
from sklearn.metrics import matthews_corrcoef

y_pred = history.predict(X_val)
mcc = matthews_corrcoef(y_val, y_pred)
print(mcc)

# 5. Prediction

We do the same things to the test data and make predictions.

In [None]:
df_test = pd.read_csv("/kaggle/input/playground-series-s4e8/test.csv")
num_columns = df_test.select_dtypes(include='float64').columns
cat_columns = df_test.select_dtypes(include='object').columns

for column in num_columns:
    df_test[column] = df_test[column].fillna(float(df_test[column].median()))

df_test_new = df_test.drop(['id', 'stem-root', 'stem-surface', 'veil-type', 'veil-color', 'spore-print-color'], axis=1)

cat_columns = df_test_new.select_dtypes(include='object').columns

for column in cat_columns:
    df_test_new[column].fillna('none', inplace=True)

df_test_new = remove_noise(df_test_new)
X_test = df_test_new
X_test = preprocess(X_test)
X_test.head()

Finally, we generate the file for submission.

In [None]:
results = history.predict(X_test)
submission = pd.read_csv("/kaggle/input/playground-series-s4e8/sample_submission.csv")
submission['class'] = results
submission['class'] = submission['class'].replace({1: 'p', 0: 'e'})
submission.to_csv('submission.csv', index=False)

In [None]:
submission.head()