##**Assignment 3 (2023/2): ML1**
**Safe to eat or deadly poison?**



This homework is a classification task to identify whether a mushroom is edible or poisonous.

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.


Step 1. Load 'mushroom2020_dataset.csv' data from the “Attachment” (note: this data set has been preliminarily prepared.).

Step 2. Drop rows where the target (label) variable is missing.

Step 3. Drop the following variables:
'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate', 'stalk-root', 'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'

Step 4. Examine the number of rows, the number of digits, and whether any are missing.

Step 5. Fill missing values by adding the mean for numeric variables and the mode for nominal variables.

Step 6. Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1

Step 7. Convert the nominal variable to numeric using a dummy code with drop_first = True.

Step 8. Split train/test with 20% test, stratify, and seed = 2020.

Step 9. Create a Random Forest with GridSearch on training data with 5 CV with n_jobs=-1.
	'criterion':['gini','entropy']
'max_depth': [2,3]
'min_samples_leaf':[2,5]
'N_estimators':[100]
'random_state': 2020

Step 10.  Predict the testing data set with classification_report.


**Complete class MushroomClassifier from given code template below.**

In [73]:
#import your other libraries here
import pandas as pd

In [74]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
class MushroomClassifier:
    def __init__(self, data_path): # DO NOT modify this line
        self.data_path = data_path
        self.df = pd.read_csv(data_path)

    def Q1(self): # DO NOT modify this line
        """
            1. (From step 1) Before doing the data prep., how many "na" are there in "gill-size" variables?
        """
        df = self.df
        return df['gill-size'].isna().sum()


    def Q2(self): # DO NOT modify this line
        """
            2. (From step 2-4) How many rows of data, how many variables?
            - Drop rows where the target (label) variable is missing.
            - Drop the following variables:
            'id','gill-attachment', 'gill-spacing', 'gill-size','gill-color-rate','stalk-root', 'stalk-surface-above-ring',
            'stalk-surface-below-ring', 'stalk-color-above-ring-rate','stalk-color-below-ring-rate','veil-color-rate','veil-type'
            - Examine the number of rows, the number of digits, and whether any are missing.
        """
        columns_to_remove = [
            'id',
            'gill-attachment',
            'gill-spacing',
            'gill-size',
            'gill-color-rate',
            'stalk-root',
            'stalk-surface-above-ring',
            'stalk-surface-below-ring',
            'stalk-color-above-ring-rate',
            'stalk-color-below-ring-rate',
            'veil-color-rate',
            'veil-type',
        ]
        def drop_label_na(df):
            return df[df['label'].notna()]
        def remove_columns(df,columns_to_remove):
            return df.drop(columns=columns_to_remove)
        self.df =  (df
               .pipe(drop_label_na)
               .pipe(remove_columns,columns_to_remove)
                   )
        return self.df.shape


    def Q3(self): # DO NOT modify this line
        """
            3. (From step 5-6) Answer the quantity class0:class1
            - Fill missing values by adding the mean for numeric variables and the mode for nominal variables.
            - Convert the label variable e (edible) to 1 and p (poisonous) to 0 and check the quantity. class0: class1
        """
        # remove pass and replace with you code
        from sklearn.impute import SimpleImputer
        from sklearn.pipeline import Pipeline
        from sklearn.compose import ColumnTransformer 
        from sklearn.preprocessing import FunctionTransformer
        from sklearn.preprocessing import OneHotEncoder
        from sklearn.model_selection import train_test_split
        def label_encoder(column):
            encode_map = {'e':1,'p':0}
            return column.apply(lambda row : row.apply(lambda label : encode_map[label]) ,axis=1)

        df = self.df
        label_column = ['label']
        categorial_columns = df.iloc[:,1:-1].columns
        numerical_columns = ['cap-color-rate']
        preprocessor = ColumnTransformer(
            transformers=[
                ('label_encoding',Pipeline(steps=[
                    ('label_encoder',FunctionTransformer(func=label_encoder, validate=False)),
                ]),label_column),
                ('categorial_imputation',Pipeline(steps=[
                    ('mode_imputer',SimpleImputer(strategy='most_frequent')),
                ]),categorial_columns),
                ('numerical-imputation',Pipeline(steps=[
                    ('mean_imputer',SimpleImputer(strategy='mean'))
                ]),numerical_columns),
            ],
            remainder='passthrough',
        )
        preprocessor.fit(df)
        preprocessed_dataframe = pd.DataFrame(preprocessor.transform(df),columns=df.columns)
        self.df = preprocessed_dataframe
        
        return preprocessed_dataframe.label.value_counts()


    def Q4(self): # DO NOT modify this line
        """
            4. (From step 7-8) How much is each training and testing sets
            - Convert the nominal variable to numeric using a dummy code with drop_first = True.
            - Split train/test with 20% test, stratify, and seed = 2020.
        """
        from sklearn.preprocessing import OneHotEncoder
        from sklearn.model_selection import train_test_split
        preprocessed_dataframe = self.df
        categorial_columns = self.df.iloc[:,1:-1].columns
        one_hot_encode_pipeline = ColumnTransformer(
            transformers=[
                ('onehot',OneHotEncoder(drop='first',sparse_output=False),categorial_columns)
            ],
            remainder='passthrough',
            force_int_remainder_cols=False,
        )
        X_features = preprocessed_dataframe.drop('label',axis=1)
        y = preprocessed_dataframe['label']
        one_hot_encode_pipeline.fit(X_features)
        encoded_columns = one_hot_encode_pipeline.named_transformers_['onehot'].get_feature_names_out()
        X = one_hot_encode_pipeline.transform(X_features)
        
        X_train,X_test, y_train,y_test =train_test_split(X,y,stratify=y,test_size=0.2,random_state=2020)
        
        return X_train.shape, X_test.shape


    def Q5(self):
        """
            5. (From step 9) Best params after doing random forest grid search.
            Create a Random Forest with GridSearch on training data with 5 CV with n_jobs=-1.
            - 'criterion':['gini','entropy']
            - 'max_depth': [2,3]
            - 'min_samples_leaf':[2,5]
            - 'N_estimators':[100]
            - 'random_state': 2020
        """
        # remove pass and replace with you code
        pass


    def Q6(self):
        """
            5. (From step 10) What is the value of macro f1 (Beware digit !)
            Predict the testing data set with confusion_matrix and classification_report,
            using scientific rounding (less than 0.5 dropped, more than 0.5 then increased)
        """
        # remove pass and replace with you code
        pass

    def pipelining(self):
        df = self.df


In [233]:
df_original = pd.read_csv('mushroom2020_dataset.csv')

def dataframe_prep(df):
    df = df_original
    columns_to_remove = [
            'id',
            'gill-attachment',
            'gill-spacing',
            'gill-size',
            'gill-color-rate',
            'stalk-root',
            'stalk-surface-above-ring',
            'stalk-surface-below-ring',
            'stalk-color-above-ring-rate',
            'stalk-color-below-ring-rate',
            'veil-color-rate',
            'veil-type',
    ]    

    drop_columns = Pipeline(steps=[
        ('drop_columns', FunctionTransformer(func=lambda df : df.drop(columns=columns_to_remove), validate=False))
    ])
    
    def drop_nan_rows(X, column_name):
        if isinstance(X, pd.DataFrame):
            return X.dropna(subset=[column_name])
        else:
            raise TypeError(f"Input must be a pandas DataFrame : input is {type(X)}")
    label_column = 'label'
    drop_target_missing = Pipeline(steps=[
        ('drop_na_label',FunctionTransformer(func=lambda df : drop_nan_rows(df,label_column),validate=False)),
    ])
    def encode_label_to_binary(df,column):
        encode_map = {'e':1, 'p':0}
        newdf = df.copy()
        newdf[column] = newdf[column].map(encode_map).astype(float)
        return newdf
    encode_label_to_binary_tranformer = FunctionTransformer(func=lambda df : encode_label_to_binary(df,label_column),validate=False)
    label_transformer = Pipeline(steps=[
        ('encode_label_to_binary',encode_label_to_binary_tranformer),
    ])
    col_transformer = ColumnTransformer(
        transformers=[
            ('label_transformer',label_transformer,[label_column])
        ],
        remainder='passthrough',
    )

    drop_columns_and_missing_row_pipe = Pipeline(steps=[
        ('drop_target_missing',drop_target_missing),
        ('drop_columns',drop_columns),
        ('col_transformer',col_transformer),
    ])
    result = drop_columns_and_missing_row_pipe.fit_transform(df)
    left_over_columns = [column for column in df.columns if column not in columns_to_remove]
    
    df = pd.DataFrame(result, columns=left_over_columns)
    
    # norminal_columns_loc = [dataframe.columns.get_loc(column) for column in dataframe.iloc[:,1:-1].columns]
    norminal_columns = ['cap-shape', 'cap-surface', 'bruises', 'odor', 'stalk-shape',
       'ring-number', 'ring-type', 'spore-print-color', 'population',
       'habitat']
    numeric_columns= ['cap-color-rate']
    
    fill_mean_pipe = Pipeline(steps=[
        ('fill_mean', SimpleImputer(strategy='mean')),
    ])
    fill_mode_pipe = Pipeline(steps=[
        ('fill_mode', SimpleImputer(strategy='most_frequent'))
    ])
    identity_transformer = FunctionTransformer(func=lambda X : X, validate=False)
    impute_columns = ColumnTransformer(
        transformers=[
            ('identity', identity_transformer,[label_column]),
            ('impute_mode_to_norminal_columns',fill_mode_pipe,norminal_columns),
            ('impute_mean_to_numeric_columns',fill_mean_pipe,numeric_columns),
        ],
        remainder='passthrough',
    )

    df = pd.DataFrame(impute_columns.fit_transform(df),columns=df.columns)

    ohe_pipe = ColumnTransformer(
        transformers=[
             ('label_identity',identity_transformer,[label_column]),
             ('onehot',OneHotEncoder(drop='first',sparse_output=False),categorial_columns),
        ],
        remainder='passthrough',
    )

    ohe_pipe.fit(df)
    final_preprocessed = pd.DataFrame(ohe_pipe.transform(df),columns=['label',*ohe_pipe.named_transformers_['onehot'].get_feature_names_out(), 'cap-color-rate'])
    return final_preprocessed

In [293]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
def split(df):
    X = df.drop(columns=['label'])
    y = df['label'].astype(int)
    X_train,X_test,y_train,y_test= train_test_split(X,y,stratify=y,random_state=2020,test_size=0.2)
    return X_train,X_test,y_train,y_test

def train(df):
    X_train,X_test,y_train,y_test=split(df)
    clf = DecisionTreeClassifier()
    param_grid = {
        'criterion':['gini','entropy'],
        'max_depth':[2,3],
        'min_sampe_leaf':[2,5],
    }

def search_best_param(df):
    X_train,X_test,y_train,y_test=split(df)
    clf = RandomForestClassifier()
    param_grid = {
        'criterion':['gini','entropy'],
        'max_depth':[2,3],
        'min_samples_leaf':[2,5],
        'n_estimators':[100],
        'random_state': [2020],
    }
    grid_search = GridSearchCV(param_grid=param_grid,estimator=clf,n_jobs=-1,cv=5)
    print(X_train.shape,y_train.shape)
    grid_search.fit(X_train,y_train)
    return grid_search

dfx = dataframe_prep(df)
dfx['cap-color-rate'] = dfx['cap-color-rate'].astype(float)
X_train,X_test,y_train,y_test = split(dfx)
grid_search = search_best_param(dfx)

report = classification_report(y_test,grid_search.predict(X_test),digits=2)
cmatrix = confusion_matrix(y_test,grid_search.predict(X_test))
print(grid_search.best_params_)
print(report)
print(cmatrix)

(4611, 42) (4611,)
{'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 2, 'n_estimators': 100, 'random_state': 2020}
              precision    recall  f1-score   support

           0       0.99      0.97      0.98       732
           1       0.95      0.98      0.97       421

    accuracy                           0.98      1153
   macro avg       0.97      0.98      0.97      1153
weighted avg       0.98      0.98      0.98      1153

[[712  20]
 [  8 413]]


Run the code below to only test that your code can work, and there is no need to submit it to the grader.

In [90]:
def main():
    hw = MushroomClassifier('mushroom2020_dataset.csv')
    return (hw.Q1(),hw.Q2(),hw.Q3(),hw.Q4(),hw.Q5(),hw.Q6())

if __name__ == "__main__":
    df = pd.read_csv('mushroom2020_dataset.csv')
    for ans in main():
        print(ans)

121
(5764, 12)
label
0    3660
1    2104
Name: count, dtype: int64
((4611, 42), (1153, 42))
None
None


In [76]:
columns_to_remove = [
    'id',
    'gill-attachment',
    'gill-spacing',
    'gill-size',
    'gill-color-rate',
    'stalk-root',
    'stalk-surface-above-ring',
    'stalk-surface-below-ring',
    'stalk-color-above-ring-rate',
    'stalk-color-below-ring-rate',
    'veil-color-rate',
    'veil-type',
]
def flat(xs):
    return sum(xs,[])
def drop_label_na(df):
    return df[df.label.notna()]
def remove_columns(df,columns_to_remove):
    return df.drop(columns=columns_to_remove)
def fill_mean_for_numeric_types(df):
    columns = flat([[column] if pd.api.types.is_numeric_dtype(df[column]) else [] for column in df.columns])
    return df
    
dff = (df
.pipe(drop_label_na)
.pipe(remove_columns,columns_to_remove) 
      )
numeric_df = dff.select_dtypes(include=['number'])

In [77]:
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer 
from sklearn.preprocessing import FunctionTransformer
def label_encoder(column):
    encode_map = {'e':1.0,'p':0.0}
    return column.apply(lambda row : row.apply(lambda label : encode_map[label]) ,axis=1)

label_column = ['label']
categorial_columns = dff.iloc[:,1:-1].columns
numerical_columns = ['cap-color-rate']
preprocessor = ColumnTransformer(
    transformers=[
        ('label_encoding',Pipeline(steps=[
            ('label_encoder',FunctionTransformer(func=label_encoder, validate=False)),
        ]),label_column),
        ('categorial_imputation',Pipeline(steps=[
            ('mode_imputer',SimpleImputer(strategy='most_frequent')),
        ]),categorial_columns),
        ('numerical-imputation',Pipeline(steps=[
            ('mean_imputer',SimpleImputer(strategy='mean'))
        ]),numerical_columns),
    ],
    remainder='passthrough',
)
preprocessor.fit(dff)
preprocessed_dataframe = pd.DataFrame(preprocessor.transform(dff),columns=dff.columns)
preprocessed_dataframe.head()

Unnamed: 0,label,cap-shape,cap-surface,bruises,odor,stalk-shape,ring-number,ring-type,spore-print-color,population,habitat,cap-color-rate
0,0.0,x,s,t,p,e,o,p,k,s,u,1.0
1,1.0,x,s,t,a,e,o,p,n,n,g,2.0
2,1.0,b,s,t,l,e,o,p,n,n,m,3.0
3,0.0,x,y,t,p,e,o,p,k,s,u,3.0
4,1.0,x,s,f,n,t,o,e,n,a,g,4.0


In [78]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
one_hot_encode_pipeline = ColumnTransformer(
    transformers=[
        ('onehot',OneHotEncoder(drop='first',sparse_output=False),categorial_columns)
    ],
    remainder='passthrough',
    force_int_remainder_cols=False,
)
X_features = preprocessed_dataframe.drop('label',axis=1)
y = preprocessed_dataframe['label'].astype(float)
one_hot_encode_pipeline.fit(X_features)
encoded_columns = one_hot_encode_pipeline.named_transformers_['onehot'].get_feature_names_out()
X = one_hot_encode_pipeline.transform(X_features)

X_train,X_test, y_train,y_test =train_test_split(X,y,stratify=y,test_size=0.2,random_state=2020)
X_train.shape, X_test.shape

((4611, 42), (1153, 42))

In [88]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# model_pipeline = Pipeline(
#     [
#         ('clf', RandomForestClassifier()),
#     ]
# )
# param_grid = {
#     'clf__criterion' : ['gini', 'entropy'],
#     'clf__max_depth' : [2,3],
#     'clf__min_samples_leaf' : [2,5],
#     'clf__n_estimators' : [100],
#     'clf__random_state' : [2020],
# }
# grid_search = GridSearchCV(param_grid=param_grid,estimator=model_pipeline,n_jobs=-1)
# grid_search.fit(X_train,y_train)
# print(f'Best parameters: {grid_search.best_params_}')
# print(f'Best score: {grid_search.best_score_:.2f}')
clf = RandomForestClassifier()
param_grid = {
    'criterion':['gini','entropy'],
    'max_depth':[2,3],
    'min_samples_leaf':[2,5],
    'n_estimators':[100],
    'random_state':[2020],
}
grid_search = GridSearchCV(
    n_jobs=-1,
    param_grid=param_grid,
    estimator=clf,
)
grid_search.fit(X_train,y_train)
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_:.2f}')

Best parameters: {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 2, 'n_estimators': 100, 'random_state': 2020}
Best score: 0.97
