### 03_ml_03: ML3 Clustering

This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family Mushroom drawn from The Audubon Society Field Guide to North American Mushrooms (1981).

Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the credibility of a mushroom; no rule like "leaflets three, let it be'' for Poisonous Oak and Ivy.


**Complete class MushroomClassifier from given code template below.**

In [None]:
!pip install scikit-learn

In [14]:
#import your other libraries here
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

Step 1. Load ‘ModifiedEdibleMushroom.csv’ data from the “Attachment” (note: this data set has been preliminarily prepared.).

Step 2. Choose edible mushrooms only.

Step 3. Only the variables below have been selected to describe the distinctive characteristics of edible mushrooms:
'cap-color-rate','stalk-color-above-ring-rate'

Step 4. Provide a proper data preprocessing as follows:
- Fill missing with mean.
- Standardize variables with Standard Scaler.

Step 5. K-means clustering with 5 clusters (n_clusters=5, random_state=0, n_init='auto').

Step 6. Show the maximum centroid of 2 features ('cap-color-rate' and 'stalk-color-above-ring-rate') in 2 digits.

Step 7. Convert the centroid value to the original scale, and show the minimum centroid of 2 features in 2 digits.


In [17]:
class Clustering:
    def __init__(self, file_path): # DO NOT modify this line
        #Add other parameters if needed
        self.file_path = file_path 
        self.df = None #parameter for loading csv

    def Q1(self): # DO NOT modify this line
        """
        Step1-4
            1. Load the CSV file.
            2. Choose edible mushrooms only.
            3. Only the variables below have been selected to describe the distinctive
               characteristics of edible mushrooms:
               'cap-color-rate','stalk-color-above-ring-rate'
            4. Provide a proper data preprocessing as follows:
                - Fill missing with mean
                - Standardize variables with Standard Scaler
        """
        # remove pass and replace with you code
        self.df = pd.read_csv(self.file_path)
        edible_df = self.df[self.df['label'] == 'e'][['cap-color-rate', 'stalk-color-above-ring-rate']]
        edible_df.fillna(edible_df.mean(numeric_only=True), inplace=True)

        scaler = StandardScaler()
        scaler.fit(edible_df)
        self.df = scaler.transform(edible_df)
        
        # for Q3
        self.mean = scaler.mean_
        self.std = [e**0.5 for e in scaler.var_]

        return self.df.shape

    def Q2(self): # DO NOT modify this line
        """
        Step5-6
            5. K-means clustering with 5 clusters (n_clusters=5, random_state=0, n_init='auto')
            6. Show the maximum centroid of 2 features ('cap-color-rate' and 'stalk-color-above-ring-rate') in 2 digits.
        """
        # remove pass and replace with you code
        self.Q1()
        
        kmeans = KMeans(n_clusters=5, random_state=0, n_init='auto')
        kmeans.fit(self.df)

        centroid_coords = pd.DataFrame(kmeans.cluster_centers_, columns=['x', 'y'])
        max_x = round(centroid_coords['x'].max(),2)
        max_y = round(centroid_coords['y'].max(),2)
        max = np.array([max_x, max_y])

        # for Q3
        self.min_x = centroid_coords['x'].min()
        self.min_y = centroid_coords['y'].min()
        
        return max
    

    def Q3(self): # DO NOT modify this line
        """
        Step7
            7. Convert the centroid value to the original scale, and show the minimum centroid of 2 features in 2 digits.

        """
        # remove pass and replace with you code
        self.Q2()

        # since z (x-u)/s, x = z*s + u
        min_x_true = round(self.min_x * self.std[0] + self.mean[0], 2)
        min_y_true = round(self.min_y * self.std[1] + self.mean[1], 2)
        min_true = np.array([min_x_true, min_y_true])
        return min_true

**Run the code below to only test that your code can work, and there is no need to submit it to the grader.**

In [18]:
def main(): # DO NOT modify this line
    hw = Clustering('ModifiedEdibleMushroom.csv')
    # exec(input().strip())
    print(hw.Q3())

if __name__ == "__main__": # DO NOT modify this line
    main()

[1.01 1.  ]
