# HW 3: Classification, Evaluation, and Deployment

In this homework, you will have a chance to experience a complete machine learning cycle. You will prepare a dataset, make a model, evaluate models to find the best fit, and deploy it to a simple web page. Our main objective is to make students *learn* classification and evaluation methods, so we will apply essential data preprocessing techniques but mainly focus on classification and evaluation.

We will use **Pima Indians Diabetes Database** that is publicly available and from UCI. However, we removed and changed some parts of the dataset for the homework evaluation, so **please use the one in the zip file provided in ilearn**.

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on specific diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

According to the information on the data, it has eight attributes and one binary class. The brief explanation of the attributes are as follows:

- Pregnancies: Number of times pregnant.

- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

- BloodPressure: Diastolic blood pressure (mm Hg).

- SkinThickness: Triceps skin fold thickness (mm).

- Insulin: 2-Hour serum insulin (mu U/ml).

- BMI: Body mass index (weight in kg/(height in m)^2).

- DiabetesPedigreeFunction: Diabetes pedigree function.

- Age: Age (years).

and we have a binary class which can be 0 (healthy) or 1 (diabetes).

**NOTE**: Unlike the labs, each function you make here will be **graded**, so it is important to *strictly* follow input and output instruction stated in the skeleton code.

## Contents

0. Preparation
 - Load the dataset.
 - Task 1: Changing zero value into mean (not graded).
1. Classification
 - Task 2: Random forest (graded, 0.5 pt).
 - Task 3: SVM with diverse kernels (graded, 0.5 pt).
 - Task 4: Decision tree implementation (graded, **advanced**, 2 pt).
2. Evaluation
 - Task 5: Precision, Recall, F1-score (graded, 0.3 pt).
 - Task 6: AUC/AUPRC (graded, 0.2 pt).
 - Task 7: Apply them together with scikit-learn (graded, 0.5 pt).
 - Task 8: Task 5 implementation (graded, **advanced**, 1 pt).
3. Deployment
 - Save models into a file using pickle.

# 0. Preparation

##### Student information

Please provide your information for automatic grading.

In [1]:
STUD_SUID = 'qiwa1131'
STUD_NAME = 'Qiushi Wang'
STUD_EMAIL = 'qiwa1131@student.su.se'

#### Basic libraries

These libraries will be frequently used throughout the homework!

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
from HW3_helper import *
RANDOM_STATE = 12345 #Do not change it!
np.random.seed(RANDOM_STATE) #Do not change it!

#### Load the dataset

Use the **diabetes** dataset located ilearn, and load it here using pandas. 

In [3]:
diabetes = pd.read_csv("datasets/diabetes.csv")

Here you can find out some basic information by calling *info(), head()*, and *describe()*.

In [4]:
diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


It seems like there is no null data. However, if you check zero values in the dataset, there are so many of them. Is it normal that people's BMI is zero? or not? You may want to change the zero values into another reasonable value, such as mean or median. The only thing that can have zero value is **pregnancies**. Let's first make a function changing zero values into the mean of the column.

#### Task 1: Changing zero value into mean (not graded)

In [5]:
def imputation(df, columns):
    
    """
     A function to change nan value (or zero value) to the mean of the attribute
        
        - Step 1: Get a part of dataframe using columns received as a parameter.
        - Step 2: Change the zero values in the columns to np.nan.
        - Step 3: Change the nan values to the mean of each attribute (column). 
                  You can use apply() or fillna() function.
        Input:
          df: A dataframe that needs to apply imputation
          columns: A list of columns that need to apply imputation
        Output:
          An imputed dataframe
    
    """
    #step 1 
    
    for col in columns:
        col_values = df[col]
        col_values = col_values.replace(0,np.nan)
        df [col] = col_values
  
    for col in columns:
        col_mean = df[col].mean()
        df [col].fillna(value = col_mean)
    
    return df

In [6]:
diabetes_test = imputation(diabetes, ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"])

After finishing a simple data processing, let's proceed to our main task, classification.

If you want to skip this part, you may want to use the imputation code from scikit-learn. If you are interested in imputation, you can find more information [here](https://scikit-learn.org/stable/modules/impute.html).
**NOTE**: The imputation function itself is not graded, but it is required for you to run imputation as it will affect other functions' results that are graded. 

<span style="color:blue"> **You HAVE TO run imputation even though it is not graded for the next tasks. It is your responsibliity to follow the instruction. If you want to skip this part, remove the comments below and run the code th perform imputation**

In [7]:
# Remove the comments and run the code below if you skip the parts above 

from sklearn.impute import SimpleImputer
columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
df_parts = diabetes.copy()[columns]
df_parts[df_parts==0] = np.nan
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
df_converted = pd.DataFrame(imp.fit_transform(df_parts), columns=columns)
diabetes[columns] = df_converted
diabetes.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.435949,12.096346,8.790942,85.021108,6.875151,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.75,64.0,25.0,121.5,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.202592,29.15342,155.548223,32.4,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,155.548223,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Now, you can see that the columns' min values are changed except for pregnancies' one.

# 1. Classification

In this assignment, we will try to run random forest (RF), and support vector machine (SVM) with different kernels using scikit-learn. As an extra task, we also have a chance to understand the decision tree in detail by implementing it from scratch. We will continue to use **the pre-processed diabetes dataset**!

#### Task 2: Random forest (graded, 0.5 pt)

Here you will run the random forest algorithm using scikit-learn, together with cross-validation. Detailed information about the random forest in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

Your task is as follows:
 1. Create a random forest classifier with the random state stated above (RANDOM_STATE).
 2. Report an average cross-validation score with stratified k-fold with **k=5** into the variable called **rf_cross_val_score** (0.2 pt).
 3. Run grid search with a dictionary having two elements; `max_depth` from 1 to 10, and `min_samples_split` from 2 to 10. Report the best classifier (or the best estimator) into the variable called **rf_best_classifier**. Use the same random forest classifier instance with the random state. Set **k=5** for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put a complete dataset, not a split training set (0.3 pt).
 
* For clarification, **Task 2** and **Task 3** are independent. You have to run stratified k-fold for **Task 2**, but it will not be used in **Task 3**.
* There is no further partial point for each subtask, so please be careful to read the instruction.

In [8]:
# Import required libraries if needed.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler 

In [9]:
rf = RandomForestClassifier(random_state=RANDOM_STATE) 
# CHANGE IT

In [10]:
#diabetes.columns = np.arange(9)
X = diabetes.drop('Outcome',axis=1)
y = diabetes.iloc[:, -1]

In [11]:
X.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885
std,3.369578,30.435949,12.096346,8.790942,85.021108,6.875151,0.331329,11.760232
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0
25%,1.0,99.75,64.0,25.0,121.5,27.5,0.24375,24.0
50%,3.0,117.0,72.202592,29.15342,155.548223,32.4,0.3725,29.0
75%,6.0,140.25,80.0,32.0,155.548223,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


In [12]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [13]:
skf = StratifiedKFold(n_splits=5)
scores_skf = cross_val_score(rf, X, y, cv=skf) 
rf_cross_val_score = np.mean(scores_skf) # CHANGE IT

In [14]:
rf_cross_val_score

0.7656735421441304

In [15]:
param_grid = [
  {'max_depth': [1,2,3,4,5,6,7,8,9,10],'min_samples_split': [2,3,4,5,6,7,8,9,10]},
 ]

search = GridSearchCV(rf, param_grid,cv=5)
search.fit(X, y)

rf_best_classifier = search.best_estimator_
# CHANGE IT

In [16]:
rf_best_classifier

RandomForestClassifier(max_depth=8, min_samples_split=6, random_state=12345)

#### Task 3: SVM with diverse kernels (graded, 0.5 pt)

We already tried a simple SVC with the RBF kernel before. Here you will rerun SVM, but trying different kernels, together with cross-validation. Detailed information about SVC in scikit-learn can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC).

Your task is as follows:

  1. Create a standard SVC classifier without setting any parameter.
  2. You may want to re-scale the dataset so that all the attributes have the same range of values. Apply StandardScaler without changing any parameters. You can find information [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). *Please do not apply Scaler to the label* (0.1 pt).
  3. Report test score of SVC model with **holdout test** with **test set ratio = 30%** into the variable called svm_ho_score. It means that you train the model using the training dataset and report the score using the test set. Since test_train_split function shuffles the dataset, do not forget to put the random state stated above (`RANDOM_STATE`) (0.2 pt).
  4. Run grid search with a dictionary stating kernels ['linear', 'poly', 'rbf'] and C = [1, 10, 100], and put the best classifier into the variable called svm_best_classifier. Set **k=10** for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set (0.2 pt).
  
  
  * For clarification, **Task 3** and **Task 4** are independent. Any produced work in **Task 3** will not be used in **Task 4**.
  * There is no further partial point for each subtask, so please be careful to read the instruction. Failing to apply StandardScaler affects the scores of part 3 and 4, as those are automatically graded, so be careful.

In [17]:
# Apply StandardScaler to change the datasets here
sd = StandardScaler()
X = sd.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=RANDOM_STATE)


In [18]:
svc = SVC() # CHANGE IT
svc.fit(X_train,y_train)

SVC()

In [19]:
svm_ho_score = svc.score(X_test,y_test)
# CHANGE IT

In [20]:
svm_ho_score

0.8268398268398268

In [21]:
param_grid_ = [{'C': [1,10,100],'kernel':['linear','poly','rbf']}]

search_ = GridSearchCV(SVC(),param_grid_,cv=10)
search_.fit(X, y)

svm_best_classifier = search_.best_estimator_ # CHANGE IT

In [22]:
svm_best_classifier

SVC(C=1, kernel='linear')

#### Task 4: Decision tree implementation (graded, advanced, 2 pt)

This task is extra for those who want to get extra points! We will now implement a decision tree from scratch. Follow the instruction carefully so that you can return the correct result, which will be a criterion to grade. We will also offer a simple test function so you can validate your implementation.

We have two different grading options:

  - 4-1. Implement a decision tree without any constraints (1 pt)
  - 4-2. Allow two main parameters: max_depth (0.5 pt) and min_size (0.5 pt) (in total 1 pt)

Here you can see our structure. Unlike labs, since this task is graded, we did not offer you class structure since it can make additional confusion to some students. We have seven separate methods, and here is a brief description of each method:

- **dt_fit**: This function is first called with the dataset and creates the tree's root node. It also calls a recursive function to grow the tree.
- **dt_score**: A function returning the accuracy scores of the received dataset and labels.
- **dt_predict**: A recursive function that predict a row's label by going through the trained tree.
- **find_best_split**: This function examines the best split by trying to split based on each attribute and a specific value.
- **gini_index**: This function receives two groups (left, right) and calculates a Gini index of these two groups based on outcome distribution.
- **leaf_final_value**: This function receives one group and returns the most common label (outcome) so that the tree can terminate with its final decision.
- **split**: This function is a recursive function that calculates the best split and splits the node into two parts until specific criteria are met, such as minimum samples or max depth of the tree.



* Unfortunately, there is no further partial point for each subtask, so please be careful to read the instruction.
* Part 2 (4-2) is only counted when you successfully finish part 1 (4-1). So prioritize finishing 4-1 first to get scores.

In [23]:
#series = pd.Series(['g', 'e', 'e', 'k', 's','f', 'o', 'r', 'g', 'e', 'e', 'k', 's']) 
#print("Printing the Original Series:") 
#display(series) 
  
# counting the frequency of each element 
#freq = series.value_counts().index [0]
#print("Printing the frequency") 
#display(freq) 
#freq = y.value_counts().index [0]
#freq


In [24]:
def dt_fit(X, y, min_samples_split=1, max_depth=np.inf):
#def dt_fit(X, y, min_samples_split, max_depth):
    """
    Input:
      X: Training dataset.
      y: Training labels.
      min_samples_split: constraint. Minimum number of samples in the node that the algorithm stops splitting.
      max_depth: constraint. Maximum number of depth from the root that the algorithm stops splitting.
      * X and y should have the same size.
    Output:
      root: A root node having the whole information of the tree after completing recursion.
    """
    # the data structure of a node was designed as 
    # [chosen_column,chosen_value,[child_node_left,child_node_right]], and it is the only datastructure of a node, no matter the node is a root node or a child node.
    # the program always follow the rule: firstly use find_best_split() to a node, then use split() to its child nodes. 
    root_node = find_best_split(X,y)
    child_L = root_node [2][0] #[Left_X,Left_y]
    child_R = root_node [2][1] #[Right_X,Right_y]
    next_L = find_best_split(child_L[0],child_L[1])
    next_R = find_best_split(child_R[0],child_R[1])

    # current node, and two children.
    return [root_node[0],root_node[1],[split(next_L,0, min_samples_split, max_depth),split(next_R,0, min_samples_split, max_depth)]]
    #return



def find_best_split(X, y):

    '''
    - Step 1: Get possible unique labels of the current node
    - Step 2: Iterate each column and the possible unique values of each column (double loops),
              and try dividing a node into two parts by the specific value of the chosen column (the current two values of the loop).
    - Step 3: Calculate a gini index of the node with the separated parts by chosen column and value.
              Since we are dealing with continuous values, divide the datasets with the following criteria:
               - if the value of the chosen column is lower than the chosen value -> assign it to the left node
               - otherwise (higher or equal to) -> assign it to the right node.
               - Then we can call gini_index function with those two nodes' information.
    - Step 4: By calling gini_index function for every (column, value) pair,
              get the best gini index by iterating all the values and columns from the dataset that the node has.
    - Step 5: With chosen criteria, create a node structure with the following information: [column, value to split, children]
              It is up to you to specify the structure, but here is an example.
              {'index': A chosen column name having the best gini index score,
               'value': A chosen value in the index column having the best gini index score,
               'children': A list that contains splitted groups [left, right]}.
    '''
    # step 1: unique labels
    unique_labels = y.unique()
    col_gini = 1.1
    lowest_gini_of_columns = ''
    split_value = 0
    # loop for column, '1st loop in dobule loop'
    for col in X:
        # possible uniques values for current column
        possible_unique_values = X [col].unique()
        # I concat class label with corresponding value for convenience of calculating gini index
        col_with_label = pd.concat([X [col], y], axis=1)
        col_with_label.columns = ['col','label']
        gini_ = 1.1
        chosen_value = 0
        #loop possible unique values for current column, '2nd loop in double loop'
        for value in possible_unique_values:
            left_ = col_with_label.iloc[lambda x: x['col'].values < value]
            right_ = col_with_label.iloc[lambda x: x['col'].values >= value]
            g_i = gini_index([left_,right_], unique_labels)
            if g_i < gini_:
                gini_ = g_i
                chosen_value = value
        # if gini index from the latest column is less than previous lowest value, then update information.
        if gini_<col_gini:
            col_gini = gini_
            split_value = chosen_value
            lowest_gini_of_columns = col

   
    # prepare result
    Left_X = X.iloc[lambda x: x[lowest_gini_of_columns].values < split_value]
    Right_X = X.iloc[lambda x: x[lowest_gini_of_columns].values >= split_value]
    Left_y = y[Left_X.index]
    Right_y = y[Right_X.index]

    Left = [Left_X,Left_y]
    Right = [Right_X,Right_y]
    Children = [Left,Right]
    result = [lowest_gini_of_columns,split_value,Children]
    return result
    

def gini_index(children, classes):
    """
    Input
      children: A list that contains splitted groups [left, right]}.
      classes: Possible outcomes of the part of dataset.
    Output
      A gini index value.
    """ 
    # get size of children
    children_size = 0
    gini_index = 0
    # count the size of children
    for child in children:
        children_size += child.shape [0]

    # for each child
    for child in children:
        child_size = child.shape [0]
        if child_size == 0:
            continue;

        child_score = 0
        # for each label of a child, count child score
        for label in classes:
            #label_count = 0
            # iterate each row in child, to count label
            '''
            for index,line in child.iterrows():
                if line[-1] == label:
                    label_count += 1
            '''
            label_count = sum(child.iloc[lambda x: x['label'].values == label]['label'])
            
            proportion = label_count / child_size 
            sqaured_p = np.square(proportion)
            child_score += sqaured_p 

        child_gini = (1 - child_score) * (child_size/children_size)
        gini_index += child_gini
    
    # end for child in children
    return gini_index


def leaf_final_value(y):
    """
    A function that returns the most common label given labels in a specific node.
    Input
      y: A list of labels of the part of dataset.
    Output
      The most common label in the input series.
    """
    if y.shape [0] == 0:
        return
    else:
        return [y.value_counts().index [0]]


#def split(node, depth, min_samples_split, max_depth):
def split(node, depth, min_samples_split=1, max_depth=np.inf):
    """
    A recursive function to split the node into two parts based on [the result from find_best_split function].
    This function will create left and right children in the node structure (so the current node will have all its children).
    
    If you only developed part 4-1, just follow Step 1 and Step 4.

    - Step 1: Termination 1: Check whether the size of left and right child is zero.
              If so, call leaf_final_value for **both children** to finalize the node.
    - Step 2: Termination 2: If the depth of current node reaches the maximum depth parameter (max_depth),
              again call leaf_final_value to finalize the node.
    
    * Step 3-4 should be applied to each child separately.
    - Step 3: Termination 3: If the number of samples in the left or right node is smaller than our threshold (min_samples_split),again call leaf_final_value to finalize the corresponding child (left or right).

    """
    #print('\n depth: ',depth)
    # empty
    if len(node) == 0:
        return
    # a label
    if len(node) == 1:
        return node

    Children = node [2]
    Left = Children [0]
    Right = Children [1]
    Left_X = Left [0]
    Left_y = Left [1]
    Right_X = Right [0]
    Right_y = Right [1]

    #Termination 2: If the depth of current node reaches the maximum depth parameter (max_depth). again call leaf_final_value to finalize the node.
    
    if depth >= max_depth:
        #print('max_depth works here')
        y_ = Left_y.append(Right_y)
        return leaf_final_value(y_)
    
    #Termination 3: If the number of samples in the left or right node is smaller than our threshold (min_samples_split), again call leaf_final_value to finalize the corresponding child (left or right).
    
    if Left_X.shape[0]<min_samples_split:
        #print('min_samples_split works here on the left')
        child_node_L = leaf_final_value(Left_y)
    if Right_X.shape[0]<min_samples_split:
        #print('min_samples_split works here on the right')
        child_node_R = leaf_final_value(Right_y)
    
    #Termination 1: Check whether the size of left and right child is zero. if so, call leaf_final_value for **both children** to finalize the node.
    if Left_X.empty == True and Right_X.empty == True:
        return
    if Left_X.empty == True:
        child_node_L = []
        child_node_R = leaf_final_value(Right_y)
    elif Right_X.empty == True:
        child_node_R = []
        child_node_L = leaf_final_value(Left_y)
    # continue spliting    
    else:
        child_node_L = find_best_split(Left_X,Left_y)
        child_node_R = find_best_split(Right_X,Right_y)

        #depth = depth+1
    #return: [ (1)current_node_chosen_attribute_for_split, (2)current_node_chosen_value_for_split, (3)[split(Left_Child),split(Right_child)]]
    return [node[0],node[1],[split(child_node_L, depth+1, min_samples_split, max_depth),split(child_node_R, depth+1, min_samples_split, max_depth) ]]

def dt_score(tree, X, y):
    """
    Input:
      tree: A trained tree returned by fit function.
      X: A test dataset.
      y: Test labels.
    Output:
      An accuracy score.
    """
    from sklearn.metrics import accuracy_score

    length = y.shape[0]
    
    idx = np.arange(length)
    y_pred = []
    for i in idx:
        current_row = X.iloc[[i]]
        y_pred.append(dt_predict(tree,current_row))


    score = accuracy_score(y, y_pred)
    return score
    
def dt_predict(node, row):
    """
    A recursive function that predict a row's label by going through the trained tree.
    Input:
      node: A current node to check for splitting.
      row: A single row in a dataset.
    Output:
      A predicted label.
    """
    # predict a false label if node = None
    if node == None:
        return 0

    length = len(node)
    row = row.reset_index(drop=True)
    #next_node = NULL
    next_branch = 1 # default right branch
    if length == 3: # [A,S,[Left,Right]]
        attr = node [0]
        value = row [attr][0]
        
        if value < node [1]:
            next_branch = 0 # go left branch
        
        next_node = node [2][next_branch]  # next node
        # if a next node is empty, it means ending, like [leftNode,[]] or [[],rightNode]. so go to the other node. 
        if next_node == None:
            return dt_predict(node [2][(-1)*next_branch],row)

        return dt_predict(next_node,row)    
    elif length == 1: # give the label
        return node [0]


In [25]:
# make dataset for decision tree
from sklearn.impute import SimpleImputer
test_tree = pd.read_csv("datasets/diabetes.csv")
columns = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI", "DiabetesPedigreeFunction", "Age"]
df_parts = test_tree.copy()[columns]
df_parts[df_parts==0] = np.nan
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
df_converted = pd.DataFrame(imp.fit_transform(df_parts), columns=columns)
test_tree[columns] = df_converted
test_tree.describe()
test_tree.head()
#test_tree = test_tree.iloc[100:130,:]
tree_X = test_tree.drop(['Outcome'],axis=1)
tree_y = test_tree.iloc[:, -1]

#train data, test data
X_tree_train, X_tree_test, y_tree_train, y_tree_test = train_test_split(tree_X, tree_y, test_size = 0.3, random_state=RANDOM_STATE,shuffle=True)


In [26]:
# test my own tree
import time
before = time.asctime(time.localtime(time.time()))
print('before fit:',before,'\n')
Tree = dt_fit(X_tree_train, y_tree_train, min_samples_split=3, max_depth=6)
after = time.asctime(time.localtime(time.time()))
print('after fit: ',after,'\n')
#print('time comsumption: ',after - before)
score = dt_score(Tree,X_tree_test,y_tree_test)
score

before fit: Sun Oct 25 01:01:32 2020 

after fit:  Sun Oct 25 01:02:14 2020 



0.7445887445887446

In [27]:
Tree

['Glucose',
 155.0,
 [['Age',
   29.0,
   [['SkinThickness',
     29.0,
     [['Age',
       27.0,
       [['Pregnancies',
         4,
         [['Pregnancies',
           2,
           [['Pregnancies', 0, [None, [0]]], ['Pregnancies', 2, [None, [0]]]]],
          ['Pregnancies', 4, [None, [0]]]]],
        ['Insulin',
         190.0,
         [['DiabetesPedigreeFunction',
           0.867,
           [['Pregnancies', 4, [[0], [0]]],
            ['BloodPressure', 68.0, [[1], [0]]]]],
          ['Pregnancies', 3, [None, [1]]]]]]],
      ['BMI',
       52.9,
       [['Glucose',
         128.0,
         [['DiabetesPedigreeFunction',
           0.503,
           [['Glucose', 112.0, [[0], [0]]], ['Insulin', 100.0, [[1], [0]]]]],
          ['BloodPressure',
           75.0,
           [['BloodPressure', 66.0, [[0], [1]]],
            ['Pregnancies', 4, [[0], [1]]]]]]],
        ['Pregnancies', 0, [None, [1]]]]]]],
    ['BMI',
     26.5,
     [['Age',
       60.0,
       [['Pregnancies',
      

In [28]:
# test DecisionTreeClassifier()
from sklearn import tree
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_tree_train, y_tree_train)

from sklearn.metrics import accuracy_score

length = y_tree_test.shape[0]
idx = np.arange(length)
y_pred = []
for i in idx:
    current_row = X_tree_test.iloc[[i]]
    y_pred.append(clf.predict(current_row))
    

#sklearn.metrics.accuracy_score(y_true, y_pred)
score_sklearntree = accuracy_score(y_tree_test, y_pred)
score_sklearntree


0.7359307359307359

In [29]:
#clf
#from sklearn import tree
#tree.plot_tree(clf)


# 2. Evaluation 

#### Task 5: Precision, Recall, F1-score (graded, 0.3 pt)

You will evaluate the random forest and the support vector machine classifier with various performance measures besides accuracy such as precision, recall, and F1-score, also using scikit-learn. Here we continue to use the Pima Indians Diabetes Database dataset.

Your task is as follows:

1. Scale the attributes in the dataset using *StandardScaler*. Please don't apply it to the labels.
2. Create an instance of the SVC classifier without setting any constraint.
3. Divide the dataset into two parts: a training set and a test set using the train_test_split method. Assign 30% of the dataset to the test set. 
  Please **turn off** shuffling the data.
4. Fit the model using the training set.
5. Report precision score (0.1 pt), recall score (0.1 pt), and F1-score (0.1 pt) using the test set, and save it into the variable called *recall_score_svc, precision_score_svc, f1_score_svc*. You can find out the information about the performance measures [here](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-f-measure-metrics). We require you to calculate the scores using the following functions: *precision_score, recall_score, f1_score*.


* Scaling the data can affect the results and also affect your scores. Please be careful to follow the instruction.
* There is no partial point except for the ones mentioned in part 5. There is also no any points if the result is incorrect, so you should correctly solve parts 1-4 as well.

In [30]:
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

X = diabetes.drop('Outcome',axis=1)
y = diabetes.iloc[:, -1]
clf = SVC()
sd = StandardScaler()
X = sd.fit_transform(X)
# didn't mention random state in instruction
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30,shuffle=False)
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test) 

recall_score_svc = recall_score(y_test, y_pred) 
precision_score_svc = precision_score(y_test, y_pred)
f1_score_svc = f1_score(y_test, y_pred)

In [31]:
recall_score_svc

0.5569620253164557

In [32]:
precision_score_svc

0.7333333333333333

In [33]:
f1_score_svc

0.6330935251798561

#### Task 6: AUC / AUPRC (graded, 0.2 pt)

You will evaluate the random forest and the support vector machine classifier with various performance measures related to the ROC curve, such as the area under the ROC curve (AUC) and rea under the precision-recall curve (AUPRC).

Your task is as follows:

1. Create an instance of a random forest classifier without setting any constraint. Don't forget to set the random state to our value RANDOM_STATE.
2. Divide the dataset into two parts: a training set and a test set using the train_test_split method. Assign 30% of the dataset to the test set. As the method will shuffle the data, please again set the random state to our value RANDOM_STATE.
3. Fit the model using the training set. Please note that we no longer use scalaed dataset used in the previous task. Use the original dataset here.
4. Report AUC (0.1 pt) and AUPRC (0.1 pt) using the test set, and save it into the variable called *auc_rf, auprc_rf*. You can find out the information about the performance measures [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) for AUC score, and [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) for AUPRC score. AUPRC has many names, and it is supported as *average precision score* in scikit-learn. We require you to calculate the scores using the following functions: *roc_auc_score, average_precision_score*.

In [34]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
X = diabetes.drop('Outcome',axis=1)
y = diabetes.iloc[:, -1]
rf_clf = RandomForestClassifier(random_state=RANDOM_STATE) 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=RANDOM_STATE,shuffle=True)
rf_clf.fit(X_train,y_train)
pred_y = rf_clf.predict(X_test)

auc_rf =  roc_auc_score(y_test, pred_y) # CHANGE IT
auprc_rf = average_precision_score(y_test, pred_y) # CHANGE IT

In [35]:
auc_rf

0.7649557829027224

In [36]:
auprc_rf

0.5780967890556932

#### Task 7: Apply them together with scikit-learn (graded, 0.5 pt)

Here you will try to apply the grid search using the performance measures you have tried on Task 5 and Task 6, and pick the best performing model in terms of specific performance measures.

Our dataset is imbalanced, meaning that the healthy patient is dominant. Therefore, we can expect that the best model can be different, and we may also need to use AUPRC to get the most suitable model. 

Your task is as follows:

1. Create an instance of a kNN classifier without setting any constraint. 
2. Run grid search with a dictionary stating n_neighbors from 1 to 10, and use two different scoring measures: AUPRC (average_precision) and F1-score (f1). So you may want to run two different grid-search.
3. Put the best classifiers into the respective variable called *auprc_best_classifier* (0.25 pt) and *f1_best_classifier* (0.25 pt). Set cv=5 for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set.


* Unfortunately, there is no further partial point for each subtask, so please be careful to read the instruction.

In [37]:
X = diabetes.drop('Outcome',axis=1)
y = diabetes.iloc[:, -1]
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=RANDOM_STATE,shuffle=True)

from sklearn.neighbors import KNeighborsClassifier
clf_knn = KNeighborsClassifier()

param_grid_ = [{'n_neighbors': [1,2,3,4,5,6,7,8,9,10]}]

search_auprc = GridSearchCV(clf_knn,param_grid_,cv=5,scoring='average_precision')
search_auprc.fit(X, y)

search_f_one = GridSearchCV(clf_knn,param_grid_,cv=5,scoring='f1')
search_f_one.fit(X, y)

#svm_best_classifier = search_.best_estimator_ # CHANGE IT

auprc_best_classifier = search_auprc.best_estimator_
f1_best_classifier = search_f_one.best_estimator_

In [38]:
auprc_best_classifier

KNeighborsClassifier(n_neighbors=10)

In [39]:
f1_best_classifier

KNeighborsClassifier(n_neighbors=9)

#### Task 8: Task 5 implementation (graded, advanced, 1 pt)

This extra task requires you to implement the following performance measures:
 - Accuracy (0.25 pt)
 - Precision (0.25 pt)
 - Recall (0.25 pt)
 - F1-score (0.25 pt)
 
All inputs will be the NumPy arrays, so you can use any NumPy array methods to calculate the scores.

In [40]:
def accuracy_manual(predicted, truth):
    # Write a logic and return accuracy
    length = len(truth)
    tp = 0
    tn = 0
    fn = 0
    fp = 0
    
    index = np.arange(length)
    for i in index:
        if predicted [i]==1 and truth [i] == 1:
            tp += 1
        elif predicted [i]==0 and truth [i] == 0:
            tn += 1
        elif predicted [i]==1 and truth [i] == 0:
            fp += 1
        elif predicted [i]==0 and truth [i] == 1:
            fn += 1
    
                

    return (tp+tn)/(tp+tn+fp+fn)

In [41]:
def precision_manual(predicted, truth):
    # Write a logic and return precision
    length = len(truth)
    tp = 0
    tn = 0
    fn = 0
    fp = 0
    
    index = np.arange(length)
    for i in index:
        if predicted [i]==1 and truth [i] == 1:
            tp += 1
        elif predicted [i]==0 and truth [i] == 0:
            tn += 1
        elif predicted [i]==1 and truth [i] == 0:
            fp += 1
        elif predicted [i]==0 and truth [i] == 1:
            fn += 1

    return tp/(tp+fp)

In [42]:
def recall_manual(predicted, truth):
    # Write a logic and return recall
    length = len(truth)
    tp = 0
    tn = 0
    fn = 0
    fp = 0
    
    index = np.arange(length)
    for i in index:
        if predicted [i]==1 and truth [i] == 1:
            tp += 1
        elif predicted [i]==0 and truth [i] == 0:
            tn += 1
        elif predicted [i]==1 and truth [i] == 0:
            fp += 1
        elif predicted [i]==0 and truth [i] == 1:
            fn += 1
    return tp/(tp+fn)

In [43]:
def f1_score_manual(predicted, truth):
    # Write a logic and return f1 score
    length = len(truth)
    tp = 0
    tn = 0
    fn = 0
    fp = 0
    
    index = np.arange(length)
    for i in index:
        if predicted [i]==1 and truth [i] == 1:
            tp += 1
        elif predicted [i]==0 and truth [i] == 0:
            tn += 1
        elif predicted [i]==1 and truth [i] == 0:
            fp += 1
        elif predicted [i]==0 and truth [i] == 1:
            fn += 1
    return (2*tp)/(2*tp+fp+fn)

If you complete the method, you can run the following line to check whether your functions are correct or not. Note that we will evaluate your functions with different data, so please still be careful to implement them.

In [44]:
check_scores(accuracy_manual, precision_manual, recall_manual, f1_score_manual)

1. Accuracy test
Correct!
2. Precision test
Correct!
3. Recall test
Correct!
4. F1-score test
F1 test failed! You should have got 0.6666666666666667 but you got 0.6666666666666666


# 3. Deployment

You will learn how to pick the best model using cross-validation and deploy the best model as a file. This task will only be graded if you intend to do an extra task (HW 3.3), as loading the best model from this lab is one of the requirements of the next extra assignment. This part will be taught in Lab 5.

Your task is as follows:

1. Scale the attributes in the dataset using *StandardScaler*.
2. Create an instance of an SVC classifier without setting any constraint.
3. Run grid search with a dictionary stating a list of C values [1, 10, 100], and classifiers {'linear', 'poly', 'rbf'}. When examining 'poly' kernel, please also find the best classifier by testing degree = [2,3,4]. You may need to make more than one dictionary. Please use **AUPRC** as its scoring measure. Set cv=5 for grid search cross-validation. Since grid search uses stratified k-fold inside, you should put the complete dataset, not split training set.
4. Save the best classifier into the variable called *svm_best_classifier_2* and train this classifier using the whole dataset we have using the fit method inside the best classifier returned by grid search.
5. Save the trained model using pickle and use this model as your deployed model for the Dash visualization. Detailed instruction can be found in Lab 5.

Completing only this task will not be graded. To get one extra point, you need to use the **best model** from this task to show the **Dash** application. We will check the following points:

 1) Whether the student successfully finds out the best classifier by following the instruction correctly.
 
 2) Whether the student deploys the model successfully using the Dash framework with the given dataset.
 

- It is highly recommmended to finish Lab 5 first before starting this section. You need to modify the files provided in Lab 5 (dash_example_web, helper_dash_example) to be appropriate for this task (that does not explicitly require the knowledge on HTML/Web programming), to handle different dataset having different columns and target label. 
- You do not need to change all the appearance but it should work with the new dataset in this homework and the best model derived in this task. It means that the deployed website should classify the new user input.

In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import pickle

X = diabetes.drop('Outcome',axis=1)
y = diabetes.iloc[:, -1]
clf = SVC()
# do not have to standard
#sd = StandardScaler()
#X = sd.fit_transform(X)

param_grid = [
  {'C': [1, 10, 100], 'kernel': ['linear','rbf']},
  {'C': [1, 10, 100], 'degree': [2,3,4], 'kernel': ['poly']},
 ]

search = GridSearchCV(clf,param_grid,cv=5,scoring='average_precision')
search.fit(X, y)

svm_best_classifier_2 = search.best_estimator_ # CHANGE IT
print(svm_best_classifier_2)
#pred = svm_best_classifier_2.predict(X)
#print(classification_report(y,pred))


FOLDER_PATH = "deployment/"
trained_model_filename = FOLDER_PATH + "model.pickle"

data_to_save = svm_best_classifier_2
file_path = trained_model_filename

with open(file_path, "wb") as writeFile:
    pickle.dump(data_to_save, writeFile)


SVC(C=100, degree=2, kernel='poly')
