<a href="https://colab.research.google.com/github/taegeonyu/HDS-5230-07/blob/main/Week10/retraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

In [2]:
# load dataset
# since there was no mention on which data to use, I am using the diabetes data online to avoid using the same data with the assignment
data = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/diabetes.csv")
df = data.copy()

In [3]:
# separate predictors and target variable
X = df.drop('Outcome', axis=1)
y = df['Outcome']

In [4]:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)

In [5]:
# function to find optimal k
def find_optimal_k(X_train, y_train):
    k_range = range(1, 11)
    scores = []

    for k in k_range:
        knn = KNeighborsClassifier(n_neighbors=k, weights='distance')
        score = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')
        print(f'The average accuracy score for k = {k} is {score.mean()}.')
        scores.append(score.mean())

    best_k = k_range[np.argmax(scores)]
    print('-' * 50)
    print(f'The optimal number of k is {best_k}')
    return best_k

find_optimal_k(X_train, y_train)

The average accuracy score for k = 1 is 0.6955751032920165.
The average accuracy score for k = 2 is 0.6955751032920165.
The average accuracy score for k = 3 is 0.7346394775423164.
The average accuracy score for k = 4 is 0.720005331200853.
The average accuracy score for k = 5 is 0.744368919099027.
The average accuracy score for k = 6 is 0.7411302145808343.
The average accuracy score for k = 7 is 0.7394908703185393.
The average accuracy score for k = 8 is 0.7444089031054245.
The average accuracy score for k = 9 is 0.7525256564041051.
The average accuracy score for k = 10 is 0.7525523124083698.
--------------------------------------------------
The optimal number of k is 10


10

In [6]:
# function to have model performance
def model_performance_checker(model, predictors, target):
    pred = model.predict(predictors)
    acc = accuracy_score(target, pred)
    precision = precision_score(target, pred)
    recall = recall_score(target, pred)
    f1 = f1_score(target, pred)

    perf_df = pd.DataFrame({
        'Accuracy': [acc],
        'Precision': [precision],
        'Recall': [recall],
        'F1': [f1]
    })
    return perf_df

In [7]:
# define and fit the knn model
knn = KNeighborsClassifier(n_neighbors = 10, weights='distance').fit(X_train, y_train)

In [8]:
# check model performance
model_performance_checker(knn, X_test, y_test)

Unnamed: 0,Accuracy,Precision,Recall,F1
0,0.694805,0.574468,0.5,0.534653


* Performance in general is not good.
* Let's create a function to retrain the data if it could make difference.

In [9]:
# copy data to use for new training and test sets
df2 = df.copy()

In [10]:
# train_test_split
X2 = df2.drop('Outcome', axis=1)
y2 = df2['Outcome']

X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y2, random_state=42, test_size=0.2, stratify=y)

In [11]:
# check for data imbalance
y_train2.value_counts(1)

Unnamed: 0_level_0,proportion
Outcome,Unnamed: 1_level_1
0,0.651466
1,0.348534


In [13]:
# scaling the variables
scale_cols = ['Glucose','BloodPressure',	'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction'] # variables for scaling
scaler = StandardScaler()
X_train2[scale_cols] = scaler.fit_transform(X_train2[scale_cols])
X_test2[scale_cols] = scaler.transform(X_test2[scale_cols])

In [14]:
# undersampling
under = RandomUnderSampler(sampling_strategy = 'auto')
X_train2, y_train2 = under.fit_resample(X_train2, y_train2)

In [18]:
y_train2.value_counts(1)

Unnamed: 0_level_0,proportion
Outcome,Unnamed: 1_level_1
0,0.5
1,0.5


* Now the training data is balanced.

In [15]:
# function to retrain data including GridSearch for best hyper parameters
def retrain_knn(X_train, y_train):

    param_grid = {
            'n_neighbors': range(1, 11),
            'p': [1, 2],
            'metric': ['minkowski', 'euclidean', 'manhattan']
        }

    knn = KNeighborsClassifier()

    grid_search = GridSearchCV(estimator = knn, param_grid = param_grid, cv = 5, scoring = 'accuracy', n_jobs = -1)

    grid_search.fit(X_train, y_train)
    best_knn = grid_search.best_estimator_


    print("Best hyperparameters found by GridSearchCV:")
    print(grid_search.best_params_)

    return best_knn

In [16]:
best_knn = retrain_knn(X_train2, y_train2)

Best hyperparameters found by GridSearchCV:
{'metric': 'minkowski', 'n_neighbors': 9, 'p': 2}


In [17]:
# check model performance
model_performance_checker(best_knn, X_test2, y_test2)

Unnamed: 0,Accuracy,Precision,Recall,F1
0,0.707792,0.569231,0.685185,0.621849


* Although it is slightly better overall, the retraining did not make a big improvements.