The data is from UCI Machine Learning Depository HCV Dataset. https://archive.ics.uci.edu/ml/datasets/HCV+data

This notebook is to do classification using K-Nearest Neighbours and use random forest classifier to do feature selection. 

# Library Loading

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier as rf_clf
from sklearn.model_selection import RandomizedSearchCV as randomCV
from sklearn.model_selection import train_test_split
from sklearn.utils import class_weight
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, plot_confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

# Data Loading

This section will load the data into Jupyter notebook and display the dataframe. 

In [None]:
hcv_df=pd.read_csv(r'../input/hcv-data-data-set/hcvdat0.csv')
hcv_df

Based on the dataframe above, it has 615 instances with 14 columns.

# Data Cleansing

This section will look into data whether there are any missing values or inconsistent data types. 

In [None]:
hcv_df_cp=hcv_df.copy()
np.unique(np.ravel(hcv_df[["Category"]]))

In [None]:
hcv_df.loc[:,["Category"]]=hcv_df.loc[:,["Category"]].replace(
    {'0=Blood Donor':0,
    '0s=suspect Blood Donor':1,
    '1=Hepatitis':2,
    '2=Fibrosis':3,
    '3=Cirrhosis':4},regex=True)

hcv_df.loc[:,["Sex"]]=hcv_df.loc[:,["Sex"]].replace(
    {'m':0,
    'f':1},regex=True)

To make it easier to deal with categorical data, each categorical value is converted into a number, so that it is easier to fit into statistical model.

In [None]:
hcv_df.describe()

In [None]:
hcv_df.isna().sum()

Looking at the counts for data values with NA, there are quite a few in ALP and CHOL while 1 instance with NA for ALB, ALT and PROT. Therefore, median value for the variables with NA values for each category will be used.

In [None]:
hcv_replace_val=\
hcv_df.loc[:,[ 'ALB','ALP', 'ALT','CHOL','PROT',"Category"]].groupby("Category").agg([np.median])
hcv_replace_val

In [None]:

for i in [ 'ALB','ALP', 'ALT','CHOL','PROT']:
    for j in range(0,int(np.max(hcv_df[["Category"]])+1)):
        hcv_df.loc[(hcv_df[i].isna()==True) & (hcv_df["Category"]==j),[i]]=\
        hcv_df.loc[(hcv_df[i].isna()==True) & (hcv_df["Category"]==j),[i]].\
        replace(np.nan,int(hcv_replace_val.loc[:,[i]].iloc[j]))


Using the median values tabulated, the missing values are replaced with the median values based on categories.

In [None]:
hcv_df[[ 'ALB','ALP', 'ALT','CHOL','PROT']].describe()

The NA values replacement in ALP and CHOL using median values causes the mean values for ALP and CHOL to be slightly decrease. 

# Data Exploration

This section will do simple data visualisation such as histogram and correlation matrix to understand the data in terms of distribution and relationship between variables. 

In [None]:
hcv_df.hist(figsize=(10,10))
plt.plot()

In [None]:
hcv_df.drop("Unnamed: 0",axis=1,inplace=True)

Looking at the histograms, 0 class has the largest count out of 5 classes. ID do not carry any meaning as it is used to identify the subjects, so ID column is dropped. ALP, ALT, AST, BIL, CREA and GGT indicate that the distribution is skewed to the right as they have a long right tailed and some extremely large values. 

In [None]:
cls_wg=dict(hcv_df["Category"].value_counts())
print(cls_wg)

Looking at the counts for each classes, class 0 (Blood donor) has the highest count followed by class 4 (Cirrhosis), class 2 (Hepatitis) and class 3 (Fibrosis). Class 1 (Suspect blood donor) is the class with the least count. 

In [None]:
corr = hcv_df.corr()
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(12, 10))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, mask=mask,cmap=cmap, vmax=.9, center=0, square=True, linewidths=.5, annot=True,cbar_kws={"shrink": .5})
plt.show()

Based on the correlation matrix above, there are a few pairs of variables are correlated like AST with GGT, ALB and PROT and CHOL with CHE. But, the correlation values are at a level where they are still acceptable.

# Data Split for Train and Test Sets

This section will split data into 2 sets: train set for model fitting and test set for model validation. 

In [None]:
hcv_X=hcv_df.copy()
hcv_X.drop(["Category"],axis=1,inplace=True)

hcv_Y=hcv_df[["Category"]].copy()

hcv_train_X,hcv_test_X,hcv_train_Y,hcv_test_Y=\
train_test_split(hcv_X,hcv_Y,test_size=0.20,random_state=48)

In [None]:
print(hcv_train_X.shape)
print(hcv_test_X.shape)

Train dataset has 492 instances while test dataset has 123 instances. 

# Hyperparameter Tuning for Random Forest Classifier Using RandomCV

This section will use random search cross validation in SKlearn to find the best set of hyperparameters for fitting the data into random forest classifier. The random forest classifier can be used as classifier, but also feature selection as it can calculate the weights for each feature by measuring how frequent and how accurate for each feature to be used in differentiating the instances into different classes using information gain or gini impurity. 

In [None]:
def rf_classifier(min_sample_split_in,min_sample_leaf_in,no_trees,max_features_in,score_criteria):
    rf_grid={"min_samples_split":min_sample_split_in,"min_samples_leaf":min_sample_leaf_in,
            "n_estimators":no_trees,"max_features":max_features_in}
    clf = rf_clf(max_depth=3, random_state=48,criterion="gini")
    rf_clf_cv = randomCV(clf, rf_grid, random_state=48,scoring=score_criteria,cv=5,return_train_score=True)
    return rf_clf_cv

To prevent overfitting, the random forest classifier is restricted to a maximum depth of 3. The criterion for splitting is based on gini impurity as it is more suitable to deal with a categorical variable with high cardinality. For this case, category variable is considered as high cardinality as it has 5 unique values. 

In [None]:
min_split=np.arange(2,40,5)
min_leaf=np.arange(2,40,10)
n_trees=np.arange(100,350,50)
max_features=np.arange(3,11,1)

rf_clf_model=rf_classifier(min_sample_split_in=min_split,min_sample_leaf_in=min_leaf,max_features_in=max_features,
                           no_trees=n_trees,score_criteria="f1_weighted")

The hyperparameters that set in random CV search are minimum sample for splitting, minimum size for each leaf, number of trees and maximum number of features used. As there are some classes with low counts, minimum size for each leaf and each split begin with 2. 

Due to imbalance class size, weighted f1 score is used as it includes the effect of imbalance class size by calculating f1 score based on the proportion for each class.  

In [None]:
rf_clf_model.fit(hcv_train_X,np.ravel(hcv_train_Y))

In [None]:
print("Best set of parameters:",rf_clf_model.best_params_)
print("Weighted F1 score for best set of parameters:",rf_clf_model.best_score_)

Based on CV result of the random forest classifier, it has a weighted F1 score of 91% which is considered quite good. The best random tree classifier for the data needs to have 150 trees with a minimum sample of 7 and 2 for splitting and at each leaf respectively and 9 features. 

# Feature Selection Using Random Forest Classifier

This section is to refit the random forest classifier using the best parameter set and do feature selection based on the weight importance calculated for each feature using the classifier. 

In [None]:
rf_clf_best=rf_clf(n_estimators=150, min_samples_split=7, min_samples_leaf=2, max_features=9,max_depth=3, random_state=48,
                   criterion="gini")
rf_clf_best.fit(hcv_train_X,np.ravel(hcv_train_Y))

In [None]:
rf_train_y=rf_clf_best.predict(hcv_train_X)
rf_test_y=rf_clf_best.predict(hcv_test_X)

In [None]:
def cf_mat(data_Y_actual,data_Y_pred,title,f1_average):
    cm_grid=confusion_matrix(data_Y_actual,data_Y_pred)
    cm_grid_display=ConfusionMatrixDisplay(confusion_matrix=cm_grid)
    cm_grid_display.plot()
    plt.title(title)
    plt.show()
    print("Average F1 score for all classes:",f1_score(data_Y_actual,data_Y_pred,average=f1_average).mean())

The function above is to plot a confusion matrix with F1 score at the bottom of the plot.

In [None]:
cf_mat(hcv_train_Y,rf_train_y,title="Confusion Matrix Based on Train HCV Data",f1_average="weighted")

There are 15 subjects being classified as blood donors despite they are pending for the confirmation of the status as blood donor or have blood-transmitted diseases such as Hepatitis and Fibrosis. 

In [None]:
cf_mat(hcv_test_Y,rf_test_y,title="Confusion Matrix Based on Test HCV Data",f1_average="weighted")

When comes to test set, the model able to clearly separate subjects with Fibrosis and Cirrhosis from blood donors.

Based on the confusion matrices above, weighted F1 score for train is around 95% while for test is 89%. This indicates that the model might be overfitting as it is not well in predicting classes using new data. However, the random forest classifier is used to determine the weights for each feature, so the overfitting problem can be ignored. 

In [None]:
feature_impt=dict(zip(list(hcv_train_X.columns),list(rf_clf_best.feature_importances_)))
feature_impt=dict(sorted(feature_impt.items(), key=lambda item: item[1],reverse=True))
feature_impt

Looking at the feature importance list above, AST, ALP, ALT, CHE and ALB are the top 5 features compared to others as others have weights less than 6%. The features with weights less than 6% might be insignificant to distinguish the subjects whether they are suitable blood donors. 

In [None]:
final_feature_list=list(feature_impt.keys())

The features selected are stored in a list for later use.

# Hyperparameter Tuning for K-Nearest Neighbors Classifier Using RandomCV

This section is to explore using K-Nearest Neighbours (KNN) classifier for classification. Before fitting the model, random CV search is used to find the best set of hyperparameters for KNN to fit the data.KNN's weight will be based on the distance, which means that the weight will be larger if the distance of 1 data point to another is smaller compared to another datapoint, and the distance is measured using Minkowski as it is the most commonly used distance metric.

In [None]:
def knn_classifier(neighbors,leaf_size_in,score_criteria):
    knn_grid={"n_neighbors":neighbors,"leaf_size":leaf_size_in}
    clf = KNeighborsClassifier(weights="distance",algorithm="auto",metric="minkowski",n_jobs=-1)
    knn_clf_cv = randomCV(clf, knn_grid, random_state=48,scoring=score_criteria,cv=5,return_train_score=True)
    return knn_clf_cv

The function above is to create a KNN model with a randomised search cross validation process to find the best set of hyperparameters for KNN. The hyperparameters that set in KNN are number of neighbors and number of samples in each leaf.

In [None]:
neighbors_knn_in=np.arange(2,30,2)
leaf_knn_in=np.arange(1,40,2)

knn_cv_model=knn_classifier(neighbors=neighbors_knn_in,leaf_size_in=leaf_knn_in,score_criteria="f1_weighted")


In [None]:
knn_cv_model.fit(hcv_train_X,np.ravel(hcv_train_Y))

In [None]:
print("Best set of parameters:",knn_cv_model.best_params_)
print("Weighted F1 score for best set of parameters:",knn_cv_model.best_score_)

Based on CV result of the KNN classifier, it has a weighted F1 score of 91%, which is considered quite good, and its performance is slightly weaker than the random forest classifier. The best KNN uses 4 neighbors to determine the class for each data point with and 25 instances in each tree to speed up KNN process.

In [None]:
knn_clf=KNeighborsClassifier(n_neighbors= 4, leaf_size=25,
                             weights="distance",algorithm="auto",metric="minkowski",n_jobs=-1)
knn_clf.fit(hcv_train_X,np.ravel(hcv_train_Y))
knn_train_Y=knn_clf.predict(hcv_train_X)
knn_test_Y=knn_clf.predict(hcv_test_X)

In [None]:
cf_mat(hcv_train_Y,knn_train_Y,title="Confusion Matrix Based on Train HCV Data",f1_average="weighted")

In [None]:
cf_mat(hcv_test_Y,knn_test_Y,title="Confusion Matrix Based on Test HCV Data",f1_average="weighted")

Looking at the test set, KNN classfied 3 subjects as blood donor despite they are awaiting confirmation of the status as blood donor or having hepatitis or cirrhosis diseases. 

Furthermore, the big difference in weighted F1 score between train and test sets indicates that KNN has a serious overfitting problem compared to random forest classifier. 

# KNN Classifer Model Fitting Using Forward Selection to Determine Maximum Number of Features

This section is to find out how many features to be used in KNN using the best hyperparameter set in the previous section for classification using forward selection. The dataframe for X is arranged based on feature importance with the first variable as the variable with the highest weight in feature importance followed by variables with lower importance in a decreasing order.

In [None]:
f1_train_list=[]
f1_test_list=[]

for i in range(1,len(final_feature_list)+1):
    knn_clf_red_temp=KNeighborsClassifier(n_neighbors= 4, leaf_size=25,
                             weights="distance",algorithm="auto",metric="minkowski",n_jobs=-1)
    knn_clf_red_temp.fit(hcv_train_X.loc[:,final_feature_list[:i]],np.ravel(hcv_train_Y))
    knn_train_y_red_temp=knn_clf_red_temp.predict(hcv_train_X.loc[:,final_feature_list[:i]])
    knn_test_y_red_temp=knn_clf_red_temp.predict(hcv_test_X.loc[:,final_feature_list[:i]])
    f1_train_temp=f1_score(hcv_train_Y,knn_train_y_red_temp,average="weighted")
    f1_test_temp=f1_score(hcv_test_Y,knn_test_y_red_temp,average="weighted")
    f1_train_list.append(f1_train_temp)
    f1_test_list.append(f1_test_temp)

In [None]:
#f1_index=np.arange(1,6,1)
f1_feature_sel=pd.DataFrame(zip(f1_train_list,f1_test_list))
f1_feature_sel.columns=["F1_train","F1_test"]
f1_feature_sel.index=f1_feature_sel.index+1

In [None]:
sns.lineplot(data=f1_feature_sel)\
.set_title("F1 Score Based on Train and Test Datasets\n Using Different Number of Features")
plt.show()

Looking at the graph above, the first 4 features are sufficient for KNN to do classification despite the huge gap of weighted F1 score between train and test sets indicates the model is overfitting. 

In [None]:
knn_clf_red=KNeighborsClassifier(n_neighbors= 4, leaf_size=25,
                             weights="distance",algorithm="auto",metric="minkowski",n_jobs=-1)
knn_clf_red.fit(hcv_train_X.loc[:,final_feature_list[:4]],np.ravel(hcv_train_Y))
knn_train_y_red=knn_clf_red.predict(hcv_train_X.loc[:,final_feature_list[:4]])
knn_test_y_red=knn_clf_red.predict(hcv_test_X.loc[:,final_feature_list[:4]])
cf_mat(hcv_test_Y,knn_test_y_red,title="Confusion Matrix Based on Test HCV Data",f1_average="weighted")

Using reduced set of features, 1 subject with Cirrhosis that previously classified as blood donor is correctly classified as having Cirrhosis. Furthermore, weighted F1 score is slightly increased by approximately 1%. 

# Findings Based on KNN After Feature Selection

In [None]:
hcv_train=hcv_train_X.loc[:,final_feature_list[:4]].copy()
hcv_train[["Pred Categories"]]=knn_train_y_red

In [None]:
hcv_train.groupby("Pred Categories").agg([np.median,np.mean,np.std,])

In [None]:
hcv_test=hcv_test_X.loc[:,final_feature_list[:4]].copy()
hcv_test[["Pred Categories"]]=knn_test_y_red

In [None]:
hcv_test.groupby("Pred Categories").agg([np.median,np.mean,np.std,])

Looking at the tables above based on train dataset and test dataset:
1. subjects with Cirrhosis and Fibrosis have higher value in AST in terms of mean and median compared to other groups
2. subjects with Cirrhosis and the status of suspect blood donor have higher value in ALP in terms of mean and median compared to other groups
3. subjects with Cirrhosis have lower value in ALT in terms of mean and median compared to other groups
4. subjects with Cirrhosis and the status of suspect blood donor have lower value in CHE in terms of mean and median compared to other groups
5. subjects with Fibrosis have higher value in ALT in terms of mean and median compared to other groups
6. subjects with Hepatitis have lower value in ALP in terms of mean and median compared to other groups

Therefore, subjects with Cirrhosis tend to be more likely to have high values in AST and ALP but low values in CHE and ALT while subjects with Fibrosis tend to be more likely to have high values in AST and ALT. Subjects with Hepatitis tend to be more likely to have low value in ALP. 

In [None]:
fig,ax=plt.subplots(2,2,figsize=(10,10))
sns.scatterplot(data=hcv_train, x="AST", y="Pred Categories",ax=ax[0][0])
sns.scatterplot(data=hcv_train, x="ALP", y="Pred Categories",ax=ax[0][1])
sns.scatterplot(data=hcv_train, x="ALT", y="Pred Categories",ax=ax[1][0])
sns.scatterplot(data=hcv_train, x="CHE", y="Pred Categories",ax=ax[1][1])
plt.show()

Looking at the scatter plots above for train datasets, those predicted that have Hepatitis, Fibrosis and Cirrhosis have higher AST compared to those that are blood donor or considered as future blood donor. Those predicted that have Cirrhosis have lower CHE.


In [None]:
fig,ax=plt.subplots(2,2,figsize=(10,10))
sns.scatterplot(data=hcv_test, x="AST", y="Pred Categories",ax=ax[0][0])
sns.scatterplot(data=hcv_test, x="ALP", y="Pred Categories",ax=ax[0][1])
sns.scatterplot(data=hcv_test, x="ALT", y="Pred Categories",ax=ax[1][0])
sns.scatterplot(data=hcv_test, x="CHE", y="Pred Categories",ax=ax[1][1])
plt.show()

Looking at the scatter plots above for test dataset, they show similar patterns as the train dataset except that no instances are predicted to have Hepatitis. However, in actual data, there are 5 instances with Hepatitis.  

The model performance of KNN after feature selection using random forest classifier has improved, especially when using test dataset. This indicates that KNN can generalised well after the feature selection. 

# KNN Using Standardised Data

This section will use standardised data, especially for independent variables, to determine whether data standardisation can improve the model performance of KNN even further after feature selection. 

In [None]:
std_scale=StandardScaler()
hcv_train_X_s=pd.DataFrame(std_scale.fit_transform(hcv_train_X))
hcv_test_X_s=pd.DataFrame(std_scale.fit_transform(hcv_test_X))
hcv_train_X_s.columns=hcv_train_X.columns
hcv_test_X_s.columns=hcv_test_X.columns

In [None]:
knn_clf_red.fit(hcv_train_X_s.loc[:,final_feature_list[:4]],np.ravel(hcv_train_Y))
knn_train_y_red_s=knn_clf_red.predict(hcv_train_X_s.loc[:,final_feature_list[:4]])
knn_test_y_red_s=knn_clf_red.predict(hcv_test_X_s.loc[:,final_feature_list[:4]])
cf_mat(hcv_test_Y,knn_test_y_red_s,title="Confusion Matrix Based on Test HCV Scaled Data",
       f1_average="weighted")

Based on the confusion matrix and average weighted F1 score above, data standardisation does not improve the model performance of KNN but more instances are being misclassified. 

Therefore, data standardisation might not improve the model performance as some information in the data might be lost after data standardisation unless the model performs badly at the beginning even after feature selection.