Task #1 Red Wine Quality Prediction Using KNN

Task Details

This task is suitable for beginners who have just stepped into the world of data science. It requires you to use the K Nearest Neighbours algorithm to make a prediction on the 'quality' column of the dataset. You are encouraged to explore the different parameters you can work with in your model and understand the importance of data understanding and feature selection.
(Note: Beginners can use the "KNeighborsClassifier" class available under "sklearn.neighbors" )

Expected Submission

The submission must be a Notebook containing the process of Exploratory Data Analysis and making of the model. You can split the data into testing and training data and are required to show how well your model does on the testing data using the 'accuracy' metric.

Evaluation

The aim is to understand the KNN algorithm and its parameters, and evaluation would be based on the accuracy of the model.

In [None]:
#main libraries and graphics
import os
import numpy as np
import pandas as pd

#ML libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier

#metrics
from sklearn.metrics import roc_auc_score

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

------------------------------------------------  Reading data  ----------------------------------------------------

In [None]:
#reading data from file
#creating DataFrame included red wine data

with open("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv") as red_wine_file:
    red_wine_data = pd.read_csv(red_wine_file, delimiter=',')

#reading data structure information

red_wine_data.info(verbose = True, show_counts = True)

#data example

red_wine_data.head()

---------------------------------- Statistical characteristics of variables  -----------------------------------------

In [None]:
#mean, median, min, max, std. error

agg_func_list =  ['mean', 'median', 'min', 'max', 'std']
columns_agg_func_list = {}
for column in red_wine_data.columns[:11]: columns_agg_func_list[column] = agg_func_list
red_wine_data.agg(columns_agg_func_list).round(2)

In [None]:
red_wine_data.hist('quality')

In this example, we are faced with the problem of unbalanced classification, so a simple parameter of the model's
 accuracy will not reflect its true performance. Instead, we use the area under the receiver operating 
characteristics curve (ROC AUC).

------------------------------------------- Correlation of variables  ----------------------------------------------

In [None]:
#creating the correlation matrix
#'pearson' - standard correlation coefficient metod

corr_matrix = red_wine_data.corr(method = 'pearson')
corr_matrix

If the correlation coefficient |r| is greater than 0.95, then it is assumed that there is an almost linear relationship between the parameters. If the correlation coefficient |r| is in the range from 0.8 to 0.95, it indicates a strong degree of linear relationship between the parameters. If 0,6 < |r| < 0,8, they say that there is a linear relationship between the parameters. At |r| < 0.4, it is usually assumed that the linear relationship between the parameters could not be detected.

There is a weak correlation between the two input variables (at the very boundary of the metric) and the target 
variable. In this regard, I assume that to build the algorithm, you will need to use more input variables.

----------------------------------------------- Data preparation ---------------------------------------------------

Let's illustrate the input variables that have a correlation between them. To do this, we will filter the values of the correlation matrix.

In [None]:
corr_matrix_copy = corr_matrix.copy()
for row in corr_matrix_copy.index:
    for column in corr_matrix_copy[row].index: 
        if abs(corr_matrix_copy[row][column]) < 0.6 or abs(corr_matrix_copy[row][column]) == 1.0: 
            corr_matrix_copy[row][column] = '---'
        else:
            corr_matrix_copy[row][column] = round(corr_matrix_copy[row][column], 2)

corr_matrix_copy

It is obvious that there is a significant correlation between some input variables, and therefore, when building a model, it makes sense to reduce the number of input variables. I will remove the following variables: "citric acid", "density", "pH", "total sulfur dioxide". 

In [None]:
#extracting target variable, cleaning input data
target = np.array(red_wine_data.pop('quality'))
red_wine_data_cleaned = red_wine_data.drop(['citric acid', 'density', 'pH', 'total sulfur dioxide'], axis = 1)

In [None]:
#splitting cleaned data (train/test = 80/20)
train_X, test_X, train_y, test_y = train_test_split(red_wine_data_cleaned, target, stratify = target,
                                                    test_size=0.2, shuffle = True, random_state=1)

In [None]:
#scalling features
scaler = StandardScaler()
scaler.fit(train_X)
train_X_scaled = scaler.transform(train_X)
test_X_scaled = scaler.transform(test_X)

Using PCA (Principal component analysis), we can study the cumulative sample variance of these features in order to understand which features explain most of the variance in the data.

In [None]:
#PCA test (n_components=7, to see the explained variance of all generated components)
pca_test = PCA(n_components = 7)
pca_test.fit(train_X_scaled)
evr = pca_test.explained_variance_ratio_
cvr = np.cumsum(pca_test.explained_variance_ratio_)
pca_df = pd.DataFrame()
pca_df['Cumulative Variance Ratio'] = cvr
pca_df['Explained Variance Ratio'] = evr
display(pca_df.head(7))

Probably, with a small number of input variables (we have 7 of them), optimization by the principal component method does not make sense. The table shows that the level of more than 90% of the explained variance is achieved when using 6 main components.

In [None]:
#i select the number of main components to achieve maximum 'roc_auc_score' metric, it turned out "5"
pca = PCA(n_components = 5)
pca.fit(train_X_scaled)
train_X_scaled_pca = pca.transform(train_X_scaled)
test_X_scaled_pca = pca.transform(test_X_scaled)

------------------------------------ Creating the model (KNN classification) -----------------------------------------

In [None]:
#creating the model and the selection of a parameter 'n_neighbors' to maximize the metric 'roc_auc_score'
metric_values = []
for n_neighbors in range(1, 200):
    knc = KNeighborsClassifier(n_neighbors = n_neighbors, weights = 'distance', p = 1)
    knc.fit(train_X_scaled_pca, train_y)
    #'roc_auc_score' metric
    rf_predictions = knc.predict(test_X_scaled_pca)
    rf_probs = knc.predict_proba(test_X_scaled_pca)
    metric_values.append(roc_auc_score(test_y, rf_probs, multi_class = 'ovr'))

#printing the max value of 'roc_auc_score' and appropriate value of the parameter 'n_neighbors'
roc_auc_score_max = max(list(enumerate(metric_values, 1)), key=lambda i : i[1])
print(f'The value of the parameter "n_neighbors" in the KNN Classifier: {roc_auc_score_max[0]}')
print(f'The max value of the metric "roc_auc_score" in the test sample: {round(roc_auc_score_max[1], 4)}')