# 0. Table of contents

* [1. Load libraries](#load)
* [2. Introduction](#intro)
* [3. Dataset](#dataset)
* [4. Initial and exploratory data analysis](#eda)
    * [4.1. Label column](#label)
    * [4.2. Each feature](#features)
    * [4.3. Features vs label](#vslabel)
* [5. Modelling](#model)
    * [5.1. Using accuracy](#acc)
        * [5.1.1. Evaluation on the test set](#testacc)
    * [5.2. Using balanced accuracy score](#balacc)
        * [5.2.1. Evaluation on the test set](#testbalacc)
    * [5.3. With oversampling](#over)
        * [5.3.1. Evaluation on the test set](#testover)
* [6. Comparing the scores for all models](#compare)

<a id="load"></a>
# 1. Load libraries

In [None]:
# -- Data manipulation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# -- Data visualisation
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter # to format decimal points on axis
plt.style.use('ggplot')
import seaborn as sns
sns.set_palette("pastel")

<a id="intro"></a>
# 2. Introduction

This is a super quick little notebook with some basic EDA and just one estimator model (KNN) with its hyperparameters tuned. The notebook is presented as a tutorial for beginners. I also use two different metrics - mean accuracy and balanced accuracy to tune the model and discuss which is best. Another pre-processing method used is oversampling with SMOTE to deal with minority classes.

<a id="dataset"></a>
# 3. Dataset

In [None]:
# -- Load the dataset
df = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

# -- Have a look at the top of the dataframe
df.head()

In [None]:
# -- Get the shape of the dataframe
print(f'The dataset has {df.shape[0]} rows and {df.shape[1]} columns\n')

# -- See if the data types for each column make sense
df.info()

We have 12 columns, 11 of those are features and 1 is a label (the last column). There's no missing values and each column has the right data type. The label indicates the quality of the wine as rated by tasters. 1 is for the poorest quality wines and 10 is for the best wines. Each row represents an individual wine, so all the observations are independent. Now let's see if there's duplicated rows in the dataset.

In [None]:
# -- Count the number of duplicated rows
print(f'There are {df.duplicated().sum()} duplicated rows in the dataframe.')

# -- Remove duplicated rows
df = df.drop_duplicates()

print(f'After removing duplicated rows there are {df.shape[0]} rows in the dataframe.')

Why should we remove the rows that are duplicated? Well, it is unlikely that different wines will have the exact same physicochemical composition, especially with features that have 4 decimal digits (density). Therefore, this must be an error or the wines are so similar that we can count them as one by removing the duplicates.

In [None]:
from sklearn.model_selection import train_test_split

# -- Separate features and label
# (a) drop target column
X = df.drop(columns=['quality'])
# (b) make an array with the target column
y = df['quality'].copy()

# -- Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

When splitting the dataset into train and test sets it's important to keep the proportions of classes in the label column consistent. We want to have the sameish proportion of classes in our train and test sets as in the original set. To do this we used the *stratify* parameter.

<a id="eda"></a>
# 4. Initial and exploratory data analysis

<a id="label"></a>
## 4.1. Label column

In [None]:
sns.countplot(x=y_train)
plt.gca().set(xlabel='Red wine quality', ylabel='Count', title='Distribution of quality categories')
plt.show()

In [None]:
# -- Get the actual count of wines of each category
y_train.value_counts()

We can see that the poor quality wines (3 and 4) and high quality wines (7 and 8) don't have a lot of instances in the dataset. It makes sense to group the wines into three groups to give the model something to work with. Let's say 3 and 4 wines are poor (3rd place), 5 and 6 wines are ok (2nd place), and 7 and 8 wines are great (1st place).

In [None]:
def rename_labels(labels):
    # -- Assign new labels for wine quality
    # (a) form an array with new labels
    new_labels_array = np.where( (labels == 3) | (labels == 4), 3, labels)
    new_labels_array = np.where( (labels == 5) | (labels == 6), 2, new_labels_array)
    new_labels_array = np.where( (labels == 7) | (labels == 8), 1, new_labels_array)
    # (b) make a pandas series out of the array to later assign correct indices
    new_labels = pd.Series(new_labels_array)
    # (c) assign correct indices
    new_labels.index = labels.index
    # (d) rename the series
    labels = new_labels
    return labels

# -- Rename labels for both train and test sets
y_train = rename_labels(y_train)
y_test = rename_labels(y_test)

In [None]:
# -- Have a look at the new labels
sns.countplot(x=y_train)
plt.gca().set(xlabel='Red wine quality', ylabel='Count', title='Distribution of quality categories after renaming')
plt.show()

<a id="features"></a>
## 4.2. Each feature

In [None]:
# -- Set the plot number for the first subplot
plot_number = 1

plt.figure(figsize=(25, 40)) # set the size of the whole set of plots
plt.subplots_adjust(hspace=0.9, wspace=0.15) # set the space between subplots

for col in X_train.columns:
    plt.subplot(12, 2, plot_number)
    sns.histplot(X_train[col], color="#9bd0b7")
    plt.gca().xaxis.set_major_formatter(FormatStrFormatter('%.2f'))
    plt.title(f'{col.capitalize()} histogram')
    plt.xlabel('')
    plt.ylabel('')

    plt.subplot(12, 2, plot_number+1)
    
    sns.boxplot(x=X_train[col], color='#badfda', width=0.7, linewidth=0.6)
    plt.gca().xaxis.set_major_formatter(FormatStrFormatter('%.2f'))
    plt.title(f'{col.capitalize()} boxplot')
    plt.xlabel('')
    plt.ylabel('')

    plot_number = plot_number+2 # set a new plot number for the next feature
    
plt.show()

Quite a few features are not normally distributed (e.g., total sulfur dioxide) and a few have a significant amount of outliers (e.g. sulphates). The distribution of data isn't important for the KNN algorithm, as the algorithm is non-parametric. The outliers also may not be a problem since the algorithm casts a majority vote to make its decision and the outliers are scarce, however, it will be interesting to see if the model's accuracy improves when we deal with the outliers.

<a id="vslabel"></a>
## 4.3. Features vs label

In [None]:
plot_number = 1

plt.figure(figsize=(25, 20))
plt.subplots_adjust(hspace=0.2, wspace=0.15)

for col in X_train.columns:
    plt.subplot(3, 4, plot_number)
    sns.boxplot(x=y_train, y=X_train[col])
    plt.title(f'{col.capitalize()}')
    plt.xlabel('')
    plt.ylabel('')
    
    plot_number = plot_number+1

plt.show()

It looks that there are some trends - the more fixed acidity, citric acid, and alcohol the wine has the better it is rated. But, the differences are rather small and we must remember that both category 1 and 3 wines are sparse, they don't have that many wines in them to begin with so it's questionable if we can even draw such conclusions. If I am bored tomorrow, I will see if I can do some hypothesis tests to see if these box plots hold any water.

<a id="model"></a>
# 5. Modelling

We will make a pipeline which will comprise two steps:
1. Scale the data
3. Run the KNN algorithm for classification.

We scale the data for the KNN algorithm as it relies on the majority vote to make the classification. This means that it will assume that points that are close together are likely to be of the same class. And this is why scaling is important - if features have very different scales the distances between these features may be uninformative. For more info on this, have a look at this [post on StackExchange](https://stats.stackexchange.com/questions/287425/why-do-you-need-to-scale-data-in-knn).

<a id="acc"></a>
## 5.1. Using accuracy

In [None]:
import optuna
from optuna.samplers import TPESampler
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
# Which hyperparameters to tune: https://machinelearningmastery.com/hyperparameters-for-classification-machine-learning-algorithms/

def objective(trial):
    # -- Instantiate scaler
    scalers = trial.suggest_categorical("scalers", ['minmax', 'standard', 'robust'])

    if scalers == "minmax":
        scaler = MinMaxScaler()
    elif scalers == "standard":
        scaler = StandardScaler()
    else:
        scaler = RobustScaler()
                
    # -- Tune estimator algorithm
    n_neighbors = trial.suggest_int("n_neighbors", 1, 30)
    weights = trial.suggest_categorical("weights", ['uniform', 'distance'])
    metric = trial.suggest_categorical("metric", ['euclidean', 'manhattan', 'minkowski'])
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, metric=metric)
        
    # -- Make a pipeline
    pipeline = make_pipeline(scaler, knn)

    # -- Cross-validate the features reduced by dimensionality reduction methods
    kfold = StratifiedKFold(n_splits=10)
    score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=kfold)
    score = score.mean()
    return score

sampler = TPESampler(seed=42) # create a seed for the sampler for reproducibility
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=300)

In [None]:
# -- Have a look at the best trial
print("Best trial out of 300 is:")
study.best_trial

It's interesting to see that the best performing scaler is the RobustScaler which is specifically designed to deal with outliers. Perhaps dealing with outliers manually will lead to even better results?

In [None]:
optuna.visualization.plot_optimization_history(study)

Here it looks like the tuning algorithm quickly found the pocket for the best parameters (in the first 50 trials) and after that the accuracy score was consistent.

In [None]:
optuna.visualization.plot_param_importances(study)

Here we can see that the single most important hyperparameter is the number of neighbours. All the other parameters and the scaling options have a negligible effect on the score.

<a id="testacc"></a>
### 5.1.1. Evaluation on the test set

In [None]:
# -- Have a look at the best parameters for the tuned model
print("Best parameters after tuning using mean accuracy:")
study.best_params

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report

# -- Instantiate tuned model
scaler = RobustScaler()
best_params = study.best_params # make a dictionary of best parameters
best_params.pop('scalers') # delete scaler option from the dictionary of params
knn = KNeighborsClassifier(**best_params)
pipeline = make_pipeline(scaler, knn)

# -- Make a function to print out mean accuracy scores for both train and test sets
# and to get a dataframe for precision, recall, and f1 scores for each class
def get_scores(pipeline, tuning_method):
    # -- Get scores for training data
    kfold = StratifiedKFold(10)
    score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=kfold)
    print("Training set: %0.2f mean accuracy with a standard deviation of %0.2f" % (score.mean(), score.std()))

    # -- Fit the tuned model
    pipeline.fit(X_train, y_train)

    # -- Evaluate on the test set
    # (a) mean accuracy
    test_score = pipeline.score(X_test, y_test)
    print("Testing set: %0.2f mean accuracy" % test_score)
    # (b) predict the labels for test data
    y_predicted = pipeline.predict(X_test)

    print()
    print("Classification report for the testing set:")
    print(classification_report(y_test, y_predicted))
    
    # -- Produce a dataframe of all metrics for all classes for the model
    # (a) Get scores for each metric for all classes
    scores = classification_report(y_test, y_predicted, output_dict=True)

    # Since the output is a dictionary of various metrics and we only need information on each class
    # we need to extract this information and store it as a dataframe for when we plot it
    # (b) Get a list of dataframes corresponsing to each class from the scores dictionary
    list_of_dfs = []
    for class_number in range(1,4):
        # -- Define columns
        each_score = list(scores[str(class_number)].values())[:-1] # get all values into a list apart from the last value (which is support)
        metric = list(scores[str(class_number)].keys())[:-1]       # do the same for keys
        class_name = list(str(class_number)*3)                     # assign the class to which the three metrics belong to
        tuning = [tuning_method for i in range(3)]                      # mark which tuning method these metrics belong to

        data = list(zip(each_score, metric, class_name, tuning))

        # -- Make a dataframe for each class
        scores_df = pd.DataFrame(data=data, columns=['score', 'metric', 'class', 'tuning method'])

        # -- Make a list of all dataframes
        list_of_dfs.append(scores_df)

    # -- Concatenate all class dataframes
    class_scores = pd.concat(list_of_dfs)
    
    return class_scores

# -- Get a dataframe for scores for each class
mean_accuracy_trained_scores = get_scores(pipeline, "mean accuracy")

Judging by accuracy the model isn't doing too bad and there's no overfitting as training scores and testing scores are almost the same (0.01 difference). However, if we look at the classification report for each category, it looks like the model essentially classifies all wines as category 2 (ok wines) which are the vast majority of all wines. The recall for category 2 is 0.95, meaning 95% of all category 2 wines were classified as such and precision for category 2 is 85%, meaning out of all wines classified as category 2, 85% were true category 2 wines. No wine was labelled category 3 and only 36% of category 1 wines were labelled as such. This is to say the model isn't great as most of the time we want accurate predictions on minority classes. At the same time, the dataset is tiny, so accurate predictions for minority classes will be out of reach for any algorithm. Maybe there's an amazing neural network that can do the job? With a little over 1000 observations I doubt it.

<a id="balacc"></a>
## 5.2. Using balanced accuracy score

Now let's see if the model performs better if we tune the pipeline relying on the balanced accuracy metric. Balanced accuracy is essentially an average of recalls for all classes, so perhaps the pipeline will be optimised to perform better with minority classes.

In [None]:
def objective(trial):
    # -- Instantiate scaler
    scalers = trial.suggest_categorical("scalers", ['minmax', 'standard', 'robust'])

    if scalers == "minmax":
        scaler = MinMaxScaler()
    elif scalers == "standard":
        scaler = StandardScaler()
    else:
        scaler = RobustScaler()
        
    # -- Tune estimator algorithm
    n_neighbors = trial.suggest_int("n_neighbors", 1, 30)
    weights = trial.suggest_categorical("weights", ['uniform', 'distance'])
    metric = trial.suggest_categorical("metric", ['euclidean', 'manhattan', 'minkowski'])
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, metric=metric)
        
    # -- Make a pipeline
    pipeline = make_pipeline(scaler, knn)

    # -- Cross-validate the features reduced by dimensionality reduction methods
    kfold = StratifiedKFold(n_splits=10)
    score = cross_val_score(pipeline, X_train, y_train, scoring='balanced_accuracy', cv=kfold)
    score = score.mean()
    return score

sampler = TPESampler(seed=42) # create a seed for the sampler for reproducibility
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=300)

In [None]:
# -- Have a look at the best trial
print("Best trial out of 300 is:")
study.best_trial

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_param_importances(study)

<a id="testbalacc"></a>
### 5.2.1. Evaluation on the test set

In [None]:
# -- Have a look at the best parameters for the tuned model
print("Best parameters after tuning using balanced accuracy:")
study.best_params

In [None]:
# -- Instantiate tuned model
scaler = StandardScaler()
best_params = study.best_params # make a dictionary of best parameters
best_params.pop('scalers') # delete scaler option from the dictionary of params
knn = KNeighborsClassifier(**best_params)
pipeline = make_pipeline(scaler, knn)

# -- Get a dataframe for scores for each class
balanced_accuracy_trained_scores = get_scores(pipeline, "balanced accuracy")

And here it is, precision for class 2 is 87% and recall is 87% (compared to 95% with the model tuned using accuracy). And, we've found 23% of all class 3 wines and 33% of those labelled as category 3 wines were labelled correctly. Recall for category 1 wines is also better (from 36% to 50%). We can clearly see this model doesn't just label the overwhelming majority of wines as class 2 just because it's the most abundant category.

<a id="over"></a>
## 5.3. With oversampling

Let's try another method to deal with imbalanced classes.

We will make a pipeline with three steps:
1. Scale the data
2. Oversample the data
3. Run the KNN algorithm for classification.

We scale the data first and then oversample it since SMOTE uses KNN to generate new samples ([more details here](https://stats.stackexchange.com/questions/363312/normalization-standardization-should-one-do-this-before-oversampling-undersampl)).

Oversampling is what it sounds like - we will produce synthetic samples which will represent our minority classes (the very best and the very worst wines) and hopefully this will stop the modelfrom ignoring the minority classes. For more info on SMOTE go to [Machine Learning Mastery](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/).

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

def objective(trial):
    # -- Instantiate oversampling
    k_neighbors = trial.suggest_int("k_neighbors", 1, 30)
    over = SMOTE(k_neighbors=k_neighbors, random_state=42)
    
    # -- Instantiate scaler
    scalers = trial.suggest_categorical("scalers", ['minmax', 'standard', 'robust'])

    if scalers == "minmax":
        scaler = MinMaxScaler()
    elif scalers == "standard":
        scaler = StandardScaler()
    else:
        scaler = RobustScaler()
        
    # -- Tune estimator algorithm
    n_neighbors = trial.suggest_int("n_neighbors", 1, 30)
    weights = trial.suggest_categorical("weights", ['uniform', 'distance'])
    metric = trial.suggest_categorical("metric", ['euclidean', 'manhattan', 'minkowski'])
    knn = KNeighborsClassifier(n_neighbors=n_neighbors, weights=weights, metric=metric)
        
    # -- Make a pipeline
    steps = [('scaler', scaler), ('over', over), ('estimator', knn)]
    pipeline = Pipeline(steps=steps)

    # -- Cross-validate the features reduced by dimensionality reduction methods
    kfold = StratifiedKFold(n_splits=10)
    score = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=kfold)
    score = score.mean()
    return score

sampler = TPESampler(seed=42) # create a seed for the sampler for reproducibility
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=300)

In [None]:
# -- Have a look at the best trial
print("Best trial out of 300 is:")
study.best_trial

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_param_importances(study)

<a id="testover"></a>
### 5.3.1. Evaluate on the test set

In [None]:
# -- Have a look at the best parameters for the tuned model
print("Best parameters after tuning using balanced accuracy and oversampling:")
study.best_params

In [None]:
# -- Instantiate the best tuned model
scaler = StandardScaler()

over = SMOTE(k_neighbors=1, random_state=42)

best_params = study.best_params # make a dictionary of best estimator parameters
for keys in ('scalers', 'k_neighbors'): # delete scaler and oversampling option from the dictionary of params
    best_params.pop(keys, None)
    
knn = KNeighborsClassifier(**best_params)

steps = [('scaler', scaler), ('over', over), ('estimator', knn)]
pipeline = Pipeline(steps=steps)

# -- Get a dataframe for scores for each class
oversample_trained_scores = get_scores(pipeline, "with oversampling")

<a id="compare"></a>
# 6. Comparing the scores for all models

Now to compare all scores for all classes across our three models we can use the dataframes we made earlier.

In [None]:
# -- Concatenate all dataframes with scores for all tuned models
all_scores_list = [mean_accuracy_trained_scores, balanced_accuracy_trained_scores, oversample_trained_scores]
all_scores = pd.concat(all_scores_list)

all_scores.head(12)

In [None]:
# -- Make a plot comparing all the models
g = sns.catplot(data=all_scores, x="class", y="score", hue="metric", col="tuning method", ci=None, kind="bar")

(g.set_axis_labels("", "")
.set_xticklabels(["Class 1", "Class 2", "Class 3"])
.set(ylim=(0, 1),  yticks=np.arange(0, 1.1, 0.1).tolist()))

g.fig.suptitle('Comparison of models tuned with different methods', fontsize=18)
g.fig.subplots_adjust(top=0.8)

Using mean accuracy metric results in the weakest model. Tuning the model using balanced accuracy vs tuning the model with the mean accuracy on an oversampled dataset results in predictions that are roughly the same.

**Summary:**
* When working on imbalanced datasets we need to be careful which metric we use to make our evaluation since the model can ignore the minority class or classes and still result in high scores.
* We can also use other techniques like oversampling to deal with imbalances.
* Some hyperparameters are more influential than others and if time of training is important, we can decide to only tune the most important ones.