# Objective

The objective of this project is to predict the quality of wine using the concepts learned in DSA5842 Learning from Data: Support Vector Machines. The Wine Quality dataset consists of red wine samples. The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The dataset input variables (based on physicochemical tests) are:

1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - g / dm^3
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm^3)
11. alcohol (% by volume)


The output variable (based on sensory data) is:

12. quality (score between 0 and 10)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from matplotlib import rcParams
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn import metrics
from mlxtend.plotting import plot_decision_regions

plt.style.use('default')
rcParams["figure.figsize"] = [10, 8]

In [None]:
%matplotlib inline

# Loading Wine Quality Dataset

In [None]:
wine_df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
wine_df.head(10)

# Exploratory Data Analysis

### Summary Statistics

In [None]:
wine_df.describe()

### Check for missing values

In [None]:
print(wine_df.isna().sum())

Since there are no missing values, we can proceed with the data as is.

### Correlation Heatmap

A heatmap of the pairwise correlation coefficients are plotted to observe the collinearity of the variables. As none of the correlation coefficients are close to +1 or -1, we do not need to remove any of the variables and can proceed with the data.

In [None]:
plt.figure(figsize=[15,15])
sb.heatmap(wine_df.corr(), annot=True)
plt.show()

# Approach

The prediction of wine qualities will be treated as a multiclass classification problem as the wine quality can take on any value from 0 to 10, hence there are 11 possible classes. An SVM classifier will be used to solve this problem as well as linear discriminant analysis and logistic regression.

# Train/Test Split

The input variables are normalized first before performing the train/test split, and a train/test split of 70/30 is used.

In [None]:
X = wine_df.drop('quality', axis=1)
X = StandardScaler().fit_transform(X)
y = np.ravel(wine_df[['quality']])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=892)

# Fitting the Model

## SVM

In [None]:
svm = SVC(random_state=892)
svm = svm.fit(X_train, y_train)

In [None]:
print("5-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(svm, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, svm.predict(X_test)),2)))

In [None]:
metrics.plot_confusion_matrix(svm, X, y, cmap=plt.cm.Blues)
plt.show()

## Linear Discriminant Analysis

In [None]:
lda = LinearDiscriminantAnalysis()
lda = lda.fit(X_train, y_train)

In [None]:
print("5-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(lda, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, lda.predict(X_test)),2)))

In [None]:
metrics.plot_confusion_matrix(lda, X_test, y_test, cmap=plt.cm.Blues)
plt.show()

## Logistic Regression

In [None]:
logreg = LogisticRegression(solver='newton-cg', random_state=892)
logreg = logreg.fit(X_train, y_train)

In [None]:
print("5-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(logreg, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, logreg.predict(X_test)),2)))

In [None]:
metrics.plot_confusion_matrix(logreg, X_test, y_test, cmap=plt.cm.Blues)
plt.show()

# Preliminary Analysis of Results

| Method                       | K-fold CV Error Rate | Out-of-sample Error Rate |
|------------------------------|----------------------|--------------------------|
| SVM                          | 39.59%               | 37.71%                   |
| Linear Discriminant Analysis | 43.16%               | 39.17%                   |
| Logistic Regression          | 42.00%               | 39.79%                   |

From the results of each model, it can be observed that all of them perform quite poorly as they have very high error rates.

In [None]:
for i in range(0,11):
    print("Number of entries with quality of score {}: {}".format(i, sum(wine_df['quality'] == i)))

We can see that the distribution of the data is not ideal as there are significantly less entries with a quality score of 3, 4, 7, and 8 as compared to 5 and 6. Additionally, there are no entries in the dataset with a quality score of 0, 1, 2, 9, or 10. This would lead to high misclassification rates as the model does not have enough data points to learn from and make more accurate predictions. Therefore, the data needs to be reclassified.

# Transformation of Data

Before the data can be reclassified, the number of entries with a quality score of 5 and 6 need to be reduced. Those particular entries are being over-represented and the results are biased towards them. As such, undersampling is performed on these entries so that only 30% of the original entries remain, bringing down the number of entries for each score closer to the next highest number of entries in any one category, which is 199.

In [None]:
wine_df.drop(wine_df.query('quality == 5').sample(frac=0.7, random_state=892).index, inplace=True)
wine_df.drop(wine_df.query('quality == 6').sample(frac=0.7, random_state=892).index, inplace=True)
wine_df.reset_index(drop=True, inplace=True)

In [None]:
print('Post-correction:')
for i in range(0,11):
    print("Number of entries with quality of score {}: {}".format(i, sum(wine_df['quality'] == i)))

After resolving the over-representation of the two majority categories, the wine quality scores will be reclassified into 3 categories:

- If the wine quality is 4 and below, it is assigned as *Poor* (denoted as 0).
- If the wine quality is 5 or 6, it is assigned as *Average* (denoted as 1).
- If the wine quality is 7 and above, it is assigned as *Good* (denoted as 2).

In [None]:
for i in range(len(wine_df)):
    if wine_df.loc[i, 'quality'] <= 4:
        wine_df.loc[i, 'quality'] = 0
    elif 5 <= wine_df.loc[i, 'quality'] <= 6:
        wine_df.loc[i, 'quality'] = 1
    elif wine_df.loc[i, 'quality'] >= 7:
        wine_df.loc[i, 'quality'] = 2

In [None]:
wine_df

# Re-fitting the Model

In [None]:
X = wine_df.drop('quality', axis=1)
X = StandardScaler().fit_transform(X)
y = np.ravel(wine_df[['quality']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=892)

## SVM

In [None]:
svm = SVC(random_state=892)
svm = svm.fit(X_train, y_train)

In [None]:
print("5-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(svm, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, svm.predict(X_test)),2)))

In [None]:
metrics.plot_confusion_matrix(svm, X_test, y_test, cmap=plt.cm.Blues)
plt.show()

## Linear Discriminant Analysis

In [None]:
lda = LinearDiscriminantAnalysis()
lda = lda.fit(X_train, y_train)

In [None]:
print("20-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(lda, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, lda.predict(X_test)),2)))

In [None]:
metrics.plot_confusion_matrix(lda, X_test, y_test, cmap=plt.cm.Blues)
plt.show()

## Logistic Regression

In [None]:
logreg = LogisticRegression(solver='newton-cg', random_state=892)
logreg = logreg.fit(X_train, y_train)

In [None]:
print("20-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(logreg, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, logreg.predict(X_test)),2)))

In [None]:
metrics.plot_confusion_matrix(logreg, X_test, y_test, cmap=plt.cm.Blues)
plt.show()

# Comparison of Results

| Method (pre-correction)      | K-fold CV Error Rate | Out-of-sample Error Rate |
|------------------------------|----------------------|--------------------------|
| SVM                          | 39.59%               | 37.71%                   |
| Linear Discriminant Analysis | 43.16%               | 39.17%                   |
| Logistic Regression          | 42.00%               | 39.79%                   |

| Method (post-correction)     | K-fold CV Error Rate | Out-of-sample Error Rate |
|------------------------------|----------------------|--------------------------|
| SVM                          | 23.95%               | 30.54%                   |
| Linear Discriminant Analysis | 27.98%               | 31.53%                   |
| Logistic Regression          | 29.46%               | 29.06%                   |

After making the corrections to the dataset to remove bias, we can see that there is an improvement in performance as the error rates have decreased across the board for all methods. The Logistic Regression method gives the lowest out-of-sample error rate and therefore we will select this model and tune its parameters to improve its performance.

# Tuning the Parameters

## Logistic Regression

For logistic regression, the parameters that can be tuned are:
- C
- solver

The solver refers to the method used to solve the optimization problem, and each solver uses a different algorithm. Namely, they are Newton's Method (*newton-cg*), Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm (*lbfgs*), and Library for Large Linear Classification (*liblinear*).

A grid search is performed and it iterates through the possible combinations of the parameters together with K-fold cross validation to determine the best combination of parameters.

In [None]:
parameters = {'C':[0.001, 0.01, 0.1, 1, 10, 100],
              'solver':('newton-cg', 'lbfgs', 'liblinear', 'saga', 'sag')}
logreg = GridSearchCV(LogisticRegression(max_iter=500, random_state=892), parameters, cv=5)
logreg = logreg.fit(X_train, y_train)
result = pd.DataFrame(logreg.cv_results_).sort_values('rank_test_score')[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']]
result.head(10)

From the results, the parameters that give the best performance i.e. the highest score are $C=1$ and the *liblinear* solver.

## Evaluating the tuned Logistic Regression model

In [None]:
logreg_tuned = LogisticRegression(C=1, solver='liblinear', random_state=892)
logreg_tuned = logreg_tuned.fit(X_train, y_train)

In [None]:
print("5-fold cross-validation error rate: {}%".format(round(100-100*np.mean(cross_val_score(logreg_tuned, X_train, y_train, cv=5)),2)))
print("Out-of-sample error rate: {}%".format(round(100-100*metrics.accuracy_score(y_test, logreg_tuned.predict(X_test)),2)))

Here we can see that there is a decrease in the K-fold cross-validation error rate from 29.46% to 27.97% and a decrease in the out-of-sample error rate from 29.06% to 28.57%, thereby verifying that tuning the model parameters has resulted in a better performance.

# Plotting the Decision Regions

To visualize the decision regions, the mlxtend library was used (https://rasbt.github.io/mlxtend/). As this dataset contains more than one predictor variable, it is not possible to visualize them in a 2D plot. Instead, we look at only two predictor variables at a time and look at the decision region boundaries.

The model has a hard time trying to classify poor quality wine (labeled 0), and some pairwise plots do not even have a decision region for that class, because the number of data points for that class is still quite low compared to the other classes. Despite the undersampling and reclassification, the decision regions are still quite biased towards the average quality wines (labeled 1), as the majority of the plot areas are being classified as such.

In [None]:
plt.figure(figsize=[10,15])
for i in range(0,10):
    plt.subplot(5,2,i+1)
    logreg = LogisticRegression(C=1, solver='liblinear', random_state=892)
    logreg = logreg.fit(X_train[:,[i+1,0]], y_train)
    plot_decision_regions(X_test[:,[i+1,0]], y_test, clf=logreg, legend=2)
    plt.xlabel(wine_df.columns[i+1])
    plt.ylabel(wine_df.columns[0])
    plt.tight_layout()

# Conclusion

Despite the efforts made to improve the quality of both the dataset and the model, a satsifactory performance could not be achieved. Looking at the pairwise plots of all of the variables, the different classes are clustered and overlap with one other and it is difficult to separate them, linearly or otherwise.

It is possible that there are other methods more suited to tackling this problem or perhaps it would be more appropriate to frame it as a regression problem instead of a classification problem, but it is outside the scope of this module and will be left for future consideration.