# Red wine quality - binary classification: finding a good wine

In this kernel we will try to find good wines! This is a continuation of https://www.kaggle.com/anarthal/red-wine-quality-linear-regression/, where we tried to predict the quality of red wine based on its physical properties.

We will try a different approach here: my dear uncle, a curated Basque wine-lover, is coming to have dinner next week with us. I want to impress him but, well, my wine culture is not the best. So let's build a model to help us!

This dataset contains different wines with their physical properties, together with a numerical quality ranging from 3 to 8, where 8 means an excellent wine and 3 means... not so excellent. I don't think my uncle is very interested in me saying "hey, try this wine, I bet it's a 5-quality wine!". I just want to be confident enough to say "try this *excellent* wine!". So we will transform the ordered categorical `quality` variable into a binary `good_quality`, and will try to predict this. Let's get started!

First of all, some imports...

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_recall_fscore_support, make_scorer, fbeta_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
%matplotlib inline

Let's have a look at our data...

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

In [None]:
# This dataset is really clean, let's double check
df.info()

In [None]:
sns.countplot(x='quality', data=df)

## 1. What is good wine?

Looking at the above plot, we see the majority of the wines have quality 5 or 6. Wines with quality 7 or superior are rare. I want to impress my uncle, so let's say a good wine is one with quality 7 or higher (and, well, this is actually what the dataset description recommends as threshold ;)).

Let's build the `high_quality` column and analyze what we have created:

In [None]:
df['high_quality'] = (df['quality'] >= 7).astype('int64')
df['high_quality'].value_counts() / df['high_quality'].count()

In [None]:
df['high_quality'].value_counts().plot.pie(explode=[0, 0.1], figsize=(7,7))

So just 13% of our wines are worthy. Looks like we have a little bit of imbalance here.

## 2. What makes good wine good?

Let's now see what are the features with the highest importance:

In [None]:
corr = df.drop(columns='quality').corr()
idx = corr['high_quality'].abs().sort_values(ascending=False).index[:5]
idx_features = idx.drop('high_quality')
sns.heatmap(corr.loc[idx, idx])

In [None]:
_, ax = plt.subplots(1, 4, figsize=(20, 5))
for i, var in enumerate(idx_features):
    sns.boxplot(x='high_quality', y=var, data=df, ax=ax.flatten()[i])
sns.despine()

So we can see the following relationships:
- The more alcohol, the better the wine. Literature says the best alcohol values around 14%. All our wines are below it, so this makes sense.
- The less volatile acidity, the better the wine. Volatile acids, like acetic acid, can make the wine taste sour and are considered a cause of wine fault. A negative correlation here also makes sense.
- The more citric acid, the better the wine. Apparently, citric acid gives wine a more fresh flavour, but I couldn't find anything clear about this in the literature.
- The more sulphates, the better the wine.

## 3. Some preprocessing!

We will take two pre-processing steps for our data:
- *Scaling*. The range of features is not very distinct, but there are some differences. We are going to try several different classification algorithms, including SVM, which requires features to be normalized. We will use sklearn's StandardScaler, which applies a linear transformation to our features such that their mean is zero and their variance is one. We will apply scaling to all classifiers.
- *Polynomials*. We will be creating polynomial features for some of the linear classifiers. So, for any pair of features, like $alcohol$ and $sulphates$, we will create $alcohol^2$, $sulphates^2$, $alcohol * sulphates$ and so on.

We will define these as re-usable functions:

In [None]:
def scale(X_train, X_test):
    # Note that we only use the training data to fit the scaler, so that
    # no information from the test set leaks into our model
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    return X_train, X_test

def polynomials(X_train, X_test, degree=2):
    pol = PolynomialFeatures(degree)
    return pol.fit_transform(X_train), pol.transform(X_test)

def split(X, y):
    return train_test_split(X, y, test_size=0.3, random_state=1)

## 4. Training and evaluation

The default scoring for classification is accuracy (correct classification count / total count). However, we are facing a skewed classes classification problem, so accuracy may not the best measure. We also have *precision*, *recall*, and *F1*, which are more suitable for this case. I will be reporting all these scores using sklearn's `precision_recall_fscore_support`.

However, it is good to have a single real number evaluation metric in order to compare different models objectively. In this case, what I want is to be confident that my wine will be a good one so I don't let my uncle down. In other words, I am more concerned about false positives than about false negatives. That means that I will be prioritizing precision over recall. I don't want to say "_no wine is worthy_", either, so I will be using a weighted version of F1, `fbeta_score`. Beta is a parameter that measures how important is recall over precision, where beta=0 means 'consider only precision' and beta=inf means 'consider only recall'. 

Let's define a function to do the training and evaluation:

In [None]:
f1_beta = 0.25

def _compute_all_scores(model, X, y):
    return precision_recall_fscore_support(y, model.predict(X), average='binary', beta=f1_beta)[:-1]

def train(X_train, X_test, y_train, y_test, model):
    model.fit(X_train, y_train)
    prec, rec, f1 = _compute_all_scores(model, X_train, y_train)
    print('TRAIN: prec={:.4f}, recall={:.4f}, f1={:.4f}'.format(prec, rec, f1))
    prec, rec, f1 = _compute_all_scores(model, X_test, y_test)
    print('TEST : prec={:.4f}, recall={:.4f}, f1={:.4f}'.format(prec, rec, f1))
    
scorer = make_scorer(fbeta_score, beta=f1_beta)

In [None]:
# These are our features. X_subset are the ones that we identified as the most relevant.
X = df.drop(columns=['quality', 'high_quality'])
X_subset = X[idx_features].copy()
y = df['high_quality']

### 4.1. Logistic regression

Let's start with the simplest possible model: a linear one, no polynomial features, no hyperparameter tuning.

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
train(X_train, X_test, y_train, y_test, LogisticRegression(random_state=0))

Not bad but we can do better than this. As this is a linear model, it may be too simple. It may benefit from including polynomial features. Let's try it:

In [None]:
X_train, X_test, y_train, y_test = split(X_subset, y)
X_train, X_test = scale(X_train, X_test)
X_train, X_test = polynomials(X_train, X_test)
train(X_train, X_test, y_train, y_test, LogisticRegression(random_state=0))

That's a little bit better. Let's now try to automatically tune hyperparameters for better performance. Fortunately, logistic regression has this built-in, as a cross-validation grid search:

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
X_train, X_test = polynomials(X_train, X_test)
train(X_train, X_test, y_train, y_test, LogisticRegressionCV(random_state=0, max_iter=500, scoring=scorer))

Much better in the training set, worse on the test set. It seems like we are overfitting the training set. Let's try simplifying the model a little bit, and just try a subset of the employed features:

In [None]:
X_train, X_test, y_train, y_test = split(X_subset, y)
X_train, X_test = scale(X_train, X_test)
X_train, X_test = polynomials(X_train, X_test)
train(X_train, X_test, y_train, y_test, LogisticRegressionCV(random_state=0, max_iter=500, scoring=scorer))

That's a little better, but let's try different classifiers and see how they perform.

## 4.2. Decision trees and random forest

Let's first start with a simple decision tree:

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
train(X_train, X_test, y_train, y_test, DecisionTreeClassifier(random_state=0, max_leaf_nodes=30))

Promising, it did better than the best tuned logistic regression classifier. Let's try now the omnipresent random forest:

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
train(X_train, X_test, y_train, y_test, RandomForestClassifier(random_state=0))

I can now understand why is so popular. It seems it has completely overfit the training set, but it still performs good on the test set. 

Let's try tuning hyperparameters. We will do an exhaustive grid search. Let's set `n_estimators` to values higher than the defaults, as increasing this tends to decrease overfitting; and let's try smaller values for `max_features` and `max_depth`, as biger values for these make trees more complex, thus increasing the chance of overfitting:

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
model = GridSearchCV(RandomForestClassifier(random_state=0), {
    'n_estimators': [100, 200, 300],
    'max_features': [1, 2, 3],
    'max_depth' : [4, 6, 8]
}, scoring=scorer)
train(X_train, X_test, y_train, y_test, model)

## 4.3. Support vector machines

Let's try first a SVM with a RBF kernel:

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
train(X_train, X_test, y_train, y_test, SVC(random_state=0))

It outperforms the logistic regression but not the random forest classifier. Let's try now with a linear kernel and adding some polynomial features instead:

In [None]:
X_train, X_test, y_train, y_test = split(X, y)
X_train, X_test = scale(X_train, X_test)
X_train, X_test = polynomials(X_train, X_test)
train(X_train, X_test, y_train, y_test, SVC(random_state=0, kernel='linear'))

# 5. Conclusion

After evaluating all these different classifiers, the best ones appears to be random forest with the tuned hyperparameters. Hope my uncle enjoys the wine.

This takes us to the end of this kernel! Thank you for reading it. Any comments or suggestions are more than welcome. If you found it useful, please leave an upvote!