# Red wine quality - linear regression
In this kernel I will be applying different flavours of linear regressions to try to predict wine quality as a function of its physical properties.

Some of the ideas for the EDA in this kernel have been inspired by https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python.

In [None]:
# Let's start with some imports and loading our dataset

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, learning_curve, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.preprocessing import StandardScaler, PolynomialFeatures

df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
df.head()

## Let's meet the dataset
Our target variable is `quality`. It looks like an ordered categorical variable, where a big value means a good wine. Let's explore how it looks like:

In [None]:
df['quality'].describe()

In [None]:
sns.countplot(x='quality', data=df)

Our wine scale ranges between 3 and 8. There are some really good wines (7-8), while most of them have an average quality between 5 and 6, and some of them are really poor (3-4).

Let's now identify which physical properties are the ones that affect quality the most. We will use the correlation matrix, and will also take negative correlations into account, as some physical properties, like excessive acidity, may affect negatively the wine quality.

In [None]:
corr = df.corr()
idx = corr['quality'].abs().sort_values(ascending=False).index[:5]
idx_features = idx.drop('quality')
sns.heatmap(corr.loc[idx, idx])

According to the correlation matrix, the four more influential properties are:
- Alcohol. Seems like a positive correlation: the more alcohol, the better the wine. Bibliography suggests there is an optimal value for alcohol around 13.6-14.0.
- Volatile acidity. This represents the presence of certain volatile acids, like acetic acid. Too much acetic acid is considered a wine fault. We've got a negative correlation here, which makes sense.
- Sulphates. Weak positive correlation.
- Citric acid. May be added to wine to give a more 'fres' flavor. Weak positive correlation.

Let's see what these variables look like:

In [None]:
_, ax = plt.subplots(2, 2, figsize=(20, 10))
for var, axis in zip(idx_features, ax.flatten()):
    df[var].plot.hist(ax=axis)
    axis.set_xlabel(var)

Almost all wines have less than the optimal alcohol quality, which explains the strong positive correlation. All variables seem to have not very different ranges, so we may not need feature scaling. Let's confirm this for the rest of the variables:

In [None]:
df.describe()

There are no drastic differences in scale so we will not do feature scaling at all.

Let's now visualize the five most relevant variables and their interactions:

In [None]:
sns.pairplot(df, vars=idx)

Sulphates seem to have some positive correlation with citric acid. Volatile acidity presents negative correlation with citric acid. Citric acid is not a volatile acid, so this is not that surprising. Other than that, most variables appear to not have much correlation.

As quality is an ordered categorical variable, it is difficult to visualize relationships using scatter plots - let's do some box plots instead:

In [None]:
_, ax = plt.subplots(2, 2, figsize=(20, 10))
for i, var in enumerate(idx.drop('quality')):
    sns.boxplot(x='quality', y=var, data=df, ax=ax.flatten()[i])

This confirms our suspicions, more or less. Note that the quality-alcohol relationship only appears to be lineal for values of quality between 5 and 8, so we may want to use a higher order polynomial to model it. There are also a lot of outliers in sulphates (TODO: we may want to treat these somehow?)

Let's get into creating our model. We will try several different linear models, with and without regularization and polynomial features. We will also plot learning curves to visualize if we have bias or variance problems. Let's define some functions to do that:

In [None]:
def plot_learning_curves(X, y, model):
    train_sizes, train_scores, cv_scores = learning_curve(model, X, y)
    train_scores = np.mean(train_scores[1:], axis=1)
    cv_scores = np.mean(cv_scores[1:], axis=1)
    plt.figure(figsize=(10,10))
    plt.plot(train_sizes[1:], train_scores, label='Train')
    plt.plot(train_sizes[1:], cv_scores, label='CV')
    plt.xlabel('Sample size')
    plt.ylabel('R2')
    plt.legend()
    
def train(X, y, model, poly_degree=None):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
    if poly_degree is not None:
        pol = PolynomialFeatures(poly_degree, include_bias=False)
        X_train = pol.fit_transform(X_train)
        X_test = pol.transform(X_test)
    model.fit(X_train, y_train)
    r2_train = model.score(X_train, y_train)
    r2_test = model.score(X_test, y_test)
    print('r2_train = {:.3f}, r2_test={:.3f}'.format(r2_train, r2_test))
    plot_learning_curves(X, y, model)

# 1. Simple linear model

KISS! Let's start with the simplest possible model: let's input all variables into a simple linera regression model:

In [None]:
# Simple linear regression
features = df.drop(columns='quality')
X = features.copy()
y = df['quality']
train(X, y, LinearRegression())

Not very optimal. Let's try now with polynomial features:

In [None]:
# Let's try adding some polynomic features
train(X, y, LinearRegression(), poly_degree=2)

It seems like we have a little bit of overfitting around here. Let's add some regularization to the equation:

In [None]:
train(X, y, Ridge(alpha=2.0), poly_degree=2)

We did a _little_ bit better than the simple linear model. How about selecting the top four features?

In [None]:
feature_subset = df[idx].drop(columns='quality')
train(feature_subset, y, Ridge(5.0), poly_degree=2)

That didn't help either. So far, the best model we have achieved is the regularized linear with degree two polynomial features - and it's not a great model.

This concludes my exploration of linear models for this dataset. In the next kernels, I will be approaching this problem differently - as a classification problem instead, trying to predict whether a wine has good quality (`quality` > threshold) or not.

If you have any feedback or suggestions to improve this notebook, please say! And if you found it useful, please leave an upvote :)