# Using Logistic Regression to Predict Red Wine Quality

In this notebook, we will try to predict quality of red wines using logistic regression. Logistic regression is used because goal of this classification is to predict wines as "Good" or "Bad" based on their components.

In [None]:
'''
We need these imports
'''

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import seaborn as sns

plt.rc("font", size=14)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

%matplotlib inline

In [None]:
'''
Necessary function to turn quality values to binary.
Scores above and equal to 7 will be considered good.
Scores below 7 are considered bad.
(Because quality in this dataset ranges from 3 to 8)
'''

def good_to_one(col):
    quality = col[0]
    
    if quality >= 7:
        return 1
    else:
        return 0

In [None]:
df = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

df.info()

## Data Preparation:

Thankfully, there are no missing values in this dataset. We will create a dummy variable to replace the "quality" variable. Dummy variables have 2 values, 0 and 1. Fitting for logistic regression and the classification we are hoping to achieve. We will name our dummy variable "quality_1".

In [None]:
#We don't want to change the original set, as we might need it later.
df_good = df.copy()

df_good['quality'] = df_good[['quality']].apply(good_to_one, axis=1)

quality = pd.get_dummies(df_good['quality'],drop_first=True, prefix="quality")
df_good.drop(['quality'],axis=1,inplace=True)

df_good = pd.concat([df_good, quality], axis=1)

df_good.head(10)

## Data Exploration:

In [None]:
df_good['quality_1'].value_counts()
sns.countplot(x='quality_1', data=df_good, palette='hls')
plt.show()

There are 217 good quality and 1382 bad quality wines. Our classes are not balanced, majority is "bad" quality wines. If we train a model now, it will predict every instance as having "bad" quality. We either only take 217 samples from "bad" group or do over-sampling to increase the size of "good" group. In this notebook, we will use over-sampling because reducing the size of one group randomly creates inconsistent models. Before balancing the groups, we can make more plots for further exploration.

In [None]:
df.groupby('quality').mean()

In [None]:
df_good.groupby('quality_1').mean()

- Bad quality wines have on average higher volatile acidity (acetic acid) but lower citric acid.
- Good quality wines have less chlorides.
- Good quality wines have lower free and total sulfur dioxide but higher sulphates.
- Bad quality wines tend to be less alcoholic

We will keep those in mind while creating our model.

## Over-sampling

We will use SMOT (Synthetic Minority Oversampling Technique) for oversampling our data.
SMOT:

1. Works by creating synthetic samples from the minor class (good quality) instead of creating copies.
2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

Source used for Over-sampling, RFE (next section) and ROC (last section) with logistic regression:
https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8

(I suggest you give it a read as we didn't have to use some techniques in this notebook thanks to our complete dataset)

In [None]:
from imblearn.over_sampling import SMOTE

x = df_good.loc[:, df_good.columns != 'quality_1']
y = df_good.loc[:, df_good.columns == 'quality_1']

over_samp = SMOTE(random_state=0)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)
columns = x_train.columns

overs_data_x, overs_data_y = over_samp.fit_resample(x_train, y_train)
overs_data_x = pd.DataFrame(data=overs_data_x,columns=columns )
overs_data_y= pd.DataFrame(data=overs_data_y,columns=['quality_1'])

# we can Check the numbers of our data
print("Length of oversampled data is ",len(overs_data_x))
print("Number of good quality in oversampled data ",len(overs_data_y[overs_data_y['quality_1']==0]))
print("Number of bad quality ",len(overs_data_y[overs_data_y['quality_1']==1]))
print("Proportion of good quality data in oversampled data is ",len(overs_data_y[overs_data_y['quality_1']==0])/len(overs_data_x))
print("Proportion of bad quality data in oversampled data is ",len(overs_data_y[overs_data_y['quality_1']==1])/len(overs_data_x))

## Recursive Feature Elimination

RFE lets us eliminate worst performing features. Removing these features will make our model more accurate as our model won't be distracted by non-significant features. Think of it like removing noise from a voice recording or an image.

In [None]:
from sklearn.feature_selection import RFE

In [None]:
df_good_vars = df_good.columns.values.tolist()
y = ['quality_1']
x = [d for d in df_good_vars if d not in y]

logreg = LogisticRegression(solver='liblinear')
rfe = RFE(logreg, n_features_to_select=20)
rfe = rfe.fit(x_train, y_train.values.ravel())

print(rfe.support_)
print(rfe.ranking_)

We will select all features as RFE couldn't eliminate any of them. But don't worry, in the next section we will eliminate them manually.

In [None]:
cols=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 
      'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 
      'pH', 'sulphates', 'alcohol']
x = overs_data_x[cols]
y = overs_data_y['quality_1']

## Model

In [None]:
import statsmodels.api as sm

In [None]:
logit_model = sm.Logit(y, x)
result = logit_model.fit()
print(result.summary2())

(Don't mind the warnings, they are about top half of the results which we can calculate in the cell below)

A-ha! Take a look at the column named "P > |z|". We can remove variables with p-values higher than 0.05. They are not significant enough to bother our model. Goodbye fixed acidity, citric acid, free sulfur dioxide and pH.

In [None]:
glm_model = sm.GLM(y, x, family=sm.families.Binomial())
result = glm_model.fit()
print(result.summary2())

In [None]:
cols = ['volatile acidity', 'residual sugar', 'chlorides', 'total sulfur dioxide', 
        'density', 'sulphates', 'alcohol']
x = overs_data_x[cols]
y = overs_data_y['quality_1']

logit_model = sm.Logit(y, x)
result = logit_model.fit()
print(result.summary2())

Notice how p-values of;

- residual sugar decreased from 0.0044 to 0.0013,
- chlorides decreased from 0.0006 to 0.0002,
- density decreased from 0.0002 to <0.0001

As you can see, removing features which perform badly gets rid of the noise and now our model performs better with other features.

In [None]:
cols = ['volatile acidity', 'residual sugar', 'chlorides', 'total sulfur dioxide', 
        'density', 'sulphates', 'alcohol']
x = overs_data_x[cols]
y = overs_data_y['quality_1']

glm_model = sm.GLM(y, x, family=sm.families.Binomial())
result = glm_model.fit()
print(result.summary2())

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.30,
                                                    random_state=0)
logreg = LogisticRegression(solver='liblinear')
logreg.fit(x_train, y_train)

In [None]:
y_pred = logreg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))

## Confusion Matrix

Confusion matrix is important. A medical test used for diagnosis with high accuracy might be rendered totally useless by having recall. Quick summary about values in the confusion matrix:

- Top left: True negative (tn)
- Top right: False positive (fp)
- Bottom left: False negative (fn)
- Bottom right True positive (tp)

Essentially you want top-left and bottom-right numbers to be as high as possible while top-right and bottom-left stays as low as possible.

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
print(classification_report(y_test, y_pred))

You probably know accuracy, but what are those precision and recall values?

- precision: tp / (fp + tp)
- recall: tp / (fn + tp)
- f1-score: 2(precision*recall) / (precision+recall)

Remember the medical test example? If your precision is low, you will falsely diagnose healthy people with cancer. If your recall is low, you will falsely diagnose people with cancer as healthy. Our model looks fine.

## ROC Curve

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, logreg.predict(x_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(x_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## Cross-validation

We are done with our model. One last thing left to do, which is cross validation.

In [None]:
from sklearn.model_selection import cross_validate, cross_val_score

In [None]:
cv_dict = cross_validate(logreg, x, y, cv=5, return_train_score=True)

In [None]:
cv_dict

In [None]:
scores = cross_val_score(logreg, x, y, cv=5)
print("Accuracy with cross-validation: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Aaand we are done! Now we have a model which can predict red wines' quality based on their components with 0.82 accuracy. Hopefully you had fun or learned something new.