## BUILDING A MACHINE LEARNING MODEL USING LOGISTIC REGRESSION

#### PREDICTING THE QUALITY OF RED WINE BASED ON ITS PHYSIOCHEMICAL PROPERTIES

###### This dataset is related to red variants of the Portuguese "Vinho Verde" wine and has been obtained from Kaggle. For more details, check the referene Cortez et al., 2009. The data is also available on UCI machine learning repository. A logistic regression model has been fit to the data to study the effect of certain pysiochemical properties of wine on its quality.

###### The dataset has one target variable, the quality of wine, based on sensory data and 11 pysiochemical properties of 1599 observations.

###### We begin by importing the required libraries and the dataset.

In [None]:
#alias importing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
red_wine = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

### Exploring the data

In [None]:
red_wine.head()

In [None]:
red_wine.tail()

###### Let us check the data type of the variables 

In [None]:
red_wine.info()

###### It can be seen that all the variables are of the float data type except the 'quality' variable because it contains integer quality ratings of the red wine. Since, 'quality' is our target variable, we move to see what its observations look like.

In [None]:
red_wine.quality.head()

In [None]:
red_wine.quality.value_counts()

###### The 'quality' variable is presently a continuous variable, but we want to convert it to a binary categorical variable to distinguish between good and bad quality wine. 'Quality' takes six unique integer values, these ratings range from 1 to 10. Let us categorise a 'good' quality wine as something with a value above 6.9 and the remaining as 'bad'.

###### Using correlation, we can see how much does each of the pysiochemical properties of wine affect its quality.

In [None]:
red_wine.corr()['quality'].sort_values(ascending=False).drop('quality')

In [None]:
red_wine.corr()['quality'].drop('quality').plot(kind='bar',color='magenta', title='Graph 1.1 - Pearson Correlation');

###### Alcohol has the highest positive effect on the quality of wine, whereas volatile acidity is the most negatively impacting it.

In [None]:
sns.heatmap(red_wine.corr())
plt.title('Graph 1.2 - Heatmap of Correlation');

###### It can be seen that the alcohol content in the wine affects the quality the most. It has a positive Pearson correlation coefficient of 0.47, meaning if we increase the alcohol content in wine by one unit, the quality increases by 0.47 units or 47%.

###### Before proceeding to do a binary classification of the target variable, let us check for missing values first using the 'isnull' function.

### Cleaning the data

###### We check for any missing values in our data and remove them.

In [None]:
red_wine.isnull().sum()

###### We see that there are no missing values. (P.S. You rarely encounter such datasets with zero missing values)

###### We next move on to do a binary classification of the target variable into 'good' and 'bad'. I learned using this code from a Kaggle kernel of the same dataset.

### Pre-processing the data

In [None]:
#Binary classification of the target variable into 'good' and 'bad'
bins=(2,6.9,8)
group_names=['bad','good']
red_wine['quality']=pd.cut(red_wine['quality'],bins=bins,labels=group_names)

###### We create 2 bins (group names), namely, 'good' and bad'. Any wine that has a quality rating above 6.9 has been categorised as 'good' and that below 6.9 as 'bad'. To achieve the same, we use the Pandas cut function.

In [None]:
red_wine.quality.head()

###### The 'quality' variable is now divided into two categories. Since, any ML algorithm deals with numeric data (float or integer), our next step is to code these two categories. This is done using the LabelEncoder function from the Preprocessing sub-library of the Sklearn library. Therefore, 'bad' is assigned 0 and 'good' 1.

###### It is to be also noted, however, that LabelEncoder automatically gives higher weightage to higher labels, which holds true in our data, hence used.

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
label_qual = LabelEncoder()

In [None]:
#Assigning labels - Bad becomes 0 and good becomes 1 
red_wine['quality'] = label_qual.fit_transform(red_wine['quality'])

In [None]:
red_wine.head()

###### Note, we converted the 'quality' variable from an integer value to a categorical variable and then again to an integer.

In [None]:
red_wine['quality'].value_counts()

###### This shows that our data has 1382 bad quality wine and 217 good quality wine. We have an imbalanced target class.

In [None]:
#Plotting
plt.style.use('fivethirtyeight')
red_wine['quality'].value_counts().plot(kind='bar', title='Graph 1.3 - Count of good and bad quality wine');

###### We move to defining our feature columns i.e. the (independent) variables that affect the quality (dependent variable) of wine.

In [None]:
red_wine.columns

In [None]:
feature_columns=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol']

###### Assigning the target / dependent (y) and response / independent (X) variables.

In [None]:
#response variables and target variable
X=red_wine[feature_columns]
y=red_wine.quality

###### Before moving on to building the machine learning model, let us scale our data so that no single variable dominates our results. For this, we use the StandardScaler function from the Preprocessing sub-library of the Sklearn library.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler=StandardScaler()

In [None]:
X=scaler.fit_transform(X)

###### Now that our pre-processing part is done, we move on to building our Machine Learning model using the Logistic Regression algorithm.

### Model building - Logistic Regression

###### We begin by splitting our data into 80% training and 20% testing data. This is achieved using the 'train_test_split' function from the 'model_selection' sub-library of the Sklearn library.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=13)

###### We now call the 'LogisticRegression' function from the 'linear_model' sub-library of the Sklearn library.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
redwinelogit=LogisticRegression()

###### We have defined the model as 'redwinelogit' which we now fit into the training data. 

In [None]:
redwinelogit.fit(X_train,y_train)

###### Let us check the accuracy score

In [None]:
redwinelogit.score(X_train,y_train)

In [None]:
redwinelogit.score(X_test,y_test)

###### We are getting an accuracy score of 88%.

In [None]:
redwinelogit.predict(X_test)

In [None]:
print(redwinelogit.coef_)
print(redwinelogit.intercept_)

In [None]:
feature_columns

###### It can be seen that 'alcohol' has a coefficient of positive 0.82 which indicates that the alcohol level of wine has the highest effect on its quality. A coefficient (slope) of 0.82 means that if the alcohol content is increased by 1 unit, wine's quality increases by 0.82 units or 82%.

###### It can also be seen that if the total sulphur dioxide content is increased by 1 unit, wine's quality will decrease by 52%.

###### Since our target class is imbalanced, we do not rely alone on the accuracy rate. We make use of the precision-recall and receiver operating characteristic (ROC) curve to check the reliability of our model.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve

In [None]:
confusion_matrix(y_test,redwinelogit.predict(X_test))

###### The above result tells us that 271+13=284 observations are predicted correctly and 8+28=36 are predicted incorrectly.

In [None]:
print(classification_report(y_test,redwinelogit.predict(X_test)))

###### The model predicts and classifies 279 out of the 320 observations (20% testing data of 1599 observations) into class 0 i.e. bad quality wine and the remaining 41 into class 1 i.e. good quality wine. Thus, 87% of the testing data has been classified into class 0 and 13% into class 1. It is due to an imbalanced target class that most of the observations have been predicted to be of bad quality.

###### According to our model,

###### False Positive (FP) is a wine that was predicted to be of good quality, but turns out to be actually bad
###### False Negative (FN) is a wine that was predicted to be of bad quality, but turns out to be actually good
###### So,
###### Cost of FP > Cost of FN, therefore, higher weightage is given to precision at the cost of recall.

In [None]:
precision, recall, _ = precision_recall_curve(y_test,redwinelogit.predict(X_test))
plt.plot(recall,precision)
plt.xlabel('Recall')
plt.ylabel("Precision")
plt.title("Graph 1.4 - Precision Recall Curve");

In [None]:
# plot_roc_curve throwing an import error. Referred this method from a Medium article. Link given in sources section at the end.
from sklearn.metrics import roc_auc_score, roc_curve

In [None]:
logit_roc_auc = roc_auc_score(y_test, redwinelogit.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, redwinelogit.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r-')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Graph 1.5 - Receiver Operating Characteristic')
plt.legend(loc="lower right");

###### We get an ROC AUC of 0.64 which indicates that our model is 64% accurate.

###### Created by Elena Jose, under the guidance of Prof. Pitabas Mohanty (XLRI Jamshedpur)
###### Sources: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
###### https://www.kaggle.com/vishalyo990/prediction-of-quality-of-wine
###### https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8