# Predicting Red Wine Quality

In this project we shall attempt to understand which chemical features of red wine are the most important factors which contribute to the quality of red wine. 

The dataset used within this project is the 'Red Wine Quality' dataset uploaded to Kaggle by UCI Machine Learning.

Throughout the duration of the project, we shall focus on the following tasks:

0. Package and Data Imports
1. Exploratory Data Analysis and Visualisation
2. Data Preprocessing
3. Model Creation and Analysis
    1. 'Good' vs 'Bad' Wine predictions
    2. Predicting Quality Ratings
    
## 0: Package and Data Imports

Let us first import the data visualisation libraries that will be needed for the next section.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

We shall now import the data into a dataframe called 'df'.

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

## 1: Exploratory Data Analysis and Visualisation

In this section we shall attempt to understand the relationship between our data features and the target variable. We shall first check some basic information regarding the dataframe we have created.

In [None]:
df.head()

We can see that we have a total of 11 features that can be used to predict the value of target variable "quality". 

In [None]:
df.info()

We also notice that each feature is in a numerical format and that we have a total of 1599 different wines stored within our dataset.

### 1.1: Impact of Features on our Target Variable

In this section we shall attempt to understand how each of our features in turn impacts the quality rating of wine. 

#### 1.1.1: Fixed Acidity

In [None]:
df.groupby('quality').mean()['fixed acidity'].plot(kind='bar')
plt.ylabel('Mean Fixed Acidity')

The plot above shows the average fixed acidity number for each wine grouped by the quality rating it has been given. We can see that the value of this feature does not significantly change with the quality rating. As a result, we can determine that this features is not a useful predictor of the target variable.

#### 1.1.2: Volatile Acidity

In [None]:
sns.lmplot(x='volatile acidity',y='quality',data=df)

In [None]:
df.groupby('quality').mean()['volatile acidity'].plot(kind='bar')
plt.ylabel('Mean Volatile Acidity')

From the two plots shown above, we can clearly see a strong negative relationship between the quality of wine and its volatile acidity level. Hence, we can deduce that this feature is a significant factor in the prediction of our target variable.

#### 1.1.3: Citric Acid

In [None]:
df.groupby('quality').mean()['citric acid'].plot(kind='bar')
plt.ylabel('Mean Citric Acid Level')

We can clearly visualise a strong, postive relationship between the citric acid level and the wine's quality. This feature is an important factor in predicting the quality rating.

#### 1.1.4: Residual Sugar

In [None]:
df.groupby('quality').mean()['residual sugar'].plot(kind='bar')
plt.ylabel('Mean Residual Sugar Level')

The plot above shows no clear linear relationship between the quality of wine and the level of residual sugar contained within it.

#### 1.1.5: Chlorides

In [None]:
df.groupby('quality').mean()['chlorides'].plot(kind='bar')
plt.ylabel('Mean Chloride Level')

From the above plot, we can determine that, on average, a low chloride level leads to a higher quality of wine. This feature is an important factor when predicting the quality of wine.

#### 1.1.6: Free Sulfur Dioxide

In [None]:
df.groupby('quality').mean()['free sulfur dioxide'].plot(kind='bar')
plt.ylabel('Mean Free Sulfur Dioxide')

In [None]:
sns.scatterplot(x='free sulfur dioxide',y='quality',data=df)

It is difficult to determine whether this feature has a significant impact on our target variable. For wines with a quality of above 5, as the level of free sulfur dioxide decreases, the quality of wine increases. However, for wines rated as either a 3 or 4, the reverse of this statement is true. 

We shall leave this variable within our dataset as it appears like there is some hidden information within this feature.

#### 1.1.7: Total Sulfur Dioxide

In [None]:
df.groupby('quality').mean()['total sulfur dioxide'].plot(kind='bar')
plt.ylabel('Mean Total Sulfur Dioxide Level')

As with free sulfur dioxide above, it is difficult to determine whether this feature is an important factor. Further investigation on the relatonship between the "free sulfur dioxide" and "total sulfur dioxide" shall be done in section 1.2

#### 1.1.8: Density

In [None]:
df.groupby('quality').mean()['density'].plot(kind='bar')
plt.ylabel('Mean Density')
plt.ylim(0.99,1)

We can see that, despite a negative relationship shown in the graph, there is no clear relationship between the density of the wine and its quality. The range of densities contained within our dataset is so small that any relationship would be insignificant. As a result, we can assume that there is no relationship between these two variables.

#### 1.1.9: pH

In [None]:
df.groupby('quality').mean()['pH'].plot(kind='bar')
plt.ylabel('Mean pH Level')

As with density above, there is no significant relationshio between the pH level of a wine and its quality.

#### 1.1.10: Sulphates

In [None]:
df.groupby('quality').mean()['sulphates'].plot(kind='bar')
plt.ylabel('Mean Sulphate Level')

The plot of the mean sulphate level by quality class shows a clear, positive relationship between the two variables. This feature is an important factor in predicting the quality of wine.

#### 1.1.11: Alcohol

In [None]:
df.groupby('quality').mean()['alcohol'].plot(kind='bar')
plt.ylabel('Mean Alcohol Level')

Once again, we can see a clear relationship between these variables.

In summary, we have found that there seem to be 5 main factors that influence the quality of wine. These are:

 - Alcohol
 - Sulphates
 - Chlorides
 - Citric Acid
 - Volatile Acidity
 
Let us produce a plot of the correlation values of all our features to our target variable to confirm these findings.

In [None]:
series = pd.Series(df.corr()['quality'])
series.drop('quality').plot(kind='bar')

We can clearly see that the 5 columns mentioned above seem to have a high correlation value, in absolute terms. However, it also seems that the density and total sulfur dioxide features are also significantly correlated to our target variable. 

### 1.2: Feature Relationships

Now that we have determined which features have an impact on our target variable, we shall now begin to investigate the relationships between our features. We shall do this by producing a correlation heatmap.

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(df.corr(),annot=True)

The heatmap above shows the correlation values between each of the features. We shall now investigate values that have a large absolute value.

#### 1.2.1: Fixed Acidity v Citric Acid

In [None]:
sns.scatterplot(x='fixed acidity',y='citric acid',hue='quality',data=df)

The first thing to note is that the level of fixed acidity within a wine is always greater than the level of citric acid. This is because citric acid is one of the types of acid considered in the fixed acidity value. We notice a clear positive relationship between these two features, which is why the value of 0.67 in the correlation matrix appears. 

Let us investigate whether the percentage of the citric acid has an effect on the quality of wine. 

In [None]:
df['citric acid percentage'] = df['citric acid'] / (df['citric acid'] + df['fixed acidity'])

In [None]:
df.groupby('quality').mean()['citric acid percentage'].plot(kind='bar')
plt.ylabel('Mean Citric Acid Percentage')

We can clearly see that the as the average citric acid percentage of the wine increases, the quality score associated to the wine also increases. This is due to the fact that acids impart the fundamental features of a wines taste. We have found another useful characteristic of our target variable. 

#### 1.2.2: Fixed Acidity v Density

In [None]:
sns.scatterplot(x='fixed acidity',y='density',hue='quality',data=df)

Once again we can see a clear positive relationship between these two variables.

#### 1.2.3: Fixed Acidity v pH

In [None]:
sns.scatterplot(x='fixed acidity',y='pH',hue='quality',data=df)

Here, we see a clear negative relationshio between these variables. This intuitively makes sense, since pH is a direct measure of how acidic or alkaline a substance is. The more acidic the wine is, the lower the pH value is because of the scale that is used when measuring pH levels. A substance with 0pH is extremelty acidic, 7pH is neutral and 13pH is extremely alkaline. 

#### 1.2.4: Volatile Acidity v Citric Acid

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='volatile acidity',y='citric acid',hue='quality',data=df)

The negative relationship between these variables is clear. Volatile acidity is commonly used as a measure of wine spoilage, and manufactures aim to keep the level of volatile acidity negligible. This may be the reason why we see more high quality wines to the top left of the scatter plot above. This is because in this region the citric acid levels are high, which contribute to the taste of the wine, and the volatile acidity levels are low, which reduce the spoilage. 

#### 1.2.5: Citric Acid v pH

In [None]:
sns.scatterplot(x='citric acid',y='pH',hue='quality',data=df)

The relationship between these two variables is clearly negative for the same reasons as the relationship between fixed acidity and pH.

#### 1.2.6: Residual Sugar v Density

In [None]:
sns.scatterplot(x='residual sugar',y='density',hue='quality',data=df)

The positive relationship between these two variables is difficult to be seen since the correlation between them is only 0.37. 

#### 1.2.7: Chlorides v Sulphates

In [None]:
sns.scatterplot(x='chlorides',y='sulphates',hue='quality',data=df)

We can see a slightly positive linear relationship between these two variables. We notice that the majority of the high quality wines are plotted in the left hand part of this graph. This is due to the negative relationship that chlorides have with the wine quality.

#### 1.2.8: Free Sulfur Dioxide v Total Sulfur Dioxide

In [None]:
sns.scatterplot(x='free sulfur dioxide',y='total sulfur dioxide',hue='quality',data=df)
plt.plot(range(0,71),range(0,71))

The first thing to notice is that the level of free sulfur dioxide is always less than or equal to the total sulfir dioxide level, which makes intuitive sense. Let us investigate whether the percentage of free sulfur dixoide impacts the quality of wine.

In [None]:
df['free sulfur dioxide percentage'] = df['free sulfur dioxide'] / (df['free sulfur dioxide'] + df['total sulfur dioxide'])
df.groupby('quality').mean()['free sulfur dioxide percentage'].plot(kind='bar')
plt.ylabel('Mean Free Sulfur Dioxide Percentage')

We can see that for each quality rating there is a differet average free sulfur dioxide percentage, but there does not exist a clear relationship between these variables. Let us check this by calculating the correlation between these features.

In [None]:
df.corr()['quality']

Our correlation value has been shown to be equal to approximately 0.2, which suggests that there is enough of a relationship to consider this variable for predicting wine quality. 

#### 1.2.9: Alcohol v Density

In [None]:
sns.scatterplot(x='alcohol',y='density',hue='quality',data=df)

We can see a clear negative relationship between these two variables.

## 2: Data Preprocessing

In this section we shall manipulate our dataset for use in the machine learning algorithms that we shall implement in section 3. 

From the analysis we perform in section 1, we identified 5 key variables in predicting wine quality. These were alcohol, sulphates, chlorides, citric acid and volatile acidity. We also created two new variables, citric acid percentage and free sulfur dioxide percentage, which we also found to be useful. 

We must now remove any variables which we find are not useful or variables that may cause multicolinearity issues. As a result, we shall remove the columns free sulfur dioxide and fixed acidity.

In [None]:
df = df.drop(['free sulfur dioxide','fixed acidity'],axis=1)

In [None]:
df.head()

Let us now check the distribution of our target class by producing a countplot.

In [None]:
sns.countplot(x='quality',data=df)

We can see that we have an extremely unbalanced dataset, which may lead to issues when implementing our algorithms. Let us see how unbalanced the dataset really is.

In [None]:
df.groupby('quality').count()['pH'] * 100 /len(df)

We see that over 80% of our data points belong to either quality rating 5 or 6. However, if we reclassify our points as "good" if they have a quality rating of at least 6 and "bad" otherwise, then we should achieve a much more balanced dataset. Let us create a new target variable called "new_rating" that contains a 1 if the wine is "good" and a 0 if the wine is "bad".

In [None]:
df['new_rating'] = df['quality'].apply(lambda x: 1 if x >= 6 else 0)

Let us check the split that this has produced.

In [None]:
sns.countplot(x='new_rating',data=df)

In [None]:
df['new_rating'].value_counts() * 100 / len(df)

We can now clearly see that we have a much more balanced dataset as a result of the new rating system we have implemented.

We are now able to split our data into training and test sets for use in our machine learning algorithms. 

In [None]:
X = df.drop(['quality','new_rating'],axis=1)
y = df['new_rating']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

## 3: Model Creation and Analysis

In this section we shall create and analyse machine learning models in order to predict the quality of wine. 

### 3.1: "Good" vs "Bad" Predictions

In this section, we shall make use of the "new_rating" feature created above to predict whether a wine is either "good" or "bad". We shall implement a range of machine learning algorithms and attempt to determine which method produces the most accuracy. In this case, we are considering a binary classification problem. Let us begin implementing and analysing the following algorithms:

1. Logistic Regression (75%)
2. Decision Tree (74%)
3. Random Forest (80%)
4. Support Vector Machines (77%)
5. XGBoost (78%)

#### 3.1.1: Logistic Regression

We first need to import the Logistic Regression model from scikit-learn. We will then fit the model using our training data, followed by creating predictions using our testing data.

In [None]:
from sklearn.linear_model import LogisticRegression
logmodel = LogisticRegression(max_iter=1500)
logmodel.fit(X_train, y_train)
logmodelpreds = logmodel.predict(X_test)

Let us now produce a confusion matrix and a classification report to determine the accuracy of our model.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report
print("Confusion Matrix: ")
print(confusion_matrix(y_test,logmodelpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,logmodelpreds))

Since we are attempt to solve a binary classification problem, the key metric to consider is accuracy. In this case, our logistic regression model achieved 75% when predicting wines from the testing set. 

The process we shall undertake for the implementation of each model type will be similar to the process we have just undertaken for our logistic regression model. Let us now work through each model in turn and find out its accuracy.

#### 3.1.2: Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
treepreds = tree.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,treepreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,treepreds))

We can see that the decision tree model implemented achieved 74% accuracy.

#### 3.1.3: Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
forestpreds = forest.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,forestpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,forestpreds))

Our random forest classifier achieved 80% accuracy.

#### 3.1.4: Support Vector Machines

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
svcpreds = svc.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,svcpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,svcpreds))

Our support vector machine classifier only achieved 58% accuracy. However, we can use a grid search to attempt to alter the model parameters and achieve a higher accuracy.

In [None]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.fit(X_train,y_train)

In [None]:
grid.best_estimator_
gridpreds = grid.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,gridpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,gridpreds))

Despite tuning the parameters, we were only able to achieve a 77% accuracy using Support Vector Machines.

#### 3.1.5: XGBoost

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgbpreds = xgb.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,xgbpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,xgbpreds))

Our XGBoost model achieved 78% accuracy.

In summary, each of our models achieved at least 70% accuracy. The reason for this relatively low score may be due to the fact that there were lots of cases that were located close to boundary between "good" and "bad", since about 40% of our datapoints belonged to both quality rating 5 or 6. However, our random forest model managed to achieve a much better 80% accuracy. 

### 3.2 Predicting Quality Rating

In this section we shall attempt to solve our original problem of predicting the quality rating of wine given its chemical features. Note that our dataset was extremely unbalanced, which was the major reason for considering the "good" v "bad" problem above. In order to predict exact quality ratings, we will need to balance our dataset. We can do this using a technique known as SMOTE, which will synthetically produce more samples of the under represented classes. Let us apply this method and then check the number of each wine within eavh quality rating. 

First, we shall remove the "new_rating" target column since it is no longer needed.

In [None]:
df = df.drop('new_rating',axis=1)
df.head()

Let us reorder the columns of order dataframe so that our target variable shows in the last column.

In [None]:
df = df[['volatile acidity','citric acid','residual sugar','chlorides','total sulfur dioxide','density','pH','sulphates','alcohol','citric acid percentage','free sulfur dioxide percentage','quality']]
df.head(2)

We can now implement the SMOTE technique to oversample our dataset which will result in perfectly balanced classes.

In [None]:
data = df.values
X = data[:, :-1]
y = data[:, -1]
X_columns = df.columns[:-1]
y_columns = df.columns[-1]

from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_sample(X, y)

X_sampled = pd.DataFrame(X, columns=X_columns)
y_sampled = pd.DataFrame(y, columns=[y_columns])
df = pd.concat([X_sampled,y_sampled],axis=1)

Let us check that the technique worked and that we have balanced classes.

In [None]:
sns.countplot(x='quality',data=df)

In [None]:
df['quality'].value_counts()

The plot and series above show that we now posses 681 wines for each quality rating, meaning that we have a perfectly balanced dataset as required. Let us now move on to create training and testing sets for this data.

In [None]:
X = df.drop('quality',axis=1)
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)

Now that we have oversampled and split our data into training and test sets, we are now in a position to begin implementing machine learning models to predict the quality of wine.

#### 3.2.1: Logistic Regression

In [None]:
logmodel = LogisticRegression(max_iter=5000)
logmodel.fit(X_train, y_train)
logmodelpreds = logmodel.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,logmodelpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,logmodelpreds))


Our logistic regression model achieved 56% accuracy.

#### 3.2.2 Decision Tree

In [None]:
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
treepreds = tree.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,treepreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,treepreds))

We have managed to significantly improve the accuracy of predictions to 77% by using a decision tree.

#### 3.2.3: Random Forest

In [None]:
forest = RandomForestClassifier(n_estimators=100)
forest.fit(X_train, y_train)
forestpreds = forest.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,forestpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,forestpreds))

Our random forest model has performed the best out of the models created so far, achieving 85% accuracy on our test data.

#### 3.2.4: Support Vector Machines

In [None]:
svc = SVC()
svc.fit(X_train, y_train)
svcpreds = svc.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,svcpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,svcpreds))

Our support vector machine achieves only 37% accuracy. Let us use a grid search to attempt to improve this accuracy.

In [None]:
param_grid = {'C': [0.1,1, 10, 100, 1000], 'gamma': [1,0.1,0.01,0.001,0.0001], 'kernel': ['rbf']} 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.fit(X_train,y_train)

grid.best_estimator_
gridpreds = grid.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,gridpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,gridpreds))

Our grid search has increased the accuracy we were able to achieve from 37% to 82%.

#### 3.2.5: XGBoost

In [None]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
xgbpreds = xgb.predict(X_test)

print("Confusion Matrix: ")
print(confusion_matrix(y_test,xgbpreds))
print("-" * 50)
print("Classification Report: ")
print(classification_report(y_test,xgbpreds))

Our XGBoost model achieved a 75% accuracy.