# Prediciting the Quality of Red Wine
DH150: Jul 27th 2022
<br>
Christina Cha

## Report Agenda
<b>1</b> | Introduction
<br>
<b>2</b> | Understanding the Data
<br>
<b>3</b> | Modeling <i>(Logistic Regression, Random Forest, k Nearest Neighbors)</i>
<br>
<b>4</b> | Understanding Prior Works
<br>
<b>5</b> | Results & Comparison
<br>
<b>6</b> | Conclusion

## Introduction

Wine experts utilize taste and perception of the wine to determine wine quality. However, it is undeniable that the components of the wine itself as well as the methods used to produce the wine play a significant role in determining how the wine tastes and, ultimately, the level of quality achieved by the wine. This research will focus on utilizing machine learning to forecast the quality of red wine based on the various features that are present in the wine. 

## Understanding the Data

#### Data Introduction
For this project, I will be using the <a href="http://archive.ics.uci.edu/ml/datasets/Wine+Quality">Red Wine Quality Data Set</a> from UCI Machine Learning Repository. The dataset is related to red variants of the Portuguese "Vinho Verde" wine. The dataset is comprised of a total of 12 variables, each of which was documented for each of the 1,599 observations. Out of the 12, the 11 input variables based on physicochemical tests include To see which variables are likely to affect the quality of red wine the most, I ran a correlation analysis of our independent variables against our dependent variable, quality. This analysis ended up with a list of variables of interest that had the highest correlations with quality. Through this dataset, we will be able to develop a variety of regression models to investigate the extent to which a variety of independent factors contribute to the prediction of our output variable based on sensory data, quality. 

In [1]:
import pandas as pd

# Import Red Wine Quality Data
df = pd.read_csv("winequality-red.csv")
df

FileNotFoundError: [Errno 2] No such file or directory: 'winequality-red.csv'

In [None]:
#Count, Mean, STD, Min~Max
df.describe()

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.shape

#### Data Breakdown
In order to find which attribute had the greatest impact on on the quality of red wine, I conducted a correlation test of the input variables against the output variable. I did the correlation test using a heat map and a function that selects highly correlated features.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# Heat Map
plt.figure(figsize=(12,10))
cor = df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Blues)
plt.show()

In [None]:
#Correlation with output variable
cor_target = abs(cor["quality"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.2]
relevant_features

<b>Based on the heat map and function, the attribute with the highest correlation with quality is as follows:</b>
<br>
<i>1 = highest correlation ... 10 = lowest correlation</i>
1. <b>Alcohol:</b> the percent alcohol content of the wine
2. <b>Volatile Acidity:</b> the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. <b>Sulphates:</b> a wine additive which can contribute to sulfur dioxide gas (S02) levels
4. <b>Citric Acid:</b> found in small quantities, citric acid can add 'freshness' and flavor to wines
5. <b>Total Sulfur Dioxide:</b> amount of free and bound forms of S02
6. <b>Density:</b> the density of water is close to that of water depending on the percent alcohol and sugar content
7. <b>Chlorides:</b> the amount of salt in the wine
8. <b>Fixed acidity:</b> most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
9. <b>pH:</b> describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4
10. <b>Free Sulfur Dioxide:</b> the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion
11. <b>Residual Sugar:</b> the amount of sugar remaining after fermentation stops

<b>Plots</b>
<br>
In addition to the correlation, I wanted to further see the specific relationship of each attribute against the "quality" variable. I created both density and box plots for all 11 attributes to understand the data distribution. 

In [None]:
# Box Plots
fig = plt.figure(figsize=[15,12])
cols = df.columns
cnt = 1
for col in cols:
    plt.subplot(4,3,cnt)
    sns.boxplot(x="quality", y=col, data=df, palette="Blues")
    cnt = cnt+1
plt.tight_layout()
plt.show()

<b>Box Plot Analysis</b>
<br>
Based on the box plots, we can conclude that red wines that are higher quality have higher levels of alcohol, sulphates, and citric acid. On the other hand, red wines that are lower quality have high volatile acidity, density, and pH. Finally, attributes such as residual sugar, total sulfur dioxide, free sulfur dioxide, and chlorides have no effect or significant relationship with the quality of red wine.  

In [None]:
# Density Plots
fig = plt.figure(figsize=[15,12])
cols = df.columns
cnt = 1

for col in cols:
    plt.subplot(4,3,cnt)
    sns.kdeplot(df[col], color="red")
    cnt+=1
plt.tight_layout()
plt.show()

<b>Density Plot Analysis</b>
- The pH level is constantly between 3 and 4 
- Chloride level stays around 0.08-0.1
- The most common ratings for wine quality is either 5 or 6. 

## Modeling

In this section, I will be creating 

In [None]:
# Classifying the quality of wine in to two classes: low and high
# Quality score between 3 and 6 = 0 (low quality wine)
# Quality score between 7 and 8 = 1 (high quality wine)        

df['quality'] = df['quality'].replace([3, 4, 5, 6], 0)
df['quality'] = df['quality'].replace([7, 8], 1)
df.head()

In [None]:
# Number of wines per class
df['quality'].value_counts()

#### Train and Test Data Split

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, recall_score, mean_squared_error
from sklearn import metrics

In [None]:
# Array
x = df.drop(['quality'], axis=1)
y = df['quality'] 

In [None]:
# Split, Train, Test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42) 

### Logistic Regression Model

In [None]:
# Model Object
model1 = LogisticRegression(max_iter=4000)

In [None]:
# Train the model
model1.fit(x_train,y_train)

In [None]:
# Prediction on Test
y_pred1=model1.predict(x_test)

In [None]:
# Classification Report
print("Classification Report:", classification_report(y_test, y_pred1))

In [None]:
# Confusion Metrics
print("Confusion Metrics:", confusion_matrix(y_test, y_pred1))

In [None]:
# Recall & Precision
print("Recall:", metrics.recall_score(y_test, y_pred1, average="micro"))
print("Precision Score:", metrics.precision_score(y_test, y_pred1, average="micro"))

In [None]:
# Calculate the Accuracy Score.
lr = accuracy_score(y_test, y_pred1)
print(lr)

### Random Forest

In [None]:
# Model Object
model2=RandomForestClassifier(n_estimators=100)

In [None]:
# Train the model
model2.fit(x_train,y_train)

In [None]:
# Prediction on Test
y_pred2=model2.predict(x_test)

In [None]:
# Classification Report
print("Classification Report:", classification_report(y_test, y_pred2))

In [None]:
# Confusion Metrics
print("Confusion Metrics:", confusion_matrix(y_test, y_pred2))

In [None]:
# Recall & Precision
print("Recall:", metrics.recall_score(y_test, y_pred2, average="micro"))
print("Precision Score:", metrics.precision_score(y_test, y_pred2, average="micro"))

In [None]:
# Calculate the Accuracy Score.
rf = accuracy_score(y_test, y_pred2)
print(rf)

### k Nearest Neighbors Model

In [None]:
# Model Object
model3 = KNeighborsClassifier(n_neighbors=9, leaf_size=20)

In [None]:
# Train the model
model3.fit(x_train,y_train)

In [None]:
# Prediction on Test
y_pred3=model3.predict(x_test)

In [None]:
# Classification Report
print("Classification Report:", classification_report(y_test, y_pred3))

In [None]:
# Confusion Metrics
print("Confusion Metrics:", confusion_matrix(y_test, y_pred3))

In [None]:
# Recall & Precision
print("Recall:", metrics.recall_score(y_test, y_pred3, average="micro"))
print("Precision Score:", metrics.precision_score(y_test, y_pred3, average="micro"))

In [None]:
# Calculate the Accuracy Score.
knn = accuracy_score(y_test, y_pred3)
print(knn)

In [None]:
# Accuracy Scores based on Models
models = pd.DataFrame({
    'Model':['Logistic Regression', 'Random Forest', 'KNeighbours'],
    'Accuracy_score' :[lr, rf, knn]
})

models