# Red Wine Quality

This is a simple classifier, comparing Decision Tree Classifier and Random Forest Classifier methods for analyzing the dataset that contains 11 features and one rating for every wine type. 


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.metrics import confusion_matrix, accuracy_score

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

In [None]:
# below table shows the first couple of features for different samples

df.head()

In [None]:
# following graph shows the range of quality and count of the target variable
sns.countplot(x = "quality", data = df, palette = "Set3")
plt.title("Quality Count")
plt.show()

* In order to visualize the correlation between the target and features,we can plot "boxen" plots, which are similar to box plots but show more info about the distribution. Types of box plots also shows us the outliers clearly.

In [None]:
sns.catplot(x = "quality", y = "fixed acidity", data = df, kind = "boxen")
plt.title("Fixed Acidity vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "volatile acidity", data = df, kind = "boxen")
plt.title("Volatile Acidity vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "citric acid", data = df, kind = "boxen")
plt.title("Citric Acidity vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "residual sugar", data = df, kind = "boxen")
plt.title("Residual Sugar vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "chlorides", data = df, kind = "boxen")
plt.title("Chlorides vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "free sulfur dioxide", data = df, kind = "boxen")
plt.title("Free Sulfur Dioxide vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "total sulfur dioxide", data = df, kind = "boxen")
plt.title("Total Sulfur Dioxide vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "density", data = df, kind = "boxen")
plt.title("Density vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "pH", data = df, kind = "boxen")
plt.title("pH vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "sulphates", data = df, kind = "boxen")
plt.title("Sulphates vs. Quality")
plt.show()

In [None]:
sns.catplot(x = "quality", y = "alcohol", data = df, kind = "boxen")
plt.title("Alcohol vs. Quality")
plt.show()

Later with this analysis, I have found out that in order to increase the prediction accuracy, I can re-label the target with a new range. Such as:

* Quality 3-4 = Below Average = 1
* Quality 5-6-7 = Average = 2
* Quality 8 = Above Average = 3

I've added another column to the dataset, similar to the quality.


In [None]:
# "rating" is the new column which is correlated with wine quality 
rating = []
for i in df['quality']:
    if i >= 3 and i < 5:
        rating.append('1')
    elif i >= 5 and i < 8:
        rating.append('2')
    elif i == 8:
        rating.append('3')
df['rating'] = rating


# we can see the total count of elements in the new grading system below
Counter(df['rating'])

In [None]:
# defined target and feature arrays as 'rating' being the target and other 11 colums as features

x = df.iloc[:,:11]
y = df['rating']

print(x.head(5))
print("-----------------------------------------")
print(y.head(5))


* It is important to scale the data before processing it. This can be done with standardization, which means rescaling the features so that they have the properties of a standard normal distribution. This distribution has a mean of zero and a standard deviation of one.

> More information can be found [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html).

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
x_std = sc.fit_transform(x)

* Principle component analysis (PCA) is a method of analysis which involves finding the linear combination of a set of variables that has maximum variance and removing its effect, repeating this. This technique is reduces the dimensionality of datasets, increasing interpretability but at the same time minimizing information loss. If one feature varies less than another feature because of their respective scales, PCA might determine that the direction of maximal variance more closely.

> More information can be found [here](https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202).

In [None]:
from sklearn.decomposition import PCA

pca = PCA()
x_pca = pca.fit_transform(x_std)

* We can split our data into two subsets, training set and test set. Training set which includes known outputs will be given to the model to learn on. Test data will be later on used to measure the prediction on this set.

> More information can be found: [Resource_1](https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6), [Resource_2](https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data), [Video_Tutorial](https://www.youtube.com/watch?v=fwY9Qv96DJY).


* In the code below, test_size = 0.3 means that %30 of the data is split into test set and the remainig is left for training.


![](https://miro.medium.com/max/700/1*-8_kogvwmL1H6ooN1A1tsQ.png)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_pca, y, test_size = 0.3,random_state = 16)
print("Training Feature Set:", x_train.shape,"\nTraining Output Set:", y_train.shape, "\n\nTest Feature Set:",x_test.shape, "\nTest Output Set", y_test.shape)

* Now, the model needs to be fit with the data we split, before making predictions. I've chosen Decision Tree Classifier for the model type.

In [None]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(random_state = 16)
model.fit(x_train,y_train)
prediction = model.predict(x_test).reshape(-1,1)

* Finally, now that the model is trained with the subsets x_train and y_train, the x_test data is used for predicting the corresponding outputs. The last thing that is left is to compare these predictions with the subset y_test. This comparison will yield the approxiamate accuracy of the test in general.

In [None]:
accuracy = accuracy_score(y_test, prediction)
print("The Decision Tree Classifier test is %",accuracy*100, "accurate.")

* Repeating the last step with Random Forest Classifier in order to compare the test acuracy with Decision Tree Classifier.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_2 = RandomForestClassifier(n_estimators = 100, random_state = 16)
model_2.fit(x_train, y_train)
prediction_2 = model_2.predict(x_test).reshape(-1,1)

accuracy_2 = accuracy_score(y_test, prediction_2)
print("The Random Forest Classifier test is %",accuracy_2*100, "accurate.")

Reducing the target range from (3, 4, 5, 6, 7, 8) to (1, 2, 3) obviously increaseses the accuracy and it is a simple, straight forward way. Altough you miss somewhat of the data. But these methods can be improved by filtering and tuning the features as input parameters, and preprocessing the data better.

Thank you for your time.