# Wine Quality Prediction!
Hey everyone! In this project we're gonna be looking at the wine quality dataset! <br>
We'll be looking at supervised learning machine learning algorithm Random Forest and try to improve the accuracy as we'll tune the hyper parameters! <br>
Happy Learning!

Let's start by importing the required libraries! <br>
A little about the libraries!
* numpy - for numpy arrays, useful for processing and scientific computing
* pandas - helpful for creating dataframes and storing data
* matplotlib.pyplot - useful for creating plots and charts
* seaborn - useful for data visualization like matplotlib
* train_test_split - to split the data into training and test set
* Random forest Classifier - An ensemble model which we'll use to train our model on
* accuracy_score - to check the accuracy of our model

StandardScalar - in this project we're using support vector machine classification and this class cannot process the data given to it unless the data is standardized.
svm - the suport vector machine class in the sklearn package

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

Now let's import our data!
We're gonna use the wine quality dataset from the UCI ML Dataset! 

In [1]:
wine_data = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
wine_data.shape

Let's explore the dataset further! <br>
Eploratory data analysis!

In [1]:
wine_data.head()

In [1]:
wine_data.describe()

In [1]:
wine_data.info()

Doesn't look like there's any missing values! Let's be sure! <br>
Checking for missing values

In [1]:
wine_data.isnull().sum()

As we can see we have a clean dataset without any missing values!

Data Analysis and Visualization

In [1]:
# Let's check the number of values under each quality classification
sns.catplot(x='quality',data= wine_data,  kind='count')

Let's compare some features with the quality !

In [1]:
# volatile acidity vs quality
plot = plt.figure(figsize=(5,5))
plt.title('Volatile acidity vs Quality')
sns.barplot(x='quality', y='volatile acidity', data=wine_data)

In [1]:
# citric acid vs quality
plot = plt.figure(figsize=(5,5))
plt.title('Citric acid vs Quality')
sns.barplot(x='quality', y='citric acid', data=wine_data)

Let's find the correlation! <br>
About the heatmap!
* corr - correlation calculated
* cbar - color bar to indicate the values range
* square - To get a square form
* fmt - we need one floating point value
* annot - annotations on the sides!
* annot_kws - size of annotations
* cmap - color of the heatmap!

In [1]:
corr = wine_data.corr()
# Let's create a heatmap!
plot = plt.figure(figsize=(10,10))
plt.title('Correlation heatmap!')
sns.heatmap(corr,cbar=True,square=True,fmt='.1f',annot=True,annot_kws={'size':8},cmap='Blues')

Data Preprocessing! <br>
Let's seperate the data and the labels!

In [1]:
X = wine_data.drop(columns='quality',axis=1)
X.head()

For the quality values, we're gonna binarize the values to either good wine quality **1** or bad wine **0**

In [1]:
y = wine_data['quality'].apply(lambda y_value: 1 if y_value>=7 else 0 )
y

Now that we have our data and labels, let's split the data into train and test split!

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

Now let's train our model using random forest classifier!

In [1]:
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

Model evaluation using K-fold cross validation! <br>
Why k-fold cross validation ? To be sure that we didn't get lucky on the train test!

In [1]:
y_pred_train = classifier.predict(X_train)
accuracy = accuracy_score(y_pred_train, y_train)
print("Accuracy of the model on training data is:", accuracy)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier,X = X_train,y= y_train , cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f}".format(accuracies.std()*100))

As we can see that's a pretty good accuracy! <br>
Let's check our model performance on the test set!

In [1]:
y_pred_test = classifier.predict(X_test)
accuracy = accuracy_score(y_pred_test, y_test)
print("Accuracy: {:.2f} %".format(accuracy*100))
cf_matrix = confusion_matrix(y_pred_test,y_test)

In [1]:
#.  visualizing the confusion matrix!
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, 
            fmt='.2%', cmap='Reds')

Now let's check if there is a better hyperparameter we can tune to improve over all accuracy. We use **Grid Search CV** here

In [1]:
from sklearn.model_selection import GridSearchCV
parameters = [{'bootstrap': [True], 'max_depth': [10, 20], 'max_features': ['auto', 'sqrt'], 'min_samples_leaf': [1, 2, 4], 'min_samples_split': [2, 5, 10], 'n_estimators': [100,200]}]
grid_search = GridSearchCV(estimator=classifier,
                          param_grid=parameters,
                          scoring='accuracy',
                          cv=10)
grid_search.fit(X_train,y_train)
print("Best Accuracy: {:.2f} %".format(grid_search.best_score_*100))
print("Best Parameters: ", grid_search.best_params_)

Now that we've come to an end let's look back upon what we did in this project! <br>
* imported the required libraries!
* read our data from the **Red Wine Quality** dataset!
* Checked for any missing values!
* Found some useful information between different features using plots and graphs!
* Made a heatmap to find the correlation between different features!
* Split the data into training and test sets!
* Trained our model using supervised learning algorithm - Random forest classification ! 
* Used KFold cross validation to get the accuracy !
* With a little bit of hyper parameter tuning we we're able to get a good accuracy score of **92**% !
<br>
Hope you all enjoyed this notebook! <br>
Happy Learning!!