# Red Wine Quality

On this project we are going to analyse the Wine-Quality dataset to find more about what indicates the wine quality. In order to understand our data, we will get a description of the dataset and then check if there are any null values. 

This dataset is public available for research. The details are described in [Cortez et al., 2009].
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Title: Wine Quality

Sources
Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009

Past Usage:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In [None]:
#Import the libraries that we are going to use:
import pandas as pd
from matplotlib import pyplot as plt
import scipy
import numpy as np
import seaborn as sns


In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv', index_col = 0)
df.head()

In [None]:
stats_df = df[['volatile acidity', 'citric acid', 'residual sugar']]
stats_df.head()

In [None]:
df.describe()

In [None]:
df['quality'] = pd.Categorical(df.quality)
df.dtypes

In [None]:
df.isnull().sum()

It looks like the data is clean. There are 1599 entries and no null values. 

Let us count how many unique values are in the 'quality' variable.

In [None]:
df.quality.unique()

In [None]:
plt.figure(figsize=[9,5])
plt.hist(df.quality, bins = 5, ec ='white', facecolor='g', alpha=0.5)
plt.xlabel('Quality')
plt.ylabel('Count')
plt.title('Quality Count')
plt.grid(True)
plt.show()


In the next step we are going to visualize the relations between 'quality' and some other variables:

In [None]:
plt.figure(figsize=(9, 5))
sns.boxplot(x='quality',y='volatile acidity',data=df, palette='tab10')
plt.title("Quality by Volatile Acidity")
plt.xlabel('Quality')
plt.ylabel('Volatile Acidity')
plt.show()


In [None]:
plt.figure(figsize=(9,5))
sns.boxplot(x="quality", y="pH", data=df, palette = 'tab10')
plt.title('Quality by pH Level')
plt.xlabel('Quality')

In [None]:
plt.figure(figsize=(9, 5))
sns.boxplot(x= 'quality', y= 'citric acid', data = df, palette = 'tab10')
plt.ylim(0, 1)
plt.title('Quality by Citric Acid Level')
plt.xlabel('Quality')
plt.ylabel('Citric Acid')

In [None]:
plt.figure(figsize=(9, 6))
sns.boxplot(x= 'quality', y= 'residual sugar', data = df, palette = 'tab10')
plt.ylim(1, 9)
plt.title('Quality vs Residual Sugar')
plt.xlabel('Quality')
plt.ylabel('Residual Sugar')


In [None]:
plt.figure(figsize=(9, 6))
sns.boxplot(x= 'quality', y= 'free sulfur dioxide', data = df, palette = 'tab10')
plt.title('Quality vs Free Sulfur Dioxide')
plt.xlabel('Quality')
plt.ylabel('Free Sulfur Dioxide')

In [None]:
plt.figure(figsize=(9, 5))
sns.boxplot(x= 'quality', y= 'total sulfur dioxide', data = df, palette = 'tab10')
plt.ylim(0, 175)
plt.title('Quality vs Total Sulfur Dioxide')
plt.xlabel('Quality')
plt.ylabel('Total Sulfur Dioxide')


In [None]:
plt.figure(figsize=(9, 5))
sns.boxplot(x= 'quality', y= 'density', data = df, palette = 'tab10')
#plt.ylim(0, 10)
plt.title('Quality vs Density')
plt.xlabel('Quality')
plt.ylabel('Density')


In [None]:
plt.figure(figsize=(10, 7))
sns.boxplot(x= 'quality', y= 'sulphates', data = df, palette = 'tab10')
plt.ylim(0.3, 2)
plt.title('Quality vs Sulphates')
plt.xlabel('Quality')
plt.ylabel('Sulphates')


In [None]:

plt.figure(figsize=(10, 5))
sns.boxplot(x= 'quality', y= 'alcohol', data = df, palette = 'tab10')
plt.ylim(8, 15)
plt.title('Quality vs Alcohol')
plt.xlabel('Quality')
plt.ylabel('Alcohol')


In the next step we are going to build a correlation matrix to measure the statistical relationship between variables:

In [None]:
corr=df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(250, 20, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True)

From the correlation matrix we can notice that the variables that have the strongest relation with quality of the wine are: volatile acidity, alcohol and sulphates. Quality has a negative correlation with volatile acidity and a positive correlation with sulphates and alcohol. In other words, the high the level of sulphates and alcohol, the high is the quality of wine and the lower the volatile acidity level, the better the wine is. 


In the final step we are building a pairplot with the seaborn library to summarize and visualize the pairwise relationships in our dataset. 

In [None]:
df1 = df[['quality', 'volatile acidity', 'alcohol', 'sulphates', 'citric acid', 'total sulfur dioxide']]
sns.pairplot(df1, hue = 'quality',  plot_kws = {'alpha': 0.4}, palette = 'YlGnBu')