# Red wine data analysis

**Citation**
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

In [None]:
import pandas as pd
from pandas import DataFrame
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from matplotlib import rcParams
import plotly.graph_objects as go
import plotly.express as px
from plotly.colors import n_colors
import numpy as np
import seaborn as sns
import pandas_profiling
%matplotlib inline
from matplotlib import rc
import scipy.stats
from scipy.stats import skew
from scipy.stats.stats import pearsonr

In [None]:
#reading the file
red_wine = pd.read_csv("../input/wine-quality-selection/winequality-red.csv",delimiter=',')

In [None]:
red_wine

**Column description.**

1) Fixed Acidity : Amount of Tartaric acid present

2) Volatile Acidity : Amount of Acetic acid present

3) Citric Acid : Amount of Citric acid present which contributes to crispness

4) Residual sugar: Amount of sugar present in wine after fermentation

5) Chlorides : Amount of Sodium Chloride present

6) Free sulfur dioxide : Amount of free sulfur dioxide present

7) Total sulfur dioxide : Total of free and bounded sulfur dioxide present in the wine. Sulfur dioxide acts as an antimicrobial and an antioxidant and too much sulfur can lead to a pungent smell in wine

8) Density : As name suggest it shows density of wine

9) pH : pH shows pH value of wine between 0 to 14 where 0 is highly acidic and 14 means highly basic

10) Sulphates : Amount of Potassium sulphate present

11) Alcohol : Alcohol content in wine

12) Quality : Defines qualtiy between range 1-10

**We will try to answer some intial questions.**

Which factor/factors affect the qualtiy of Red/White wine?

Do factors affecting quality vary according wine type?

Finding trends in other columns besides quality.

In [None]:
red_wine.columns

In [None]:
red_wine.info()

**We don't have any null values over the columns.**

In [None]:
red_wine.describe()

**There is no such odd values in our descriptive stastics.**
**Our main focus is quality column. The minimum value of quality is 3 and maximum value is 8. It has amean of 5.6 with standard devaiation 0.80.**
**Though quality column has numerical value it can be treated as categorical ordinal values. The reason being that it has a range of 1 to 10. Higher the value better the quality.**

In [None]:
sns.set(rc={'figure.figsize':(8,8)})
sns.countplot(red_wine['quality'])

**We can see that majority of the wine are of medium quality that is 5 and 6. Very few wine are of low quality and few of very good quality.**

**Let's check relation of quality with other variables.**

In [None]:
sns.pairplot(red_wine)

**We have scatterplots with every possible combination of pair of coulmns.**

**Now let's put number on these correlation with the help of heat map.**

In [None]:
sns.heatmap(red_wine.corr(), annot =True, fmt ='.2f', linewidths = 2)

**Some insights from the both above graphs are following:**

**Free and total sulfur dioxide have some positive correlation with residual sugar. The reason being that quantity of sulfur dioxide is dependent on sugar content.**

**Density also has some positive correlation with residual sugar which kind of obvious becuase sugar is dense in nature. So putting sugar in wine can have effect on density.**

**Denisty has negative relation with alcohol and pH.**

**Fixed acidity seems to have some positive correlation with density and citric acid, and negative correlation with pH.**

**pH has negative correlation with fixed acidity and citric acid and, positive correlation with volatile acidity.**

**Since we are focusing on quality,** 
**1) quality has positive correlation with alcohol.**
**2) weak positive correlation with sulphates, citric acid, and fixed acidity.**
**3) weak negative correlation with volatile acidity, density, total sulfure dioxide.**

**There are so many variables that we can work on but we will consider some important ones.**

In [None]:
sns.distplot(red_wine['alcohol'])

In [None]:
skew(red_wine['alcohol'])

**We can see that skewed value is positively high meaning alcohol content is positively skewed.**

In [None]:
np.mean(red_wine['alcohol'])

In [None]:
np.median(red_wine['alcohol'])

**Let's check how alcohol and quality relates to eachother.**

In [None]:
sns.boxplot(x='quality',y='alcohol', data = red_wine, showfliers = False)

**We have seen that density and alcohol share negative relation. So we can say that as quality increases alcohol content increases and with that density decreases. It makes sense because primary reason for drinking wine is enjoy some light alcohol.**

In [None]:
joint_plt = sns.jointplot(x='alcohol', y ='pH', data = red_wine, kind = 'reg')

In [None]:
def get_corr(var1,var2,df):
    pearson_coefficient, p_value = pearsonr(df[var1], df[var2])
    print('Pearsonr correlation between {} and {} is {}'.format(var1,var2,pearson_coefficient))
    print("P value of this correlation is {}".format(p_value))

In [None]:
get_corr('alcohol','pH',red_wine)

In [None]:
joint_plt = sns.jointplot(x='alcohol', y ='density', data = red_wine, kind = 'reg')

In [None]:
get_corr('alcohol','density',red_wine)

In [None]:
a = sns.FacetGrid(red_wine, col = 'quality')
a = a.map(sns.regplot,'density', 'alcohol')

**AS the value of quality increases the negative relation between alcohol and density become stronger.**

**Let's move to sulphates.**

In [None]:
sns.boxplot(x='quality',y='sulphates', data = red_wine, showfliers = False)

**Sulfates is used to create sulfur dioxide in wine which acts an antioxident agent in wine. That improve the taste of wine. That is why as quality increases quantity of sulphates also increase.**

In [None]:
sns.boxplot(x='quality',y='total sulfur dioxide', data = red_wine, showfliers = False)

In [None]:
sns.boxplot(x='quality',y='free sulfur dioxide', data = red_wine, showfliers = False)

**Both of them follow similar trend since they are highly correlated. Sulphate forms sulfur dioxide in wine which leads to pungent smell. As we can see from the both graphs as quality value increase sulfur dioxide increases for first three values. But after that sulfur dioxide started decreases as the quality increase.**

**So when you have less sulfur dioxide you won't have any preserving and antimicrobial agents.**


**Let's move to fixed acidity, volatile acidity, and citric acid.**

In [None]:
sns.boxplot(x='quality',y='fixed acidity', data = red_wine, showfliers = False)

In [None]:
sns.boxplot(x='quality',y='citric acid', data = red_wine, showfliers = False)

**The more citric acidity the better the quality. One reason behind is citric acid leads to crispness or freshness of wine and brings unique taste to wine.**

In [None]:
sns.boxplot(x='quality',y='volatile acidity', data = red_wine, showfliers = False)

**Volatile acidity Acetic acid constributes to the smell of vinegar. So it makes sense that you don't want the test of vinegar in your wine.**

In [None]:
red_wine['total acidity'] = (red_wine['fixed acidity']+red_wine['citric acid']+red_wine['volatile acidity'])

In [None]:
sns.boxplot(x='quality',y='total acidity', data = red_wine, showfliers = False)

In [None]:
sns.regplot(x='pH', y= 'total acidity', data = red_wine)

In [None]:
b  = sns.FacetGrid(red_wine, col = 'quality')
b = b.map(sns.regplot,'total acidity', 'pH')

**Among the high quality wine negative relation is quite strong. But it may also possible to due to the fact that you have less sample for lower quality wine. more acid means less pH.**

In [None]:
get_corr('total acidity','pH',red_wine)