<h1>**Red Wine Data Analysis**</h1>

In [None]:
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# loading wine dataset
red_wine_df = pd.read_csv("../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv")

In [None]:
# Checking Top 5 records
red_wine_df.head()

In [None]:
# Checking columns
red_wine_df.columns

> <h1>**Questions that I try to answer ?**</h1>
* Which factor or combination of factors affect the quality of Red Wine ?
* Is there any interesting trends that exist in other columns besides Quality ?


<h2>**Descriptive Statistics**</h2>

In [None]:
# Displaying the Details of the dataset
red_wine_df.info()

In [None]:
red_wine_df.describe()

In [None]:
# Showing the Quality of Wine
red_wine_df['quality']

<h3>**Since there are no null entries, we don't need to deal with missing values.**</h3>

<h1>**Descriptive Statistics**</h1>

In [None]:
red_wine_df.describe()

<h1>**Analysis over Red Wine**</h1>

<h2>**Let's first check the Quality Column**</h2>

In [None]:
sns.set(rc={'figure.figsize':(7,6)})
sns.countplot(red_wine_df['quality'])

<h3>**Lets check which of the other columns are highly correlated to Quality**</h3>

In [None]:
sns.pairplot(red_wine_df)

In [None]:
sns.set(rc={'figure.figsize':(12,10)})
sns.heatmap(red_wine_df.corr(), annot=True, fmt='.2f', linewidths=2)

* Free Suplhur Dioxide and Total Sulphur Dioxide have some positive relation to Residual Sugar. On further inspection, I found that the quantity of SO2 is dependent on Sugar content. Reference : http://thewinehub.com/home/2013/01/09/the-use-or-not-of-sulfur-dioxide-in-winemaking-trick-or-treat/ . More specifically, the mentioned link states that "the lower the Residual Sugar , the less SO2 needed"
* Density has a postive correlation with fixed acidity and residual sugar
* Density has negative correlation with alcohol and pH
* Quality has positive correlation with alcohol, citric acid and sulphates, and -ve correlation with citric acid. We need to explore this further.
* Fixed acidity has high +ve correlation with citric acid and density and -ve correlation with pH
* Residual sugar has +ve correlation with citric acid
* pH has -ve correlation with fixed acidity and citric acid, but +ve correlation with volatile acid

In [None]:
sns.distplot(red_wine_df['alcohol'])

In [None]:
from scipy.stats import skew
skew(red_wine_df['alcohol'])

<h2>**Alcohol content is positively skewed**</h2>

In [None]:
def draw_hist(temp_df, bin_size = 15):
    ax = sns.distplot(temp_df)
    #xmin, xmax = ax.get_xlim()
    #ax.set_xticks(np.round(np.linspace(xmin, xmax, bin_size), 2))
    plt.tight_layout()
    plt.locator_params(axis='y', nbins=6)
    plt.show()
    print("Skewness is {}".format(skew(temp_df)))
    print("Mean is {}".format(np.median(temp_df)))
    print("Median is {}".format(np.mean(temp_df)))

In [None]:
draw_hist(red_wine_df['alcohol'])

<h2>**Let's see how alcohol varies w.r.t. quality** </h2>

In [None]:
sns.boxplot(x='quality', y='alcohol', data=red_wine_df)

In [None]:
sns.boxplot(x='quality', y='alcohol', data=red_wine_df)

In [None]:
joint_plt = sns.jointplot(x='alcohol', y='pH', data=red_wine_df,
                        kind='reg')

In [None]:
from scipy.stats import pearsonr
def get_corr(col1, col2, temp_df):
    pearson_corr, p_value = pearsonr(temp_df[col1], temp_df[col2])
    print("Correlation between {} and {} is {}".format(col1, col2, pearson_corr))
    print("P-value of this correlation is {}".format(p_value))

In [None]:
get_corr('alcohol', 'pH', red_wine_df)

In [None]:
oint_plt = sns.jointplot(x='alcohol', y='density', data=red_wine_df,
                        kind='reg')

In [None]:
get_corr('alcohol', 'density', red_wine_df)

In [None]:
g = sns.FacetGrid(red_wine_df, col="quality")
g = g.map(sns.regplot, "density", "alcohol")

<h2>**Lets analyze sulphates and quality**</h2>

In [None]:
sns.boxplot(x='quality', y='sulphates', data=red_wine_df)

In [None]:
sns.boxplot(x='quality', y='total sulfur dioxide', data=red_wine_df)

In [None]:
sns.boxplot(x='quality', y='free sulfur dioxide', data=red_wine_df)

In [None]:
red_wine_df.columns

<h2>**Lets move on to fixed acidity, volatile acidity and citric acid**</h2>

In [None]:
sns.boxplot(x='quality', y='fixed acidity', data=red_wine_df)

In [None]:
sns.boxplot(x='quality', y='citric acid', data=red_wine_df)

In [None]:
sns.boxplot(x='quality', y='volatile acidity', data=red_wine_df)

<h2>**Trends between other columns**</h2>

In [None]:
red_wine_df.columns

In [None]:
get_corr('pH', 'citric acid', red_wine_df)

<h2>**Create a new Column Total Acidity**</h2>

In [None]:
red_wine_df['total acidity'] = (red_wine_df['fixed acidity']+ red_wine_df['citric acid'] + red_wine_df['volatile acidity'])
sns.boxplot(x='quality', y='total acidity', data=red_wine_df,
           showfliers=False)

In [None]:
sns.regplot(x='pH', y='total acidity', data=red_wine_df)

In [None]:
g = sns.FacetGrid(red_wine_df, col="quality")
g = g.map(sns.regplot, "total acidity", "pH")

In [None]:
get_corr('total acidity', 'pH', red_wine_df)

In [None]:
g = sns.FacetGrid(red_wine_df, col="quality")
g = g.map(sns.regplot, "free sulfur dioxide", "pH")