In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# <center> Red Wine Quality statistical exploring </center>


<center> <img src="https://i.pinimg.com/originals/af/d9/70/afd970fc2aae41b1b34647cea95497c7.jpg" alt="wines" style="width: 200px;"/> </center>

<br>
<br>
<br>

<center> Let's talk about exploring dataset. We have red wines characteristics physicochemical (inputs) and sensory (the output) features as data. They are: </center>


<table>
<thead>
<tr><th>Feature</th><th>Description</th></tr>
</thead>
<tbody>
<tr><td>fixed acidity</td><td> most acids involved with wine or fixed or nonvolatile (do not evaporate readily) </td></tr>
<tr><td>volatile acidity</td><td> he amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste </td></tr>
<tr><td>citric acid</td><td> found in small quantities, citric acid can add 'freshness' and flavor to wines </td></tr>
<tr><td>residual sugar</td><td> the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet </td></tr>
<tr><td>chlorides</td><td> the amount of salt in the wine </td></tr>
<tr><td>free sulfur dioxide</td><td> the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine </td></tr>
<tr><td>total sulfur dioxide</td><td> amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine </td></tr>
<tr><td>density</td><td> the density of water is close to that of water depending on the percent alcohol and sugar content </td></tr>
<tr><td>pH</td><td> describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale </td></tr>
<tr><td>sulphates</td><td> a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant </td></tr>
<tr><td>alcohol</td><td> - </td></tr>
<tr bgcolor='lightpink'><td>quality (target value)</td><td> score between 0 and 10 </td></tr>
</tbody>
</table>

# <center> How will we explore? </center>

At first, create some [exploratory data analysis](#EDA:);

At second, exploring data by two different statistical approaches:

- [Descriptive Statistics](#Descriptive-Statistics),

- [Inferential Statistics](#Inferential-Statistics).

<p> <font size="3" color="red"> will be updated, if u find it useful, please upvote :) </font> </p>

# EDA:

In [None]:
data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
data.head()

Explore our target value (quality):

In [None]:
quality =  pd.DataFrame(data.quality.value_counts().sort_values(ascending=False))
plt.figure(figsize=(10,5))
sns.set(style="ticks", palette="pastel")
ax = sns.barplot(x = quality.index, y = 'quality' , data = quality)
ax = ax.set_xticklabels(ax.get_xticklabels(), rotation=90)

Well, in most cases we have normal quality - 5. Also, we have a little bad and good wine.

It looks like we work with no-balanced data.

Have we got any NaNs?

In [None]:
# see NaN-values

cmap = sns.diverging_palette(220, 10, as_cmap=True)
table_nan = data.isna().sum().to_frame()[:10].style.background_gradient(cmap=cmap)
table_nan

Nice, we haven not missing values. Let's describe the data:

In [None]:
data.describe()

Now, do some statistical analysis.
* Descriptive Statistics;
* Inferential Statistics.

# Descriptive Statistics

<br>
In this notebook, we will be focussing on three key elements of descriptive Statistics :
<br>
<br>


* Measures Of Central Tendency
   - Mean
   - Median
   - Mode

<br>

* Measures Of Spread
   - Outliers
   - Interquantile Range
   
<br>

* Dependence
   - Correlation


* Finding mean is not a good approach as the 'Mean is often affected by Outliers'.

* To generalize it if the distribution of datasets is skewed(troubled by outliers), we do not choose mean. Here we will have to go for Median.

Good median representation is boxplot.

In [None]:
sns.set(style="darkgrid", palette="pastel")

f, axes = plt.subplots(3, 4, figsize=(25, 15))
sns.despine(left=True)

sns.boxplot(data['fixed acidity'],  ax=axes[0, 0])
sns.boxplot(data['volatile acidity'],  ax=axes[0, 1])
sns.boxplot(data['citric acid'],  ax=axes[0, 2])
sns.boxplot(data['residual sugar'],  ax=axes[0, 3])

sns.boxplot(data['chlorides'],  ax=axes[1, 0])
sns.boxplot(data['free sulfur dioxide'],  ax=axes[1, 1])
sns.boxplot(data['total sulfur dioxide'],  ax=axes[1, 2])
sns.boxplot(data['density'],  ax=axes[1, 3])

sns.boxplot(data['pH'],  ax=axes[2, 0])
sns.boxplot(data['sulphates'],  ax=axes[2, 1])
sns.boxplot(data['alcohol'],  ax=axes[2, 2])
sns.boxplot(data['quality'],  ax=axes[2, 3])

Well, we can see that a lot of features have outliers (it is individual points on plots).

Let's take a closer look at 'pH' column, for example.

Comparing median and mean:

In [None]:
median = np.median(data['pH'])
mean = np.mean(data['pH'])
mode = data['pH'].mode()[0]
print('pH median: ', median)
print('pH mean: ', mean)
print('pH mode: ', mode)

q1 = data['pH'].quantile(0.25) # lower quartile 
q3 = data['pH'].quantile(0.75) # upper quartile       
print("Q1:", q1)
print("Q3:", q3)
print("IQR:", q3 - q1)

Median and mean have very close values. May be pH has normal distribution?

In [None]:
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

sns.boxplot(data["pH"], ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(data["pH"], ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
ax_hist.axvline(mode, color='b', linestyle='-')

plt.legend({'Mean':mean,'Median':median,'Mode':mode})

ax_box.set(xlabel='')

Thus, we see that our Histogram is normal (no skew).

$H_0$: data['pH'] comes from a normal distribution.

<br>

$H_1$: $H_0$ is false.

<br>

<center> If  $p$ $value < 0.05 $, the $H_0$ can be rejected, else the null hypothesis cannot be rejecte. </center>

In [None]:
from scipy import stats

normal = stats.normaltest(data['pH'])
normal

$p$ $value < 0.05 $,  pH does not come from a normal distribution.

Another approach to check this is QQ-plot:

In [None]:
import statsmodels.api as sm

fig = sm.qqplot(data['pH'], stats.t, fit=True, line="45")

It is very close to the normal distribution, but on the plot, we have 'outliers' (blue points which are not on the red line) because it is not.

Normaltest, QQ-plot, distribution plot are commonly used simple approaches to check normal distribution.

Let's take a closer look at 'sulphates' column, for example. Do the same sequence of actions.

In [None]:
median = np.median(data['sulphates'])
mean = np.mean(data['sulphates'])
mode = data['sulphates'].mode()[0]
print('sulphates median: ', median)
print('sulphates mean: ', mean)
print('sulphates mode: ', mode)

q1 = data['sulphates'].quantile(0.25) # lower quartile  } Note: The fuction is .quantile() with 'n'
q3 = data['sulphates'].quantile(0.75) # upper quartile  }       not .quartile() with 'r'
print("Q1:", q1)
print("Q3:", q3)
print("IQR:", q3 - q1)

In [None]:

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw= {"height_ratios": (0.2, 1)})

sns.boxplot(data["sulphates"], ax=ax_box)
ax_box.axvline(mean, color='r', linestyle='--')
ax_box.axvline(median, color='g', linestyle='-')
ax_box.axvline(mode, color='b', linestyle='-')

sns.distplot(data["sulphates"], ax=ax_hist)
ax_hist.axvline(mean, color='r', linestyle='--')
ax_hist.axvline(median, color='g', linestyle='-')
ax_hist.axvline(mode, color='b', linestyle='-')

plt.legend({'Mean':mean,'Median':median,'Mode':mode})

ax_box.set(xlabel='')

Thus, we see that our Histogram is "Positively Skewed".

In [None]:
normal = stats.normaltest(data["sulphates"])
normal

$p$ $value < 0.05 $, sulphates does not come from a normal distribution.

**Correlation**

Now what if we want to know how the price is affected by different factors which are some of the other columns/features in our dataset.
This is nothing but correlation.
The most popular correlation is Pearson correlation. But we need to keep in mind that this coefficient is very sensitive to outliers.

We can use alternative coefficient - Spearman correlation, it is the same as Pearson, but do not sensitive to outliers.

In [None]:
corr_data = data
corr = corr_data.corr(method="spearman")
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True)

For example, the correlation between chlorides and free sulfur dioxide is very low. It means, that btw features correlation is absent. On the next plot points should behave form as cloud if the correlation is calculated true.

In [None]:
g = sns.lmplot(x="free sulfur dioxide", y="chlorides",
               height=5, data=data, line_kws={'color': 'red'})

g.set_axis_labels("free sulfur dioxide", "chlorides")

Also, the correlation between fixed acidity and pH is high. It means, that features are correlated (linear). On the next plot points should behave form as a line if the correlation is calculated true.

In [None]:
g = sns.lmplot(x="fixed acidity", y="pH",
               height=5, data=data, line_kws={'color': 'red'})

g.set_axis_labels("fixed acidity", "pH")

# Inferential Statistics

* Sample Mean & Population Mean
* Confidence Intervals ** Calculating Confidence Intervals
* Hypothesis Testing p-value and t-test

In [None]:
data = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
data.head()

In [None]:
print('Shape: \n', data.shape )


**Sample mean and Population mean**

Example for alcohol feature.

In [None]:
np.random.seed(0)
sample = np.random.choice(a=data['alcohol'], size=500) 
print("Sample mean:", sample.mean() )                       
print("Population mean:", data['alcohol'].mean())

**Confidence Intervals**

In [None]:
sample_size = 200
sample = np.random.choice(a= data['alcohol'], size = sample_size)
sample_mean = sample.mean()
pop_stdev = data['alcohol'].std()

z_critical = stats.norm.ppf(q = 0.95) 
print("z-critical value: ",z_critical)

In [None]:
from statsmodels.stats.weightstats import _zconfint_generic, _tconfint_generic

# if we know std for population
z_conf = _zconfint_generic(sample_mean, 
                          pop_stdev, 
                          0.05, 'two-sided')
print( "95% confidence interval", z_conf )
# if we know only sample std
t_conf = _tconfint_generic(sample_mean, sample.std(),
                           sample_size - 1,
                           0.05, 'two-sided')
print ("95% confidence interval", t_conf)

In [None]:
print("True mean: {}".format(data['alcohol'].mean()))


The confidence interval includes the value of the true mean

**Hypothesis Testing**

We see above some examples of hypothesis, now repeat them.

$H_0$: Wines alcohol with quality '5' really different from the quality of other

$H_1$: $H_0$ is not correct

In [None]:
from statsmodels.stats.weightstats import ztest

z_statistic, p_value = ztest(x1=data[data['quality'] == 5]['alcohol'], value=data['alcohol'].mean())
print('Z-statistic is :{}'.format(z_statistic))
print('P-value is :{}'.format(p_value))

$H_0$ is rejected, bcs $p$ $value < 0.05$.

Another way to test: **Gosset's (Student's) t-test**

Now, let's also see if fixed acidity in wines with  $quality = 5$  is different from the wines in the other qualities.

In [None]:
stats.ttest_1samp(a= data[data['quality'] == 5 ]['fixed acidity'],           
                 popmean= data['fixed acidity'].mean())

p-value < 0.05, we can reject $H_0$ again.