In [None]:
!pip install seaborn==0.11.0
!pip install scikit-learn==0.24.0

# Wine Quality Analysis: EDA

<img style="margin-left:0" src="https://thumbor.forbes.com/thumbor/fit-in/1200x0/filters%3Aformat%28jpg%29/https%3A%2F%2Fspecials-images.forbesimg.com%2Fdam%2Fimageserve%2F1133888244%2F0x0.jpg%3Ffit%3Dscale" width="600px" />

This notebook analyse a database of **red** and **white** variants of the Portuguese "Vinho Verde" wine based on wine **physicochemical test results** and quality scores that experts assign to each wine sample.

- **If you find this notebook intresting and useful or helped you to select a good bottle of wineüç∑, please feel free to upvote it üí´**
- Modelling Part of the Analysis: https://www.kaggle.com/glushko/wine-quality-modelling-part-ii
- Github: https://github.com/roma-glushko/kaggle-wine-quality

In [None]:
import numpy as np
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
RANDOM_SEED = 42

plt.rcParams['figure.figsize'] = (12, 8)
sns.set_theme(style='whitegrid')

wineBarColors = ['#ecf0f1', '#e74c3c']
wineBinColors = ['#f39c12', '#c0392b']

wineBarPalette = sns.color_palette(wineBarColors)
wineBinPalette = sns.color_palette(wineBinColors)

In [None]:
raw_df = pd.read_csv('../input/wine-quality/winequalityN.csv')

full_df = raw_df.copy()

# Database Overview üîç

Let's notice a flavor of our dataset:

In [None]:
full_df.info()

In [None]:
full_df.head()

In [None]:
full_df.describe().transpose()

We have **12 features**. Most of them are **continues numbers**. In total, dataset contains **6.4k samples**. From the first glance, it's pretty big number comparing to the number of features. Let's see if this is enough for us to explanin observation variance.

For instance, in <a href="https://www.kaggle.com/glushko/house-prices-domain-driven-eda-part-i">Ames housing dataset</a> is around 3k samples and 80 features.

## Missing Values

Dataset contains some missing values. However, their ratio is little:

In [None]:
missing_df = full_df.isnull().sum()
missing_df = missing_df.drop(missing_df[missing_df == 0].index).sort_values(ascending=False)

missing_df = pd.DataFrame({'missing_count': missing_df})
missing_df['missing_rate(%)'] = (missing_df['missing_count'] / len(full_df)) * 100

missing_df

Most of the missing values are continues. Although, **pH** can resemble a category as well. We will impute it by the most frequent value per **wine type**. 

Other features could differ per wine type, so we groups the rest of imputations by **wine type**, too. **pH** could help impute **acidity-related features** more precisely:

In [None]:
full_df['pH'] = full_df.groupby(['type'])['pH'].transform(lambda x: x.fillna(x.mode()[0]))

for feature in ['chlorides', 'sulphates', 'residual sugar']:
    full_df[feature] = full_df.groupby(['type'])[feature].transform(lambda x: x.fillna(x.median()))

for feature in ['fixed acidity', 'volatile acidity', 'citric acid']:
    full_df[feature] = full_df.groupby(['type', 'pH'])[feature].transform(lambda x: x.fillna(x.median()))

Just in case, let's also encode our single categorical feature **type**:

In [None]:
type_onehot = pd.get_dummies(full_df['type'], prefix='type')
full_df = pd.concat([full_df, type_onehot], axis=1)

In [None]:
full_df.isnull().sum()

Having list of features by type is also frequently useful:

In [None]:
num_features = [f for f in full_df.columns if full_df.dtypes[f] != 'object']
num_features.remove('quality')

num_features

## Wine Quality Distribution

**Wine quality** measures from **3 to 9** according to the observations in the training set. **Wine quality 5, 6 and 7** were the most frequent:

In [None]:
sns.displot(data=full_df, x='quality', hue='type', kind='kde');

In [None]:
full_df['quality'].value_counts(), full_df['quality'].value_counts(normalize=True) * 100

The dataset is **imbalanced**. We only have access to a few records of the worst and the best wine samples. Depending on the our classification objectives, this may be a problem as we **would not have enough samples** to perform all class classification.

Having this situation, we would like to simplify quality analys by adding **low**, **average** and **high quality** quality groups:

In [None]:
def impute_quality_group(quality):
    if quality <= 5:
        return 0 # low
    if quality > 5 and quality < 7:
        return 1 # average
    if quality >= 7:
        return 2 # high

for dataset in [raw_df, full_df]:
    dataset['quality_group'] = dataset['quality'].apply(impute_quality_group)

In [None]:
sns.displot(data=full_df, x='quality_group', hue='type', kind='kde');

By reducing number of classes, we come up with **more balanced** dataset:

In [None]:
full_df['quality_group'].value_counts(), full_df['quality_group'].value_counts(normalize=True) * 100 

## Data Transformation

By taking a look at dataset, we can see a lot of **right skewed distributions** (fixed and volatile acidity, residual sugar, chlorides, free and total SO2, density). Only shape of **pH distribution** resemble bell shape:

In [None]:
_, ax = plt.subplots(4, 3, figsize=(15, 15))

for idx, feature in enumerate(num_features):
    subplot = ax[idx % 4][idx % 3]
    sns.histplot(data=full_df, x=feature, hue='quality_group', kde=True, palette='tab10', ax=subplot)

Let's estimate each feature distribution **skewness and kurtosis** on the original dataset and after applying the most common data transformations (Box-Cox, log and principal square root):

In [None]:
skewness_df = pd.DataFrame({
    'feature': [], 
    'skewness': [], 
    'kurtosis': [], 
    'boxcox_test': [],
    'boxcox_skewness': [], 
    'boxcox_kurtosis': [],
    'log_skewness': [],
    'log_kurtosis': [],
    'sqrt_skewness': [],
    'sqrt_kurtosis': [],
})

transformed_df = full_df.copy()

for feature in num_features:
    skewness = transformed_df[feature].skew()
    kurtosis = transformed_df[feature].kurtosis()
    transformed_df[feature + '_transformed'], boxcox_test = stats.boxcox(1 + full_df[feature])

    boxcox_skewness = transformed_df[feature + '_transformed'].skew()
    boxcox_kurtosis = transformed_df[feature + '_transformed'].kurtosis()

    transformed_df[feature + '_transformed'] = np.log1p(transformed_df[feature])
    log_skewness = transformed_df[feature + '_transformed'].skew()
    log_kurtosis = transformed_df[feature + '_transformed'].kurtosis()

    transformed_df[feature + '_transformed'] = np.sqrt(transformed_df[feature])
    sqrt_skewness = transformed_df[feature + '_transformed'].skew()
    sqrt_kurtosis = transformed_df[feature + '_transformed'].kurtosis()

    skewness_df = skewness_df.append({
            'feature': feature, 
            'skewness': skewness, 
            'kurtosis': kurtosis, 
            'boxcox_test': boxcox_test,
            'boxcox_skewness': boxcox_skewness, 
            'boxcox_kurtosis': boxcox_kurtosis,
            'log_skewness': log_skewness, 
            'log_kurtosis': log_kurtosis,
            'sqrt_skewness': sqrt_skewness, 
            'sqrt_kurtosis': sqrt_kurtosis,
        }, 
        ignore_index=True
    )

skewness_df

Kurtosis is a heaviness of distirbution tails which is basically outliers. **Chlorides**, **sulphates**, **free SO2**, **fixed acidity**, **residual sugar** has large kurtosis and we want to take a look at their outliers. 

**Box-Cox transformation** performs great for most of the features. 
**Density** is an exception. It has a small deviation (~0.002) and all values are ~0.99. Probably, this is the reason, why Box-Cox transformation replaces density values by constant. 

Let's skip **density** and use Box-Cox transformation for the rest of the features:

In [None]:
features_to_transform = list(num_features)
features_to_transform.remove('density')
features_to_transform.remove('type_red')
features_to_transform.remove('type_white')

for feature in features_to_transform:
    full_df[feature], _ = stats.boxcox(1 + full_df[feature])

In [None]:
for feature in features_to_transform:
    fig, ax = plt.subplots(2, 2, constrained_layout=True)
    
    # plot original distribution and its probability-probability plot
    sns.histplot(data=raw_df, x=feature, hue='quality_group', kde=True, palette='tab10', ax=ax[0][0])
    stats.probplot(raw_df[feature], dist=stats.norm, plot=ax[1][0])

    # plot transformed distribution and its probability-probability plot
    sns.histplot(data=full_df, x=feature, hue='quality_group', kde=True, palette='tab10', ax=ax[0][1])
    stats.probplot(full_df[feature], dist=stats.norm, plot=ax[1][1])

PP plots show that **Box-Cox** tranformation truly helped us to reach better degree of normality. In addition, the transfromation helped to reveal better feature distributions. For instance, now we clearly see that **residual sugar** and **chlorides** are binomial distributions which is not obvious at all.

## Outliers

Let's use boxplots to get overview of outliers in our dataset:

In [None]:
small_scale_features = list(set(num_features) - set(['total sulfur dioxide', 'free sulfur dioxide', 'type_white', 'type_red']))
large_scale_features = ['total sulfur dioxide', 'free sulfur dioxide']

_, (ax0, ax1) = plt.subplots(1, 2, figsize=(22, 10))

sns.boxplot(data=full_df[small_scale_features], palette='Set3', ax=ax0)
sns.boxplot(data=full_df[large_scale_features], palette='Set3', ax=ax1);

**Total SO2 and citric acid** are rich for outliers that far from -+1.5IQR. Most of the remaining features also have outliers, but they are closer to the the -+1.5IQR.

We can review founded anomalies:

In [None]:
full_df[full_df['total sulfur dioxide'] > 170][['type', 'alcohol', 'free sulfur dioxide', 'total sulfur dioxide', 'quality', 'quality_group']]

In [None]:
full_df[full_df['free sulfur dioxide'] > 20][['type', 'alcohol', 'free sulfur dioxide', 'total sulfur dioxide', 'quality', 'quality_group']]

In [None]:
full_df[full_df['citric acid'] > 0.75][['type', 'alcohol', 'free sulfur dioxide', 'total sulfur dioxide', 'quality', 'quality_group']]

Now we are aware of outliers, but we would not take any actions on them untill we can check how our models will be be working with and without them.

# Wine Physicochemical Analysis üß™

## Wine Type

Our wine collection is represented in **red** and **white** wine types. Red and white wines are not only different because of the type of grapes and color. Actually, technology of producing these types of wine is quite different:

- **Red wines** are fermented with the **grape skins** and **seeds** and white wines are **not**. This is because all the color in red wine comes from the **skins and seeds of the grapes**.

- The largest difference between **red winemaking** and **white winemaking** is the oxidation that causes the wines to lose their floral and fruit notes in exchange for rich, nutty flavors and more smoothness.

More Info:
- https://winefolly.com/tips/red-wine-vs-white-wine-the-real-differences/
- https://www.healthline.com/nutrition/red-vs-white-wine

The training set is **imbalanced** with respect to wine types: there are much more red wines than white ones (75% vs 25%):

In [None]:
full_df['type'].value_counts()

**Wine quality** doesn't seem to be dependent on the wine type:

In [None]:
sns.catplot(data=full_df, x='type', y='quality', kind='box', palette=wineBarPalette);

Physicochemical test results for "good" and "bad" wines may be different for different **wine type**. So it may be helpful to **separate** our training set **by wine type** for further investigation:

In [None]:
red_wine_df = full_df[full_df['type'] == 'red']
white_wine_df = full_df[full_df['type'] == 'white']

Having wine type dataframes, we can see how **quality** of each wine type **correlates** with the rest of wine characteristics:

In [None]:
correlation_features = list(num_features)
correlation_features.append('quality')
correlation_features.remove('type_red')
correlation_features.remove('type_white')

sns.heatmap(full_df[correlation_features].corr(), annot=True, square=True, cmap='Greens', cbar=False);

In [None]:
fig, (ax0, ax1) = plt.subplots(1, 2, constrained_layout=True, figsize=(18, 18))

sns.heatmap(red_wine_df[correlation_features].corr(), annot=True, square=True, cmap='Greens', cbar=False, ax=ax0)
ax0.set_title('Red Wine')

sns.heatmap(white_wine_df[correlation_features].corr(), annot=True, square=True, cmap='Greens', cbar=False, ax=ax1)
ax1.set_title('White Wine');

**Red Wine Quality Insights**:
- **Alchohol** has possitive correlation with **Quality** (0.48)
- **Sulphates** has possitive correlation with **Quality** (0.25)
- **Citric Acidity** has possitive correlation with **Quality** (0.23)
- **Fixed Acidity** has possitive correlation with **Quality** (0.12)
- **Chlorides** has negative correlation with **Quality** (-0.13)
- **Density** has negative correlation with **Quality** (-0.17)
- **Total Sulfure Dioxide** has negative correlation with **Quality** (-0.19)
- **Volatile Acidity** has negative correlation with **Quality** (-0.39)

**White Wine Quality Insights**:
- **Alchohol** has possitive correlation with **Quality** (0.44)
- **Fixed Acidity** has negative correlation with **Quality** (-0.11)
- **Total Sulfure Dioxide** has negative correlation with **Quality** (-0.17)
- **Volatile Acidity** has negative correlation with **Quality** (-0.19)
- **Chlorides** has negative correlation with **Quality** (-0.21)
- **Density** has negative correlation with **Quality** (-0.31)

General wine correlations are close to **white wine** correlations. This is because number of **white wines** bigger than **red wines**.

## Acidity and pH

<img style="margin-left:0" src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Wine_grape_diagram_en.svg/2560px-Wine_grape_diagram_en.svg.png" width="600px" />

The **acids** in wine are an important component in both winemaking and the finished product of wine. They are present in both **grapes and wine**, having direct influences on the **color, balance and taste** of the wine as well as the **growth and vitality of yeast** during fermentation and **protecting the wine from bacteria**. 

Here is acidity information we have been given:
- **fixed acidity** (tartaric acid - $g/dm^3$) - most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
- **volatile acidity** (acetic acid - $g/dm^3$) - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
- **citric acid** ($g/dm^3$) - found in small quantities, citric acid can add "freshness" and flavor to wines
- **pH** - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

More Info:
- https://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity
- http://www.oiv.int/public/medias/3732/oiv-ma-as313-02.pdf
- https://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity
- https://www.decanter.com/learn/volatile-acidity-va-45532/
- https://extension.psu.edu/volatile-acidity-in-wine

Let's give a quick look at bi-variable relations amongs acidity features:

In [None]:
g = sns.PairGrid(full_df[['fixed acidity', 'volatile acidity', 'citric acid', 'pH', 'type']], hue='type', diag_sharey=False, palette=wineBinPalette)

g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2);

Here is what can we see immediately from countor plots:
- **pH** negatevly correlates with **fixed acidity** and **citric acidity**. Less **pH** means more tart taste of wine.
- **Volatile acidity** positively correlates with **pH** in **red wine samples**
- **Citric acid** negatevly correlates with **volatile acidity**

### Fixed Acidity

<img style="margin-left: 0" src="https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?cid=875&t=l" width="600px" />

**Tartaric acid** is, from a winemaking perspective, the most important in wine due to the prominent role it plays in **maintaining the chemical stability of the win and its color and finally in influencing the taste** of the finished wine.

Most of the **acids** involved with wine are **fixed acids** with the notable exception of **acetic acid**, mostly found in vinegar, which is volatile and can contribute to the wine fault known as **volatile acidity**.

**Fixed acidity** differently affects **red and white wine quality**:
- **Red wine** becomes better **quality** with increasing **fixed acidity**
- Conversing, **white wine** quality degrades with increasing **fixed acidity**

In [None]:
sns.lmplot(data=full_df, x='fixed acidity', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1});

In [None]:
sns.catplot(data=full_df, x='quality_group', y='fixed acidity', kind='box', palette='Set3')
sns.catplot(data=full_df, x='quality_group', y='fixed acidity', hue='type', kind='box', palette=wineBarPalette);

### Volatile Acidity

<img style="margin-left:0" src="https://i.insider.com/5ef6191e5af6cc6d7604ae44?width=2000&format=jpeg&auto=webp" width="600px" />

**Volatile acidity** is mostly caused by **bacteria** in the wine creating **acetic acid**‚Ää‚Äî‚Ääthe acid that gives vinegar its characteristic flavor and aroma‚Ää‚Äî‚Ääand its byproduct, ethyl acetate. It is only good if it's added intensionally to the wine production and in small dozes. Increasing in **volatile acidity** is killing **wine quality**:

This is what we see on the plot: **wine quality** goes up when **volatile acidity** descreases:

In [None]:
sns.lmplot(data=full_df, x='volatile acidity', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1})

sns.catplot(data=full_df, x='quality_group', y='volatile acidity', kind='box', palette='Set3')
sns.catplot(data=full_df, x='quality_group', y='volatile acidity', hue='type', kind='box', palette=wineBarPalette);

**Volatile acidity** has pretty strong negative correlations with **citric acid**, **free and total SO2** content. They might prevent wine from bacterial activities that produces **acetic acid**.

We can combine **citric acid** and **total SO2** to create something like **antibacterial_componets** variable (also scale **total SO2** to g/L):

In [None]:
full_df['antibacterial_componets'] = full_df['citric acid'] + full_df['total sulfur dioxide'] * 0.001

sns.lmplot(data=full_df, x='antibacterial_componets', y='volatile acidity', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1});

### Citric Acid

<img style="margin-left:0" src="https://images-na.ssl-images-amazon.com/images/I/41VHzHjkVvL.jpg" width="600px" />

The **citric acid** is found in very small quantities in wine grapes. It can be used by winemakers in acidification to boost the **wine's total acidity** and prevent **ferric hazes**. It is used less frequently than **tartaric and malic** due to the aggressive **citric flavors** it can add to the wine. When **citric acid** is added, it is always done after primary alcohol fermentation has been completed due to the tendency of yeast to convert citric into **acetic acid**. 

More Info:
- https://wineserver.ucdavis.edu/industry-info/enology/methods-and-techniques/common-chemical-reagents/citric-acid

**Citric Acidity** improves **red wine quality** and has slightly negative impact on the **white wine quality**:

In [None]:
sns.lmplot(data=full_df, x='citric acid', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1});

In [None]:
sns.catplot(data=full_df, x='quality_group', y='citric acid', kind='box', palette='Set3')
sns.catplot(data=full_df, x='quality_group', y='citric acid', hue='type', kind='box', palette=wineBarPalette);

So we have 3 features that contributes to wine acidity. Let's try to create **total acidity** feature and see its affect on the wine quality:

In [None]:
full_df['total_acidity'] = full_df['fixed acidity'] + full_df['volatile acidity'] + full_df['citric acid']

sns.catplot(data=full_df, x='quality_group', y='total_acidity', kind='box', palette='Set3')
sns.catplot(data=full_df, x='quality_group', y='total_acidity', hue='type', kind='box', palette=wineBarPalette);

### pH test

<img style="margin-left:0" src="https://253qv1sx4ey389p9wtpp9sj0-wpengine.netdna-ssl.com/wp-content/uploads/2019/06/PH_level_6.jpg" width="600px" />

**pH** stands for **power of hydrogen**, which is a measurement of the **hydrogen ion concentration** in the solution.

The **pH** scale runs from **0 to 14**, from **very acidic to very basic, or alkaline**. 
**pH** affects almost every aspect of the wine: flavor, aroma, color, and potentially even quality.

Lower **pH** numbers mean higher amounts of hydrogen ions, and therefore a **more acidic wine**. In comparison, lemon juice usually has a pH between 2 and 3, while orange juice and wine are generally between 3 and 4, with some wines reaching slightly beyond that, to high **2s or low 4s**. For each full point increase in pH, the level of acidity is 10 times more acidic as you go up, so the difference between a pH of 3 and a pH of 4 is very significant.

More Info:
- https://daily.sevenfifty.com/how-winemakers-analyze-ph-and-its-impact-on-wine/
- https://rstudio-pubs-static.s3.amazonaws.com/277109_1f7df6787d344858af90b5a5d6c1ef65.html

In [None]:
sns.lmplot(data=full_df, x='pH', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1});

In [None]:
pH_labels = ['high', 'mod high', 'medium', 'low']
pH_bins = [2.5, 3.2, 3.3, 3.4, 4.1]

full_df['pH_groups'] = pd.cut(raw_df['pH'], bins=pH_bins, labels=pH_labels) 

sns.catplot(data=full_df, x='pH_groups', hue='quality_group', kind='count', palette='Set2')
sns.catplot(data=full_df, x='pH_groups', hue='type', kind='count', palette=wineBinPalette);

**pH** shows the same picture as **fixed acidity**. **Red wine quality** becomes higher with higher acidity (lower pH value) and **white wine quality** becomes higher as acidity reduces.

## Sugar, Alcohol and Density

<img style="margin-left:0" src="https://upload.wikimedia.org/wikipedia/commons/d/d5/Mthomebrew_must.JPG" width="600px" />

The process of **fermentation** in winemaking turns grape juice into an alcoholic beverage. During fermentation, **yeasts** transform sugars present in the juice into **alcohol** and carbon dioxide.

Here is a list of wine properties related to **fermentation** process:

- **residual sugar** ($g/dm^3$) - the amount of **sugar** remaining after **fermentation stops**, it's rare to find wines with less than 1 g/L and wines with greater than 45 g/L are considered sweet
- **alcohol** (% by volume) - the percent alcohol content of the wine
- **density** ($g/cm^3$) - the density of water is close to that of water depending on the percent alcohol and sugar content

In [None]:
g = sns.PairGrid(full_df[['alcohol', 'residual sugar', 'density', 'type']], hue='type', diag_sharey=False, palette=wineBinPalette)

g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2);

### Residual Sugar

<img style="margin-left:0" src="https://media.winefolly.com/Residual-Sugar-wine.png" width="600px" />

**Residual sugar** is from natural grape sugars leftover in a wine after the alcoholic fermentation finishes. It defines wine's **sweetness**.

More Info:
- https://winefolly.com/deep-dive/what-is-residual-sugar-in-wine/

Here is a quick ranges of residual sugar by wine type:

**White Wine**:
- Total: 0.6-65.8 g/gm^3
- In Low Tier: 0.7-17.55 g/gm^3
- In High Tier: 0.8-14.8 g/gm^3

**Red Wine**:
- Total: 0.9-15.5 g/dm^3
- In Low Tier: 1.2-12.9 g/dm^3
- In High Tier: 1.4-6.4 g/dm^3

**Resigual sugar** has slightly possitive correlation with **red wine quality** and slightly negative correlation with **white wine quality**.

In [None]:
sns.catplot(data=full_df, x='quality_group', y='residual sugar', hue='type', kind='box', palette=wineBarPalette);

**Residual sugar** distribution is binomial. **Wine type** explains only part of the left hump:

In [None]:
sns.displot(data=full_df, x='residual sugar', hue='type', kind='hist', kde=True, palette=wineBinPalette);

Let's see if **wine sweetness** can help to explain the rest.

**Residual sugar** can be useful to group wine by **sweetness**. **Sweetness** is one of the five key wine characteristics:

In [None]:
def impute_sweetness(residual_sugar):
    if residual_sugar < 1:
        return 'bone-dry'
    if residual_sugar >= 1 and residual_sugar < 9:
        return 'dry'
    if residual_sugar >= 9 and residual_sugar < 18:
        return 'off-dry'
    if residual_sugar >= 18 and residual_sugar < 50:
        return 'semi-sweet'
    if residual_sugar >= 50 and residual_sugar < 120:
        return 'mid-sweet'
    if residual_sugar >= 120:
        return 'sweet'

full_df['sweetness'] = raw_df['residual sugar'].apply(impute_sweetness)
white_wine_df['sweetness'] = white_wine_df['residual sugar'].apply(impute_sweetness)

sns.catplot(data=full_df, x='sweetness', y='quality', kind='box', palette='Set3');

**Wine sweetness** doesn't help to discriminate wine quality, but it explanins **Residual sugar distribution peaks** well:

In [None]:
_, (ax0, ax1) = plt.subplots(1, 2)

ax0.set_title('All Wine Sweetness')
sns.histplot(data=full_df, x='residual sugar', hue='sweetness', kde=True, ax=ax0)

ax1.set_title('White Wine Sweetness')
sns.histplot(data=white_wine_df, x='residual sugar', hue='sweetness', kde=True, ax=ax1);

**Acidity and sweetness** are two of five principal characteristics of the wine. We can combine them in **sugar to acidity ratio**. Sugar can soft sour taste of wine:

More Info:
- https://winefolly.com/deep-dive/understanding-acidity-in-wine/

In [None]:
full_df['sugar_acidity_ratio'] = full_df['residual sugar'] / full_df['total_acidity']

sns.lmplot(data=full_df, x='quality', y='sugar_acidity_ratio', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1});

With increasing **sugar to acidity ratio**, the **wine quality** starts to **decrease**.

### Percentage of Alcohol

<img style="margin-left:0" src="https://media.winefolly.com/wine-serving-size-based-on-alcohol-content1.png" width="600px" />

**Percentage of alcohol** has a great possitive correlation with **wine quality**:

In [None]:
raw_df['alcohol'].describe()

In [None]:
sns.lmplot(data=full_df, x='alcohol', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1})
sns.catplot(data=full_df, x='quality_group', y='alcohol', hue='type', kind='box', palette=wineBarPalette);

In [None]:
sns.histplot(data=full_df, x='alcohol', hue='type', kde=True, palette=wineBinPalette);

**Alcohol** content is created from **sugar**. Generally, the **more alcohol wine** contains, the **less residual sugar** it has (**-0.54** correlation):

In [None]:
sns.lmplot(data=full_df, x='alcohol', y='residual sugar', hue='quality_group', scatter_kws={'alpha': 0.1});

Generally, wines contains **9-11% of alcohol**. Let's group our samples in 4 **acohol tiers**:

In [None]:
alcohol_labels = ['low', 'medium', 'mod hig', 'high']
alcohol_bins = [0, 9.5, 11.5, 13.5, 20]

full_df['alcohol_groups'] = pd.cut(raw_df['alcohol'], bins=alcohol_bins, labels=alcohol_labels) 

_, (ax0, ax1) = plt.subplots(1, 2)
sns.countplot(data=full_df, x='quality_group', hue='alcohol_groups', palette='Set2', ax=ax0)
sns.histplot(data=full_df, x='alcohol', hue='alcohol_groups', kde=True, ax=ax1);

### Density

**Density** is generally used as a measure of the **conversion of sugar to alcohol**. Prior to fermentation, wine will contain **sugars** which will make the liquid more **dense**. When the **wine** is undergoing fermentation the **sugars** in the liquid are converted by the yeast into **alcohol**. 

**Alcohol in water** is **less dense** than sugar in water. Density of water is **0.99984 g/L**.

As we can see, the **wine density increases** when **wine quality** goes down:

In [None]:
sns.lmplot(data=full_df, x='density', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1})
sns.catplot(data=full_df, x='quality_group', y='density', hue='type', kind='box', palette=wineBarPalette);

**Residual sugar** is what makes wine **dense**. Their corralation is pretty strong (**0.54**). Scatter plot shows **two outliers** with high amount of sugar and desity. The **residual sugar** and **dense** relations are **heterogeneous**:

In [None]:
sns.jointplot(data=full_df, x='density', y='residual sugar', hue='type', kind='kde', palette=wineBinPalette);

## Sulfur Dioxide and Sulphates

<img style="margin-left:0" src="https://grapesandwine.cals.cornell.edu/sites/grapesandwine.cals.cornell.edu/files/shared/images/101Fig1300.jpg" width="600px" />

**Sulfur dioxide**, is thought to help rid the wine of a wide variety of bacteria (good and bad); this seems to **lower the wine quality as well** because it dulls the wine‚Äôs fermentation process.

- **free sulfur dioxide** ($mg/dm^3$) - the free form of $SO_{2}$ exists in equilibrium between molecular $SO_{2}$ (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
- **total sulfur dioxide** ($mg/dm^3$) - amount of free and bound forms of $SO_{2}$; in low concentrations, $SO_{2}$ is mostly undetectable in wine, but at free $SO_{2}$ concentrations over 50 ppm, $SO_{2}$ becomes evident in the nose and taste of wine
- **sulphates** (potassium sulphate - $g/dm^3$) - a wine additive which can contribute to sulfur dioxide gas ($SO_{2}$) levels, wich acts as an antimicrobial and antioxidant

More Info:
- https://www.decanter.com/learn/wine-terminology/sulfites-in-wine-friend-or-foe-295931/
- https://link.springer.com/chapter/10.1007/978-1-4757-6255-6_12
- https://winefolly.com/deep-dive/sulfites-in-wine/
- https://www.healthline.com/nutrition/sulfites-in-wine

In [None]:
g = sns.PairGrid(full_df[['free sulfur dioxide', 'total sulfur dioxide', 'sulphates', 'type']], hue='type', diag_sharey=False, palette=wineBinPalette)

g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2);

**Free SO2** amount doesn't seem to have a strong correlation with **wine quality**. **Total SO2** has a little negative correlation with **red wine quality**:

In [None]:
sns.catplot(data=full_df, x='quality_group', y='free sulfur dioxide', hue='type', kind='box', palette=wineBarPalette)
sns.catplot(data=full_df, x='quality_group', y='total sulfur dioxide', hue='type', kind='box', palette=wineBarPalette);

**Total SO2** consists of **Free SO2** and **Bounded SO2** (combined with compounds such as phenols, acetaldehyde and sugar). This is the reason why they correlate strongly:

In [None]:
sns.jointplot(data=full_df, x='total sulfur dioxide', y='free sulfur dioxide', hue='type', kind='kde', palette=wineBinPalette);

**White wine** contains more **bounded SO2** then **red wine**. Both type of wines tends to have **less bounded SO2** when **wine samples quality** goes up:

In [None]:
full_df['bound_sulfur_dioxid'] = full_df['total sulfur dioxide'] - full_df['free sulfur dioxide']

sns.catplot(data=full_df, x='quality_group', y='bound_sulfur_dioxid', hue='type', kind='box', palette=wineBarPalette);

**Free and total SO2** can be combined by ratio. Better taste wine has larger **free SO2** ratio:

In [None]:
full_df['free_total_so2_rate'] = full_df['free sulfur dioxide'] / full_df['total sulfur dioxide']

sns.catplot(data=full_df, x='quality_group', y='free_total_so2_rate', hue='type', kind='box', palette=wineBarPalette);

**Free SO2** consists of **molecular SO2**, **bisulfite ions (HSO3-)** and **sulfite (SO3+)**. **Molecular SO2** is directly active in
preventing oxidation and spoilage. Bisulfites and sulfites are much less reactive components.

The effectiveness of the $SO_2$ to protect the wine from oxidation and microbial spoilage is **dependant upon the pH** of the wine. 
Therefore, the **higher the pH** a wine has, the **less effective the SO2**. Wines with higher pH levels, red or white, may require too high a total SO2 level to achieve desired free $SO_2$ levels.

Rather than have excessive **bound SO2** (which may give a ‚Äúchemical‚Äù taste), it is best to rely on a combination of factors, including susceptibility to spoilage. Some pH problems can be relieved by adjusting the pH downward with tartaric acid.

More Info:
- http://www.santarosa.edu/~jhenderson/Sulfur%20Dioxide.pdf
- http://srjcstaff.santarosa.edu/~jhenderson/SO2.pdf

In [None]:
full_df['molecular_sulfur_dioxid'] = full_df['free sulfur dioxide'] / (1 + 10 ** (full_df['pH'] - 1.8))
full_df['sulfites'] = full_df['free sulfur dioxide'] - full_df['molecular_sulfur_dioxid']
full_df['sulfur_dioxid_pH_ratio'] = full_df['total sulfur dioxide'] - full_df['pH']

sns.catplot(data=full_df, x='quality_group', y='molecular_sulfur_dioxid', hue='type', kind='box', palette=wineBarPalette)
sns.catplot(data=full_df, x='quality_group', y='sulfites', hue='type', kind='box', palette=wineBarPalette)
sns.catplot(data=full_df, x='quality_group', y='sulfur_dioxid_pH_ratio', hue='type', kind='box', palette=wineBarPalette);

**Molecular SO2** and **sulfites** tended to be less in the **higher quality wines**.

### Sulphates

**Sulphates** are salts of sulfuric acid (that's why it has **negative correlation** with **SO2** amount). Sulfur dioxide reacts other chemical compounds (metabolites) in wine to create sulphite compounds, which heavily characterise the **chemical profile of the oldest wines**. It also adds a bit of a "sharp" taste:

In [None]:
sns.catplot(data=full_df, x='quality_group', y='sulphates', hue='type', kind='box', palette=wineBarPalette);

**Red wine** tends to have higher content of **sulfates** comparing with **white wine**. **Sulfates** slightly correlates with **wine quality**.

## Chlorides

<img style="margin-left:0" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLeecYvHp8BIW_cEbXZEhbiwIaTxRuImVT0g&usqp=CAU" width="600px" />

**Chloride** concentration in the wine is influenced by **geographic** and its highest levels are found in wines coming from countries where irrigation is carried out using **salty water** or in areas with brackish terrains.

**Chlorides** (**sodium chloride** - $g/dm^3$) - the amount of **salt** in the wine

In [None]:
sns.lmplot(data=full_df, x='chlorides', y='quality', hue='type', palette=wineBinPalette, scatter_kws={'alpha': 0.1});

In [None]:
sns.catplot(data=full_df, x='quality_group', y='chlorides', hue='type', kind='box', palette=wineBarPalette);

**Chloride** content **decrease** as **wine quality goes up**.

# Principal Component Analysis üìà

Principal component analysis will help us to draw all observations in 2D space to see if there is any visual patterns we need to know about.

But first, let's check component-variance plot for the dataset:

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

x_processed = StandardScaler().fit_transform(full_df[num_features])

In [None]:
pc_analyser = PCA(
    random_state=RANDOM_SEED
)

pc_analyser.fit(x_processed)

plt.plot(np.cumsum(pc_analyser.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

As we can see from the plot, we can use **6 components** to explaint **90% of sample varience**.

When we plot all observations, we see two dense groups on the figure. This is **two different types of wine** and they are being descriminated well. 

Regarding wine quality, there is no such a clear discrimination (at least in our low dimentional 2D space). Groups color intensity increases from top to bottom which means increasing of wine quality (according to the legend), but the pattern is not strong:

In [None]:
pca2d = PCA(
    n_components=2,
    random_state=RANDOM_SEED
)

x2d = pca2d.fit_transform(x_processed)

pca_df = pd.DataFrame(data=x2d, columns=['pc1', 'pc2'])
pca_df = pd.concat([pca_df, full_df[['type', 'quality_group', 'volatile acidity']]], axis=1)

_, (ax0, ax1) = plt.subplots(2, 1)

sns.scatterplot(data=pca_df, x='pc1', y='pc2', hue='type', alpha=0.3, ax=ax0, palette=wineBinPalette)
sns.scatterplot(data=pca_df, x='pc1', y='pc2', hue='quality_group', alpha=0.3, ax=ax1);

In [None]:
pca_features = num_features.copy()
pca_features.remove('type_red')
pca_features.remove('type_white')

x_processed = StandardScaler().fit_transform(full_df[pca_features])
y_processed = full_df[['quality_group']].copy()
y_processed = y_processed.reset_index(drop=True)

**Manifold learning** approaches doesn't help to descriminate quality groups as well. However, we do see the same pattern of reducing wine quality going from the top to bottom:

In [None]:
from sklearn.manifold import Isomap

X_iso = Isomap(n_neighbors=20, n_components=2).fit_transform(x_processed)

pca_df = pd.DataFrame(data=X_iso, columns=['discr1', 'discr2'])
pca_df = pd.concat([pca_df, y_processed], axis=1)

sns.scatterplot(data=pca_df, x='discr1', y='discr2', hue='quality_group', alpha=0.3);

In [None]:
from sklearn.neighbors import NeighborhoodComponentsAnalysis

nca = NeighborhoodComponentsAnalysis(init='random', n_components=2, random_state=RANDOM_SEED)
X_nca = nca.fit_transform(x_processed, y_processed)

pca_df = pd.DataFrame(data=X_nca, columns=['comp1', 'comp2'])
pca_df = pd.concat([pca_df, y_processed], axis=1)

sns.scatterplot(data=pca_df, x='comp1', y='comp2', hue='quality_group', alpha=0.3);

# Feature Importance

Feature importance was reviewed in <a href="https://www.kaggle.com/glushko/wine-quality-modelling-part-ii#Model-Inspection-%F0%9F%94%8E">the second part of the research</a>.

# Summary

In this notebook, we try to recall school chemistry and analyze wine test results and their relations with wine quality and type.

**If you find this notebook intresting and useful or helped you to select a good bottle of wineüç∑, please feel free to upvote it üí´**

Github: https://github.com/roma-glushko/kaggle-wine-quality

## Credits

- https://bibinmjose.github.io/RedWineDataAnalysis/
- https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
- https://towardsdatascience.com/what-makes-a-wine-good-ea370601a8e4
- https://www.kaggle.com/mgmarques/wines-type-and-quality-classification-exercises

# Want More? üí´

- <a href="https://www.kaggle.com/glushko/house-prices-domain-driven-eda-part-i">Ames House Prices: Domain-Driven EDA (Part I)</a>
- <a href="https://www.kaggle.com/glushko/house-prices-regression-modelling-part-ii">Ames House Prices: Modelling (Part II)</a>