In [None]:


# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Getting the Data

In [None]:
#import required library

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv',index_col=False)

In [None]:
df

In [None]:
#check if there are null values
df.info() #output showing no nulls so that's good

## EDA

## Initial observation: 

1) all the columns are numerical continuous values

2) each column's numerical values are scaled differently, meaning that we may have to scale them accordingly before we train our models since some models such as SVM are sensitive to the scales of the numericals themselves

## Column info:

**fixed acidity** - most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

**volatile acidity** - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

**citric acid** - found in small quantities, citric acid can add 'freshness' and flavor to wines

**residual sugar** - the amount of sugar remaining after fermentation stops, it's rare to find  wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet 

**chlorides** - the amount of salt in the wine

**free sulfur dioxide** - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

**total sulfur dioxide** - amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

**density** - the density of water is close to that of water depending on the percent alcohol and sugar content

**pH** - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**sulphates** - a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

**alcohol** - the percent alcohol content of the wine

## Some initial thoughts:

1) Some of the columns are pretty chemical in nature and would probably require some domain knowledge to understand them fully

2) Some columns provide us info with how to better explore our data

    a) For example, whenever residual sugar is >45 grams / liter, they are considered more sweet chlorides.  With that in mind, we may be able to get some insights to see how sweetness affects quality
    
    b) pH, it's stated that anything above 4 are considered more basic and anything under 3 are considered more acidic.

In [None]:
##An initial pariplot to see how each column relate to each other

sns.set_style('darkgrid')
sns.pairplot(df)
plt.show()

## Displot observation

- Critic acid and fixed acid seems positively correlated
- density and fixed acid seems positively correlated
- pH and fixed acid seems negatively related

All of which makes sense from a chemical perspective

### For quality:

- does not seem to be a single factor that makes or breaks a wine given how the points do not form a strong trend or concentration in the quality column

### Moving forward:

- I want to dig deeper into each feature and see if our observation is indeed the case

## Acidity & pH

In [None]:
df[['fixed acidity', 'citric acid', 'pH','volatile acidity']].corr()

In [None]:
df[['fixed acidity', 'citric acid', 'pH','quality']].corr()


confirming our observation that these 3 are pretty correlated and may cause multicollinearity issues down the line

In [None]:
sns.displot(data=df,x='pH',kde=True)
plt.title('pH distribution')
plt.show()

In [None]:
sns.displot(data=df,x='fixed acidity',kde=True)
plt.title('fixed acidity distribution')
plt.show()

In [None]:
sns.displot(data=df,x='citric acid',kde=True)
plt.title('citric acid distribution')
plt.show()

In [None]:
## seeing how pH affects quality

def pH_split(x):
    if x > 3.5:
        return 1
    else:
        return 0
#basically if pH > 3.5, we say the wine is acidic

df['Acidic or not'] = df['pH'].apply(pH_split)


In [None]:
df.groupby('Acidic or not').mean()['quality']

Acidity and pH level does not seem to affect wine too much.  Given the correlation these factors have, it's probably enough to conclude that acidity by itself does not affect quality that much

In [None]:
#volatile acidity - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

df[['volatile acidity','quality']].corr()

#makes sense correlation is negative, given the higher the volatile acidity, the more likely it will taste unpleasant.  However, most wine makers probably know this and avoid making it that way, resulting in not a lot of wine being brewed like that in the first place

## Water, sugar, salt and alcohol content

In [None]:
df['residual sugar'].describe()

## More on residual sugar

According to https://www.arkansasonline.com/news/2019/may/15/wine-s-residual-sugar-determines-sweetn/,

- residual sugar is measured in grams / litre
- 0-9 g/L is considered "dry"
- 10-19 g/L is considered "off-dry"

Since out data only goes as far as 15.5 g/L.  Let's seperate the wine into dry and off-dry to see if there are insights about wine quality

In [None]:
def wine_cat(x):
    if x>9:
        return 1 #1 meaning off-dry
    else:
        return 0 #0 meaning dry

In [None]:
df['off dry or not'] = df['residual sugar'].apply(wine_cat)
df #off dry column added

In [None]:
df.groupby('off dry or not').mean()['quality']

In [None]:
df.groupby('off dry or not').median()['quality']

Similar case as before.  Does not seem to be a big indicator of wine quality.  Small difference in mean but same median

In [None]:
sns.displot(data=df,x='chlorides',kde=True)
plt.show()

Chlorides seem to be centered around 0.1

In [None]:
sns.displot(data=df,x='density',kde=True)
plt.show()

Seem to be normally distributed with only very nominal difference between each wine's density

In [None]:
sns.displot(data=df,x='alcohol',kde=True)
plt.show()

Most are around 9-10% alcohol content

In [None]:
df[['alcohol','chlorides','density','residual sugar','quality']].corr()

Among the factors, quality seem to have the strongest correlation with alcohol.  However, even then it's still only 0.476 or so

In [None]:
df['alcohol'].describe()

## More on alcohol - low, moderate and high alcohol wine

According to https://www.masterclass.com/articles/learn-about-alcohol-content-in-wine-highest-to-lowest-abv-wines#which-wines-have-low-alcohol-content,

- Low alcohol usually refers to around 7.9% to less than 11.5%
- Moderate alcohol usually refers to around 11% - 13%
- High alcohol usually refers to something up to 15%

Let's try to categorize each wine into the 3 categories, with the adjustment that any alcohol% > 13% will be considered higher in alcohol content

In [None]:
def alcohol_level(x):
    if x < 11:
        return 0 #low alcohol
    elif x < 13:
        return 1 #moderate
    else:
        return 2 #high

In [None]:
df['alcohol grade'] = df['alcohol'].apply(alcohol_level)
df

In [None]:
df.groupby('alcohol grade').mean()['quality']

In [None]:
df.groupby('alcohol grade').median()['quality']

In [None]:
df.groupby('alcohol grade').count()['quality']

Alcohol thus far seem to yield the most prominent difference between wine quality scores.  However, given the amount of wine that is of lower alcohol content.  This could be painting a picture that is not representative of actual wine situation.

## Sulfur

In [None]:
df[['free sulfur dioxide','total sulfur dioxide']]

In [None]:
df[['free sulfur dioxide','total sulfur dioxide', 'quality']].corr()

## Chemical Intuition

It makes sense that free and total sulfur dioxide would have high correlation with each other.  Given that they are all related to sulfur.  Interestingly, if we remember from the column description

*total sulfur dioxide - amount of free and bound forms of S02; in low concentrations, SO2 is mostly  undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2  becomes evident in the nose and taste of wine*

And it does show.  Since that total sulfur dioxide has a negative correlation, albeit small one with quality

What I want to do is similar to what I did for the acidity.  Categorizing them to be above 50ppm or below and see if the taste changes affect the quality score

In [None]:
df['total sulfur dioxide'].describe()

In [None]:
#assuming the above is indeed in the 'ppm' unit

def ppm(x):
    if x>50:
        return 1 # 'becomes evident in the nose and taste of wine'
    else:
        return 0

In [None]:
df['ppm above 50'] = df['total sulfur dioxide'].apply(ppm)
df

In [None]:
df.groupby('ppm above 50').mean()['quality']

In [None]:
df.groupby('ppm above 50').median()['quality']

In [None]:
df.groupby('ppm above 50').count()['quality']

There does seem to be some difference but not significant when it comes to wine quality and sulfur dioxide's relationship.  Given the count, low sulfur dioxide wine seems to dominate the dataset

## Sulphates

*a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant*

Thus far, I can't really find any external sources saying how to interpret this one feature

In [None]:
df[['sulphates','quality']].corr()

On first glance, there seems to *some* positive correlation between sulphates and quality

In [None]:
sns.displot(data=df,x='sulphates',kde=True)
plt.show()

In [None]:
df.groupby('quality').mean()['sulphates']

In [None]:
df.groupby('quality').median()['sulphates']

A case could be made that the higher the sulphate content, the more likely it will be rated higher quality

In [None]:
df.groupby('quality').count()['sulphates']

However, again and as a simple EDA for the outcome - quality.  There are significantly less high quality wine than middle quality wine

## EDA summary

### Our major findings are as follows:

1) Alcohol seems to be most explanatory when it comes to quality ratings

2) Less significant features are total sulfur dioxide and sulphates

3) With help of some external sources, we managed to further analyze the alcohol, dryness, acidity of a wine and their relations to the quality rating

4) There does not seem to be a clear trend or strong correlation between any single feature and quality rating

5) Perhaps most unfortunately, the dataset does not have branding and price available.  Given that wine is a luxury, one can hypothesize that branding is probably important when it comes to wine.

However, we should still work with what we've got and see if we can come up with a model to predict the quality scores

## Classification or Regression

### Potential Problem with Classification
Assuming you're working for a wine dealership and you're working with this dataset.

Let's think back to how the dataset is statistically.  There are way more 5s and 6s in the dataset.  If you're simply classifying, you can amp up your accuracy by simply classifying a lot of the wine as 5s or 6s.  The point being that accuracy is probably not a good metric for this business problem.  One can imagine that the biggest margin will come from premium, high quality wine.  Therefore, the cost of misclassifying them as 5s or 6s is far greater.

If this is binary problem, we can simply use recall to adjust for this error.  However, given this is a multi-class classification problem, that will be much more difficult.

### Why Regression maybe more appropriate
With regression, the output is continuous and float.  However, we can round up and round down depending on how aggressive we want to be with our quality scoring.  That inherently would solve the aforementioned problem without being too aggressive in my opinion.

In fact, the digits following the continuous float output would be like your pseudo confidence interval as to whether predict up or down

While in a purely data science POV, this is a classification problem.  I think regression would be more appropriate from a business problem standpoint.  I encourage any comments and discussion about this in the comment section.

## Feature Selection

Given how correlated the accidity are.  I will simply use pH to aggregate fo rthe acidity in the wine. (with the exception of volatile acidity since that is something that actually affects the taste and shares lower correlation with pH)

Given hoiw correlated free and total sulfur dioxides are.  I will simply use total as an aggregate for the 2 of them

All are the other features are used except the ones I engineered for the EDA

In [None]:
X = ['pH','volatile acidity', 'residual sugar','chlorides','density','alcohol','total sulfur dioxide','sulphates']
y = 'quality'

X = df[X]
y = df[y]

#features selected

## Model Selection

There are mainly 2 models I want to try.

**Gradient boosted regression trees.**  As far as I know and to put it simple, "the main idea behind gradient boosting is to combine many simple models.  Each tree can only provide good predictions on part of the data, and so more and more trees are added to iteratively improve performance" (Introduction to Machine Learning).  In addition, this builds trees in a serial manner with each tree trying to correct mistakes of the previous one.  I think this also best micmics the real-life wine tasting profession.  According to wikipedia:

*A wine rating is a score assigned by one or more wine critics to a wine tasted as a summary of that critic's evaluation of that wine. A wine rating is therefore a subjective quality score, typically of a numerical nature, given to a specific bottle of wine. In most cases, wine ratings are set by a single wine critic, but in some cases a rating is derived by input from several critics tasting the same wine at the same time.*

And each wine critic probably learns from his or her previous wine tasting.

**SVM**.  From our EDA, we conclude that no single strong trend with the various features and wine quality.  However, this could be the case if we consider beyond the linear, which is what SVM can do for us, loking at it on a Gaussian and polynomial level.

In [None]:
#import required library

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import LinearSVR #regular taking too long, maybe in a future version
from sklearn.preprocessing import MinMaxScaler #SVR are pretty senstitive to feature scales from what I've read
from sklearn.model_selection import GridSearchCV #hyperparameter tuning
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score #just to test post-rounding up and down

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

## Gradient Boosting Model

In [None]:
gbr = GradientBoostingRegressor() #instantiation

In [None]:
param = {'learning_rate':[0.1,1,10,100],
        'n_estimators':[10,50,100],
        'max_features':[8,4]}

cv_gbr = GridSearchCV(estimator=gbr, param_grid = param ,
                      scoring = 'neg_root_mean_squared_error',
                     cv = 5, verbose = 4, n_jobs=-1)

In [None]:
cv_gbr.fit(X_train, y_train)

In [None]:
print(cv_gbr.best_params_)
print(-cv_gbr.best_score_)

In [None]:
gbr_tuned = GradientBoostingRegressor(learning_rate = 0.1, 
                                      max_features = 4, 
                                      n_estimators = 50)

In [None]:
gbr_tuned.fit(X_train, y_train)
y_pred = gbr_tuned.predict(X_test)
#trying hyperparameter tuned dataset on test set

In [None]:
#rounded down
y_pred_down = y_pred.astype(int)

#rounded up
y_pred_up = np.rint(y_pred)

In [None]:
print(f'Rounding up RMSE: {np.sqrt(mean_squared_error(y_test,y_pred_up))}')
print(f'Rounding down RMSE: {np.sqrt(mean_squared_error(y_test,y_pred_down))}')   

In [None]:
df['quality'].describe()

## when we round up, error is about 85% of standard deviation

In [None]:
print(f'Rounding up accuracy: {accuracy_score(y_test,y_pred_up)}')
print(f'Rounding down accuracy: {accuracy_score(y_test,y_pred_up)}')

One possible reason for the less than stellar performance could be that trees are only as good as what they saw in the training set and given how much of the 7s, 8s may be missed, this could be a reason

## SVM

In [None]:
#preprocessing

scaler = MinMaxScaler()

scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.25,random_state=42)

In [None]:
svRegressor = LinearSVR()

param = {
    'C':[0.1,1,100,1000]
}

cv_svr = GridSearchCV(estimator=svRegressor, param_grid=param,
                      scoring = 'neg_root_mean_squared_error',
                      cv=5, verbose = 4, n_jobs = -1
                     )

In [None]:
cv_svr.fit(X_train, y_train)

In [None]:
cv_svr.best_params_

In [None]:
-cv_svr.best_score_

In [None]:
y_pred = cv_svr.predict(X_test)
y_pred

In [None]:
#rounded down
y_pred_down = y_pred.astype(int)

#rounded up
y_pred_up = np.rint(y_pred)

In [None]:
print(f'Rounding up RMSE: {np.sqrt(mean_squared_error(y_test,y_pred_up))}')
print(f'Rounding down RMSE: {np.sqrt(mean_squared_error(y_test,y_pred_down))}')   

In [None]:
print(f'Rounding up accuracy: {accuracy_score(y_test,y_pred_up)}')
print(f'Rounding down accuracy: {accuracy_score(y_test,y_pred_up)}')

## Model conclusion

It seems that both models perform on par with each other.  In a future version if I decide to continue working on it.  Maybe something could be done with classification.  Like having multiple models that predict a one-versus-all-wine-quality probability, then selecting the highest P as the class label