In this notebook, I'd train both classification and regression models with different techniques and compare them. Will they be effective or not?

# 1. Data description

**Features:**
1. fixed acidity - most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2. volatile acidity - the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. citric acid - found in small quantities, citric acid can add 'freshness' and flavor to wines
4. residual sugar - the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5. chlorides - the amount of salt in the wine
6. free sulfur dioxide - the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7. total sulfur dioxide - amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8. density - the density of water is close to that of water depending on the percent alcohol and sugar content
9. pH - describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10. sulphates - a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11. alcohol - the percent alcohol content of the wine
12. quality - output variable (based on sensory data, score between 0 and 10)

**Tips by author**

What might be an interesting thing to do, is aside from using regression modelling, is to set an arbitrary cutoff for your dependent variable (wine quality) at e.g. 7 or higher getting classified as 'good/1' and the remainder as 'not good/0'.
This allows you to practice with hyper parameter tuning on e.g. decision tree algorithms looking at the ROC curve and the AUC value.
Without doing any kind of feature engineering or overfitting you should be able to get an AUC of .88 (without even using random forest algorithm).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import f1_score, roc_auc_score, mean_squared_error, mean_absolute_error
from imblearn.over_sampling import RandomOverSampler

import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

df.head(3)

Since I will modify data, I need to copy original dataset.

In [None]:
df_copy = df.copy()

# 2. Classification task

In [None]:
df.loc[df.quality >= 7, 'quality'] = 1
df.loc[(df.quality < 7) & (df.quality != 1), 'quality'] = 0

# 2.1 EDA

In [None]:
sns.pairplot(df, hue='quality');

According to pairplot you can clearly see that you **cannot** perform any linear model to solve this task.

In [None]:
plt.title('Class distribution')
sns.countplot(x='quality', data=df);

In [None]:
plt.figure(figsize=(10, 10))
plt.xticks(rotation='vertical')
sns.boxplot(data=df);

# 2.2 Baseline

I'd like to try these 3 models since they got high score in most cases:
1. Random Forest Classifier
2. XGB Classifier
3. LGBM Classifier

In [None]:
SEED = 9
DECIMALS = 2

In [None]:
x = df.drop('quality', axis=1).values
y = df.quality.values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=SEED, stratify=y)

In [None]:
def get_f1_rocauc(model, dec=DECIMALS):
    preds = model.predict(x_test)
    f1 = f1_score(y_test, preds)
    roc_auc = roc_auc_score(y_test, preds)
    f1_round = np.round(f1, dec)
    roc_auc_round = np.round(roc_auc, dec)
    
    print(f'F1 score: {f1_round}, ROC AUC: {roc_auc_round}')

In [None]:
rfc_base = RandomForestClassifier(n_jobs=-1, random_state=SEED)
rfc_base.fit(x_train, y_train)

get_f1_rocauc(rfc_base)

In [None]:
xgbc_base = xgb.XGBClassifier(n_jobs=-1, random_state=SEED)
xgbc_base.fit(x_train, y_train)

get_f1_rocauc(xgbc_base)

In [None]:
lgbmc_base = lgb.LGBMClassifier(n_jobs=-1, random_state=SEED)
lgbmc_base.fit(x_train, y_train)

get_f1_rocauc(lgbmc_base)

As we can see **LGBM Classifier** is the **best** here.

# 2.3 Random oversampling

Since the dataset is imbalanced you can perform oversample technique to make our classes equal by using **RandomOverSampler**!

In [None]:
ros = RandomOverSampler(random_state=SEED)
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

In [None]:
rfc_os = RandomForestClassifier(n_jobs=-1, random_state=SEED)
rfc_os.fit(x_resampled, y_resampled)

get_f1_rocauc(rfc_os)

In [None]:
xgbc_os = xgb.XGBClassifier(n_jobs=-1, random_state=SEED)
xgbc_os.fit(x_resampled, y_resampled)

get_f1_rocauc(xgbc_os)

In [None]:
lgbmc_os = lgb.LGBMClassifier(n_jobs=-1, random_state=SEED)
lgbmc_os.fit(x_resampled, y_resampled)

get_f1_rocauc(lgbmc_os)

Well, oversampling technique has achieved decent score compare to baseline models. 

# 2.4 Tuning

Next step I'd tune each model's hyperparameters.

Original dataset's size is not that huge so I'd do 4-fold CV instead of common 5-fold. It may help to improve a score.

In [None]:
skf = StratifiedKFold(4, shuffle=True, random_state=SEED)

In [None]:
def train_gscv_model(estimator, param_grid, task, skf=skf):
    if task == 'class':
        model = GridSearchCV(estimator, param_grid, scoring='f1', n_jobs=-1, cv=skf)
        model.fit(x, y)
        print('Best f1 score: ', np.round(model.best_score_, DECIMALS))
        print('Best params: ', model.best_params_)
    elif task == 'reg':
        model = GridSearchCV(estimator, param_grid, scoring='neg_mean_squared_error', n_jobs=-1, cv=skf)
        model.fit(x, y)
        print('Best MSE: ', np.round(model.best_score_, DECIMALS))
        print('Best params: ', model.best_params_)
    else:
        raise ValueError(f'{task} task is not exist... yet')

In [None]:
param_rfc = {
    'n_estimators': [100, 200, 300],
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, None],
    'class_weight': ['balanced', 'balanced_subsample', None]
}

train_gscv_model(rfc_base, param_rfc, 'class')

In [None]:
param_xgbc = {
    'n_estimators': [100, 200, 300]
}

train_gscv_model(xgbc_base, param_xgbc, 'class')

In [None]:
param_lgbmc = {
    'boosting_type': ['gbdt', 'dart', 'goss', 'rf'],
    'max_depth': [5, -1],
    'learning_rate': [.1, .001, .0001],
    'n_estimators': [100, 200, 300],
    'class_weight': ['balanced', None]
}

train_gscv_model(lgbmc_base, param_lgbmc, 'class')

**Summarize:** the best algorith here is *LGBM Classisifier*. Hyperparameters tuning doesn't impove score compare to oversampling technique.

# 3. Regression task

To solve this task dataset isn't need to be modified.

In [None]:
DECIMALS = 3

In [None]:
def get_reg_scores(model, dec=DECIMALS):
    preds = model.predict(x_test)
    mae = mean_absolute_error(y_test, preds)
    mse = mean_squared_error(y_test, preds)
    mae_round = np.round(mae, dec)
    mse_round = np.round(mse, dec)
    
    print(f'MAE: {mae_round}, MSE: {mse_round}')

In [None]:
x = df_copy.drop('quality', axis=1).values
y = df_copy.quality.values

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=SEED)

Similar to classification task I'm gonna try these 3 models:
1. Random Forest Resressor
2. XGB Resressor
3. LGBM Resressor

# 3.1 Baseline

In [None]:
rfr_base = RandomForestRegressor(n_jobs=-1, random_state=SEED)
rfr_base.fit(x_train, y_train)

get_reg_scores(rfr_base)

In [None]:
xgbr_base = xgb.XGBRegressor(n_jobs=-1, random_state=SEED)
xgbr_base.fit(x_train, y_train)

get_reg_scores(xgbr_base)

In [None]:
lgbmr_base = lgb.LGBMRegressor(n_jobs=-1, random_state=SEED)
lgbmr_base.fit(x_train, y_train)

get_reg_scores(lgbmr_base)

Well, surprisingly, Random Forest Regressor better than other models!

# 3.2 Random oversampling

In [None]:
x_resampled, y_resampled = ros.fit_resample(x_train, y_train)

In [None]:
rfr_os = RandomForestRegressor(n_jobs=-1, random_state=SEED)
rfr_os.fit(x_resampled, y_resampled)

get_reg_scores(rfr_os)

In [None]:
xgbr_os = xgb.XGBRegressor(n_jobs=-1, random_state=SEED)
xgbr_os.fit(x_resampled, y_resampled)

get_reg_scores(xgbr_os)

In [None]:
lgbmr_os = lgb.LGBMRegressor(n_jobs=-1, random_state=SEED)
lgbmr_os.fit(x_resampled, y_resampled)

get_reg_scores(lgbmr_os)

Oversampling didn't improve base algorithms.

# 3.3 Tuning

In [None]:
param_rfr = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, None]
}

train_gscv_model(rfr_base, param_rfr, 'reg')

In [None]:
param_xgbr = {
    'n_estimators': [100, 200, 300]
}

train_gscv_model(xgbr_base, param_xgbr, 'reg')

In [None]:
param_lgbmr = {
    'boosting_type': ['gbdt', 'dart', 'goss', 'rf'],
    'max_depth': [5, -1],
    'learning_rate': [.1, .001, .0001],
    'n_estimators': [100, 200, 300],
    'class_weight': ['balanced', None]
}

train_gscv_model(lgbmr_base, param_lgbmr, 'reg')

Somehow models with tuning showed even worse score.

**Summarize:** in case of regression task, oversamping technique or tuning didn't improve base score at all! Furthermore *Random Forest Regressor* got the best score among other gradient boosting models. That suprised me to be honest.

# 4. What is next?

As you can see without a lot of work and deep dive into the dataset I could improve classification score.
But this trick didn't work out with regression score cause of many outliers in features.
So how can you make score higher?
1. Feature engineering (create new features)
2. Try different models
3. More hyperparameters tuning
4. Preprocessing (in case of regression)

Thanks for reading! I hope you have found here any useful information for yourself.

Feel free to comment this notebook.