Red wine quality dataset is a very simple dataset to work with, all the features are numerical and there is no missing data! Can it be more convenient! Honestly such datasets are a bit boring for me, however as a wine lower, I am very excited to work on it! :D

This is a classification problem (can be also regression but I prefer to do classification!)

In [None]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import log_loss
from sklearn.model_selection import StratifiedKFold
from xgboost import XGBClassifier
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()



First lets lave look at the data:

In [None]:
df=pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
print(df.info())
features = df.drop(columns='quality').columns

As we can see there is no missing data and all the features are of numerical type.
This is a small dataset and all features are numerical, so SVM can be an option, it works well on small datasets, KNN maybe ... the dimension of data is big so it will be slow, logistic regression, Naive bayes and decision tree also might be a good option for this problem! But, I need still more information to make a decision. Lets look at the statistic summary of dataset:


In [None]:
#print(df.describe())
plt.figure(figsize=(20,14))
sns.boxplot(data=df)
plt.show()

Features have different variances, features with higher variances are dominant in distance dependant-classification methods like KNN, so to use KNN features should be scaled first, maybe that is a good excuse for not using KNN! Hahaha... 
We need still more information, lets see how is the distribution of wines in different classes:

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(df.quality, bins=20, kde=False)
plt.ylabel('The number of wines')
plt.xlabel('Wine quality')
plt.title('The number of wines per quality')
plt.show()

So, it is an unbalanced dataset... As a rule of thumb, decision tree-based models peforms best, so I will definietly try that, KNN is valnurable; 
Accuracy is not a good measure for performance evaluation of an unbalanced dataset, AUC is iu sed for binary classification problems, so my choice is logloss which is most common accuracy measure for unbalanced datasets! One decision is made, it is a progress!
Now I want to see the correlation between different features and with quality of wine! 

In [None]:
plt.figure(figsize=(12,8))
corr_mat = df.corr(method='pearson')
sns.heatmap(corr_mat, annot=True) 
plt.xticks(rotation=30)
plt.show()

I love heatmap, it is beautiful and gives you a lot of information! It turns out that there is a quite remarkable correlation between wine quality and its alcohol content which is not surprising at all! 
Density and citric acid have high correlation?!! I am wondering why...
If I am going to to feature reduction, between citric acid and fixed acidity I will keep only one of them, but now I don't think it is necessary, lets keep all the features. 
Now it is time to build a ML model. Based on the observations above, I apply SVM, logistic regression, naive bayes and decision tree models to the data!
First, without hyperparameter tuning, I want to see a rough comparison between their performance. I keep a subset of data as validation dataset out to use it later as unseen data by model!


In [None]:
# Split-out validation df
array = df.values
X = array[:,0:11]
y = array[:,11]

X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True)

models = []
models.append(('Decision Tree', DecisionTreeClassifier()))
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC(gamma='auto', probability=True))) 
results = []
names = []
for name, model in models:
    kfold = StratifiedKFold(n_splits=4, shuffle=True, random_state=1)
    cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='neg_log_loss')
    results.append(cv_results)
    names.append(name)

plt.figure(figsize=(12,10))
plt.boxplot(results, labels=names)
plt.title('Performance comparison of different algorithms')
plt.show()

Apparently Logistic Regression and SVM performs better than other algorithms, so I narrow down my choices to these two. I compare the performance of these two on validation data set and then make my final decision:

In [None]:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('SVM', SVC(gamma='auto', probability=True))) 
results = []
names = []
for name, model in models:
    model.fit(X_train, Y_train)
    predictions = model.predict_proba(X_validation)
    score=log_loss(Y_validation, predictions)
    results.append(cv_results)
    names.append(name)
    print('logloss score of {} on unseen data is: '.format(name) + str(score))


So SVM is out! Now I should optimize hyperparameters of Linear Regression model:

In [None]:
model=LogisticRegression(solver='liblinear', multi_class='ovr') ### Note: default solver is 'lbfgs' which only supports L2 (Ridge);  solver 'liblinear' supports both 'L1' and 'L2' ; Usually Ridge is better than Lasso, Lasso is good for feature reduction
parameters = {'C': np.logspace(-5, 8, 15, 10, 12)}

# Create grid search using 5-fold cross validation
lr_cv = GridSearchCV(model, parameters, cv=5, verbose=0)
best_model = lr_cv.fit(X_train, Y_train)

# View best parameters
print('Best C:', best_model.best_estimator_.get_params()['C'])

# Evaluate predictions
predictions = lr_cv.predict_proba(X_validation)
print('logloss score on unseen data is: ' + str(log_loss(Y_validation, predictions)))



I got only a tiny improvement by hyperparameter tuning... Lets try XGBoost classifier, I usually try it, it is fast and powerful!

In [None]:
xgb = XGBClassifier(max_depth=35, random_state=42, n_estimators=1500, learning_rate=0.005, booster='gbtree', objective='multi:softprob', min_child_weight=0.1, n_jobs=10)
xgb.fit(X_train, Y_train)
y_pred=xgb.predict_proba(X_validation)

print("logloss score XGBoost: {}".format(log_loss(Y_validation, y_pred)))

Here the performance of XGBoost is better than logistic regression, but not so much! I would like to see which features have had the most influence in this result:

In [None]:
plt.figure(figsize=(12,10))
feat_imp = pd.Series(xgb.feature_importances_, index=features).sort_values(ascending=True)
feat_imp.plot(kind='barh', title='Feature Importances XGBoost')
plt.show()