[This datasets](https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009) is related to red variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

# Import Libraries
First, we import necessary libraries, such as:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Import The Data

In [None]:
red_wine = pd.read_csv('/kaggle/input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')

# Quick Look at The Data

In [None]:
red_wine.head()

- Dataset's info

In [None]:
red_wine.info()

- Dataset's descriptive statistics

In [None]:
red_wine.describe(include='all')

- Check for missing data

In [None]:
red_wine.isnull().sum()

# Exploratory Data Analysis

- Correlation of all the train features with target variable

In [None]:
(red_wine.corr()**2)['quality'].sort_values(ascending = False)[1:]

Plot some top of the most correlated one.

In [None]:
sns.boxplot(red_wine['quality'], red_wine['alcohol']);

In [None]:
sns.boxplot(red_wine['quality'], red_wine['volatile acidity']);

Let's see the distribution of quality feature by plotting it.

In [None]:
sns.countplot(red_wine['quality'], data=red_wine);

In the real world, people often 'take it simple' by just classified red wine into 2 qualities, good and bad. I will try the same approach by transforming it to binary labels. Let's say wine with quality > 6 is good and the remainder is bad.

In [None]:
labels = ['bad', 'good']
bins = [2, 6, 8]
red_wine['quality'] = pd.cut(red_wine['quality'], bins=bins, labels=labels)

In [None]:
red_wine = pd.get_dummies(red_wine, drop_first=True)
red_wine

# Creating A Model
We begin by splitting data into two subsets: for training data and for testing data.

In [None]:
from sklearn.model_selection import train_test_split

y = red_wine['quality_good']
X = red_wine.drop(['quality_good'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)

Then, we standarize the train and the test datasets

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaler.fit(X_train)

X_train = scaler.transform(X_train)
print('X_train_scaled mean : ', X_train.mean(axis=0))
print('X_train_scaled std  : ', X_train.std(axis=0))

X_test = scaler.transform(X_test)
print('')
print('X_test_scaled mean : ', X_test.mean(axis=0))
print('X_test_scaled std  : ', X_test.std(axis=0))

Model training : Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()

In [None]:
#search grid for optimal parameters
from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators' : [100,1000],
              'max_depth': [None,10,100],
              'max_features': ['auto','sqrt','log2']}

grid = GridSearchCV(model, param_grid, cv=5)

grid.fit(X_train, y_train)

print(grid.best_params_)
print(grid.best_score_)

In [None]:
from sklearn.metrics import mean_squared_error, classification_report

#use the best model
model = grid.best_estimator_

#make a prediction
y_predict = model.predict(X_test)

#calculate Mean Squared Error and classification report
print('MSE : ', mean_squared_error(y_test, y_predict))
print(classification_report(y_test,y_predict))