# Wine Classification
In this kernel we will use the data from [Wine Varaieties Dataset](https://www.kaggle.com/brynja/wineuci) to perform a simple classification to predict the wine class.

# Loading libraries and dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

First of all, we start by loading the dataset using `pd.read_csv()` function

In [None]:
data = pd.read_csv('../input/Wine.csv')

In [None]:
data.head(3)

We can see that columns are given arbitrary numbers. The real columns names are provided in the dataset page, we will assign the columns names to the real ones:

In [None]:
data.columns = ['class','alcohol','malicAcid','ash','ashalcalinity','magnesium','totalPhenols','flavanoids','nonFlavanoidPhenols','proanthocyanins','colorIntensity','hue','od280_od315','proline']

In [None]:
data.head(3)

# 2. Missing values

First we check whether the data contains any missing values

In [None]:
print('There are %d missing values in total.' % data.isna().sum().sum())

The data is clean and has no missing values, no further processing is needed

# 3. Data Analysis

## 3.1. Count of different wine classes

In [None]:
sn.countplot(data['class'], palette='Blues_d');

## 3.2. Variables correlation

In [None]:
corr = data.corr()
fig, ax = plt.subplots(figsize=(10,10))
sn.heatmap(corr,ax=ax, cmap=sn.diverging_palette(20, 220, n=200), square=True, annot=True, cbar_kws={'shrink': .8})
ax.set_xticklabels(data.columns, rotation=45, horizontalalignment='right');

The variable the least correlated with the target variable (class) is **ash**, we can drop it but we will leave this for feature elimination.

# 4. Split train/test data

First, we need to seperate the variables and the target from the original dataset as follows:

In [None]:
X = data.drop(['class'], axis=1)
Y = data['class']

Then we will use `train_test_split()` from `sklearn.model_selection` to split it further into training and testing subsets.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
random_state = 2
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=random_state, shuffle=True)

The `train_test_split()` function takes the following arguments:
* `X`: the variables, the whole dataset, without the target variable (wine class)
* `Y`: the target variable, which is the wine class
* `test_size`: represents the proportion of the original data to be used as testing set (here I chose 30%)
* `shuffle`: since the original dataset is grouped by the wine class, it is preferable to rearrange everythign randomly, so we set `shuffle` to `True`

# 5. Feature Elimination

Some variables may not be predictive for the target wine class, we will use feature elimination to try to eliminate them in order to improve the data quality we will feed into the model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold

In [None]:
estimator = LogisticRegression(solver='liblinear', multi_class='auto')
selector = RFECV(estimator, step=1, cv = StratifiedKFold(10));
selector.fit(X, Y);

In [None]:
plt.figure()
plt.xlabel('Number of Features')
plt.ylabel('Cross Validation Score')
grid_scores = plt.plot(range(1, len(selector.grid_scores_) + 1), selector.grid_scores_, zorder = 3);
best_number = plt.scatter(selector.n_features_, np.max(selector.grid_scores_), color='red', zorder = 5);
plt.legend([best_number],['Optimal Number of Features'], loc='lower right');

Recrusive Feature Elimination didn't eliminate any feature, so apparently all features contribute to the clasification.
We will keep all of them.

# Building the models

It is always a good idea to try many classifiers and compair their results, and pick the one with best accuracy. Different algorithms may perform differently on different datasets.
We will try the following models:
* Logistic Regression
* Support Vector Classifier
* Naive Bayes
* K-Nearest Neighbours
* Decision Trees
* Multi-Layer Perceptron
* XGBoost Classifier

We need to import the mentioned classifiers:

In [None]:
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

We will build a list of tuples containing the name of the classifier and the classifier itself:

In [None]:
classifiers = []
classifiers.append(('Logistic Regression', LogisticRegression(solver='liblinear', multi_class='auto')))
classifiers.append(('Support Vector Classifier', SVC(kernel='linear')))
classifiers.append(('GaussianNB', GaussianNB()))
classifiers.append(('K-Nearest Neighbors',KNeighborsClassifier(n_neighbors=3)))
classifiers.append(('Decision Tree', DecisionTreeClassifier()))
classifiers.append(('Multi-Layer Perceptron', MLPClassifier(hidden_layer_sizes=(15),solver='sgd',learning_rate_init=0.01,max_iter=500)))
classifiers.append(('eXtreme Gradient Boosting', XGBClassifier()))

# Models ranking

To evaluate the performance of our models, we will ues the `cross_val_score()` function

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score

In [None]:
kfold = StratifiedKFold(n_splits=10, random_state=random_state)
cv_results = []
for name, classifier in classifiers:
    result = cross_val_score(classifier, X, Y, cv=kfold);
    cv_results.append((name, result));

In [None]:
results = pd.DataFrame(cv_results, columns=['classifier','cvscore'])
results['cvscore'] = [np.mean(i) for i in results['cvscore']]

In [None]:
sn.set_style('whitegrid')
ax = sn.barplot(x='cvscore',y='classifier', data=results.sort_values('cvscore'), palette='Blues_d')
ax.set(xlabel='Cross Validation Score', ylabel='');

In [None]:
print('The best performing model is: %s\nWith Cross-Validation Score of: %.2f' % (results.iloc[results['cvscore'].idxmax()][0], results.iloc[results['cvscore'].idxmax()][1]))

We notice that **GuassianNB** scored the highest, so this the model that we will pick

We use the training split we created earlier in order to train our model, then we will use it to predict the class of testing samples:

In [None]:
estimator = GaussianNB()
estimator.fit(X_train, Y_train)
Y_predict = estimator.predict(X_test)

To evaluate the accuracy of our predictions, we will use `accuracy_score()`:

In [None]:
from sklearn.metrics import accuracy_score
print('Prediction accuracy is: %.2f' % (100*accuracy_score(Y_predict, Y_test)))

Let's check for the variables that our model predicted wrong

In [None]:
X_test[Y_predict != Y_test]

# Conclusion

The dataset was clean and didn't require any real preprocessing and missing values handling.
Also the variables were really predictive for the target variable, many models scored very high (+90%) and the best model scored %96.30