# Objectives

* Predict the colors of wines (Red or White)
* Create a Neural Network with Keras
* Create a Random Forest
* Compare both models
* Obtain Feature importace

# Libraries

In [None]:
import numpy as np                # linear algebra
import pandas as pd               # data frames
import seaborn as sns             # visualizations
import matplotlib.pyplot as plt   # visualizations
%matplotlib inline
import scipy.stats                # statistics
from sklearn import preprocessing

import os
print(os.listdir("../input"))

# Reading the Data

In [None]:
# Read the data and store them in two objects

red = pd.read_csv('../input/winequality-red.csv')
white = pd.read_csv('../input/winequality-white.csv')

We have two separate datasets **Red Wines** and **White Wines** and these are their dimensions:

In [None]:
# Explore dimension of the datasets

print("There are {} red wines with {} attributes in red dataset. \n".format(red.shape[0],red.shape[1]))
print("There are {} red wines with {} attributes in white dataset. \n".format(white.shape[0],white.shape[1]))

Let´s append both datasets (**Red** and **White**), after that a new variable was created called **"Red"** that will let us know if the wine is **red(1)** or **white(0)**

In [None]:
# Let´s create a new variable that let us know if the wine is red(1) or white(0)
red['red']=1
white['red']=0

# Union of both datasets
wines = pd.concat([red,white])

In [None]:
wines.head()

In [None]:
wines.shape

# Cleaning the Data

There are **1177 duplicated** records:

In [None]:
# Let's see if we have duplicated records

twice = wines[wines.duplicated()]
twice.shape

In [None]:
twice.head()

They are distributed like this:

In [None]:
sns.countplot(x="red", data=twice, palette="husl")

In [None]:
pd.DataFrame(twice['red'].value_counts(dropna=False)).head()

In [None]:
pd.DataFrame(wines['red'].value_counts(dropna=False)).head()

## Remove Duplicates

Let's get rid off the duplicated ones, now we have these dimensions:

In [None]:
wine = wines.drop_duplicates(keep='first')
wine.shape

## Balance of the Target Variable

In [None]:
sns.countplot(x="red", data=wine, palette="RdPu")

# Explore the Data

These are the main statistics of the variables:

In [None]:
wine.describe()

Let's take a look of their dispersion.

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(data=wine.drop(columns=['red']), orient='horizontal', palette='RdPu')

We can see from the boxplot above that their range of values varies from one variable to another, so we will need to scale their values to enhance the maipulation of this data.

# Scale the data

The method to scale the data will be **Standardization**. Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. **Gaussian with 0 mean and unit variance**).

In [None]:
# Scaling the continuos variables
wine_scale = wine.copy()
scaler = preprocessing.StandardScaler()
columns = wine.columns[0:12]
wine_scale[columns] = scaler.fit_transform(wine_scale[columns])
wine_scale.head()

In [None]:
plt.figure(figsize=(15,10))
sns.boxplot(data=wine_scale.drop(columns=['red']), orient='horizontal', palette='RdPu')

# Correlation between features

### Scatter Plot

In [None]:
g = sns.PairGrid(wine_scale.iloc[:,1:13], hue="red", palette="RdPu")
g.map(plt.scatter);

### Correlation Matrix

In [None]:
# Compute the correlation matrix
corr=wine_scale.iloc[:,1:13].corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, annot=True, cmap='RdPu', center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

* The variables "free sulfur dioxide" and	"total sulfur dioxide" have a strong correlation between them.

* We can see that the features more related with the wine's color are **volatile acidity**, **total sulfur dioxide** and **chlorides**

In [None]:
sns.jointplot(x="total sulfur dioxide", y="free sulfur dioxide", data=wine, color='c')

The relationship between "**free sulfur dioxide**" and "**total sulfur dioxide**"shows heteroscedasticity and the one that is more related with our objective variable is "**total sulfur dioxide**"

I'm going to leave both variables anyway hoping that the Neural Network and the Random Forest could be able to manage the previous findings.

# Training and Test Datasets

Splitting the data into **80%** for **training** and **20%** for **test**. Here is the representation of the target variable in the **Original Dataset**, **Training Dataset** and **Test**:

In [None]:
sample = np.random.choice(wine_scale.index, size=int(len(wine_scale)*0.8), replace=False)
train_data, test_data = wine_scale.iloc[sample], wine_scale.drop(sample)

print("Number of training samples is", len(train_data))
print("Number of testing samples is", len(test_data))
print(train_data[:5])
print(test_data[:5])

In [None]:
f, axes = plt.subplots(1, 3, figsize=(18, 5), sharex=True)
sns.countplot(x="red", data=wine_scale, palette="RdPu", ax=axes[0])
sns.countplot(x="red", data=train_data, palette="RdPu", ax=axes[1])
sns.countplot(x="red", data=test_data, palette="RdPu", ax=axes[2])

We can see here that training dataset is the more balanced of the three which is good and the test dataset is the more unbalanced one, so this will be challenging for our models.

In [None]:
features = train_data.drop('red', axis=1)
targets = train_data['red']
features_test = test_data.drop('red', axis=1)
targets_test = test_data['red']

# Neural Networks With Keras

Our workflow will be as follow: first we will present our neural network with the training data, **features** and **targets**. The network will then learn to associate **features** and **targets**. Finally, we will ask the network to produce predictions for **features_test**, and we will verify if these predictions match the labels from **targets_test**.

Let's build our network:

Here our network consists of a sequence (**Sequential**) of **two Dense layers**, which are densely-connected or fully-connected neural layers. The second (and last) layer is a 2-way "sigmoid" layer, which means it will return an array of 2 probability scores (summing to 1). Each score will be the probability that the current wine belongs to Red Wines of our two wine classes (Red and White).

In [None]:
from keras import models
from keras import layers

# Building the model
Nnetwork = models.Sequential()
Nnetwork.add(layers.Dense(40, activation='sigmoid', input_shape=(12,)))
Nnetwork.add(layers.Dense(1, activation='sigmoid'))

## Compilation

To make our network ready for training, we need to pick three more things, as part of the "compilation" step:

*     A **loss function**: this is how the network will be able to measure how good a job it's doing on its training data, and thus how it will be able to steer itself in the right direction.

*     An **optimizer**: this is the mechanism through which the network will update itself based on the data it sees and its loss function.
                        sgd = stochastic gradient descent

*     Some **metrics** to monitor during training and testing. Here we will only care about accuracy (the fraction of the wines that were correctly classified).

In [None]:
# Compiling the model
Nnetwork.compile(loss = 'binary_crossentropy',
                 optimizer='sgd',
                 metrics=['accuracy'])
Nnetwork.summary()

We are now ready to train our network, which in Keras is done via a call to the fit method of the network: we "fit" the model to its training data.

In [None]:
# Training the model
Nnetwork.fit(features, targets, epochs=10, batch_size=100, verbose=0)

## Let's see the performance in Train and Test to evaluate the possibility of Overfitting

This is the accuracy of our Neural Network in test data:

In [None]:
test_loss, test_acc = Nnetwork.evaluate(features_test, targets_test)

In [None]:
print('test_acc:', test_acc, '\ntest_loss:', test_loss)

This is the accuracy of our Neural Network in Train data:

In [None]:
train_loss, train_acc = Nnetwork.evaluate(features, targets)

In [None]:
print('train_acc:', train_acc, '\ntrain_loss:', train_loss)

We had better accuracy in our test set than in our training set which is very very good!

Our **test set accuracy** turns out to be quite a bit higher than the **training set accuracy**. This gap between training accuracy and test accuracy is good and it seems that we are **avoiding "overfitting"** which is the fact that machine learning models tend to perform worse on new data than on their training data.

*Here we won't explore the Hyperparameter Tuning and we are going to keep this result*

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

Some relevant hyper-parameters for the random forest:

* **max_depth**: max_depth represents the depth of each tree in the forest. The deeper the tree, the more splits it has and it captures more information about the data.

* **n_estimators**: the number of decision trees used.

* **max_features**: The number of features (predictor variables) that the model will randomly consider when looking for the best split

In [None]:
Rforest = RandomForestClassifier(max_depth=4, n_estimators=10, max_features=2)

In [None]:
Rforest.fit(features, targets)

This is the accuracy of our Random Forest in test data:

In [None]:
# Accuracy
score = Rforest.score(features_test, targets_test)

In [None]:
print(score)

This is the accuracy of our Random Forest in training data:

In [None]:
# Accuracy
score_train = Rforest.score(features, targets)

In [None]:
print(score_train)

Again, our **test set accuracy** turns out to be a bit higher than the **training set accuracy**. This gap between training accuracy and test accuracy is good and it seems that we are **avoiding "overfitting"**.

## More Performance Metrics from Random Forest

In [None]:
### Predictions
y_pred_rf = Rforest.predict(features_test)

In [None]:
### Probabilities
y_prob_rf = Rforest.predict_proba(features_test)
y_prob_rf = y_prob_rf.T[1]

In [None]:
from sklearn import metrics
# measure confusion matrix
cm_rf = metrics.confusion_matrix(targets_test, y_pred_rf, labels=[0, 1])
cm_rf = cm_rf.astype('float')
cm_rf_norm = cm_rf / cm_rf.sum(axis=1)[:, np.newaxis]
print("True Positive (rate): ", cm_rf[1,1], "({0:0.4f})".format(cm_rf_norm[1,1]))
print("True Negative (rate): ", cm_rf[0,0], "({0:0.4f})".format(cm_rf_norm[0,0]))
print("False Positive (rate):", cm_rf[1,0], "({0:0.4f})".format(cm_rf_norm[1,0]))
print("False Negative (rate):", cm_rf[0,1], "({0:0.4f})".format(cm_rf_norm[0,1]))

In [None]:
np.shape(y_prob_rf)

In [None]:
fpr, tpr, thresholds = metrics.roc_curve(targets_test, y_pred_rf)

In [None]:
# measure Area Under Curve (AUC)
auc_rf = metrics.roc_auc_score(targets_test, y_pred_rf)
print()
print("AUC:", auc_rf)

In [None]:
# ------------------------------------------------------------------------------
# Plot: Receiver-Operator Curve (ROC)
# ------------------------------------------------------------------------------

fig, axis1 = plt.subplots(figsize=(8,8))
plt.plot(fpr, tpr, 'r-', label='ROC')
plt.plot([0,1], [0,1], 'k--', label="1-1")
plt.title("Receiver Operator Characteristic (ROC)")
plt.xlabel("False positive (1 - Specificity)")
plt.ylabel("True positive (selectivity)")
plt.legend(loc='lower right')
plt.tight_layout()

Metrics of Random Forest are almost perfect!

# Accuracy of Neural Networks vs Random Forest

Accuracy of the Neural Network:

In [None]:
# Accuracy of Random Forest
test_loss, test_acc_nn = Nnetwork.evaluate(features_test, targets_test)
print('Accuracy of NN in test data:', test_acc, '\ntest_loss:', test_loss)

Accuracy of the Random Forest:

In [None]:
# Accuracy of Random Forest
score = Rforest.score(features_test, targets_test)
print(score)

Both Models have presented a superior performance but we can see that the **Accuracy of the Random Forest is almost 3% bigger** than the Accuracy of the Neural Network.

# * And the winner is... RANDOM FOREST!!!*

There is an interesting thing that Random Forest let us do and is to know which features are more important to predict the target variable. Let's take a look at this:

# Feature importance

We can rank the features according to how much each feature was used to split the dataset while training. This is a measure of their importance, i.e, of how much each feature contributes to successfully isolate pure partitions.

In [None]:
importances = Rforest.feature_importances_
std = np.std([tree.feature_importances_ for tree in Rforest.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

In [None]:
# Print the feature ranking
print("Feature ranking:")

for f in zip(features.columns, Rforest.feature_importances_):
    print(f)

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(features.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(features.shape[1]), indices)
plt.xlim([-1, features.shape[1]])
plt.show()

This shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability.

As expected from the correlation matrix, the plot suggests that the 3 most informative features are:

* **chlorides**
* **total sulfur dioxide**
* **volatile acidity**

Those three features explains an important part of the wine's colors.