## Dataset
[Wine Quality Repository](https://archive.ics.uci.edu/ml/datasets/wine+quality)

In [1]:
%matplotlib inline 

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import statsmodels.api as sm

import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

# special matplotlib argument for improved plots
from matplotlib import rcParams

The Dataset can be loaded directly from the library sklearn.

The loaded dataset has four different attributes:
* the `DESCR` that contains the description of the dataset itself;
* the `feature_names` that contains the name of the columns;
* the `data` that contains the data.
* the `target` that contains the data to be predicted.

A brief description of the characteristics can be found in the DESCR file. 

In [2]:
from sklearn.datasets import load_wine
wine = load_wine()
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

Now let’s create a pandas dataframe from the data.

We can do that by calling the method `pd.DataFrame()` on the `boston.data`. 

Let's than check the first 5 lines using the `head()` method.



In [3]:
df = pd.DataFrame(wine.data)
print(df.head())

      0     1     2     3      4     5     6     7     8     9     10    11  \
0  14.23  1.71  2.43  15.6  127.0  2.80  3.06  0.28  2.29  5.64  1.04  3.92   
1  13.20  1.78  2.14  11.2  100.0  2.65  2.76  0.26  1.28  4.38  1.05  3.40   
2  13.16  2.36  2.67  18.6  101.0  2.80  3.24  0.30  2.81  5.68  1.03  3.17   
3  14.37  1.95  2.50  16.8  113.0  3.85  3.49  0.24  2.18  7.80  0.86  3.45   
4  13.24  2.59  2.87  21.0  118.0  2.80  2.69  0.39  1.82  4.32  1.04  2.93   

       12  
0  1065.0  
1  1050.0  
2  1185.0  
3  1480.0  
4   735.0  


Let's now convert the index to the column names exploiting the `feature_names`

In [4]:
df.columns = wine.feature_names
print(df.head())

   alcohol  malic_acid   ash  alcalinity_of_ash  magnesium  total_phenols  \
0    14.23        1.71  2.43               15.6      127.0           2.80   
1    13.20        1.78  2.14               11.2      100.0           2.65   
2    13.16        2.36  2.67               18.6      101.0           2.80   
3    14.37        1.95  2.50               16.8      113.0           3.85   
4    13.24        2.59  2.87               21.0      118.0           2.80   

   flavanoids  nonflavanoid_phenols  proanthocyanins  color_intensity   hue  \
0        3.06                  0.28             2.29             5.64  1.04   
1        2.76                  0.26             1.28             4.38  1.05   
2        3.24                  0.30             2.81             5.68  1.03   
3        3.49                  0.24             2.18             7.80  0.86   
4        2.69                  0.39             1.82             4.32  1.04   

   od280/od315_of_diluted_wines  proline  
0                  

Let's now add the target column to the dataframe and split it into target and train

In [5]:
from sklearn.model_selection import train_test_split

df['quality'] = wine.target
X = df.drop('quality', axis=1)
Y = df.quality

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)

Let's finally run use decision trees.

In [6]:
from sklearn import tree

tm = tree.DecisionTreeClassifier()
tm.fit(X_train, Y_train)
Y_pred = tm.predict(X_test)

In [11]:
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
cm = confusion_matrix(Y_test, Y_pred)

print("mean squared error: {}".format(mse))
print("root mean squared error: {}".format(pow(mse,1/2)))
print("r2 metric: {}".format(r2))
print("Confusion matrix:\n{}".format(cm))

mean squared error: 0.08888888888888889
root mean squared error: 0.29814239699997197
r2 metric: 0.8456260720411664
Confusion matrix:
[[13  1  0]
 [ 2 16  1]
 [ 0  0 12]]


Now let's use random forrest approach

In [18]:
from sklearn.ensemble import RandomForestClassifier

X = df.drop('quality', axis=1)
Y = df.quality
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25)

rfc = RandomForestClassifier(n_estimators=3000)
model = rfc.fit(X_train, Y_train)
Y_pred = rfc.predict(X_test)

In [19]:
from sklearn.metrics import mean_squared_error, r2_score, confusion_matrix

mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
cm = confusion_matrix(Y_test, Y_pred)

print("mean squared error: {}".format(mse))
print("root mean squared error: {}".format(pow(mse,1/2)))
print("r2 metric: {}".format(r2))
print("Confusion matrix:\n{}".format(cm))

mean squared error: 0.0
root mean squared error: 0.0
r2 metric: 1.0
Confusion matrix:
[[15  0  0]
 [ 0 18  0]
 [ 0  0 12]]
