# Predicting Wine Types with Decision Trees & Random Forest

## Introduction

This notebook is a simple demonstration of how to use scikit-learn to build a Decision Tree model and Random Forest model for classification. It uses a dataset of 178 wines and their various attributes. There are three different classes of wine in the data and the goal is to predict the wine class based upon the attributes.

## The Data

The data has been taken from [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml) and can be found [here](https://archive.ics.uci.edu/ml/datasets/Wine). 

Information on the data can be found [here](https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names). The main thing we are going to focus on our the attribute names as the raw data doesn't have a header. The first column is the class, which is what we wish to predict, and the rest of the attributes we will use as features:
1. Alcohol
2. Malic acid
3. Ash
4. Alcalinity of ash  
5. Magnesium
6. Total phenols
7. Flavanoids
8. Nonflavanoid phenols
9. Proanthocyanins
10. Color intensity
11. Hue
12. OD280/OD315 of diluted wines
13. Proline   

We import python libraries

In [1]:
import pandas as pd
import numpy as np

We read in the data we've saved, passing the column names

In [2]:
wine = pd.read_csv("data/wine.csv", names=["class", "alcohol", "malic_acid", "ash", "alcalinity_of_ash", "magnesium", "total_phenols",
                                          "flavanoids", "nonflavanoid_phenols", "proanthocyanins", "color_intensity", "hue", 
                                           "od280_od315_of_diluted_wines", "proline"])

Let's check out the first few rows of data

In [3]:
wine.head()

Unnamed: 0,class,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280_od315_of_diluted_wines,proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


## Train & Test Data

The purpose of splitting the data is to be able to assess the quality of a predictive model when it is used on unseen data. When training, you will try to build a model that fits to the data as closely as possible, to be able to most accurately make a prediction. However, without a test set you run the risk of overfitting - the model works very well for the data it has seen but not for new data.

The split ratio is often debated and in practice you might split your data into three sets: train, validation and test. You would use the training data to understand which classifier you wish to use; the validation set to test on whilst tweaking parameters; and the test set to get an understanding of how your final model would work in practice. Furthermore, there are techniques such as K-Fold cross validation that also help to reduce bias.

For the purpose of this demonstration, we will only be randomly splitting our data into test and train, with a 80/20 split.

We import the required library from scikit-learn, [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [4]:
from sklearn.model_selection import train_test_split

We wish for all features to be used for training, therefore we are taking all columns except "class"

In [5]:
X = wine.drop(["class"], axis=1)

The column "class" is our target variable, we set y as this column

In [6]:
y = wine["class"]

We use the *train_test_split* function to create the appropriate train and test data for our features ("X_train" and "X_test" respectively) and target data ("Y_train" and "Y_test"). We are specifying our test data to be 20% of the total data. We are also providing a seed to be able to reproduce this split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

We can check the number of examples we have in each of our train and test data sets using "shape"

In [8]:
X_train.shape

(142, 13)

In [9]:
X_test.shape

(36, 13)

## Standardisation

All features are numeric so we do not need to worry about converting categorical data with techniques such as one-hot encoding. However, we will demonstrate how to standardise our data. Standardisation rescales our attributes so they have a mean of 0 and standard deviation of 1. It assumes that the distribution is Gaussian (it works better if it is), alternatively normalisation can be used to rescale between the range of 0 and 1

We use scikit-learn's [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

In [10]:
from sklearn.preprocessing import StandardScaler

We create the scaler, leaving parameters as default

In [11]:
scaler = StandardScaler()

We fit the scaler passing the training data but also request it transforms the data and returns it to a variable named "train_scaled"

In [12]:
train_scaled = scaler.fit_transform(X_train)

We then transform our test data with the same fitted scaler

In [13]:
test_scaled = scaler.transform(X_test)

## Decision Trees & Random Forest

Decision trees learn how to best split the dataset into separate branches, allowing it to learn non-linear relationships. 

Random forests (RF) combine predictions from many individual decision trees. As they use a collection of results to make a final decision, they are referred to as 'Ensemble techniques'.

We are using scikit-learn's [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and [Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

We create an Decision Tree model & a Random Forest Model

In [15]:
tree_model = DecisionTreeClassifier()
rf_model = RandomForestClassifier()

We train them with our scaled training data and target values

In [16]:
tree_model.fit(train_scaled, y_train)
rf_model.fit(train_scaled, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

## Model Evaluation

We wish to understand how good our model is; one simple metric to use for evaluation is accuracy. This will return the percentage of correct predictions.

We import [scikit-learn's accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [17]:
from sklearn.metrics import accuracy_score

We calculate accuracy for our training data

In [19]:
print("Decision Tree training accuracy",accuracy_score(y_train, tree_model.predict(train_scaled)))
print("Random Forest training accuracy",accuracy_score(y_train, rf_model.predict(train_scaled)))

Decision Tree training accuracy 1.0
Random Forest training accuracy 1.0


Both result in training accuracy of 100%! More importantly, we should check if we find good results on unseen data (to ensure we haven't overfit). We calculate the accuracy for our test data

In [20]:
print("Decision Tree test accuracy",accuracy_score(y_test, tree_model.predict(test_scaled)))
print("Random Forest test accuracy",accuracy_score(y_test, rf_model.predict(test_scaled)))

Decision Tree test accuracy 0.8888888888888888
Random Forest test accuracy 0.9722222222222222


So it looks like our Decision Tree overfit resulting in a lower accuracy on our test data. However, our Random Forest model looks much better with an accuracy of 97.2%

## Decision Trees & Random Forest Parameters

More information on tree algorithms can be found in the scikit-learn documentation [here](http://scikit-learn.org/stable/modules/tree.html) and ensembles [here](http://scikit-learn.org/stable/modules/ensemble.html)

There are a number of parameters that can be tuned that should be explored when trying to improve Decision Trees and Random Forest models. A common approach is to test many different paramters, building multiple models and testing their accuracy to find the best combination.

### Decision Trees
For Decision Trees, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- criterion : default=”gini”
    - The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

- splitter : default=”best”
    - The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

- max_depth : default=None
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

- min_samples_split : default=2
    - The minimum number of samples required to split an internal node:

- min_samples_leaf : default=1
    - The minimum number of samples required to be at a leaf node:

- min_weight_fraction_leaf : default=0.
    - The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

- max_features : default=None
    - The number of features to consider when looking for the best split. None, means take all features

- random_state : default=None
    - If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

- max_leaf_nodes : default=None
    - Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

- min_impurity_decrease : default=0.
    - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

- class_weight : default=None
    - Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

- presort : default=False
    - Whether to presort the data to speed up the finding of best splits in fitting. For the default settings of a decision tree on large datasets, setting this to true may slow down the training process. When using either a smaller dataset or a restricted depth, this may speed up the training.

### Random Forests
Similarly, for Random Forests, the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) provides parameters that can be passed by the user; changing these are likely to have an impact on the performance of the model. 

Here is high-level information on the parameters, the documentation has more details:
- n_estimators : default=10
    - The number of trees in the forest.

- criterion : default=”gini”
    - The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific.

- max_features : default=”auto”
    - The number of features to consider when looking for the best split:

- max_depth : default=None
    - The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

- min_samples_split : default=2
    - The minimum number of samples required to split an internal node:

- min_samples_leaf : default=1
    - The minimum number of samples required to be at a leaf node:

- min_weight_fraction_leaf : default=0.
    - The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.

- max_leaf_nodes : default=None
    - Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

- min_impurity_decrease : default=0.
    - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

- bootstrap : default=True
    - Whether bootstrap samples are used when building trees.

- oob_score : default=False
    - Whether to use out-of-bag samples to estimate the generalization accuracy.

- n_jobs : default=1
    - The number of jobs to run in parallel for both fit and predict. If -1, then the number of jobs is set to the number of cores.

- random_state : default=None
    - If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

- verbose : default=0
    - Controls the verbosity of the tree building process.

- warm_start : default=False
    - When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

- class_weight : default=None
    - “balanced_subsample” or None, optional. Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.
    
### Grid Search

To search for the best hyper-parameters for your algorithm and data, grid search cross validation is commonly used. The [scikit-learn documentation](http://scikit-learn.org/stable/modules/grid_search.html) provides more thorough information on how to use this. 

#### Data Citation

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 