# Decision Tree & Random Forest: quality of red wine

We study UCI Machine Learning's dataset about the quality of Portugese red wine using Decision Trees and Random Forest.

<img src="https://i.imgur.com/nD0qMyY.jpg" width="50%">

## Libraries and data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
import graphviz 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [None]:
wine = pd.read_csv("../input/winequality-red.csv")

## Exploring the data

In [None]:
wine.head()

The column `quality` states the quality of the wine. Higher measurement of `quality` means higher quality. We study the distribtution of `quality`.

In [None]:
quality_dist = wine['quality'].value_counts()
plt.bar(quality_dist.index, quality_dist)
plt.xlabel('quality')
plt.ylabel('frequency')
plt.show()

The quality of the red wines are in the set  $\{3,4,5,6,7,8\}$. We want to classify the wine into so-called 'bad' and 'good'. We study statistics of our dataset to determine how to make a classification.

In [None]:
wine['quality'].describe()

In [None]:
values, base = np.histogram(wine['quality'], bins=20)
kumulativ = np.cumsum(values/wine.shape[0])
plt.plot(base[:-1], kumulativ, c='blue')
plt.xlabel('quality')
plt.ylabel('frequency')
plt.show()

Since the median quality is $6$ and mean quaity is $\sim 5,6$, we classify red wine as 'poor' if `quality` is less then or equal to $6$, otherwise we classify the red wine as 'good'. We repleace the column `quality` with the value $0$ for 'poor' wine and the value $1$ for 'good' wine.

In [None]:
indeksDaarlig = wine.loc[wine['quality'] <= 6].index
indeksGod = wine.loc[wine['quality'] > 6].index
wine.iloc[indeksDaarlig, wine.columns.get_loc('quality')] = 0
wine.iloc[indeksGod, wine.columns.get_loc('quality')] = 1

In [None]:
wine['quality'].value_counts()

## Decision Tree

We make a decision tree to try to determine the characteristics of so-called 'good' wine in our classification.

In [None]:
x = wine.drop('quality',axis=1)
y = wine['quality']

In [None]:
#Choosing 40% as training data.
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.40, random_state = 42)

In [None]:
# Making a decision tree with two levels.
clfTre = tree.DecisionTreeClassifier(max_depth=2)
clfTre.fit(xTrain, yTrain)

In [None]:
#Visualizing the decision tree
dot_data = tree.export_graphviz(clfTre, out_file=None, max_depth=2, feature_names=list(x.columns.values), filled=True, rounded=True)
valgTre = graphviz.Source(dot_data) 
valgTre

We observe that our model uses the level of alcohol first to make a decision. Also, note that if the alcohol level is high  and the sulphate levels are high, then the wine quality is most likely 'good'. On the other hand, is the alcohol level is low and volatile acidity is high, then the wine quality is most likely 'bad'.

We examine the prediction accuracy of the decision tree.

In [None]:
utfall = (clfTre.predict(xTest) == yTest).value_counts()
print("The decision tree predicts the test data in", (utfall[1]/(utfall[0]+utfall[1]))*100 , "% of the cases.")

## Random Forest

We use a Random Forest to make a random selection of decision trees with increased depth.

In [None]:
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=0)
rf.fit(xTrain, yTrain)

In [None]:
utfall = (rf.predict(xTest) == yTest).value_counts()
print("The decision tree predicts the test data in", (utfall[1]/(utfall[0]+utfall[1]))*100 , "% of the cases.")