# Wine quality analysis with decision trees

The file `wine_quality.csv` contains information about chemical properties of some wines. Let's see if what we learned so far can help us to predict if a wine will be good based on its properties.

## Load, examine, clean, prepare

In [None]:
# Read and parse the wine_quality.csv file.

# The file is in CSV format. The pandas library is well
# suited to read and parse each field.
import pandas
data = pandas.read_csv("wine_quality.csv")

# We can take a look at the dataset to see what it contains
data.head()

In [None]:
# How many rows and columns does the dataset have ?

n_rows, n_cols = data.shape
print("The dataset has {} rows and {} columns.".format(
    n_rows, n_cols))

In [None]:
# List all chemical properties of this dataset.

print("The chemical properties of each wine are:")
for col_name in data.columns:
    print("  -", col_name)

In [None]:
# What kind of wines are present in this dataset ?

# With the previous question, we can see that the column
# `type` indicates what are the different kind of wines.
print("The different kind of wines in this dataset are:")
for kind in data.type.unique():
    print("  -", kind)

In [None]:
# Find the right method to get the average/minimum/maximum value
# of each column (and only these 3 information per column)

# The method describe() of the DataFrame object `data` gives
# many information (including the average/minimum/maximum)
# but also other information like min/max or 1st/3rd quartile.
# We need to select only the interesting information (row).
data.describe().loc[["mean", "min", "max"]]

In [None]:
# Does this dataset have any missing information ?

data.isna().head(20)
# Yes, this dataset has some missing values. For example, we
# can see a value "True" in the 17th row in the table below.

In [None]:
# How many missing values ?

# .sum() computes the sum for each column. It gives the
# number of missing values for each column. We need a second
# .sum() to get the number of missing values in the entire
# dataframe.
print("There are {} missing values in the dataframe.".
     format(data.isna().sum().sum()))

In [None]:
# Which column has the most missing values ?

print("The column with the most missing values is '{}'.".
     format(data.isna().sum().idxmax()))

In [None]:
# Remove the rows which have at least 1 missing value.
# How many rows have been removed ?

In [None]:
# Use an histogram to see see the repartition of
# the wine quality.

In [None]:
# Let's consider that a wine is good if its quality is
# at least 7. Replace the values in the "quality" column
# with "good" if quality >= 7 and with "not good" otherwise.

In [None]:
# Create the input data (i.e. the properties) and the
# label (i.e. the quality of wine) and assign them
# to 2 different variables X and y. Our machine learning
# algorithm needs to have both input and output data.

In [None]:
# Separate your data into a training and a test set
# with 80% for the training set.

## Predicting wine quality with a decision tree

In [None]:
# Is this a classification or a regression problem ?
# Import the appropriate version of DecisionTree, then
# train it with your training data.

In [None]:
# Oops, it seems that there is a problem! Indeed, most
# machine learning algorithms only work with numerical vectors.
# And our current training data still have some string values
# (like the type or the quality). We need to transform them before
# training our model.

# sklearn comes with tools to transform non-numerical values.
# In our case, we are going to use a LabelEncoder. Look at the
# documentation to learn what is does.

from sklearn.preprocessing import LabelEncoder

# now create two encoders: one for the `type` in X, the other
# for the `quality` in y. Use the trained encoders to transform
# X_train, X_test, y_train and y_test.

In [None]:
# Now train again your Decision Tree.

In [None]:
# What is the accuracy of your model (both on training
# and test sets) ? Do you think we are underfitting ? Overfitting ?

In [None]:
# Look at the documentation of your DecisionTree model
# and try to tune the hyperparameters: create other models
# with different values for max_depth, min_samples_split, max_features...
# Train them and evaluate their accuracy. What is the best accuracy
# you obtain?

In [None]:
# Use the feature_importances_ attribute of your best model. What are
# the three most important features to evaluate the quality of a wine?

## Predicting wine quality with random forests

We saw in the course (and in this example) that Decision Trees can easily overfit. To prevent this, we can use Random forests instead. Random forests are a collection of decision trees, where each decision tree is trained differently. The prediction of the RandomForest is then the average (or the most frequent) prediction of all the decision trees.

In [None]:
# Use a RandomForest composed of 20 decision trees and
# train it on your data. Evaluate its accuracy. Do you see
# an improvement ?
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Train other random forest classifiers with different
# hyperparameters (n_estimators, max_features). Can you beat
# the best accuracy you obtained with a single decision tree ?