## Decision Tree Classifier on Wine Dataset using CART Algorithm
This project is one of my Machine Learning mini projects. For this project, we have a [wine](https://github.com/richardcsuwandi/datasets/blob/master/wine.csv) dataset that contains the quality of the wine along with different categories affecting its quality. The categories in the dataset, includes:

1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4. residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5. chlorides: the amount of salt in the wine
6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8. density: the density of water is close to that of water depending on the percent alcohol and sugar content
9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11. alcohol: the percent alcohol content of the wine
12. quality: output variable (based on sensory data, score between 0 and 10)

The goal of this project is to create a Decision Tree Classifier model to classify whether the quality of the wine is above 6 or not using the CART algorithm.

In [1]:
# Importing the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Loading the Data

In [2]:
wine = pd.read_csv('wine.csv', sep=', ', engine='python')  # The data is seperated by ', '
wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,6.2,0.56,0.09,1.7,0.053,24.0,32.0,0.99402,3.54,0.6,11.3,5
1,11.3,0.34,0.45,2.0,0.082,6.0,15.0,0.9988,2.94,0.66,9.2,6
2,8.6,0.42,0.39,1.8,0.068,6.0,12.0,0.99516,3.35,0.69,11.7,8
3,8.5,0.28,0.35,1.7,0.061,6.0,15.0,0.99524,3.3,0.74,11.8,7
4,7.7,0.23,0.37,1.8,0.046,23.0,60.0,0.9971,3.41,0.71,12.1,6


## Preprocessing the Data
Since we are only interested in the quality of wine above 6 and otherwise, we can convert the values of the quality above 6 to 1 and otherwise to 0.

In [21]:
wine['quality'] = wine['quality'].apply(lambda x: 1 if x > 6 else 0)
wine['quality'].head()

0    0
1    0
2    1
3    1
4    0
Name: quality, dtype: int64

## Building the Model
Finally, we can build our Decision Tree Classifier model from the wine dataset

In [22]:
# Declaring the features and the label
features = wine.drop('quality', axis=1)
label = wine['quality']

In [23]:
# Split the data between training and test sets, in a 80:20 ratio
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2, random_state=1)

In [24]:
# Build and fit the model
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=1, splitter='best')

## Predictions and Evaluations

In [25]:
# Making predictions
pred = clf.predict(X_test)

In [26]:
# Measuring the accuracy of the model
acc = accuracy_score(y_test, pred)
acc

0.875

In [27]:
# Create a Confusion Matrix
matrix = pd.DataFrame(
        confusion_matrix(y_test, pred),
        columns=['Predicted 0', 'Predicted 1'],
        index=['Actual 0', 'Actual 1'])
matrix

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,170,12
Actual 1,16,26


In [28]:
# Create a classification report
print(classification_report(y_test,pred))

precision    recall  f1-score   support

           0       0.91      0.93      0.92       182
           1       0.68      0.62      0.65        42

    accuracy                           0.88       224
   macro avg       0.80      0.78      0.79       224
weighted avg       0.87      0.88      0.87       224



Based on the results above, it seems like our model did pretty well but there is still a risk of overfitting. Hence, we can try and improve the results by pruning the tree and tuning the parameters for the classifier.

## Finding the Maximum Depth
Here, we are going to prune our tree by tuning the maximum depth parameter. We will find the optimal depth that gives the highest accuracy for the classifier.

In [29]:
# Define a helper function to find the optimal depth

def get_max_depth():
    """Returns the depth with the highest accuracy"""
    best_depth = list()
    best_acc = 0
    max_range = 30

    # Simulates the model with different depths within the max range
    for i in range(1,max_range):
        clf = DecisionTreeClassifier(max_depth=i, random_state=42)
        clf.fit(X_train, y_train)
        pred = clf.predict(X_test)
        acc = accuracy_score(y_test, pred)
        if acc > best_acc:
            best_acc = acc
            best_depth.append(i)

    return best_depth.pop()  # Returns the last item of the list

In [30]:
best_depth = get_max_depth()
best_depth

12

## New Predictions and Results

In [31]:
# Build and fit the model based on the tuned parameters
clf = DecisionTreeClassifier(max_depth=best_depth, random_state=42)
clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=12, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [32]:
# Making predictions
new_pred = clf.predict(X_test)

In [33]:
# Measuring the accuracy of the model
acc = accuracy_score(y_test, new_pred)
acc

0.8928571428571429

In [34]:
# Create a Confusion Matrix
matrix = pd.DataFrame(
        confusion_matrix(y_test, new_pred),
        columns=['Predicted 0', 'Predicted 1'],
        index=['Actual 0', 'Actual 1'])
matrix

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,172,10
Actual 1,14,28


In [35]:
# Create a classification report
print(classification_report(y_test,new_pred))

precision    recall  f1-score   support

           0       0.92      0.95      0.93       182
           1       0.74      0.67      0.70        42

    accuracy                           0.89       224
   macro avg       0.83      0.81      0.82       224
weighted avg       0.89      0.89      0.89       224



Based on the new results above, it seems that our model has done a little better this time and the accuracy of the model is also improved.