# Objective

The objective of this project is to predict the quality of wine using the concepts learned in DSA5841 Learning from Data: Decision Trees. The Wine Quality dataset consists of red wine samples. The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The dataset input variables (based on physicochemical tests) are:

1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - g / dm^3
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm^3)
11. alcohol (% by volume)


The output variable (based on sensory data) is:

12. quality (score between 0 and 10)

In [None]:
pip install pydotplus

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sb

from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor, GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.tree import export_graphviz
from six import StringIO  
from IPython.display import Image  
import pydotplus

plt.style.use('seaborn-darkgrid')

In [None]:
%matplotlib inline

# Loading Wine Quality Dataset

In [None]:
wine_df = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
wine_df.head(10)

# Exploratory Data Analysis

### Summary Statistics

In [None]:
wine_df.describe()

### Check for missing values

In [None]:
print(wine_df.isna().sum())

We check for any missing values across all rows to see if there are any records that need to be removed or filled in. Since there are none, we proceed as usual.

### Dsitribution of wine quality

In [None]:
rcParams["figure.figsize"] = [10, 8]
plt.hist(wine_df['quality'], bins=6, edgecolor='black')
plt.xlabel('quality', fontsize=20)
plt.ylabel('count', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

We plot a histogram of the wine qualities to see if there is a good distribution. From it, we can see that it is similar to a normal distribution and is not skewed to either side, and therefore we can proceed to use the data as is.

# Approach

The qualities of the wines are scored on a scale of 1 to 10, which means the data is comprised of discrete values.

Framing it as a classification problem would require converting the wine quality into a binary variable. For example, wines with a quality score of 7 or more would be classified as good quality wine and wines with a quality score of less than 7 would be classified as bad quality wine. However, this approach is problematic as it does not differentiate a wine with a quality score of 3 and a wine with a quality score of 6, when in reality there is an actual difference to someone who tastes them.

Framing it as a regression problem would mean the predictions made by the model are floating point numbers and not discrete values.

Hence, the approach taken in this project is to frame it as a regression problem but round up or round down the predictions made by the model in order to obtain discrete values. Then, the predictions are compared against the test set to obtain the accuracy of the predictions.

# Train/Test Split

The wine quality data is split into 70% for the training set and 30% for the test set.

In [None]:
X = wine_df.drop('quality', axis=1).values
X = StandardScaler().fit_transform(X)
y = np.ravel(wine_df[['quality']])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=200)

# Fitting the Decision Tree

## Decision Tree Regressor

In [None]:
reg = DecisionTreeRegressor(random_state=200)
reg = reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
y_pred = np.array([round(y) for y in y_pred])

In [None]:
# Evaluating the Model
print('Accuracy:', sum(y_test == y_pred) / len(y_test == y_pred))

In [None]:
dot_data = StringIO()
export_graphviz(reg, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,
                feature_names = wine_df.drop('quality', axis=1).columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('wine_quality.png')
Image(graph.create_png())

## Bagging

In [None]:
bag_reg = BaggingRegressor(random_state=200)
bag_reg = bag_reg.fit(X_train, y_train)
y_pred_bag = bag_reg.predict(X_test)
y_pred_bag = np.array([round(y) for y in y_pred_bag])

In [None]:
# Evaluating the Model
print('Accuracy:', sum(y_test == y_pred_bag) / len(y_test == y_pred_bag))

## Boosting

In [None]:
boost_reg = GradientBoostingRegressor(random_state=200)
boost_reg = boost_reg.fit(X_train, y_train)
y_pred_boost = boost_reg.predict(X_test)
y_pred_boost = np.array([round(y) for y in y_pred_boost])

In [None]:
# Evaluating the Model
print('Accuracy:', sum(y_test == y_pred_boost) / len(y_test == y_pred_boost))

## Random Forest

In [None]:
rf_reg = RandomForestRegressor(random_state=200)
rf_reg = rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)
y_pred_rf = np.array([round(y) for y in y_pred_rf])

In [None]:
# Evaluating the Model
print('Accuracy:', sum(y_test == y_pred_rf) / len(y_test == y_pred_rf))

# Evaluation of Results

|               | Accuracy          |
|---------------|-------------------|
| Decision Tree | 62.71%            |
| Bagging       | 64.79%            |
| Boosting      | 66.04%            |
| Random Forest | 71.25%            |

From the results above, at the baseline with no tuning of parameters, we can see that Random Forest gives the highest accuracy and therefore the Random Forest model will be used and its parameters will be tuned.

# Feature Importances

In [None]:
keys = wine_df.columns
values = rf_reg.feature_importances_
var_imp = dict(zip(keys, values))
var_imp = dict(sorted(var_imp.items(), key=lambda x: x[1]))

rcParams["figure.figsize"] = [10, 8]
plt.title('Feature Importances', fontsize=20)
plt.barh(list(var_imp.keys()), list(var_imp.values()))
plt.xlabel('Relative Importance', fontsize=20)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.show()

From the Feature Importances plot shown above, it can be seen that no single feature has a relative importance that is too insignificant to discount. Hence, all features will be kept from this point forward.

# Tuning Parameters

The parameters that will be tuned for the Random Forest model are as follows:
- Ntree: Number of trees to grow.
- Mtry: Number of variables randomly sampled as candidates at each split.

For each round of tuning, The Out-of-Bag (OOB) Error is calculated to determine the best value for the parameter.

## Number of trees

In [None]:
oob_error_ntrees = []

for i in range(50,401):
    rf_reg_ntrees = RandomForestRegressor(n_estimators=i, oob_score=True, random_state=200)
    rf_reg_ntrees.fit(X_train, y_train)
    oob_error_ntrees.append(1 - rf_reg_ntrees.oob_score_)

In [None]:
rcParams["figure.figsize"] = [10, 8]
plt.title('Tuning number of trees', fontsize=22)
plt.plot([i for i in range(50,401)], oob_error_ntrees)
plt.xlabel('No. of trees', fontsize=20)
plt.ylabel('OOB Error', fontsize=20)
plt.xticks([i for i in range(50,401,50)], fontsize=15)
plt.yticks(fontsize=15)
plt.show()

In [None]:
#Finding number of trees for minimum OOB Error
ntrees = oob_error_ntrees.index((min(oob_error_ntrees))) + 50
print("Number of trees for min OOB Error:", ntrees)

## Number of variables randomly sampled

In [None]:
oob_error_mtry = []

for j in range(1,12):
    rf_reg_mtry = RandomForestRegressor(max_features=j, oob_score=True, random_state=200)
    rf_reg_mtry.fit(X_train, y_train)
    oob_error_mtry.append(1 - rf_reg_mtry.oob_score_)

In [None]:
rcParams["figure.figsize"] = [10, 8]
plt.title('Tuning number of variables', fontsize=22)
plt.plot([j for j in range(1,12)], oob_error_mtry, marker='o')
plt.xlabel('No. of variables', fontsize=20)
plt.ylabel('OOB Error', fontsize=20)
plt.xticks([j for j in range(1,12)], fontsize=15)
plt.yticks(fontsize=15)
plt.show()

In [None]:
#Finding number of trees for minimum OOB Error
mtry = oob_error_mtry.index((min(oob_error_mtry))) + 1
print("Number of variables for min OOB Error:", mtry)

# Final Model

From the testing done above, we obtain the optimum parameter values of $ntrees=376$ and $mtry=5$, which are inserted into the final model as follows.

In [None]:
rf_reg_final = RandomForestRegressor(n_estimators=ntrees, max_features=mtry, random_state=200)
rf_reg_final = rf_reg_final.fit(X_train, y_train)
y_pred_final = rf_reg_final.predict(X_test)
y_pred_final = np.array([round(y) for y in y_pred_final])

In [None]:
# Evaluating the Model
print('Accuracy:', sum(y_test == y_pred_final) / len(y_test == y_pred_final))

After tuning the parameters, we can see that there is an increase in accuracy from 71.25% to 72.50%, which can be considered quite good given the non-standard approach to this problem.

# Conclusion

In this particular case, the Random Forest ensemble method performed the best among all the models considered, and was further tuned to obtain a higher accuracy, and predict 72.50% of the test set correctly.

To obtain an even higher accuracy, other parameters of the model could be tuned such as the maximum depth of the tree or maximum number of leaf nodes.