In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

This section is working of read wine quality dataset.

In [None]:
#read the data, check if there are missing values
red_wine = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
red_wine.isnull().sum()

***Inspection of the data***

There is *no missing data* in this dataset. 

No obvoius pair of features has really high correlation. As scale of the data does not affect the performance of random forest, therefore, no scaling will be performed. 

Below graph is a pair plot for first 5 variables in the data. Fixed acidity and citric acid seem having a positive linaer relationship. Citric acid and volatile acidity seem to have a negative linear trend. 

In [None]:
sns.pairplot(red_wine.iloc[:,0:5])

There are no pair of features have really high correlation.

In [None]:
plt.subplots(figsize=(8,8))
sns.heatmap(red_wine.corr(), cmap='BrBG', annot=True)

Split the training data and the test data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(red_wine.loc[:,red_wine.columns != 'quality'], 
                                                    red_wine['quality'], test_size=0.15, random_state=42)

**Hyperparameter in RF Regressor: Try to avoid overfitting**

The ***max depth*** sets as 18, this is trying to **avoid overfitting** on each tree. This avoid too much split based on the training data, which might lead to the model doing well in the training set but not the test set. 

The ***n_estimators*** sets as 80, smaller than the default number of estimator, too much number of number of estimators might lead to overfitting on the training set. Again, it is used to **avoid overfitting**.

***n_job*** is -1 means using all processors.


***No feature scaling***

No feature scaling is taken as scale of the data would not affect the performance or the split of each of the decision trees in the forest. So, feature scaling would not be performed.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
#a function to build random forest with different dataset, with the prediction on training set and test set
def build_forest(X_train,y_train,X_test,y_test):
    model = RandomForestRegressor(n_estimators= 80,max_depth=20, random_state=0,n_jobs = -1)
    model.fit(X_train,y_train)
    #test prediciton
    rfr_test_pred = np.around(model.predict(X_test))
    #train prediction 
    rfr_train_pred = np.around(model.predict(X_train))
    #MAE calculation
    mae_test = mean_absolute_error(y_test,rfr_test_pred)
    mae_train = mean_absolute_error(y_train,rfr_train_pred)
    
    return model, rfr_test_pred,rfr_train_pred,mae_test,mae_train

rfr, rfr_test_pred,rfr_train_pred,mae_test,mae_train = build_forest(X_train,y_train,X_test,y_test)

print('The MAE for the training set is ',round(mae_train,3))
print('The MAE for the test set is ',round(mae_test,3))

**Explanation of the graphs below: (how good the model is)**

The plots below show the actual values of compressive Strength against the predicted values. The closer the data points to the straight line, the more accuracy the prediction is. 

The model is doing better in the training set than the test set.

The predictions of the regression tree are lying horizontally, which means they made the same predicted values.

The prediction at test set is not good as the training set since the data points are more far away from the middle line. And hence, there are only 3 unique predicted values in the test set.

To sum up, the model are doing pretty well, the predicted values are close to acutal values.

In [None]:
#Actual vs predicted values
def fitted_vs_actual(model,X_test,X_train,y_true_test,y_true_train,titles):
    #Make prediction from the adaboost model
    y_pred_test = np.around(model.predict(X_test))
    y_pred_train = np.around(model.predict(X_train))
    #plotting
    fig, axes = plt.subplots(1, 2,figsize=(15,5))
    colors = ['green','#01B6B7']
    y_pred = [y_pred_test,y_pred_train]
    y_true = [y_true_test,y_true_train]
    for i in range(2):
        sns.regplot(x = y_pred[i],y = y_true[i], color=colors[i],ax = axes[i])
        axes[i].title.set_text(titles[i])
        axes[i].set(ylabel='actual values',xlabel='predicted values')

In [None]:
fitted_vs_actual(rfr,X_test,X_train,y_test,y_train,['Test Data','Training Data'])

The distribution of the raw error are similar in both training set and the test set. 

In [None]:
def dis_error(model,X_test,X_train,y_test,y_train):
    fig, axes = plt.subplots(1, 2,figsize=(18,5))

    #The raw error for test
    y_pred_test = np.around(model.predict(X_test))
    y_pred_train = np.around(model.predict(X_train))
    
    sns.countplot(x = (y_pred_test-y_test).to_numpy(),ax=axes[0])
    axes[0].title.set_text('Test set raw error')
    axes[0].set_xlabel('Raw error')
    axes[0].set_ylabel('Frequency')

    sns.countplot(x=(y_pred_train-y_train).to_numpy(),ax=axes[1])
    axes[1].title.set_text('Training set raw error')
    axes[1].set_xlabel('Raw error')
    axes[1].set_ylabel('Frequency')

In [None]:
dis_error(rfr,X_test,X_train,y_test,y_train)

# Feature importances

Retain only those features which has importance values above 5%. 

The random forest model has an attribute that store the feature importance. It is impurity-based feature importances. 

The higher, the more important the feature. 

The importance of a feature is computed as the (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance.

Using SelectFromModel in sklearn can do the task. Putting the estimator and threshold value for feature selection intp the parameters, we can see whether the individual features higher than the threshold.

In [None]:
from sklearn.feature_selection import SelectFromModel
#prefit is True since we have fitted the model
selector = SelectFromModel(estimator=rfr,prefit=True,threshold=0.05)
#Get the columns name with feature importance above threshold
feature_selected = np.array(X_train.columns)[selector.get_support()]
#Get the columns name with feature importance below threshold
removed_feature= np.array(X_train.columns)[~selector.get_support()]

print('These are the features with importance values higher than 0.05 are:')
print('\n')
print(', '.join([str(feature ) for feature in feature_selected]))
print('\n')
print('The total number of features have been removed is', len(X_train.columns)-len(feature_selected),'.')
print('\n')
print('The feature removed is',', '.join([str(feature ) for feature in removed_feature]),'.')
print('\n')
print('The total feature importance value that is retained after the dimension reduction step',round(np.sum(selector.estimator.feature_importances_[selector.get_support()]),3))


citric acid does not provide too much information on the split of the data, which has a feature importance about 0.044. We will drop in the following section.

# Refit the model with selected features

Use the selected features to build a new Random forest regressor. The hyperparameter are same as the previous one. So we can compare the models with same setting with the training data is the only difference.

Subset the training set, only the features with importance values higher than 0.05 are kept.

In [None]:
#subset the features
reduced_X_train = X_train.loc[:,feature_selected]
reduced_X_test = X_test.loc[:,feature_selected]

rfr, rfr_test_pred,rfr_train_pred,mae_test,mae_train = build_forest(reduced_X_train,y_train,reduced_X_test,y_test)

print('The MAE for the training set on the model with feature selection is ',round(mae_train,3))
print('The MAE for the test set on the model with feature selection is ',round(mae_test,3))

The distribution of the raw error are similar in training set and the test set. The model predicts the quality perfectly in most of the case. However, the model is doing better in the training set(graph on the right), especially when predicting wine quality about -1 and about 1 comparing to the test set. And hence, there are some error equal to 2 and -2 for the test set but not in training set. 

In general, the model performs better in the training set, which is expected.

There are no features with importance values smaller than 0.05, as a result the total feature importance value is 1. It is becuase no feature need to be dropped.

In [None]:
selector = SelectFromModel(estimator=rfr,prefit=True,threshold=0.05)
feature_selected = np.array(reduced_X_train.columns)[selector.get_support()]
removed_feature= np.array(reduced_X_train.columns)[~selector.get_support()]

print('These are the features with importance values higher than 0.05')
print('\n')
print(', '.join([str(feature) for feature in feature_selected]))
print('\n')
print('The total number of features have been removed is', len(reduced_X_train.columns)-len(feature_selected),'.')
print('\n')
print('The total feature importance value that is retained after the dimension reduction step',np.sum(selector.estimator.feature_importances_[selector.get_support()]))

**Explanation of the graphs below:(how good the model is)**

The prediction are fairly good at the training set, the data points are closer to the line in the middle, which mean their are close to the true value. In constract, the prediction at test set is not good as the training set since the data points are more far away from the middle line. And hence, there are only 3 unique predicted values in the test set.

In [None]:
fitted_vs_actual(rfr,reduced_X_test,reduced_X_train,y_test,y_train,['Test Data','Training Data'])

In [None]:
dis_error(rfr,reduced_X_test,reduced_X_train,y_test,y_train)

The distribution of the raw error are similar in training set and the test set. The model predicts the quality perfectly in most of the case. However, the model is doing better in the training set(graph on the right), especially when predicting wine quality about -1 and about 1 comparing to the test set. The number of raw error for -1 and 1 in training set is much more lower than that in the test set. 

Additionally, there are some error equal to 2 and -2 (overestimate and underesimate) for the test set but not in training set. 

In general, the model performs better in the training set, which is expected.