## GradientBoostingClassifier from SAS® Viya® on Wine Quality 

### Source
This example has been adapted from [Cross Validation and Grid Search for Model Selection in Python](https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/) by Usman Malik

### About the [Red Wine Dataset](https://archive.ics.uci.edu/dataset/186/wine+quality)

The red wine dataset is a publicly available dataset used for research purposes. It was created by Paulo Cortez, Antonio Cerdeira, Fernando Almeida, Telmo Matos, and Jose Reis in 2009. The dataset consists of physicochemical properties and sensory data of red and white variants of Portuguese "Vinho Verde" wine.

In [None]:
import os
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

from sasviya.ml.tree import GradientBoostingClassifier

import matplotlib.pyplot as plt

import warnings
from sklearn.exceptions import UndefinedMetricWarning
# Suppress the warning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

In [None]:
workspace=f'{os.path.abspath("")}/../data/'
dataset=pd.read_csv(workspace+'wineQualityReds.csv')
print(dataset.info())
dataset.head()

### Data Preprocessing

Extracting the feature data and target data from the dataset

In [None]:
X = dataset.iloc[:, 0:11].values
y = dataset.iloc[:, 11].values

Using train_test_split function to split the dataset into 80% training data and 20% test data 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Scaling the Data

If you examine the dataset, you will notice that it is not well-scaled. For example, the "volatile acidity" and "citric acid" columns have values between 0 and 1, whereas most of the other columns have higher values. Hence, before training the algorithm, we will need to scale our data down.

Here, we will utilize the StandardScale to standardize the features by centering them around the mean and scaling to unit variance..

In [None]:
feature_scaler = StandardScaler()
X_train = feature_scaler.fit_transform(X_train)
X_test = feature_scaler.transform(X_test)

### Training and Cross Validation

For details about using the `GradientBoostingClassifier` class, see the [GradientBoostingClassifier documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=n1kiea90s0276wn1xr0ig0hvkix6.htm)

In [None]:
classifier = GradientBoostingClassifier(n_estimators=50,
                                        max_depth = 5,
                                        random_state=1)
classifier.fit(X_train, y_train)

To implement cross-validation, the cross_val_score method from the sklearn.model_selection library can be utilized. This method returns the accuracy for all the folds. Four parameters need to be passed to the cross_val_score class. The first parameter is the estimator, which specifies the algorithm for cross-validation. The second and third parameters, X and y, contain features and labels. Finally, the number of folds is specified in the cv parameter, as demonstrated in the following code snippet:

In [None]:
X = np.concatenate((X_train, X_test), axis=0)
y = np.concatenate((y_train, y_test), axis=0)

In [None]:
all_accuracies = cross_val_score(estimator=classifier, X=X, y=y, cv=5)

Once you have executed this, simply print the accuracies returned for five folds using the cross_val_score method by calling print on all_accuracies.

In [None]:
print(all_accuracies)

To calculate the average of all the accuracies, simply use the mean() method of the object returned by the cross_val_score method as shown below.

In [None]:
print('{:.4f}'.format(all_accuracies.mean()))

The mean value is 0.6636, or 66.36%

Finally, let's calculate the standard deviation of the data to assess the degree of variance in the results obtained by our model. To do this, call the std() method on the all_accuracies object.

In [None]:
print('{:.4f}'.format(all_accuracies.std()))

The result is 0.0351, equivalent to 3.51%. This percentage reflects low variance, indicating that our model exhibits consistent performance across different test sets. This consistency is favorable as it suggests that the predictions are not random but rather reliable and stable across various test scenarios.

### GridSearchCV for Gradient Boosting Hyperparameter Tuning

Can we get improved performance over the initial model? 

Here we perform a simplified grid search over the max_depth parameter. To do so we define a dictionary of parameters and their respective values for the gradient boosting algorithm.

In [None]:
grid_param = {
    'max_depth': [5, 10],
    'n_estimators': [50],
    'random_state': [1],
}

gb_clf = GridSearchCV(estimator=classifier,
                      param_grid=grid_param,
                      scoring='accuracy',
                      cv=5,
                      n_jobs=-1)

In [None]:
gb_clf.fit(X_train, y_train)

Check the parameters that yield the highest accuracy.


In [None]:
best_parameters = gb_clf.best_params_
print(best_parameters)

The result indicates that the highest accuracy is attained when the max_depth is **10**. The accuracy value represents an improvement over our original gradient boosting model. 

The last step of the Grid Search algorithm is to determine the accuracy obtained using the best parameters.


In [None]:
print('Model train accuracy is:', '{:.4f}'.format(gb_clf.score(X_train, y_train)))

In [None]:
print('Model test accuracy is:', '{:.4f}'.format(gb_clf.score(X_test, y_test)))

### Calculate and display confusion matrix

In [None]:
y_pred = gb_clf.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', fontsize=12, color='black')
ax.set_ylabel('Actual outputs', fontsize=12, color='black')
ax.xaxis.set(ticks=range(6))
ax.yaxis.set(ticks=range(6))
ax.set_ylim(5.5, -0.5)
for i in range(6):
    for j in range(6):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='white')
plt.show()

In [None]:
print(classification_report(y_test, y_pred))