## ForestClassifier from SAS® Viya® on Handwritten Digits

### Source
This example is adapted from [Example: Random Forest for Classifying Digits](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.08-Random-Forests.ipynb#scrollTo=YxsqDJXxpcZI) by Jacob VanderPlas and [Logistic Regression in Python: Handwriting Recognition](https://realpython.com/logistic-regression-python/#logistic-regression-in-python-handwriting-recognition) by Mirko Stojiljković.

### Data Preparation
#### About the data set
This data is a set of 1797 images of digits that have been processed into 32 x 32 pixel bitmaps, divided into non-overlapping 4 x 4 pixel blocks. The number of pixels in each block is counted and each image is classified by the integer between 0 and 9 that it represents. This example will use this input format to classify additional images as the correct integer.

#### Importing the data
scikit-learn includes a copy of this data and it can be loaded through `load_digits()`. It returns a tuple of the inputs and output.

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()

### Examining the data
As these are images, viewing a representation of the images can help clear understand the data.

In [None]:
import matplotlib.pyplot as plt

# set up the figure
fig = plt.figure(figsize=(6, 6))  # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the digits where each the color of each block is represented by a pixel on the grayscale
for i in range(64):
    ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation='nearest')

    # label the image with the target value
    ax.text(0, 7, str(digits.target[i]))

In [None]:
x, y = load_digits(return_X_y=True)

In [None]:
x

In [None]:
y

### Partitioning the Data
In order to train a model and test its accuracy, we will partition the data into two subsets randomly. We will use a training set to create the model and the test set to evaluate how well the model performs.  

scikit-learn provides `train_test_split()` to make this partitioning easy. In addition to the input and outcome data, two other important parameters are `test_size` for controlling the size of the test and `random_state` to define the state of the pseudo-random number generator used to split the data.  The function returns four arrays--the input training and test data and the results for the training and test data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

### Training a ForestClassifier model with SAS® Viya®
In order to fit the model, create an instance of `ForestClassifier` and call `.fit()` with the `X_train` and `y_train` data.

For details about using the `ForestClassifier` class in the `sasviya` package, see the [ForestClassifier documentation](https://documentation.sas.com/?cdcId=workbenchcdc&cdcVersion=default&docsetId=explore&docsetTarget=p04zhxjh60eutqn1t40f0104gw42.htm).

In [None]:
from sasviya.ml.tree import ForestClassifier

rf = ForestClassifier(
    n_estimators=100,
    max_depth=5,
    min_samples_leaf=1,
    max_features=None,
    criterion='gini',
    random_state=0
)
rf.fit(X_train, y_train)

### Examining the results
#### Model parameters
Since `.fit()` returns the model, we can view the parameters used to train the model with `.get_params()`.

In [None]:
rf.get_params()

#### Predicting results
We can run the model on the test data through `.predict()` and view the results.

In [None]:
y_pred = rf.predict(X_test)

In [None]:
y_pred

#### Calculating accuracy
You can obtain the accuracy of the model with `.score()` on the training and test data. It can be helpful to compare the two, as a much higher accuracy score for the training data can indicate overfitting. 

In [None]:
print(f'Accuracy: {rf.score(X_train, y_train):.2f}')

In [None]:
print(f'Accuracy: {rf.score(X_test, y_test):.2f}')

#### Viewing the confusion matrix
Although the confusion matrix can be obtained with `confusion_matrix`, it is often more helpful to visualize the results through a heatmap than a table of numbers.  In the heatmap, purple represents numbers 2 or less, while green and yellow represent numbers 10 or above.

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,  y_pred)

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', fontsize=12, color='black')
ax.set_ylabel('Actual outputs', fontsize=12, color='black')
ax.xaxis.set(ticks=range(10))
ax.yaxis.set(ticks=range(10))
ax.set_ylim(9.5, -0.5)
for i in range(10):
    for j in range(10):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='white')
plt.show()

#### Obtaining the classification report
Scikit-learn produces a report of the results of classifications by comparing the actual results of the test data in `y_test` with the predicted values in `y_pred`.  The report provides information such as the support and precision of the classifications.


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))