# imports

In [15]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier

# Multiclass Classifiers
In this assignment you will load a dataset and train two models to perform multiclass classification and compare the results of the models. The dataset is the **digits** dataset available from the sklearn's *datasets* library. This dataset contain 1797 samples of written digits. The goal is to correctly identify digits from 0 to 9.

## Load the data

1. import the *load_digits* function from the *sklearn.datasets* library
2. invoke *load_digits* with the *return_X_y* parameter set to true and store the returned dataset in variable **X** and **y**.

In [8]:
X, y = load_digits(return_X_y=True)

## Exploratory Data Analysis
Perform a few exploratory  steps including:

1. Display the number of rows of data returned
2. Display the number of features in the dataset
3. Use Numpy's **bincount** to display how many samples belong to each class. Is this a balanced dataset?

In [9]:
print('The number of rows in the dataset is {:d}'.format(X.shape[0]))
print('The number of features in the dataset is {:d}'.format(X.shape[1]))
np.bincount(y)

The number of rows in the dataset is 1797
The number of features in the dataset is 64


array([178, 182, 177, 183, 181, 182, 181, 179, 174, 180], dtype=int64)

> This is a balanced dataset
---

## Prepare training and testing data
1. Use *train_test_split* to split the dataset into a training set and a test set. Set the proportion of test data to 20%. Set a random state value so that the results will be repeatable.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

## Cross validation with Logistic Regression
In this step you will create a LogisticRegression classifier and use 5-fold cross validation to train the model.

1. import *LogisticRegression* classifier from sklearn
2. instantiate a LogisticRegression classifier with the 'lbfgs' solver and 'ovr' multiclass strategy. You may have to set the maximum number of iterations to 1000.
3. perform cross validation on the model
4. print the cross validation scores and the mean of the cross validation scores.

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

In [16]:
lr_model = LogisticRegression(solver="lbfgs", multi_class='ovr', max_iter=1000)
lr_model_cv = cross_val_score(lr_model, X_train, y_train, cv=5)
cv_mean = np.mean(lr_model_cv)

print('Accuracy scores for the 5 folds: ', lr_model_cv)
print('Mean cross validation score: {:.3f}'.format(cv_mean))

Accuracy scores for the 5 folds:  [0.95833333 0.95138889 0.96515679 0.94773519 0.96864111]
Mean cross validation score: 0.958


## Cross validation with RandomForest
Perform the same steps as above but this time with a RandomForestClassifier.

In [17]:
# imports here

rf_model = RandomForestClassifier(n_estimators=24)
rf_model_cv = cross_val_score(rf_model, X_train, y_train, cv=5)
cv_mean = np.mean(rf_model_cv)

print('Accuracy scores for the 5 folds: ', rf_model_cv)
print('Mean cross validation score: {:.3f}'.format(cv_mean))

Accuracy scores for the 5 folds:  [0.95486111 0.96180556 0.97909408 0.96515679 0.97212544]
Mean cross validation score: 0.967
