# Multiclass Classifiers
In this assignment you will load a dataset and train two models to perform multiclass classification and compare the results of the models. The dataset is the **digits** dataset available from the sklearn's *datasets* library. This dataset contain 1797 samples of written digits. The goal is to correctly identify digits from 0 to 9.
## Load the data

1. Import the *load_digits* function from the *sklearn.datasets* library
2. Invoke *load_digits* with the *return_X_y* parameter set to true and store the returned dataset in variable **X** and **y**.

In [1]:
#Set the environment:
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import itertools

In [2]:
# Import the load_digits function from the sklearn.datasets library.
# Invoke load_digits with the return_X_y parameter set to true and store the returned dataset in variable X and y.
X, y = load_digits(return_X_y = True)

## Exploratory Data Analysis
Perform a few exploratory  steps including:

1. Display the number of rows of data returned.
2. Display the number of features in the dataset.
3. Use Numpy's **bincount** to display how many samples belong to each class. Is this a balanced dataset?

In [3]:
# Number of rows of data returned.
num_rows = X.shape[0]
print('The number of rows in the dataset is {:d}'.format(num_rows))

# Number of features in the dataset.
num_feat = X.shape[1]
print('The number of features in the dataset is {:d}'.format(num_feat))

# Use Numpy's bincount to display how many samples belong to each class.
bincount = np.bincount(y)
print("There are array of {} samples belong to each class.".format(bincount))
print("The dataset is balanced.")

The number of rows in the dataset is 1797
The number of features in the dataset is 64
There are array of [178 182 177 183 181 182 181 179 174 180] samples belong to each class.
The dataset is balanced.


## Prepare training and testing data
1. Use *train_test_split* to split the dataset into a training set and a test set. Set the proportion of test data to 20%. Set a random state value so that the results will be repeatable.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 30)

## Cross validation with Logistic Regression
In this step you will create a LogisticRegression classifier and use 5-fold cross validation to train the model.

1. Import *LogisticRegression* classifier from sklearn.
2. Instantiate a LogisticRegression classifier with the 'lbfgs' solver and 'ovr' multiclass strategy. You may have to set the maximum number of iterations to 1000.
3. Perform cross validation on the model.
4. Print the cross validation scores and the mean of the cross validation scores.

In [5]:
# Instantiate a LogisticRegression classifier with the 'lbfgs' solver and 'ovr' multiclass strategy. 
lr_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'ovr', max_iter = 1000)

# Perform cross validation on the model.
lr_cv_scores = cross_val_score(lr_clf, X_train, y_train, cv = 5)

# Print the cross validation scores and the mean of the cross validation scores.
print("Accuracy scores for the five folds: ", lr_cv_scores)
print("Mean cross-validation score: {:.3f}".format(np.mean(lr_cv_scores)))

Accuracy scores for the five folds:  [0.97569444 0.95138889 0.96167247 0.97560976 0.94425087]
Mean cross-validation score: 0.962


## Cross validation with RandomForest
Perform the same steps as above but this time with a RandomForestClassifier.

In [6]:
rf_clf = RandomForestClassifier(n_estimators = 24)
rf_cv_scores = cross_val_score(rf_clf, X_train, y_train, cv = 5)

print("Accuracy scores for the five folds: ", rf_cv_scores)
print("Mean cross-validation score: {:.3f}".format(np.mean(rf_cv_scores)))

Accuracy scores for the five folds:  [0.96527778 0.97569444 0.97212544 0.97212544 0.94773519]
Mean cross-validation score: 0.967
