# Multiclass Classification
Binary classification techniques work well when the data observations belong to one of two classes or categories, such as "True" or "False". When the data can be categorized into more than two classes, you must use a multiclass classification algorithm.

Fortunately, in most machine learning frameworks, including scikit-learn, implementing a multiclass classifier is not significantly more complex than binary classification - and in many cases, the classification algorithm classes used for binary classification implicltly support multiclass classification.

## A multiclass dataset
Let's start by examining a dataset that contains observations of multiple classes. We'll use one of the most commonly used examples in machine learning - the Iris dataset, in which characteristics of iris flowers are recorded along with the specific species of iris.

This dataset is so commonly used in machine learning examples, it's available directly from the scikit-learn library. Run the following cell to load it:

In [None]:
from sklearn import datasets

iris = datasets.load_iris()
iris

The dataset in scikit-learn consists of:
* A description of the dataset
* An array named **data** containing the feature values
* An array named **feature_names** containing the names of the features (*sepal length*, *sepal width*, *petal length*, and *petal width*).
* An array named **target** containing the corresponding labels
* An array named **target_names** containing the species names that correspond to each possible label value (*setosa*, *versicolor*, and *virginica*).

Let's combine the features and label values into a dataframe so we can see them more clearly:

In [None]:
import numpy as np
import pandas as pd

features = pd.DataFrame(data = np.c_[iris.data,iris.target], columns = iris.feature_names + ['label'])
features

The labels in the dataset are 0, 1, and 2. Let's see what those labels correspond to in terms of species names:

In [None]:
print(iris.target_names)

## Splitting the data
As with binary classification, we need to split the multiclass data into a set of features and labels for training, and a second set of features and labels for testing the trained model. The dataset from scikit-learn already includes separated features (**iris.data**) and labels (**iris.target**), so we just need to separate these into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.30, random_state=0)

print ('Training Set: %d, Test Set: %d \n' % (X_train.size, X_test.size))

print("Sample of features and labels:")
# Take a look at the first 10 training features and corresponding labels
for n in range(0,9):
    print(X_train[n], Y_train[n], '(' + iris.target_names[Y_train[n]] + ')')

## Training a multiclass classification model
Now that we have a set of training features and corresponding training labels, we can fit a multiclass classification algorithm to the data to create a model. The **sklearn.linear_model.LogisticRegression** algorithm inherently supports multiclass classification, so we can use that with some tweaks to the parameters:

In [None]:
# Train the model
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01

# train a logistic regression model on the training set
clf = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='multinomial').fit(X_train, Y_train)
print (clf)

## Evaluating the classifier
Let's start by predicting the labels for the test features, and comparing the predicted labels to the actual labels:

In [None]:
predictions = clf.predict(X_test)
print('Predicted labels: ', predictions)
print('Actual labels: ' ,Y_test)

Looks pretty good. What's the overall *accuracy* of the model when used with the test dataset?

In [None]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

print('Accuracy: ', accuracy_score(Y_test, predictions))

OK, how about some other metrics? We can calculate the *precision*, *recall*, and *f1-score* for each class:

In [None]:
from sklearn. metrics import classification_report

print(classification_report(Y_test, predictions))

Now let's look at the confusion matrix for our model:

In [None]:
from sklearn.metrics import confusion_matrix

# Print the confusion matrix
cm = confusion_matrix(Y_test, predictions)
print(cm)

Note that the confusion matrix for multiclass classification is different to that for binary classification. The matrix shows the intersection of predicted and actual label values for each class - in simple terms, the diagonal intersections from top-left to bottom-right indicate the number of correct predictions.

It's generally more intuitive to visualize this as a heat map, like this:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.imshow(cm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)
plt.xlabel("Predicted Species")
plt.ylabel("True Species")
plt.show()

## Using the model with new data observations
OK, so now we have a trained model. Let's use it to predict the class of a new iris observation:

In [None]:
X_new = [[6.6,3.2,5.8,2.4]]
print ('New sample: {}'.format(X_new))

pred = clf.predict(X_new)
print('Predicted class is', iris.target_names[pred])

## Learing More

Check out the scikit-learn documentation at http://scikit-learn.org/stable/documentation.html.