# Multiclass Classification

In the last notebook, we looked at binary classification. This  works well when the data observations belong to one of two classes or categories, such as "True" or "False". When the data can be categorized into more than two classes, you must use a multiclass classification algorithm.

Multiclass classification can be thought of as a combination of multiple binary classifiers. There are two ways in which you approach the problem:

- **One vs Rest (OVR)**, in which a classifier is created for each possible class value, with a positive outcome for cases where the prediction is *this* class, and negative predictions for cases where the prediction is any other class. A classification problem with four possible shape classes (*square*, *circle*, *triangle*, *hexagon*) would require four classifiers that predict:
    - *square* or not
    - *circle* or not
    - *triangle* or not
    - *hexagon* or not
    
- **One vs One (OVO)**, in which a classifier for each possible pair of classes is created. The classification problem with four shape classes would require the following binary classifiers:
    - *square* or *circle*
    - *square* or *triangle*
    - *square* or *hexagon*
    - *circle* or *triangle*
    - *circle* or *hexagon*
    - *triangle* or *hexagon*

In both approaches, the overall model that combines the classifiers generates a vector of predictions in which the probabilities generated from the individual binary classifiers are used to determine which class to predict.

Fortunately, in most machine learning frameworks, including scikit-learn, implementing a multiclass classification model is not significantly more complex than binary classification - and in most cases, the estimators used for binary classification implicitly support multiclass classification by abstracting an OVR algorithm, an OVO algorithm, or by allowing a choice of either.

> **More Information**: To learn more about estimator support for multiclass classification in Scikit-Learn, see the [Scikit-Learn documentation](https://scikit-learn.org/stable/modules/multiclass.html).

### Explore the data

Let's start by examining a dataset that contains observations of multiple classes. We'll use a dataset that contains observations of three different species of penguin.

> **Citation**: The penguins dataset used in the this exercise is a subset of data collected and made available by [Dr. Kristen
Gorman](https://www.uaf.edu/cfos/people/faculty/detail/kristen-gorman.php)
and the [Palmer Station, Antarctica LTER](https://pal.lternet.edu/), a
member of the [Long Term Ecological Research
Network](https://lternet.edu/).

In [None]:
import pandas as pd

# load the training dataset
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/penguins.csv
penguins = pd.read_csv('penguins.csv')

# Display a random sample of 10 observations
sample = penguins.sample(10)
sample

 	    CulmenLength 	CulmenDepth 	FlipperLength 	BodyMass 	Species
	247 	50.8 	        15.7 	        226.0 	        5200.0 	  1
	81 	  42.9 	        17.6 	        196.0 	        4700.0 	  0
	310   49.7 	        18.6 	        195.0 	        3600.0 	  2
	207   45.0 	        15.4 	        220.0 	        5050.0 	  1
	69 	  41.8 	        19.4 	        198.0 	        4450.0 	  0
	197   43.6 	        13.9 	        217.0 	        4900.0 	  1
	32 	  39.5 	        17.8 	        188.0 	        3300.0 	  0
	221   50.7 	        15.0 	        223.0 	        5550.0 	  1
	204   45.1 	        14.4 	        210.0 	        4400.0 	  1
	334   50.2 	        18.8 	        202.0 	        3800.0 	  2

The dataset contains the following columns:
* **CulmenLength**: The length in mm of the penguin's culmen (bill).
* **CulmenDepth**: The depth in mm of the penguin's culmen.
* **FlipperLength**: The length in mm of the penguin's flipper.
* **BodyMass**: The body mass of the penguin in grams.
* **Species**: An integer value that represents the species of the penguin.

The **Species** column is the label we want to train a model to predict. The dataset includes three possible species, which are encoded as 0, 1, and 2. The actual species names are revealed by the code below:

In [None]:
penguin_classes = ['Adelie', 'Gentoo', 'Chinstrap']
print(sample.columns[0:5].values, 'SpeciesName')
for index, row in penguins.sample(10).iterrows():
    print('[',row[0], row[1], row[2], row[3], int(row[4]),']',penguin_classes[int(row[4])])

    ['CulmenLength' 'CulmenDepth' 'FlipperLength' 'BodyMass' 'Species'] SpeciesName
    [ 48.4 16.3 220.0 5400.0 1 ] Gentoo
    [ 34.0 17.1 185.0 3400.0 0 ] Adelie
    [ 51.3 14.2 218.0 5300.0 1 ] Gentoo
    [ 51.7 20.3 194.0 3775.0 2 ] Chinstrap
    [ 41.1 19.0 182.0 3425.0 0 ] Adelie
    [ 44.1 18.0 210.0 4000.0 0 ] Adelie
    [ 49.5 16.1 224.0 5650.0 1 ] Gentoo
    [ 41.7 14.7 210.0 4700.0 1 ] Gentoo
    [ 49.2 15.2 221.0 6300.0 1 ] Gentoo
    [ 38.8 20.0 190.0 3950.0 0 ] Adelie

Now that we know what the features and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (null) values.

In [None]:
# Count the number of null values for each column
penguins.isnull().sum()

CulmenLength     2
CulmenDepth      2
FlipperLength    2
BodyMass         2
Species          0
dtype: int64

In [None]:
# Show rows containing nulls
penguins[penguins.isnull().any(axis=1)]

 	    	CulmenLength 	CulmenDepth 	FlipperLength 	BodyMass 		Species
	3 	  NaN 	        NaN 	        NaN 	          NaN 	   		0
	271 	NaN 	        NaN 	        NaN 	          NaN 	   		1

There are two rows that contain no feature values at all (NaN stands for "not a number"), so these won't be useful in training a model. Let's discard them from the dataset.

In [None]:
# Drop rows containing NaN values
penguins=penguins.dropna()
#Confirm there are now no nulls
penguins.isnull().sum()


    CulmenLength     0
    CulmenDepth      0
    FlipperLength    0
    BodyMass         0
    Species          0
    dtype: int64