<center>
    <img src="https://www.clearlyrated.com/brand-logo/talent-path" width="300" alt="cognitiveclass.ai logo"  />
</center>

Instructor and author: [_Dr. Junaid Qazi_](https://www.linkedin.com/in/jqazi/)

<a id='L7-Logistic-Regression-Multiclass-Classification'></a>

# L7: Logistic Regression -- Multiclass Classification{-}

* [(L7-1) The dataset, EDA and preprocessing](#(L7-1)-The-dataset-EDA-and-preprocessing)
* [(L7-2) One vs rest](#(L7-2)-One-vs-rest)
* [(L7-3) Multinomial](#(L7-3)-multinomial)
* [(L7-4) Predicted probabilities](#(L7-4)-Predicted-probabilities)
* [(L7-5) Readings](#(L7-5)-Readings)

Logistic regression is one of the most popular and widely used classification algorithm and by default, it is limited to a binary class classification problem. However, the logistic regression can be used for multi-class classification as well using its extension like one-vs-rest (ovr) and multinomial. 

* In one-vs-rest, the problem is first transformed into multiple binary classification problems, and under the hood, separate binary classifiers are trained for all classes. 

* whereas, for the multinomial, the solvers learns a true multinomial logistic regression model ([4.3.4: Pattern Recognition and Machine Learning by Christopher M. Bishop](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf)) and the in this case, the probability estimates should be better calibrated than one-vs-rest. The [cross-entropy error/loss function](https://en.wikipedia.org/wiki/Cross_entropy) (eq: 4.108 in 4.3.4) natively support multi-class classification problem. -- maximum likelihood estimation

Let's work with a multi-class classification problem using both of the above mentioned extensions in logistic regression. 

In [1]:
import pandas as pd; import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid') # just optional!
%matplotlib inline

#Setting display format to retina in matplotlib to see better quality images.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina')

# Lines below are just to ignore warnings
import warnings; warnings.filterwarnings('ignore')

We will be working with very famous the [iris dataset](https://archive.ics.uci.edu/ml/datasets/iris). It is on classifying three flower type based on some given features.  

<a id='(L7-1)-The-dataset-EDA-and-preprocessing'></a>

## (L7-1) The dataset, EDA and preprocessing {-}

In [2]:
iris_df = pd.read_csv("https://raw.githubusercontent.com/junaidqazi/\
DataSets_Practice_ScienceAcademy/master/Iris.csv")

In [3]:
iris_df.head(2)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa


`Id` column is extra, we can drop this column. We can also convert `Species` column to categorical codes. Let's do this!

In [4]:
iris_df.drop('Id', axis=1, inplace=True) # dropping Id column
iris_df['target']=iris_df.Species.astype('category').cat.codes # creating codes for categories in target column

In [5]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    object 
 5   target         150 non-null    int8   
dtypes: float64(4), int8(1), object(1)
memory usage: 6.1+ KB


In [6]:
iris_df.target.value_counts()

2    50
1    50
0    50
Name: target, dtype: int64

All good, there is no missing data. `Species` and `target` columns are actually the same but different types. We just need one and will handle in separating the features and target. 

The target column has balanced class distribution, 50 observations from each class!

#### Some required imports from scikit-learn {-}

In [7]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn import metrics

#### Separating features and the target {-}

The classes are completely balanced. For illustration purposes, we
create a subset with class imbalance.

In [8]:
# the classes are completely balanced
# create a subset with class imbalance
#y = data.target#[:-20]
#X = data.data#[:-20]
#print (X,y)
X = iris_df.drop(['Species', 'target'], axis=1)# features or predictors
y = iris_df.target # target 

#### Feature scaling {-}

In [9]:
#scaler = StandardScaler()
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# You can try MinMaxScaler and re-run the model!

#### Machine learning {-}

As we know it's a multi-class classification problem, let's explicitly compare `ovr` one-vs-rest and `multinomial` approaches. 

*`<shift-tab>` to explore the documentation, see what happen if `multi_class='auto'` and the problem is multi-class.*

[Go to: L7: Logistic Regression -- Mulitclass Classification](#L7-Logistic-Regression-Multiclass-Classification)

<a id='(L7-2)-One-vs-rest'></a>

## (L7-2) One-vs-rest {-}

In [10]:
# Creating model instances
logR_ovr = LogisticRegression(multi_class='ovr') # one-vs-rest
# fitting the model
logR_ovr.fit(X,y)
# Accuracy Score 
print("Score when multi_class='ovr':",logR_ovr.score(X,y))

Score when multi_class='ovr': 0.8933333333333333


Let's look at the confusion matrix and classification reports to explore little more on the performance of our models.

In [11]:
# predictions 
pred_ovr = logR_ovr.predict(X)
# Confusion matrix and classification report
print("multi_class='ovr'\n")
print(metrics.confusion_matrix(y,pred_ovr,labels=[0, 1, 2]))
print(metrics.classification_report(y,pred_ovr,labels=[0, 1, 2]))

multi_class='ovr'

[[50  0  0]
 [ 0 37 13]
 [ 0  3 47]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.93      0.74      0.82        50
           2       0.78      0.94      0.85        50

    accuracy                           0.89       150
   macro avg       0.90      0.89      0.89       150
weighted avg       0.90      0.89      0.89       150



[Go to: L7: Logistic Regression -- Mulitclass Classification](#L7-Logistic-Regression-Multiclass-Classification)

<a id='(L7-3)-multinomial'></a>

## (L7-3) Multinomial {-}

In [12]:
# Creating model instances
logR_mul = LogisticRegression(multi_class='multinomial') # multinomial
# fitting both models 
logR_mul.fit(X,y)
# Accuracy score 
print("Score when multi_class='multinomial':",logR_mul.score(X,y))

Score when multi_class='multinomial': 0.94


Well, as expected, the `multinomial` is giving the better accuracy! 

Let's look at the confusion matrix and classification reports to explore little more on the performance of our models.

In [13]:
# predictions 
pred_mul = logR_mul.predict(X)

# Confusion matrix and classification report
print("multi_class='multinomial'\n")
print(metrics.confusion_matrix(y,pred_mul,labels=[0, 1, 2]))
print(metrics.classification_report(y,pred_mul,labels=[0, 1, 2]))

multi_class='multinomial'

[[50  0  0]
 [ 0 45  5]
 [ 0  4 46]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       0.92      0.90      0.91        50
           2       0.90      0.92      0.91        50

    accuracy                           0.94       150
   macro avg       0.94      0.94      0.94       150
weighted avg       0.94      0.94      0.94       150



**Little more on multiclass logistic regression (optional read)**


Multiclass logistic regression is also known as polytomous logistic regression, multinomial logistic regression, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.

The multinomial logistic regression assumes that the odds of preferring a certain class over other/s do not depend on the presence of absence of other irrelevant alternatives. So the model choices are [independent of irrelevant alternatives](https://en.wikipedia.org/wiki/Independence_of_irrelevant_alternatives), which is a core hypothesis in relational choice theory. A typical example in predicting animal from images that we can think of, the relative probabilities of predicting a horse or cat don't not change if another choice of lion is added as an additional possibility. 

* _if class_0 is a preferred choice over class_1 from the given choices set {class_0, class_1}, then by adding another choice of class_3 must not make the class_1 preferable on class_0 from the new set of choices {class_0, class_1, class_2}._

This allows the choice of K alternatives to be modeled as a set of K-1 independent binary choices, in which one alternative is chosen as a "pivot" and the other K-1 compared against it, one at a time. 

[Go to: L7: Logistic Regression -- Mulitclass Classification](#L7-Logistic-Regression-Multiclass-Classification)

<a id='(L7-4)-Predicted-probabilities'></a>

## (L7-4) Predicted probabilities {-}

We can get the predicted probabilities as well, which is simple.

In [14]:
# probabilities when using one-vs-rest
print(logR_ovr.predict_proba(X)[:2])

[[0.81748599 0.17029348 0.01222053]
 [0.73774929 0.25155208 0.01069863]]


In [15]:
# probabilities when using miltinomial
print(logR_mul.predict_proba(X)[:2])

[[0.89388834 0.10251261 0.00359905]
 [0.83386212 0.16201348 0.0041244 ]]


We can create the ROC curve for our multi-class problem, however, we need to treat the problem as binary class and we will have multiple curves. ROC is only possible for binary class problem. This link in the reading will be useful if you want to create ROC curve for multi-class classification problem: 

[Go to: L7: Logistic Regression -- Mulitclass Classification](#L7-Logistic-Regression-Multiclass-Classification)

<a id='(L7-5)-Readings'></a>

<a id='(L7-6)-Code-examples'></a>

# (L7-6) Extra Reading and resources:{-} 

* [**Data Science from Scratch -- Part 1: Advance Analytics**](https://leanpub.com/data-science-from-scratch)

* [**Data Science from Scratch -- Part 2: Business Machine Learning**](https://leanpub.com/datascience-from-scratch-p2-business-machine-learning/c/r1W4Bml3Zqr6)


## License

Author: [___Dr. Junaid Qazi___](https://www.linkedin.com/in/jqazi/)<br>
Twitter: [***@JunaidSQazi***](https://twitter.com/JunaidSQazi)

Copyright 2021

Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) (the "License").<br>you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

*Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. Please see the License for the specific language governing permissions and limitations under the License.*


*This is not an official product but sample code provided for an educational purpose.*

***Acknowledgement is requested***