# Machine Learning - Classifiers - logisticRegression - Multiclass - Iris.csv

---

This Notebook is to practice some ML algorithms on a well known dataset such as iris.csv.
- Note that the data set is an asset in the Watson project "Machine learning exercices", and load in the Notebook from it's location.
- Note this is a multiclass classification prob

Will practice usage of:
- Logistic regression --> will use different partitions of the dataset to get the best possible metrics and model
- Decission tree
- SVM
- KNN

Will calculate metrics:
- jaccard similarity score
- f1_score
- confusion matrix

### 0. Loading, first analysis, pre-processing and wrangling of the data

The following code is a "template" generated by Watson studio, when asking to load a data set (in the case iris.csv) that we have stored as an asset in the project. It allows to get the data into a dataframe that we can later visualize for better understanding.

In [161]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [162]:
# Understand how big is our dataset
df_iris.shape

(150, 5)

In [163]:
# Understand what is the distribution 
df_iris["variety"].value_counts() 

Setosa        50
Versicolor    50
Virginica     50
Name: variety, dtype: int64

In [164]:
# Understand how variables are stored in the dataset, their type
df_iris.dtypes

sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object

In [165]:
# It's convenient to have the variable/label to predict "Variety" as a numeric value (int)
df_iris.replace({'Setosa':0,'Versicolor':1,'Virginica':2},inplace=True)
df_iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [166]:
# Now we have the "Variety" as an int
df_iris.dtypes

sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety           int64
dtype: object

Once we have done this first analysis, pre-processing and wrangling of the dataset, we can build a model using different ML algorithms.

### 1. Logistic regression - Multiclass

In [167]:
# We import some methods that will be used
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix

In [168]:
# We get the values from the dataframe and get np.arrays
iris = df_iris.to_numpy()

In [171]:
# We get in variable X all the variables data and normalize them
X = iris[:,0:4]
X = StandardScaler().fit(X).transform(X)

# Dependent variable to predict is stored in y
y = iris[:,4]

In [172]:
# We make a partition of the X and y data, to have training and test data --> we use 20% of the data for test purposes and we take them randomly with shuffle = True
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=True)

In [173]:
# We create the logistic regression model for this multiclass case, and we train it 
clf = LogisticRegression(multi_class = 'multinomial', solver='lbfgs').fit(X, y)

In [174]:
# We use the model to calculate the y_hat values
y_hat = clf.predict(X_test)
y_hat

array([2., 0., 1., 2., 2., 1., 2., 0., 1., 1., 2., 2., 2., 1., 2., 0., 2.,
       1., 0., 1., 0., 0., 2., 2., 1., 1., 2., 2., 0., 2.])

In [175]:
# We calculate metrics: jaccard similarity score
j_score = jaccard_similarity_score(y_test, y_hat)
j_score

1.0

In [176]:
# We calculate metrics: f1_score
f1score = f1_score(y_test,y_hat, average = 'macro')
f1score

1.0

In [177]:
# We display confusion matrix to understand better where the model is working better/worst 
# Remember that 0 = Setosa / 1 = Vesicolor / 2 = Virginica
# Also remember a) that the diagonal shows the well classified cases b) in axis Y we have true label values and X we have predicted label values
cm = confusion_matrix(y_test, y_hat)
cm

array([[ 7,  0,  0],
       [ 0,  9,  0],
       [ 0,  0, 14]])

Note that the partition of the data is random each time we run the code. Therefore training data and test data are different each time we run the code and as a consequence the metrics may differ slightly each time.

In [178]:
# Note that by default, the prediction is taking the label that has more probabilities.
# By extracting the probabilities calculated using the model, we could see furthermore where the model is more confident and in which cases.
y_hat_prob = clf.predict_proba(X_test)
print(y_hat_prob)

[[1.41647421e-04 6.02683696e-02 9.39589983e-01]
 [9.76448379e-01 2.35514987e-02 1.22736265e-07]
 [2.25878373e-02 9.51238700e-01 2.61734630e-02]
 [6.90105962e-07 1.16454554e-02 9.88353855e-01]
 [4.40354143e-03 3.97926093e-01 5.97670365e-01]
 [2.53123857e-02 8.95737342e-01 7.89502719e-02]
 [1.31544485e-03 1.51679739e-01 8.47004816e-01]
 [9.69164345e-01 3.08355526e-02 1.02282145e-07]
 [4.83959162e-03 7.81835561e-01 2.13324848e-01]
 [1.22109102e-03 8.58436334e-01 1.40342575e-01]
 [1.31721990e-04 3.94707201e-02 9.60397558e-01]
 [1.49158507e-05 6.22489357e-03 9.93760191e-01]
 [2.34402968e-04 1.03370428e-01 8.96395169e-01]
 [5.65067303e-04 6.53307954e-01 3.46126978e-01]
 [4.89817314e-05 6.10155547e-02 9.38935464e-01]
 [9.90904919e-01 9.09501238e-03 6.83771356e-08]
 [3.02762672e-03 3.65794534e-01 6.31177839e-01]
 [7.01345191e-03 8.84189806e-01 1.08796742e-01]
 [9.55910743e-01 4.40889390e-02 3.18318808e-07]
 [2.67063814e-02 9.59447225e-01 1.38463938e-02]
 [9.73723179e-01 2.62767049e-02 1.163004