<a href="https://colab.research.google.com/github/twisha-k/Python_notes/blob/main/80_coding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 80: Logistic Regression - Multiclass Classification I

### Teacher-Student Activities

So far you have learnt to build a logistic regression model for only two labels. There are a few cases when you have to classify more than two labels. So the classification of such labels is called multiclass classification. In order to practice it, we are going to solve another problem-statement wherein we have to classify different types of glasses based on their chemical and physical composition. Let's call this project glass-type classification.

Also, in this class we will learn to create graphs with Plotly.

**Dataset Description:**

The dataset used in this problem statement involves the classification of samples of different glasses based on their physical and chemical properties. They are as follows:

1. **RI:** Refractive Index

2. **Na:** Sodium

3. **Mg:** Magnesium

4. **Al:** Aluminum

5. **Si:** Silicon

6. **K:** Potassium

7. **Ca:** Calcium

8. **Ba:** Barium

9. **Fe:** Iron

The chemical compositions are measured as the weight per cent in their corresponding oxides such as $\text{Na}_2\text{O}$, $\text{Al}_2\text{O}_3$, $\text{Si}\text{O}_2$ etc.

There are seven types (classes or labels) of glass listed; they are:

* **Class 1:** used for making building windows (float processed)

* **Class 2:** used for making building windows (non-float processed)

* **Class 3:** used for making vehicle windows (float processed)

* **Class 4:** used for making vehicle windows (non-float processed)

* **Class 5:** used for making containers

* **Class 6:** used for making tableware

* **Class 7:** used for making headlamps

A float-type glass refers to the process used to make the glass. The molten glass is introduced into a bath of molten tin, causing the glass to float freely. These glasses are used to absorb heat and UV rays.

**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/Glass+Identification


**Citation:** Dua, D., & Graff, C.. (2017). UCI Machine Learning Repository.


---

#### Activity 1: Data Loading

So let's go through the routine steps before we build a logistic regression model and explore the dataset.

Link to the dataset: https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/glass-types.csv

In [None]:
# S1.1: Load the dataset.
# Import the necessary libraries.
import pandas as pd
glass_type_df=pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/glass-types.csv')
# Load the dataset.
glass_type_df

Unnamed: 0,1,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00.1,1.1
0,2,1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1
1,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1
2,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1
3,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1
4,6,1.51596,12.79,3.61,1.62,72.97,0.64,8.07,0.00,0.26,1
...,...,...,...,...,...,...,...,...,...,...,...
208,210,1.51623,14.14,0.00,2.88,72.61,0.08,9.18,1.06,0.00,7
209,211,1.51685,14.92,0.00,1.99,73.06,0.00,8.40,1.59,0.00,7
210,212,1.52065,14.36,0.00,2.02,73.42,0.00,8.44,1.64,0.00,7
211,213,1.51651,14.38,0.00,1.94,73.61,0.00,8.48,1.57,0.00,7


As you can see from the output, the data columns have strange headers (or titles). Let's load the dataset again without the column headers. For this, you can pass a parameter called `header` inside the `read_csv()` function of the `pandas` module and set its value equal to `None`.

**Syntax:** `pd.read_csv(file_path, header = None)`

In [None]:
# S1.2: Load the dataset again without the column headers.
glass_type=pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/glass-types.csv',header=None)
glass_type

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,1,1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.0,1
1,2,1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.0,1
2,3,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.0,1
3,4,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.0,1
4,5,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...
209,210,1.51623,14.14,0.00,2.88,72.61,0.08,9.18,1.06,0.0,7
210,211,1.51685,14.92,0.00,1.99,73.06,0.00,8.40,1.59,0.0,7
211,212,1.52065,14.36,0.00,2.02,73.42,0.00,8.44,1.64,0.0,7
212,213,1.51651,14.38,0.00,1.94,73.61,0.00,8.48,1.57,0.0,7


It seems like the first column might contain the serial numbers for the samples of glasses collected. Let's display the last 10 rows of the first column (indicated by 0) of the dataset.

In [None]:
# S1.3: Display the last 10 rows of the first column (indicated by 0) of the dataset.
glass_type[0].tail(10)

204    205
205    206
206    207
207    208
208    209
209    210
210    211
211    212
212    213
213    214
Name: 0, dtype: int64

So our suspicion was correct. Let's drop this column because we don't need it to build a logistic regression model later.

In [None]:
# S1.4: Drop the 0th column as it contains only the serial numbers.
glass_type.drop(columns=0,inplace=True)
# Get an array of the new set of columns.


---

#### Activity 2: Renaming Column Headers^

Now let's provide the suitable column headers to the dataset so that we know the values of each independent variable for each glass sample. For this, we need to

- Create a Python list containing the suitable column headers as string values. The desired column headers are `'RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType'` in the same order.

- Create a Python dictionary containing the current column heads and the desired column headers as key-value pairs.

- Change the column heads by calling the `rename()` function of the `pandas` module on the `pandas` data frame object. The **syntax** to apply the `rename()` function is

  `data_frame_object.rename(python_dictionary)`

  where `python_dictionary` contains the elements as described in the second point.





In [None]:
# S2.1: Create a Python list containing the suitable column headers as string values. Also, create a Python dictionary as described above.
column_headers=['RI', 'Na', 'Mg', 'Al', 'Si', 'K', 'Ca', 'Ba', 'Fe', 'GlassType']
col_dict={}
for i in glass_type.columns:
  col_dict[i]=column_headers[i-1]
glass_type.rename(col_dict,axis=1,inplace=True)
# Create the required Python dictionary.
glass_type.head()

Unnamed: 0,RI,Na,Mg,Al,Si,K,Ca,Ba,Fe,GlassType
0,1.52101,13.64,4.49,1.1,71.78,0.06,8.75,0.0,0.0,1
1,1.51761,13.89,3.6,1.36,72.73,0.48,7.83,0.0,0.0,1
2,1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.0,0.0,1
3,1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.0,0.0,1
4,1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.0,0.0,1


In [None]:
# S2.2: Call the 'rename()' function on the data frame object to rename the columns.
glass_type.info()
# Display the first five rows of the data frame.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 214 entries, 0 to 213
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   RI         214 non-null    float64
 1   Na         214 non-null    float64
 2   Mg         214 non-null    float64
 3   Al         214 non-null    float64
 4   Si         214 non-null    float64
 5   K          214 non-null    float64
 6   Ca         214 non-null    float64
 7   Ba         214 non-null    float64
 8   Fe         214 non-null    float64
 9   GlassType  214 non-null    int64  
dtypes: float64(9), int64(1)
memory usage: 16.8 KB


As you can see, all the column headers are renamed as required.

---

#### Activity 3: Dataset Inspection^^

Let's look at the kind of values each of the columns have, number of rows and columns in the dataset and whether the dataset has any missing values or not.

In [None]:
# S3.1: Get the information about the dataset.
glass_type['GlassType'].value_counts()

2    76
1    70
7    29
3    17
5    13
6     9
Name: GlassType, dtype: int64

Except for the last column, all the columns have floating-point values as we already observed. There are 214 rows and 10 columns. And there are no missing values in the dataset because all the columns contain 214 non-null values.

Now let's get the count of each glass-type samples in the dataset.

In [None]:
# S3.2: Get the count of each glass-type samples in the dataset.


Notice that there is no count for glass-type `4`. This means the dataset does not have any sample of glass-type `4`.

Also, glass types `2` and `1` are the most common among all the samples and glass-type `6` is the least. This suggests that the dataset is slightly imbalanced and biased in-favour of types `1` and `2`. Let's also calculate the percentage of these values.

In [None]:
# S3.3: Get the percentage of count of each glass-type samples in the dataset.
(glass_type['GlassType'].value_counts())*100/glass_type.shape[0]

2    35.514019
1    32.710280
7    13.551402
3     7.943925
5     6.074766
6     4.205607
Name: GlassType, dtype: float64

Through percentages, we can clearly see the imbalance in the dataset.

---

#### Activity 4: Data Visualisation using Plotly

Plotly another Python library used for Data visualisation. We can create various kinds of graphs like line plot, pie plot, scatter plot etc. using plotly as well.  

**So why should we use Plotly over matplotlib or seaborn?** The reason is:

- There is a hover tool capabilities that can be use to observe anomalies in a large number of data points.

- Also there are endless customizations to make interactive visualisation which can be displayed in Colab/Jupyter notebooks or standalone HTML files as well.


Let's start with a creating a count plot using plotly.

**Steps:**

1. Import the `plotly.express` module for Plotly features.

2. Group the DataFrame `df` by the column `GlassType` without making it a default index column and save the grouping object in a variable .

3. Compute the size of each group with the `size()` function.

>`glass_group_df = glass_group.size()`

> where `glass_group` is the grouping object and `size()` returns a DataFrame   the number of records in each unique group saved as `glass_group_df`.

4.  Create the count plot with the `bar()` function of the plotly library. The syntax for the `bar()` function is:

> **Syntax:**  `plotly.express.bar(data_frame, x, y, color)`

> where

  - `data_frame` : parameter requires the name of the dataframe with the distribution of values
  - `x` : parameter requires a column name / pandas series name / array name from where the values are used to position marks along the x axis.

  - `y` : parameter requires a column name / pandas series name / array name from where the values are used to position marks along the y axis.

  - `color` : parameter requires a column name / pandas series name / array name from where the values are used to assign color to marks.

5. Display the graph using the `show()` function.



Let's create a count plot with plotly to observe the distribution of types of glasses in the dataset.

In [None]:
# T4:1 Create the count plot to observe distribution of glass types using Plotly.
import plotly.express as px
glass_gr=glass_type.groupby(by='GlassType',as_index=False)
glass_gr_df=glass_gr.size()
print(glass_gr_df.head())
fig = px.bar(data_frame = glass_gr_df, x = "GlassType", y = "size", color = "GlassType")
fig.show()

   GlassType  size
0          1    70
1          2    76
2          3    17
3          5    13
4          6     9


**Note:** The `bar()` function can take in more parameters that can be passed to create more customised data. You may refer to the following document:

https://plotly.com/python-api-reference/generated/plotly.express.bar.html

As it can be observed, count plot is created using plotly. Also if you hover the mouse over the bars, a pop-up appears with the `GlassType` and its size information.

We can also convert the plot to html with the `write_html()` function.

In [None]:
# S4.1: Convert the plot to html file.
fig.write_html("Glass Distribution.html")

Check the file explorer on the left-hand side to verify if a new `.html` file is created. We can download that the graph file from the explorer.

Now, let's move ahead create a scatter plot with Plotly with dummy data. The `plotly.express` has the function `scatter()` to create the scatter plot. The syntax of the `scatter()` function is:

> **Syntax:**  `plotly.express.scatter(data_frame, x, y, color, size, hover_data, title)`

> where


Create a scatter plot and show the distribution across labels using the steps below:.

**Steps:**

1. Create two NumPy arrays `x` and `y` with 10 integers from range 1-10 and 1-100 respectively.

2. Create a NumPy array `labels` to divide the above array `data` into three labels - `1` , `2`, `3`  randomly.

3. Create the scatter plot between `x` and `y` and show the distribution of data points with `labels` array with the color parameter.

4. Display the plot.


In [None]:
# S4.2: Create scatter plot between 'x' and 'y' and show the distribution using 'labels' array
import random
import numpy as np
x=np.random.randint(1,11,10,dtype=int)
y=np.random.randint(1,101,10,dtype=int)
lables=np.random.randint(1,3,10,dtype = int)
scat_plot=px.scatter(x=x,y=y,color=lables)
scat_plot.show()

As it can be observed, the scatter plot is created with the dummy data points and different classes are assigned different colors. Also, when hover over the data points, the pop-up shows three pieces of information `x`, `y` and `color` which refers to the class of the data point.

We can also observe the data points are really small. So, we can include the `size` parameter which should be an array of the same shape as values in `x`. The `size` parameter like `color` can be used to distinguish between different labels as well.

Now, let's create a scatter plot using plotly between the column `Fe` to understand distribution of types of glasses with the Iron (Fe) with the guidlines below:

- `dataframe` will be `df`

- `x` will be an numpy array of size `df.shape[0]` within the range from the minimum value of the column `Fe` to the maximum value + 1 of the column `Fe`.

- `y` will be the values in the column `Fe`

- `size` will be values in the column `GlassType` such that the size of points change with the glass types

- `color` will also be the values in the column `GlassType`such that the color of points change with the glass types.

- `title` will be string representing the plot e.g. "Scatter plot between Fe and Glass Type"

- `color_continuous_scale` will be `px.colors.sequential.Viridis`. This parameter is used to create list of continuous color scale values when the column denoted by `color` contains numeric data.

In [None]:
# S4.3: Create the scatter plot for the column 'Fe' values and display the display the distribution of glass types over the column values.
x = np.linspace(glass_type["Fe"].min(),glass_type["Fe"].max()+1,glass_type.shape[0])
y= glass_type["Fe"]
lables = glass_type["GlassType"]
scat_plot=px.scatter(x=x,y=y,color=lables)
scat_plot.show()

As it can be observed, the scatter plot is created for the column `Fe`. We can distinguish the data points into different types of glass using color with the color bar in the right or even the size (smallest represent label `1` and largest represent label `7`).

The above scatter plot shows that label `5` type of glass can have the highest amout of `Fe`.

**Note:** The different color scales can be observed in the `colors` sub modules of Plotly like `plotly.express.colors.sequential`, `plotly.express.colors.diverging` and `plotly.express.colors.cyclical`.

Let's create the Plotly scatter plot for all the columns to check the distribution of glass types.

In [None]:
# S2.4: Create the scatter plot for all the columns in 'df' to observe the distribution of glass types.
for i in list(glass_type.columns[:-1]):
  x = np.linspace(glass_type[i].min(),glass_type[i].max()+1,glass_type.shape[0])
  y= glass_type[i]
  lables = glass_type["GlassType"]
  scat_plot=px.scatter(x=x,y=y,color=lables)
  scat_plot.show()

We can observe the scatter plots above to deduce various facts like `RI` reflective index of label `2` glass type is highest.



---

#### Activity 5: Model Building^^^

Let's build a logistic regression model first without balancing the dataset. If the model evaluation parameters suggest that the model is not classifying the labels correctly, then we will first deal with the imbalance and then build a logistic regression model again.

In [None]:
# S5.1: Create separate data frames for training and testing the model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
X =glass_type.drop(columns = 'GlassType')
y =glass_type['GlassType']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
# Creating the features data frame holding all the columns accept last column
print(X_train.shape,X_test.shape, y_train.shape, y_test.shape)
# Creating the target series that holds last column 'GlassType'

# Splitting the train and test sets using the 'train_test_split()' function.


(149, 9) (65, 9) (149,) (65,)


In [None]:
# S5.2: Print the shape of all the four variables i.e. 'x_train', 'x_test', 'y_train' and 'y_test'


In [None]:
# S5.3 Build a logistic regression model using the 'sklearn' module.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix

# Predict the target values for the train set.
lgcf1=LogisticRegression()
lgcf1.fit(X_train,y_train)
print(lgcf1.score(X_train,y_train))
y_train_pred=lgcf1.predict(X_train)
print(y_train_pred[:10])
print(confusion_matrix(y_train,y_train_pred))
print(classification_report(y_train,y_train_pred))
# 1. First, call the 'LogisticRegression' module and store it in 'lg_clg' variable.

# 2. Call the 'fit()' function with 'x_train' and 'y_train' as inputs.

# 3. Call the 'score()' function with 'x_train' and 'y_train' as inputs to check the accuracy score of the model.


0.6174496644295302
[2 1 6 6 1 2 1 1 2 1]
[[36 15  0  0  0  0]
 [16 36  0  0  1  0]
 [10  3  0  0  0  0]
 [ 0  4  0  2  0  1]
 [ 0  3  0  0  2  1]
 [ 0  2  0  1  0 16]]
              precision    recall  f1-score   support

           1       0.58      0.71      0.64        51
           2       0.57      0.68      0.62        53
           3       0.00      0.00      0.00        13
           5       0.67      0.29      0.40         7
           6       0.67      0.33      0.44         6
           7       0.89      0.84      0.86        19

    accuracy                           0.62       149
   macro avg       0.56      0.47      0.49       149
weighted avg       0.57      0.62      0.59       149




lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



**Note:** This is a preliminary model building step. Hence, we can ignore `ConvergenceWarning` completely.  

So the accuracy score is 61.75% which is not a good score.

Now in the cases of binary classification, we generally create a confusion matrix and print the precision, recall and f1-score values. But in the case of multiclass classification, it best to first check what all labels the classification model identified or detected. For this, you can use either the `unique()` function or the `value_counts()` function.

In [None]:
# S5.4: Get the target values predicted by the logistic regression model on the train set.
y_test_pred=lgcf1.predict(X_test)
print(y_test_pred[:10])
print(confusion_matrix(y_test,y_test_pred))
print(classification_report(y_test,y_test_pred))


[1 7 1 7 2 2 1 2 2 2]
[[15  4  0  0  0  0]
 [ 5 17  0  0  0  1]
 [ 2  2  0  0  0  0]
 [ 0  5  0  1  0  0]
 [ 0  1  0  0  0  2]
 [ 0  0  0  0  0 10]]
              precision    recall  f1-score   support

           1       0.68      0.79      0.73        19
           2       0.59      0.74      0.65        23
           3       0.00      0.00      0.00         4
           5       1.00      0.17      0.29         6
           6       0.00      0.00      0.00         3
           7       0.77      1.00      0.87        10

    accuracy                           0.66        65
   macro avg       0.51      0.45      0.42        65
weighted avg       0.62      0.66      0.61        65




Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.



As you can see, the logistic regression model failed to identify glass-type `3`.

Consequently, it does not makes sense to create a confusion matrix here because the actual target set has all the labels but the predicted target set misses one label (glass-type `3`) among the available (the whole dataset does not have any glass-type `4` sample) labels.

Hence, **in the case of multiclass classification, before creating a confusion matrix, always first check whether the predicted target set has all the labels**.

Let's repeat the above exercise on the test set and find out all the classes identified by the logistic regression model.

In [None]:
# S5.5: Get the target values predicted by the logistic regression model on the test set.
pd.Series(y_train_pred).value_counts(),pd.Series(y_test_pred).value_counts()

(2    63
 1    62
 7    18
 6     3
 5     3
 dtype: int64, 2    29
 1    22
 7    13
 5     1
 dtype: int64)

On the test set, the logistic regression model failed to identify labels `3` and `6`. This is clearly a very bad classification model.

Let's stop here. In the next class, we will try to build a logistic regression model again so that it can identify all the different labels before we can evaluate its performance further using confusion matrix, precision, recall and f1-score values.

---

### **Project**
You can now attempt the **Applied Tech Project 80 Multiclass Classification I** on your own.

**Applied Tech Project 80 Multiclass Classification I**: https://colab.research.google.com/drive/1UvGgfNYlK8p2fE1umQNl2hEeTH3hx2_X

---