# Logistic Regression

## Table of Contents

1. [**Introduction**](#Intro)  
    
2. [**Model Development Procedure**](#ModelDev)
   
3. [**Binary Classification**](#BinaryClass)

    3.1 [**Model Development**](#BinModelDev)
  
    3.2 [**Model Evaluation**](#BinModelEval)

4. [**Multiclass Classification**](#MultiClass)

    4.1 [**Model Development**](#MultModelDev)

    4.2 [**Model Evaluation**](#MultModelEval)

5. [**Final Comments**](#FinalComments)

## 1 Introduction <a name="Intro"></a>

We can apply linear regression to classification problems by converting the class names of training examples to numbers, i.e. probabilities. In order to estimate the class of a data point, we need some sort of guidance on what would be the <u>most probable class</u> for that data point. For this, we use __Logistic Regression__.





## 2 Model Development Procedure <a name="ModelDev"></a>

Here are the steps to implement logistic regression in Python using <font color='blue'>scikit-learn</font> library

__1.__ Import `LogisticRegression`, `train_test_split`, and `MinMaxScaler` funcions from scikit learn library along with `numpy` library
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np
```

__2.__ Define dependant (target variable) and independent variable (feature) from data set:
```python
x_data=np.array(df[['feature1','feature2',...]])
y_data=df['target variable']
```

__3.__ Normalize your data using <font color='blue'>MinMaxScaler</font> (Optional but advised)
```python
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data
```

__4.__ Split the data into train and test sets: `x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data)`



__5.__ Create a logistic regression object using the constructor: `lr = LogisticRegression() `


__6.__ Use the fit function to fit the model to the training data: `lr.fit(x_train,y_train)`

__7.__ Then, make prediction using the test data and training data:
```python
yhatTest=lr.predict(x_test)
yhatTrain=lr.predict(x_train)
```

__8.__ To see the probability calculated for each class, you can use (Optional)
```python
yhatTest_prob = lr.predict_proba(x_test)
yhatTrain_prob = lr.predict_proba(x_train)
```

## 3 Binary Classification <a name="BinaryClass"></a>

Steel plate faults dataset is provided by Semeion, Research of Sciences of Communication, Via Sersale 117, 00128, Rome, Italy. In this dataset, the faults of steel plates are classified into 7 types. Since it has been donated on October 26,2010, this dataset has been widely used in machine learning for automatic pattern recognition. Types of fault and corresponding numbers of sample are shown in the table below

<img src="https://docs.google.com/uc?export=download&id=1pw1oJ7plDsTASg_ntI_QSVivQ-tMhlqq" width="500">


The number of samples vary a lot from one category to another. Meanwhile, fault 7 is a special class because it contains all other faults except the first six kinds of fault. In other words, samples in class 7 may have no obvious common characteristics. For every sample, 27 features are recorded, providing evidences for its fault class. All attributes are expressed by integers or real numbers. Detailed information about these 27 independent variables is listed out in the following table.

<img src="https://docs.google.com/uc?export=download&id=1lAV-mPa2seL9VWkezbaCicnZVwOup2c6" width="500">


Ref: https://www.sciencedirect.com/science/article/pii/S0925231214012193?casa_token=8ZvcrfiUELkAAAAA:Vt2ShomuyzpagA6Su9nSQHzImgti_HHvtK5zuGqgC01It_Xn9UsccPB-5HVtzBonmsYCibDgYQ



Ref for the dataset: https://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults


In [None]:
import pandas as pd

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/Steel_Plates_Faults/Data.csv')
df = pd.read_csv(url,names=['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
                            'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
                            'Length_of_Conveyer', 'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness',
                            'Edges_Index', 'Empty_Index', 'Square_Index', 'Outside_X_Index', 'Edges_X_Index',
                            'Edges_Y_Index', 'Outside_Global_Index', 'LogOfAreas', 'Log_X_Index', 'Log_Y_Index',
                            'Orientation_Index', 'Luminosity_Index', 'SigmoidOfAreas', 'Pastry', 'Z_Scratch',
                            'K_Scratch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults'])           
df.head()

In [None]:
# Check to see if there are missing values in the dataset
df.isnull().sum().sum()

### 3.1 Model Development <a name="BinModelDev"></a>

Step __1__, importing the required libraries

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler  # For normalization
import numpy as np

Step __2__, defining dependat and independant variables

In [None]:
x_data=np.array(df[['X_Minimum', 'X_Maximum', 'Y_Minimum', 'Y_Maximum', 'Pixels_Areas', 'X_Perimeter',
             'Y_Perimeter', 'Sum_of_Luminosity', 'Minimum_of_Luminosity', 'Maximum_of_Luminosity',
             'TypeOfSteel_A300', 'TypeOfSteel_A400', 'Steel_Plate_Thickness']])
y_data=df['K_Scratch']

In [None]:
df['K_Scratch'].unique()

In [None]:
print('Target variable distribution:')
print( df['K_Scratch'].value_counts() )

df['K_Scratch'].value_counts().plot(kind='barh')

Step __3__, normalization using MinMaxScaler

In [None]:
MinMaxscaler = MinMaxScaler()  # define min max scaler
x_data_scaled = MinMaxscaler.fit_transform(x_data)  # transform data

Step __4__, spliting the data

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x_data_scaled,y_data,test_size=0.3)

Step __5__, creating a logistic regression object

In [None]:
lr = LogisticRegression(max_iter=500)

Step __6__, fitting the model (training the  model)

In [None]:
lr.fit(x_train,y_train)

Step __7__, making predictions

In [None]:
yhatTest=lr.predict(x_test)
yhatTrain=lr.predict(x_train)

Step __8__, probability for each class

`predict_proba`  returns estimates for all classes, ordered by the label of classes, which can be found using `lr.classes_`.

In [None]:
lr.classes_

In [None]:
print('Order of the classes',lr.classes_)
yhatTest_prob = lr.predict_proba(x_test)
yhatTest_prob

### 3.2 Model Evaluation <a name="BinModelEval"></a>

To evaluate our model logistic regression model, we use log loss and other previously introduced evaluation metrics.


In [None]:
from sklearn.metrics import accuracy_score 
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
acc_scoreTrain = accuracy_score(y_train,yhatTrain)
acc_scoreTest = accuracy_score(y_test,yhatTest)
print(f'The accuracy for training data is {acc_scoreTrain:0.3f}')
print(f'The accuracy for the test data is {acc_scoreTest:0.3f}')

In [None]:
J_scoreTrain = jaccard_score(y_train,yhatTrain)
J_scoreTest = jaccard_score(y_test,yhatTest)
print(f'Jaccard index for training data is {J_scoreTrain:0.3f}')
print(f'Jaccard index for the test data is {J_scoreTest:0.3f}')

In [None]:
F_scoreTrain = f1_score(y_train,yhatTrain)
F_scoreTest = f1_score(y_test,yhatTest)
print(f'F-score for training data is {F_scoreTrain:0.3f}')
print(f'F-score for the test data is {F_scoreTest:0.3f}')

In [None]:
LogLossTrain = log_loss(y_train,yhatTrain)
LogLossTest = log_loss(y_test,yhatTest)
print(f'Log Loss for training data is {LogLossTrain:0.3f}')
print(f'Log loss for the test data is {LogLossTest:0.3f}')

In [None]:
print('Confusion matrix for training data')
CM_scoreTrain = confusion_matrix(y_train,yhatTrain)   # possible option normalize='true'
print(CM_scoreTrain)

print(40*'-')

print('Confusion matrix for test data')
CM_scoreTest = confusion_matrix(y_test,yhatTest)   # possible option normalize='true'
print(CM_scoreTest)

In [None]:
dispTr = ConfusionMatrixDisplay(CM_scoreTrain,display_labels=['No Fault','Fault']) # 
dispTr.plot()

dispTs = ConfusionMatrixDisplay(CM_scoreTest,display_labels=['No Fault','Fault'])
dispTs.plot()

## 4 Multiclass Classification <a name="MultiClass"></a>

By default, logistic regression cannot be used for classification tasks that have more than two class labels, i.e. multi-class classification. Instead, it requires modification to support multi-class classification problems.

One popular approach for adapting logistic regression to multi-class classification problems is to split the multi-class classification problem into multiple binary classification problems and fit a standard logistic regression model on each subproblem.

Instead of y=0,1 we will expand our definition so that y=0,1...n. Basically we re-run binary classification multiple times, once for each class.

### 4.1 Model Development <a name="MultModelDev"></a> 

We are going to use part of the Fuel Economy Dataset, which is produced by the Office of Energy Efficiency and Renewable Energy of the U.S. Department of Energy. Fuel economy data are the result of vehicle testing done at the Environmental Protection Agency's National Vehicle and Fuel Emissions Laboratory in Ann Arbor, Michigan, and by vehicle manufacturers with oversight by EPA. This dataset can be accessed from here: https://github.com/MasoudMiM/ME_364/blob/main/EPA_Green_Vehicle_Guide/EPA_2020_Fuel_Economy.csv and a description of the data is provided at https://www.fueleconomy.gov/feg/EPAGreenGuide/GreenVehicleGuideDocumentation.pdf

In [None]:
import pandas as pd

url = ('https://raw.githubusercontent.com/MasoudMiM/ME_364/main/EPA_Green_Vehicle_Guide/EPA_2020_Fuel_Economy.csv')
dfEPA = pd.read_csv(url)           

dfEPA.drop(columns='Unnamed: 0',inplace=True)
dfEPA.head()

Let's develop the model using a target variable that has more than two possible values. In this case, we use SmartWay, which can be 'Yes', 'No', or 'Elite'. This is a multi-class classification.

In [None]:
dfEPA['SmartWay'].unique()

In [None]:
print('Target variable distribution:')
print( dfEPA['SmartWay'].value_counts() )

dfEPA['SmartWay'].value_counts().plot(kind='barh')

Defining dependant and independant variables

In [None]:
x_dataM=np.array(dfEPA[['City MPG','Hwy MPG']])
y_dataM=dfEPA['SmartWay']

Normalizing the data

In [None]:
MinMaxscalerM = MinMaxScaler()  # define min max scaler
x_data_scaledM = MinMaxscaler.fit_transform(x_dataM)  # transform data 

Splitting the data

In [None]:
x_trainM,x_testM,y_trainM,y_testM=train_test_split(x_data_scaledM,y_dataM,test_size=0.25)

Creating a logistic regression object

In [None]:
lrM = LogisticRegression(max_iter=500)

Trainig the model

In [None]:
lrM.fit(x_trainM,y_trainM)

Making predictions

In [None]:
yhatTestM=lrM.predict(x_testM)
yhatTrainM=lrM.predict(x_trainM)

In [None]:
yhatTestM[:2]

And finding the probability for each class (optional)

In [None]:
print('Order of the classes',lrM.classes_)
yhatTestM_prob = lrM.predict_proba(x_testM)
yhatTestM_prob[:2]

### 4.2 Model Evaluation <a name="MultModelEval"></a>

We can use some of the evaluation metrics to assess the performance of our model. Here, let's just look at accuracy and confusion matrix.

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
acc_scoreTrainM = accuracy_score(y_trainM,yhatTrainM)
acc_scoreTestM = accuracy_score(y_testM,yhatTestM)
print(f'The accuracy for training data is {acc_scoreTrainM:0.3f}')
print(f'The accuracy for the test data is {acc_scoreTestM:0.3f}')

In [None]:
CM_scoreTrainM = confusion_matrix(y_trainM,yhatTrainM)   # possible option normalize='true'
CM_scoreTestM = confusion_matrix(y_testM,yhatTestM)   # possible option normalize='true'

dispTrM=ConfusionMatrixDisplay(CM_scoreTrainM, display_labels=lrM.classes_)
dispTrM.plot()
dispTsM=ConfusionMatrixDisplay(CM_scoreTestM, display_labels=lrM.classes_) 
dispTsM.plot()

## 5 Final Comments <a name="FinalComments"></a>

- Classification is the problem of predicting the right label for a given input and it is different from regression where the labels are continuous variables.

- The right way to think about classification is as carving feature space into regions, so that all the points within any given region are destined to be assigned the same label.

- Sigmoid function take a real-value input $-\infty < x< \infty$ and produces a value ranging over [0,1], .i.e probability.
