# Supervised Machine Learning Models

### Table of Contents

* [Intro](#chapter1)
    * [Introduction to Supervised Machine Learning](#Section_1_1)
        * [Machine Learning Process Flow](#section_1_1_1)
* [PART 1](#chapter2)  
    * [Regression Analysis](#section_2_1)
         * [Linear Regression](#section_2_1_1)
         * [Ridge Regression](#section_2_1_2)
         * [Lasso Regression](#section_2_1_3)
    * [Feature Importance](#section_2_2)
        * [Feature Importance](#section_2_2_1)
    * [Evaluation Metrics](#section_2_3)
        * [MAE,MSE,RMSE](#section_2_3_1)
        * [R^2, Adjusted R^2](#section_2_3_2)
* [PART 2](#chapter3)
    * [Data Balancing](#section_3_1)
    * [Classification](#section_3_2)
        * [Logistic Regression](#section_3_2_1)
        * [KNN](#section_3_2_2)
        * [Decision Tree](#section_3_2_3)
    * [Evaluation Metrics](#section_3_4)
        * [Confusion Matrix](#section_3_4_1)
        * [Accuracy, Precision, Recall, F1 Score, ROC, AUC](#section_3_4_2)

## Intro <a class="anchor" id="chapter1"></a>

What is the machine learning?
What are the real use-cases?


## Introduction to Supervised Machine Learning <a class="anchor" id="Section_1_1"></a>

Supervised Machine Learning is an algorithm that learns from labeled training data to help you predict outcomes for unforeseen data.

#### Machine Learning Process Flow <a class="anchor" id="section_1_1_1"></a>

![MACHINE%20LEARNING%20PROCESS.png](attachment:MACHINE%20LEARNING%20PROCESS.png)

## PART 1 <a class="anchor" id="chapter2"></a>

## Regression Analysis <a class="anchor" id="#section_2_1"></a>

### Linear Regression <a class="anchor" id="#section_2_1_1"></a>

In [None]:
# # Importing the Linear Regression Model
# lrmodel = LinearRegression()

### Ridge Regression <a class="anchor" id="#section_2_1_2"></a>

In [None]:
# Importing the Ridge Regression
# ridge_reg = Ridge() 

### Lasso Regression <a class="anchor" id="#section_2_1_3"></a>

In [None]:
# Importing the LASSO Regression
# lasso_reg = Lasso() 

### Decision Trees <a class="anchor" id="#section_2_1_4"></a>

In [None]:
# Importing the Decision Tree Regressor
# dtr = DecisionTreeRegressor()

## Feature Importance <a class="anchor" id="#section_2_2"></a>

In regression analysis, the magnitude of your coefficients is not necessarily related to their importance. The most common criteria to determine the importance of independent variables in regression analysis are p-values. Small p-values imply high levels of importance, whereas high p-values mean that a variable is not statistically significant. 

## Evaluation Metrics <a class="anchor" id="#section_2_3"></a>

## PART 2 - Classification <a class="anchor" id="chapter3"></a>

**Definition:** Classification is the process of predicting the class of given data points. Classes are also called as targets/ labels or categories. <br>
Classification requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. 

**Types of Classification:**

<img src="https://www.researchgate.net/profile/Emerson-Nithiyaraj/publication/342987800/figure/fig1/AS:913942624870401@1594912310090/Binary-vs-Multiclass-classification.jpg" width="400"/>

1. <font color='Brown'>**Binary Classifier:**</font>  If the classification problem has only two possible outcomes as either 0 or 1, then it is called as Binary Classifier.<br>
        Examples: Email spam detection (spam or not), Fraud detection (Credit card default or not)

2. <font color='Brown'>**Multi-class Classifier:**</font> If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.<br>
        Example: Classify a set of images of fruits(oranges, apples, or pears), Classification of types of music.

### Data Balancing <a class="anchor" id="section_3_1"></a>

**Need for Balanced dataset:** Imagine classifying for credit card fraud where <br>

    - Total No. of transactions = 1 Million
    - No. of fraudulent transactions = 5 
 
In this case all our model has to do is predict negatives for all data, and the model will be 99.9995% accurate! 
Thus, the model will most likely learn to “predict negative” no matter what the input data is, and is completely useless! 
To combat this problem, the data set must be balanced with similar amounts of positive and negative examples.

**Reason:** If the dataset is biased towards one class, an algorithm trained on the same data will be biased towards the same class.

<img src="https://www.researchgate.net/profile/Amer-Abdulrahman-2/publication/349154432/figure/fig1/AS:989404227182593@1612903758442/Balanced-and-Imbalanced-datasets.jpg" width="400"/>

**Tactics To Combat Imbalanced Training Data:**

- Collect More Data
- Resampling the Dataset
    - Over-Sampling : Add copies of instances from the minority class
    - Under-Sampling : Delete instances from the majority class
    - Generate Synthetic Samples : Create synthetic samples for minority class
- Try Different Algorithms : Like Random forest, Decision trees
- Changing Performance Metric <br>

**Resampling Techniques:**

NOTE: All resampling operations are applied to only training datasets.

<font color='Darkblue'>A. Under Sampling:<font>

1. **Random Under-Sampling:** It randomly selects the samples of the majority label(s) then removes them as much as the samples of the minority label.   
    * from imblearn.under_sampling import RandomUnderSampler -->   rus =  RandomUnderSampler() <br>
<img src="https://e476rzxxeua.exactdn.com/wp-content/uploads/2020/03/Undersampling.jpg" width="300"/> <br>

2. **Tomek Link:** Tomek Link means the samples belong to different classes and are each other’s nearest neighbors. It removes overlapping Tomek Link from the data.
    * from imblearn.under_sampling import TomekLinks    -->     tk = TomekLinks()<br> <br>
<img src="https://editor.analyticsvidhya.com/uploads/85598tomek.png" width="300"/> <br>

<font color='Darkblue'>B. Over Sampling:<font>
    
3. **Random Over-Sampling:** It randomly selects the samples of minority label(s) then duplicates them as much as the samples of majority label.
     * from imblearn.over_sampling import RandomOverSampler   -->   ros =  RandomOverSampler() <br><br>
<img src="https://dataaspirant.com/wp-content/uploads/2020/08/10-oversampling.png" width="300"/> <br><br>

4. **Synthetic Minority Over-sampling Technique (SMOTE):** Based on the distance of each minority class data (usually using Euclidean distance) and its nearest minority class neighbors, SMOTE generates examples that are neither an exact copy nor too different from the original minority class. 

    *Example:* Let there be two observations (x1,y1) and (x2,y2) from the minority class. As a first step, a random number between 0 and 1 is created, let’s call it r. The synthetic point will be (x1 + r*(x2 -x1), y1 + r*(y2 -y1)). It’s illustrated further with the following example.
    
     * from imblearn.over_sampling import SMOTE    -->    sm = SMOTE()  <br>
     
    <p float="left">
    <img src="https://www.researchgate.net/publication/349985270/figure/fig1/AS:1080299945496636@1634574986644/The-schematic-of-SMOTE-algorithm.jpg" width="200"/>
    <img src="https://miro.medium.com/max/1090/1*TIniOUSnxwmX-EnwRoQaNQ.png" width="250"/>
    </p>

5. **Adaptive Synthetic sampling (ADASYN):** An alternative of SMOTE that uses the weighted distribution for various minority classes, where the samples generated are proportional to the number of nearby samples which do not belong to the same class,
     * from imblearn.over_sampling import ADASYN   -->   adasyn =  ADASYN()<br><br>
<img src="https://miro.medium.com/max/1400/1*iXHQaRrdIJLRsxxjEnr67Q.jpeg" width="300"/> <br><br>

<font color='Darkblue'>C. Hybrid Sampling:<font>        
        
6. **SMOTE + Tomek Link:** It performs over-sampling (SMOTE) to create new synthetic minority samples to get a balanced distribution and then under-sampling (Tomek Link) to remove the samples close to the boundary of the two classes, to increase the separation between the two classes.
      * from imblearn.combine import SMOTETomek    -->      smtom =  SMOTETomek()

In [None]:
# Read the training & test datasets generated from Preprocessed classification Solution 
# Link --> content/04_data_preprocessing_&_feature_engineering/Solution_Classification_preprocessing.ipynb

import pandas as pd
import warnings
warnings.filterwarnings('ignore')

x_train=pd.read_csv('../datasets/classification/processed/X_train.csv', index_col=0)
x_test=pd.read_csv('../datasets/classification/processed/X_test.csv', index_col=0)

y_train=pd.read_csv('../datasets/classification/processed/y_train.csv', index_col=0)
y_test=pd.read_csv('../datasets/classification/processed/y_test.csv', index_col=0)

print(x_train.shape,x_test.shape,y_train.shape,y_test.shape)

In [None]:
# Check if the dataset needs Class balancing for Target variable
print("Original unbalanced dataset distribution: ", [y_train.value_counts(), y_train.value_counts(normalize=True)])

In [None]:
#importing SMOTETomek to handle class imbalance
from imblearn.combine import SMOTETomek

balanced_data = SMOTETomek(random_state=42)

# fit predictor and target variable
x_smote, y_smote = balanced_data.fit_resample(x_train, y_train)

In [None]:
# check the distribution of balanced sample
print("Resampled balanced dataset distribution: ", [y_smote.value_counts(), y_smote.value_counts(normalize=True)])

### Classification Algorithms <a class="anchor" id="section_3_2"></a>

A model will use the training dataset and will calculate how to best map examples of input data to specific class labels. As such, the training dataset must be sufficiently representative of the problem and have many examples of each class label.

**Types of algorithms:**

1. <font color='Green'>**Lazy Learners:**</font> Lazy Learner firstly stores the training dataset and wait until it receives the test dataset. In Lazy learner case, classification is done based on the most related data stored in the training dataset. It takes less time in training but more time for predictions. <br>
        Example: K-NN algorithm

2. <font color='Green'>**Eager Learners:**</font> Eager Learners develop a classification model based on a training dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in prediction. <br>
        Example: Decision Trees, Naïve Bayes

1. <font color='Brown'>**Parametric algorithms:**</font> Learning models that summarizes data with a set of parameters of fixed size (independent of the number of training examples).<br>
        Example: Logistic Regression, Naive Bayes

2. <font color='Brown'>**NonParametric algorithms:**</font> Algorithms that do not make strong assumptions about the form of the mapping function.<br>
        Example: K-NN algorithm, Decision Trees, Support Vector Machines
   
1. <font color='DarkBlue'>**Linear algorithms:**</font> Algorithms that categorize a set of data points to a discrete class based on a linear combination of its explanatory variables.
        Example: Logistic Regression, Support Vector Machines
        
2. <font color='DarkBlue'>**Non-Linear algorithms:**</font> Algorithms that categorizes those instances that are not linearly separable.<br>
        Example: K Nearest Neighbor, Naive Bayes, Decision Trees, Support Vector Machines

### 1. Logistic Regression <a class="anchor" id="section_3_2_1"></a>

<font color='Red'>**Definition:**</font> 
Logistic regression models the probabilities for classification problems with two possible outcomes.

<font color='Red'>**Characteristics:**</font>
In Logistic Regression, we don’t directly fit a straight line to our data like in linear regression. Instead, we fit a S shaped curve, called Sigmoid, to the observations. <br>
Why the name Regression? - Logistic regression uses the same basic formula as linear regression but it is regressing for the probability of a categorical outcome.

**Sigmoid Function / Logit function** – It’s an S-shaped curve that can take any real-valued number and map it into a value between 0 and 1. 
    
<img src="Images/LR_2.png" width="500"/>

<p float="left">
    <img src="https://qph.cf2.quoracdn.net/main-qimg-c53c1a29966c2117a4f23ceb35644fcd" width="300"/>
    <img src="https://christophm.github.io/interpretable-ml-book/images/logistic-class-threshold-1.png" width="300"/>
<p>
    
*Decision boundary*:  A decision boundary is a threshold that we use to categorize the probabilities of logistic regression into discrete classes.<br>
    y = class 0 if predicted probability < 0.5        
    y = class 1 if predicted probability >= 0.5

<font color='Red'>**How it works:**</font>
Instead of fitting a straight line or hyperplane, the logistic regression model uses the logistic function to squeeze the output of a linear equation between 0 and 1.

<img src="Images/LR_1.PNG" width="400"/>     
    
**Maximum-likelihood estimation**: The coefficients (Beta values) of the logistic regression algorithm are estimated from the training data using  maximum likelihood approach; computed via iterative procedures.
    

<p float="left">
<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRrKH7n4La1Tu-wvRGPv8U_QcRO9XdhUQS4Q8jR7vLvwMzenC8kMxW4uIoS-jCyjnMvxmc&usqp=CAU" width=250"/>
<p>

In [None]:
# Python library & function to implement Logistic regression

from sklearn.linear_model import LogisticRegression  

logmodel = LogisticRegression()
logmodel.fit(x_smote,y_smote)

In [None]:
# predicting the y test observations
y_pred = logmodel.predict(x_test)
y_train_pred = logmodel.predict(x_smote)

### Classification Evaluation Metrics <a class="anchor" id="section_3_4"></a>

#### Confusion matrix <a class="anchor" id="section_3_4_1"></a>

<font color='Red'>**Confusion matrix:**</font>
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.


<img src="https://rapidminer.com/wp-content/uploads/2022/06/Confusion-Matrix-1.jpeg" width="450"/>


#### Accuracy, Precision, Recall, F1 Score, ROC, AUC <a class="anchor" id="section_3_4_2"></a>


**1. Accuracy**
Ratio of the number of correct predictions and the total number of predictions
It lies between [0,1] - Higher accuracy means a better model.

**2. Precision**
Precision quantifies the number of positive class predictions that actually belong to the positive class.

**3. Recall / Sensitivity**
Recall quantifies the numbers of Positive samples correctly classified as Positive to the total number of Positive samples.

**4. F1 Score**
F1-score is the harmonic mean of precision and recall.
F-Measure = (2 * Precision * Recall) / (Precision + Recall) 

**5. ROC AUC**
AUC (Area Under The Curve)- ROC (Receiver Operating Characteristics)
AUC ROC indicates how well the probabilities from the positive classes are separated from the negative classes by plotting sensitivity vs specificity.
Greater the AUC (0 to 1), better is the performance of the model

<img src="Images/EM_1.PNG" width="550"/>



<img src="https://preview.redd.it/1rxo44rmhec41.png?auto=webp&s=edf666bea11f07b018289a94dd8f1c431a500b91" width="250"/>

<font color='Green'>**Example:**</font> Lets consider 10,000 manually classified transactions, with 300 fraudulent transaction and 9,700 non-fraudulent transactions. You run your classifier on every transaction, predict the class label (fraudulent or non-fraudulent) and summarise the results in the following confusion matrix:

<img src="https://miro.medium.com/max/875/0*7-zkSjvB-QOqqD8u.png" width="400"/>

<img src="https://miro.medium.com/max/774/1*W7l6mq-CxFuQbOamuXJYIQ.png" width="350"/>  

    
what percent of the positive (fraudulent) cases were identified??
        
<img src="https://miro.medium.com/max/735/1*NluETFtk71xnWPStB9-9tA.png" width="300"/>
The classifier caught 33.3% of the fraudulent transactions.

<img src="https://miro.medium.com/max/501/1*2_Znl36tIqA_KRBDKLWn6A.png" width="200"/>
When the classifier predicted that a transaction is fraudulent, only 12.5% of the time your classifier is correct.
    
F1 is usually more useful than Accuracy, especially if you have an uneven class distribution.<br>
    
<img src="https://miro.medium.com/max/604/1*oo1-C74CC1i5wYkkzJv2Dw.png" width="250"/>

In [None]:
# Import evaluation metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, precision_score, recall_score, f1_score
from sklearn.metrics import roc_auc_score, roc_curve, auc

In [None]:
#getting all scores for Logistic Regression

# Confusion Matrix
print("Confusion matrix: \n", confusion_matrix(y_test, y_pred))

training_accuracy = round(accuracy_score(y_train_pred,y_smote), 3)
print("Train accuracy: ", training_accuracy)

accuracy = round(accuracy_score(y_pred,y_test), 3)
print("Test accuracy: ", accuracy)

precision = round(precision_score(y_pred,y_test), 3)
print("Precision: ", precision)

recall = round(recall_score(y_pred,y_test), 3)
print("Recall: ", recall)

f1_score = round(f1_score(y_pred,y_test), 3)
print("F1 score: ", f1_score)

roc_auc = round(roc_auc_score(y_pred,y_test), 3)
print("ROC AUC score: ", roc_auc)

### 2. K Nearest Neighbours (KNN)  <a class="anchor" id="section_3_2_2"></a>

<font color='Red'>**Definition:**</font> 
K-Nearest Neighbor is a classification and prediction algorithm that is used to divide data into classes based on the distance between the data points.  

<font color='Red'>**How it works:**</font> 
1. Determine distance between the new observation and all data points in training set.
2. Sort the distances
3. Identify the K closest neighbors
4. votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

<img src="https://miro.medium.com/max/1400/1*R9P-psALmaTA8r0s9dNECQ.gif" width="300" align="center">


**Distance Metrics:** To measure the distance between data feature values and new test inputs. Usually, we use the Euclidean approach. The alternative is Manhattan distance.

<img src="https://miro.medium.com/max/1068/1*lTYhxn9o3H8g9twdEoyiRA.png" width="200"/>

**K Value:**  K value indicates the count of the nearest neighbors. 
   - By default the value of K is 5.
   - There is no structure way to find the value of K, however one can iterate through a range of values to find the optimal value
   - K value should be a odd value while considering binary(two-class) classification
   
   
*Example:* Consider the value of K is 3, to identify if an infection is Viral or Bacterial.

<img src="Images/KNN_2.PNG" width="450"/>

Reference - https://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical-example.html 

In [None]:
# Python library & function to implement KNN

from sklearn.neighbors import KNeighborsClassifier     
knn = KNeighborsClassifier()

### 3. Decision Trees <a class="anchor" id="section_3_2_3"></a>

<font color='Red'>**Definition:**</font> 
Decision Tree algorithm creates a training model to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data). A Decision Tree can be made by asking a yes/no question and splitting the answer to lead to another decision.

<img src="Images/DT_1.PNG" width="450"/>

<font color='Red'>**Important Terminologies:**</font> 

    Root Node: It represents the entire population or sample, which gets divided into two or more homogeneous sets.
    Splitting: It is a process of dividing a node into two or more sub-nodes.
    Decision Node: A sub-node that splits into further sub-nodes
    Leaf / Terminal Node: Nodes that do not split
    Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
 
 
<font color='Red'>**How it works:**</font> 
It utilizes the if-then rules which are equally exhaustive and mutually exclusive in classification. The process goes on with breaking down the data into smaller structures and eventually associating it with an incremental decision tree. 

Steps:
1. Begins with the original set S as the root node.
2. Select the best attribute in the dataset using Attribute Selection Measures (smallest Entropy or Largest Information gain) 
3. Divide the S into subsets that contains possible values for the best attributes.
4. The algorithm continues to recur on each subset, considering all the other unused attributes.
5. The process continues until a stage is reached where you cannot further classify the nodes and calls the final node as a leaf node.

**Need for choosing "best" attribute:**
The algorithm searches through all possible decision tree options and chooses the first one that classifies the training examples correctly employing simple-to-complex search.
Example:

<img src="Images/DT_2.PNG" width="500"/>

<font color='Red'>**Attribute Selection Measures:**</font>

- **Gini Index:** It gives an idea of how good a split is by calculating the amount of probability of a specific feature that is classified incorrectly when selected randomly.
    It measures the impurity of a node (value ranges from 0 to 1)
    A feature with a lower Gini index is chosen for a split
    The degree of gini index varies from 0 to 1, Where **0** = all the elements belong to only one class, 
    **1** = all the elements are randomly distributed across various classes

<img src="Images/DT_3.PNG" width="500"/>

- **Entropy:** Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.
If a dataset contains homogeneous subsets of observations, then no impurity or randomness is there in the dataset, and if  all the observations belong to one class, the entropy of that dataset becomes zero.<br>

<img src="https://www.saedsayad.com/images/Entropy_3.png" width="250"/>

- **Information Gain:** It follows the concept of entropy while aiming at decreasing the level of entropy, beginning from the root node to the leaf nodes.
       Information Gain = Entropy before splitting - Entropy after splitting
The feature having the highest value of information gain is accounted for as the best feature to be chosen for split.


<font color='Red'>**Pruning:**</font> Removing sub-nodes of a decision node that makes use of features having low importance

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is known as Pruning.

In [None]:
# Python library & function to implement Decision trees

from sklearn.tree import DecisionTreeClassifier   
dtc = DecisionTreeClassifier()

**Other Common ML Algorithms**

4. Naive Bayes: Naive Bayes method is a supervised learning algorithm based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable.
    - from sklearn.naive_bayes import GaussianNB
    - nb = GaussianNB() 
    
5. Support Vector Machines (SVM): SVM is a discriminative classifier formally defined by a separating hyperplane.
    - from sklearn.svm import SVC                             
    - SVC(kernel = 'linear/rbf')

#### **End notes:** 

Part 2 of this session will cover more Classification algorithms and explain about the possibilities to improve their performance