<h1> Day 15 - Class </h1>

## Steps involved in solving a AI/ML problem

- Problem definition
- EDA
- Preprocessing
- Model building
- Model validation
- Feature engineering
- Model deployment

### Model building

Machine learning models

- Supervised
- Unsupervised

In supervised we have
- Regression
    - linear regression
    - lasso regression
    - ridge regression
    - elastic net regression
    - random forest regression
    - decision tree regression
    - SVM regression
    
- Classification
    - Logistic regerssion
    - Random forest classification
    - decision tree classification
    - bagging classification
    - xg boost (xtreme gradient boosting)
    - ADA Boost (adaptive boosting)
    - gradient boosting
    - catboost
    
 Unsupervised 
 - Clustering
     - Kmeans
     - Agglomerative
     - K Prototype
     - Kmodes
 - PCA (principal component analysis)
 - LDA (linear discriminant analysis)
 
 - KNN
 - Tensorflow, keras, OpenCV
 - ANN
 - CNN
 - NLP

### Model Validation (Regression)
- First thumbrule of a regression model is that 'y' should be a continous variable
- Mean Absolute Error will validate how good the linear regression model is

<b> Mean Absolute Error (MAE) </b>

<img src='img/mae-01.png' />

<b> Mean Squared Error (MSE) </b>
<img src='img/mse-01.gif' />


<b> Root Mean Squared Error (RMSE) </b>
<img src='img/rmse-01.png' />


<b> Mean Absolute Percentage Error (MAPE) </b>

<img src='img/mape-01.jpeg' />

Ypredicted - Yactual will give the loss 

Sum of all Ypredicted - Yactual will give the cost 

While calculating the cost (i.e. while summing) use the absolute value. 

MSE will penalize larger errors while neglect smaller errors for e.g,

8^2 = 64
.8 ^ 2 = 0.64 (see a small error of .8 is reduced further)

### Model Validation (Classification)

y variable here will be a categorical variable.

We cannot use validation metrics such as MAE etc in this case. Instead we use confusion matrix.

<img src='img/confusion-matrix-01.png' />

Sum of TP,TN,FP,FN will be the total observations

When we have 2 unique values in the y variable, then it's called binary classification more than 2 unique values means we are doing multiclass classification.

There are other metrics in classification problems such as,
- recall(sensitivity)
- precesion(specificity)
- f1-score
- roc-auc curve

We shall see them in detail when we cover classification algorithms

In [1]:
import pandas as pd
import numpy as np
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'
import seaborn as sns

## Algorithms

### Clustering - Unsupervised

In unsupervised algorithms there is a no 'y' variable. A good cluster is such that,

1) The distance between all the points in the cluster to it's centroid is minimum

2) The distance between different clusters formed should be maximum

Different ways of finding distances are,

1) Manhattan distance

2) Minkowski distance

3) Euclidean distance

Entropy is a measure of how homegenous or heterogenous the data is, within a cluster
Lesser the entropy better the cluster. For homogenous data, entropy will be less

Different clustering algorithms,

1) K-Means

2) K-Modes

3) Agglomerative

When we group the data by rows then its called CLUSTERING or SEGMENTATION

When we group the data by columns then its called DIMENSIONALITY REDUCTION or FACTOR ANALYSIS

Steps involved in implementing cluster analysis in real time

1) Identify the clusters

2) Study the behaviour of the clusters

3) Design the strategies according to the cluster behaviour

What algorithms to use when ?
- K-Means - When all the data is continous variables
- K-Modes - When all the data is class variables
- K-Prototypes - When the data is a mix of continous and class variables


In all clustering algorithms, if the distance between the data row and random point is same, then assign the point to the cluster with least(minimum) variance

After forming clusters, target the members within the cluster. e.g. find all rows within cluster that has less mean value than the mean value of the cluster and start targetting such users to improve sales.

### K-Means

How to choose the columns to be considered for K-Means ?

- Pick continous variables 
- One way is to check how many unique values for a column
- A better way is to check the variance of the fields. Higher the variance we need to consider such columns

Once the cluster is formed plot it and see how it looks. Also look min/max/mean etc to see if the clusters are formed without any overlaps

Steps

1) Let's identify the number of clusters. There is a statistical way to find the number of clusters. For now,let's assume the number of clusters to be 3

2) Pick 3 random rows from the data set ( 3 because we are going to create 3 clusters)

3) Find the distance of all the rows with these 3 random points. Each row will be associated with 3 distances. Row will be assigned to the cluster with least distance. Once we do this for all the rows in the dataset, we would have assigned the row to either cluster1, cluster2 or cluster3

4) Find the centroid of each cluster. Centroid will be calculated by finding the mean for each column i.e centroid = (mean of column1, mean of column2, mean of column3 ....). Since we have 3 clusters, there will be 3 centroids now

5) Repeat step 2 - 4 by making the centroids as the new random points. After one such iteration the cluster would've changed

6) In the above step, if the cluster doesn't change between 'n' th and (n+1) th iteration, then by now we would've formed the final clusters

The above steps are very similar to KNN, just that in K-Means there is no 'y' variable but in KNN there is.

The statistical way of finding cluster is called K-Elbow

What is K-Elbow ?

- Divide the entire dataset into 1 cluster, find the variance
- Divide the entire dataset into 2 clusters, find the variance of each cluster and sum it
- Divide the entire dataset into 3 clusters, find the variance of each cluster and sum it 
- and so on...
- After a particular point the sum of variance doesn't vary much. The point where we see a sharp drop in the sum of variance we can conside the number of cluster at that point as the number of clusters.

If we plot the number of clusters in x axis and the sum of variance in y axis , then we can see a 'Elbow' like shape when the sum of variance drops

### K-Modes

In [None]:
# Steps
# 1) Let's identify the number of clusters. There is a way to find the number of clusters, yet to be covered. 
#    The method name is 'Silhouette'
# 2) Pick 3 random rows from the data set ( 3 because we are going to create 3 clusters)
# 3) Form dissimilarity index(put value of 1 if not a match and 0 if its a match) for each rows and for 
#    each random points. i.e. Compare column1 value of the row with the column1 value of the random point and 
#    then put value of 1 and 0.  lower the sum of the elements in the dissimilarity index , 
#    closer the corresponding row to the respective random point.
# 4) After 1 iteration we would've assigned the initial cluster. Find the centroid of each cluster. 
#    Centroid will be calculated by finding the mode for each column i.e centroid = (mode of column1,
#    mode of column2, mode of column3 ....). Since we have 3 clusters, there 
#    will be 3 centroids now
# 5) Repeat step 2 - 4 by making the centroids as the new random points. After one such iteration the cluster
#    would've changed
# 6) In the above step, if the cluster doesn't change between 'n' th and (n+1) th iteration, then by now we 
#    would've formed the final clusters

### Agglomerative

In [1]:
# This is not a popular clustering method because of the heavy number of computations involved
# if there is a 100 row data set, then we form a 100*100 matrix. 
# Positive is that the visualization created is very self explanatory

# 5 types of link functions
# 1) Single link (also known as Min link)
# 2) Complete link
# 3) Average Link
# 4) Centroids
# 5) Wards method (also known as Min Variance)

# For each data rows, r1,r2,r3....,rn find the distances between each row and every other row. This is where 
# a n*n matrix will be formed
#      r1   r2   r3   r4
# r1   0    d1   d2   d3
# r2   d1   0    d4   d5  
# r3   d2   d4   0    d6          
# r4   d3   d5   d6   0         
#
# In this matrix we have distances between every row between every other row
# find the minimum distance , let's assumg d5 is minimum. Then in the next iteration we consider r4,r2 as 
# combined data point
#        r1   r2   r3   r2r4
# r1     0    d1   d2   d3
# r2r4   d1   0    d4   d5  
# r3     d2   d4   0    d6          
# 
# We have to repeat the above steps till we get to 1 cluster (not too clear on the intend behind this)
# when we find the distance between r1 and r2r4 i.e. r1 - r2r4
# 1) if we consider the minimum value between (r1-r2) AND (r1-r4) then that's called Single link
# 2) if we consider the maximum value between (r1-r2) AND (r1-r4) then that's called Complete link
# 3) if we consider the average of (r1-r2) AND (r1-r4) then that's called Average Link
# 4) if we find the centroid of (r1) and (r2 r4) and if we consider the distance between the centroids then it's 
#    called Centroids Link
# 5) Ward's Link method 
#    Find the mean of each cluster.
#    Calculate the distance between each object in a particular cluster, and that cluster’s mean.
#    Square the differences from Step 2.
#    Sum (add up) the squared values from Step 3.
#    Add up all the sums of squares from Step 4.


In [None]:
# Dendogram is the visual representation of how clusters are formed in an agglomerative clustering