# Introduction - SVM 

SVM is the Supervised Machine Learning algorithm used for both classification, regression. But mostly preferred for classification.

Given a dataset, the algorithm tries to divide the data using hyperplanes and then makes the predictions. SVM is a non-probabilistic linear classifier. While other classifiers, when classifying, predict the probability of a data point to belong to one group or the another, SVM directly says to which group the datapoint belongs to without using any probability calculation.



How it works?
- SVM constructs a best line or the decision boundary called **Hyperplane** which can be used for classification or regression or outlier detection.  The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane.

- This hyperplane creates 2 margin lines parallel to it which have some distance so that it can distinctly classify the data points. The distance between the 2 margin lines are called **marginal distance**.

- These 2 margin lines passes through the most nearest +ve points and the most nearest -ve points. Those points through which the margin lines pass are called **support vectors**. Support vectors are important as it helps to determine the maximum distance of the marginal plane.
 

## Understanding the Mathematics involved
Let’s take the example of the following dataset and see how can we divide the data into appropriate groups.
<img src='SVM_intution.PNG'  width="300">

We can see that there are two groups of data. The question is how to divide these points into two groups. It can be done using any of the three lines. Or, for that purpose, there can be an infinite number of straight lines that can divide these points into two classes. Now, which line to choose?
SVM solves this problem using the maximum margin as shown 
<img src='SVM_hyperplane.PNG' width="400">


The black line in the middle is the optimum classifier. This line is drawn to maximise the distance of the classifier line from the nearest points in the two classes. It is also called a __hyperplane__ in terms of  SVM. 
A _Hyperplane_ is an n-1 dimensional plane which optimally divides the data of n dimensions. Here, as we have only a 2-D data, so the hyperplane can be represented using one dimension only. Hence, the hyperplane is a line here.
The two points (highlighted with circles) which are on the yellow lines, they are called the __support vectors__. As it is a 2-D figure, they are points. In a multi-dimensional space, they will be vectors, and hence, the name- support vector machine as the algorithm creates the optimum classification line by maximising its distance from the two support vectors.

##### Significance of marginal distance:
![SVM1%20%282%29.png](attachment:SVM1%20%282%29.png)


Which of these lines suits our problem? Observe Fig (i) & (ii), line is close to the data points. So any minor change in the data point might cause this point to fall in to the other class. Whereas in Fig(iii), the line looks more stable as it is far from data points on either side and not susceptible to small changes in data points. Hence hyperplane with the maximum marginal distance is considered to be the best hyperplane.

All three lines, classifies correctly in training data. But the 3rd line will work well with new data too and the rest might have lot of errors.

 - Hence, the optimal hyperplane is the one which is farthest from our training data points ie., the line which has maximum marginal distance. 
 
When the data is not linearly separable,  then to create a hyperplane to separate data into different groups, the SVM algorithm needs to perform computations in a higher-dimensional space. But the introduction of new dimensions makes the computations for the SVMs more intensive, which impacts the algorithm performance. To rectify this, mathematicians came up with the approach of Kernel methods. 
Kernel methods use kernel functions available in mathematics. The unique feature of a kernel function is to compute in a higher-dimensional space without calculating the new coordinates in that higher dimension. It implicitly uses predefined mathematical functions to do operations on the existing points which mimic the computation in a higher-dimensional space without adding to the computation cost as they are not actually calculating the coordinates in the higher dimension thereby avoiding the computation of calculating distances from the newly computed points.  This is called the kernel trick.
<img src= "SVM_3D_Hyperplane.PNG" width="300">
                                                                        Image: bogotobogo.com 


In the left diagram above, we have a non-linear distribution of data as we can not classify a data using a linear equation. To solve this problem, we can project the points in a 3-dimensional space and then derive a plane which divides the data into two parts. In theory, that’s what a kernel function does without computing the additional coordinates for the higher dimension.


## Python Implementation

### Business Case:- With the given features, we need to predict whether loan will be approved or not.


In [None]:
## Supervised learning with classification task(2 classes)

In [None]:
##importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
## loading the data
data=pd.read_csv('loan_approved.csv')

In [None]:
## Getting the first few rows of data
data.head()

## Basic Checks

In [None]:
# Quick summary of data
data.info()

In [None]:
# Statistical information of the data
data.describe()

In [None]:
data.shape

## Exploratory Data Analysis

## Data Preprocessing Pipeline

In [None]:
## Checking missing value
data.isnull().sum()

In [None]:
## Getting the rows where values are missed in Gender features
data.loc[data['Gender'].isnull()==True]    

In [None]:
## Checking the distribution along the both labels
data.Gender.value_counts()

In [None]:
## Imputing the missing values with mode
data.loc[data['Gender'].isnull()==True,'Gender']='Male'

In [None]:
data.loc[data['Gender'].isnull()==True]

In [None]:
data.Dependents.value_counts()

In [None]:
data.loc[data['Dependents'].isnull()==True,'Dependents']='0'

In [None]:
data.loc[data['Dependents'].isnull()==True]

In [None]:
## getting the counts
data.Married.value_counts()

In [None]:
## Imputing with mode
data.loc[data['Married'].isnull()==True,'Married']='Yes'

In [None]:
data.Self_Employed.value_counts()

In [None]:
# Replace the nan values with mode
data.loc[data['Self_Employed'].isnull()==True,'Self_Employed']='No'

In [None]:
# Credit_History
data.Credit_History.value_counts()

In [None]:
data.loc[data['Credit_History'].isnull()==True,'Credit_History']=1.0

In [None]:
# check for null values
data.isnull().sum()

In [None]:
data

In [None]:
## Histogram since it has numerical value
data.LoanAmount.hist()

Since data is skewed, we can use median to replace the nan value. It is recommended to use mean only for symmetric data distribution.

In [None]:
np.median(data.LoanAmount.dropna(axis=0))

In [None]:
# Replace the nan values in LoanAmount column with median value
data.loc[data['LoanAmount'].isnull()==True,'LoanAmount']= np.median(data.LoanAmount.dropna(axis=0))

In [None]:
data.Loan_Amount_Term.hist()

In [None]:
# replace the nan values in Loan_Amount_Term with the median value
data.loc[data['Loan_Amount_Term'].isnull()==True,'Loan_Amount_Term']=np.median(data.Loan_Amount_Term.dropna(axis=0))

In [None]:
data.isnull().sum()

In [None]:
## renaming the target column
data.rename(columns={"Loan_Status (Approved)":'Loan_Status'},inplace=True)

In [None]:
## Step 2 Handling the categorical data
data.info()

In [None]:
## Using label encoder to convert the categorical data to numerical data

# Ordinal data

from sklearn.preprocessing import LabelEncoder
lc=LabelEncoder()

data.Married=lc.fit_transform(data.Married)
data.Education=lc.fit_transform(data.Education)
data.Property_Area=lc.fit_transform(data.Property_Area)
data.Loan_Status=lc.fit_transform(data.Loan_Status)
data.Dependents=lc.fit_transform(data.Dependents)  
data.Self_Employed=lc.fit_transform(data.Self_Employed)

In [None]:
# Nominal Data
data.Gender=pd.get_dummies(data.Gender,drop_first=True)

In [None]:
data.Gender

In [None]:
data

In [None]:
## scaling data
from sklearn.preprocessing import MinMaxScaler
scale=MinMaxScaler()
data[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']]=scale.fit_transform(data[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']])

In [None]:
data.head()

## Feature Selection

In [None]:
## checking correlation
corr_data=data[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']]

In [None]:
sns.heatmap(corr_data.corr(),annot=True)

In [None]:
## There is no relationship among the numerical features

In [None]:
corr_data.describe() ## no constant features

In [None]:
## checking the duplicate rows
data.duplicated().sum()

## Model Creation

In [None]:
## defining X and y
X=data.iloc[:,1:-1]
y=data.Loan_Status

In [None]:
## creating training and testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X, y,random_state=3)

In [None]:
y.value_counts()

In [None]:
## balancing the data

# Install imblearn package - pip install imblearn
from imblearn.over_sampling import SMOTE
smote = SMOTE()

In [None]:
X_smote, y_smote = smote.fit_resample(X_train,y_train)

**Counter** is a container which keeps track to how many times equivalent values are added. Python counter class is a part of collections module.

In [None]:
from collections import Counter
print("Actual Classes",Counter(y_train))
print("SMOTE Classes",Counter(y_smote))

In [None]:
# Support Vector Classifier Model

from sklearn.svm import SVC
svclassifier = SVC() ## base model with default parameters
svclassifier.fit(X_smote, y_smote)

In [None]:
# Predict output for X_test

y_hat=svclassifier.predict(X_test)

In [None]:
## evaluating the model created
from sklearn.metrics import accuracy_score,classification_report,f1_score
acc=accuracy_score(y_test,y_hat)
acc

In [None]:
# Calssification report measures the quality of predictions. True Positives, False Positives, True negatives and False Negatives 
# are used to predict the metrics of a classification report 

print(classification_report(y_test,y_hat))

In [None]:
cm1=pd.crosstab(y_test,y_hat)
cm1

In [None]:
# F1 score considers both Precision and Recall for evaluating a model
f1=f1_score(y_test,y_hat)
f1

In [None]:
## checking cross validation score
from sklearn.model_selection import cross_val_score

scores = cross_val_score(svclassifier,X,y,cv=3,scoring='f1')
print(scores)
print("Cross validation Score:",scores.mean())
print("Std :",scores.std())
#std of < 0.05 is good. 

## What is a Model Hyperparameter?

A model hyperparameter is a configuration that is external to the model and whose value cannot be estimated from data.



## Hyperparameters of Support Vector Machine

#### SVM separates data points that belong to different classes with a decision boundary. When determining the decision boundary, a soft margin SVM (soft margin means allowing some data points to be misclassified) tries to solve an optimization problem with the following goals:

#### 1)Increase the distance of decision boundary to classes (or support vectors)
#### 2)Maximize the number of points that are correctly classified in the training set

### There is obviously a trade-off between these two goals which and it is controlled by C which adds a penalty for each misclassified data point.

### If C is small, the penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassification.

### If C is large, SVM tries to minimize the number of misclassified examples due to the high penalty which results in a decision boundary with a smaller margin. The penalty is not the same for all misclassified examples. It is directly proportional to the distance to the decision boundary.

<img src='1_XFtyzSNjexMecQ4wmqBfgA.PNG'  width="300">

<img src='1_k4wh7vzjDbQWXx7wKyH0kg.PNG'  width="600">



### Gamma is a hyperparameter used with non-linear SVM. One of the most commonly used non-linear kernels is the radial basis function (RBF). Gamma parameter of RBF controls the distance of the influence of a single training point.

### Low values of gamma indicate a large similarity radius which results in more points being grouped together. 

### For high values of gamma, the points need to be very close to each other in order to be considered in the same group (or class). Therefore, models with very large gamma values tend to overfit.

![1*JDSwT-svWnAu69fy9oguBw.png](attachment:1*JDSwT-svWnAu69fy9oguBw.png)

![1*faj7x1I0uFwfU6mkLfUwvg.png](attachment:1*faj7x1I0uFwfU6mkLfUwvg.png)
![1*5DtPKUzLI1e-FIjC-odFiw.png](attachment:1*5DtPKUzLI1e-FIjC-odFiw.png)

## GridSearchCV

#### It is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. As mentioned above, the performance of a model significantly depends on the value of hyperparameters

#### Doing this manually could take a considerable amount of time and resources and thus we use GridSearchCV to automate the tuning of hyperparameters.

#### GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.

In [None]:
from sklearn.model_selection import GridSearchCV
  
# defining parameter range
param_grid = {'C': [0.1, 1, 10, 100, 1000], 
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001]} 

grid = GridSearchCV(SVC(random_state=42), param_grid, verbose =1,scoring='f1',cv=3)
  
# fitting the model for grid search  
grid.fit(X_smote, y_smote)

In [None]:
# print best parameter after tuning
print(grid.best_params_)
  
# print how our model looks after hyper-parameter tuning
print(grid.best_estimator_)

In [None]:
# creating amodel with optimal values
clf=SVC(C=10, gamma=0.001, random_state=42) 

In [None]:
clf.fit(X_smote, y_smote)

In [None]:
y_clf=clf.predict(X_test)

In [None]:
print(classification_report(y_test,y_clf))

In [None]:
cm=pd.crosstab(y_test,y_clf)
cm

In [None]:
f1=f1_score(y_test,y_clf)
f1

In [None]:
scores_after = cross_val_score(clf,X,y,cv=3,scoring='f1')
print(scores_after)
print("Cross validation Score:",scores_after.mean())
print("Std :",scores.std())
#std of < 0.05 is good. 