## Support Vector Machine

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However,  it is mostly used in classification problems. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well 

In [1]:
#Loading Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.simplefilter('ignore')
%matplotlib inline

In [2]:
#Gathering Data
credit = pd.read_csv(r"E:/Github/Python/Data-Sets/Risk.txt",sep=",",index_col=False)

In [5]:
#Searching for unique values of string columns
print('MARITAL column unique values:',credit['MARITAL'].unique())
print('HOWPAID column unique values:',credit['HOWPAID'].unique())
print('MORTGAGE column unique values:',credit['MORTGAGE'].unique())
print('GENDER column unique values:',credit['GENDER'].unique())
print('RISK column unique values:',credit['RISK'].unique())

MARITAL column unique values: ['married  ' 'single   ' 'divsepwid']
HOWPAID column unique values: ['monthly' 'weekly ']
MORTGAGE column unique values: ['y' 'n']
GENDER column unique values: ['m' 'f']
RISK column unique values: ['good risk ' 'bad loss  ' 'bad profit']


In [6]:
#Before modelling we need to compute new columns for string data
credit['MARITAL_new'] = credit.MARITAL.map({'married  ':0, 'single   ':1,'divsepwid':2})
credit['GENDER_new'] = credit.GENDER.map({'m':0, 'f':1})
credit['MORTGAGE_new'] = credit.MORTGAGE.map({'y':0, 'n':1})
credit['RISK_new'] = np.where(credit['RISK'].str.contains('good'), 1,0)
credit['HOWPAID_new'] = credit.HOWPAID.map({'monthly':0, 'weekly ':1})

In [7]:
#Top 5 rows of dataframe
print(credit.head())

       ID  AGE  INCOME GENDER    MARITAL  NUMKIDS  NUMCARDS  HOWPAID MORTGAGE  \
0  100756   44   59944      m  married          1         2  monthly        y   
1  100668   35   59692      m  married          1         1  monthly        y   
2  100418   34   59508      m  married          1         1  monthly        y   
3  100416   34   59463      m  married          0         2  monthly        y   
4  100590   39   59393      f  married          0         2  monthly        y   

   STORECAR  LOANS        RISK  MARITAL_new  GENDER_new  MORTGAGE_new  \
0         2      0  good risk             0           0             0   
1         1      0  bad loss              0           0             0   
2         2      1  good risk             0           0             0   
3         1      1  bad loss              0           0             0   
4         1      0  good risk             0           1             0   

   RISK_new  HOWPAID_new  
0         1            0  
1         0         

In [8]:
#Null column check
credit.isnull().any()

ID              False
AGE             False
INCOME          False
GENDER          False
MARITAL         False
NUMKIDS         False
NUMCARDS        False
HOWPAID         False
MORTGAGE        False
STORECAR        False
LOANS           False
RISK            False
MARITAL_new     False
GENDER_new      False
MORTGAGE_new    False
RISK_new        False
HOWPAID_new     False
dtype: bool

In [9]:
#Selecting X and Y variables 
X = credit[['AGE','INCOME','GENDER_new','MARITAL_new','NUMKIDS','NUMCARDS','HOWPAID_new','MORTGAGE_new','STORECAR','LOANS']]
y = credit['RISK_new']

In [10]:
#Loading train test split library
from sklearn.model_selection import train_test_split

#Creating X and y train and test splits
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)

In [11]:
#Loading Library
from sklearn import svm
svc = svm.SVC()

In [12]:
#Applying model
svc.fit (X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [13]:
#Making predictions
y_train_predictions = svc.predict(X_train)
y_test_predictions = svc.predict(X_test)

## Model Evaluation

### Classification Report:

Scikit-learn does provide a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures.

The classification_report() function displays the precision, recall, f1-score and support for each class.

In [24]:
from sklearn.metrics import classification_report
report = classification_report(y_test, y_test_predictions)
print('The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The best value is 1 and the worst value is 0.')
print('')
print('The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The best value is 1 and the worst value is 0.')
print('')
print('The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall).')
print('')
print('The support is the number of occurrences of each class in y_true.')
print('')
print(report)

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The best value is 1 and the worst value is 0.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The best value is 1 and the worst value is 0.

The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is: F1 = 2 * (precision * recall) / (precision + recall).

The support is the number of occurrences of each class in y_true.

             precision    recall  f1-score   support

          0       0.79      1.0

### Confusion Matrix:

The confusion matrix is a handy presentation of the accuracy of a model with two or more classes.

The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.

For example, a machine learning algorithm can predict 0 or 1 and each prediction may actually have been a 0 or 1. Predictions for 0 that were actually 0 appear in the cell for prediction=0 and actual=0, whereas predictions for 0 that were actually 1 appear in the cell for prediction = 0 and actual=1. And so on.

In [37]:
from sklearn.metrics import confusion_matrix
print ('Train:\n')
print(confusion_matrix(y_train, y_train_predictions))
print('')
print ('Test:\n')
print(confusion_matrix(y_test, y_test_predictions))
print('')

Train:

[[2662    0]
 [  16  615]]

Test:

[[651   0]
 [173   0]]



### Classification Accuracy:

Classification accuracy is the number of correct predictions made as a ratio of all predictions made.

This is the most common evaluation metric for classification problems, it is also the most misused. It is really only suitable when there are an equal number of observations in each class (which is rarely the case) and that all predictions and prediction errors are equally important, which is often not the case.

In [39]:
from sklearn import model_selection
from sklearn.svm import SVC
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = SVC()
scoring = 'accuracy'
results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))

Accuracy: 0.80 (+/- 0.21)


### Area Under ROC Curve:

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems.

The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random. 

ROC can be broken down into sensitivity and specificity. A binary classification problem is really a trade-off between sensitivity and specificity.

Sensitivity is the true positive rate also called the recall. It is the number instances from the positive (first) class that actually predicted correctly.

Specificity is also called the true negative rate. Is the number of instances from the negative class (second) class that were actually predicted correctly.

In [None]:
from sklearn import model_selection
from sklearn.svm import SVC
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = SVC()
scoring = 'roc_auc'
results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
print("Accuracy: %0.2f (+/- %0.2f)" % (results.mean(), results.std()))