# Topics Covered:
* Logistic Regression
* SVM
* KNN
* Naive Bayes

## Logistics Regression

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(context="notebook", palette="Spectral", style = 'darkgrid' ,font_scale = 1.5, color_codes=True)
import warnings
warnings.filterwarnings('ignore')
import os
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score, accuracy_score

# Classifying the breast cancer data into benign and malignant class<br>

1) ID number <br>
2) Diagnosis (M = malignant, B = benign)<br>
3) radius (mean of distances from center to points on the perimeter)<br>
4) texture (standard deviation of gray-scale values)<br>
5) perimeter<br>
6) area<br>
7) smoothness (local variation in radius lengths)<br>
8) compactness (perimeter^2 / area - 1.0)<br>
9) concavity (severity of concave portions of the contour)<br>
10) concave points (number of concave portions of the contour)<br>
11) symmetry<br>
12) fractal dimension ("coastline approximation" - 1)<br>
13) to 32) Calculated using mean and worst of the above features***

In [2]:
from sklearn import datasets
import warnings
warnings.filterwarnings('ignore')
# import some data to play with
data = datasets.load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [4]:
#Splitting the Data-Set into Training Set and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=123)

In [5]:
#Normalize the data
#normalized scaler - fit&transform on train, fit only on test
from sklearn.preprocessing import MinMaxScaler
n_scaler = MinMaxScaler()
X_train_scaled = n_scaler.fit_transform(X_train.astype(np.float))
X_test_scaled = n_scaler.transform(X_test.astype(np.float))

In [6]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
logreg= LogisticRegression()
logreg.fit(X_train_scaled, y_train)
y_pred_logreg = logreg.predict(X_test_scaled)

In [7]:
print('Precision: %.3f' % precision_score(y_test, y_pred_logreg, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred_logreg, average='weighted'))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred_logreg))
print('F1 Score: %.3f' % f1_score(y_test, y_pred_logreg, average='weighted'))

Precision: 0.967
Recall: 0.965
Accuracy: 0.965
F1 Score: 0.964


## SVM

***SVM is a supervised machine learning algorithm capable of performing classification, regression and even outlier detection. The classifier separates data points using a hyperplane with the largest amount of margin. SVM finds an optimal hyperplane which helps in classifying new data points.***

C: It is the hyper parameter which control the number of misclassifications errors which has a direct effect on the hyperplane.<br>
Gamma: It is used to give weightage to points close to support vector. In other words, changing the value of gamma would change the shape of the hyperplane.<br>
Kernal function: If our data is not linearly separable, we could apply a “Kernel Trick” method which maps the nonlinear data to higher dimensional space.


In [8]:
#Support Vector Classification model when data is not normalized
from sklearn.svm import SVC
svc_model = SVC()
svc_model.fit(X_train, y_train)

y_pred = svc_model.predict(X_test)

#Measure the accuracy
from sklearn import metrics
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score
print("Accuracy score %.3f" % metrics.accuracy_score(y_test, y_pred))

Accuracy score 0.930


In [9]:
#Support Vector Classification model when data is normalized
svc_model.fit(X_train_scaled, y_train)
from sklearn.metrics import classification_report, confusion_matrix
y_pred_scaled = svc_model.predict(X_test_scaled)

#Measure the performace
print('Precision: %.3f' % precision_score(y_test, y_pred_scaled, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred_scaled, average='weighted'))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred_scaled))
print('F1 Score: %.3f' % f1_score(y_test, y_pred_scaled, average='weighted'))

Precision: 0.983
Recall: 0.982
Accuracy: 0.982
F1 Score: 0.982


### Effects of SVM parametre 
**Smaller C**: Lower variance but higher bias (soft margin) and reduce the cost of miss-classification (less penalty).<br>
**Larger C**: Lower bias and higher variance (hard margin) and increase the cost of miss-classification (more strict).<br>
**Smaller Gamma**: Large variance, far reach, and more generalized solution.<br>
**Larger Gamma**: High variance and low bias, close reach, and also closer data points have a higher weight.<br>

In [10]:
#Support Vector Classification model when data is scaled and kernal function is linear
svc_model = SVC(C=1.0,kernel='linear')
svc_model.fit(X_train_scaled, y_train)
y_pred_new_C=svc_model.predict(X_test_scaled)

#Measure the performace
print('Precision: %.3f' % precision_score(y_test, y_pred_new_C, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred_new_C, average='weighted'))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred_new_C))
print('F1 Score: %.3f' % f1_score(y_test, y_pred_new_C, average='weighted'))

Precision: 0.983
Recall: 0.982
Accuracy: 0.982
F1 Score: 0.982


## KNN
KNN classifier creates an imaginary boundary to classify the data. When new data points come in, the algorithm will try to predict that to the nearest of the boundary line.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)

#Measure the accuracy
from sklearn import metrics
from sklearn.metrics import confusion_matrix, precision_score, recall_score, accuracy_score, f1_score

#Measure the performace
print('Precision: %.3f' % precision_score(y_test, y_pred_knn, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred_knn, average='weighted'))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred_knn))
print('F1 Score: %.3f' % f1_score(y_test, y_pred_knn, average='weighted'))

Precision: 0.983
Recall: 0.982
Accuracy: 0.982
F1 Score: 0.982


## Naive Bayes
Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It works based on the Bayes theorem. The easiest naive Bayes classifier to understand is Gaussian Naive Bayes. In this classifier, the assumption is that data from each label is drawn from a simple Gaussian distribution.

In [12]:
# Importing the model:
from sklearn.naive_bayes import GaussianNB

# Initiating the model without scaling the data:
nb = GaussianNB()
nb.fit(X_train, y_train)
# Predicting the Test set results
y_pred = nb.predict(X_test)

#Measure the performace
print('Precision: %.3f' % precision_score(y_test, y_pred, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred, average='weighted'))
print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
print('F1 Score: %.3f' % f1_score(y_test, y_pred, average='weighted'))

Precision: 0.959
Recall: 0.956
Accuracy: 0.956
F1 Score: 0.955


In [13]:
# Scaling the data
nb.fit(X_train_scaled, y_train)
# Predicting the Test set results
y_pred_scaled = nb.predict(X_test_scaled)

#Measure the performace
print('Precision: %.3f' % precision_score(y_test, y_pred_scaled, average='weighted'))
print('Recall: %.3f' % recall_score(y_test, y_pred_scaled, average='weighted'))
print('Accuracy: %.3f' % accuracy_score(y_pred, y_pred_scaled))
print('F1 Score: %.3f' % f1_score(y_test, y_pred_scaled, average='weighted'))

Precision: 0.959
Recall: 0.956
Accuracy: 1.000
F1 Score: 0.955
