Syed Sajjad Askar

2139484

Machine Learing

Lab-2


## Import Libraries
* pandas: Used for data manipulation and analysis
* numpy : Numpy is the core library for scientific computing in Python. It is used for working with arrays and matrices.
* matplotlib : It’s plotting library, and we are going to use it for data visualization
* model_selection: Here we are going to use model_selection.train_test_split() for splitting the data
* svm: Sklearn support vector machine model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn import svm

## Load Data
* We are going to use ‘admission_basedon_exam_scores.csv’ CSV file
* File contains three columns Exam 1 marks, Exam 2 marks and Admission status

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/satishgunjal/datasets/master/admission_basedon_exam_scores.csv')
print('Shape of data= ', df.shape)
df.head()

## Data Understanding
* There are total 100 training examples (m= 100 or 100 no of rows)
* There are two features Exam 1 marks and Exam 2 marks
* Label column contains application status. Where ‘1’ means admitted and ‘0’ means not admitted

### Data Visualization
To plot the data of admitted and not admitted applicants, we need to first create separate data frame for each class(admitted/not-admitted)

In [3]:
df_admitted = df[df['Admission status'] == 1]
print('Training examples with admission status 1 are = ', df_admitted.shape[0])
df_admitted.head(3)

In [4]:
df_notadmitted = df[df['Admission status'] == 0]
print('Training examples with admission status 0 are = ', df_notadmitted.shape[0])
df_notadmitted.head(3)

Now lets plot the scatter plot for admitted and not admitted students

In [5]:
def plot_data(title):    
    plt.figure(figsize=(10,6))
    plt.scatter(df_admitted['Exam 1 marks'], df_admitted['Exam 2 marks'], color= 'green', label= 'Admitted Applicants')
    plt.scatter(df_notadmitted['Exam 1 marks'], df_notadmitted['Exam 2 marks'], color= 'red', label= 'Not Admitted Applicants')
    plt.xlabel('Exam 1 Marks')
    plt.ylabel('Exam 2 Marks')
    plt.title(title)
    plt.legend()
 
plot_data(title = 'Admitted Vs Not Admitted Applicants')

## Build Machine Learning Model

In [6]:
#Lets create feature matrix X and label vector y
X = df[['Exam 1 marks', 'Exam 2 marks']]
y = df['Admission status']

print('Shape of X= ', X.shape)
print('Shape of y= ', y.shape)

### Create Test And Train Dataset
* We will split the dataset, so that we can use one set of data for training the model and one set of data for testing the model
* We will keep 20% of data for testing and 80% of data for training the model
* If you want to learn more about it, please refer [Train Test Split tutorial](https://satishgunjal.com/train_test_split/)

In [7]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.2, random_state= 1)

print('X_train dimension= ', X_train.shape)
print('X_test dimension= ', X_test.shape)
print('y_train dimension= ', y_train.shape)
print('y_train dimension= ', y_test.shape)

Now lets train the model using SVM classifier

In [8]:
# Note here we are using default SVC parameters
clf = svm.SVC()
clf.fit(X_train, y_train)
print('Model score using default parameters is = ', clf.score(X_test, y_test))

In order to visualize the results better lets create a function to plot SVM Classifier decision boundary with margin

In [9]:
def plot_support_vector(classifier):
    """
    To plot decsion boundary and margin. Code taken from Sklearn documentation.

    I/P
    ----------
    classifier : SVC object for each type of kernel

    O/P
    -------
    Plot
    
    """
    clf =classifier
    # plot the decision function
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # create grid to evaluate model
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = clf.decision_function(xy).reshape(XX.shape)

    # plot decision boundary and margins
    ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,
               linestyles=['--', '-', '--'])
    # plot support vectors
    ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,
               linewidth=1, facecolors='none', edgecolors='k')  

In [10]:
plot_data(title = 'SVM Classifier With Default Parameters')
plot_support_vector(clf)  

## SVM Parameters
* Gamma: In case of high value of Gamma decision boundary is dependent on observations close to it, where in case of low value of Gamma, SVM will consider the far away points also while deciding the decision boundary
* Regularization parameter(C): Large C will result in overfitting and which will lead to lower bias and high variance. Small C will result in underfitting and which will lead to higher bias and low variance. For more details about it please refer [Underfitting & Overfitting](https://satishgunjal.github.io/underfitting_overfitting/)

So regularization parameter C and gamma parameters plays an important role in order to find the best fit model. Let's create a function which will try multiple such values and return the best value of C and gamma for our choice of the kernel. At the end we will plot the decision boundary with margin using the best choice of SVM parameters for each type of kernel.

In [11]:
def svm_params(X_train, y_train, X_test, y_test):
    """
    Finds the best choice of Regularization parameter (C) and gamma for given choice of kernel and returns the SVC object for each type of kernel

    I/P
    ----------
    X_train : ndarray
        Training samples
    y_train : ndarray
        Labels for training set
    X_test : ndarray
        Test data samples
    y_test : ndarray
        Labels for test set.

    O/P
    -------
    classifiers : SVC object for each type of kernel
    
    """
    C_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 40]
    gamma_values = [0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30, 40]
    kernel_types = ['linear', 'poly', 'rbf']
    classifiers = {}
    max_score = -1
    C_final = -1
    gamma_final = -1
    for kernel in kernel_types:                    
        for C in C_values:
            for gamma in gamma_values:
                clf = svm.SVC(C=C, kernel= kernel, gamma=gamma)
                clf.fit(X_train, y_train)
                score = clf.score(X_test, y_test)
                #print('C = %s, gamma= %s, score= %s' %(C, gamma, score))
                if score > max_score:
                    max_score = score
                    C_final = C
                    gamma_final = gamma
                    classifiers[kernel] = clf        
        print('kernel = %s, C = %s, gamma = %s, score = %s' %(kernel, C_final, gamma_final, max_score))
    return classifiers

Lets call the svm_params() function to get the best parameters for each type of kernel

In [12]:
classifiers = svm_params(X_train, y_train, X_test, y_test)

In [13]:
plot_data(title = 'SVM Classifier With Parameters ' + str(classifiers['linear']))
plot_support_vector(classifiers['linear'])

In [14]:
plot_data(title = 'SVM Classifier With Parameters ' + str(classifiers['rbf']))
plot_support_vector(classifiers['rbf'])

In [15]:
plot_data(title = 'SVM Classifier With Parameters ' + str(classifiers['poly']))
plot_support_vector(classifiers['poly'])

# Conclusion
Remember that, our data is 2D so hyperplane will be a line. But if you observe the data closely there is no clear separation between classes that's why straight line is not a good fit, which is obvious from above plots. Though the accuracy of poly kernel is less than rbf, but still its best choice for our data. 

