## Using SVM with different Kernel to predict whether a candidate will attend an interview

For interviewers, one of the things that make them crazy is that the interviewee disappears when they begin to start an interview. The time of interviewers is very precious, so, could we find some ways to perdict whether the interviewee will attend an interview?

Here, we have a raw dataset from Kaggle that exactly present this problem.

https://www.kaggle.com/vishnusraghavan/the-interview-attendance-problem/data

In this tutorial, I will introduce some basic data visuliazation and preprocessing techniques and tell you how to using suport vector machine with different kernel to make such binary prediction.

This tutorial is divide into following parts:

(1) Data exploration

(2) Data prepreocessing

(3) Split dataset

(4) Using SVM to make the prediction

(5) Cross validation

## 1. Data exploration

First, let us download the data and see the first five rows and last five rows of the data to gain some basic understanding.

In [39]:
import pandas as pd
df = pd.read_csv('./Interview.csv')
df.head()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
0,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 1,Male,Chennai,...,Yes,Yes,Yes,No,Single,,,,,
1,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 2,Male,Chennai,...,Yes,Yes,Yes,No,Single,,,,,
2,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 3,Male,Chennai,...,,,Uncertain,No,Single,,,,,
3,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 4,Male,Chennai,...,Yes,Yes,Uncertain,No,Single,,,,,
4,13.02.2015,Hospira,Pharmaceuticals,Chennai,Production- Sterile,Routine,Scheduled Walkin,Candidate 5,Male,Chennai,...,Yes,Yes,Uncertain,No,Married,,,,,


In [6]:
df.tail()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Name(Cand ID),Gender,Candidate Current Location,...,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27
1229,07.05.2016,Pfizer,Pharmaceuticals,Chennai,Niche,Biosimiliars,Scheduled,Candidate 1230,Male,Chennai,...,Yes,Yes,Yes,Yes,Single,,,,,
1230,06.05.2016,Pfizer,Pharmaceuticals,Chennai,Niche,Biosimiliars,Scheduled,Candidate 1231,Male,Chennai,...,Yes,Yes,Yes,Yes,Married,,,,,
1231,06.05.2016,Pfizer,Pharmaceuticals,Chennai,Niche,generic drugs – RA,Scheduled,Candidate 1232,Male,Chennai,...,Yes,Yes,Yes,Yes,Single,,,,,
1232,06.05.2016,Pfizer,Pharmaceuticals,Chennai,Niche,generic drugs – RA,Scheduled,Candidate 1233,Female,Chennai,...,,,Uncertain,Yes,Single,,,,,
1233,,﻿﻿,,,,,,,,,...,,,,,,,,,,


By doing some basic exploration of the data, we know that there exists some rows and values that is not useful, so that we need to remove them. Such as, the last row has all values as NaN, so we romove it. Also, there exists some columns that are named "Unnamed:xx" with NaN, so we drop them too. In addition, the name of candidates is meaningless.

In [40]:
df = df[:-1]
df = df.drop(['Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Name(Cand ID)'], axis=1)

Then, we use describe to see the information of the data.

In [41]:
df.describe()

Unnamed: 0,Date of Interview,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Gender,Candidate Current Location,Candidate Job Location,...,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status
count,1233,1233,1233,1233,1233,1233,1233,1233,1233,1233,...,1029,986,986,986,985,985,988,1228,1233,1233
unique,96,15,7,11,7,92,6,2,10,7,...,7,7,5,6,8,7,12,7,8,2
top,06.02.2016,Standard Chartered Bank,BFSI,Chennai,Routine,JAVA/J2EE/Struts/Hibernate,Scheduled Walk In,Male,Chennai,Chennai,...,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Single
freq,220,904,949,754,1023,220,456,965,754,893,...,917,949,951,936,940,946,932,882,701,767


Then, let's see all the columns and their unique values.

In [42]:
for col in df.columns:
    print("Column Name: " + str(col))
    print("Unique Numbers: " + str(len(df[col].unique())))
    print(df[col].unique())

Column Name: Date of Interview
Unique Numbers: 96
['13.02.2015' '19.06.2015' '23.06.2015' '29.06.2015' '25.06.2015'
 '25.05.16' '25.5.2016' '25-05-2016' '25.05.2016' '25-5-2016' '04/12/16'
 '13.04.2016' '27.02.2016' '07.05.2016' '5.5.16' '4.5.16' '21.4.16'
 '22.4.16' '23.4.16' '15 Apr 16' '19 Apr 16' '20 Apr 16' '21-Apr -16'
 '22 -Apr -16' '25 \xe2\x80\x93 Apr-16' '25 Apr 16' '18 Apr 16' '11.5.16'
 '10.5.16' '11.05.16' '12.04.2016' '12.04.2017' '12.04.2018' '12.04.2019'
 '12.04.2020' '12.04.2021' '12.04.2022' '12.04.2023' '8.5.16' '7.5.16'
 '19.03.16' '24.05.2016' '05/11/2016' '26/05/2016' '10.05.2016'
 '28.08.2016 & 09.00 AM' '28.08.2016 & 9.30 AM' '28.8.2016 & 12.00 PM'
 '28.08.2016 & 09.30 AM' '28.8.2016 & 10.30 AM' '28.8.2016 & 09.30 AM'
 '28.8.2016 & 04.00 PM' '28.08.2016 & 11.30 AM' '28.08.2016 & 11.00 AM'
 '28.08.2016 & 10.30 AM' '28.8.2016 & 03.00 PM' '28.08.2016 & 10.00 AM'
 '28.8.2016 & 02.00 PM' '28.8.2016 & 11.00 AM' '13.06.2016' '02.09.2016'
 '02.12.2015' '23.02.2016' '22.

## 2. Data preprocessing

From the data exploration before, we know that the data is not clean, so that we should take some steps to clean it.

First, let's extract the time from column in "Date of Interview" and then change it into "Weekday". Bias is very important for machine learning, since they may be good features based on our daily experience, it is a kind of bias which will be good for modeling.

In [43]:
from datetime import datetime
import re

df = df.rename(columns={'Date of Interview':'weekday'})

'''
['13.02.2015' '19.06.2015' '23.06.2015' '29.06.2015' '25.06.2015'
 '25.05.16' '25.5.2016' '25-05-2016' '25.05.2016' '25-5-2016' '04/12/16'
 '13.04.2016' '27.02.2016' '07.05.2016' '5.5.16' '4.5.16' '21.4.16'
 '22.4.16' '23.4.16' '15 Apr 16' '19 Apr 16' '20 Apr 16' '21-Apr -16'
 '22 -Apr -16' '25 \xe2\x80\x93 Apr-16' '25 Apr 16' '18 Apr 16' '11.5.16'
 '10.5.16' '11.05.16' '12.04.2016' '12.04.2017' '12.04.2018' '12.04.2019'
 '12.04.2020' '12.04.2021' '12.04.2022' '12.04.2023' '8.5.16' '7.5.16'
 '19.03.16' '24.05.2016' '05/11/2016' '26/05/2016' '10.05.2016'
 '28.08.2016 & 09.00 AM' '28.08.2016 & 9.30 AM' '28.8.2016 & 12.00 PM'
 '28.08.2016 & 09.30 AM' '28.8.2016 & 10.30 AM' '28.8.2016 & 09.30 AM'
 '28.8.2016 & 04.00 PM' '28.08.2016 & 11.30 AM' '28.08.2016 & 11.00 AM'
 '28.08.2016 & 10.30 AM' '28.8.2016 & 03.00 PM' '28.08.2016 & 10.00 AM'
 '28.8.2016 & 02.00 PM' '28.8.2016 & 11.00 AM' '13.06.2016' '02.09.2016'
 '02.12.2015' '23.02.2016' '22.03.2016' '26.02.2016' '06.02.2016'
 '21.4.2016' '21/04/16' '21.4.15' '22.01.2016' '3.6.16' '03/06/16'
 '09.01.2016' '09-01-2016' '03.04.2015' '13/03/2015' '17/03/2015'
 '17.03.2015' '18.03.2014' '4.04.15' '16.04.2015' '17.04.2015' '9.04.2015'
 '05/02/15' '30.05.2016' '07.06.2016' '20.08.2016' '14.01.2016' '30.1.16 '
 '30.01.2016' '30/01/16' '30.1.16' '30.1.2016' '30.01.16' '30-1-2016'
 '06.05.2016']
'''

# check if a char is a-z, 0-9
def is_digit_or_char(char):
    regex = re.compile('[a-z0-9]')
    if regex.search(char):
        return True
    else:
        return False

# It is possible that month is a sequence of characters
# get day, month and year
def get_formated_date(date_str):
    # day, month, year
    date = []
    val = ""
    
    count = 0
    str_len = len(date_str)
    
    while (len(date) < 3):
        char = date_str[count]

        if is_digit_or_char(char):
            val += char
        elif not is_digit_or_char(char) and val:
            date.append(val)
            val = ""
        
        if count == (str_len - 1) and val:
            date.append(val)
            val = ""
    
        count += 1
    
    return date
    
# change date to weekday
def convert_date(date):
    [day, month, year] = get_formated_date(date)
    
    if len(year) == 2:
        year = "20" + year
    year = int(year)
    day = int(day)
    
    if month.isdigit(): 
        month = int(month)               
    else:
        if month == "pr":
            month = "Apr"
        month = int(datetime.strptime(month, "%b").strftime("%m"))
    
    formated_date = datetime(year, month, day)
    
    return formated_date.strftime('%A')
        
df.weekday = df.weekday.apply(convert_date)
print(df['weekday'])

0         Friday
1         Friday
2         Friday
3         Friday
4         Friday
5         Friday
6         Friday
7         Friday
8         Friday
9         Friday
10        Friday
11        Friday
12        Friday
13        Friday
14        Friday
15        Friday
16        Friday
17        Friday
18        Friday
19        Friday
20        Friday
21        Friday
22        Friday
23        Friday
24        Friday
25        Friday
26        Friday
27        Friday
28       Tuesday
29       Tuesday
          ...   
1203    Saturday
1204    Saturday
1205    Saturday
1206    Saturday
1207    Saturday
1208    Saturday
1209    Saturday
1210    Saturday
1211    Saturday
1212    Saturday
1213    Saturday
1214    Saturday
1215    Saturday
1216    Saturday
1217    Saturday
1218    Saturday
1219    Saturday
1220    Saturday
1221    Saturday
1222    Saturday
1223    Saturday
1224    Saturday
1225    Saturday
1226    Saturday
1227    Saturday
1228    Saturday
1229    Saturday
1230      Frid

Next, because there exist some "No", "NO" data in different column, even they have the same meaning but they look like different things, we need to make them case insensitive and change all the data into lower case.

In [44]:
for col in df.columns:
    df[col] = df[col].str.strip()
    df[col] = df[col].str.lower()
df.head()

Unnamed: 0,weekday,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Gender,Candidate Current Location,Candidate Job Location,...,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status
0,friday,hospira,pharmaceuticals,chennai,production- sterile,routine,scheduled walkin,male,chennai,hosur,...,yes,yes,yes,yes,yes,yes,yes,yes,no,single
1,friday,hospira,pharmaceuticals,chennai,production- sterile,routine,scheduled walkin,male,chennai,bangalore,...,yes,yes,yes,yes,yes,yes,yes,yes,no,single
2,friday,hospira,pharmaceuticals,chennai,production- sterile,routine,scheduled walkin,male,chennai,chennai,...,,na,,,,,,uncertain,no,single
3,friday,hospira,pharmaceuticals,chennai,production- sterile,routine,scheduled walkin,male,chennai,chennai,...,yes,yes,no,yes,no,yes,yes,uncertain,no,single
4,friday,hospira,pharmaceuticals,chennai,production- sterile,routine,scheduled walkin,male,chennai,bangalore,...,yes,yes,yes,no,yes,yes,yes,uncertain,no,married


Third, since we want to use SVM to finish the task, we need to transform the data into digits. Fortunately, all the columns contain discrete value, so what we need to do is using dummy coding to deal with them.

In [45]:
from sklearn.preprocessing import LabelEncoder

dummy_encoder = LabelEncoder()

for col in df.columns :
    dummy_encoder.fit(df[col])
    df[col] = dummy_encoder.transform(df[col])
df.head()

Unnamed: 0,weekday,Client name,Industry,Location,Position to be closed,Nature of Skillset,Interview Type,Gender,Candidate Current Location,Candidate Job Location,...,Have you obtained the necessary permission to start at the required time,Hope there will be no unscheduled meetings,Can I Call you three hours before the interview and follow up on your attendance for the interview,Can I have an alternative number/ desk number. I assure you that I will not trouble you too much,Have you taken a printout of your updated resume. Have you read the JD and understood the same,Are you clear with the venue details and the landmark.,Has the call letter been shared,Expected Attendance,Observed Attendance,Marital Status
0,0,7,5,2,3,63,3,1,2,4,...,4,5,4,4,5,4,7,5,0,1
1,0,7,5,2,3,63,3,1,2,1,...,4,5,4,4,5,4,7,5,0,1
2,0,7,5,2,3,63,3,1,2,2,...,0,2,0,0,0,0,0,4,0,1
3,0,7,5,2,3,63,3,1,2,2,...,4,5,2,4,2,4,7,4,0,1
4,0,7,5,2,3,63,3,1,2,1,...,4,5,4,2,5,4,7,4,0,0


## 3. Split dataset

After preprocessing the raw data, it's time to really start our training. However, we cannot train our model blindly, we need to remain parts of our data to test or evaluate the performance of our model. So, we need to split our dataset into training dataset, validation dataset and testing dataset. Then we use training dataset to train our model, validation dataset to validate our model. The performance of our model can be evaluated by the accuracy of the test dataset.

In [46]:
# Split to get the label
y = df.pop("Observed Attendance")
df.head()
print(y)

0       0
1       0
2       0
3       0
4       0
5       1
6       1
7       1
8       1
9       0
10      1
11      0
12      1
13      1
14      1
15      1
16      1
17      0
18      1
19      0
20      0
21      1
22      1
23      1
24      0
25      0
26      0
27      0
28      1
29      1
       ..
1203    1
1204    1
1205    1
1206    1
1207    0
1208    1
1209    1
1210    1
1211    1
1212    0
1213    0
1214    1
1215    1
1216    0
1217    1
1218    1
1219    1
1220    1
1221    1
1222    1
1223    1
1224    1
1225    1
1226    1
1227    1
1228    1
1229    1
1230    1
1231    1
1232    1
Name: Observed Attendance, Length: 1233, dtype: int64


In [47]:
# Get train, validate and test data
from sklearn.cross_validation import train_test_split

def generate_dataset(df, y):
    X_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.3)

    # here test_size means the size of validation set
    X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size = 0.2)
    return X_train, X_validation, y_train, y_validation, X_test, y_test

X_train, X_validation, y_train, y_validation, X_test, y_test = generate_dataset(df, y)

## 4. Using SVM to train model and make prediction

The first step is using SVM to train model. In the case when SVM using a linear kernel, it is a simple and good linear classification model. 

The key diffience between linear SVM and logistic regression is the loss function, both of them works for classification, but SVM use hindge function, which enable the model decide the classification boundary only though the points near the boundary(so called support vector), that is how SVM(Support vector machine) got its name. And, usually SVM has L2 norm which enable some points go across the boundary.

With the introduce of nonlinear kernel, which transform the problem into higher dimension, SVM can even deal with nolinear classification. Commom kernel people use includes 1.Fisher kernel 2.Graph kernels 3.Kernel smoother 4.Polynomial kernel 5.Radial basis function kernel (RBF) 6.String kernels.  Here we will use different kenel function to train model and make prediction.

In [48]:
from sklearn import svm
from sklearn.metrics import accuracy_score

# SVM with linear kernel function
linear_clf = svm.SVC(kernel='linear')
linear_clf.fit(X_train, y_train)  
linear_pre_validation = linear_clf.predict(X_validation)
linear_pre_test = linear_clf.predict(X_test)
# print the result
print("Validation:")
print(accuracy_score(linear_pre_validation, y_validation))
print("Test:")
print(accuracy_score(linear_pre_test, y_test))

Validation:
0.687861271676
Test:
0.713513513514


In [49]:
# Using kernel rbf kernel function
def kernel_SVM(X_train, X_validation, y_train, y_validation, X_test, y_test, kernel):
    rbf_clf = svm.SVC(kernel=kernel)
    rbf_clf.fit(X_train, y_train)  
    rbf_pre_validation = rbf_clf.predict(X_validation)
    rbf_pre_test = rbf_clf.predict(X_test)
    validation_acc = accuracy_score(rbf_pre_validation, y_validation)
    test_acc = accuracy_score(rbf_pre_test, y_test)
    return validation_acc, test_acc

validation_acc, test_acc = kernel_SVM(X_train, X_validation, y_train, y_validation, X_test, y_test, kernel='rbf')
print("Validation:")
print(validation_acc)
print("Test:")
print(test_acc)

Validation:
0.670520231214
Test:
0.67027027027


From the result above, we can see that: SVM with rbf kernel have better results both on Validation and test set.

## 5. Cross validation

From previous part, we have know how to use SVM to train model and make prediction, it seems that the SVM without kernel function is sometime better than the SVM model with 'rbf' kernel function? However, is this true? In previous part, we only divide our data one time, use part of it as training data, others as validation and test data. In this way, we are not efficiently using the data. So in this part, we will random create data set several times to train and test model, then we average the accuracy to see which model is better in this problom.

In [50]:
generate_time = 10

average_rbf_validation_acc = 0
average_rbf_test_acc = 0

average_linear_validation_acc = 0
average_linear_test_acc = 0

for i in range(generate_time):
    X_train, X_validation, y_train, y_validation, X_test, y_test = generate_dataset(df, y)
    
    validation_acc, test_acc = kernel_SVM(X_train, X_validation, y_train, y_validation, X_test, y_test, kernel='linear')
    average_linear_validation_acc += validation_acc
    average_linear_test_acc += test_acc
    
    validation_acc, test_acc = kernel_SVM(X_train, X_validation, y_train, y_validation, X_test, y_test, kernel='rbf')
    average_rbf_validation_acc += validation_acc
    average_rbf_test_acc += test_acc

print("RBF validation:")
print(average_rbf_validation_acc / generate_time)
print("RBF test:")
print(average_rbf_test_acc / generate_time)
print("Linear validation:")
print(average_linear_validation_acc / generate_time)
print("Linear test:")
print(average_linear_test_acc / generate_time)

RBF validation:
0.657225433526
RBF test:
0.649189189189
Linear validation:
0.683815028902
Linear test:
0.697837837838


From the results above, we can know that in the condition of the features we selected, raw linear SVM model works better than SVM model with RBF kernel function. Because the features have linear relationship with results, in which case linear kernel is simpler and better according to occarm razor.

## Reference

[1] http://scikit-learn.org/stable/modules/svm.html

[2] http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

[3] https://www.kaggle.com/vishnusraghavan/the-interview-attendance-problem/data