<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
SVM using Python (sklearn):</p><br>
<p style="font-family: Arial; font-size:2.25em;color:green; font-style:bold"><br>
Kumar Rahul</p><br>

### We will be using HR data in this exercise. Refer the Exhibit 1 to understand the feature list. Use the HR data and answer the below questions.

> 1. Load the dataset in Jupyter Notebook using pandas
2. Get the unique labels and frequency count of categorical features.
3. In line of business feature, use EAS, Healthcare and MMS labels to create a new label named 'Others'.
4. Create a custom function to compute the mean of numeric features w.r.t to each categorical features in the data.
5. Create a custom function to visualize the data created in step 4. Use seaborn package for visualization.
6. Create a new data frame with the numeric features and categorical features as dummy variable coded features. Which features will you include for model building and why?
7.	Split the data into training set and test set. Use 80% of data for model training and 20% for model testing.
8. Build a model using sklearn package (from sklearn import svm.LinearSVC) to predict the probability of Not Joining. Use the concept of pipeline to scale the data (using StandardScaler) and apply LinearSVC.
9. Refine the hyperparameters of model created in step 8 using GridSearchCV from model_selection module.
10. Report the performance of the model on the test set.
11. Build a model using sklearn package (from sklearn import svm.SVC) to apply radial. Fine tune the two parameters: C and gamma for radial kernel. Use the concepts of pipeline and GridSearch as applied in step 8.
12. Report the performance of the model on the test set. Compare the performance with that obtained in step 10 and document your findings. 


Participants may skip question 4 and 5 and start with model building. Links which may be helpful to implement questions 8 to question 12:

* SVM implementation in sklearn here: https://scikit-learn.org/stable/modules/svm.html
* Tips for SVM: https://scikit-learn.org/stable/modules/svm.html#tips-on-practical-use 
* Pipelines: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html


The code below does not implement, question 11 and 12. Participants are expected to implement on their own as a home assignment. 


**PS: Not all the questions are being answered as a part of the same notebook. You are encouraged to answer the questions if you find them missing.**

**Exhibit 1**


|Sl.No.|Name of Variable|Variable Description|
|:-------|----------------|:--------------------|
|1	|Candidate reference number|	Unique number to identify the candidate|
|2	|DOJ extended|Binary variable identifying whether candidate asked for date of joining extension (Yes/No)|
|3	|Duration to accept the offer|	Number of days taken by the candidate to accept the offer (continuous variable)|
|4	|Notice period|	Notice period to be served in the parting company before candidate can join this company (continuous variable)|
|5	|Offered band|	Band offered to the candidate based on experience and performance in interview rounds (categorical variable labelled C0/C1/C2/C3/C4/C5/C6)|
|6	|Percentage hike (CTC) expected|	Percentage hike expected by the candidate (continuous variable)|
|7	|Percentage hike offered (CTC)| Percentage hike offered by the company (continuous variable)|
|8	|Percent difference CTC|	Percentage difference between offered and expected CTC (continuous variable)|
|9	|Joining bonus|	Binary variable indicating if joining bonus was given or not (Yes/No)|
|10	|Gender|	Gender of the candidate (Male/Female)|
|11	|Candidate source|	Source from which resume of the candidate was obtained (categorical variables with categories  Employee referral/Agency/Direct)|
|12	|REX (in years)|	Relevant years of experience of the candidate for the position offered (continuous variable)|
|13	|LOB|	Line of business for which offer was rolled out (categorical variable)|
|14	|DOB|	Date of birth of the candidate|
|15	|Joining location|	Company location for which offer was rolled out for candidate to join (categorical variable)|
|16	|Candidate relocation status|	Binary variable indicating whether candidate has to relocate from one city to another city for joining (Yes/No)|
|17 |HR status|	Final joining status of candidate (Joined/Not-Joined)|

***

Learn more about random forest: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro

# Code starts here

To know the environment with the pyhton kernal



In [None]:
import sys, os

sys.executable

Supress the warnings

In [None]:
import warnings
warnings.filterwarnings("ignore")

We are going to use below mentioned libraries for **data import, processing and visulization**. As we progress, we will use other specific libraries for model building and evaluation. 

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sn # visualization library based on matplotlib
import matplotlib.pylab as plt

#the output of plotting commands is displayed inline within Jupyter notebook
%matplotlib inline 


## Data Import and Manipulation

### 1. Importing a data set

_Give the correct path to the data_



modify the ast_note_interactivity kernel option to see the value of multiple statements at once.

In [None]:
import os

os.getcwd()

#os.chdir()

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
raw_df = pd.read_csv( "../HR_case/data/HR_Data_No_Missing_Value.csv", 
                        sep = ',', na_values = ['', ' '])

raw_df.columns = raw_df.columns.str.lower().str.replace(' ', '_')
raw_df.head()

In [None]:
#?pd.read_csv

Dropping SLNo and Candidate.Ref as these will not be used for any analysis or model building.

In [None]:
#?raw_df.drop()

In [None]:
if set(['slno','candidate_ref']).issubset(raw_df.columns):
    raw_df.drop(['slno','candidate_ref'],axis=1, inplace=True)
    
raw_df.head()


### 2. Structure of the dataset



In [None]:
raw_df.info()

In [None]:
raw_df.status.value_counts()
#raw_df.describe(include='all').transpose()
raw_df.describe().transpose()

To get a help on the features of a object

In [None]:
#?raw_df.status.value_counts()

### 2. Summarizing the dataset
Create a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed. The *dropna()* function is used for row wise deletion of missing value. The axis = 0 means row-wise, 1 means column wise.


In [None]:
filter_df = raw_df.dropna(axis=0, how='any', thresh=None, 
                             subset=None, inplace=False)

list(filter_df.columns )

We will first start by printing the unique labels in categorical features

In [None]:
numerical_features = [x for x in filter_df.select_dtypes(include = np.number)]

categorical_features = [x for x in filter_df.select_dtypes(include = np.object)]

for f in categorical_features:
    print("\nThe unique labels in {} is {}\n".format(f, filter_df[f].unique()))
    print("The values in {} is \n{}\n".format(f,  filter_df[f].value_counts()))


Looking at the feature **line of business** it seems that *EAS, Healthcare and MMS* does not have enough observations and may be clubbed together

In [None]:
filter_df['lob']=np.where(filter_df['lob'] =='EAS', 'Others', filter_df['lob'])
filter_df['lob']=np.where(filter_df['lob'] =='Healthcare', 'Others', filter_df['lob'])
filter_df['lob']=np.where(filter_df['lob'] =='MMS', 'Others', filter_df['lob'])
filter_df.lob.value_counts()

We will use **groupby** function of pandas to get deeper insights of the behaviour of people **Joining** or **Not Joining** the company. We will write a generic function to report the mean by any categorical variable.

In [None]:
def group_by (categorical_features):
    return filter_df.groupby(categorical_features).mean()



In [None]:
for c in categorical_features:
    group_by(c)

### 3. Visualizing the Data

Plot can be done using the callable functions of 

>1. pandas library (http://pandas.pydata.org/pandas-docs/stable/visualization.html)
2. matplotlib library (https://matplotlib.org/) or
3. seaborn library (https://seaborn.pydata.org/) which is based on matplotlib and provides interface for drawing attractive statistical graphics.

#### 3a. Visualizing the Data using seaborn

In [None]:
filter_df[numerical_features].info()
filter_df[categorical_features].info()

In [None]:
def bar_plot(xlabel,ylabel,xcnt,ycnt):
    sn.barplot(x = xlabel, y = ylabel, data= filter_df, ax = axes[xcnt,ycnt])
    fig.show()

In [None]:
xcnt=0
ycnt = 0
fig, axes = plt.subplots(16,4, figsize=(40,55))
fig.subplots_adjust(hspace = 1, wspace=.5)

for c in categorical_features:
    for n in numerical_features:
        bar_plot(c,n,xcnt,ycnt)
        if ycnt <3:
            ycnt = ycnt+1
        else:
            ycnt = 0
            xcnt = xcnt+1

## Model Building: 

### Dummy Variable coding

Remove the response variable from the dataset¶


In [None]:
X_features = list(filter_df.columns)
X_features.remove('status')
X_features.remove('pecent_hike_expected_in_ctc')
X_features.remove('percent_hike_offered_in_ctc')
X_features.remove('candidate_relocate_actual')

In [None]:
X_features

In [None]:
encoded_X_df = pd.get_dummies( filter_df[X_features], drop_first = False )
encoded_Y_df = pd.get_dummies( filter_df['status'], drop_first=False)

In [None]:
#?pd.get_dummies

In [None]:
pd.options.display.max_columns = None
encoded_X_df.info()

In [None]:
Y = encoded_Y_df.filter(['Joined'], axis =1)
X = encoded_X_df
Y.info()

### Train and test data split using Python

The train and test split can also be done using the **sklearn module**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, Y, test_size = 0.2, random_state = 42)

## Model Building: Using the **sklearn** 



In [None]:
from sklearn import svm
#dir(svm)

In [None]:
svm.LinearSVC?
svm.SVC?

### Creating Pipeline

Creating pipeline for LinearSVC()

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

seq_steps = [('scaler', StandardScaler()), ('sv', svm.LinearSVC())]
pipeline = Pipeline(seq_steps)
print(pipeline)

In [None]:
C = [0.8,0.9,1.0,1.5,2.0]
# Create the grid
random_grid = {'sv__C': C}
random_grid

### Model with Grid Search

To report the performance on the selected KPI use `sklearn.metrics.SCORERS.keys()` to get the list of all the metrics and pass the relevant one in `RandomizedSearchCV` or `GridSearchCV`

In [None]:
from sklearn.metrics import SCORERS

SCORERS.keys()

In [None]:
# Use the random grid to search for best hyperparameters
from sklearn.model_selection import GridSearchCV
from tqdm import tqdm_notebook as tqdm

for cv in tqdm(range(3,6)):
    svm_best_model = GridSearchCV(estimator = pipeline, param_grid = random_grid,scoring='balanced_accuracy',
                                  cv = cv)
    # Fit the random search model
    svm_best_model.fit(X_train, y_train.values.ravel())
    print("performance for %d fold CV = %2.2f" %(cv, svm_best_model.score(X_test,y_test)))
    print("best parameters for %d fold CV" %(cv))
    print(svm_best_model.best_params_)

### Report the parameter

The best model has the following parameter selected from the random search grid

In [None]:
svm_best_model.best_params_

svm_best_model.best_estimator_

#cv_result = pd.DataFrame(svm_best_model.cv_results_)
#cv_result

svm_best_model.best_score_

#rf__best_model.best_index_

## Model Evaluation


### 1. The prediction on train data.

To predict the outcome on the **train set**
> * Use **predict** function of the model object 


In [None]:
# Make predictions using the testing set
#pd.options.display.max_rows = None

predict_class_train_df = pd.DataFrame(svm_best_model.predict(X_train))
predict_class_train_df.head()

#predict_porb_train_df = pd.DataFrame(svm_best_model.predict(X_train))
#predict_porb_train_df.iloc[:,:].head()

The above output clearly shows that the predcited class is the one for which the calculated probability is more compared to the calculated probability of the other class.

### 2. The prediction on test data.

The prediction can be carried out by **defining functions** as well. Below is one such instance wherein a function is defined and is used for prediction

In [None]:
def get_predictions ( test_class, model, test_data ):
    predicted_df = pd.DataFrame(model.predict(test_data))
    y_pred_df = pd.concat([test_class.reset_index(drop=True), predicted_df.iloc[:,0:]], axis =1)
    return y_pred_df

Giving label to the Y column of the test set by using the dictionary data type in python. This is being done for the model which was built using dummy variable coding. It will be used to generate confusion matrix at a later time

In [None]:
test_series = y_test
train_series = y_train

status_dict = {1:"Joined", 0:"Not Joined"}
class_test_df = test_series.replace(dict(Joined=status_dict))
class_test_df.rename({'Joined': 'status'}, axis='columns', inplace=True )

class_train_df = train_series.replace(dict(Joined=status_dict))
class_train_df.rename({'Joined': 'status'}, axis='columns', inplace=True )

#class_test_df.info()
#class_train_df.info()

In [None]:
predict_test_df = pd.DataFrame(get_predictions(class_test_df.status, svm_best_model, X_test))
predict_test_df.rename(columns = {0:'predicted_class'}, inplace=True)
predict_test_df.head()

In [None]:
predict_test_df['predicted'] = predict_test_df.predicted_class.map(lambda x: 'Joined' if x ==1 else 'Not Joined')
predict_test_df[0:10]

### 3. Confusion Matrix

We will built classification matrix using the **metrics** method from **sklearn** package. We will also write a custom function to build a classification matrix and use it for reporting the performance measures.

To understand the concept of micro average and macro average:

https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin

#### 3a. Confusion Matrix using sklearn

In [None]:
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
print("The model with dummy variable coding output: ")
confusion_matrix(class_test_df.status, predict_test_df.predicted)
lg_reg_report = (classification_report(class_test_df, predict_test_df.predicted))
print(lg_reg_report)


#### 3b Confusion Matrix using generic function

In [None]:
def draw_cm( actual, predicted ):
    plt.figure(figsize=(9,9))
    cm = metrics.confusion_matrix( actual, predicted )
    sn.heatmap(cm, annot=True,  fmt='.0f', xticklabels = ["Joined", "Not Joined"] , 
               yticklabels = ["Joined", "Not Joined"],cmap = 'Blues_r')
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.title('Classification Matrix Plot', size = 15);
    plt.show()

The classification matrix plot as reported with dummy variable coding is:

In [None]:
draw_cm( predict_test_df.status, predict_test_df.predicted )

### 4. Performance Measure on the test set


In [None]:
def measure_performance (clasf_matrix):
    measure = pd.DataFrame({
                        'sensitivity': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)], 
                        'specificity': [round(clasf_matrix[1,1]/(clasf_matrix[1,0]+clasf_matrix[1,1]),2)],
                        'recall': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[0,1]),2)],
                        'precision': [round(clasf_matrix[0,0]/(clasf_matrix[0,0]+clasf_matrix[1,0]),2)],
                        'overall_acc': [round((clasf_matrix[0,0]+clasf_matrix[1,1])/
                                              (clasf_matrix[0,0]+clasf_matrix[0,1]+clasf_matrix[1,0]+clasf_matrix[1,1]),2)]
                       })
    return measure

In [None]:
cm = metrics.confusion_matrix(predict_test_df.status, predict_test_df.predicted)

lg_reg_metrics_df = pd.DataFrame(measure_performance(cm))
lg_reg_metrics_df

print( 'Total Accuracy sklearn: ',np.round( metrics.accuracy_score( class_test_df.status, predict_test_df.predicted ), 2 ))




#### End of Document

***
***
