# Campus Recruitment with SVM and grid search
Author: Tino Merl

__Table of Contents:__
* [Motivation & Goal](#motiv_goal)
* [CRISP-DM](#crisp_dm)
* [Business Understanding](#buss_und)
* [Data Unterstanding](#dat_und)
* [Data Preparation](#dat_prep)
* [Modeling](#model)
* [Evaluation](#eval)

## Motivation & Goal<a class="anchor" id="motiv_goal"></a>
First and foremost this is a homework to deepen the understanding and usage of python's library scikit-learn. To be precise the usage of support vector machines(SVM) in scikit-learn. For this assignment the campus recruiment dataset from kaggle is used.[[1]](#kaggle_dataset) The goal is to use SVMs to answer the following questions tasked for the dataset.

1. Which factor influenced a candidate in getting placed?
2. Does percentage matters for one to get placed?
3. Which degree specialization is much demanded by corporate?

## CRISP-DM<a class="anchor" id="crisp_dm"></a>
CRISP-DM is an acronym standing for Cross Industry Standard Process for Data Mining. I will continue to refer to it as CRISP-DM. It's a widely used process for standardization while working with data. It consists of six steps.

1. Business Understanding
2. Data Unterstanding
3. Data Preparation
4. Modeling
5. Evaluation
6. Deployment

Although this list lines them out as sequential there is sometimes a bit of back and forth between the steps. This is better lined out by the following illustration.

<figure>
    <img src="img/crisp-dm_diagramm.png"/>
    <figcaption>CRISP-DM diagram by statistik-dresden.de[<a href="#crisp-dm_diagramm">2</a>]</figcaption>
</figure>

For the sake of this homework we will be leaving out the last step since this model will not be deployed to production. Therefore it will remain as experimental and we end the process after the fifth step.

## Business Understanding<a class="anchor" id="buss_und"></a>

Since there is no real business use case we will translate the assignment and the questions listed in the first paragraph. I will list them here again and try to elaborate on each on.

### Which factor influenced a candidate in getting placed?
To find this out we will be using the vector machine and a grid search. In this step we want to find out which one of our features had the most impact on our model.

### Does percentage matters for one to get placed?
This question will hopefully be answered by the same Procedure used to answer the previous question. By looking at the features and their impact it will be made clear whether or not the percentage mattered. 

### Which degree specialization is much demanded by corporate?
To find that out we will use hot label encoding and will be hopefully find out something about this question.

## Data Understanding<a class="anchor" id="dat_und"></a>
For the step of data understanding we will will first include and look at the official kaggle documentation and the description of the columns.
* `sl_no` {integer} -- Serial Number.
* `gender` {string} -- Gender- Male='M',Female='F'.
* `ssc_p` {float} -- Secondary Education percentage - 10th Grade.
* `ssc_b` {string} -- Board of Education.
* `hsc_p` {float} -- Higher Secondary Education percentage - 12th Grade.
* `hsc_b` {string} -- Board of Education - Central/ Others.
* `hsc_s` {string} -- Specialization in Higher Secondary Education.
* `degree_p` {float} -- Degree Percentage.
* `degree_t` {string} -- Under Graduation(Degree type) - Field of degree education.
* `workex` {bool} -- Work Experience.
* `etest_p` {float} -- Employability test percentage (conducted by college).
* `specialisation` {string} -- Post Graduation(MBA) - Specialization.
* `mba_p` {float} -- MBA percentage.
* `status` {string} -- Status of placement- Placed/Not placed.
* `salary` {integer} -- Salary offered by corporate to candidates.

Since we had the first look at namings at their potential data types it is now time to load the data and check up for anomalies.

In [8]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn import svm 

In [9]:
np.random.seed(28)
name = "Placement_Data_Full_Class.csv"
df = pd.read_csv(os.path.join("data", name))
df.sample(10)

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
137,138,M,67.0,Others,63.0,Central,Commerce,72.0,Comm&Mgmt,No,56.0,Mkt&HR,60.41,Placed,225000.0
183,184,M,65.0,Central,77.0,Central,Commerce,69.0,Comm&Mgmt,No,60.0,Mkt&HR,61.82,Placed,276000.0
153,154,M,49.0,Others,59.0,Others,Science,65.0,Sci&Tech,Yes,86.0,Mkt&Fin,62.48,Placed,340000.0
164,165,F,67.16,Central,72.5,Central,Commerce,63.35,Comm&Mgmt,No,53.04,Mkt&Fin,65.52,Placed,250000.0
128,129,M,80.4,Central,73.4,Central,Science,77.72,Sci&Tech,Yes,81.2,Mkt&HR,76.26,Placed,400000.0
53,54,M,80.0,Others,70.0,Others,Science,72.0,Sci&Tech,No,87.0,Mkt&HR,71.04,Placed,450000.0
76,77,F,66.5,Others,70.4,Central,Arts,71.93,Comm&Mgmt,No,61.0,Mkt&Fin,64.27,Placed,230000.0
24,25,M,76.5,Others,97.7,Others,Science,78.86,Sci&Tech,No,97.4,Mkt&Fin,74.01,Placed,360000.0
51,52,M,54.4,Central,61.12,Central,Commerce,56.2,Comm&Mgmt,No,67.0,Mkt&HR,62.65,Not Placed,
65,66,M,54.0,Others,47.0,Others,Science,57.0,Comm&Mgmt,No,89.69,Mkt&HR,57.1,Not Placed,


Checkup if the datatypes were casted correctly to note for later in data preparation.

In [7]:
for col in df.columns:
    print(f"{col} was casted as {df[col].dtypes}")

sl_no was casted as int64
gender was casted as object
ssc_p was casted as float64
ssc_b was casted as object
hsc_p was casted as float64
hsc_b was casted as object
hsc_s was casted as object
degree_p was casted as float64
degree_t was casted as object
workex was casted as object
etest_p was casted as float64
specialisation was casted as object
mba_p was casted as float64
status was casted as object
salary was casted as float64


The first abnormalities can be already seen. `workex` wasn't casted as bool and `salary` has the datatype float instead of integer. The first stems from the point that instead of using `True` and `False` the Work experience seems to be coded with Yes and No as values. `salary` wasn't casted as integer because of the NaNs. Since integer can't naturally deal with NaNs this column was therefore casted as float. This can be fixed by replacing the NaNs later on.

## Data Preparation

__Needed Steps:__

1. Update NaNs
2. Cast `workex` to bool

## Footnotes
[1]<a class="anchor" id="kaggle_dataset"></a> Ben Roshan D (2020). Campus Recruitment, Academic and Employability Factors influencing placement, Version 1. Retrieved 2020-05-10 from https://www.kaggle.com/benroshan/factors-affecting-campus-placement.

[2]<a class="anchor" id="crisp-dm_diagramm"></a> Wolf Riepel (2012). CRISP-DM: Ein Standard-Prozess-Modell für Data Mining. Retrieved 2020-05-10 from https://statistik-dresden.de/archives/1128