## Placement Result Prediction

#### Overview 

In this project, we'll look at the placements record and try to build a classifier to predict whether a given profile likely to be get placed or not.

#### Data used

Placement_Data_Full_Class.csv - Download from kaggle datasets (https://www.kaggle.com/sevdanurgenc/placement-data-full-class)

In [1]:
# Let's import the required python packages

# Importing pandas for dataframe usage
import pandas as pd

# Importing necessary scikit-learn modules for transforming the data
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

# Importing matplotlib for data visualization
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Reading the 'Placement_Data_Full_Class.csv' file using pandas library to dataframe
placement_record_df = pd.read_csv('./Data/Placement_Data_Full_Class.csv')

In [3]:
# Printing the dataframe to understand the structure of the data
placement_record_df.head() # head() returns the top 5 rows by default

Unnamed: 0,sl_no,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status,salary
0,1,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed,270000.0
1,2,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed,200000.0
2,3,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed,250000.0
3,4,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed,
4,5,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed,425000.0


In [4]:
# Printing the shape of the dataframe
placement_record_df.shape 

(215, 15)

In [5]:
# Getting the information about the columns in dataframe
placement_record_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(1), object(8)
memory usage: 25.3+ KB


In [6]:
# Checking to see if the data has any missing values
placement_record_df.isna().sum()

sl_no              0
gender             0
ssc_p              0
ssc_b              0
hsc_p              0
hsc_b              0
hsc_s              0
degree_p           0
degree_t           0
workex             0
etest_p            0
specialisation     0
mba_p              0
status             0
salary            67
dtype: int64

In [7]:
# Getting information about the datatypes in dataframe column
placement_record_df.dtypes

sl_no               int64
gender             object
ssc_p             float64
ssc_b              object
hsc_p             float64
hsc_b              object
hsc_s              object
degree_p          float64
degree_t           object
workex             object
etest_p           float64
specialisation     object
mba_p             float64
status             object
salary            float64
dtype: object

In [8]:
# Since no missing values to handle, proceeding further

# Dropping the unwanted columns from the dataframe
placement_record_df.drop(['sl_no', 'salary'], axis=1,  inplace=True)

In [9]:
placement_record_df.head()

Unnamed: 0,gender,ssc_p,ssc_b,hsc_p,hsc_b,hsc_s,degree_p,degree_t,workex,etest_p,specialisation,mba_p,status
0,M,67.0,Others,91.0,Others,Commerce,58.0,Sci&Tech,No,55.0,Mkt&HR,58.8,Placed
1,M,79.33,Central,78.33,Others,Science,77.48,Sci&Tech,Yes,86.5,Mkt&Fin,66.28,Placed
2,M,65.0,Central,68.0,Central,Arts,64.0,Comm&Mgmt,No,75.0,Mkt&Fin,57.8,Placed
3,M,56.0,Central,52.0,Central,Science,52.0,Sci&Tech,No,66.0,Mkt&HR,59.43,Not Placed
4,M,85.8,Central,73.6,Central,Commerce,73.3,Comm&Mgmt,No,96.8,Mkt&Fin,55.5,Placed


In [10]:
# Step for setting up the data preprocessor for model training

# Making a list of categorical Column
categorical_columns = ["gender", "hsc_s", "hsc_b", "degree_t", "workex", "specialisation"]

# Building transformers pipeline for transforming columns to a format used for training
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown="ignore"))
])

# Building a preprocessing pipeline for applying the transformations
preprocesser = ColumnTransformer(transformers=[
    ('categorical_transformer', categorical_transformer, categorical_columns)
])


### Using Sklearn to train a machine learning model to predict whether a person with a profile will get placed or not

For selecting the right model (estimator) for the right task, we can take the help of sklearn's `ml_map`

!["sklearn_ml_map"](https://scikit-learn.org/stable/_static/ml_map.png "Sklearn's ML Map")

Since our data is a classification task, we can look at the classification algorithms path.

In [11]:
# Importing the metrics that can be used to evaluate the Classification models
from sklearn.metrics import classification_report

#### 1. `START` -> `>50 samples` -> `predicting a category` -> `labeled data` -> `<100K` -> `Linear SVC`

In [12]:
# Importing the LinearSVC (Support Vector Classification) module from sklearn
from sklearn.svm import LinearSVC

In [13]:
# Initializing the LinearSVC model
model_LinearSVC = Pipeline(steps=[
    ("preprocessor", preprocesser),
    ("model_LinearSVC", LinearSVC())
])

In [14]:
# Splitting the data into dependent and independent variables
X = placement_record_df.drop("status", axis=1)
y = placement_record_df['status']

In [15]:
# Spitting X and y into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

In [16]:
# Training/fitting the training data to the LinearSVC model
model_LinearSVC.fit(X_train, y_train);

In [17]:
# Evaluating the trained model on the test data
model_LinearSVC.score(X_test, y_test)

0.6666666666666666

In [18]:
# Predicting the results on test data
y_preds = model_LinearSVC.predict(X_test)

In [19]:
# Looking at the classification report for evaluating the model on more metrics
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

  Not Placed       0.60      0.25      0.35        12
      Placed       0.68      0.90      0.78        21

    accuracy                           0.67        33
   macro avg       0.64      0.58      0.56        33
weighted avg       0.65      0.67      0.62        33



#### 2. `START` -> `>50 samples` -> `predicting a category` -> `labeled data` -> `<100K` -> 

**`KNeighborsClassifier`**

In [20]:
# Importing the KNeighborsClassifier module from sklearn
from sklearn.neighbors import KNeighborsClassifier

In [21]:
# Initializing the KNeighborsClassifier model
model_KNeighbors_classifier = Pipeline(steps=[
    ("preprocessor", preprocesser),
    ("model_LinearSVC", KNeighborsClassifier())
])

In [22]:
# Splitting the data into dependent and independent variables
X = placement_record_df.drop("status", axis=1)
y = placement_record_df['status']

In [23]:
# Spitting X and y into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

In [24]:
# Training/fitting the training data to the KNeighborsClassifier model
model_KNeighbors_classifier.fit(X_train, y_train);

In [25]:
# Evaluating the trained model on the test data
model_KNeighbors_classifier.score(X_test, y_test)

0.7575757575757576

In [26]:
# Predicting the results on test data
y_preds = model_KNeighbors_classifier.predict(X_test)

In [27]:
# Looking at the classification report for evaluating the model on more metrics
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

  Not Placed       0.60      0.33      0.43         9
      Placed       0.79      0.92      0.85        24

    accuracy                           0.76        33
   macro avg       0.69      0.62      0.64        33
weighted avg       0.74      0.76      0.73        33



#### 3. `START` -> `>50 samples` -> `predicting a category` -> `labeled data` -> `<100K` -> 

**`RandomForestClassifier`**

In [28]:
# Importing the RandomForestClassifier module from sklearn
from sklearn.ensemble import RandomForestClassifier

In [29]:
# Initializing the RandomForestClassifier model
model_random_forest_classifier = Pipeline(steps=[
    ("preprocessor", preprocesser),
    ("model_LinearSVC", RandomForestClassifier())
])

In [30]:
# Splitting the data into dependent and independent variables
X = placement_record_df.drop("status", axis=1)
y = placement_record_df['status']

In [31]:
# Spitting X and y into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

In [32]:
# Training/fitting the training data to the RandomForestClassifier model
model_random_forest_classifier.fit(X_train, y_train);

In [33]:
# Evaluating the trained model on the test data
model_random_forest_classifier.score(X_test, y_test)

0.6666666666666666

#### Looking at the Cross Validation Score on the data

In [34]:
# Importing the cross_val_score from sklearn
from sklearn.model_selection import cross_val_score

In [35]:
# Perform a five fold cross validation on the LinearSVC Model
cross_val_score(model_LinearSVC, X, y, cv=5).mean()

0.6139534883720931

In [36]:
# Perform a five fold cross validation on the KNeighborsClassifier Model
cross_val_score(model_KNeighbors_classifier, X, y, cv=5).mean()

0.586046511627907

In [37]:
# Perform a five fold cross validation on the RandomForestClassifier Model
cross_val_score(model_random_forest_classifier, X, y, cv=5).mean()

0.6372093023255815

#### Comparing the Model performance on the test data, for the given data the RandomForestClassifier gives better preformance on an average. 

In [38]:
# Saving the RandomForestClassifier Model to disk
import pickle

pickle.dump(model_random_forest_classifier, open("random_forest_classifier_model_v1.pkl", "wb"))