<a id="part0"></a>
# 0.Campus Recruitment

In this dataset we can reach %100 accuracy with simple couple lines of code. I'am going to apply XGBClassifier to dataset and optimize it with GridSearch. Lets dive into notebook.

![](https://gregsavage.com.au/wp-content/uploads/2017/05/worstrecruiter-600x340.png)

**Table of Contents **

* [0.Campus Recruitment:](#part0)
* [1.Quickpeek to Data:](#part1)
* [2.Encoding:](#part2)
* [3.Scaling:](#part3)
* [4.Feature Selection:](#part4)
* [5.GridSearch and Training:](#part5)
* [6.Evaluation:](#part6)

**DATA DICTIONARY:**

1. sl_no : Serial Number(we dont need this for training)

2. gender: Male='M', Female='F'

3. ssc_p : Secondary Education percentage- 10th Grade
 
4. ssc_b : Board of Education- Central/ Others

5. hsc_b : Higher Secondary Education percentage- 12th Grade
 
6. hsc_s : Specialization in Higher Secondary Education
 
7. degree_p: Degree Percentage

8. degree_t: Under Graduation(Degree type)- Field of degree education
 
9. workex : Work Experience
 
10. etest_p: Entrance Test Percentage
 
11. mba_p: MBA Percentage
 
12. status : Placed or not
 
13. salary : Salary offered


<a id="part1"></a>
# 1.Quickpeek to Data:

In [None]:
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv("/kaggle/input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv")
data.head()

In [None]:
data.shape

In [None]:
data.isnull().sum()

Dataset has 215 unique rows but salary column has 67 missing values. Lets Change their values to zero. Because we need all features as numerical value.

In [None]:
data.salary.fillna(0,inplace = True)

<a id="part2"></a>
# 2.Encoding:

In [None]:
dataStringVals = []
for i in data.columns:
    dataStringVals.append(len(data[i].unique()))
print(f"column: {data.columns} values: {dataStringVals}")

Gender, ssc_b, hsc_b, workex, specialisation and status columns have 2 different value. I am going to label encode these columns. Then i am going to apply OneHotEncoding to hsc_s and degree_t columns because these columns have 3 different values.

In [None]:
le = LabelEncoder()
for i in ["gender","ssc_b","hsc_b","workex","specialisation","status"]: #label encoding
    data[i] = le.fit_transform(data[i])
data

In [None]:
for i in ["hsc_s", "degree_t"]: # Onehot encoding
    temp = pd.get_dummies(data[i])
    data = pd.concat((data,temp),axis = 1)
    data.drop(columns = [i],inplace = True)
data

<a id="part3"></a>
# 3.Scaling:

We have to scale data for faster and better training.

In [None]:
listofunscaledcolumns = ["ssc_p","hsc_p","degree_p","etest_p","mba_p","salary"]
mms = MinMaxScaler()
for i in listofunscaledcolumns:
    data[i] = mms.fit_transform(np.expand_dims(data[i].to_numpy(),axis = 1))
data

<a id="part4"></a>
# 4.Feature Selection:

Some columns in data can affect training negatively. What we're going to do in this section is check correlations between columns and drop unnecessary ones.

In [None]:
data.corr()["status"]

In [None]:
plt.figure(figsize=(10,8))
sns.barplot(y=data.corr()["status"].index,x=abs(data.corr()["status"]),palette='CMRmap')

Seems like ssc_p, hsc_p, degree_p, workex, specialisation and salary columns more correlated with status column. Lets pick these columns for training

In [None]:
listofpickedcolumns = ["ssc_p","hsc_p","degree_p","workex","specialisation","salary","status"]

<a id="part5"></a>
# 5.GridSearch and Training:

To find best hyperparameters for XGBoostClassifier i am going to apply gridsearch.

In [None]:
data = data[listofpickedcolumns]
X = data.drop(columns = ["status"])
y = data.status

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state = 44, test_size = 0.2)

In [None]:
from xgboost import XGBClassifier
XGBC = XGBClassifier()
modelParams = {"max_depth": [2,5,15,30],
              "subsample": [0.5,0.75,1],
              "colsample_bytree":[0.5,0.75,1],
              "colsample_bylevel":[0.5,0.75,1],
              "min_child_weight": [1,5,25,100],
              "n_estimators": [10,50,100,250,500],
              "learning_rate":[0.01,0.1,0.25]} 
XGBGridSearch = GridSearchCV(XGBC, modelParams,verbose = 2,n_jobs = -1,cv = 10) #n_jobs = -1 means use all cores for training
XGBGridSearch.fit(X_train,y_train)
XGBGridSearch.best_params_

In [None]:
XGBR2 = XGBClassifier(colsample_bylevel= 0.5,colsample_bytree= 0.5,learning_rate= 0.25,max_depth= 2,min_child_weight= 1,n_estimators= 100,subsample= 0.75)
XGBR2.fit(X_train,y_train)
y_pred = XGBR2.predict(X_test)
y_predtrain = XGBR2.predict(X_train)

<a id="part6"></a>
# 6.Evaluation:

Lets see how model predicted.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
print(f"train set confusion matrix:{confusion_matrix(y_train, y_predtrain)} train set accuracy:{accuracy_score(y_train,y_predtrain)}")
print(f"test set confusion matrix:{confusion_matrix(y_test, y_pred)} test set accuracy:{accuracy_score(y_test,y_pred)}")

Now, you exactly now what recruiters want.We predicted both sets(training and test) correctly. If you have any question about notebook, i am here to answer it. Thanks for your time.

![](https://blog-c7ff.kxcdn.com/blog/wp-content/uploads/2013/09/top_5_recruit-01.jpg)