# $$\textrm{Campus Recruitment}$$

_Academic and Employability Factors influencing placement_

$$\textrm{If you like the work please upvote :-), Comments are Welcome}$$

# Table of contents

* [Import Libraries](#T1)
* [Import Dataset](#T2)
* [Data Exploration](#T3)
* [Data Visualization](#T4)
    * [Correlation between features](#T41)
    * [Lineplot](#T42)
    * [Barplot: No. of students from different boards](#T43)
    * [Catplot: Higher secondary % gender wise](#T44)
    * [Boxplot: Finding salary outliers](#T45)
    * [piechart: Most prefered stream](#T46)
* [Data Distribution](#T5)
* [Encoding categorical data](#T6)
* [Classification of Placement Status](#T7)
    * [Preprocessing Data](#T71)
    * [Logistic Regression Model](#T72)
    * [Random Forest Classifier](#T73)
    * [Accuracy/Confusion Matrix](#T74)
* [Build a regression model(Salary predictor)](#T8)
    * [Preprocessing Data](#T81)
    * [Estimation by Multiple regressor](#T82)
    * [Estimation by Random Forest regressor](#T83)
    * [OLS model summery](#T84)
    * [Regressor coefficient and intercept](#T85)

# Import libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor


%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Import dataset

In [None]:
dataframe = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
dataframe.head()

# Data Exploration

In [None]:
from pandas_profiling import ProfileReport
profile = ProfileReport(dataframe);

In [None]:
profile.to_widgets()

***Averge percentage you require to get placed***

In [None]:
df = pd.DataFrame.drop(dataframe,columns=["sl_no","ssc_b","hsc_b"])
df_new = df.groupby(by  = 'status').mean()
df_new

<a id='T4'></a>
# Data visualisation

<a id='T41'></a>
**Correltion between features**

In [None]:
matrix = dataframe.corr()
plt.figure(figsize=(8,6))
#plot heat map
g=sns.heatmap(matrix,annot=True,cmap="YlGn_r")

Higher Senior secondary have a higher correlation with Salary i.e. more likely to get placed

<a id='T42'></a>
**Which marks matter more for getting placed ?**

In [None]:
plt.figure(figsize=(12,8))
plt.ylim([200000,450000])
sns.regplot(x="ssc_p",y="salary",data=dataframe)
sns.regplot(x="hsc_p",y="salary",data=dataframe)
sns.regplot(x="mba_p",y="salary",data=dataframe)
sns.regplot(x="etest_p",y="salary",data=dataframe)
plt.legend(["ssc percentage", "hsc percentage", "MBA", "E-test"])
plt.ylabel("mba percentage")
plt.xlabel("Percentage %")
plt.show()

<a id='T43'></a>
**Number of students from central, other boards in different class**

In [None]:
sns.catplot(x="ssc_b",hue="gender",data=dataframe, kind="count",);
plt.ylabel("No. of students");
plt.xlabel("senior secondary");
sns.catplot(x="hsc_b",hue="gender",data=dataframe, kind="count");
plt.ylabel("No. of students");
plt.xlabel("higher senior secondary");
sns.catplot(x="hsc_s",hue="gender",data=dataframe, kind="count");
plt.ylabel("No. of students");
plt.xlabel("prefered subjects");

<a id='T44'></a>
**Higher secondary percentage gender wise**

In [None]:
sns.catplot(x="ssc_b",y="ssc_p",hue="gender",data=dataframe,kind="boxen");
plt.ylabel("percentage");
plt.xlabel("boards");

    Average percentage of girls in both boards is higher than boys

**Work experience for stdents in different degrees**

In [None]:
sns.catplot(x="workex",hue="degree_t",data=dataframe, kind="count");
plt.ylabel("No. of students");
plt.xlabel("work exp in different degrees");
sns.catplot(x="degree_t",hue="workex",data=dataframe, kind="count");
plt.ylabel("No. of students");
plt.xlabel("work exp in different degrees");

<a id='T45'></a>
**Finding the salary outliers**

In [None]:
sns.catplot(y="salary",x="gender",data=dataframe, kind="box", hue="specialisation" );

<a id='T46'></a>
***which stream is prefered by students the most***

In [None]:
df1 = pd.DataFrame(dataframe['degree_t'].value_counts(normalize=True))
plot = df1.plot.pie(y='degree_t', autopct='%1.1f%%', figsize=(5, 5))

***percentage of specialisation***

In [None]:
df2 = pd.DataFrame(dataframe['specialisation'].value_counts(normalize=True))
plot = df2.plot.pie(y='specialisation', autopct='%1.1f%%', figsize=(5, 5))

***Students that got placement***

In [None]:
df3 = pd.DataFrame(dataframe['status'].value_counts(normalize=True))
plot = df3.plot.pie(y='status', autopct='%1.1f%%', figsize=(5, 5))

<a id='T5'></a>
# Data Distribution

In [None]:
dataframe.hist(bins = 30, figsize=(10,10), color= 'orange');

<a id='T51'></a>
**Placement % of mba in each specialisation by gender**

In [None]:
import plotly.express as px
dfc=pd.DataFrame(dataframe.groupby(['gender','specialisation','status'])['sl_no'].count()).rename(columns={'sl_no': 'no. of students'}).reset_index()

fig = px.sunburst(dfc, path=['gender','status','specialisation'], values='no. of students')
fig.update_layout(title="Placement % of mba in each specialisation by gender ",title_x=0.5)
fig.show()

<a id='T6'></a>
# Encoding the categorical data

In [None]:
df["degree_t"] = df["degree_t"].astype('category')
df["workex"] = df["workex"].astype('category')
df["specialisation"] = df["specialisation"].astype('category')
df["status"] = df["status"].astype('category')
df["gender"] = df["gender"].astype('category')
df["hsc_s"] = df["hsc_s"].astype('category')
df.dtypes
df["workex"] = df["workex"].cat.codes
df["gender"] = df["gender"].cat.codes
df["degree_t"] = df["degree_t"].cat.codes
df["specialisation"] = df["specialisation"].cat.codes
df["status"] = df["status"].cat.codes
df["hsc_s"] = df["hsc_s"].cat.codes
df.tail()

<a id='T7'></a>
# Classification of placement status

<a id='T71'></a>
## Preprocessing the data

In [None]:
df_class = df.copy()
X = df_class.iloc[:,0:-2].values
y = df_class.iloc[:,-2].values

In [None]:
#Split the dataset for training
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.15, random_state=0)

<a id='T72'></a>
## Train Logistic Regression Model

In [None]:
#Train the model
#from sklearn.linear_model import LogisticRegression
lg_classifier = LogisticRegression(random_state=0,max_iter=1000)
lg_classifier.fit(X_train, y_train)

#Predict the test cases
y_pred_lgclass = lg_classifier.predict(X_test)

<a id='T73'></a>
## Train Random Forest Classifier

In [None]:
#Train the model
#from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(n_estimators=1000,criterion="entropy")
rf_classifier.fit(X_train, y_train)

#Predict the test cases
y_pred_rfclass = rf_classifier.predict(X_test)

<a id='T74'></a>
## Accuracy/Confusion matrix

In [None]:
#from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred_rfclass)
print(cm)
print("random forest accuracy: {:2.2f}%" .format(accuracy_score(y_test, y_pred_rfclass) * 100) )

In [None]:
cm = confusion_matrix(y_test, y_pred_lgclass)
print(cm)
print("Logistic regressor accuracy: {:2.2f}%" .format(accuracy_score(y_test, y_pred_lgclass)*100) )

<a id='T8'></a>
# Build a regression model

<a id='T81'></a>
## Pre processing data

In [None]:
df_reg = df.copy()

In [None]:
df_reg.dropna(inplace=True)
df_reg = df_reg[df_reg["salary"]<350000.0]

**Skewness of salary plot**

In [None]:
#PDF of Salary
sns.kdeplot(df["salary"])
plt.legend(["before"])
plt.show()

density plot is right skewed

In [None]:
sns.kdeplot(df_reg["salary"])
plt.legend(["after"])

In [None]:
#select the features of regression model
X = df_reg.iloc[:,:-2].values
y = df_reg.iloc[:,-1].values

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

<a id='T82'></a>
## OLS model summary

In [None]:
from statsmodels.api import OLS
summ=OLS(y_train,X_train).fit()
summ.summary()

**Drop insignificant features**

In [None]:
df_reg = pd.DataFrame.drop(df_reg,columns=["degree_p","ssc_p","specialisation","workex"])

#select the features of regression model
X = df_reg.iloc[:,:-2].values
y = df_reg.iloc[:,-1].values

#splitting into training and test set
#from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=0)

summ=OLS(y_train,X_train).fit()
summ.summary()

<a id='T83'></a>
## Estimation by multiple regressor

In [None]:
#from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Predict the salary
y_pred_m = regressor.predict(X_test)

<a id='T84'></a>
## Estimation by Random forest regressor

In [None]:
#from sklearn.ensemble import RandomForestRegressor
rfregressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
rfregressor.fit(X_train, y_train)

#Predict the salary
y_pred_r = rfregressor.predict(X_test)

In [None]:
from sklearn.metrics import r2_score, accuracy_score
print("R2 score")
print("multiple regressor " + str(r2_score(y_test, y_pred_m)))
print("random forest "+ str(r2_score(y_test, y_pred_r)))

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from math import sqrt
print("Mean Absolute error")
MAE = mean_absolute_error(y_test, y_pred_m)
print("Multiple linear regressor "+str(MAE))
MAE = mean_absolute_error(y_test, y_pred_r)
print("Random forest regressor "+ str(MAE))

> This is the best feature combinaion I get, comment the best features combination you get with least mean absolute error.

<a id='T85'></a>
## Regression coefficients and intercept

In [None]:
print("regression coeff:" + str(regressor.coef_))
print("regression intercept: " + str(regressor.intercept_))

Therefore, the equation of our multiple linear regression model is:

$$\textrm{salary} = 13587.74 \times \textrm{gender} - 162 \times \textrm{hsc_p} - 9251.35 \times \textrm{hsc_s} + 7127 \times \textrm{degree_t} + 163.43 \times \textrm{etest_p} + 77.79 \times \textrm{mba_p} + 251890.32$$