# Factors affecting Campus Placements

## Campus placements
A campus placement is a type of recruitment programme usually done at higher educational institutions, with the aim of providing graduate jobs to students at companies. Typically, these placement positions are quite competitive, since their numbers are limited. Thus, career-oriented students may want to understand what choices they can make to maximise their success in landing their first job.

## Dataset
Kindly provided by Ben Roshan D (MBA in Business Analytics at Jain University Bangalore), this data set consists of placement data of students in some campus. It includes secondary and higher secondary school percentage and specialization. It also includes degree specialization, type and work experience and salary offers to the placed students.

## Questions
We aim to solve these set of questions included with the dataset.
1. Which factor influenced a candidate in getting placed?
2. Does percentage matters for one to get placed?
3. Which degree specialization is much demanded by corporate?
4. Play with the data conducting all statistical tests.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 1. Exploratory Data Analysis

In [None]:
# read csv file
path = '/kaggle/input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv'

df = pd.read_csv(path)
print('The shape of our data is:' , df.shape)
df.head()


In [None]:
print('Numeric columns')
df.describe(include = np.number)

In [None]:
print('Categorical columns')
df.describe(include = np.object)

In [None]:
# processing
column_names = list(df.columns)

# extract features and the target
data = df.iloc[:, 1 :-2]
target = df.iloc[:,-2:]

# separate between categorical and numeric columns
numeric_columns = data.select_dtypes(include=['int64' , 'float64'])
categorical_columns = data.select_dtypes(exclude=['int64' , 'float64'])

## Exploratory Analysis
We want to distinguish between those who got placed and those who did not, depending on the various factors at play. Using the pairplot and grouping the data by whether the person got placed or not allows us to recognise patterns at play in the job market. From the plots we see a general positive correlation between the various degree percentages of each student in question, and that those who gained placements generally do better on these tests. 

One exception is with the MBA percentages, seeing as they do

In [None]:
# data visualisation
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')

# seaborn pairplot
sns.pairplot(df.iloc[:, 1:], hue="status")
plt.show()

#from pandas.plotting import corr_matrix
corr_matrix = df.corr()
print('Correlation matrix:')
corr_matrix["salary"].sort_values()

Here are some subplots indicating how different categorical data influenced whether an individual gets placed or not. What is important to notice is not the absolute number of the number of people who get placed but should be the **relative fraction**, given that the individual is in that category. 

By observation, we can see that those who are female, have work experience or study Marketing and Finance do indeed have an advantage over their counterparts. However, we would like to know whether this advantage is significant or not. 

In [None]:
# Bar charts
fig, ax = plt.subplots(4,2,figsize=(15,15))

for i in range(len(categorical_columns.columns)): 
    sns.countplot(x = categorical_columns[categorical_columns.columns[i]], 
                  hue = target['status'], 
                  ax = ax[i//2,i%2])

# 2. Modelling

We will now employ a Random Forest Classifier to understand the different factors at play which influence whether a candidate gets placed. 

### 2.1 Some more processing for categorical data

In [None]:
# label encode the binary data, as this would increase the prediction accuracy
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

# get column names with binary data i.e. two classes
c = categorical_columns.nunique() == 2
binary_cats = c[c].index

label_categorical_columns = categorical_columns.copy()
label_data = data.copy()

for c in binary_cats: 
    label_categorical_columns[c] = label_encoder.fit_transform(categorical_columns[c])
    label_data[c] = label_encoder.fit_transform(categorical_columns[c])

# make sure to keep track which are the positive and negative classes
print(label_categorical_columns.head())
print('---')
print(categorical_columns)

In [None]:
# label encode the status column from target data
label_target = target.copy()
label_placed = label_encoder.fit_transform(target['status'])
label_target['status'] = label_placed

# one hot encode the categorical columns, except those with binary values
categorical_columns_onehot = pd.get_dummies(label_categorical_columns)

#t = label_encoder.fit_transform(data['gender'])
data_onehot = pd.get_dummies(label_data)
data_onehot

### 2.2 Model training

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X = data_onehot
y = label_placed

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

forest_model = RandomForestClassifier(random_state = 1)
forest_model.fit(train_X, train_y)

# evaluate model
print('F1 Score:', forest_model.score(val_X, val_y))


### 2.3 Feature importance ranking

In [None]:
# evaluate each feature's importance
importances = forest_model.feature_importances_

# sort by descending order of importances
indices = np.argsort(importances)[::-1]

#create sorted dictionary
sorted_importances = {}

print("Feature ranking:")
for f in range(X.shape[1]):
    sorted_importances[X.columns[indices[f]]] = importances[indices[f]]
    print("%d. %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]]))

### What factor influenced a candidate in getting placed?

Thus, we notice that the most useful predictors for this is the secondary education percentage (0.343) the degree percentage (0.168), followed by the higher secondary (0.157) and MBA percentages (0.110). 

The least influential predictors indeed are the high school degree specialisations and undergraduate degree types. 

### Does percentage matters for one to get placed?

Evidently, percentage matters to a large extent in determining whether a particular student gets placed, accounting for the majority (0.779) of the random forest importance score. Some possible explanation of this would be how percentages are correlated with performance abilities, which would shine be at play in interview scenarios, if they exist, or would stand out to employers very well on paper. 

---
## 3. Statistical tests

If you're unconvinced about the influence of percentages, we can perform a hypothesis test about whether those who are placed have higher percentages than those who are not placed. 

In [None]:
# Obtain average percentages for the DataFrame
percent_df = pd.DataFrame(numeric_columns.agg('mean', axis = 1)).join(df['status'])

percent = percent_df.groupby(['status']).agg(['mean','var', 'count'])
percent.columns = percent.columns.droplevel()
percent

In [None]:
# one tailed t-test
# H0: mu_placed - mu_notplaced = 0
# H1: mu_placed - mu_notplaced > 0

# unequal sample sizes, similar variance

# extract variables from table
placed_mu, notplaced_mu = percent['mean']['Placed'], percent['mean']['Not Placed']
placed_var, notplaced_var = percent['var']['Placed'], percent['var']['Not Placed']
n1, n2 = percent['count']['Placed'], percent['count']['Not Placed']

# calculate the t statistic
sp = np.sqrt(((n1 - 1) * placed_var + (n2 - 1) * notplaced_var)/ (n1 + n2 - 2))
t_stat = (placed_mu - notplaced_mu) / (sp * np.sqrt(1/ n1 + 1/ n2))

print('The t statistic is', t_stat)
from scipy.stats import t
print('The p-value is,', t.cdf(-np.abs(t_stat), df = n1 + n2- 2))


### Does percentage matter for one to get placed (part 2)?
We can thus establish, using the t-test that **those with placements tend to have higher percentages**, because that the p-value is extremely small. 

In [None]:
data_joined = data.join(df['status'])

# obtain table for placed students in each specialisation
p = data_joined.groupby(['specialisation'])['status'].agg([lambda z: np.mean(z=='Placed'), "size"])
p.columns = ["Placed", 'Total']
print(p)


In [None]:
# We want to test whether it is finance students find it easier to get placed 
# H0 is pfin - phr > 0, as we want to do a one-tailed test 
# H1: pfin - phr <= 0 

# calculate pool proportion
p_us = len(df[df['status']=='Placed']) / len(df)

# obtaining individual proportions and total counts from table above
pfin, phr = p['Placed']['Mkt&Fin'], p['Placed']['Mkt&HR']
n1, n2 = p['Total']['Mkt&Fin'], p['Total']['Mkt&HR']

# calculate standard error
se = np.sqrt(p_us*(1- p_us)*(1/n1 + 1/n2))

# Calculate the best estimate of the proportion distribution
be = pfin - phr

# Calculate the hypothesized estimate, which is no difference
he = 0

#Calculate the test statistic
test_statistic = (be - he)/se

# Obtain one tailed p-value
from scipy.stats import norm
pvalue = norm.cdf(-np.abs(test_statistic))

print('The p-value is {0:.6f}'.format(pvalue))

###  Which degree specialization is much demanded by corporate?

Though degree specialisations are not the determining factors crucial to whether a student gets placed or not, we can still perform tests to see if there is a statistical difference to the two groups. Notice that the proportion of people who do the Marketing and Finance specialisation who get placed is higher than those who do Marketing and Human Resources (0.792 vs 0.558). Our one-tailed proportion difference hypothesis test allows us to figure out whether this difference is due to randomness or not.

The p-value of the test was 0.000119, which means that there is a 0.1% chance that the difference in proportions was due to randomness. Thus, we can conclude that those in the finance specialisation perform better than those in HR. 

Though in the correlation matrix below, we can see that those in Finance tend to have had higher percentages, so that may be the main reason why this is true.

In [None]:
onehot_target = data_onehot.join(label_target)
corr_matrix = onehot_target.corr()
fig = plt.figure(figsize = (10,10))
sns.heatmap(corr_matrix, square=True, cmap="YlGnBu")

## 4. Conclusion


A Random Forest classifier was used to model the factors responsible for successful Campus Placements of students studying for an MBA, achieving an F1 accuracy of 90%. Using this classifier, it was noted that secondary school percentages was the best predictor for whether a person could get a placement. I digress, this seems weird but gives some insight into the question of whether an individual's success is already predetermined during childhood. The worst predictors were degree types.

Using a t-test, we found that percentage indeed matters with a high degree of significance. A proportion z-test gave the result that those who study the Marketing and Finance specialisation are more demanded by corporates than those who study Marketing and HR, with a p-value of about 0.1%. However, we also spot a slight correlation that those who study the Finance specialisation tend to have higher overall percentages. Thus, this may be the underlying cause of the two deductions we made.