# Campus Recruitment(Academic and Employability Factors influencing placement)


This data set consists of Placement data of students in a college campus. It includes secondary and higher secondary school percentage and specialization. It also includes degree specialization, type,work experience and salary offers to the placed students.

Before we start our analysis, let us import the libraries and start working!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import scipy.stats as stats


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#### Done! Let us now import the data set and see how it looks!

In [None]:
data = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv') #Importing the data
data.head() #First five observations of the data

**Now let us see what the variables mean:
**

sl_no : Serial Number

Gender : Gender
ssc_p : Secondary Education percentage- 10th Grade

ssc_b: Board of Education- Central/ Others 10th Grade

hsc_p: Higher Secondary Education percentage- 12th Grade

hsc_b : Board of Education- Central/ Others- 12th Grade

hsc_s: Specialization in Higher Secondary Education

degree_p : Degree Percentage

degree_t: Under Graduation(Degree type)- Field of degree education

workex : Work Experience

etest_p: Employability test percentage ( conducted by college)

specialisation: Post Graduation(MBA)- Specialization

mba_p: MBA percentage

status: Status of placement- Placed/Not placed

salary: Salary offered by corporate to candidates

#### Great! Now let us see some more information about our variables


In [None]:
print(data.info())
print(data.shape)

This does tell us that there are 215 entries and 15 features in total. One weird thing is that salary has only 148, which means many of the values are missing. Well, we will investigate that soon enough. Let us see some descriptive statistics for both the quantitative variables.

In [None]:
data.describe(include=['O']) #Categorical Variables

In [None]:
data.describe() #Numerical Variables

Everything looks fine so far. There are a lot of things we can observe from this. One thing is that the mean salary of placed students is 2,88,655 INR and there's a big standard deviation of 93,457. Another observation can be that majority of students are male and most of the students study in Central Board in the Commerce Field. Before we start doing further analysis, we should make a copy of our data and clean it!

## Data Preprocessing

Let us start by making a copy of the data!

In [None]:
my_data = data.copy() #copying data to keep original intact

Great! Now let's handle the missing values that we saw earlier

In [None]:
my_data.isnull().sum() #Number of missing values for each column

Voila! 67 in the salary column. This means that that these students haven't been placed yet. That means that they do not contribute in the analysis if we're finding out what are the key factors that govern placement. So, we can remove these entries.

In [None]:
my_data.dropna(inplace = True) #drop rows with missing values
my_data.isnull().sum() 

Well Done! Now we're ready to do our analysis. Let us answer these questions:

1. Which factor influenced a candidate in getting placed?
2. Does percentage matter for one to get placed?
3. Which degree specialization is much demanded by corporate?

## 1. Which factor influenced a candidate in getting placed?

There are many methods of figuring out which variables are important for placement. Let us first see visually if variables have any effect. 

In [None]:
#Gender vs Status
gpb_gender = my_data[["gender", "status"]].groupby(['gender'], as_index=False).count()
print(gpb_gender)

sns.set(style="whitegrid")
ax = sns.barplot(x="gender", y="status", data=gpb_gender)

So we can see that 100 males were placed which was twice the number of females placed. That kind of tells us that gender does play an important role.

In [None]:
#  Board of Education- Central/ Others 10th Grade vs Status

ssc_b_gb = my_data[["ssc_b", "status"]].groupby(['ssc_b'], as_index=False).count()
print(ssc_b_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="ssc_b", y="status", data=ssc_b_gb)

So, here we say that Central students get placed more as compared to Other boards but there's not much of a huge difference. Let's move on to the next variable!

In [None]:
#  Board of Education- Central/ Others 12th Grade vs Status

hsc_b_gb = my_data[["hsc_b", "status"]].groupby(['hsc_b'], as_index=False).count()
print(hsc_b_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="hsc_b", y="status", data=hsc_b_gb)

Hey! Now we see a drastic change where students who study in other boards do get placed much more than central forming 61% of total placed students. Onto the next!

In [None]:
# Specialization in Higher Secondary Education vs Status


hsc_s_gb = my_data[["hsc_s", "status"]].groupby(['hsc_s'], as_index=False).count()
print(hsc_s_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="hsc_s", y="status", data=hsc_s_gb)

So here, people with science background and commerce background do cosiderably well as they form up most of the placed students. While people with arts background only form 4% of placed students, making stream of specialization an important factor. NEXT!

In [None]:
#Under Graduation(Degree type)- Field of degree education vs Status

degree_t_gb = my_data[["degree_t", "status"]].groupby(['degree_t'], as_index=False).count()
print(degree_t_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="degree_t", y="status", data=degree_t_gb)

Another important factor as seen from the chart is undergraduate field which seems to have a an impact. Majority of placed students chose Commerce and Management as their majors whereas people who've have chosen other majors comprise of barely 3.3%. Let's investigate the next variable which seems to be very prevalent in today's world.

In [None]:
#Work Experience vs Status


wex_gb = my_data[["workex", "status"]].groupby(['workex'], as_index=False).count()
print(wex_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="workex", y="status", data=wex_gb)

Surprisingly, people with no work experience were placed more as compared to how it is in the real world, where employers choose candidates with a little work exposure. Onto the next!

In [None]:
# specialisation MBA vs status

spec_gb = my_data[["specialisation", "status"]].groupby(['specialisation'], as_index=False).count()
print(spec_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="specialisation", y="status", data=spec_gb)

Here, we see that majority of students placed comprise of people with a specialization in Finance which is quite interesting. 

## 2. Does percentage matter for one to get placed?

Now let us see the correlation of quantitative variables with status to realize the features that are important predictors for Status.

In [None]:
#Let us make a another copy of the data

my_data1 = data.copy()
my_data1.head()

#Converting Status with 1 - Placed 0- Not Placed 

from sklearn.preprocessing import LabelEncoder
enc = LabelEncoder()
my_data1['status'] = enc.fit_transform(my_data1['status'])

#Code for drawing a heatmap using seaborn
plt.figure(figsize=(20,10))
heat_map= my_data1.corr()
sns.heatmap(heat_map,annot=True)
heat_map


From the heatmap, we can see that ssc_p, hsc_p, degree_p are very highly correlated to status! Hence, perecentage does matter when it comes to being placed.

In [None]:
group_dp = my_data1[my_data1.status == 1]
group_dp = group_dp[["degree_p", "status"]].groupby(['degree_p'], as_index=False).count().sort_values(by='degree_p',ascending=False)
print(group_dp)

group_dp = my_data1[my_data1.status == 1]
group_dp = group_dp[["degree_p", "status"]].groupby(['degree_p'], as_index=False).count().sort_values(by='status',ascending=False)
print(group_dp)

group_dp = my_data1[my_data1.status == 0]
group_dp = group_dp[["degree_p", "status"]].groupby(['degree_p'], as_index=False).count().sort_values(by='degree_p',ascending=False)
print(group_dp)

group_dp = my_data1[my_data1.status == 0]
group_dp = group_dp[["degree_p", "status"]].groupby(['degree_p'], as_index=False).count().sort_values(by='status',ascending=False)
print(group_dp)

sns.set(style="whitegrid")
ax = sns.scatterplot(x="degree_p", y="status", data=group_dp)

So, we see that people with 65% percentage in the degree are getting placed the maximum. But also, those with high percentages do get placed as well as we see that the highest percentage achieved by a person who hasn't been placed is 79%. So percentage is an important variable!

## 3. Which degree specialization is much demanded by corporate?

In [None]:
spec_gb = my_data[["specialisation", "status"]].groupby(['specialisation'], as_index=False).count()
print(spec_gb)

sns.set(style="whitegrid")
ax = sns.barplot(x="specialisation", y="status", data=spec_gb)

Corporates prefer Marketing and Finance as compared to Marketing and HR. 

So in conclusion, this was just a basic analysis of the dataset of a college recruitment! We could use ML methods to predict who will get placed given the features which can be an awesome task to try!