# 1. Importing data and the necessary packages


The first step is to import the necessary libraries for our analysis and export the data from the dataset into *data_campus*.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
path = '/kaggle/input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv'
data_campus = pd.read_csv(path, index_col = 'sl_no')
data_campus.head()

# 2. Data Wrangling

We analyze the imported data carefully, as they may have missing values. 

In [None]:
data_campus.info()

There are NaN values in the salary column for those students who are not placed. It is logical to think that their salary is 0, so we will substitute NaN with 0.

In [None]:
data_campus['salary']=data_campus['salary'].fillna(0)

In [None]:
print(data_campus['gender'].unique())
print(data_campus['ssc_b'].unique())
print(data_campus['hsc_b'].unique())
print(data_campus['degree_t'].unique())
print(data_campus['workex'].unique())
print(data_campus['specialisation'].unique())
print(data_campus['status'].unique())

# 3. Data Visualization
After preparing the data, we can start with the visualization.

* **Sex vs Placed**

We are going to look at the relationship between gender and whether or not they're placed.

In [None]:
#plt.figure(figsize=(15,10))
sns.catplot('gender' , kind='count', hue = 'status',data = data_campus)
plt.title('Gender vs Status', fontsize=16)

At first glance, it seems that men are more likely to find work. 
However, we will continue to analyze this data. Let's normalize them.

In [None]:
gender_status = data_campus.groupby(['gender', 'status'])
group_gen_sts = gender_status.size()
group_gen_sts.name = 'total'
group_gen_sts = group_gen_sts.reset_index()
group_gen_sts


In [None]:
def normal_total(group):
    group['normal_data'] = group.total/group.total.sum()
    return group

gender_normal = group_gen_sts.groupby('gender').apply(normal_total)
gender_normal

In [None]:
sns.barplot(x = 'gender', y = 'normal_data', hue = 'status',data = gender_normal, order = ['M', 'F'], hue_order = ['Placed', 'Not Placed'])
plt.title('Gender vs Status (normalised data)', fontsize=16)

In this graph, it can be seen that the difference between men and women in terms of hiring is not as great as it appeared in the first graph. This is due to the fact that the number of women on campus is lower than that of men.

* **Work experience vs Status**

Let's see the relationship between previous work experience and the current status.

In [None]:
sns.set(style="ticks")
sns.catplot('workex', kind = 'count', hue = 'status',data = data_campus)
plt.title('Work experience vs Status', fontsize=16)

In [None]:
workex_status = data_campus.groupby(['workex', 'status'])
group_work_sts = workex_status.size()
group_work_sts.name = 'total'
group_work_sts = group_work_sts.reset_index()
group_work_sts

From this graph it can be extracted that the most students who have previous work experience are hired. If we analyze the standardized data we can see more clearly.

In [None]:
workex_normal = group_work_sts.groupby('workex').apply(normal_total)
workex_normal

In [None]:
sns.barplot(x = 'workex', y = 'normal_data', hue = 'status',data = workex_normal, hue_order = ['Placed', 'Not Placed'])
plt.title('Work experience vs Status (normalised data)', fontsize=16)

From these data, it can be seen that 86% of the students who have previous experience are placed. While for those who do not have work experience, the hiring is 60%.

* **Analysing percentage**

Now, the scores are going to be analyse.

In [None]:
sns.pairplot(data_campus, vars=['ssc_p', 'hsc_p', 'degree_p', 'etest_p', 'mba_p'], hue='status')
plt.title('Relationship between the different percentage', fontsize=16)

In this multiple graph, all the data about the scores are collected. 

In all of them there is a positive trend, those with high marks tend to have good marks in the rest of the grades as well. However, these positive trends are lighter in the case of *Placement Test* and *MBA*.

We can also conclude that the students, who are placed, had better grades in *Secondary Education*, *Higher Secondary Education* and *Degree*. While the *MBA* and* Placement Test* scores are similar for both hired and unhired students.

* **Specialisation vs Status**

In this section we will analyze both the Higher secondary stream, the  degree type and the  MBA specialisation depending on whether or not they are hired. 

Let's start with Higher secondary stream.

In [None]:
sns.catplot('hsc_s', kind = 'count', hue = 'status',data = data_campus)
plt.title('Higher secondary stream vs Status', fontsize=16)

In [None]:
hsc_status = data_campus.groupby(['hsc_s', 'status'])
group_hsc_sts = hsc_status.size()
group_hsc_sts.name = 'total'
group_hsc_sts = group_hsc_sts.reset_index()
group_hsc_sts

In this graph you can see that most of the students chose Commerce as Higher secondary stream. While very few studied Arts.

As we have done before, we will analyze the normalized data.

In [None]:
hsc_normal = group_hsc_sts.groupby('hsc_s').apply(normal_total)
hsc_normal

In [None]:
sns.barplot(x = 'hsc_s', y = 'normal_data', hue = 'status',data = hsc_normal, order = ['Commerce', 'Science', 'Arts'], hue_order = ['Placed', 'Not Placed'])
plt.title('Higher secondary stream vs Status (normalised data)', fontsize=16)

Here it can be seen more clearly that studying Commerce or Science at the Higher Education is more likely to be placed (around 70% in both cases). While 55% of the students who studied Arts are placed.

Next, let's look at the degree type.

In [None]:
sns.catplot('degree_t', kind = 'count', hue = 'status',data = data_campus, order = ['Comm&Mgmt', 'Sci&Tech', 'Others'])
plt.title('Degree type vs Status', fontsize=16)

In [None]:
degree_status = data_campus.groupby(['degree_t', 'status'])
group_degree_sts = degree_status.size()
group_degree_sts.name = 'total'
group_degree_sts = group_degree_sts.reset_index()
group_degree_sts

As with Higer Education Stream, most of the students studied Commerce and Management. However, in order to analyse the percentages of hiring according to the degree type, we will study the normalised data.

In [None]:
degree_normal = group_degree_sts.groupby('degree_t').apply(normal_total)
degree_normal

In [None]:
sns.barplot(x = 'degree_t', y = 'normal_data', hue = 'status',data = degree_normal, order = ['Comm&Mgmt', 'Sci&Tech', 'Others'], hue_order = ['Placed', 'Not Placed'])
plt.title('Degree type vs Status (normalised data)', fontsize=16)

It can be seen that although there are fewer science students than commerce students, in both cases 70% are placed. While studying other careers is less likely to be placed.

Following is an analysis of the MBA specialization.

In [None]:
sns.catplot('specialisation', kind = 'count', hue = 'status',data = data_campus)
plt.title('MBA specialisation vs Status', fontsize=16)

In [None]:
mba_status = data_campus.groupby(['specialisation', 'status'])
group_mba_sts = mba_status.size()
group_mba_sts.name = 'total'
group_mba_sts = group_mba_sts.reset_index()
group_mba_sts

The specialization with more students is marketing and finance, being these also the most hired. 
Likewise, we analyze normalised data.

In [None]:
mba_normal = group_mba_sts.groupby('specialisation').apply(normal_total)
mba_normal

In [None]:
sns.barplot(x = 'specialisation', y = 'normal_data', hue = 'status',data = mba_normal, order = ['Mkt&HR', 'Mkt&Fin'], hue_order = ['Placed', 'Not Placed'])
plt.title('MBA specialisation vs Status (normalised data)', fontsize=16)

From this graph we can see that 80% of the students who studied marketing and finance are placed, while those who chose marketing and HR only 55% are placed.
Therefore, studying marketing and finance has a better chance of be placed.

* **Salary**

The salary distribution is shown below.

In [None]:
sns.distplot(data_campus.salary[data_campus.salary > 0])
plt.title('Salary distribution',size=15)
mean_salary = data_campus.salary[data_campus.salary > 0].mean()
median_salary = data_campus.salary[data_campus.salary > 0].median()
plt.axvline(mean_salary,color='red')
plt.axvline(median_salary,color='green')
plt.title('Salary distribution \n Mean={0:.2f}   Median={1:.2f}'.format(mean_salary,median_salary))


Most salaries are between 200,000 and 400,000, being the mean of 288655.41 and median of 265000

In [None]:
sns.boxplot(data_campus.salary[data_campus.salary > 0],orient='v')
plt.title('Boxplot of salary', fontsize=16)

The results of the salary distribution are corroborated with the boxplot.

Finally, we analyse the mean salary according to the MBA specialisation.

In [None]:
mba_salary = data_campus[data_campus.salary > 0].groupby('specialisation')[['salary']].mean()

mba_salary

Students who chose marketing and finance are paid on average 298852, while those who studied marketing and HR have a salary of 270377.