This is a notebook made as a test for a company interview.
The notebook will be divided into 3 main sections:
1. Data Analysis
    * Includes EDA, filling in missing values, etc.
2. Supervised Learning
    * Using a supervised learning algorithm to predict whether a student will get placed or not. (Classification)
3. Unsupervised Learning
    * Using an unsupervised learning algorithm.

The reason I have chosen this notebook is simply because as a recent graduate, I too have struggled with getting placed. I hope this notebook introduces some better insight.

Let's start with **Data Analysis**

# Data Analysis

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_profiling
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA

In [None]:
data = pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
#Removing 'sl_no' since it is basically just another index.
data.drop(['sl_no'], axis=1, inplace = True)
data.head()

'pandas_profiling' is a module which helps with EDA. 

It summarises a lot of information without having to code it manually while providing interactive reports.

In [None]:
data.profile_report(title='Campus Placement Data - Report', progress_bar=False)

Some insights from the report:
1. The only missing data is in 'salary' column, which would probably be due to students not getting placed.
1. Based on correlation data, only 'mba_p' and 'etest_p' seem to have a higher correlation while others have a negligent one.
1. Data is not scaled properly. Salary has a higher range while the rest are in percentages.
1. There are more male than female students
1. Commerce students have a majority, followed by Science and then Art students.
1. Most students dont have work experience
1. A majority of students are placed in companies

Let's deal with missing values first, then outliers, and move on to EDA later

In [None]:
#Let's check and see if all the students that aren't placed have null salary
data['status'][data['salary'].isnull()].unique()

All the not placed students are the ones with no salary.

So lets just input them with a 0.

In [None]:
data['salary'].fillna(0, inplace=True)
data.isnull().sum()#Checking for null values

With no null values, let's move on to handling outliers

In [None]:
plt.figure(figsize = (15, 10))
plt.style.use('seaborn-white')
ax=plt.subplot(221)
plt.boxplot(data['ssc_p'])
ax.set_title('Secondary school percentage')
ax=plt.subplot(222)
plt.boxplot(data['hsc_p'])
ax.set_title('Higher Secondary school percentage')
ax=plt.subplot(223)
plt.boxplot(data['degree_p'])
ax.set_title('UG Degree percentage')
ax=plt.subplot(224)
plt.boxplot(data['etest_p'])
ax.set_title('Employability percentage');

The majority of outliers are present in 'hsc_p'. Let's clear them up

In [None]:
Q1 = data['hsc_p'].quantile(0.25)
Q3 = data['hsc_p'].quantile(0.75)
IQR = Q3 - Q1    #IQR is interquartile range. 

filter = (data['hsc_p'] >= Q1 - 1.5 * IQR) & (data['hsc_p'] <= Q3 + 1.5 *IQR)
filtered_data=data.loc[filter]

The comparisons side by side

In [None]:
plt.figure(figsize = (15, 5))
plt.style.use('seaborn-white')
ax=plt.subplot(121)
plt.boxplot(data['hsc_p'])
ax.set_title('Before removing outliers(hsc_p)')
ax=plt.subplot(122)
plt.boxplot(filtered_data['hsc_p'])
ax.set_title('After removing outliers(hsc_p)');

Now that the outliers are handled, it's time for EDA

A lot of EDA was automatically done for us by 'pandas-profiling'. Remaining tasks would be something like:
* How does a variable affect the placement of a student.

Let's start with the **Gender** variable

In [None]:
sns.countplot(x="gender", hue="status", data=data)
plt.show()

* Even when the number of male students are higher, non-placed students have an equal distribution between male and female students indicating that male students have a higher placement rate.

In [None]:
plt.figure(figsize =(18,6))
sns.boxplot(x="salary", y="gender", data=data)
plt.show()

* Male students are placed with a higher salary compared to their female counterparts

Now that the gender variable's out of the way, the rest are just similar to each other.
We can combine them into subplots 

Let's start with 'Boards' and 'Specializations' effects' on the placement status.

In [None]:
plt.subplot(231)
sns.countplot(x="ssc_b", hue="status", data=data)
fig=plt.gcf()
fig.set_size_inches(20,20)

plt.subplot(232)
sns.countplot(x="hsc_b", hue="status", data=data)
fig=plt.gcf()
fig.set_size_inches(20,20)

plt.subplot(233)
sns.countplot(x="degree_t", hue="status", data=data)
fig=plt.gcf()
fig.set_size_inches(20,20)

plt.subplot(234)
sns.countplot(x="specialisation", hue="status", data=data)
fig=plt.gcf()
fig.set_size_inches(20,20)


* Board of Education does not affect placement much in SSC and HSC's case.
* Science and Commerce students have a high 2:1 ratio of getting placed.
* Anyone specialising in Marketing and Finance has a relatively higher chance of getting placed.

Let'see the effect work experience has on placements

In [None]:
plt.style.use('seaborn-white')
f,ax=plt.subplots(1,2,figsize=(18,8))
filtered_data['workex'].value_counts().plot.pie(explode=[0,0.05],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Work experience')
sns.countplot(x = 'workex',hue = "status",data = filtered_data)
ax[1].set_title('Influence of experience on placement')
plt.show()

* Majority of students who got placed had no experience
* However, having relevant work experience reduced the chances of not gettitng placed drastically.
* Looking at the plots, a person with work experience is 6 times more likely to get placed than not, while a person with no experience is only 1.6 times likely to get placed.

That's all the EDA we need.
Let's move on to the prediction of target variables.

# Supervised Learning

I am going to be using a Tree based model for the following reasons:
1. Scaling - Our data is not scaled properly as explained above. Tree based models take the need out for scaling.
1. Categorical Variable ranking - Tree based models split the data into two. So, there is no need for one hot encoding. Label Encoding works just fine.
1. Accuracy - The accuracy from ensemble tree models is much higher than linear/logistic regression since multiple classifiers are working at the same time.

Before we do that, let's change categorical variables to int values

In [None]:
data["gender"] = data.gender.map({"M":0,"F":1})
data["hsc_s"] = data.hsc_s.map({"Commerce":0,"Science":1,"Arts":2})
data["degree_t"] = data.degree_t.map({"Comm&Mgmt":0,"Sci&Tech":1, "Others":2})
data["workex"] = data.workex.map({"No":0, "Yes":1})
data["status"] = data.status.map({"Not Placed":0, "Placed":1})
data["specialisation"] = data.specialisation.map({"Mkt&HR":0, "Mkt&Fin":1})
data['hsc_b'] = data.hsc_b.map({'Others':0, 'Central':1})
data['ssc_b'] = data.ssc_b.map({'Others':0, 'Central':1})

In [None]:
#drop 'salary' column since it will lead to target leakage
y=data['status']
X = data.drop(['salary','status'], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=42)

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
accuracy_score(y_test, y_pred)

The accuracy of a simple Random Forest Model is 80% here.

Tree based algorithms can be used to compute feature importance

In [None]:
rows = list(X.columns)
imp = pd.DataFrame(np.zeros(6*len(rows)).reshape(2*len(rows), 3))
imp.columns = ["Classifier", "Feature", "Importance"]
#Add Rows
for index in range(0, 2*len(rows), 2):
    imp.iloc[index + 1] = ["RandomForest", rows[index//2], (100*model.feature_importances_[index//2])]

In [None]:
plt.figure(figsize=(15,5))
sns.barplot("Feature", "Importance", data=imp)
plt.title("Computed Feature Importance")
plt.show()

As is visible, percentage variables have a higher importance than thought about before.

# PCA

PCA is known as Principal Component Analysis and is used to reduce the number of features and ultimately the dimensionality.

In [None]:
pca = PCA(n_components=2)
X_new = pca.fit_transform(X)

Let's plot the graphs before and after PCA

In [None]:
fig, axes = plt.subplots(1,2)

axes[0].scatter(X.iloc[:,0], X.iloc[:,1], c=y)
axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_title('Before PCA')

axes[1].scatter(X_new[:,0], X_new[:,1], c=y)
axes[1].set_xlabel('PC1')
axes[1].set_ylabel('PC2')
axes[1].set_title('After PCA')

plt.show()

# References
1. https://www.kaggle.com/benroshan/you-re-hired-analysis-on-campus-recruitment-data
1. https://www.kaggle.com/atishadhikari/placement-dataanalysis-classification-regression
1. https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e