# College Placement Dataset 

* This dataset is related to the placement statistics of an MBA college. 
* It is available on kaggle : https://www.kaggle.com/benroshan/factors-affecting-campus-placement
* Thanks Ben Roshan D for providing the Dataset
* Here we use machine learning to predict the placement chances of placement and the salary offered if placed 

## We can get started by importing the Dataset

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

In [None]:
import warnings
warnings.simplefilter("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')
# Read your datasets from the installed directory
df.head() #prints the first 5 rows of the dataset

## Understanding the columns in the DataFrame
* <b>sl_no</b>          : The id no. of the student 
* <b>gender</b>         : The gender of the student
* <b>ssc_p</b>          : The percentage of marks obtained in SSC (Senior Secondary Certificate)
* <b>ssc_b</b>          : The board in which the student has studied SSC
* <b>hsc_p</b>          : The percentage of marks obtained in HSC (Higher Secondary Certificate)
* <b>hsc_b</b>          : The board in which the student has studied HSC 
* <b>hsc_s</b>          : The subject chosen for HSC
* <b>degree_p</b>       : The percentage of marks obtained in Degree
* <b>degree_t</b>       : The subject chosen for Degree
* <b>workex</b>         : Work Experience of the student 
* <b>etest_p</b>        : Employability Test Percentage
* <b>specialisation</b> : Specialization chosen in MBA
* <b>mba_p</b>          : Percentage of marks obtained in MBA
* <b>status</b>         : The placement status of the student
* <b>salary</b>         : The salary offered to the students who are placed 

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.shape

In [None]:
from pandas_profiling import ProfileReport
report = ProfileReport(df,title='Summary Report of Student Placements')
report

## Takeaways from the Profile Report 
* There are <b>32.1%</b> empty values in Salary i.e 32.1% people have not been placed 
* There are <b>7</b> Numerical Variables 
* There are <b>7</b> Categorical Variables
* There is <b>1</b> Boolean Variable

## EXPLORATORY DATA ANALYSIS

* Plotting the salaries of those who scored more than 60% in all their studies

In [None]:
greater_70 = (df.ssc_p > 70) & (df.hsc_p > 70) & (df.mba_p >70)

In [None]:
df_70 = df[greater_70]
df_70.shape

* This shows that there are only 12 students who scored more than 70 in their studies

In [None]:
plt.hist(df_70.salary,bins=20)
plt.show()

* Checking the no. of students with and without work ex

In [None]:
df['workex'].value_counts() #Since its a categorical variable it does not require normalization

* Checking the types of specializations offered and students enrolled

In [None]:
df['specialisation'].value_counts()

* Checking the types of degree done in undergraduation

In [None]:
df['degree_t'].value_counts()

* Checking the background of student in their +1 and +2

In [None]:
df['hsc_s'].value_counts()

* Looking at no. of placed and not placed

In [None]:
df['status'].value_counts()

* Checking the gender difference in the batch

In [None]:
df['gender'].value_counts()

## Using Matplotlib to Visualize the Data 

In [None]:
import matplotlib.pyplot as plt 

## Plotting the range of Salaries Offered

In [None]:
plt.hist(df['salary'],bins=20)
plt.show()

## Plot for SSC_P vs Salary 

In [None]:
plt.scatter(df['ssc_p'],df['salary'])
plt.xlabel('Percentage in SSC')
plt.ylabel('Salary Offered')
plt.title('Salary offered wrt SSC Percentage')
plt.show()

## Plot for HSC_P vs Salary

In [None]:
plt.scatter(df['hsc_p'],df['salary'])
plt.xlabel('Percentage in HSC')
plt.ylabel('Salary Offered')
plt.title('Salary offered wrt HSC Percentage')
plt.show()

## Plot for Degree_P vs Salary

In [None]:
plt.scatter(df['degree_p'],df['salary'])
plt.xlabel('Percentage in Degree')
plt.ylabel('Salary Offered')
plt.title('Salary offered wrt Degree Percentage')
plt.show()

## Plot for MBA_P vs Salary

In [None]:
plt.scatter(df['mba_p'],df['salary'])
plt.xlabel('Percentage in MBA')
plt.ylabel('Salary Offered')
plt.title('Salary offered wrt MBA Percentage')
plt.show()

* The outliers in the dataset have to be removed so that the algorithm can work equally well on new data
* Therefore we can remove the data where salary is greater than 5,00,000

In [None]:
plt.hist(df['salary'],bins=20)
plt.show()

## DATA PREPROCESSING

## Checking Wether the numerical data is Normally Distributed

<b>Tests to check Normality:</b>
* The Shapiro-Wilk test
* The Anderson-Darling test
* The Kolmogorov-Smirnov test

<b>Visual measures to be implemented:</b>
* Box Plots
* QQ Plots

<b>Why is Normality Required:</b>
* It is a (a bit strongly stated) fact that formal normality tests always reject on the huge sample sizes we work with today. It’s even easy to prove that when n gets large, even the smallest deviation from perfect normality will lead to a significant result. And as every dataset has some degree of randomness, no single dataset will be a perfectly normally distributed sample. But in applied statistics the question is not whether the data/residuals … are perfectly normal, but normal enough for the assumptions to hold.
* As we can see from the code below, the Shapiro-Wilk test has <b>rejected normality for MBA Percentage</b>. Therefore, we might have to use some additional measure to see if the null hypothesis for MBA Percentage should indeed be rejected.

In [None]:
from scipy import stats

degree_p = stats.norm.rvs(df['degree_p'])
ssc_p = stats.norm.rvs(df['ssc_p'])
hsc_p = stats.norm.rvs(df['hsc_p'])
mba_p = stats.norm.rvs(df['mba_p'])
salary = stats.norm.rvs(df['salary'])
etest_p = stats.norm.rvs(df['etest_p'])
print("Stat for degree:", stats.shapiro(degree_p)) # Null Accepted
print("Stat for ssc:", stats.shapiro(ssc_p)) # Null Accepted
print("Stat for hsc:", stats.shapiro(hsc_p)) # Null Rejected
print("Stat for mba:", stats.shapiro(mba_p)) # Null Accepted
print("Stat for salary:", stats.shapiro(salary)) # Null Accepted 
print("Stat for etest:", stats.shapiro(etest_p)) # Null Rejected

* The above tests prove that the data is not normal therefore we can scale the Data
* This can be done using StandardScaler from scikit-learn

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
columns = df[['ssc_p','hsc_p','degree_p','mba_p','etest_p']]
x_scaled = pd.DataFrame(scaler.fit_transform(columns))
x_scaled.columns = ['ssc_p','hsc_p','degree_p','mba_p','etest_p']
x_scaled.reset_index(drop=True, inplace=True)
x_scaled

### Label Encoding for all the Categorical Variables

In [None]:
x_cat = df[['gender','ssc_b','hsc_b','hsc_s','degree_t','specialisation']]
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
x_cat['gender'] = le.fit_transform(x_cat.gender)
x_cat['ssc_b'] = le.fit_transform(x_cat.ssc_b)
x_cat['hsc_b'] = le.fit_transform(x_cat.hsc_b)
x_cat['hsc_s'] = le.fit_transform(x_cat.hsc_s)
x_cat['degree_t'] = le.fit_transform(x_cat.degree_t)
x_cat['specialisation'] = le.fit_transform(x_cat.specialisation)
x_cat.reset_index(drop=True, inplace=True)
x_cat

# What are we predicting :
* We are trying to predict the chance of a person getting a placement 
* Therefore we have to make the training and testing sets accordingly 

In [None]:
x = pd.concat([x_cat,x_scaled],join='outer',axis=1)
x.isnull().sum()
x

In [None]:
y = le.fit_transform(df.status)

### Splitting the Data into Training and Testing sets

In [None]:
from sklearn.model_selection import train_test_split as tts

x_train,x_test,y_train,y_test = tts(x,y,test_size=0.3,random_state=42)

## Applying Classification Algorithms for prediction 

#### LOGISTIC REGRESSION

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=42)
lr.fit(x_train,y_train)
y_pred = lr.predict(x_test)
lrscore = lr.score(x_test,y_test)
lrscore

#### KNEIGHBORS CLASSIFIER

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
knc.fit(x_train,y_train)
y_pred = knc.predict(x_test)
kncscore = knc.score(x_test,y_test)
kncscore

#### DECISION TREE CLASSIFIER

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
dtr = DecisionTreeClassifier(random_state=42)
dtr.fit(x_train,y_train)
y_pred = dtr.predict(x_test)

dtcscore =  metrics.accuracy_score(y_test,y_pred)
print(f'Decision Tree Classification Score = {dtcscore:4.1f}%\n')
print(f'Classification Report:\n {metrics.classification_report(y_test, y_pred)}\n')

#### RANDOM FOREST CLASSIFIER

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfr = RandomForestClassifier(n_estimators=10,random_state=42)
rfr.fit(x_train,y_train)

from sklearn import metrics

predicted = rfr.predict(x_test)
rfcscore =  metrics.accuracy_score(y_test, predicted)
print(f'Random Forest Classification Score = {rfcscore:4.1f}%\n')
print(f'Classification Report:\n {metrics.classification_report(y_test, predicted)}\n')

#### RIDGE CLASSIFIER

In [None]:
from sklearn.linear_model import RidgeClassifier
rc = RidgeClassifier(random_state=42)
rc.fit(x_train,y_train)
l_pred = rc.predict(x_test)
rcscore = rc.score(x_test,y_test)
rcscore

#### STOCHASTIC GRADIENT DESCENT CLASSIFIER

In [None]:
from sklearn.linear_model import SGDClassifier
SGDC = SGDClassifier(random_state=42)
SGDC.fit(x_train,y_train)
result = SGDC.predict(x_test)
sgdcscore = SGDC.score(x_test,y_test)
sgdcscore

#### PERCEPTRON

In [None]:
from sklearn.linear_model import Perceptron
p = Perceptron(random_state=42)
p.fit(x_train,y_train)
result = p.predict(x_test)
pscore = p.score(x_test,y_test)
pscore

#### PASSIVE AGRESSIVE CLASSIFIER

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(random_state=42)
pac.fit(x_train,y_train)
result_pac = pac.predict(x_test)
pacscore = pac.score(x_test,y_test)
pacscore

#### SUPPORT VECTOR CLASSIFIER

In [None]:
from sklearn.svm import SVC
svc = SVC(random_state=42)
svc.fit(x_train,y_train)
y_pred = svc.predict(x_test)
svcscore = svc.score(x_test,y_test)
svcscore

#### BAGGING CLASSIFIER

In [None]:
from sklearn.ensemble import BaggingClassifier
bc = BaggingClassifier(random_state=43)
bc.fit(x_train,y_train)
y_pred = bc.predict(x_test)
bcscore = bc.score(x_test,y_test)
bcscore

In [None]:
d = {'Algorithms Used': ['Logistic Regression','K Neighbors Classifier','Decision Tree Classifier','Random Forest Classifier',
                         'Ridge Classifier','Stochastic Gradient Descent','Perceptron','Passive Aggressive Classifier',
                        'Support Vector Classifier','Bagging Classifier'],
    'Accuracy Achieved': [lrscore,kncscore,dtcscore,rfcscore,rcscore,sgdcscore,pscore,pacscore,svcscore,bcscore]}

In [None]:
Accuracy_df = pd.DataFrame(d)
Accuracy_df = Accuracy_df.sort_values(by=['Accuracy Achieved'],ascending=False)
Accuracy_df

* We have achieved the highest accuracy using <b>PASSIVE AGGRESSIVE CLASSIFIER</b>

## FEATURE SELECTION  (MODEL OPTIMIZATION)

* We will continue with the Passive Aggressive Classifier for Future processes
* In feature selection we will understand which variable affects the result the most 

### Wrapper Methods - RECURSIVE FEATURE ELIMINATION

In [None]:
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.feature_selection import RFE

pac = PassiveAggressiveClassifier(random_state=42)
rfe = RFE(pac,1)
rfe.fit(x_train,y_train)
for var, name in sorted(zip(rfe.ranking_,x), key=lambda x: x[0]):
    print(f'{name:>18} rank = {var}')

## EVALUATION METRICS

* The major evaluation metrics used for a classification problem are 
* <b>Accuracy Score
* Classification Report
* Confusion Matrix</b>

In [None]:
from sklearn import metrics
import seaborn as sns

In [None]:
matrix = metrics.confusion_matrix(y_test,result_pac)
report = metrics.classification_report(y_test,result_pac)
print(f'Classification Report:\n {metrics.classification_report(y_test,result_pac)}\n')

## MODEL SELECTION (HYPERPARAMETER TUNING)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from time import time
from sklearn.linear_model import PassiveAggressiveClassifier

# Start clock
start = time()

pac = PassiveAggressiveClassifier()
skf = StratifiedKFold(n_splits=10)

fit_intercept = [True]
validation_fraction = [0.1,0.2,0.3,0.4,0.5,0.6]
loss = ['hinge','squared_hinge']
random_state = [42,33]
class_weight = ['weight','balanced',None]

# Create a dictionary of hyperparameters and values
params = {'fit_intercept':fit_intercept, 'validation_fraction':validation_fraction,'loss':loss,'random_state':random_state,'class_weight':class_weight}

# Number of random parameter samples
num_samples = 20

# Run randomized search
rscv = RandomizedSearchCV(pac, param_distributions=params, n_iter=num_samples, random_state=23)

# Fit grid search estimator and display results
rscv.fit(x_train, y_train)

print(f'Compute time = {time() - start:4.2f} seconds', end='')
print(f' for {num_samples} parameter combinations')

In [None]:
# Get best esimtator
be = rscv.best_estimator_

# Display parameter values
print(f'Best fit_intercept={be.get_params()["fit_intercept"]:5.4f}')
print(f'Best validation_fraction={be.get_params()["validation_fraction"]}')
print(f'Best loss={be.get_params()["loss"]}')
print(f'Best Class_Weight={be.get_params()["class_weight"]}')

# Display best score
print(f'Best CV Score = {rscv.best_score_:4.3f}')