# Will You Get a Job or Not ?

# Welcome To This Notebook
1. I tried my best to explain every tiny details in this notebook in between as I can.
2. I have compared all famous machine learning models in data modelling section to find the best model as I can :)
3. Don't Forget to Upvote this Notebook, If you really liked it :) and that encourage me to upload more interesting notebooks like this in the future.

# Required Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import auc, roc_curve, classification_report, confusion_matrix, roc_auc_score

### Optional Settings

In [None]:
# %matplotlib inline

In [None]:
# plt.rcdefaults()
# sns.set_style()

In [None]:
# plt.rc("figure", figsize=[9, 5])
# plt.style.use("seaborn")
# sns.set(rc={"figure.figsize": [9, 5]})

# Let's Look at the Data

In [None]:
data = pd.read_csv("../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv")
data.drop(labels=["sl_no"], axis=1, inplace=True)

### Take a Sneak Peek :)

In [None]:
data.head()

In [None]:
data.dtypes

###### The data types of the attributes are in the right form 

In [None]:
data.info()

###### As you can see, among 14 attributes (columns) there exist one attribute with 67 (215 - 67 = 148) missing values :(

### Optional

In [None]:
data.isna().sum().sort_values(ascending=False)

In [None]:
data.loc[data["salary"].isna(), :]    # You can also use this code :) -->  data[data["salary"].isna()]

# Descriptive Analysis - Numeric Data

In [None]:
data.describe()

# Exploratory Data Analysis

### Class Imbalance Check !

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="status", data=data, ax=ax)
plt.title("Target Class Distribution")
plt.xlabel("Class Label")
plt.show()

###### It says 148 (69%) people have been placed and 67 (31%) people have not placed. It concludes that our dataset is imbalanced but not that much :(

### Does Gender Impacts Placement ?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="gender", hue="status", data=data, ax=ax)
plt.xlabel("Gender")
plt.title("Gender vs Placement")
plt.show()

###### The answer is Yes. Because more number of male have been placed than female. But when it comes to not placed people, it's not that much difference.  If you think it is unfair, feel free to post your opinion in the comment section.

### Do Specialisation Matters ?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="specialisation", hue="status", data=data, ax=ax)
plt.xlabel("Specialisation")
plt.title("Specialisation vs Placement")
plt.show()

###### The chances of getting placed for students who have taken "Mkt&Fin" specialisation is more than the students with "Mkt&HR" specialisation. But don't worry 53 (36 %) out of 148 placed students are from "Mkt&HR" :)

### Do Work Experience Helps You Get Placed ?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="workex", hue="status", data=data, ax=ax)
plt.xlabel("Work Experience")
plt.title("Work Experience vs Placement")
plt.show()

###### Really it looks weird but you gotta accept that. If you are a student, mostly you might have heard from your professor or some other saying that "work experience through internship or any peoject really helps you to get placed". But here you can see that most of students who have been placed are not having any work experience :(

### Which Degree has More Placements ?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="degree_t", hue="status", data=data, ax=ax)
plt.title("Degree Priority for Placement")
plt.xlabel("Degree")
plt.show()

###### The "Comm&Mgmt" degree has more priority for recruiters to recruit. But they also need some "Sci&Tech" students than "Others" category students which is good sign for Science and Technology students :)

### Does Board of Education Matters ?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="ssc_b", hue="status", data=data, ax=ax)
plt.title("Board of Education vs Placement")
plt.xlabel("Board of Education")
plt.show()

###### Looks like both board of education is having more equal (same but not exactly) chances of getting placed :)

### Do Higher Secondary Group matters ?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.countplot(x="hsc_s", hue="status", data=data, ax=ax)
plt.xlabel("HSC Groups")
plt.title("HSC Groups vs Placement")
plt.show()

###### I guess the recruiters are mostly choosing "Commerce" students because in the above "Degree Priority" Chart, the recruiters were mostly selected the students who have completed "Comm&Mgmt" degree. But don't worry, they have also recruited more "Science" group students which is a good sign for me :)

### Do Employability Test Helps Getting Job?

In [None]:
fig = plt.figure()
ax = fig.add_subplot()

sns.barplot(x="status", y="etest_p", data=data, ax=ax, ci=None)
plt.title("Employability Test vs Placement")
plt.xlabel("Status")
plt.ylabel("Employability Test")
plt.show()

###### Again you can't be confident if you have taken Employability Test to get placed :(

### Do High Percentage Holders Got Placed More Than Low Percentage Holders ?

In [None]:
fig = plt.figure(figsize=[11, 6])
ax = fig.add_subplot()

sns.scatterplot(x="ssc_p", y="hsc_p", hue=data["status"].tolist(),
                style=data["ssc_b"].tolist(), size=data["hsc_s"].tolist(), data=data, ci=None, ax=ax)
plt.xlabel("Secondary School Percentage")
plt.ylabel("Higher Secondary School Percentage")
plt.show()

###### Now it is clear that students who have scored more than 60% in both secondary and higher secondary and have chosen commerce as their group in higher secondary have been placed more than the other students :)

Note:
1. Here we are only concerned about predicting whether a student will get placed or not which is a classification problem in our case.
2. That's why I haven't included salary attribute (column) in EDA.
3. However If you are interested in predicting the salary of placed and not placed student you can take it as a homework :)

# Feature Engineering

### Categorical Data Encoding

In [None]:
categorical_variables = data.select_dtypes(include="object").columns.tolist()

treat_not_as_same = ["degree_t", "hsc_s"]

treat_as_same = [var for var in categorical_variables if not var in treat_not_as_same]

###### Why I'm treating "degree_t" and "hsc_s" as not same ? The answer is, they are not like all categorical variable they need some order to arange them which is nothing but one degree is bigger or smaller than the other or it has higher value than the others. For example Sci&Tech > Comm&Mgmt > Others

In [None]:
for var in treat_as_same[:-1]:
    dict_to_map = {j:i for i, j in enumerate(data[var].unique())}
    data[var] = data[var].map(dict_to_map)

data["status"] = data["status"].map({"Not Placed": 0, "Placed": 1})

In [None]:
for var in treat_not_as_same:
    data = pd.concat(objs=[data, pd.get_dummies(data=data[var])], axis=1)
    data.drop(labels=var, axis=1, inplace=True)
    
data.drop(labels="salary", axis=1, inplace=True)

### Feature Selection

In [None]:
fig = plt.figure(figsize=[11, 6])
ax = fig.add_subplot()

sns.heatmap(data=data.corr(), cmap="RdYlGn", annot=True, fmt=".2f", ax=ax)
plt.title("Correlation Among all Variables")
plt.show()

###### As you can see there are some negative and positive correlations among one hot encoded variables.

In [None]:
X = data.drop(labels="status", axis=1).values
y = data["status"].values

# Model Cross Validation

Note:
1. I'm doing Cross Validation before building the model because I want to know which algorithm is giving good result.
2. So that I can give more effort to that model to get best result as I can :)

In [None]:
models = [("Logistic Regression", LogisticRegression(random_state=0, n_jobs=-1)),
         ("Linear SVM", SVC(kernel="linear", random_state=0)),
         ("RBF SVM", SVC(kernel="rbf", random_state=0)),
         ("Decision Tree", DecisionTreeClassifier(random_state=0)),
         ("Random Forest", RandomForestClassifier(n_jobs=-1, random_state=0)),
         ("Adaboost RF", AdaBoostClassifier(base_estimator=RandomForestClassifier(n_jobs=-1, random_state=0), random_state=0, learning_rate=0.1)),
         ("Adaboost DT", AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=0), learning_rate=0.1)),
         ("Gradient Boosting", GradientBoostingClassifier(random_state=0))]

In [None]:
stratified = StratifiedKFold()
model_details = {name: [] for name, _ in models}

for train_index, test_index in stratified.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    for name, model in models:
        if name in ["Logistic Regression", "Linear SVM", "RBF SVM"]:
            std = StandardScaler()
            X_train = std.fit_transform(X_train)
            X_test = std.transform(X_test)

        model.fit(X_train, y_train)
        train_accuracy = model.score(X_train, y_train)
        test_accuracy = model.score(X_test, y_test)
        model_details[name].append((train_accuracy, test_accuracy))

In [None]:
summary_df = pd.DataFrame(index=["Train Score", "Test Score"])

for model, accuracy in zip(model_details.keys(), model_details.values()):
    train_accuracy = [train_accuracy for train_accuracy, _ in accuracy]
    test_accuracy = [test_accuracy for _, test_accuracy in accuracy]
    summary_df[model] = [np.mean(train_accuracy), np.mean(test_accuracy)]

In [None]:
summary_df.T

###### From the above model we can conclude that the Support Vector Machine is doing better than the other models with "RBF" kernel. Note we can't stop right here by getting good accuracy, we need to evaluate our model using other metrics too.

## Logistic Regression

Note:
1. You can also use train_test_split here to get train and test dataset.
2. But why I'm using StratifiedShuffleSplit here ?
3. The answer is that our dataset is somehow imbalanced not fully and this ensure that our train and test will be more representive of both classes.

In [None]:
stratified_split = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

for train_index, test_index in stratified_split.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [None]:
std = StandardScaler()
scaled_train = std.fit_transform(X_train)
scaled_test = std.transform(X_test)

In [None]:
logistic = LogisticRegression(random_state=0, n_jobs=-1)
logistic.fit(scaled_train, y_train)
print("Logistic Regression Test Score :", logistic.score(scaled_test, y_test))

### Confusion Matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, logistic.predict(scaled_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("Logistic Regression Confusion Matrix")
plt.show()

###### Logistic Regression has proved to be a good classifier. However we have 4 False Positive (classified students who are not eligible for placement as eligible) and 4 False Negative (classified students who are eligible for placement as not eligible). But don't worry we will try to reduce this Type 1 and Type 2 error :)

###### Note: In this problem we have to reduce False Negatives more than False Positives because we can't miss any student who is eligible for placement but our model predicted as not eligible :(

### ROC Curve

In [None]:
fpr, tpr, thresshold = roc_curve(y_test, logistic.predict_proba(scaled_test)[:, 1])
auc_score = auc(fpr, tpr)

fig = plt.figure(figsize=[9, 5])
ax = fig.add_subplot()

plt.plot(fpr, tpr, c="darkred", lw=2, label="AUC = {}".format(round(auc_score, 2)))
plt.plot([0, 1], [0, 1], c='black', lw=2, ls='--', label="AUC = {}".format(0.5))
plt.title("ROC Curve - Logistic Regression")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()

### Hyperparameter Tuning 

In [None]:
C = [100, 10, 1.0, 0.1, 0.01]
penalty = ['l2', 'l1']
param = {"C": C, "penalty": penalty}

grid = GridSearchCV(estimator=LogisticRegression(random_state=0, n_jobs=-1), param_grid=param, n_jobs=-1)
grid.fit(scaled_train, y_train)

grid_model = grid.estimator.fit(scaled_train, y_train)
test_score = grid_model.score(scaled_test, y_test)
print("Tuned Logistic Regression Test Score :", test_score)

### Classification Report

In [None]:
print(classification_report(y_test, grid_model.predict(scaled_test)))

###### After doing hyperparameter tuning, the model is producing the same accuracy as we got before without tuning parameter.

## Support Vector Machine

In [None]:
svc = SVC(kernel="rbf", random_state=0, probability=True)
svc.fit(scaled_train, y_train)
print("SVM Test Score :", svc.score(scaled_test, y_test))

### Confusion Matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, svc.predict(scaled_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("SVM Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

###### Wow, our SVM model has reduced the False Positive count to 1 which is pretty amazing :)

### ROC Curve

In [None]:
fpr, tpr, thresshold = roc_curve(y_test, svc.predict_proba(scaled_test)[:, 1])
auc_score = auc(fpr, tpr)

fig = plt.figure(figsize=[9, 5])
ax = fig.add_subplot()

plt.plot(fpr, tpr, c="darkred", lw=2, label="AUC = {}".format(round(auc_score, 2)))
plt.plot([0, 1], [0, 1], c='black', lw=2, ls='--', label="AUC = {}".format(0.5))
plt.title("ROC Curve - SVM")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()

### Hyperparameter Tuning

In [None]:
C = [100, 10, 1.0, 0.1, 0.01]
kernel = ["linear", "RBF", "poly"]
param = {"C": C, "kernel":kernel}

grid = GridSearchCV(estimator=SVC(random_state=0), param_grid=param, n_jobs=-1)
grid.fit(scaled_train, y_train)

grid_model = grid.estimator.fit(scaled_train, y_train)
test_score = grid_model.score(scaled_test, y_test)
print("Tuned SVM Test Score :", test_score)

### Classification Report

In [None]:
print(classification_report(y_test, grid_model.predict(scaled_test)))

## Decision Tree

In [None]:
dec = DecisionTreeClassifier(random_state=0)
dec.fit(X_train, y_train)
print("Decision Tree Classifier Test Score :", dec.score(X_test, y_test))

### Confusion Matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, dec.predict(X_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("Decision Tree Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

###### More or less but same as Logistic Regression performance :( Because Decision Trees are more likely to overfit to training data. So we need to tune it's parameter

### ROC Curve

In [None]:
fpr, tpr, thresshold = roc_curve(y_test, dec.predict_proba(X_test)[:, 1])
auc_score = auc(fpr, tpr)

fig = plt.figure(figsize=[9, 5])
ax = fig.add_subplot()

plt.plot(fpr, tpr, c="darkred", lw=2, label="AUC = {}".format(round(auc_score, 2)))
plt.plot([0, 1], [0, 1], c='black', lw=2, ls='--', label="AUC = {}".format(0.5))
plt.title("ROC Curve - Decision Tree")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()

### Hyperparameter Tuning

In [None]:
depth = list(range(1, 11))
min_sample_split = np.arange(5, 30, 5)
min_leaf_sample = np.arange(3, 16, 3)
features = ["auto", "sqrt", "log2"]
max_leaf_nodes = [4, 6, 8, 10]
param = {"max_depth": depth, "min_samples_split": min_sample_split, "min_samples_leaf": min_leaf_sample,
         "max_features": features, "max_leaf_nodes": max_leaf_nodes}

grid = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0), param_grid=param, n_jobs=-1)
grid.fit(X_train, y_train)

grid_model = grid.estimator.fit(X_train, y_train)
test_score = grid_model.score(X_test, y_test)
print("Tuned Decision Tree Test Score :", test_score)

###### Always don't trust too much on Decision Trees but try to trust it using Random Forest and Boosting models.

### Classification Report

In [None]:
print(classification_report(y_test, grid_model.predict(X_test)))

## Random Forest

In [None]:
ran = RandomForestClassifier(random_state=0)
ran.fit(X_train, y_train)
print("Random Forest Classifier Test Score :", ran.score(X_test, y_test))

### Confusion Matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, ran.predict(X_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

###### That's the power of Random Forest :)  It proved to be a good classifier than SVM in our case. SVM has 7 False Positives but Random Forest has only 3 and False Negatives also good when compared overall :)

### ROC Curve

In [None]:
fpr, tpr, thresshold = roc_curve(y_test, ran.predict_proba(X_test)[:, 1])
auc_score = auc(fpr, tpr)

fig = plt.figure(figsize=[9, 5])
ax = fig.add_subplot()

plt.plot(fpr, tpr, c="darkred", lw=2, label="AUC = {}".format(round(auc_score, 2)))
plt.plot([0, 1], [0, 1], c='black', lw=2, ls='--', label="AUC = {}".format(0.5))
plt.title("ROC Curve - Random Forest")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()

###### Yes, AUC for Random Forest is 95 :) That's great isn't it !

### Hyperparameter Tuning

In [None]:
n_estimators = np.arange(100, 140, 10)
max_depth = np.arange(3, 10, 2)
max_features = ["auto", "sqrt"]
max_leaf_nodes = np.arange(3, 10, 2)
param = {"n_estimators": n_estimators, "max_depth": max_depth, "max_features": max_features, "max_leaf_nodes": max_leaf_nodes}

grid = GridSearchCV(estimator=RandomForestClassifier(random_state=0, n_jobs=-1), param_grid=param, n_jobs=-1)
grid.fit(X_train, y_train)

grid_model = grid.estimator.fit(X_train, y_train)
test_score = grid_model.score(X_test, y_test)
print("Tuned Random Forest Test Score :", test_score)

In [None]:
print("AUC afetr parameter tuning : ", roc_auc_score(y_test, grid_model.predict_proba(X_test)[:, 1]))

#confusion matrix
con_mat = pd.DataFrame(confusion_matrix(y_test, grid_model.predict(X_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("Random Forest Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

### Classification Report

In [None]:
print(classification_report(y_test, grid_model.predict(X_test)))

###### Don't worry it's doing same as before :)

## AdaBoost Classifier

In [None]:
ada = AdaBoostClassifier(base_estimator=RandomForestClassifier(random_state=0, n_jobs=-1), learning_rate=0.1, random_state=0)
ada.fit(X_train, y_train)
print("AdaBoost Classifier Test Score :", ada.score(X_test, y_test))

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, ada.predict(X_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("AdaBoost Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
fpr, tpr, thresshold = roc_curve(y_test, ada.predict_proba(X_test)[:, 1])
auc_score = auc(fpr, tpr)

fig = plt.figure(figsize=[9, 5])
ax = fig.add_subplot()

plt.plot(fpr, tpr, c="darkred", lw=2, label="AUC = {}".format(round(auc_score, 2)))
plt.plot([0, 1], [0, 1], c='black', lw=2, ls='--', label="AUC = {}".format(0.5))
plt.title("ROC Curve - AdaBoost")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend(loc=4)
plt.show()

###### When comparing Random Forest with AdaBoost, I recommend to use Random Forest than AdaBoost :)

### Classification Report

In [None]:
print(classification_report(y_test, ada.predict(X_test)))

## GradientBoosting

In [None]:
grad = GradientBoostingClassifier(learning_rate=0.1, random_state=0)
grad.fit(X_train, y_train)
print("GradientBoost Classifier Test Score :", grad.score(X_test, y_test))

### Confusion Matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, grad.predict(X_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("GardientBoost Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

###### Again I recommend Random Forest :)

### Classification Report

In [None]:
print(classification_report(y_test, grad.predict(X_test)))

## Votting Classifier

In [None]:
estimators = [("LR", LogisticRegression(random_state=0, n_jobs=-1)),
             ("SVC", SVC(random_state=0)),
             ("RF", RandomForestClassifier(random_state=0, n_jobs=-1)),
             ("Ada", AdaBoostClassifier(RandomForestClassifier(random_state=0, n_jobs=-1), learning_rate=0.1, random_state=0)),
             ("Dec", DecisionTreeClassifier(random_state=0))]
vot = VotingClassifier(estimators, n_jobs=-1)
vot.fit(X_train, y_train)
print("Votting Classifier Test Score :", vot.score(X_test, y_test))

### Confusion Matrix

In [None]:
con_mat = pd.DataFrame(confusion_matrix(y_test, vot.predict(X_test)))
fig = plt.figure(figsize=[6, 4])
ax = fig.add_subplot()

sns.heatmap(con_mat, annot=True, fmt=".2f", cmap="RdYlGn", cbar=False, ax=ax)
plt.title("Votting Classifier Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

### Classification Report

In [None]:
print(classification_report(y_test, vot.predict(X_test)))

###### Again Random Forest is good

# Conclusion :)
1. From the mdoels we have built, Random Forest is doing better than the other models.
2. You can also use Votting Classifier to predict the same as than Random Forest :(
3. You can also use SVM other than Ensemble models.
4. It is possible to select important categorical features using Chi-Square test but I have not implemented here.