Hello! My name is Laith. We are going to work on a machine learning (ML) project from beginning to end :) We will be using Pandas and Sklearn for the most part, both are very useful for ML/data analysis. Since we are using Pandas and Sklearn, yes you have guessed it, we will by coding in Python! 

Steps we will go through:

- Looking at the big picture. What are we trying to accomplish? 
- Get the data 
- Play around with the data to get a better understanding of it 
- Clean the data 
- Selecting our model to train 
- evaluate our models performance 
- present our solution 

Okay, let's begin! 

We are hired by a university that wants us to build a model that will help predict if a student will get work placement or not. For the sake of simplicity, they dont give us any further information, so we are free to approach this in any way we want. 

Since we will be dealing with labeled training examples where every instance comes with expected output, this will be a supervised learning task. It is a classification task as well since we aiming to classify if a studnet will get placement or not. 

Okay, let's get the data!

The data has been downloaded from Kaggle (https://www.kaggle.com/benroshan/factors-affecting-campus-placement?select=Placement_Data_Full_Class.csv). I have it in my directory so we will start by using Pandas to load the data. 

In [None]:
# ALL OF OUR IMPORTS WILL GO RIGHT HERE IN THIS CELL 
import os 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [None]:
# FUNCTION TO LOAD OUR DATA 

def load_dataset():
    csv_path = os.path.join("../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv")
    return pd.read_csv(csv_path)

In [None]:
# LOADING THE DATASET 
placement = load_dataset()

OKAY! So, we looking at the big picture and now we have the data. 

Next step? Lets get a better understanding of the data. We will start of by looking at the dataset. 

In [None]:
# looking at the database placement 
placement

What are we looking at? 

This table is 215x15. Meaning that there are 215 students (instances) and 15 different columns (attributes). 

The columns: 
- sl_no: serial number 
- gender: gender 
- ssc_p: secondary school percentage (grade 10)
- ssc_b: secondary school board 
- hsc_p: higher secondary school percentage (grade 11&12)
- hsc_b: higher secondary school board 
- hsc_s: specialization in higher secondary school 
- degree_p: degree percentage 
- degree_t: undergrad degree 
- workex: work experience 
- etest_p: employability test percentage
- specialisation: Postgrad degree 
- mba_p: MBA percentage 
- status: if they are placed or not 
- salary: salary 

We can get more info by simply calling .info() to get a better understanding of the attributes. And .describe() to see more numarical summary of the data.

In [None]:
# database info
placement.info()

In [None]:
# database decription 
placement.describe()

What does .info() and .describe() tell us? 

We can see from .info() that the salary attribute has 148 instances which is less that 215. Thats not okay so we will need to take care of that later. Just something to notice. 

.describe(), as you can see gives us a couple things. We wont go through all of them as they are self explanatory. 25%, 50% and 75% correspond to the percentiles. For example, 25% of students have degree_p lower than 61. 

Lets use the matplotlib to now get a better understanding of the data.

In [None]:
# histograms of each numeric attribute 
placement.hist(bins=50, figsize=(20,15))

We can see that ssc_p, hsc_p, degree_p, and mba_p are fairly bell shaped. Meaning that they have a normal distribution where points are as likely to happen on one side of the average as on the other side. The salary attribute is clearly a right-skewed histogram. This is expected as most salaries tend to be close to be the same or similar and only a few will end up having a salary that is very high. 

In [None]:

c=placement['status'].values.copy()
c[c=="Not Placed"]=0
c[c=="Placed"]=1
placement.plot(kind='scatter',y='hsc_p',x='degree_p',s='mba_p',c=c,cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend(["Not Placed","Placed"])
plt.show()

In [None]:
# # confusion matrix -- focusing on the true negatives 
def score(y, y_pred):
    return precision_score(y, y_pred), recall_score(y, y_pred), f1_score(y, y_pred)

Now lets split the dataset into training and testing sets. There a multiple ways to do this. 

We will use the sklearn stratify method. 

documentation for StratifiedShuffleSplit -- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html 

In [None]:
strat = StratifiedShuffleSplit(n_splits=1, test_size=0.15, random_state=42)

In [None]:
for train_index, test_index in strat.split(placement, placement['status']):
    strat_train = placement.loc[train_index]
    strat_test = placement.loc[test_index]

In [None]:
x_train = strat_train.drop("status", axis=1)
y_train = strat_train["status"]
x_test = strat_test.drop("status", axis=1)
y_test = strat_test["status"]

We are now going to use Transformation Piplines, comes from sklearn. 

We need to fill in the missing salary values (we saw this earlier), and change all categorical attributes to numarical since models learn best that way. 

Essentially what Piplines are, is that they hand the Transformation of each attribute to the right values. 

Documentation -- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html 

In [None]:
num_attributes = [
    "sl_no",
    "ssc_p",
    "hsc_p",
    "degree_p",
    "etest_p",
    "mba_p",
    "salary"
]

cat_attributes = [
    "gender",
    "ssc_b",
    "hsc_b",
    "hsc_s",
    "degree_t",
    "workex",
    "specialisation"
]

In [None]:
num_pipline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ('std_scalar', StandardScaler())
])

pipline = ColumnTransformer([
    ("num", num_pipline, num_attributes),
    ("cat", OneHotEncoder(), cat_attributes)
])

In [None]:
proccessed_train_x = pipline.fit_transform(x_train)
proccessed_test_x = pipline.fit_transform(x_test)

In [None]:
y_text_to_num = {
    "status": {"Placed": 0, "Not Placed": 1}
}

y_train = y_train.to_frame()
y_test = y_test.to_frame()


proccessed_train_y = y_train.replace(y_text_to_num)
proccessed_test_y = y_test.replace(y_text_to_num)

In [None]:
proccessed_train_y = proccessed_train_y["status"].values
proccessed_test_y = proccessed_test_y["status"].values

In [None]:
FINAL_X = pipline.fit_transform(placement.drop("status", axis=1))
FINAL_Y = placement["status"].values

In [None]:
split = StratifiedKFold(n_splits=10)

The models we will use: 

- Logistic Regression 
- Decision Tree 
- Gaussian Naive Bayes
- Random Forest Classifier 
- K Nearest Neighbors Classifier
- Support Vector Machine Classifier

In [None]:
log_reg = LogisticRegression()
log_reg.fit(proccessed_train_x, proccessed_train_y)

In [None]:
lr_pred_y = log_reg.predict(proccessed_test_x)

In [None]:
tree_reg = DecisionTreeClassifier()
tree_reg.fit(proccessed_train_x, proccessed_train_y)

In [None]:
tr_pred_y = tree_reg.predict(proccessed_test_x)

In [None]:
gau_naiv_bay = GaussianNB()
gau_naiv_bay.fit(proccessed_train_x, proccessed_train_y)

In [None]:
gnb_pred_y = gau_naiv_bay.predict(proccessed_test_x)

In [None]:
ran_for_cla = RandomForestClassifier()
ran_for_cla.fit(proccessed_train_x, proccessed_train_y)

In [None]:
rfc_pred_y = ran_for_cla.predict(proccessed_test_x)

In [None]:
k_near_nei = KNeighborsClassifier()
k_near_nei.fit(proccessed_train_x, proccessed_train_y)

In [None]:
knn_pred_y = k_near_nei.predict(proccessed_test_x)

In [None]:
sup_vec_mac = SVC()
sup_vec_mac.fit(proccessed_train_x, proccessed_train_y)

In [None]:
svm_pred_y = sup_vec_mac.predict(proccessed_test_x)

Lets show the scores for each model using the score() we built earlier. 

In [None]:
models = [
    ("Logistic Regression", lr_pred_y),
    ("Decision Tree Classifier", tr_pred_y),
    ("Gaussian Naive Bayes", gnb_pred_y),
    ("Random Forest Classifier", rfc_pred_y),
    ("K Nearest Neighbors Classifier", knn_pred_y),
    ("Support Vector Machine Classifier", svm_pred_y)
]

for info in models:
    model = info[0]
    y_pred = info[1]
    print(model)
    print("These are the predicted values from the model: ", y_pred)
    print("These are the correct output values:           ", proccessed_test_y)
    print("The score for this model (precision, recall, f1_score): ", score(proccessed_test_y, y_pred))
    print("\n\n")

This part is just to show how a confusion_matrix works.


Now, for our data science metrics, we will be using the confusion_matrix. It does take a bit of time to wrap your head around it. So make sure you visit this line (https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) to get a better understanding. Or just YouTube it. YouTube + Google is everything!

Essentially, the confusion_matrix tells us the true positives, false positives, true negatives, and false negatives. 

In [None]:
models = [
    ("Logistic Regression", lr_pred_y),
    ("Decision Tree Classifier", tr_pred_y),
    ("Gaussian Naive Bayes", gnb_pred_y),
    ("Random Forest Classifier", rfc_pred_y),
    ("K Nearest Neighbors Classifier", knn_pred_y),
    ("Support Vector Machine Classifier", svm_pred_y)
]

for info in models:
    model = info[0]
    y_pred = info[1]
    print(model)
    print("The confusion matrix is:  \n", confusion_matrix(proccessed_test_y, y_pred))
    print("\n")

If we take Random Forest Classifier for example. Its confusion matrix has the least amount of false positives and false negatives, and highest amount of true positives and true negatives. Hence when you look at the scoring, RFC has the highest score at about 0.95. 