In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualisation
import seaborn as sns # data visualisation
sns.set_style('darkgrid')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
path = "/kaggle/input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv"

In [None]:
data = pd.read_csv(path, index_col = "sl_no")
data.head()

In [None]:
data.describe()

# Exploratory Data Analysis

In this section, I'll try to create some visualizations to understand if there's a correlation between different features. I try to use many visualizations to understand the data. Since I don't have very much experience with EDA, this probably will be good practice for me.

## Features
Let's take look at the given definition of features

> * `sl_no` : Serial number
> * `gender` : Gender (F: Female, M: Male)
> * `ssc_p` : Secondary education percentage (to 10th grade)
> * `ssc_b` : Board of education for secondary education 
> * `hsc_p` : High secondary education percentage (10th to 12th grade)
> * `hsc_b` : Board of education for high secondary education
> * `hsc_s` : Specialization in higher secondary education
> * `degree_p` : Degree percentage
> * `degree_t` : under-grad field of degree education
> * `workex` : work experience
> * `etest_p` : Employability test percentage (test conducted by college)
> * `specialization` : post-grad (MBA) specialization 
> * `mba_p` : MBA percentage
> * `status` : status of placement
> * `salary` : salary offered by corporate to candidates


## Looking for null values
If there's a null value, we need to do something about it. But first, let's check that!

In [None]:
for feature in data.columns:
    print("There are {} null values for {} feature ".format( sum(data[feature].isnull()), feature ))

We have `67` null values for the `salary` feature. But these values may be null because of the `status` feature. If a candidate couldn't be placed in any corporation, their offered `salary` would be `None` since they couldn't place to any corporation. But that's just a prediction so we need to check if it's correct.

In [None]:
# So all the null values are because of the status feature. We don't need to replace it with something else.
data[data['status'] == 'Not Placed'].count()

## Visualization
Data visualization is an important part of exploratory data analysis since it's a very helpful process for getting insights about the data we have. We try to understand the data by using different plots.

In [None]:
#  First of all, does the dataset has a balanced distribution over the placement?
sns.countplot(x = 'status', hue = 'status', data = data)
plt.show()

Well, the data we have is pretty unbalanced. We should consider this while building machine learning models.

In [None]:
# Is there a relationship between score percentages and placement?
score_p_cols = ['ssc_p', 'hsc_p', 'mba_p', 'degree_p']
score_cols_descs = ['Secondary School', 'Higher Secondary School', 'Masters of Business Administration', 'Under-grad Degree']
plt.figure(figsize = (12,12))

for s in range(len(score_p_cols)):
    plt.subplot(2,2, s + 1)
    plt.title("Graph 1.{}: ")
    sns.boxplot(x = data['status'], y = data[score_p_cols[s]], data = data)
    plt.title("Graph 1.{}: ".format(s) + score_cols_descs[s])
    plt.ylabel("")
plt.show()

#### We can see the difference between secondary school and under-grad degree scores between placed/not placed students. Also, MBA percentages don't differ that much 

In [None]:
# Are there any correlation between employment test scores and placement?
sns.boxplot(x = 'status', y = 'etest_p', data = data)
sns.swarmplot(x = 'status', y = 'etest_p', data = data, color = ".2")
plt.title('Graph 2: Employment Test Score Distribution by Status')
plt.ylabel('')
plt.show()

print("Placed students average score on employment test: {:.2f}".format(data[data['status'] == 'Placed'].etest_p.mean()))
print("Placed students standard deviation on employment test: {:.2f}".format(data[data['status'] == 'Placed'].etest_p.std()))
print("Not placed students average score on employment test: {:.2f}".format(data[data['status'] == 'Not Placed'].etest_p.mean()))
print("Not placed students standard deviation on employment test: {:.2f}".format(data[data['status'] == 'Not Placed'].etest_p.std()))

The score difference between placed and not placed students is not that high but generally placed students has a higher score than not placed students

In [None]:
sns.countplot(x = 'status', hue = 'workex', data = data)
plt.title('Graph 3: Does having work experience is a strong factor for recruitment?')
plt.show()

# Let's look at some numbers
w_workex = data[data['workex'] == 'Yes'].status.value_counts().values
wo_workex = data[data['workex'] == 'No'].status.value_counts().values
print("Students with work experience:\n Total: {} \n Placed: {} \n Not Placed: {}".format(sum(w_workex),w_workex[0], w_workex[1]))
print("{:.2f}% of students with work experience is placed while {:.2f}% of them couldn't get placed".format(w_workex[0]/sum(w_workex) * 100, w_workex[1]/sum(w_workex) * 100))
print("\n Students without work experience:\n Total: {} \n Placed: {} \n Not Placed: {}".format(sum(wo_workex), wo_workex[0], wo_workex[1]))
print("{:.2f}% of students with work experience is placed while {:.2f}% of them couldn't get placed".format(wo_workex[0]/sum(wo_workex) * 100, wo_workex[1]/sum(wo_workex) * 100))

I think we can say that having work experience is a very good advantage but it's not necessary to be placed. Also notice that most of the students don't have work experience. Numbers don't lie, having work experience will decrease the chance of not being placed and increase the chance of being placed. `141` students don't have work experience and `74` students who have work experience. `~86%` of the students with work experience are placed while only `~59%` of inexperienced students are placed. So it's normal to say that having work experience will increase the chance of being placed

In [None]:
# Does an undergrad degree is a factor for recruitment?
sns.countplot(x = 'degree_t', hue = 'status', data = data)
plt.title('Graph 4.0')

We can see that if a student gets a degree from communication and management field, the student will have a higher chance of being placed

In [None]:
sns.countplot(x = 'hsc_s', hue = 'status', data = data)
plt.title('Graph 4.1')

In [None]:
sns.countplot(x = 'specialisation', hue = 'status', data = data)
plt.title('Graph 4.2')

There are just too many students who specialized in finance rather than HR. And most of them are also placed so we can say that if a student specializes in marketing and finance, he/she will have a higher chance of being placed.

In [None]:
# Is there any correlation between secondary and higher secondary education?
sns.scatterplot(data['ssc_p'], data['hsc_p'])
plt.title('Graph 5.0')

In [None]:
# Let's look at the best fitting line
sns.lmplot(x = 'ssc_p', y = 'hsc_p', data = data)
plt.title('Graph 5.1')

## Conclusion
> from 1.1, 1.2, 1.3, 1.4 graphs 
* `secondary school` , `higher secondary school` and `under-grad degree` are important features for being placed
* MBA score difference between placed and not placed students is very small so I think there's not much correlation between MBA scores and `status`

> from graph 2
* Employment test scores aren't different at all (roughly 73 for placed and 69 for not placed students)

> from graph 3
* Work experience is a powerful advantage to have but it's not required to get hired. Having work experience increases the chance of being placed. Statistically speaking, 86.4% of students with work experience are placed while only 60% of students without it got placed. So we can say that having work experience is an opportunity worth chasing.  

> from graph 4.0
* If a student gets a degree from Science or Tech rather than  Communication or Management, the student will have a higher chance of being recruited. Students from these fields have roughly 70% rate of being recruited while other fields have roughly 30%

> from graph 4.1
* 70% of the students who studied Commerce or Science in secondary school are being placed while only 55% of the Art students are placed. So studying Commerce or Science will increase the chance of being placed 

> from graph 4.2
* The students that specialized in Finance have almost 80% rate of being recruited while HR students have roughly 55%. Therefore, specializing in Finance along with Marketing is a good idea for increasing the chance of being recruited

> from graph 5, 5.1
* If students are academically successful in secondary education they're more likely to be recruited

# Preparing the data for Classification
We need to encode our categorical data and normalize/scale the continuous features. I'll start with encoding categorical data by using one-hot-encoding.
Also, we can drop the features we don't need (salary, gender, ssc_b, hsc_b)

In [None]:
# drop the unnecessary features
classification_data = data.drop(['salary', 'gender', 'ssc_b', 'hsc_b'], axis = 1)
# classification_data = data.drop(['salary'], axis = 1)
classification_data.head(10)

In [None]:
# One-hot-encode the categorical data we have and drop the first column in order get rid of dummy variable trap
classification_data = pd.get_dummies(classification_data, drop_first = True)
classification_data.shape

In [None]:
# We should split the dataset to feature/target subsets, then we'll split it to training and test set

# features
X = classification_data.values[:,:-1]
# target is "status" column
y = classification_data.values[:,-1]

# Split the dataset into train and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, shuffle = True, random_state = 42)

In [None]:
# Scale the data using StandardScaler 
# I'll just apply scaling to continuous variables.
from sklearn.preprocessing import StandardScaler
continuous_vars_train = X_train[:, :5]
continuous_vars_test = X_test[:, :5]

ss = StandardScaler()

continuous_vars_train = ss.fit_transform(continuous_vars_train)
continuous_vars_test = ss.transform(continuous_vars_test)

X_train[:, :5] = continuous_vars_train
X_test[:, :5] = continuous_vars_test

In [None]:
pd.DataFrame(X_train).describe()

# Machine Learning - Applying Classification Algorithm

We processed and tried to understand the data so far. Now it's time to modeling it using machine learning algorithms!

I believe there is more than one task to use machine learning, such as: 
- Finding out if the student is placed or not,
- Predicting the offered salary for placed students,

I'll try to create a model to predict if a student is going to be placed or not. I'm going to compare different classification algorithms using k-fold cross-validation.

Recall that the data was unbalanced. So we should consider using different metrics to evaluate our models. 

I'll use the following metrics:
### Accuracy
Accuracy is the most common metric for binary classification problems. 
It's pretty preferable as a metric but when you have an imbalanced dataset like we have, upper-average accuracy may be obtainable by basic approaches such as picking the majority class.

### Fbeta Measure
Fbeta score provides a way to combine recall and precision into a single measure that captures both of them, it's also widely used while working on imbalanced datasets. The mathematical formula for Fbeta measure is

$ F_{beta} = \frac{( 1+(\text{beta}^2) ) \times \text{Precision} \times \text{Recall}}{(\text{beta}^2 \times \text{Precision}) + \text{Recall}} $

I think we can say that Fbeta measure is just a modified version of the F1 measure. We can control the calculation of harmonic mean with coefficient __beta__. When beta is equal to `1`, that means we're using the F1 score.

### G-Mean
G-Mean is the geometric mean of Recall and True Negative Rate. So it also provides a way to combine two different metrics into a single measure just like Fbeta measure.


### ROC AUC Score
To understand the ROC AUC score, we should understand what ROC is. ROC curve summarizes the model behavior by using TPR (True Positive Rate) and FPR (False Positive Rate). When we plot these two by different thresholds, we get a curve that is called as the ROC curve. The area under this curve is the metric known as ROC AUC (Area Under the Receiver Operating Characteristic Curve). 
In simple terms, the ROC curve describes how good our model is at discriminating the target classes. The area under the curve gives us this description as a scalar value.

In [None]:
# Import the classification models 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
# I'll be testing the models using different metrics.
from sklearn.metrics import accuracy_score, fbeta_score, average_precision_score, roc_auc_score, roc_curve
from sklearn.metrics import confusion_matrix
from sklearn.metrics import make_scorer
# In order to do a proper test on the models, we should use KFold cross-validation
from sklearn.model_selection import StratifiedKFold, KFold
# After testing different algorithms, we need to optimize the algorithm further.
# I'll try to do this by using Grid Search
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# Evaluation with cross validation offers more generalized results
from sklearn.model_selection import cross_val_score

In [None]:
# the most basic approach would be picking the majority class all the time. 
simple_baseline = np.ones(len(y_test))
accuracy_score(y_test, simple_baseline)

In [None]:
# Utility function to calculate the geometric mean of sensitivity (recall) and specificity (TNR)
def g_mean(y_true, y_preds):
    cm = confusion_matrix(y_true, y_preds)
    tn, fp, fn, tp = cm.ravel()
    sensivity = tp / (tp + fn)
    specifity = tn / (tn + fp)
    g_mean_score = np.sqrt(sensivity * specifity)
    return g_mean_score
g_mean_scorer = make_scorer(g_mean)

In [None]:
# I'll store the models in baseline_models list.
baseline_models = [LogisticRegression(), SVC(probability = True), 
                   KNeighborsClassifier(), GaussianNB(), 
                   DecisionTreeClassifier(), RandomForestClassifier()]

baseline_model_names = ['Logistic Regression', 'Support Vector Classifier', 
                        'K Nearest Neighbors', 'Gaussian Naive Bayes', 
                        'Decision Tree', 'Random Forest']

metric_names = ['Accuracy', 'F1 Score', 'G-Mean', 'ROC AUC Score']

# I'll apply 5 fold cross-validation to get a better inspection about the models
n_splits = 5

# Store the results so we can compare the models
results = np.zeros((len(baseline_models), len(metric_names)))

for i in range(len(baseline_models)):
    # Initialize an array to store the fold results
    model_results = np.zeros((n_splits, len(metric_names)))
    
    # Initialize the StratifiedKFold object with 5 splits.    
    skf = KFold(n_splits)
    skf.get_n_splits(X_train, y_train)
    
    # Get the model
    model = baseline_models[i]
    
    for fold_iter, (train_index, test_index) in enumerate(skf.split(X_train,y_train)):
        # Get the training and test folds
        X_train_fold, X_test_fold = X_train[train_index], X_train[test_index]
        y_train_fold, y_test_fold = y_train[train_index], y_train[test_index]
        
        # Fit the data to the model 
        model.fit(X_train_fold, y_train_fold)
        # Evaluate the test fold
        model_predictions = model.predict(X_test_fold)
        model_pred_probs = model.predict_proba(X_test_fold)
        # Get the evaluation scores 
        acc = accuracy_score(y_test_fold, model_predictions)
        f1_score = fbeta_score(y_test_fold, model_predictions, beta = 1)
        g_mean_score = g_mean(y_test_fold, model_predictions)
        roc_auc = roc_auc_score(y_test_fold, model_pred_probs[:,1])
        # Store the results in the model_results array
        model_results[fold_iter] = [acc, f1_score, g_mean_score, roc_auc]
        
    # The final result for the model is going to be the average of stratified cross-validation results
    final_result = model_results.mean(axis = 0)
    # store the results along with the model name in "results" list
    results[i] = final_result

# print out the results as a dataframe
skf_results_df = pd.DataFrame(results, 
                              columns = ["Average %s for %d Folds"%(metric, n_splits) for metric in metric_names],
                              index = baseline_model_names)
skf_results_df

In over-all performance, Logistic Regression beats the other models. So I'm going to try to fine-tune the Logistic Regression model by using Grid Search.

In [None]:
# estimator = LogisticRegression()
# parameters = {
#     'penalty': ['l1', 'l2', 'elasticnet', 'none'],
#     'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
#     'tol': [0.5, 0.3, 0.1, 1e-2, 1e-3, 1e-4, 1e-5, 1e-6],
#     'C': np.logspace(-4, 4, 32),
#     'max_iter': np.linspace(50, 2000, 40),
# }

# best_estimator = RandomizedSearchCV(estimator, parameters, n_iter = 3000, scoring = 'accuracy', n_jobs = -1)
# best_estimator.fit(X_train, y_train)
# print('Best parameters: ', best_estimator.best_params_)
# print('Best score: ', best_estimator.best_score_)


# I've found these parameters while optimizing hyper-parameters with the Random Search algorithm.
# You can also try different parameter options to apply random search, just uncomment the code above.
fine_tuned_params = {'tol': 0.3, 'solver': 'sag', 'penalty': 'none', 'max_iter': 1850.0, 'C': 0.0019512934226359622}
# fine_tuned_params = best_estimator.best_params_

tuned_estimator = LogisticRegression(**fine_tuned_params) # pick the best estimator
tuned_estimator.fit(X_train, y_train);

We can finally use our model to evaluate the test dataset. Let's look at the results!

In [None]:
# make predictions
final_preds = tuned_estimator.predict(X_test)
final_acc_cv = cross_val_score(LogisticRegression(**fine_tuned_params),
                            X_test, y_test, cv = 5,
                            scoring = 'accuracy')
final_acc = accuracy_score(y_test, final_preds)
final_probas = tuned_estimator.predict_proba(X_test)
print('Final accuracy: %.4f | Final average accuracy (5 fold): %.4f'%(final_acc, final_acc_cv.mean()))

# Get the confusion matrix
cm = confusion_matrix(y_test, final_preds)
tn, fp, fn, tp = cm.ravel()
# plot it
sns.heatmap([[tn, fp], [fn, tp]], cmap = 'Blues', annot = True)
plt.xlabel('predictions')
plt.ylabel('actual values')
plt.show()

# my final visualization for the results is the ROC curve. 
# As a "no-skill" baseline approach, I'll use picking the majority method.
final_roc_auc_score = cross_val_score(LogisticRegression(**fine_tuned_params),
                                    X_test, y_test, 
                                    scoring = 'roc_auc')
baseline_roc_auc_score = roc_auc_score(y_test, simple_baseline)
print("Average ROC AUC Score:", final_roc_auc_score.mean())
print("Baseline ROC AUC Score:", baseline_roc_auc_score)


ns_fpr, ns_tpr, _ = roc_curve(y_test, simple_baseline)
lr_fpr, lr_tpr, _ = roc_curve(y_test, final_probas[:, 1])
sns.lineplot(ns_fpr, ns_tpr , label = 'Baseline (no-skill)')
sns.lineplot(lr_fpr, lr_tpr, label = 'Logistic Regression')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title("ROC Curve")
plt.legend()
plt.show()

We can see that LR model did pretty well! Of course there's a still room for improvement but these results are great!

# Planned Improvements for Notebook
- It would be better if the used methods are explained with simple descriptions. 
- I should try different classification algorithms and compare their performance on the data. (DONE)
- I can use more generalization techniques to get higher accuracy score on the test data (e.g. Ensembling)
- It'd be great if I used more statistical methods to understand the correlation in the data.

## Final words
Hello everyone! This is my first notebook where I tried to do everything all by myself (without relying on other notebooks too much :) ). If you read and liked my work, please consider upvoting it. Also, It'd be awesome if you share your ideas and opinions about this simple project. Thank you for reading all of this, I appreciate it :)