# Graduates Admission - Prediction Model

In this notebook, we will build a model for predicting probability of a student getting admission into a particular university.

We want to build a dashboard to help students with their graduate admissions. A student would use the dashboard to guage probability of getting admission into a particular university.

The dashboard would be powered by ML model, which would predict the probability. Let's say that we reached out to various universities to gather required training data.

The data that we have collated so far has following columns:

- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- Statement of Purpose - (SOP) Strength ( out of 5 )
- Letter of Recommendation - (LOR) Strength ( out of 5 )
- Undergraduate GPA-CGPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Gender (either M or F)
- Chance of Admit ( ranging from 0 to 1 )

Let's work on the model.

## Imports

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

## Read data

In [None]:
raw_data = pd.read_csv('data/admission_data-v2.csv')
raw_data.head()


### Convert categorical columns to numeric columns

In [None]:
data_gender_processed = raw_data.copy()
data_gender_processed['Gender'] = data_gender_processed['Gender'].replace({'F': 0, 'M': 1})
data_gender_processed.head()


## EDA & Feature Engineering

### Basic stats for columns

In [None]:
data_for_eda = data_gender_processed.copy()
data_for_eda.describe().T


### Check missing values

In [None]:
data_for_eda.isnull().sum()


### Pair plots

In [None]:
sns.pairplot(data_for_eda, kind='reg', diag_kind='kde')
plt.show()


In [None]:
data_for_eda.corr()


## Remove Outliers

Let's see if our data has any outliers. Let's use statistical techniques on each column.

In [None]:
binary_cols = ['Research', 'Gender', 'Chance of Admit']

for col in data_for_eda.columns:
    if col in binary_cols:
        continue
    plt.figure(figsize=(17, 1))
    sns.boxplot(data=data_for_eda, x=col)
    plt.show()
    

Let's conclude that there are no obvious outliers.

## Train Regression Models

When the model is deployed, we will be getting data in the original format as shown below without the target column 'Chance of Admit'.

In [None]:
raw_data.drop(columns=['Chance of Admit']).head()


Hence, we need to build a pipeline to tranform this data before passing it to the model.

### Data Preprocessing Pipeline

In [None]:
from sklearn.preprocessing import MinMaxScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer


def encode_gender(gender_df):
    return pd.DataFrame(gender_df['Gender'].apply(lambda x: 1 if x == 'M' else 0), columns=['Gender'])

# Create data pre-processing step
data_preprocessor = ColumnTransformer(
    [
        ('gre_transformer', MinMaxScaler(), ['GRE Score']),
        ('toefl_transformer', MinMaxScaler(), ['TOEFL Score']),
        ('sop_transformer', MinMaxScaler(), ['SOP']),
        ('lor_transformer', MinMaxScaler(), ['LOR']),
        ('cgpa_transformer', MinMaxScaler(), ['CGPA']),
        ('gender_encoder', FunctionTransformer(encode_gender), ['Gender'])
    ]
)

data_preprocessor


### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X = raw_data.drop(columns=['Chance of Admit'])
y = raw_data['Chance of Admit']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
print(f'Train size - {X_train.shape[0]}, Test size - {X_test.shape[0]}')


### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


# Create Pipeline
estimator_lr = Pipeline(
    [
        ('Preprocessor', data_preprocessor),
        ('Estimator', LinearRegression())
    ]
)

estimator_lr


#### Prediction

In [None]:
# Fit training data
estimator_lr.fit(X_train, y_train)

# Predict on testing data
y_lr = estimator_lr.predict(X_test)
y_lr


In [None]:
test_with_prediction_lr = X_test.copy()
test_with_prediction_lr['actual'] = y_test
test_with_prediction_lr['prediction'] = y_lr
test_with_prediction_lr['residual'] = y_test - y_lr
test_with_prediction_lr.head()


#### Residual Plots

In [None]:
residuals_lr_test = y_test - y_lr
plt.figure(figsize=(5, 3))
residuals_lr_test.plot.kde(label='Residual')
plt.legend()
plt.show()


In [None]:
plt.figure(figsize=(5, 5))
residuals_lr_train = estimator_lr.predict(X_train) - y_train
plt.scatter(x=y_train, y=residuals_lr_train, color='green', label='Train')
plt.scatter(x=y_test, y=residuals_lr_test, color='orange', label='Test')
plt.legend()
plt.xlabel('Actual')
plt.ylabel('Residual')
plt.axhline(y=0, color='blue')
plt.show()


#### Performance Metrics

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

mae_lr = mean_absolute_error(y_test, y_lr)
r2_lr = r2_score(y_test, y_lr)
print(f'Mpe: {mae_lr}')
print(f'R2: {r2_lr}')


### Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline


# Create Pipeline
estimator_rf = Pipeline(
    [
        ('Preprocessor', data_preprocessor),
        ('Estimator', RandomForestRegressor())
    ]
)

estimator_rf


#### Prediction

In [None]:
# Fit training data
estimator_rf.fit(X_train, y_train)

# Predict on testing data
y_rf = estimator_rf.predict(X_test)
y_rf


In [None]:
test_with_prediction_rf = X_test.copy()
test_with_prediction_rf['actual'] = y_test
test_with_prediction_rf['prediction'] = y_rf
test_with_prediction_rf['residual'] = y_test - y_rf
test_with_prediction_rf.head()


#### Residual Plot

In [None]:
residuals_rf_test = y_test - y_rf
plt.figure(figsize=(5, 3))
residuals_rf_test.plot.kde(label='Residual')
plt.legend()
plt.show()


In [None]:
plt.figure(figsize=(5, 5))
residuals_rf_train = estimator_rf.predict(X_train) - y_train
plt.scatter(x=y_train, y=residuals_rf_train, color='green', label='Train')
plt.scatter(x=y_test, y=residuals_rf_test, color='orange', label='Test')
plt.legend()
plt.xlabel('Actual')
plt.ylabel('Residual')
plt.axhline(y=0, color='blue')
plt.show()


#### Performance Metrics

In [None]:
from sklearn.metrics import mean_absolute_error, r2_score

mae_rf = mean_absolute_error(y_test, y_rf)
r2_rf = r2_score(y_test, y_rf)
print(f'Mae: {mae_rf}')
print(f'R2: {r2_rf}')


### Compare two models

In [None]:
pd.DataFrame([[mae_lr, mae_rf, 'MAE'], [r2_lr, r2_rf, 'R2']], 
             columns=['Linear Regresion', 'Random Forest Regressor', 'Metrics']).set_index('Metrics')


Let's finalize Linear Regression model and save it for deployment!

### Save the final model

In [None]:
import pickle

model_location = './models/lr_v1'

with open(model_location, 'wb') as fp:
    saved_model = pickle.dump(estimator_lr, fp)

print(f'Saved model at - {model_location}')


## Artifacts to be shared

- Model & Performance Analysis
- Library versions
- Train, Test dataset
- Code used to train the model
- ...
