# Graduate Admission Prediction using Machine Learning

In this notebook we shall do the following things:

- Read the data. The data is originally downloaded fro [this link.](https://www.kaggle.com/datasets/mohansacharya/graduate-admissions?select=Admission_Predict_Ver1.1.csv)

- Preprocess the data (Scaling the columns)

- Building Linear Regression model

- Save the model for future use

## Import necessary libraries

In [1]:
# for data i/o and manipulation
import pandas as pd
import numpy as np

# for model building, training and testing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# for saving the preprocessing data and model
import yaml
import pickle

# regular expression
import re

## Read the data

In [2]:
data = pd.read_csv("./data/Admission_Prediction.csv")

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


### Remove spaces from the name of the columns

In [4]:
columns = data.columns.to_list()

In [5]:
rename_columns = []
for x in columns:
    rename_columns.append(re.sub(' ','_',x.strip()))

In [6]:
data.columns = rename_columns

In [7]:
data.head()

Unnamed: 0,Serial_No.,GRE_Score,TOEFL_Score,University_Rating,SOP,LOR,CGPA,Research,Chance_of_Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


## Data Pre-processing

In [8]:
preprocessing_dictionary = dict()
scaled_feature_list = []

In [9]:
def min_max_scaling(feature_value, max_value, min_value):
    scaled_feature = ((feature_value - min_value)/(max_value - min_value))
    return scaled_feature

In [10]:
def feature_scaling(data, column):
    min_value_column = data[column].min()
    max_value_column = data[column].max()
    preprocessing_dictionary[column+'_max'] = max_value_column
    preprocessing_dictionary[column+'_min'] = min_value_column
    data['scaled_'+column] = data[column].apply(lambda x: min_max_scaling(x, max_value=max_value_column, min_value=min_value_column))
    scaled_feature_list.append('scaled_'+column)

In [11]:
feature_scaling(data, 'GRE_Score')

In [12]:
feature_scaling(data, 'TOEFL_Score')

In [13]:
feature_scaling(data, 'University_Rating')

In [14]:
feature_scaling(data, 'CGPA')

In [15]:
preprocessing_dictionary['feature_list'] = scaled_feature_list

## Storing the preprocessing dictionary for future use

In [16]:
with open('preprocessing.yml', 'w') as outfile:
    yaml.dump(preprocessing_dictionary, outfile)
outfile.close()

To load the yaml file use the follwing code.

```python
with open('preprocessing.yml', 'r') as readfile:
    values = yaml.load(readfile, Loader=yaml.SafeLoader)
readfile.close()
```

## Building, Training and Testing the model

In [17]:
X = data[scaled_feature_list].values
y = data['Chance_of_Admit'].values

### Splitting the data into train and test

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100, shuffle=True)

### Initiate the Linear Regression model

In [19]:
lm = LinearRegression()

### Fitting the Linear Regression model

In [20]:
lm.fit(X=X_train, y=y_train)

LinearRegression()

### Generating predictions on holdout test samples

In [21]:
y_pred_test = lm.predict(X_test)

In [23]:
pd.DataFrame(zip(y_test, y_pred_test, np.abs(y_test-y_pred_test)), columns=['Actual', 'Predicted', 'Absolute Difference']).head(10)

Unnamed: 0,Actual,Predicted,Absolute Difference
0,0.78,0.855078,0.075078
1,0.54,0.494761,0.045239
2,0.64,0.627153,0.012847
3,0.47,0.46036,0.00964
4,0.7,0.659339,0.040661
5,0.88,0.837491,0.042509
6,0.57,0.59451,0.02451
7,0.72,0.667064,0.052936
8,0.84,0.818327,0.021673
9,0.64,0.672571,0.032571


### Evaluating model performance on train dataset

In [24]:
y_pred_train = lm.predict(X_train)

In [25]:
print(f"R2 Score obainted on train dataset: {r2_score(y_true=y_train, y_pred=y_pred_train)}")

R2 Score obainted on train dataset: 0.8013131377463758


In [26]:
print(f"Root Mean Squared Error on train dataset: {np.sqrt(mean_squared_error(y_true=y_train, y_pred=y_pred_train))}")

Root Mean Squared Error on train dataset: 0.0627857204350084


### Evaluating model performance on test dataset

In [27]:
print(f"R2 Score obainted on test dataset: {r2_score(y_true=y_test, y_pred=y_pred_test)}")

R2 Score obainted on test dataset: 0.8314685359590734


In [28]:
print(f"Root Mean Squared Error on test dataset: {np.sqrt(mean_squared_error(y_true=y_test, y_pred=y_pred_test))}")

Root Mean Squared Error on test dataset: 0.05715444164431594


## Retrain the model on entire dataset

In [29]:
final_model = LinearRegression()

In [30]:
final_model.fit(X, y)

LinearRegression()

In [32]:
pickle.dump(final_model,open('./model/linear_reg_model.pkl','wb'))