<p align="center">
  <img src="./src/images/Icon_TensorFlow.png" alt="Icon TensorFlow" width="50" align="left"> 
</p>

# Deep Learning Regression Model for Predicting chances of applicant being admitted

---


- Author: [Stefanus Bernard Melkisedek](https://www.github.com/stefansphtr)
- Email: [stefanussipahutar@gmail.com](stefanussipahutar@gmail.com)
- Date: 2024-02-04

## Project Description

This project is a regression model that predicts the chances of an applicant being admitted to a university based on their scores in two exams. The dataset contains nine columns: 

1. Serial No. (index)
2. GRE Score (int)
3. TOEFL Score (int)
4. University Rating (int) 
5. SOP (Statement of Purpose) (float)
6. LOR (Letter of Recommendation) (float)
7. CGPA (Cumulative Grade Point Average) (float)
8. Research (int)
9. Chance of Admit (float)

The model is implemented using a neural network with two hidden layers. The model is trained on the training set and evaluated on the test set. The performance of the model is evaluated using the mean squared error (MSE) and the R-squared score.

## Prepare the libraries

In [2]:
# Importing libraries for data manipulation and analysis
import numpy as np  # For numerical operations
import pandas as pd  # For data manipulation and analysis
import matplotlib.pyplot as plt  # For data visualization

# Importing TensorFlow and Keras for creating and training the neural network model
import tensorflow as tf  # For machine learning and numerical computation
from tensorflow import keras  # High-level API to build and train models in TensorFlow
from keras.models import Sequential  # For linear stacking of layers
from keras.callbacks import EarlyStopping  # To stop training when a monitored metric has stopped improving
from keras.layers import Dense  # For fully connected layers

# Importing Scikit-learn libraries for data preprocessing and performance metrics
from sklearn.model_selection import train_test_split  # For splitting the data into train and test sets
from sklearn.preprocessing import StandardScaler  # For standardization of features
from sklearn.preprocessing import Normalizer  # For normalization of features
from sklearn.metrics import r2_score  # For regression performance metrics

## Data Wrangling

**Data Wrangling**: This is the process of **gathering, selecting, and transforming data** to answer an analytical question. **Also known** as **data munging**, it **involves cleaning and unifying messy and complex data sets** for easy access and analysis. This could include dealing with missing values, outliers, or inconsistent data.

### Gathering the dataset

Dataset Graduate Admission 2 is obtained from [Kaggle](https://www.kaggle.com/datasets/mohansacharya/graduate-admissions) and stored in the `data` directory.

Read and store the dataset in a DataFrame.

In [3]:
df_admissions = pd.read_csv('./data/admissions_data.csv')
df_admissions.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


### Assessing the dataset

In [4]:
# Check the summary of the dataset and its columns
df_admissions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Serial No.         500 non-null    int64  
 1   GRE Score          500 non-null    int64  
 2   TOEFL Score        500 non-null    int64  
 3   University Rating  500 non-null    int64  
 4   SOP                500 non-null    float64
 5   LOR                500 non-null    float64
 6   CGPA               500 non-null    float64
 7   Research           500 non-null    int64  
 8   Chance of Admit    500 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 35.3 KB


> We can see that the dataset have 9 columns and 500 rows. The dataset are in good shape and there are no incorrect data types.

In [6]:
# Check the missing values in the table admissions
df_admissions.isna().sum()

Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64

> There is no missing data in the dataset. It seems like the dataset is ready to use, but for the sake of this project, we will check properly for the duplicates, outliers, and the distribution of the data.

In [9]:
# Check the duplicates value in the dataframe admissions
total_duplicate_values = df_admissions.duplicated().sum()
print(f"Total duplicate values: {total_duplicate_values}")

Total duplicate values: 0


In [11]:
# Check the statistic of the dataframe admissions
df_admissions.describe()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
count,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0,500.0
mean,250.5,316.472,107.192,3.114,3.374,3.484,8.57644,0.56,0.72174
std,144.481833,11.295148,6.081868,1.143512,0.991004,0.92545,0.604813,0.496884,0.14114
min,1.0,290.0,92.0,1.0,1.0,1.0,6.8,0.0,0.34
25%,125.75,308.0,103.0,2.0,2.5,3.0,8.1275,0.0,0.63
50%,250.5,317.0,107.0,3.0,3.5,3.5,8.56,1.0,0.72
75%,375.25,325.0,112.0,4.0,4.0,4.0,9.04,1.0,0.82
max,500.0,340.0,120.0,5.0,5.0,5.0,9.92,1.0,0.97


## Data Preprocessing

**Data Preprocessing**: This is the process of **converting raw data into a well-readable format**. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and may not be usable in its original form. **Steps in data preprocessing** might include **normalization, standardization, and encoding categorical variables**. This step is crucial as it can **affect the performance of the model**.