# Predict Student Chances of Admission

This Dataset contains important parameters affecting student chances of getting admitted for a master degree. My aim is to use Logistic Regression to predict a student chance of Admission from the other Parameters (GRE Score, TOEFL Score, SOP, LOR)

Data-Source: KAGGLE

The parameters included are : 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )

# Acknowledgements
This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt




admission = pd.read_csv('Admission_Predict.csv')
admission.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [2]:
admission.tail()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67
399,400,333,117,4,5.0,4.0,9.66,1,0.95


In [3]:
#check the dtypes of the columns of our dataframes

print(admission.dtypes)

Serial No.             int64
GRE Score              int64
TOEFL Score            int64
University Rating      int64
SOP                  float64
LOR                  float64
CGPA                 float64
Research               int64
Chance of Admit      float64
dtype: object


In [4]:
#Nice we have all to be integer and floats, so we can import scikit Learn already.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

print(len(admission))

400


In [5]:
#Next we divide our data. We use 75% to train and 25% to test.

train_end = (75/100)*len(admission)
print(train_end)
train = admission[0:300]
test = admission[300:]
target = 'Chance of Admit '

train.head()

300.0


Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [6]:
#We must also set the feature columns
feature = train.columns.drop(['Chance of Admit ', 'Serial No.'])
#'Serial No.' should not be used as well, since it does not add significantly to the model
print(feature)

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research'],
      dtype='object')


In [7]:
lr = LinearRegression()
lr.fit(train[feature], train['Chance of Admit '])
predictions = lr.predict(test[feature])

mse = mean_squared_error(predictions,test['Chance of Admit '])
rmse = np.sqrt(mse)

In [8]:
print(rmse)

0.05996213907279751


# Next Steps

What can be done next to reduce the root mean square error?

In [9]:
#check for the presence of null values!

empty = admission.isnull().sum()
print(empty)

Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64


In [10]:
#Next is to normalize both the GRE Score and the TOEFL Score to make the values similar to the others
normalize = ['GRE Score','TOEFL Score']
train_normalized = (train[normalize] - train[normalize].min())/(train[normalize].max() - train[normalize].min())
test_normalized = (test[normalize] - test[normalize].min())/(test[normalize].max() - test[normalize].min())

In [11]:
new_train = train.copy()
new_test = test.copy()
new_train[normalize] = train_normalized
new_test[normalize] = test_normalized

#We may now use train[normalize] for our linear regression

new_train.head()
new_test.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
300,301,0.38,0.518519,2,2.5,2.5,8.0,0,0.62
301,302,0.58,0.592593,2,2.5,3.0,8.76,0,0.66
302,303,0.64,0.481481,2,3.0,3.0,8.45,1,0.65
303,304,0.66,0.555556,3,3.5,3.5,8.55,1,0.73
304,305,0.46,0.518519,2,2.5,2.0,8.43,0,0.62


In [12]:
lr.fit(new_train[feature],new_train['Chance of Admit '])
predictions_new = lr.predict(new_test[feature])

mse1 = mean_squared_error(predictions_new,new_test['Chance of Admit '])
rmse1 = np.sqrt(mse1)

In [13]:
print(rmse1)

0.05920481590325798


In [14]:
difference = rmse1 - rmse
print(difference)

-0.0007573231695395255


In [15]:
# And then see if we could drop some features by checking the correlations

corr = admission[feature].corrwith(admission['Chance of Admit '])
print(corr)

import seaborn as sns

GRE Score            0.802610
TOEFL Score          0.791594
University Rating    0.711250
SOP                  0.675732
LOR                  0.669889
CGPA                 0.873289
Research             0.553202
dtype: float64


From the above, we may remove any of the features which are less than 0.6. 
This means we remove the Research feature, we then train our Algorithm and see if there's an improvement.

In [16]:
new_feature = corr[corr > 0.6].index

In [17]:
#We now train our Model using this.

lr.fit(new_train[new_feature],new_train['Chance of Admit '])
predictions_newest = lr.predict(new_test[new_feature])

mse2 = mean_squared_error(predictions_newest,new_test['Chance of Admit '])
rmse2 = np.sqrt(mse2)

print(rmse2)

0.060644763833446864


In [18]:
new_diff = rmse2 - rmse1
print(new_diff)

0.0014399479301888812


Unfornately Our new Analysis(Feature Selection) does not actually improve the Model.