# Predicting Admissions

Data below is from an author Mohan S. Acharya with information connected to the prediction of graduate admissions in Indian Universities. The dataset holds 9 attributes including: serial no., GRE Scores, TOEFL Scores, University Rating, Statement of Purpose Strength, Letter of Recommendation Strength, Undergraduate GPA, Research Experience, and Chance of Admit. There are approximately 400 rows, which each represent a different applicant. 

I am interested in performing a linear regression on the dataset to determine regression best models the relationship between the predictor variables i.e. GRE Scores, Statement of Purpose Strength, Undergraduate GPA, Research Experience, on the outcome variable Chance of Admit. 

_Source information:_

_Context
_This dataset is created for prediction of Graduate Admissions from an Indian perspective.

_Content
The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are : 1. GRE Scores ( out of 340 ) 2. TOEFL Scores ( out of 120 ) 3. University Rating ( out of 5 ) 4. Statement of Purpose and Letter of Recommendation Strength ( out of 5 ) 5. Undergraduate GPA ( out of 10 ) 6. Research Experience ( either 0 or 1 ) 7. Chance of Admit ( ranging from 0 to 1 )_

_Acknowledgements
This dataset is inspired by the UCLA Graduate Dataset. The test scores and GPA are in the older format. The dataset is owned by Mohan S Acharya._

_Inspiration
This dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university._


In [92]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from patsy import dmatrices

In [105]:
df = pd.read_csv("Admission_Predict_New_New.csv")

Below we can see the main attributes of the dataset from age, sex, cp, .... to target.

In [106]:
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR_,CGPA,Research,Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [107]:
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.gridspec as gridspec

In this dataset, we can see that there are 400 rows across 9 columns. Columns exhibit numerical and bivariate data including GRE Score, TOEFL Score, University Rating, SOP, etc.

In [108]:
df.shape

(400, 9)

In [111]:
df['LOR_']

0      4.5
1      4.5
2      3.5
3      2.5
4      3.0
5      3.0
6      4.0
7      4.0
8      1.5
9      3.0
10     4.0
11     4.5
12     4.5
13     3.0
14     2.0
15     2.5
16     3.0
17     3.0
18     3.0
19     3.0
20     2.0
21     2.0
22     5.0
23     4.5
24     3.5
25     4.5
26     3.5
27     2.5
28     2.0
29     2.0
      ... 
370    2.5
371    3.0
372    4.0
373    3.0
374    2.5
375    2.5
376    2.0
377    2.0
378    2.5
379    3.0
380    4.0
381    3.5
382    4.0
383    3.5
384    5.0
385    5.0
386    3.5
387    3.5
388    2.0
389    4.0
390    2.5
391    3.0
392    3.5
393    3.0
394    4.0
395    3.5
396    3.5
397    4.5
398    4.0
399    4.0
Name: LOR_, Length: 400, dtype: float64

In [112]:
df['GRE Score'].mean()

#Applicants have a mean GRE Score of 316.8 out of 340

316.8075

In [113]:
#vars = ['GRE Score', 'SOP', 'CGPA', 'Research']

df['Admit'].mean()


0.7243499999999996

In [114]:
#Choosing Predictor Variables

data1 = df[['GRE Score', 'Admit']]

data1.corr(method='pearson')


Unnamed: 0,GRE Score,Admit
GRE Score,1.0,0.80261
Admit,0.80261,1.0


In [115]:
#Choosing Predictor Variables

data2 = df[['SOP', 'Admit']]

data2.corr(method='pearson')

Unnamed: 0,SOP,Admit
SOP,1.0,0.675732
Admit,0.675732,1.0


In [116]:
#Choosing Predictor Variables

data3 = df[['Research', 'Admit']]

data3.corr(method='pearson')

Unnamed: 0,Research,Admit
Research,1.0,0.553202
Admit,0.553202,1.0


In [117]:
#Choosing Predictor Variables

data4 = df[['TOEFL Score', 'Admit']]

data4.corr(method='pearson')

Unnamed: 0,TOEFL Score,Admit
TOEFL Score,1.0,0.791594
Admit,0.791594,1.0


In [118]:
#Choosing Predictor Variables

data5 = df[['LOR_', 'Admit']]

data5.corr(method='pearson')

#I PICK THIS ONE

Unnamed: 0,LOR_,Admit
LOR_,1.0,0.669889
Admit,0.669889,1.0


Markdown

In [119]:
#Choosing Predictor Variables

data6 = df[['LOR_', 'SOP']]

data6.corr(method='pearson')

#Multicollinearity!!!




Unnamed: 0,LOR_,SOP
LOR_,1.0,0.729593
SOP,0.729593,1.0


In [120]:
#Choosing Predictor Variables

data8 = df[['GRE Score', 'TOEFL Score']]

data8.corr(method='pearson')

#Multicollinearity!!!

#Lets pick GRE Score itself then

Unnamed: 0,GRE Score,TOEFL Score
GRE Score,1.0,0.835977
TOEFL Score,0.835977,1.0


MARKDOWN

In [122]:
X = df[['GRE Score', 'LOR_', 'Research']]

In [123]:
y = df['Admit']

In [125]:
X = sm.add_constant(X)

  return ptp(axis=axis, out=out, **kwargs)


In [126]:
model = sm.OLS(y, X).fit()

In [127]:
predictions = model.predict(X)

In [128]:
model.summary()

0,1,2,3
Dep. Variable:,Admit,R-squared:,0.722
Model:,OLS,Adj. R-squared:,0.72
Method:,Least Squares,F-statistic:,343.1
Date:,"Mon, 22 Apr 2019",Prob (F-statistic):,1.01e-109
Time:,17:05:07,Log-Likelihood:,468.12
No. Observations:,400,AIC:,-928.2
Df Residuals:,396,BIC:,-912.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.7144,0.133,-12.871,0.000,-1.976,-1.453
GRE Score,0.0071,0.000,15.811,0.000,0.006,0.008
LOR_,0.0496,0.005,9.732,0.000,0.040,0.060
Research,0.0278,0.009,2.972,0.003,0.009,0.046

0,1,2,3
Omnibus:,60.148,Durbin-Watson:,0.86
Prob(Omnibus):,0.0,Jarque-Bera (JB):,97.421
Skew:,-0.915,Prob(JB):,7e-22
Kurtosis:,4.58,Cond. No.,11200.0


In [107]:
data1.corr(method='pearson')

#Here we can see that there is a weak, but positive relationship between serum cholestrol levels and age of patients

#This correlation is slightly higher than that of trestbps and cholestrol but still is not significant

Unnamed: 0,chol,age
chol,1.0,0.213678
age,0.213678,1.0
