# Graduate Admissions
## Data Scrubbing/ Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import sklearn
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import statsmodels as sm
from statsmodels.formula.api import ols

In [2]:
grad = pd.read_csv('Admission_Predict_Ver1.1.csv')
grad.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


For a description of the dataset and the purpose of this project, please look at the Pt. 1 EDA.

In [3]:
grad.shape

(500, 9)

---

## NULL Values

In [4]:
grad.isnull().sum()

Serial No.           0
GRE Score            0
TOEFL Score          0
University Rating    0
SOP                  0
LOR                  0
CGPA                 0
Research             0
Chance of Admit      0
dtype: int64

Since there are no missing values, they do not need to be dealt with; however, if there were any missing values, one could remove it by:

In [5]:
grad.dropna(inplace = True)
grad.reset_index(drop = True, inplace = True)
grad.shape

(500, 9)

The shape of the dataset has not changed which is expected.

---

## Dropping Variables

In [6]:
grad.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


Looking at the dataset, the variable 'Serial No.' is redundant to the indeices provided by the dataframe itself. Thus, I will remove that variable so it is not considered while creating the model.

In [7]:
grad.drop('Serial No.', axis = 1, inplace = True)
grad.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


---

## Variable Names

In [8]:
grad.rename({'LOR ' : 'LOR', 'Chance of Admit ': 'Chance of Admit'}, axis = 1, inplace = True)
grad.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,337,118,4,4.5,4.5,9.65,1,0.92
1,324,107,4,4.0,4.5,8.87,1,0.76
2,316,104,3,3.0,3.5,8.0,1,0.72
3,322,110,3,3.5,2.5,8.67,1,0.8
4,314,103,2,2.0,3.0,8.21,0,0.65


In [9]:
grad.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA',
       'Research', 'Chance of Admit'],
      dtype='object')

Have removed the spaces column names for easier use.

---

result = sm.ols(formula='Chance of Admit ~ GRE Score + TOEFL Score + SOP * LOR',data=grad).fit()    
print(result.summary()) b

## Standardize Variables

Let us standardize the independent variables to being them to the same scale and reduce multicollinearity.

In [10]:
from sklearn.preprocessing import StandardScaler

cols = ['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA']
scale = StandardScaler()
grad[cols] = scale.fit_transform(grad[cols])
grad.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1.819238,1.778865,0.775582,1.13736,1.098944,1.776806,1,0.92
1,0.667148,-0.031601,0.775582,0.632315,1.098944,0.485859,1,0.76
2,-0.04183,-0.525364,-0.099793,-0.377773,0.017306,-0.954043,1,0.72
3,0.489904,0.462163,-0.099793,0.127271,-1.064332,0.154847,1,0.8
4,-0.219074,-0.689952,-0.975168,-1.387862,-0.523513,-0.60648,0,0.65


By standardizing instead of normalizing I am reducing multicollinearity along with keeping information of how many standard deviations the observations are form the mean.

I did not standardize the depedent variable of the Chance of Admit for ease of interpretation.

Let us get VIF results one more time to see if standardization has made a difference:

In [11]:
from statsmodels.stats.outliers_influence import variance_inflation_factor


# the independent variables:
X = grad[['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA', 'Research']]

# VIF dataframe
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["Features"] = X.columns

print(vif)

   VIF Factor           Features
0    4.225203          GRE Score
1    3.898373        TOEFL Score
2    2.615772  University Rating
3    2.834454                SOP
4    2.029486                LOR
5    4.776161               CGPA
6    1.170262           Research


VIF values have gone signficantly down to below the allowed limit of 10. This is a good thing because multicollinearity between the variables are low enough to proceed with various regression models without being extrmeely cautious about the meaning of model coefficients.

In [12]:
grad.head()

Unnamed: 0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1.819238,1.778865,0.775582,1.13736,1.098944,1.776806,1,0.92
1,0.667148,-0.031601,0.775582,0.632315,1.098944,0.485859,1,0.76
2,-0.04183,-0.525364,-0.099793,-0.377773,0.017306,-0.954043,1,0.72
3,0.489904,0.462163,-0.099793,0.127271,-1.064332,0.154847,1,0.8
4,-0.219074,-0.689952,-0.975168,-1.387862,-0.523513,-0.60648,0,0.65


---

Let us export the new dataframe as a csv file to be used for the third part of the project.

In [13]:
grad.to_csv('graduation.csv')

---

# Let us now move on to Data Analysis (Pt. 3 Data Analysis)