 # Graduate Admission - Linear Regression

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Business Problem:

- To **understand about the factors** which are important in graduate admissions and how these factors are **interrelated** among themselves which will help Educational Institutions **to predict one's chances of admission** given the rest of the variables.

### Column Profiling:
- Serial No. (Unique row ID)
- GRE Scores (out of 340)
- TOEFL Scores (out of 120)
- University Rating (out of 5)
- Statement of Purpose and Letter of Recommendation Strength (out of 5)
- Undergraduate GPA (out of 10)
- Research Experience (either 0 or 1)
- Chance of Admit (ranging from 0 to 1)

# Overview of the notebook:
EDA
- Loading and inspecting the Dataset
- Checking Shape of the Dateset , Meaningful Column names
- Validating Duplicate Records, Checking Missing values
- Unique values (counts & names) for each Feature
- Data & Datatype validation

Univariante & Bivariante Analysis
- Numerical Variables
- Categorial variables
- Correlation Analysis
- Handling Multicollinearity

Model Building
- Handling Categorical variables using dummies
- Test & Train Split
- Rescaling features
- Train Model

Validate Linear Regression Assumptions
- Multicolillinearity check
- Mean of residuals
- Linearity of variables
- Test for Homoscedasticity
- Normality of residuals
- Model Performance Evaluation
- Metrics checked - MAE,RMSE,R2,Adj R2
- Train and Test performances are checked
- Comments on performance measures
- Summary of final recommendations


# Exploratory data analysis:

#### Importing required packages:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='whitegrid')
from numpy import NaN, nan, NAN
from scipy import stats
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

# Train & Test data split
from sklearn.model_selection import train_test_split

#Feture scaling
from sklearn.preprocessing import StandardScaler

# Statsmodel linear regression
import statsmodels.api as sm

# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error,mean_absolute_percentage_error

#### Loading data into Dataframe:

In [None]:
grad_adm_data = pd.read_csv('/kaggle/input/graduate-admissions/Admission_Predict_Ver1.1.csv')
grad_adm_data.head()

In [None]:
#Dropping the unique row Identifier - which is Serial No.

grad_adm_data = grad_adm_data.drop('Serial No.', axis = 1)
grad_adm_data.head()

#### Identification of variables and data types:

In [None]:
grad_adm_data.shape

In [None]:
grad_adm_data.info()

#### Analysing the basic metrics:

In [None]:
grad_adm_data.describe()

In [None]:
def missingValue(df):
    #Identifying Missing data.
    total_null = df.isnull().sum().sort_values(ascending = False)
    percent = ((df.isnull().sum()/len(df))*100).sort_values(ascending = False)
    print(f"Total records in our data =  {df.shape[0]} where missing values are as follows:")

    missing_data = pd.concat([total_null,percent.round(2)],axis=1,keys=['Total Missing','In Percent'])
    return missing_data

In [None]:
missingValue(grad_adm_data)

__Summary__:
-  No missing values present in the dataset

In [None]:
numerical_cols = ['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA', 'Research', 'Chance of Admit ']
for i in numerical_cols:
    
    print(f" Unique value count in {i} is {grad_adm_data[i].nunique()}")

In [None]:
characteristics_catg = ['University Rating', 'SOP', 'LOR ','Research']
for i in characteristics_catg:
    print(f" Unique values in {i} are {grad_adm_data[i].unique()}")

In [None]:
for i in characteristics_catg:
    grad_adm_data[i] = grad_adm_data[i].astype("category")
grad_adm_data.info()

In [None]:
print(f"Columns with category datatypes (Categorical Features) are : \
{list(grad_adm_data.select_dtypes('category').columns)}")
print(f"Columns with integer and float datatypes (Numerical Features) are: \
{list(grad_adm_data.select_dtypes(['int64','float64']).columns)}")

# Univariate Analysis:

In [None]:
def outlier_detect(df,colname,nrows=2,mcols=2,width=20,height=15):
    fig , ax = plt.subplots(nrows,mcols,figsize=(width,height))
    fig.set_facecolor("peachpuff")
    rows = 0
    for var in colname:        
        ax[rows][0].set_title("Boxplot for Outlier Detection ", fontweight="bold")
        plt.ylabel(var, fontsize=12)
        sns.boxplot(y = df[var],color='crimson',ax=ax[rows][0])
        
        # plt.subplot(nrows,mcols,pltcounter+1)
        sns.distplot(df[var],color='purple',ax=ax[rows][1])
        ax[rows][1].axvline(df[var].mean(), color='r', linestyle='--', label="Mean")
        ax[rows][1].axvline(df[var].median(), color='m', linestyle='-', label="Median")
        ax[rows][1].axvline(df[var].mode()[0], color='royalblue', linestyle='-', label="Mode")
        ax[rows][1].set_title("Outlier Detection ", fontweight="bold")
        ax[rows][1].legend({'Mean':df[var].mean(),'Median':df[var].median(),'Mode':df[var].mode()})
        rows += 1
    plt.show()

In [None]:
numerical_cols = ['GRE Score', 'TOEFL Score', 'CGPA', 'Chance of Admit ']

In [None]:
outlier_detect(grad_adm_data,numerical_cols,len(numerical_cols),2,14,30)

- The data for 'GRE Score', 'TOEFL Score', 'CGPA' is normally distibuted with no outliers present.
- The data for 'Chance of Admit ' has a little skewness towards left, with a very negligible no. of outliers

In [None]:
# Frequency of each feature in percentage.
def cat_analysis(df, colnames, nrows=2,mcols=2,width=20,height=30, sortbyindex=False):
    fig , ax = plt.subplots(nrows,mcols,figsize=(width,height))  
    fig.set_facecolor(color = 'peachpuff')
    string = "Frequency of "
    rows = 0                          
    for colname in colnames:
        count = (df[colname].value_counts(normalize=True)*100)
        string += colname + ' in (%)'
        if sortbyindex:
                count = count.sort_index()
        count.plot.bar(color=sns.color_palette("flare"),ax=ax[rows][0])
        ax[rows][0].set_ylabel(string, fontsize=14)
        ax[rows][0].set_xlabel(colname, fontsize=14)
        
        count.plot.pie(colors = sns.color_palette("flare"),autopct='%0.0f%%',
                       textprops={'fontsize': 14},shadow = True, ax=ax[rows][1])#explode=[0.2 if colname[i] == min(colname) else 0])        
        ax[rows][0].set_title("Frequency wise " + colname, fontweight="bold")
        string = "Frequency of "
        rows += 1 

In [None]:
categorical_cols = ['University Rating', 'SOP', 'LOR ', 'Research']

In [None]:
cat_analysis(grad_adm_data,categorical_cols,len(categorical_cols),2,14,30)

# Data Preparation

In [None]:
grad_adm_data.info()

In [None]:
grad_adm_data['GRE Score'].sort_values().head()

In [None]:
grad_adm_data_new = grad_adm_data.copy()
bins = [290,300,310,320,330,340]
labels =["290-300","300-310","310-320","320-330","330-340"]
grad_adm_data_new['GRE Score bins'] = pd.cut(grad_adm_data_new['GRE Score'], bins,labels=labels)

In [None]:
grad_adm_data_new['TOEFL Score'].sort_values().head()

In [None]:

bins = [90,100,110,120]
labels =['90-100','100-110','110-120']
grad_adm_data_new['TOEFL Score bins'] = pd.cut(grad_adm_data_new['TOEFL Score'], bins,labels=labels)

In [None]:
grad_adm_data_new['CGPA'].sort_values().head()

In [None]:
bins = [6.5,7.0,7.5,8.0,8.5,9.0,9.5,10.0]
labels =['6.5-7.0','7.0-7.5','7.5-8.0','8.0-8.5','8.5-9.0','9.0-9.5','Above 9.5']
grad_adm_data_new['CGPA bins'] = pd.cut(grad_adm_data_new['CGPA'], bins,labels=labels)

In [None]:
grad_adm_data_new.head()

In [None]:
grad_adm_data_new.info()

In [None]:
# sns.lineplot(x='GRE Score bins',
#     hue='University Rating',
#     data=grad_adm_data_new,
#     palette="rocket")

In [None]:
characteristics_catg = ['University Rating', 'SOP', 'LOR ','Research','GRE Score bins','CGPA bins']

# Bi-Variate Analysis with Research

Categorical variables

In [None]:
def cat_bi_analysis(df,colname,depend_var,nrows=2,mcols=2,width=20,height=15):
    fig , ax = plt.subplots(nrows,mcols,figsize=(width,height))
    sns.set(style='white')
    fig.set_facecolor("peachpuff")
    rows = 0
    string = " based Distribution"
    for var in colname:
        string = var + string
        sns.countplot(data=df,x=depend_var, hue=var, palette="hls",ax=ax[rows][0])
        sns.countplot(data=df, x=var, hue=depend_var, palette="husl",ax=ax[rows][1])
        ax[rows][0].set_title(string, fontweight="bold",fontsize=14)
        ax[rows][1].set_title(string, fontweight="bold",fontsize=14)
        ax[rows][0].set_ylabel('count', fontweight="bold",fontsize=14)
        ax[rows][0].set_xlabel(var,fontweight="bold", fontsize=14)  
        ax[rows][1].set_ylabel('count', fontweight="bold",fontsize=14)
        ax[rows][1].set_xlabel(var,fontweight="bold", fontsize=14) 
        rows += 1
        string = " based Distribution"
    plt.show()

In [None]:
col_names = ['University Rating', 'SOP', 'LOR ','GRE Score bins','TOEFL Score bins','CGPA bins']
cat_bi_analysis(grad_adm_data_new,col_names,'Research',6,2,20,36)

Research criteria is predominantly useful because of following reasons:
- Students to Research papers have more chances of getting into Univeristies with top class ratings (4 & 5).
- Students with higher ratings in LOR and SOP are the students with most number of research paper publications.
- It shouldn't be surprising that the students with higher scores in academics ( GRE, TOEFL and CGPA) are the one's who are actively publishing or had published Research papers in the past.

# Multi-Variant Analysis:

Categorical variables and Numerical variables

In [None]:
def num_bi_analysis(df,colname,category,groupby,nrows=1,mcols=2,width=20,height=8):
    fig , ax = plt.subplots(nrows,mcols,figsize=(width,height),squeeze=False)
    sns.set(style='white')
    fig.set_facecolor("peachpuff")
    rows = 0
    for var in colname:
        sns.boxplot(x = category,y = var, data = df,ax=ax[rows][0])
        sns.lineplot(x=df[category],y=df[var],ax=ax[rows][1],hue=df[groupby]) 
        ax[rows][0].set_ylabel(var, fontweight="bold",fontsize=14)
        ax[rows][0].set_xlabel(category,fontweight="bold", fontsize=14)  
        ax[rows][1].set_ylabel(var, fontweight="bold",fontsize=14)
        ax[rows][1].set_xlabel(category,fontweight="bold", fontsize=14) 
        rows += 1
    plt.show()

In [None]:
col_names = ['University Rating', 'SOP', 'LOR ','GRE Score bins','TOEFL Score bins','CGPA bins']

In [None]:
grad_adm_data_new.info()

In [None]:
grad_adm_data_new.columns

In [None]:
grad_adm_data['LOR'] = grad_adm_data['LOR ']
grad_adm_data['Chance of Admit'] = grad_adm_data['Chance of Admit ']

grad_adm_data_new['LOR'] = grad_adm_data_new['LOR ']
grad_adm_data_new['Chance of Admit'] = grad_adm_data_new['Chance of Admit ']

In [None]:
col_num = [ 'Chance of Admit']
num_bi_analysis(grad_adm_data_new,col_num,"University Rating",'Research')

col_num = [ 'Chance of Admit']
num_bi_analysis(grad_adm_data_new,col_num,"SOP",'CGPA bins')

col_num = [ 'Chance of Admit']
num_bi_analysis(grad_adm_data_new,col_num,"LOR",'GRE Score bins')

col_num = [ 'Chance of Admit']
num_bi_analysis(grad_adm_data_new,col_num,"LOR",'TOEFL Score bins')

col_num = [ 'Chance of Admit']
num_bi_analysis(grad_adm_data_new,col_num,'Research',"CGPA bins")

In [None]:
grad_adm_data.columns

In [None]:
grad_adm_data = grad_adm_data.drop('Chance of Admit ', axis = 1)

In [None]:
# Correaltion between numerical variables

plt.figure(figsize = (10, 5))
sns.heatmap(grad_adm_data.corr(method="pearson"),annot = True)
plt.yticks(rotation = 360)
plt.xticks(rotation = 45)
plt.show()

- As We can See Chance of Admit is highly Correlated with GRE Score,Toefl Score and CGPA

In [None]:
sns.pairplot(grad_adm_data, hue="Research")

In [None]:
categorical_cols_int = ['University Rating','Research']
categorical_cols_float = ['SOP', 'LOR']
for i in categorical_cols_int:
    grad_adm_data[i] = grad_adm_data[i].astype("int64")
for i in categorical_cols_float:
    grad_adm_data[i] = grad_adm_data[i].astype("float64")
grad_adm_data.info()

In [None]:
grad_adm_data = grad_adm_data.drop('LOR ', axis = 1)

In [None]:
grad_adm_data.info()

In [None]:
# sns.regplot(x="TOEFL Score",y="Chance of Admit",data=grad_adm_data,color='orange')
sns.scatterplot(x="GRE Score",y="Chance of Admit",data=grad_adm_data,color='orange')
# sns.regplot(x="University Rating",y="Chance of Admit",data=grad_adm_data,color='orange')
# sns.regplot(x="SOP",y="Chance of Admit",data=grad_adm_data,color='orange')
# sns.regplot(x="LOR",y="Chance of Admit",data=grad_adm_data,color='orange')
# sns.regplot(x="CGPA",y="Chance of Admit",data=grad_adm_data,color='orange')


In [None]:
fig = plt.figure(figsize=(7, 5))
fig.set_facecolor(color = 'peachpuff')
sns.regplot(x='GRE Score',y='Chance of Admit',color="g",data=grad_adm_data);

fig = plt.figure(figsize=(7, 5))
fig.set_facecolor(color = 'peachpuff')
sns.regplot(x='TOEFL Score',y='Chance of Admit',color="y",data=grad_adm_data);

fig = plt.figure(figsize=(7, 5))
fig.set_facecolor(color = 'peachpuff')
sns.regplot(x='CGPA',y='Chance of Admit',color="r",data=grad_adm_data);

fig = plt.figure(figsize=(7, 5))
fig.set_facecolor(color = 'peachpuff')
sns.regplot(x='University Rating',y='Chance of Admit',color="b",data=grad_adm_data);

fig = plt.figure(figsize=(7, 5))
fig.set_facecolor(color = 'peachpuff')
sns.regplot(x='Research',y='Chance of Admit',color="y",data=grad_adm_data);

# EDA specific Observations and Inferences :
- By analyzing the distribution of ChanceOfAdmit, we can say that highest percentage of the getting admission at the university is between "0.6" & "1.0"
- By analyzing the distribution of Research, we can say that highest number of the students Research is "1".
- By analyzing the distribution of LOR, we can say that highest number of the Letter of recommendation (LOR) is between "2.5" & "4.5".
- By analyzing the distribution of SOP, we can say that highest number of the Statement of purpose is between "2.5" & "4.5".
- By analyzing the distribution of University Rating, we can say that highest number of the University rating is "2" & "3".
- By analyzing the distribution of TOEFLScore, we can say that highest number of the students TOEFLscore is "110" & "105". Highest TOEFLScore of students is between "99" & "115".
- By analyzing the distribution of GREScore, we can say that highest number of the students GREscore is "312" & "324".Highest GREScore of students is between "304" & "330".
- There is a strong positive relationship between GREScore and Chance Of Admit.
- There is a strong positive relationship between TOEFLScore and Chance Of Admit.
- There is a strong positive relationship between TOEFLScore and Chance Of Admit.

- We cant see any relationship between SOP and Chance Of Admit.
- We cant see any relationship between LOR and Chance Of Admit.
- We can see that the students with Research expericence has higher chance of getting an admit
- There;s o strong relationship between UniversityRating and ChanceOfAdmit,  but the university with higher rating tends to have a high chance of admit for students


# Building Model with Linear Regression:

### Assumptions made for Simple Linear Regression:

- **Linearity of residuals**: There needs to be a linear relationship between the dependent variable and independent variable(s).
- **Independence of residuals**: The error terms should not be dependent on one another (like in time-series data wherein the next value is dependent on the previous one). There should be no correlation between the residual terms. The absence of this phenomenon is known as Autocorrelation.There should not be any visible patterns in the error terms. 
- **Normal distribution of residuals**: The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero. This is done in order to check whether the selected line is actually the line of best fit or not.If the error terms are non-normally distributed, suggests that there are a few unusual data points that must be studied closely to make a better model.
-  **The equal variance of residuals**: The error terms must have constant variance. This phenomenon is known as Homoscedasticity.The presence of non-constant variance in the error terms is referred to as Heteroscedasticity. Generally, non-constant variance arises in the presence of outliers or extreme leverage values.

### Considerations of Multiple Linear Regression:
All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear Regression along with a few new additional assumptions.
- **Linear Relationship** should be present between input variables and target variables
    - We have already checked this in EDA
- **Multicollinearity**: It is the phenomenon where a model with several independent variables, may have some variables interrelated.
    - **No Multicollinearity** should be present among input variables. As Chance of Admit is highly Correlated with GRE Score,Toefl Score and CGPA, we will cross check which one to check after VIF.
- **Normal Distribution** of target varaibles.
    - Checked this in EDA.
- **Overfitting**: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set. This phenomenon is known as the overfitting of a model. This usually leads to high training accuracy and very low test accuracy.
- **Feature Selection**: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.

### Hypothesis in Linear Regression 
Once you have fitted a straight line on the data, you need to ask, “Is this straight line a significant fit for the data?” Or “Is the beta coefficient explain the variance in the data plotted?” And here comes the idea of hypothesis testing on the beta coefficient. The Null and Alternate hypotheses in this case are:
H0: B1 = 0

HA: B1 ≠ 0

### Assessing the model fit
Some other parameters to assess a model are:
t statistic: It is used to determine the p-value and hence, helps in determining whether the coefficient is significant or not
F statistic: It is used to assess whether the overall model fit is significant or not. Generally, the higher the value of the F-statistic, the more significant a model turns out to be.

In [None]:
grad_adm_data.info()

# Model 1

In [None]:
df_1 = grad_adm_data.copy()

In [None]:
df_1.head()

In [None]:

df_1.columns

#### Performing Linear Regression

In [None]:
# Assigning the featurs as X and trarget as Y

X= df_1.drop(["Chance of Admit"],axis =1)
Y= df_1["Chance of Admit"]
X_train_org, X_test_org, y_train_org, y_test_org = train_test_split(X, Y,test_size=0.20, random_state=100)

In [None]:
X_train_org.shape, X_test_org.shape, y_train_org.shape, y_test_org.shape

In [None]:
print(X.shape)
print(Y.shape)
X.head()

In [None]:
import statsmodels.api as sm
# Adding a constant to get an intercept
X_train_sm = sm.add_constant(X_train_org)

# Fitting the regression line using 'OLS'
lr = sm.OLS(y_train_org, X_train_sm).fit() #statsmodels.regression.linear_model

In [None]:
lr.summary()

In [None]:
#sklearn.linear_model --just anaother way of getting r2 value
final_model = LinearRegression() 
final_model.fit(X_train_org,y_train_org)
final_model.score(X_train_org,y_train_org)

#### Performing predictions on the test set

In [None]:
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test_org)

# Predict the y values corresponding to X_test_sm using stats model mased approach
y_pred = lr.predict(X_test_sm) 

In [None]:
type(lr), type(final_model)

#### Observations:

- Adding constant to X_test then predicting y_pred using **final_model (sklearn)** is giving an **error as size 7 (orginal) is different from 8** (after adding constant) and hence we will use **lr ( stats model)** to predict y_pred.
- Also, the reason to use stats model is that we don't have to check the normality of input varaiables.

### Testing the assumptions of the linear regression model:

#### 1. Multicollinearity check by VIF score :

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate the VIFs for the new model
def getVIF(X_train):
    vif = pd.DataFrame()
    X = X_train
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
getVIF(X_train_sm)

- **Observations from Multicollinearity check:**
    - All features have VIF < 5 
    - The problem is we have not considered the some numerical varaiables disguised as categorical varaibles--We will deal with this in next model

#### Residuals Analysis

In [None]:
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error,mean_absolute_percentage_error
#R-squared value
print("R2 score of the model is ",r2_score(y_test_org,y_pred))

#MAE value
print("mean_absolute_error  of the model is ",mean_absolute_error(y_test_org,y_pred))

#RMSE value
print( "Root mean squared error of the model is ",np.sqrt( mean_squared_error( y_test_org, y_pred ) ))

#MAPE value
print("Mean absolute percentage error of the model is ", mean_absolute_percentage_error(y_test_org,y_pred))


#### Final Predictions using orignal test data and calculating residuals

In [None]:
y_preds = lr.predict(X_test_sm)
errors = y_test_org - y_preds

**2.The mean of residuals is nearly zero**

In [None]:
np.mean(errors)

**3.Linearity of variables**  
- No pattern in the residual plot

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test_org,y_pred)
fig.suptitle('y_test_org vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test_org', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)   

**4.Test for Homoscedasticity**

In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(lr, 'CGPA', fig=fig)

In [None]:
sns.scatterplot(y_preds,errors)
plt.xlabel("Predicted chances of admit")
plt.ylabel("Residuals")
plt.title("Predicted values vs Residuals")

**5.Normality of residuals**

- Left skewed distribution

In [None]:
sns.histplot(errors, kde = True, color = 'orange') 

In [None]:
sm.qqplot(errors, line = 's')
plt.show()

# Observations for Model 1:
Here are some key statistics from the summary:

- The coefficient for TOEFL Score is 0.0032, with a very low p-value (0.002). The coefficient is statistically significant. So the association is not purely by chance. Along with TOEFL Score, other scores are GRE Score, Research and CGPA.
- R – squared is  0.83 Meaning that 83.0% of the variance in chance for admit is explained by all the input variables (**'GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'CGPA','Research', 'LOR'**). This is a decent R-squared value but the problem here is we have included all features (both numerical and categorical) which is not good for an ideal model. We will deal with this in further models.
- As we have **not normalize the data, we have used stats model** based approach to predit chance of admit and to calculate errors.In further models, we will use sklearn based approach where we will normalize the data.
- F-statistics has a very low p-value(2.27e-140 -practically low). Meaning that the model fit is statistically significant, and the explained variance isn’t purely by chance.
- Strong multicollinearity or other numerical problems present.So we will be using VIF to detect and solve this problem.
- **Observations from Multicollinearity check:**
    - All features have VIF < 5 
    - The problem is we have not considered the some numerical varaiables disguised as categorical varaibles--We will deal with this in next model
- **Observations from Residual mean check:**
    - The mean of residuals is nearly zero (0.01)
- **Observations from Linearity of variables check:**
    
    - As there's a clear linear relationship between predicted values and given values for chance of admit, we can say that the variance of both the values is similar
- **Observations from test for Homoscedasticity check:**
    - No pattern in the residual plot
- **Observations from Normality of residuals check:**
    - A little Left skewed distribution.


# Model 2

### Assumptions for Linear Regression:

All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear Regression along with a few new additional assumptions.
- **Linear Relationship** should be present between input variables and target variables
    - We have already checked this in EDA
- **Multicollinearity**: It is the phenomenon where a model with several independent variables, may have some variables interrelated.
    - **No Multicollinearity** should be present among input variables. As Chance of Admit is highly Correlated with GRE Score,Toefl Score and CGPA, we will cross check which one to check after VIF.
- **Normal Distribution** of target varaibles.
    - Checked this in EDA.
- **Overfitting**: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set. This phenomenon is known as the overfitting of a model. This usually leads to high training accuracy and very low test accuracy.
- **Feature Selection**: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.

In [None]:
# One hot encoding to convert categorical features to numerical features.

df_2 = pd.get_dummies(grad_adm_data, columns = ['SOP', 'LOR', 'University Rating', 'Research'],drop_first = True)

In [None]:
df_2.columns

In [None]:
df_train, df_test = train_test_split(df_2, train_size = 0.8, random_state = 100)

In [None]:
df_train.shape, df_test.shape

OBS : We have converted all the unique values in categorical columns to one hot encoded values.

#### Performing Linear Regression

In [None]:
# Model Corrections - 2.1

In [None]:
X_train = df_train
y_train = df_train.pop('Chance of Admit')


In [None]:
X_test = df_test
y_test = df_test.pop('Chance of Admit')

In [None]:
print( X_train.shape )
print( X_test.shape )
print( y_train.shape )
print( y_test.shape )

In [None]:
import statsmodels.api as sm
# Adding a constant to get an intercept
X_train_sm = sm.add_constant(X_train)

# Fitting the regression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit() #statsmodels.regression.linear_model

# Printing the parameters,i.e. intercept and slope of the regression line obtained
lr.params

In [None]:
#Performing a summary operation lists out all different parameters of the regression line fitted
print(lr.summary())

#### Performing predictions on the test set

In [None]:
# Adding a constant to X_test
X_test_sm = sm.add_constant(X_test)

# Predicting the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

y_pred.head()

####  Multicollinearity check by VIF score :

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate the VIFs for the new model
def getVIF(X_train):
    vif = pd.DataFrame()
    X = X_train
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
getVIF(X_train_sm)

# Observations for Model 2.1:
As we can see, this code gives you a brief summary of the linear regression. Here are some key statistics from the summary:

- R – squared is 0.822 Meaning that 82.2% of the variance in chance for admit is explained by all the input variables. This is a decent R-squared value.
- F-statistics has a very low p-value(practically low). Meaning that the model fit is statistically significant, and the explained variance isn’t purely by chance.
- No significant drop in adjusted R squared as compared to previous model.
- Strong multicollinearity exists
- features with p-value > 0.05 and VIF > 5 are :
    - 'SOP_1.5','SOP_2.0', 'SOP_2.5', 'SOP_3.0', 'SOP_3.5', 'SOP_4.0', 'SOP_4.5','SOP_5.0', 'LOR_1.5', 'LOR_2.0', 'LOR_2.5', 'LOR_3.0', 'LOR_3.5','LOR_4.0', 'LOR_4.5', 'LOR_5.0','University Rating_3', 'University Rating_4', 'University Rating_5'.
    - Multicollinearity has been checked by VIF score and variables are dropped one-by-one till none has VIF>5 for above.

In [None]:
# Model Corrections - 2.2

In [None]:
df_2.columns

In [None]:
#Dropping 'GRE Score' as there's a strong corelation between - 'GRE Score', 'TOEFL Score', 'CGPA'.
# Dropping all features with p-value > 0.05 and VIF > 5

In [None]:
X_train1 = X_train[['TOEFL Score', 'CGPA', 'Research_1','University Rating_2']]

In [None]:
import statsmodels.api as sm
# Adding a constant to get an intercept
X_train_sm = sm.add_constant(X_train1)

# Fitting the regression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit() #statsmodels.regression.linear_model

# Printing the parameters,i.e. intercept and slope of the regression line obtained
lr.params

In [None]:
#Performing a summary operation lists out all different parameters of the regression line fitted
print(lr.summary())

#### Performing predictions on the test set

In [None]:
X_test.columns

In [None]:
X_train1.shape

In [None]:
# X_test_sm[X_train1.columns]

In [None]:
# Adding a constant to X_test
X_test_sm = sm.add_constant(X_test)

X_test_new = X_test_sm[X_train_sm.columns]
# Predicting the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_new)

y_pred.head()

### Testing the assumptions of the linear regression model:

#### 1. Multicollinearity check by VIF score :

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate the VIFs for the new model
def getVIF(X_train):
    vif = pd.DataFrame()
    X = X_train
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
getVIF(X_train_sm)

- **Observations from Multicollinearity check:**
    - All features have VIF < 5 
    - The problem is we have not considered the some numerical varaiables disguised as categorical varaibles--We will deal with this in next model

#### Residuals Analysis

In [None]:
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error,mean_absolute_percentage_error
#R-squared value
print("R2 score of the model is ",r2_score(y_test,y_pred))

#MAE value
print("mean_absolute_error  of the model is ",mean_absolute_error(y_test,y_pred))

#RMSE value
print( "Root mean squared error of the model is ",np.sqrt( mean_squared_error( y_test, y_pred ) ))

#MAPE value
print("Mean absolute percentage error of the model is ", mean_absolute_percentage_error(y_test,y_pred))


#### Final Predictions using orignal test data and calculating residuals


In [None]:
y_pred = lr.predict(X_test_new)
errors = y_test - y_pred


**2.The mean of residuals is nearly zero**



In [None]:
np.mean(errors)

**3.Linearity of variables**  


In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)

**4.Test for Homoscedasticity**




In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(lr, 'CGPA', fig=fig)

In [None]:
sns.scatterplot(np.arange(1,101,1),errors)
sns.lineplot(np.arange(1,101,1),errors.mean())

In [None]:
sns.scatterplot(y_preds,errors)
plt.xlabel("Predicted chances of admit")
plt.ylabel("Residuals")
plt.title("Predicted values vs Residuals")

**5.Normality of residuals**




In [None]:
sns.histplot(errors, kde = True, color = 'orange') 

In [None]:
sm.qqplot(errors, line = 's')
plt.show()

The residuals looks normaly distributed

# Observations for Model 2.2:
As we can see, this code gives you a brief summary of the linear regression. Here are some key statistics from the summary:

- R – squared is 0.805 Meaning that 80.5% of the variance in chance for admit is explained by all the input variables. This is a decent R-squared value.
- F-statistics has a very low p-value(practically low). Meaning that the model fit is statistically significant, and the explained variance isn’t purely by chance.
- No significant drop in adjusted R squared as compared to previous model.
- Strong multicollinearity still exists
- No features with p-value > 0.05 and VIF > 5
- **Observations from Multicollinearity check:**
    - All features have VIF < 5 
- **Observations from Residual mean check:**
    - The mean of residuals is nearly zero (0.01)
- **Observations from Linearity of variables check:**
    - As there's a clear linear relationship between predicted values and given values for chance of admit, we can say that the variance of both the values is similar
- **Observations from test for Homoscedasticity check:**

    - No pattern in the residual plot
- **Observations from Normality of residuals check:**
    - The residuals looks nearly normally distributed

In [None]:
# Model Corrections - 2.3

In [None]:
X_train2 = X_train[['CGPA','Research_1']]

In [None]:
import statsmodels.api as sm
# Adding a constant to get an intercept
X_train_sm = sm.add_constant(X_train2)

# Fitting the regression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit() #statsmodels.regression.linear_model

# Printing the parameters,i.e. intercept and slope of the regression line obtained
lr.params

In [None]:
#Performing a summary operation lists out all different parameters of the regression line fitted
print(lr.summary())

In [None]:
X_test.columns

In [None]:
# Adding a constant to X_test
X_test_sm = sm.add_constant(X_test)
X_test_new1 = X_test_sm[X_train_sm.columns]
# Predicting the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_new1)

y_pred.head()

# Testing the assumptions of the linear regression model (2.3):

#### 1. Multicollinearity check by VIF score :

In [None]:
# Check for the VIF values of the feature variables. 
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate the VIFs for the new model
def getVIF(X_train):
    vif = pd.DataFrame()
    X = X_train
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

In [None]:
getVIF(X_train_sm)

- **Observations from Multicollinearity check:**
    - All features have VIF < 5 

### Model performance evaluation and Residuals Analysis



In [None]:
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error,mean_absolute_percentage_error
#R-squared value
print("R2 score of the model is ",r2_score(y_test,y_pred))

#MAE value
print("mean_absolute_error  of the model is ",mean_absolute_error(y_test,y_pred))

#RMSE value
print( "Root mean squared error of the model is ",np.sqrt( mean_squared_error( y_test, y_pred ) ))

#MAPE value
print("Mean absolute percentage error of the model is ", mean_absolute_percentage_error(y_test,y_pred))


#### Final Predictions using orignal test data and calculating residuals


In [None]:
y_pred = lr.predict(X_test_new1)
errors = y_test - y_pred


**2.The mean of residuals is nearly zero**



In [None]:
np.mean(errors)

**3.Test for Homoscedasticity**  
- No pattern in the residual plot



In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(lr, 'Research_1', fig=fig)

In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(lr, 'CGPA', fig=fig)

In [None]:
sns.scatterplot(np.arange(1,101,1),errors)
sns.lineplot(np.arange(1,101,1),errors.mean())

In [None]:
sns.scatterplot(y_preds,errors)
plt.xlabel("Predicted chances of admit")
plt.ylabel("Residuals")
plt.title("Predicted values vs Residuals")

**4.Linearity of variables**




- As there's a clear linear relationship between predicted values and given values for chance of admit, we can say that the variance of both the values is similar

In [None]:
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)  

**5.Normality of residuals**



In [None]:
sns.histplot(errors, kde = True, color = 'orange') 

In [None]:
sm.qqplot(errors, line = 's')
plt.show()

# Observations for Model 2.3

- If only GRE score is considered out of GRE, TOEFL and CGPA, we are getting less R2 value (0.65)
- If only TOEFL score is considered out of GRE, TOEFL and CGPA, we are getting R2 value (0.666)
- If only CGPA is considered out of GRE, TOEFL and CGPA, we are getting less R2 value (0.791) and ajusted R2 as (0.790), which concludes that **CGPA is the best fit out of the three highly corelated features**
- If we are not including 'University Rating_2' then R2 is not having any drop at all -> 0.790, so we will remove it from our input variable.
- If we are not including 'Research_1' then R2 is dropping to 0.75, so we will keep it as our input variable.
- F-statistics has a very low p-value(practically low). Meaning that the model fit is statistically significant, and the explained variance isn’t purely by chance.
- No significant drop in adjusted R squared as compared to previous model.
- There's hardly any difference between the **R2(0.790) and adjusted R2(0.789)**. Meaning that 79% of the variance in chance for admit is explained by all the input variables (Research and CGPA). This is a decent R-squared value.
- **Observations from Multicollinearity check:**
    - All features have VIF < 5 
- **Observations from Residual mean check:**
    - The mean of residuals is nearly zero (0.02)
- **Observations from Linearity of variables check:**
    - As there's a clear linear relationship between predicted values and given values for chance of admit, we can say that the variance of both the values is similar
- **Observations from test for Homoscedasticity check:**

     - No pattern in the residual plot
- **Observations from Normality of residuals check:**
    - The distribution looks normal

# Actionable Insights & Recommendations:

- Although GRE Score, TOEFL Score, CGPA , University Rating , Research publications , Statement of Purpose and Letter of Recommendation Strength helps in predicting chance of admit, the most important factors in graduate admissions are **CGPA and Research Publications**.
- As there's a strong corelation between GRE Score, TOEFL Score and CGPA, any one of these three can be used to give similar predictions along with the Research criterion.
- The Research criteria is predominantly useful because of following reasons:  
    - Students to Research papers have more chances of getting into Univeristies with top class ratings (4 & 5).
    - Students with higher ratings in LOR and SOP are the students with most number of research paper publications.
    - It shouldn't be surprising that the **students with higher scores in academics ( GRE, TOEFL and CGPA) are the one's who are actively publishing** or had published Research papers in the past.
    
- Everything students do in high school can impact their admissions outcomes.Grades matters a lot.However, during the model building phase I noticed that **LOR (letter of recommendation) also is a strong feature which can be linked with student's behaiviour and extra curricular activities**. There are factors outside student's control that have an impact on their chances as colleges and universities build each freshman class to include a diverse array of students, and that means selecting for diverse racial, economic, and personal backgrounds can be considered for getting a **good LOR ratings which increases the chances of admission** given the rest of the variables.
 