Multiple Tests were conducted for a class and their IQ scores were recorded separately. Use the data in MLR_Q17_IQScore to answer the following: https://drive.google.com/drive/folders/1rRbSnLml_iqwC8EeFOrEsetoov2yyHrF

1) Are the test scores correlated?<br>
2) Build an MLR model using the relevant variables.<br>
3) Extract the principal components (PC) from the test dataset. How many PCs have eighenvalue greater than 1.<br>
4) Build an MLR with PCs with eighen value > 1. How does the Adj R-Squared compare to the one on step 2?<br>
5) Build an MLR with all the PCs. What is the Adj R-Square of thos model?

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

import statsmodels.api as sm

In [2]:
# load data
iqscore = pd.read_csv('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Regression-Models-main/MLR_Q17_IQScore.csv')
iqscore.head()

Unnamed: 0,Test1,Test2,Test3,Test4,Test5,IQ
0,83,34,65,63,64,106
1,73,19,73,48,82,92
2,54,81,82,65,73,102
3,96,72,91,88,94,121
4,84,53,72,68,82,102


In [3]:
# Create X and Y
Y = iqscore['IQ']
X = iqscore.drop('IQ', axis=1)

In [4]:
# Check correlation
X.corr()

Unnamed: 0,Test1,Test2,Test3,Test4,Test5
Test1,1.0,0.100018,-0.260801,0.753937,0.013967
Test2,0.100018,1.0,0.057232,0.719623,-0.281449
Test3,-0.260801,0.057232,1.0,-0.140941,0.347335
Test4,0.753937,0.719623,-0.140941,1.0,-0.172864
Test5,0.013967,-0.281449,0.347335,-0.172864,1.0


**Test2 and Test4 are correlated**

In [5]:
# Check correlation with Y
X.corrwith(Y)

Test1    0.225648
Test2    0.240651
Test3    0.074070
Test4    0.371404
Test5   -0.058064
dtype: float64

**Test4 has slighlty more correlation with IQ score compared to others**

In [6]:
# Check multi-collinearity
from statsmodels.stats.outliers_influence import variance_inflation_factor
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Test1     668.590033
Test2     407.394006
Test3      26.167438
Test4    1982.284931
Test5      23.540886
dtype: float64

In [7]:
# Drop Test4
X = iqscore.drop(['IQ','Test4'], axis=1)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Test1    13.754610
Test2    11.499783
Test3    26.151200
Test5    23.523445
dtype: float64

In [8]:
# Drop Test3
X = iqscore.drop(['IQ','Test4','Test3'], axis=1)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Test1    13.713022
Test2     8.562830
Test5    10.591489
dtype: float64

In [9]:
# Drop Test1
X = iqscore.drop(['IQ','Test4','Test3','Test1'], axis=1)
pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns)

Test2    6.184685
Test5    6.184685
dtype: float64

In [10]:
# Check correlation with Y
X.corrwith(Y)

Test2    0.240651
Test5   -0.058064
dtype: float64

In [11]:
# Check X
X.head()

Unnamed: 0,Test2,Test5
0,34,64
1,19,82
2,81,73
3,72,94
4,53,82


In [12]:
# Train the model
X = sm.add_constant(X)

reg_model = sm.OLS(Y,X).fit()
reg_model.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,IQ,R-squared:,0.058
Model:,OLS,Adj. R-squared:,-0.099
Method:,Least Squares,F-statistic:,0.3695
Date:,"Mon, 23 May 2022",Prob (F-statistic):,0.699
Time:,18:07:50,Log-Likelihood:,-56.31
No. Observations:,15,AIC:,118.6
Df Residuals:,12,BIC:,120.7
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,95.3371,19.656,4.850,0.000,52.510,138.164
Test2,0.1384,0.166,0.834,0.420,-0.223,0.500
Test5,0.0072,0.199,0.036,0.972,-0.427,0.441

0,1,2,3
Omnibus:,4.503,Durbin-Watson:,1.331
Prob(Omnibus):,0.105,Jarque-Bera (JB):,2.576
Skew:,1.01,Prob(JB):,0.276
Kurtosis:,3.197,Cond. No.,621.0


**Regression Eq:**<br>
IQ = 95.3371 + 0.1384 * Test2 + 0.0072 * Test5

In [13]:
# Create X
X_vars = iqscore.drop('IQ', axis=1)

In [14]:
# Fit PCA and get transformed data
pca = PCA().fit(scale(X_vars))

In [15]:
# The principal axis and eighen values
for i, (comp,var) in enumerate(zip(pca.components_, pca.explained_variance_)):
    print('Eighen value:',var,'\nAxis',i,':',comp)

Eighen value: 2.373166939984251 
Axis 0 : [-0.48660571 -0.47250683  0.23207994 -0.64604806  0.2621316 ]
Eighen value: 1.3154021139702725 
Axis 1 : [0.12647863 0.22905202 0.70351061 0.23976434 0.61573063]
Eighen value: 1.185230822561089 
Axis 2 : [-0.59473493  0.57320793  0.33093311 -0.03692857 -0.4547995 ]
Eighen value: 0.47638231034348977 
Axis 3 : [-0.3763513   0.41383805 -0.58454478  0.00924243  0.58763913]
Eighen value: 0.006960670283750621 
Axis 4 : [ 5.01864619e-01  4.73754372e-01  1.54555595e-03 -7.23661398e-01
  7.00521977e-04]


**First 3 principal components (PCs) are having eighen values greater than 1**

In [16]:
X_transform = pca.transform(scale(X_vars))

In [17]:
X_transform

array([[ 0.15672499, -0.86309491, -1.36203168, -0.86657678, -0.10774732],
       [ 1.97621313, -0.28291317, -1.76440475, -0.63484355,  0.02128374],
       [ 0.14398747,  0.72395438,  1.22952344,  0.46307102,  0.10916046],
       [-1.44821522,  2.61256861, -0.99816025, -0.23945064, -0.09670003],
       [-0.17445329,  0.52177055, -1.19179838, -0.07560538,  0.13560936],
       [-1.80530162, -0.46650014, -0.19219548, -0.26508763,  0.08306683],
       [-0.65041402, -0.5162498 , -0.2974852 , -0.04626892,  0.06883195],
       [ 1.1227968 , -1.39187081, -0.94095613,  1.76324413, -0.00568583],
       [ 2.98676446,  0.12384555,  1.00175472, -0.46146844,  0.03753432],
       [ 2.38326323,  1.08772103,  0.90900483, -0.07441367, -0.01725664],
       [-1.09602054, -2.00277365,  0.27225775, -0.24784533, -0.05751862],
       [-0.05578459,  1.12768834, -0.03016279,  1.11734648, -0.12115262],
       [-1.08882299,  0.26505991,  1.22492937, -0.58151382,  0.01949723],
       [-2.2764078 ,  0.05707461,  0.3

In [18]:
# Train the model with PCs greater than 1
X = X_transform[:,0:3]
X = sm.add_constant(X)

Y = iqscore['IQ']

pca_model = sm.OLS(Y,X).fit()
pca_model.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,IQ,R-squared:,0.127
Model:,OLS,Adj. R-squared:,-0.111
Method:,Least Squares,F-statistic:,0.5324
Date:,"Mon, 23 May 2022",Prob (F-statistic):,0.669
Time:,18:07:50,Log-Likelihood:,-55.742
No. Observations:,15,AIC:,119.5
Df Residuals:,11,BIC:,122.3
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,104.3333,2.999,34.791,0.000,97.733,110.934
x1,-2.2176,2.015,-1.101,0.295,-6.653,2.217
x2,1.6391,2.707,0.606,0.557,-4.318,7.596
x3,0.3940,2.851,0.138,0.893,-5.882,6.670

0,1,2,3
Omnibus:,6.618,Durbin-Watson:,1.214
Prob(Omnibus):,0.037,Jarque-Bera (JB):,3.596
Skew:,1.119,Prob(JB):,0.166
Kurtosis:,3.862,Cond. No.,1.49


**Adjusted R-Squared was -0.09 before pca and after pca it is -0.11**

In [19]:
# Train the model with all PCs
X = X_transform
X = sm.add_constant(X)

Y = iqscore['IQ']

pca_model = sm.OLS(Y,X).fit()
pca_model.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,IQ,R-squared:,0.399
Model:,OLS,Adj. R-squared:,0.065
Method:,Least Squares,F-statistic:,1.195
Date:,"Mon, 23 May 2022",Prob (F-statistic):,0.384
Time:,18:07:50,Log-Likelihood:,-52.939
No. Observations:,15,AIC:,117.9
Df Residuals:,9,BIC:,122.1
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,104.3333,2.750,37.935,0.000,98.112,110.555
x1,-2.2176,1.848,-1.200,0.261,-6.398,1.963
x2,1.6391,2.482,0.660,0.526,-3.976,7.254
x3,0.3940,2.615,0.151,0.884,-5.521,6.309
x4,-1.4200,4.125,-0.344,0.739,-10.751,7.911
x5,-67.8981,34.123,-1.990,0.078,-145.089,9.292

0,1,2,3
Omnibus:,6.629,Durbin-Watson:,1.001
Prob(Omnibus):,0.036,Jarque-Bera (JB):,3.517
Skew:,1.078,Prob(JB):,0.172
Kurtosis:,3.989,Cond. No.,18.5


**Adjusted R-Squared now is 0.06, much better than with few principal components**

1) Are the test scores correlated?<br>

Test2 and Test4 are correlated

2) Build an MLR model using the relevant variables.<br>

IQ = 95.3371 + 0.1384 * Test2 + 0.0072 * Test5

3) Extract the principal components (PC) from the test dataset. How many PCs have eighenvalue greater than 1.<br>

First 3 principal components (PCs) are having eighen values greater than 1

4) Build an MLR with PCs with eighen value > 1. How does the Adj R-Squared compare to the one on step 2?<br>

Adjusted R-Squared was -0.09 before pca and after pca it is -0.11

5) Build an MLR with all the PCs. What is the Adj R-Square of thos model?

Adjusted R-Squared now is 0.06, much better than with few principal components