# "Please have a PCA refresher"

- author: "<a href='https://www.linkedin.com/in/aneeshdata/'>Aneesh R</a>"
- toc: false
- comments: true
- categories: [Linear Algebra Tutorial]
- badges: false

### Section 0

> Codes and fetching data

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm 
import tabulate as tb # to draw tables


class udregression(object):
  """udregression stands for a class that is user defined to do regression"""
  def __init__(self,X,Y,intercept=True):
    self.X=X
    self.Y=Y
    self.intercept=intercept
  
  def fit_model(self):
    if self.intercept==True:
      X1=np.column_stack([[1 for _ in range(self.X.shape[0])],self.X])
      model=sm.OLS(self.Y,X1)
      f_model=model.fit()
    else:
      model=sm.OLS(self.Y,self.X)
      f_model=model.fit() 
    return f_model
  
  def print_result(self):
    fitted_model=self.fit_model()
    length=len(fitted_model.summary().tables[1].data)
    var=[i[0] for i in fitted_model.summary().tables[1].data[1:length]]
    prtmp1={'var':var,'coef':fitted_model.params,
          'st error':fitted_model.bse,'t values':fitted_model.tvalues,
          'p values':fitted_model.pvalues,'LL':fitted_model.conf_int()[:,0],
          'RL':fitted_model.conf_int()[:,1]}
    
    prtmp2={'rsquared':[fitted_model.rsquared],
          'rsquared_Adj':[fitted_model.rsquared_adj],
          'fvalue':[fitted_model.fvalue]}
    
    print(tb.tabulate(prtmp1,headers="keys"))
    print()
    print(tb.tabulate(prtmp2,headers="keys"))

  def __repr__(self): return "a custom regression runner has been intialised"

df=pd.read_csv('https://raw.githubusercontent.com/varadan13/Data_Bank/main/Econ%20data/MULTICOLLINEARITYPROB.csv')    

### Section 1

Suppose we have a data of the form $(y,x_1,x_2)$. [for now let's think about them as points in a 3D space]

let's assume $x_1$ and $x_2$ are linearly related.   

Our current problem can be roughly summarised as follows:   
predict y **<-** (by using information in $x_1$) & (by using information in $x_2$)  

(1.1) But since we assumed that $x_1$ and $x_2$ are linearly related there may not be much information remaining in $x_2$ once we regress $x_1$ on y or there may not be much information remaining in $x_1$ once we regress $x_2$ on y. 

let's take an example to understand this.

Below, we have stored a data set in 'data'.

In [154]:
data=df[['Hours','NEIN','Assets']] # NEIN and Assets are independent variable
                                   # Hours is dependent variable
                                   # So we are regressing NEIN and Assets on Hours

# Let's check the correlation between NEIN and Assets 

In [156]:
data[['NEIN','Assets']].corr()

Unnamed: 0,NEIN,Assets
NEIN,1.0,0.98751
Assets,0.98751,1.0


As we can see that NEIN and Assets are highly correlated.

What are we going to do to understand (1.1)?

- Let's first regress NEIN on Hours      
- Then see if the slope is significant.
- If significant let's compute the residual.
- Then let's regress Assets on the residual
- Check for the significance.

What we are hoping to see is that the residual-Assets slope will be insignificant as there is very less information in Assets that we can use to explain Hours using a linear model. Why? Because Assets-NEIN are correlated. 

In [157]:
Y=df.Hours.to_numpy()
X1=df.NEIN.to_numpy()
X2=df.Assets.to_numpy()
regression=udregression
regression(X1,Y).print_result()

var           coef    st error    t values     p values           LL         RL
-----  -----------  ----------  ----------  -----------  -----------  ---------
const  2033.84      21.0139       96.7856   4.35155e-42  1991.09      2076.59
x1        0.318879   0.0599264     5.32117  7.15058e-06     0.196958     0.4408

  rsquared    rsquared_Adj    fvalue
----------  --------------  --------
  0.461794        0.445485   28.3149


From the above results we can observe that the slopes are significant. Now let's compute the residuals and finish the above procedure.

In [158]:
Residual=regression(X1,Y).fit_model().predict()-Y
regression(X2,Residual) #Regressing Assets with the residual
#let's check the results
regression(X2,Residual).print_result()

var           coef     st error    t values    p values            LL           RL
-----  -----------  -----------  ----------  ----------  ------------  -----------
const   6.56741     19.0888        0.344045    0.732995  -32.269       45.4039
x1     -0.00108075   0.00284811   -0.379463    0.706775   -0.00687527   0.00471377

  rsquared    rsquared_Adj    fvalue
----------  --------------  --------
0.00434444      -0.0258269  0.143992


Now we can see that the slopes are insignificant. One can also reverse X1 with X2 and conclude the same. Now the question is what can we do about this problem. 

### Section 2

Suppose we can create $z_1$ and $z_2$ such that    
- $z_1=a_{11}x_1+a_{12}x_2$ and $z_2=a_{21}x_1+a_{22}x_2$
- $V(z_{1})>V(z_{2})$ 
- z1 and z2 are ortho normal vectors
 

**How to interpret the above conditions?**    
We must interpret (1.1) and the adjoining example in a practical way. (1.1) is essentially telling us that we can do our analysis with just one variable instead of two. But that does not mean that we can abandon x1 or x2 with out incurring any error because x1 and x2 may be present in our model because of some empirical considerations. So when we extract z1 and z2 we are cleverly partitioning the variance thereby extracting orthogonal vectors that do not share any information with one another. This can be seen clearly when we generalise this derivation to a p variate problem.    
Now if the variance of z2 is very less we can omit it and do our analysis with just z1.