# 0. PCA ANALYSIS OF DOW JONES STOCKS

This notebook is base on by Nathan Thomas's notebook published in:
https://towardsdatascience.com/applying-pca-to-the-yield-curve-4d2023e555b3
which we have commented and extended.

# 1. Import and clean data

First we import the stock prices.

In [1]:
!pip install openpyxl
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Import data from excel
df = pd.read_csv("indu_dly.csv", index_col="Date")
df = df.pct_change(1).dropna(how="any")



## 2. Compute the eigenvalues & eigenvectors

In [2]:
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(df)### #do the fitting
eigenVectors = pca.components_ #####  #horizontal eigenVectors ordered from top to bottom
eigenValues = pca.explained_variance_ ##### #horizontal eigenValues ordered from left to right

## 3. PCA projections 

We now calculate the PCA projections (or
what we have been calling the tranformed "Z" features of  4.PCAInMoreDepth.pptx slides 25 to 32).
These are "latent" or hidden feature (as per slide 37) that
drive the movement of the rates as a whole.
pc1 is the most important latent feature, the one the captures the most variance.

In [3]:
principal_component_projections = pca.transform(df) #####
pc1_proj = principal_component_projections[:,0] #####

## 4. Comparison with Dow Jones Index 

In [4]:
df_indu_index = pd.read_csv("indu_index_dly.csv", index_col="Date")
df_indu_index_ret = df_indu_index.pct_change(1).dropna(how="any")

The correlation between up down movements of the pc1 projection and those of the Dow Jones Index is very high:

In [5]:
np.corrcoef(pc1_proj, df_indu_index_ret.to_numpy().reshape(-1, )) #####

array([[1.        , 0.98062392],
       [0.98062392, 1.        ]])

## 5. Variance

The variance of first principal component (first eigenvalue)

In [8]:
eigenValues

array([4.92612167e-03, 5.80237478e-04, 3.77788282e-04, 2.93192100e-04,
       2.77254324e-04, 2.46680330e-04, 2.26268416e-04, 2.02725897e-04,
       1.91269163e-04, 1.73064175e-04, 1.61000595e-04, 1.58570212e-04,
       1.43140263e-04, 1.41227046e-04, 1.31196420e-04, 1.21579674e-04,
       1.10462631e-04, 1.03344298e-04, 9.17621252e-05, 8.72417626e-05,
       8.22964067e-05, 8.18769287e-05, 7.64138039e-05, 6.87908321e-05,
       5.64239219e-05, 5.08670658e-05, 3.78548443e-05, 3.29897875e-05])

In [6]:
##### some ratio*100 #% variance of first principal component (first eigenvalue)
eigenValues[0]

0.004926121670113644

The variance of first principal component projection

In [7]:
##### another ratio*100 #variance of first principal component projection
np.var(pc1_proj)

0.004923947741750837

THEY ARE THE SAME

## 6. Betas 

Calculate the betas by regression:

In [9]:
from sklearn.linear_model import LinearRegression
betas_by_regression = []
for column in df.columns.values.tolist():
    reg = LinearRegression().fit(pc1_proj.reshape(-1, 1), df[column])
    #reg = LinearRegression().fit(df_indu_index_ret.iloc[:,0].values.reshape(-1,1), df[column])
    betas_by_regression.append(reg.coef_)

In [10]:
betas_by_regression = pd.DataFrame(betas_by_regression, columns=["Betas"], index=df.columns)
betas_by_regression.head(50)

Unnamed: 0,Betas
CSCO,0.203719
DIS,0.206642
XOM,0.173782
BA,0.195282
UNH,0.195802
MMM,0.163117
HD,0.1865
VZ,0.132984
TRV,0.205211
JNJ,0.106099


Calculate the betas by eigenvector pc1:

In [11]:
betas_by_pc1_eigenvector = eigenVectors[0]##### select the betas from the eigenVectors
betas_by_pc1_eigenvector = pd.DataFrame(betas_by_pc1_eigenvector, columns=["Betas"], index=df.columns)
betas_by_pc1_eigenvector.head(50)

Unnamed: 0,Betas
CSCO,0.203719
DIS,0.206642
XOM,0.173782
BA,0.195282
UNH,0.195802
MMM,0.163117
HD,0.1865
VZ,0.132984
TRV,0.205211
JNJ,0.106099


THEY ARE THE SAME

## 7. Using np.linealg.eig

In [12]:
# with np.linealg.eig
df_mean = df.mean()
df_ctr = df-df_mean
cov_matrix_array = np.array(np.cov(df_ctr, rowvar=False))
eigenValues, eigenVectors = np.linalg.eig(cov_matrix_array)
idx = eigenValues.argsort()[::-1]   
eigenValues_ordered = eigenValues[idx]
eigenVectors_ordered = eigenVectors[:,idx] #vertical eigenvectors ordered from left to right
principal_component_projections  = np.matmul(eigenVectors_ordered.transpose(), df_ctr.transpose().values).transpose()
pc1 = principal_component_projections[:,0]
np.corrcoef(pc1, df_indu_index_ret.iloc[:,0].values)

array([[1.        , 0.98062392],
       [0.98062392, 1.        ]])