# Solution 6
## Data Analysis
### FINM August Review 

Mark Hendricks

hendricks@uchicago.edu

In [32]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore")

## 1 Principal Components

Use the single-name equity data from the data set mentioned above. (That is, leave out SPY and SHV.)

1. Calculate the principal components of the return series.


2. Report the eigenvalues associated with these principal components. Report each eigenvalue as a percentage of the sum of all the eigenvalues. This is the total variation each PCA explains.


3. How many PCs are needed to explain 75% of the variation?


4. Calculate the correlation between the first (largest eigenvalue) principal component with each of the 22 single-name equities. Which correlation is highest?


5. Calculate the correlation between the SPY and the first, second, third principal components.

In [33]:
data = pd.read_excel("single_name_return_data.xlsx", sheet_name="total returns").set_index("Date")
equities = data.drop(columns=['SPY', 'SHV'])
spy = data[['SPY']]
rf = data[['SHV']]

##### 1.

In [34]:
# Steps to calculate PC of the return series:

pca = PCA(svd_solver='full')
pca.fit(equities.values)

pca_factors = pd.DataFrame(pca.transform(equities.values), 
                           columns=['PC {}'.format(i+1) for i in range(pca.n_components_)], 
                           index = pd.to_datetime(equities.index))
display(pca_factors.head().style.format('{:,.4f}'))

Unnamed: 0_level_0,PC 1,PC 2,PC 3,PC 4,PC 5,PC 6,PC 7,PC 8,PC 9,PC 10,PC 11,PC 12,PC 13,PC 14,PC 15,PC 16,PC 17,PC 18,PC 19,PC 20,PC 21,PC 22
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2012-07-31 00:00:00,0.1003,0.1912,-0.1277,0.2245,0.108,-0.0551,0.0173,0.089,-0.0275,0.1663,0.0389,-0.0072,0.0563,0.0053,0.0247,-0.0021,-0.0084,0.0252,0.0375,0.0475,0.0136,-0.0139
2012-08-31 00:00:00,-0.0932,0.1407,-0.1413,0.0109,0.0933,-0.1108,0.0283,0.0199,0.0831,0.0109,-0.0512,0.0198,0.0101,0.0395,-0.0022,-0.0114,0.0298,0.0207,-0.0262,-0.0179,0.0036,0.0172
2012-09-30 00:00:00,-0.1595,-0.0298,0.0984,-0.1686,0.0176,0.0307,-0.008,0.0574,-0.009,0.0185,0.0312,-0.0351,0.0044,-0.0304,0.023,0.0227,0.0304,-0.0234,-0.024,-0.0009,0.0152,-0.0074
2012-10-31 00:00:00,-0.029,0.1649,0.0122,-0.0805,-0.0149,0.0972,0.0187,-0.0043,-0.0163,0.0476,0.003,0.0045,0.0378,0.0317,0.0738,-0.0628,-0.0489,0.0192,-0.0525,0.0093,-0.001,-0.0026
2012-11-30 00:00:00,0.014,-0.2055,0.2172,-0.084,-0.1072,0.0006,0.0691,-0.0344,-0.0177,0.0345,0.0197,-0.0242,0.0571,-0.0272,-0.0064,0.0641,-0.0335,-0.0011,0.0265,-0.0279,0.0264,-0.0201


##### 2.

In [35]:
explained_var = pd.DataFrame(data = pca.explained_variance_ratio_,
                                 index = pca_factors.columns, 
                                 columns = ['Explained Variance'])
explained_var['Cumulative Explained Variance'] = explained_var['Explained Variance'].cumsum()

display(explained_var.style.format('{:,.2%}'))

Unnamed: 0,Explained Variance,Cumulative Explained Variance
PC 1,45.27%,45.27%
PC 2,11.17%,56.44%
PC 3,6.97%,63.41%
PC 4,6.37%,69.78%
PC 5,5.72%,75.50%
PC 6,4.60%,80.10%
PC 7,3.44%,83.54%
PC 8,2.77%,86.31%
PC 9,2.28%,88.59%
PC 10,1.77%,90.36%


##### 3.

Looking at the cumulative percentage of explained variance, the first 5 Principal Component Factors are enough to explain atleast 75% of the variation in the single name equities

###### 4.

In [48]:
equities_corr = equities.copy()
equities_corr.insert(0, 'PC 1', pca_factors['PC 1'])

corr_table = pd.DataFrame(equities_corr.corr().iloc[0, 1:]).sort_values('PC 1').style.format('{:,.2%}')
display(corr_table)

Unnamed: 0,PC 1
C,-91.02%
MS,-86.24%
GS,-84.24%
JPM,-84.22%
BAC,-82.66%
HON,-80.71%
EOG,-79.82%
CVX,-78.87%
XOM,-78.20%
BA,-66.06%


##### 5. 

In [52]:
factors = pca_factors[['PC 1', 'PC 2', 'PC 3']]
factors['SPY'] = spy
print('Correlation with SPY:')
display(factors.corr().iloc[3, :3])

Correlation with SPY:


PC 1   -0.910369
PC 2   -0.272989
PC 3   -0.135476
Name: SPY, dtype: float64

## 2 PCR and PLS

1. Principal Component Regression (PCR) refers to using PCA for dimension reduction, and then utilizing the principal components in a regression. Try this by regressing SPY on the first 3 PCs calculated in the previous section. Report the r-squared.


2. Calculate the Partial Least Squares estimation of SPY on the 22 single-name equities. Model it for 3 factors. Report the r-squared.


3. Compare the results between these two approaches and against penalized regression seen in the past homework.

###### 1.

In [54]:
X_PCR = factors[['PC 1', 'PC 2', 'PC 3']]

model_PCR = LinearRegression().fit(X_PCR, spy)
print('PCR R-squared: ' + str(round(model_PCR.score(X_PCR, spy),3)))

PCR R-squared: 0.922


###### 2.

In [55]:
X_PLS = equities
y_PLS = data['SPY']

model_PLS = PLSRegression(n_components=3).fit(X_PLS, y_PLS)

print('PLS R-squared: ' + str(round(model_PLS.score(X_PLS, y_PLS),3)))

PLS R-squared: 0.961


##### 3. 

PCR and PLS both seek to maximize the ability to explain the variation in y variable, and therefore they will have high $R^2$ in-sample. When using LASSO or Ridge as our model, we are conservatively forming factors, and penalizing for additional factors. This makes in-sample $R^2$ lower as we saw in Homework #5, but may make more robust OOS predictions.

##### Footnotes

For those intrested in the implementation of PCA, here is a simple code in Python:

In [66]:
# Steps to calculate PC of the return series:

### 1. Center the return series (mean = 0)

equities_centered = equities - np.mean(equities, axis = 0)

### 2. Calculate the Covariance Matrix

cov = np.cov(equities_centered , rowvar = False)

### 3. Calculating Eigenvalues and Eigenvectors of the covariance matrix

eigen_values , eigen_vectors = np.linalg.eigh(cov)

### 4. Sort the eigenvalues and corresponding eigenvectors in descending order

sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]

### 5. Transform the Data

PCA_factors = np.dot(sorted_eigenvectors.transpose(),equities_centered.transpose()).transpose()
display(pd.DataFrame(PCA_factors))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
0,-0.100321,0.191194,0.127657,0.224495,0.107962,0.055109,0.017283,0.088956,0.027479,0.166297,...,0.056321,-0.005329,-0.024710,-0.002121,0.008429,0.025162,-0.037482,0.047507,-0.013577,0.013919
1,0.093152,0.140681,0.141271,0.010891,0.093285,0.110789,0.028318,0.019877,-0.083100,0.010927,...,0.010125,-0.039513,0.002229,-0.011445,-0.029783,0.020677,0.026223,-0.017920,-0.003575,-0.017197
2,0.159468,-0.029826,-0.098390,-0.168551,0.017551,-0.030655,-0.007980,0.057431,0.009034,0.018465,...,0.004353,0.030354,-0.022978,0.022748,-0.030354,-0.023352,0.023956,-0.000859,-0.015238,0.007404
3,0.028983,0.164929,-0.012199,-0.080515,-0.014877,-0.097195,0.018747,-0.004250,0.016261,0.047642,...,0.037821,-0.031651,-0.073847,-0.062830,0.048856,0.019210,0.052455,0.009320,0.000979,0.002572
4,-0.014028,-0.205469,-0.217231,-0.084005,-0.107154,-0.000600,0.069108,-0.034436,0.017693,0.034530,...,0.057076,0.027170,0.006416,0.064056,0.033502,-0.001101,-0.026544,-0.027866,-0.026400,0.020122
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103,0.372838,0.335383,-0.139549,-0.026264,0.052281,0.121068,0.025709,-0.086073,0.055353,-0.030766,...,-0.019088,0.017686,-0.042579,0.073524,0.019790,-0.011698,-0.009586,-0.016131,-0.003191,0.004194
104,0.228194,-0.014366,-0.116946,0.046203,-0.081276,-0.144510,-0.044792,-0.029363,-0.072187,-0.056817,...,0.026751,0.015641,0.014089,-0.013145,0.025525,0.055686,-0.001700,-0.012403,0.035163,0.000218
105,0.093542,-0.162483,0.066972,-0.046543,0.099481,-0.021565,-0.008280,-0.136889,0.037869,0.014937,...,-0.005659,-0.025855,0.033333,0.016342,0.019541,0.024882,0.021935,-0.013440,0.042620,-0.003327
106,0.116208,0.160077,-0.040744,-0.052939,-0.001236,-0.045846,-0.016074,-0.050314,-0.004106,0.027177,...,-0.012503,0.014700,0.004411,-0.069712,0.028361,-0.014058,0.005816,-0.028731,0.010623,0.011653
