In [9]:
!pip install yfinance pandas_datareader



**Pick two stocks (X1, X2) that are in the S&P 500 and the S&P 500 (Y)**


    1) Build 4 models :  Y by itself, Y vs X1, Y vs X2 and Y vs X1 and X2?
    2) Which model has the lowest some of residuals squared?
    3) If you include statistical significance which model is best?


**1) Build 4 models :  Y by itself, Y vs X1, Y vs X2 and Y vs X1 and X2?**

In [10]:
import yfinance as yf
from pandas_datareader import data as pdr
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
import statsmodels.api as sm
# Override the default pandas_datareader method to use yfinance
yf.pdr_override()

# Define the ticker symbol for the S&P 500 Index
sAndp_ticker = "^GSPC"
Nvidia_ticker = "NVDA"
Microsoft_ticker = "MSFT"

# Define the period for which you want the data. Here, we want data from the past 6 months.
start_date = "2023-09-01"  # Adjust this date based on today's date
end_date = "2024-03-01"    # Adjust this date based on today's date

# Fetch the historical data for the S&P 500
sp500_data = pdr.get_data_yahoo(sAndp_ticker, start=start_date, end=end_date)
Nvidia_data = pdr.get_data_yahoo(Nvidia_ticker, start=start_date, end=end_date)
Microsoft_data = pdr.get_data_yahoo(Microsoft_ticker, start=start_date, end=end_date)
sp500_close_prices= sp500_data['Close']
Nvidia_close_prices = Nvidia_data['Close']
Microsoft_close_prices = Microsoft_data['Close']
df_merged = pd.merge(Microsoft_close_prices, Nvidia_close_prices, on='Date')
df_merged = pd.merge(df_merged, sp500_close_prices, on='Date')
df = df_merged.rename(columns={'Close_x': 'Microsoft', 'Close_y': 'Nvidia', 'Close': 'sp500'})
df.head()

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed


Unnamed: 0_level_0,Microsoft,Nvidia,sp500
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-09-01,328.660004,485.089996,4515.77002
2023-09-05,333.549988,485.480011,4496.830078
2023-09-06,332.880005,470.609985,4465.47998
2023-09-07,329.910004,462.410004,4451.140137
2023-09-08,334.269989,455.720001,4457.490234


In [11]:
len(df)

124

**Model 1**

In [12]:
Y = df['sp500'].values
predictions = df['sp500'].mean()
# Assuming `predictions` are the predicted Y values from the model and `Y_test` are the actual Y values
residuals = Y - predictions # broadcasting

# Squaring the residuals and summing them up to get the Sum of Squared Residuals
SSR1 = np.sum(residuals ** 2)
print("Sum of Squared Residuals (SSR) for model 1:", SSR1)

Sum of Squared Residuals (SSR) for model 1: 8722324.755506506


**Model 2**

In [13]:
X = df['Microsoft'].values.reshape(-1, 1)  # Reshaping is required as scikit-learn expects a 2D array for the independent variables
Y = df['sp500'].values

model = LinearRegression()

# Train the model
model.fit(X, Y)
predictions = model.predict(X)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

residuals = Y - predictions

SSR2 = np.sum(residuals ** 2)

print("Sum of Squared Residuals (SSR2):", SSR2)

Coefficient: [7.7709895]
Intercept: 1762.5587030278057
Sum of Squared Residuals (SSR2): 1437594.5608485627


**Model 3**

In [14]:
X = df['Nvidia'].values.reshape(-1, 1)
Y = df['sp500'].values

model = LinearRegression()

# Train the model
model.fit(X, Y)
predictions = model.predict(X)
print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

residuals = Y - predictions

# Squaring the residuals and summing them up to get the Sum of Squared Residuals
SSR3 = np.sum(residuals ** 2)

print("Sum of Squared Residuals (SSR3):", SSR3)

Coefficient: [2.26886685]
Intercept: 3423.5808388150667
Sum of Squared Residuals (SSR3): 1753173.723704831


**Model 4**

In [15]:
X = df[['Microsoft', 'Nvidia']].values# Reshaping is required as scikit-learn expects a 2D array for the independent variables
Y = df['sp500'].values

model = LinearRegression()

# Train the model
model.fit(X, Y)
predictions = model.predict(X)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

residuals = Y - predictions

SSR4 = np.sum(residuals ** 2)

print("Sum of Squared Residuals (SSR4):", SSR4)

Coefficient: [4.71223841 1.08317211]
Intercept: 2317.6397162223043
Sum of Squared Residuals (SSR4): 977829.3796783305


**2. Which model has the lowest some of residuals squared?**

-> Model 4, where we use both microsoft and nvidia as independent variables to predict s&p 500 index value with linear regression has lowest residuals squared

**3. If you include statistical significance which model is best?**

Model 4 (with both predictors) is clearly better than model 1(no model). Model 2(with microsoft only predictor) is better than model 3(with nvidia only predictor) as it has less sum of squared residuals . Let us check if model 4 is statistically significantly better than model 2.



We can also compare Model 4 with model 2 (with microsoft only) (which has less sum of squared error than model 3 with nvidia only), using F-score

F score = ((RSSreduced - RSSfull)/p)/(RSSfull/(n-k-1))

where p = difference between number of predictors,
n = total number of observations,
k = number of predictors in full model



In [16]:
# in our case
RSSred = 1437594.5608485627
RSSfull = 977829.3796783305
p = 1
k = 2
n = len(df)

Fobs = ((RSSred - RSSfull)/p)/(RSSfull/(n-k-1))

In [17]:
print(Fobs)

56.89293866369491


In [18]:
from scipy.stats import f

# Degrees of freedom
df1 = p
df2 = n - k - 1

# Significance level
alpha = 0.05

# Calculate critical F-value
f_critical = f.ppf(1 - alpha, df1, df2)

print(f"Critical F-value: {f_critical}")

Critical F-value: 3.919464555329504


As we can see, observed F statistic way greater than Fcritical, hence the two predictor model 4 is statistically significantly better than one predictor model 2 and in turn better than all other models. Hence model 4 with both the predictors microsoft and nvidia is the best among these models.

**1. Build a covariance matrix with X1 and X2,  do a PCA decomposition.**



In [19]:
from sklearn.decomposition import PCA
# Calculate the covariance matrix of the DataFrame
df = df[['Microsoft', 'Nvidia']]
cov_matrix = df.cov()
print("Covariance Matrix:\n", cov_matrix)

print('\n')
# Perform PCA decomposition
pca = PCA(n_components=2)  # n_components specifies the number of components to keep
pca.fit(df)

# The components_ attribute represents the principal axes in feature space
components = pca.components_
print("PCA Components:\n", components)

Covariance Matrix:
              Microsoft        Nvidia
Microsoft   980.744125   2769.506478
Nvidia     2769.506478  11006.690295


PCA Components:
 [[ 0.24969762  0.96832386]
 [-0.96832386  0.24969762]]


**2. Verify your two eigenvectors produce two new predictors that are independent**

In [20]:
# If the two new predictors are independent their dot product will be zero
np.dot(components[0], components[1])

#verified

0.0

**3. Verify that your two factor regression can be built by combining two 1-factor regressions with your two eigenvectors**

In [21]:
# model with both pca components
X = pca.fit_transform(df)

model = LinearRegression()

# Train the model
model.fit(X, Y)
predictions = model.predict(X)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

Coefficient: [ 2.22549611 -4.29250739]
Intercept: 4601.650244928175


In [22]:
# model with only 1st pca component
model = LinearRegression()

# Train the model
model.fit(X[:, 0].reshape(-1, 1), Y)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)


Coefficient: [2.22549611]
Intercept: 4601.650244928175


In [23]:
# model with only 2nd pca component
model = LinearRegression()

# Train the model
model.fit(X[:, 1].reshape(-1, 1), Y)

print("Coefficient:", model.coef_)
print("Intercept:", model.intercept_)

Coefficient: [-4.29250739]
Intercept: 4601.650244928175


As we can see, the coefficient for 1st pca component is same in both model with 2 components and model with only 1st component. Similar is the case for coefficient of 2nd component. Hence it is verified that 2 factor regression can be built by combining two 1 factor regressions with two eigen vectors.

In [24]:
# verified