# **Homework 2 (HW2)**

## **Deadline:** Wednesday, the 19th, 11:59pm

### **General recommendations:** There are still no credible ways to check for AI-generated solutions, therefore you are free to use LLM based chatbots. Nevertheless, for many questions you are expected to take decisions on some steps of the analysis the way humans do. Also, I have already tried to let ChatGPT solve this homework and what I got was not wrong, but rarely was the right solutions I am expecting from you. I ask you not to use any rolling window out-of-the-box functions for a specific reason i.e., you don't often know what it is under the hood of those. That is very dangerous, you should always be in full control.

The goals of this homework is to let you play :) with:
1. Preprocessing of large financial datasets
2. Training and testing out-of-the-box models
3. Comparing the results of the out-of-the-box models with your own functions
4. Carrying out a meaningful analysis of your results

In [2]:
!pip install wrds

Collecting wrds
  Downloading wrds-3.3.0-py3-none-any.whl.metadata (5.7 kB)
Collecting psycopg2-binary<2.10,>=2.9 (from wrds)
  Downloading psycopg2_binary-2.9.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Downloading wrds-3.3.0-py3-none-any.whl (13 kB)
Downloading psycopg2_binary-2.9.10-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: psycopg2-binary, wrds
Successfully installed psycopg2-binary-2.9.10 wrds-3.3.0


## **1. Data Wrangling**

### **0. Preliminaries**

- Fix random seed for reproducibility and import the needed libraries.  

**Context:** We want to begin analysing the stock market! How to start? We should get our hands on a financial dataset, with information on stock characteristics. Please go [here](https://jkpfactors.com/stock-char) and download the stocks characteristics for the US for the last forty years. Remember: stock characteristics, not factors...  

- Visualize the first rows to get an idea of what kind of variables are.

- Column names are quite mysterious. Merge them with the information at ([this website](https://github.com/bkelly-lab/ReplicationCrisis/tree/master)).  


In [3]:
import wrds
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import torch
import random


def set_seed(seed_value=42):
    """Set seed for reproducibility."""
    np.random.seed(seed_value)  # Set NumPy seed
    torch.manual_seed(seed_value)  # Set PyTorch seed
    random.seed(seed_value)  # Set Python random seed


wrds_db = wrds.Connection(wrds_username="santiag0")


Enter your WRDS username [santiag0]:santiag0
Enter your password:··········
WRDS recommends setting up a .pgpass file.
Create .pgpass file now [y/n]?: y
Created .pgpass file successfully.
You can create this file yourself at any time with the create_pgpass_file() function.
Loading library list...
Done


In [None]:
chars = pd.read_excel('https://github.com/bkelly-lab/ReplicationCrisis/raw/master/GlobalFactors/Factor%20Details.xlsx')
chars_rel = chars[chars['abr_jkp'].notna()]['abr_jkp'].tolist()

print(chars.shape)

sql_query= f"""
          SELECT id, eom, excntry, gvkey, permno, size_grp, me, {', '.join(map(str, chars_rel))}
                   FROM contrib.global_factor
                   WHERE common=1 and exch_main=1 and primary_sec=1 and obs_main=1 and
                   excntry='USA'
          """
data = wrds_db.raw_sql(sql_query)

print(data.shape)

### **1. Data Cleaning & Transformation**  

- Are there any missing values? If so, which five columns have the most missing data?

- Are all variables numerical? If not, which non-numeric variables need transformation?  

- Are there duplicate rows or duplicate stock entries for the same time period?  

- Convert all categorical variables, if any, into numerical representations.  


### **2. Descriptive Statistics & Distributions**  

- What are the five variables with the highest variance?

- What are the five most skewed financial variables?

- Which five financial variables have the most extreme outliers (justify your reasoning)?  

- Compute the kurtosis for all financial variables and list the five closest to a normal distribution.  

### **3. Correlation & Multicollinearity**

- Compute pairwise correlations and identify the five most highly correlated variable pairs.

- Calculate variance inflation factors (VIFs) and list the five variables most affected by multicollinearity.

### **4. Time-Series Analysis**  

- Which five financial variables exhibit the highest autocorrelation in the pooled dataset?  

- Identify the five stocks with the most stable financial variables over the past 20 years. (Justify your assumptions)  

- Detect any structural breaks: Do certain financial variables exhibit significant shifts over time? (Justify your reasoning)  

- Compute rolling mean and rolling standard deviation for financial variables to check for trends. (Don't use the predefined rolling etc. functions. Be in total control of the rolling yourself!)  


### **5. Stock Performance**  

- Compute cumulative returns for each stock and identify the top and bottom performers over the past 10 and 20 years.

- Identify the five stocks with the most extreme positive and negative monthly returns in the last 10 years.  

- Compute downside beta for each stock and list the five stocks most vulnerable to market downturns. (You don't need to run any regression)  

- Identify the five stocks that recover the fastest after a major drawdown.  

### **6. Ranking & Extremes**  

- Identify the five companies with the highest/lowest earnings variability.

- Identify the five most extreme stock return events.  

- Find the five companies with the most extreme quarterly earnings surprises.  

- Identify the five financial variables with the highest standard deviation.  

- Rank companies based on financial persistence: Which five companies have the most consistent financial performance over time?  

- Find the five companies whose financial variables are closest to the median in terms of volatility.

### **7. Grouping & Aggregation**  

- Which industry has the highest average return variance?  

- Which five industries show the highest cross-sectional dispersion in financial variables?  

- Compute monthly averages for financial variables and detect seasonal patterns.  

## **2. Linear regressions**

### **1. Sanity checks**

- Now that we've played with stock characteristics, it is time to move to other tasks! Import the dataset (take care, not anything that is simple, it is necessarily easy :) )!

- Column names are again mysterious: be a modern Sherlock Holmes and refer to your previous knowledge to solve the case.

- Are there any duplicate entries for stocks in the same time period?  

- Do all numerical columns contain valid values (e.g., no strings or special characters)?  

- Are there any constant columns that provide no useful information?  

- Are date formats consistent across the dataset?  

- Are there extreme values that could affect regression results (e.g., highly skewed distributions)?  


### **1. The first (?) regression**

- Milestone finance research showed that Book-to-Market is useful to predict returns. Let's dig up a bit - show a scatter plot of Book-to-Market and returns.

- Quite noisy, no? Let's try to run a linear regression! Is the coefficient significant?

- How do we know that `sklean` does not fool us? Let's check these results with matrix calculus and the direct OLS formulas.

- Can we make money with this strategy? (if you say yes, why you taking the class run run make those money!) Why not?

- Try to run a simple market timing strategy with your model on a portfolio of five stocks. Do you manage to beat the market?

### **2. Testing the results**

- Make the return predictions with `sklearn`.

- Compare them with the actual formula (it might involve a bit of matrix calculus..)

- Test now the accuracy of your model in an appropriate way.

- Visualize in an efficient way the two distributions of the target variable.

- How do the resisuals behave? What can we imply?

## **3. Ridge regressions**

### **1. Basics**

Once we understand linear regressions are just ridge regressions with null ridge penalty, we can leave this ''null penalty'' space safely and use this penalty :)

- Now can control better the overfit, let's use the entire data to forecast returns with a ridge regression. What's the accuracy? Let's use MSE for now.

- We suspect that `sklean` fool us? Let's check these results with matrix calculus and the direct ridge formulas.

- What's the distribution of coefficients with a given regularization? How does it change when you vary it?

- How does the accuracy change when you vary the shrinkage?

- We have been using MSE so far. It is well known it has some issues. Which ones?

- Are there any changes when you standardize the predictors?

- Let's move to R2 as a metric to test our work. Repeat the analysis above, when appropriate, with this new metric.

- Is the matrix inversion trick really faster than the usual formula? Measure it!

### **2. Portfolio construction**

(you can consider the last 25 years of data only)

- Buil an equally weighted strategy using your predictions as portfolio weights.

- How does a change in the shrinkage impact the portfolio returns?

- Try to vary the number of predictors you use in the ridge model. What happens to the test metrics?

- Does ridge regression perform differently for low-volatility vs. high-volatility stocks?

### **3. Cross-validation**

- Try to tune the shrinkage in different (non overlapping) rolling windows. Does it vary? (Don't use the predefined rolling etc. functions. Be in total control of the rolling yourself!)

- Does it vary when the rolling windows are overlapping?

- Implement a nested cross-validation pipeline to carefully tune the hyperparameters and estimate the generalization performance.

## **4. Lasso regressions**

### **1. Basics**

Unlike ridge regressions, Lasso does not only shrink coefficients but effectively selects variables by setting coefficients for irrelevant variables to zero.

- What's the distribution of coefficients with a given regularization? How does it change when you vary it?

- How does the accuracy change when you vary the regularization?

- Are there any changes when you standardize the predictors?

- Let's move to R2 as a metric to test our work. Repeat the analysis above, when appropriate, with this new metric.

- Find the predictors that the model has set to zero. Do they vary much when you vary regularization?

- Do the predictors that the model decides to set to zero vary by industry?

### **2. Portfolio construction**

(you can consider the last 25 years of data only)

- Buil an equally weighted strategy using your predictions as portfolio weights.

- How does a change in the regularization impact the portfolio returns?

- Try to vary the number of predictors you use in the lasso model. What happens to the test metrics?

- Does lasso regression perform differently for low-volatility vs. high-volatility stocks?

### **3. Cross-validation**

- Try to tune the regularization in different (non overlapping) rolling windows. Does it vary? (Don't use the predefined rolling etc. functions. Be in total control of the rolling yourself!)

- Does it vary when the rolling windows are overlapping?

- Implement a nested cross-validation pipeline to carefully tune the hyperparameters and estimate the generalization performance.

## **5. Option Pricing**

### 1. **Simulating Black-Scholes Model**  

The **Black-Scholes model** (Black & Scholes, 1973) is a fundamental formula for pricing **European options**, assuming frictionless markets, no arbitrage, and constant volatility.

- Write a function able to simulate option prices for call options for a grid of different combinations of inputs (Do not forget to set the random seed and to add the idiosyncratic value...)

## **Black-Scholes Model Parameters**

| Parameter | Symbol | Range | Step Size |
|-----------|--------|--------|-----------|
| Stock Price | **S** | 40 to 69 | 1 |
| Strike Price | **K** | 25 to 90 | 1 |
| Risk-Free Rate | **r** | 0.00 to 0.05 | 0.01 |
| Time to Expiration | **T** | 0.2 to 2.0 | 0.1 |
| Volatility | **σ** | 0.23 to 0.75 | 0.1 |

- Split the data into training and test samples.

- Scale the input to the model. (Careful :) )


### 2. **Learning Black-Scholes Model**  

- Start with a linear regression. What's its performance?

- Try to perform an analysis with a lasso model. Is it able to generalize? (Justify the use of the metric)

- Fit a ridge. How does it behave? (Justify the use of the metric)

- Compare the perfomances of the three previous models in terms of moneyness of the option.

- Compare the coefficients for the previous three models. Are there differences between them?

- How could you make these models behave better?

- Once you discover the trick, repeat the previous analysis (ie, fitting the three models and judging them)


### 2. **Testing Black-Scholes Model**  

- How does the residuals in the linear regression behave? What does it imply?

- How are the parameters in the ridge regression varying when you vary regularization? And the accuracy?

- How are the parameters in the lasso regression varying when you vary regularization? And the accuracy?

- Are they different from the parameters estimated by the linear regression?

- You are back to your DGP. We are living unstable times, you decide to adapt to it and simulate more unstable option prices. How do ridge's and lasso's accuracies react?
