<a href="https://colab.research.google.com/github/Jandsy/ml_finance_imperial/blob/main/Coursework/CourseWork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<center>Machine Learning and Finance </center>**


## <center> CourseWork 2024 - StatArb </center>



In this coursework, you will delve into and replicate selected elements of the research detailed in the paper **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**. **However, we will not reproduce the entire study.**

## Overview

This study redefines Statistical Arbitrage (StatArb) by combining Autoencoder architectures and policy learning to generate trading strategies. Traditionally, StatArb involves finding the mean of a synthetic asset through classical or PCA-based methods before developing a mean reversion strategy. However, this paper proposes a data-driven approach using an Autoencoder trained on US stock returns, integrated into a neural network representing portfolio trading policies to output portfolio allocations directly.


## Coursework Goal

This coursework will replicate these results, providing hands-on experience in implementing and evaluating this innovative end-to-end policy learning Autoencoder within financial trading strategies.

## Outline

- [Data Preparation and Exploration](#Data-Preparation-and-Exploration)
- [Fama French Analysis](#Fama-French-Analysis)
- [PCA Analysis](#PCA-Analysis)
- [Ornstein Uhlenbeck](#Ornstein-Uhlenbeck)
- [Autoencoder Analysis](#Autoencoder-Analysis)



**Description:**
The Coursework is graded on a 100 point scale and is divided into five  parts. Below is the mark distribution for each question:

| **Problem**  | **Question**          | **Number of Marks** |
|--------------|-----------------------|---------------------|
| **Part A**   | Question 1            | 4                   |
|              | Question 2            | 1                   |
|              | Question 3            | 3                   |
|              | Question 4            | 3                   |
|              | Question 5            | 1                   |
|              | Question 6            | 3                   |
|**Part  B**    | Question 7           | 1                   |
|              | Question 8            | 5                   |
|              | Question 9            | 4                   |
|              | Question 10           | 5                   |
|              | Question 11           | 2                   |
|              | Question 12           | 3                   |
|**Part  C**    | Question 13          | 3                   |
|              | Question 14           | 1                   |
|              | Question 15           | 3                   |
|              | Question 16           | 2                   |
|              | Question 17           | 7                   |
|              | Question 18           | 6                   |
|              | Question 19           | 3                   |
|  **Part  D** | Question 20           | 3                   |
|              | Question 21           | 5                   |
|              | Question 22           | 2                   |
|  **Part  E** | Question 23           | 2                   |
|              | Question 24           | 1                   |
|              | Question 25           | 3                   |
|              | Question 26           | 10                  |
|              | Question 27           | 1                   |
|              | Question 28           | 3                   |
|              | Question 29           | 3                   |
|              | Question 30           | 7                   |




Please read the questions carefully and do your best. Good luck!

## Objectives



## 1. Data Preparation and Exploration
Collect, clean, and prepare US stock return data for analysis.

## 2. Fama French Analysis
Utilize Fama French Factors to isolate the idiosyncratic components of stock returns, differentiating them from market-wide effects. This analysis helps in understanding the unique characteristics of individual stocks relative to broader market trends.

## 3. PCA Analysis
Employ Principal Component Analysis (PCA) to identify hidden structures and reduce dimensionality in the data. This method helps in extracting significant patterns that might be obscured in high-dimensional datasets.

## 4. Ornstein-Uhlenbeck Process
Analyze mean-reverting behavior in stock prices using the Ornstein-Uhlenbeck process. This stochastic process is useful for modeling and forecasting based on the assumption that prices will revert to a long-term mean.

## 5. Building a Basic Autoencoder Model
Construct and train a standard Autoencoder to extract residual idiosyncratic risk.








# Data Preparation and Exploration


---
<font color=green>Q1: (4 Marks)</font>
<br><font color='green'>
Write a Python function that accepts a URL parameter and retrieves the NASDAQ-100 companies and their ticker symbols by scraping the relevant Wikipedia page using **[Requests](https://pypi.org/project/requests/)** and **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**. Your function should return the data as a list of tuples, with each tuple containing the company name and its ticker symbol. Then, call your function with the appropriate Wikipedia page URL and print the data in a 'Company: Ticker' format.

</font>

---


In [3]:
import numpy as np
import matplotlib.pyplot as plt


In [4]:
import requests
from bs4 import BeautifulSoup

def get_nasdaq100_companies(url):
    response = requests.get(url) 
    soup = BeautifulSoup(response.content, 'html.parser') 
    table = soup.find('table', {'id': 'constituents'}) 
    rows = table.find_all('tr')
    data = []
    for row in rows[1:]:
        cols = row.find_all('td')
        data.append((cols[0].text.strip(), cols[1].text.strip()))
    return data

url = 'https://en.wikipedia.org/wiki/NASDAQ-100'
data = get_nasdaq100_companies(url)

Adobe Inc.: ADBE
ADP: ADP
Airbnb: ABNB
Alphabet Inc. (Class A): GOOGL
Alphabet Inc. (Class C): GOOG
Amazon: AMZN
Advanced Micro Devices Inc.: AMD
American Electric Power: AEP
Amgen: AMGN
Analog Devices: ADI
Ansys: ANSS
Apple Inc.: AAPL
Applied Materials: AMAT
ASML Holding: ASML
AstraZeneca: AZN
Atlassian: TEAM
Autodesk: ADSK
Baker Hughes: BKR
Biogen: BIIB
Booking Holdings: BKNG
Broadcom Inc.: AVGO
Cadence Design Systems: CDNS
CDW Corporation: CDW
Charter Communications: CHTR
Cintas: CTAS
Cisco: CSCO
Coca-Cola Europacific Partners: CCEP
Cognizant: CTSH
Comcast: CMCSA
Constellation Energy: CEG
Copart: CPRT
CoStar Group: CSGP
Costco: COST
CrowdStrike: CRWD
CSX Corporation: CSX
Datadog: DDOG
DexCom: DXCM
Diamondback Energy: FANG
Dollar Tree: DLTR
DoorDash: DASH
Electronic Arts: EA
Exelon: EXC
Fastenal: FAST
Fortinet: FTNT
GE HealthCare: GEHC
Gilead Sciences: GILD
GlobalFoundries: GFS
Honeywell: HON
Idexx Laboratories: IDXX
Illumina, Inc.: ILMN
Intel: INTC
Intuit: INTU
Intuitive Surgical: I

---
<font color=green>Q2: (1 Mark)</font>
<br><font color='green'>
Given a list of tuples representing NASDAQ-100 companies (where each tuple contains a company name and its ticker symbol), write a Python script to extract all ticker symbols into a separate list called `tickers_list`.
</font>
---


In [None]:
## Insert your code here

---
<font color=green>Q3: (3 Marks)</font>
<br><font color='green'>
Using **[yfinance](https://pypi.org/project/yfinance/)** library, write a Python script that accepts a list of stock ticker symbols. For each symbol, download the adjusted closing price data, store it in a dictionary with the ticker symbol as the key, and then convert the final dictionary into a Pandas DataFrame. Handle any errors encountered during data retrieval by printing a message indicating which symbol failed
</font>
---

In [None]:
## Insert your code here

---
<font color=green>Q4: (3 Marks)</font>
<br><font color='green'>
Write a Python script to analyze stock data stored in a dictionary `stock_data` (where each key is a stock ticker symbol, and each value is a Pandas Series of adjusted closing prices). The script should:
1. Convert the dictionary into a DataFrame.
2. Calculate the daily returns for each stock.
3. Identify columns (ticker symbols) with at least 2000 non-NaN values in their daily returns.
4. Create a new DataFrame that only includes these filtered ticker symbols.
5. Remove any remaining rows with NaN values in this new DataFrame.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q5: (1 Mark)</font>
<br><font color='green'>
Download the dataset named `df_filtered_nasdaq_100` from the GitHub repository of the course.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q6: (3 Marks) </font>
<br><font color='green'>
Conduct an in-depth analysis of the `df_filtered_nasdaq_100` dataset from GitHub. Answer the following questions:
- Which stock had the best performance over the entire period?
- What is the average daily return of 'AAPL'?
- What is the worst daily return? Provide the stock name and the date it occurred.
</font>

---

In [None]:
## Insert your code here

# Fama French Analysis

The Fama-French five-factor model is an extension of the classic three-factor model used in finance to describe stock returns. It is designed to better capture the risk associated with stocks and explain differences in returns. This model includes the following factors:

1. **Market Risk (MKT)**: The excess return of the market over the risk-free rate. It captures the overall market's premium.
2. **Size (SMB, "Small Minus Big")**: The performance of small-cap stocks relative to large-cap stocks.
3. **Value (HML, "High Minus Low")**: The performance of stocks with high book-to-market values relative to those with low book-to-market values.
4. **Profitability (RMW, "Robust Minus Weak")**: The difference in returns between companies with robust (high) and weak (low) profitability.
5. **Investment (CMA, "Conservative Minus Aggressive")**: The difference in returns between companies that invest conservatively and those that invest aggressively.

## Additional Factor

6. **Momentum (MOM)**: This factor represents the tendency of stocks that have performed well in the past to continue performing well, and the reverse for stocks that have performed poorly.

### Mathematical Representation

The return of a stock $R_i^t$ at time $t$ can be modeled as follows :

$$
R_i^t - R_f^t = \alpha_i^t + \beta_{i,MKT}^t(R_M^t - R_f^t) + \beta_{i,SMB}^t \cdot SMB^t + \beta_{i,HML}^t \cdot HML^t + \beta_{i,RMW}^t \cdot RMW^t + \beta_{i,CMA}^t \cdot CMA^t + \beta_{i,MOM}^t \cdot MOM^t + \epsilon_i^t
$$

Where:
- $ R_i^t $ is the return of stock $i$ at time $t$
- $R_f^t $is the risk-free rate at time $t$
- $ R_M^t $ is the market return at time $t$
- $\alpha_i^t $ is the abnormal return or alpha of stock $ i $ at time $t$
- $\beta^t $ coefficients represent the sensitivity of the stock returns to each factor at time $t$
- $\epsilon_i^t $ is the error term or idiosyncratic risk unique to stock $ i $ at time $t$

This model is particularly useful for identifying which factors significantly impact stock returns and for constructing a diversified portfolio that is optimized for given risk preferences.




---
<font color=green>Q7: (1 Mark) </font>
<br><font color='green'>
Download the `fama_french_dataset` from the course's GitHub account.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q8: (5 Marks)</font>
<br><font color='green'>

Write a Python function called `get_sub_df_ticker(ticker, date, df_filtered, length_history)` that extracts a historical sub-dataframe for a given `ticker` from `df_filtered`. The function should use `length_history` to determine the number of trading days to include, ending at the specified `date`. Return the sub-dataframe for the specified `ticker`.
</font>

---


In [None]:
## Insert your code here

---
<font color=green>Q9: (4 Marks)</font>
<br><font color='green'>
Create a Python function named `df_ticker_with_fama_french(ticker, date, df_filtered, length_history, fama_french_data)` that uses `get_sub_df_ticker` to extract historical data for a specific `ticker`. Incorporate the Fama-French factors from `fama_french_data` into the extracted sub-dataframe. Adjust the ticker's returns by subtracting the risk-free rate ('RF') and add other relevant Fama-French factors ('Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', and 'Mom'). Return the resulting sub-dataframe.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q10: (5 Marks) </font>
<br><font color='green'>
Write a Python function named `extract_beta_fama_french` to perform a rolling regression analysis for a given stock at a specific time point using the Fama-French model. The function should accept the following parameters:

- `ticker`: A string indicating the stock symbol.
- `date`: A string specifying the date for the analysis.
- `length_history`: An integer representing the number of days of historical data to include.
- `df_filtered`: A pandas DataFrame (assumed to be derived from question 5) containing filtered stock data.
- `fama_french_data`: A pandas DataFrame (assumed to be from question 7) that includes Fama-French factors.

Utilize the `statsmodels.api` library to conduct the regression.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q11: (2 Marks) </font>
<br><font color='green'>
Apply the `extract_beta_fama_french` function to the stock symbol 'AAPL' for the date '2024-03-28', using a historical data length of 252 days. Ensure that the `df_filtered` and `fama_french_data` DataFrames are correctly prepared and available in your environment before executing this function. The parameters for the function call are set as follows:

- **Ticker**: 'AAPL'
- **Date**: '2024-03-28'
- **Length of History**: 252 days
</font>

---



In [None]:
## Insert your code here

---
<font color=green>Q12: (2 Marks)</font>
<br><font color='green'>
Once the `extract_beta_fama_french` function has been applied to 'AAPL' with the specified parameters, the next step is to analyze the regression summary to identify which Fama-French factor explains the most variance in 'AAPL' returns during the specified period.

Follow these steps to perform the analysis:

1. **Review the Summary**: Examine the regression output, focusing on the coefficients and their statistical significance (p-values).
2. **Identify Key Factor**: Determine which factor has the highest absolute coefficient value and is statistically significant (typically p < 0.05). This factor can be considered as having the strongest influence on 'AAPL' returns for the period.

</font>

---

**Write your answers here:**

# PCA Analysis


In literature, another method exists for extracting residuals for each stock, utilizing the PCA approach to identify hidden factors in the data. Let's describe this method.

The return of a stock $R_i^t$ at time $t$ can be modeled as follows :

$$
R_i^t  = \sum_{j=1}^m\beta_{i,j}^t F_j^t  + \epsilon_i^t
$$

Where:
- $ R_i^t $ is the return of stock $i$ at time $t$
- $m$ is the number of factors selected from PCA
-  $ F_j^t $ is the $j$-th hidden factor constructed from PCA at time $t$
- $\beta_{i,j}^t $ are the coefficients representing the sensitivity of the stock returns to each hidden factor.
- $\epsilon_i^t $  is the residual term for stock $i$ at time $t$, representing the portion of the return not explained by the PCA factors.

### Representation of Stock Return Data

Consider the return data for $N$ stocks over $T$ periods, represented by the matrix $R$ of size $T \times N$:

$$
R = \left[
\begin{array}{cccc}
R_1^T & R_2^T & \cdots & R_N^T \\
R_1^{T-1} & R_2^{T-1} & \cdots & R_N^{T-1} \\
\vdots & \vdots & \ddots & \vdots \\
R_1^1 & R_2^1 & \cdots & R_N^1 \\
\end{array}
\right]
$$

Each element $R_i^k$ of the matrix represents the return of stock $i$ at time $k$ and is defined as:

$$
R_i^k = \frac{S_{i,k} - S_{i, k-1}}{S_{i, k-1}}, \quad k=1,\cdots, T, \quad i=1,\cdots,N
$$

where $S_{i,k}$ denotes the adjusted close price of stock $i$ at time $k$.

### Standardization of Returns

To adjust for varying volatilities across stocks, we standardize the returns as follows:

$$
Z_i^t = \frac{R_i^t - \mu_i}{\sigma_i}
$$

where $\mu_i$ and $\sigma_i$ are the mean and standard deviation of returns for stock $i$ over the period $[t-T, t]$, respectively.

### Empirical Correlation Matrix

The empirical correlation matrix $C$ is computed from the standardized returns:

$$
C = \frac{1}{T-1} Z^T Z
$$

where $Z^T$ is the transpose of matrix $Z$.

### Singular Value Decomposition (SVD)

We apply Singular Value Decomposition to the correlation matrix $C$:

$$
C = U \Sigma V^T
$$

Here, $U$ and $V$ are orthogonal matrices representing the left and right singular vectors, respectively, and $\Sigma$ is a diagonal matrix containing the singular values, which are the square roots of the eigenvalues.

### Construction of Hidden Factors

For each of the top $m$ components, we construct the selected hidden factors as follows:

$$
F_j^t = \sum_{i=1}^N \frac{\lambda_{i,j}}{\sigma_i} R_i^t
$$

where $\lambda_{i,j}$ is the $i$-th component of the $j$-th eigenvector (ranked by eigenvalue magnitude).


---
<font color=green>Q13 (3 Marks):

For the specified period from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28'), generate the matrix $Z$ by standardizing the stock returns using the DataFrame `df_filtered_new`
</font>

---


In [None]:
## Insert your code here

---
<font color=green>Q14: (1 Mark) </font>
<br><font color='green'>
Download the `Z_matrix` matrix from the course's GitHub account.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q15: (3 Marks) </font>
<br><font color='green'>
For the specified period from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28'), compute the correlation matrix
$C$ using the matrix `Z_matrix`.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q16: (2 Marks) </font>
<br><font color='green'>
Refind the correlation matrix from the from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28') using pandas correlation matrix method.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q17: (7 Marks) </font>
<br><font color='green'>
Conduct Singular Value Decomposition on the correlation matrix $C$. Follow these steps:


1.   **Perform SVD**: Decompose the matrix $C$ into its singular values and vectors.
2.   **Rank Eigenvalues**: Sort the resulting singular values (often squared to compare to eigenvalues) in descending order.
3. **Select Components**: Extract the first 20 components based on the largest singular values.
4. **Variance Explained**: Print the variance explained by the first 20 Components and dimensions of differents matrix that you created.

</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q18: (6 Marks) </font>
<br><font color='green'>
Extract the 20 hidden factors in a matrix F. Check that shape of F is $(252,20)$
</font>

</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q19: (3 Marks) </font>
<br><font color='green'>
Perform the Regression Analysis of 'AAPL' for the date '2024-03-28', using a historical data length of 252 days using previous $F$ Matrix. Compare the R-squared from the ones obtained at Q11.
</font>

</font>

---

In [None]:
## Insert your code here

# Ornstein Uhlenbeck

The Ornstein-Uhlenbeck process is defined by the following stochastic differential equation (SDE):

$$ dX_t = \theta (\mu - X_t) dt + \sigma dW_t $$

where:

- **$ X_t $**: The value of the process at time $ t $.
- **$ \mu $**: The long-term mean (equilibrium level) to which the process reverts.
- **$ \theta $**: The speed of reversion or the rate at which the process returns to the mean.
- **$ \sigma $**: The volatility (standard deviation), representing the magnitude of random fluctuations.
- **$ W_t $**: A Wiener process or Brownian motion that adds stochastic (random) noise.

This equation describes a process where the variable $ X_t $ moves towards the mean $ \mu $ at a rate determined by $ \theta $, with random noise added by $ \sigma dW_t $.

---
<font color=green>Q20: (3 Marks) </font>
<br><font color='green'>
In the context of mean reversion, which quantity should be modeled using an Ornstein-Uhlenbeck process?
</font>

---

**Write your answers here:**

---
<font color=green>Q21: (5 Marks) </font>
<br><font color='green'>
Explain how the parameters $ \theta $ and $ \sigma $ can be determined using the following equations. Also, detail the underlying assumptions:
$$ E[X] = \mu $$
$$ \text{Var}[X] = \frac{\sigma^2}{2\theta} $$
</font>

---

**Write your answers here:**

---
<font color=green>Q22: (2 Marks) </font>
<br><font color='green'>
Create a function named `extract_s_scores` which computes 's scores' for the last element in a list of floating-point numbers. This function calculates the scores using the following formula $ \text{s scores} = \frac{X_T - \mu}{\sigma} $ where `list_xi` represents a list containing a sequence of floating-point numbers $(X_0, \cdots, X_T)$.

</font>

---

In [None]:
## Insert your code here

# Autoencoder Analysis

Autoencoders are neural networks used for unsupervised learning, particularly for dimensionality reduction and feature extraction. Training an autoencoder on the $Z_i$ matrix aims to identify hidden factors capturing the intrinsic structures in financial data.

### Architecture
- **Encoder**: Compresses input data into a smaller latent space representation.
  - *Input Layer*: Matches the number of features in the $Z_i$ matrix.
  - *Hidden Layers*: Compress data through progressively smaller layers.
  - *Latent Space*: Encodes the data into hidden factors.
- **Decoder**: Reconstructs input data from the latent space.
  - *Hidden Layers*: Gradually expand to the original dimension.
  - *Output Layer*: Matches the input layer to recreate the original matrix.

### Training
The autoencoder is trained by minimizing reconstruction loss, usually mean squared error (MSE), between the input $Z_i$ matrix and the decoder's output.

### Hidden Factors Extraction
After training, the encoder's latent space provides the most important underlying patterns in the stock returns.

---
<font color=green>Q23: (2 Marks) </font>
<br><font color='green'>
Modify the standardized returns matrix `Z_matrix` to reduce the influence of extreme outliers on model trainingby ensuring that all values in the matrix `Z_matrix` do not exceed 3 standard deviations from the mean. Specifically, cap these values at the interval $-3, 3]$. Store the adjusted values in a new matrix, `Z_hat`.
</font>

----

In [None]:
## Insert your code here

---
<font color=green>Q24: (1 Marks) </font>
<br><font color='green'>
Fetch the `Z_hat` data from GitHub, and we'll proceed with it now.
</font>



In [None]:
## Insert your code here

---
<font color=green>Q25: (3 Marks) </font>
<br><font color='green'>
Segment the standardized and capped returns matrix $\hat{Z}$ into two subsets for model training and testing. Precisly Allocate 70% of the data in $\hat{Z}$ to the training set $ \hat{Z}_{train} $ and Allocate the remaining 30% to the testing set $\hat{Z}_{test}$. Treat each stock within $\hat{Z}$ as an individual sample, by flattening temporal dependencies.
</font>



In [None]:
## Insert your code here

---
<font color=green>Q26: (10 Marks) </font>
<br><font color='green'>
Please create an autoencoder following the instructions provided in  **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**, Use the model 'Variant 2' in Table 1.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q27 (1 Mark) :

Display all the parameters of the deep neural network.
</font>

---

In [None]:
## Insert your code here

---
<font color=green>Q28: (3 Marks) </font>
<br><font color='green'>
Train your model using the Adam optimizer for 20 epochs with a batch size equal to 8 and validation split to 20%. Specify the loss function you've chosen.
</font>


In [None]:
## Insert your code here

---
<font color=green>Q29: (3 Marks) </font>
<br><font color='green'>
Predict using the testing set and extract the residuals based on the methodology described in **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**.
for 'NVDA' stock.
</font>

---

In [None]:
## Insert your code here

<font color=green>Q30: (7 Marks) </font>
<br><font color='green'>
By reading carrefully the paper **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**, answers the following question:
1. **Summarize the Key Actions**: Highlight the main experiments and methodologies employed by the authors in Section 5.
2. **Reproduction Steps**: Detail the necessary steps required to replicate the authors' approach based on the descriptions provided in the paper.
3. **Proposed Improvement**: Suggest one potential enhancement to the methodology that could potentially increase the effectiveness or efficiency of the model.



**Write your answers here:**








