# Unsupervised Learning Trading Strategy

1. Download/Load SP500 stocks prices data.
2. Calculate different features and indicators on each stock.
3. Aggregate on monthly level and filter top 150 most liquid stocks.
4. Calculate Monthly Returns for different time-horizons.
5. Download Fama-French Factors and Calculate Rolling Factor Betas.
6. For each month fit a K-Means Clustering Algorithm to group similar assets based on their features.
7. For each month select assets based on the cluster and form a portfolio based on Efficient Frontier max sharpe ratio optimization.
8. Visualize Portfolio returns and compare to SP500 returns.

In [3]:
from statsmodels.regression.rolling import RollingOLS
import pandas_datareader.data as web
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import numpy as np
import datetime as dt
import yfinance as yf
import pandas_ta
import warnings 

In [4]:
warnings.filterwarnings('ignore')

# 1. Download/Load SP500 stocks prices data.

In [6]:
sp500 = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]

In [10]:
sp500['Symbol'] = sp500['Symbol'].str.replace('.', '-')

In [15]:
symbol_list = sp500['Symbol'].unique().tolist() # Contains Survivorship bias

Survivorship bias in the context of the S&P 500 stock list refers to the tendency to analyze or consider only those companies that have survived over a certain period and are currently included in the index, while ignoring those that were once part of the index but have since been removed. This can lead to skewed or overly optimistic conclusions about the performance and characteristics of the index because it fails to account for the companies that did not survive due to poor performance, bankruptcy, mergers, or other reasons.

### Key Points of Survivorship Bias in S&P 500

1. **Exclusion of Failed Companies**: When analyzing the S&P 500, the focus is often on the companies that are currently in the index. This excludes companies that were once part of the index but were removed because they underperformed, went bankrupt, or were acquired. This exclusion can make historical performance appear better than it actually was.

2. **Overestimation of Performance**: Since the S&P 500 tends to retain successful companies and drop unsuccessful ones, historical analyses based on the current list of S&P 500 companies may overestimate the index's past performance. This is because it doesn't reflect the losses from companies that failed or were removed from the index.

3. **Misleading Risk Assessment**: Survivorship bias can lead to an underestimation of the risks involved. Investors might believe that the S&P 500 is less risky than it actually is if they do not consider the companies that failed and were removed from the index.

4. **Inaccurate Backtesting**: When backtesting investment strategies using the current list of S&P 500 companies, the results can be misleading. The backtest does not account for the companies that did not survive, thus presenting an incomplete picture of potential outcomes.

### Example

Suppose an investor looks at the historical performance of the S&P 500 index and sees strong returns over the past 30 years. If they only consider the companies currently in the index, they are ignoring companies that were part of the index at some point but were later removed due to poor performance or other reasons. This leads to an overestimation of the index's true historical performance.

### Mitigating Survivorship Bias

1. **Use Historical Constituents Data**: To get a more accurate picture, use historical data that includes all companies that have ever been part of the S&P 500, not just those that are currently included. This data should account for the companies that were added and removed over time.

2. **Consider Delisted Companies**: Include the performance of companies that were delisted, went bankrupt, or were acquired. This provides a more comprehensive view of the risks and returns associated with the index.

3. **Adjusted Indices**: Some financial data providers offer adjusted indices that attempt to account for survivorship bias by including data on companies that have been removed from the index.

### Practical Approach

For academic research, financial analysis, or backtesting investment strategies, it is crucial to consider survivorship bias. Researchers and analysts often rely on databases that track historical constituents of indices like the S&P 500, including the companies that were removed. This helps in producing more accurate and realistic analyses and strategies.

In summary, survivorship bias in the S&P 500 stock list can significantly distort our understanding of the index's historical performance and risk. By recognizing and accounting for this bias, investors and analysts can make more informed decisions.

In [16]:
end_date = '2024-05-01'
start_date = pd.to_datetime(end_date) - pd.DateOffset(365*8)

In [18]:
df = yf.download(tickers=symbol_list,
                start = start_date,
                end = end_date)
df = df.stack()

Unnamed: 0_level_0,Price,Adj Close,Close,High,Low,Open,Volume
Date,Ticker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2016-05-03,A,38.698872,41.240002,41.490002,40.910000,41.320000,1334000.0
2016-05-03,AAL,33.232197,34.580002,34.770000,33.849998,34.759998,10674200.0
2016-05-03,AAPL,21.723124,23.795000,23.934999,23.420000,23.549999,227325200.0
2016-05-03,ABBV,43.775429,61.770000,62.090000,60.720001,60.990002,9316600.0
2016-05-03,ABT,33.272667,38.549999,38.830002,38.209999,38.799999,18241600.0
...,...,...,...,...,...,...,...
2024-04-30,XYL,130.358917,130.699997,133.710007,130.580002,132.050003,1397800.0
2024-04-30,YUM,140.559265,141.250000,142.860001,139.750000,140.000000,4087300.0
2024-04-30,ZBH,120.279999,120.279999,121.410004,120.260002,120.930000,1429000.0
2024-04-30,ZBRA,314.559998,314.559998,322.950012,304.209991,320.000000,907700.0


In [22]:
df.index.names = ['date', 'ticker']
df.columns = df.columns.str.lower()

# 2. Calculate features and technical indicators for each stock.
- Garman-Klass Volatility
- RSI
- Bollinger Bands
- ATR
- MACD
- Dollar Volume

The Garman-Klass volatility is a method used to estimate historical volatility, primarily in the context of financial markets. It was proposed by Garman and Klass in their 1980 paper titled "On the Estimation of Security Price Volatilities from Historical Data."

### Formula for Garman-Klass Volatility

The Garman-Klass volatility is calculated using the following formula:

$$
\text{Volatility} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} \left( \ln \left( \frac{H_i}{L_i} \right)^2 - 2 \ln \left( 2 \right) + \ln \left( \frac{C_i}{O_i} \right)^2 \right)}
$$

Where:
- $H_i$  is the high price of the asset on day $i$,
- $L_i$ is the low price of the asset on day $i$, 
- $C_i$ is the closing price of the asset on day $i$, 
- $O_i$ is the opening price of the asset on day $i$, and 
- $n$ is the number of days over which the volatility is being calculated. 

### Interpretation

The Garman-Klass volatility measures the historical volatility of an asset based on its price movements over a specific period. It considers both the range of price fluctuations (high and low prices) and the relationship between opening and closing prices. The higher the volatility, the greater the price variability of the asset.

### Advantages

- **Incorporates Price Range**: The Garman-Klass volatility considers the high and low prices of the asset, providing a more comprehensive view of price movements.
  
- **Accounts for Open-Close Price Relation**: It takes into account the relationship between opening and closing prices, which can provide additional insights into market dynamics.

### Limitations

- **Sensitive to Outliers**: Like other historical volatility estimators, the Garman-Klass volatility can be sensitive to extreme price movements or outliers, potentially leading to distorted volatility estimates.

- **Assumes Log-Normality**: The formula assumes that price returns are log-normally distributed, which may not always hold true in practice, especially during periods of market stress.

### Use Cases

The Garman-Klass volatility is commonly used in finance and investment for risk management, options pricing, and portfolio optimization. Traders and analysts use historical volatility measures like Garman-Klass to assess the riskiness of assets, adjust trading strategies, and make informed investment decisions.

### Summary

The Garman-Klass volatility is a widely used method for estimating historical volatility in financial markets. By incorporating both price range and the relationship between opening and closing prices, it provides a comprehensive measure of asset price variability over a given period. Despite its limitations, it remains a valuable tool for risk assessment and decision-making in the financial industry.

\begin{equation}
\text{Garman-Klass Volatility} = \frac{(\ln(\text{High}) - \ln(\text{Low}))^2}{2} - (2\ln(2) - 1)(\ln(\text{Adj Close}) - \ln(\text{Open}))^2
\end{equation}

In [26]:
df['garman_klass_vol'] = ((np.log(df['high'])-np.log(df['low']))**2)/2-(2*np.log(2)-1)*((np.log(df['adj close'])- np.log(df['open']))**2)

The Relative Strength Index (RSI) is a technical indicator used to measure the magnitude of recent price changes to evaluate overbought or oversold conditions in a stock or market index like the S&P 500. It was developed by J. Welles Wilder Jr. and introduced in his 1978 book, "New Concepts in Technical Trading Systems."

### Concept of RSI

The RSI is calculated using the following formula:

$$
RSI = 100 - \left( \frac{100}{1 + \frac{\text{Average Gain}}{\text{Average Loss}}} \right)
$$

Where:
- Average Gain = Average of gains over a specified period (usually 14 days)
- Average Loss = Average of losses over the same period

### Interpretation of RSI

- **Overbought Condition**: When the RSI value is above 70, it suggests that the stock or index may be overbought, meaning that it has risen too far too quickly, and a reversal or pullback may be imminent.

- **Oversold Condition**: When the RSI value is below 30, it suggests that the stock or index may be oversold, meaning that it has fallen too far too quickly, and a rebound or rally may be imminent.

### Application to S&P 500 Stocks

Traders and investors use the RSI to analyze individual stocks within the S&P 500 index to identify potential buying or selling opportunities. For example:

- **Short-term Trading**: Traders may use RSI signals to identify short-term trading opportunities based on overbought or oversold conditions.

- **Confirmation Tool**: RSI can be used in conjunction with other technical indicators or fundamental analysis to confirm potential trend reversals or continuation patterns in S&P 500 stocks.

- **Divergence Analysis**: Traders may look for divergence between the price action of an S&P 500 stock and its RSI reading to anticipate trend reversals or changes in momentum.

### Considerations

- **Period Length**: The period used to calculate the RSI (usually 14 days) can be adjusted based on the trader's preference or the specific characteristics of the stock being analyzed.

- **False Signals**: RSI signals should be used in conjunction with other technical indicators or analysis methods to avoid relying solely on RSI for trading decisions, as false signals can occur, especially in choppy or sideways markets.

### Summary

The Relative Strength Index (RSI) is a popular technical indicator used by traders and investors to analyze the strength and momentum of price movements in individual stocks, including those within the S&P 500 index. By identifying overbought and oversold conditions, RSI can help traders anticipate potential trend reversals or continuation patterns in S&P 500 stocks, providing valuable insights for trading decisions.

In [31]:
df['rsi'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.rsi(close=x, length=20))

Bollinger Bands are a popular technical analysis tool developed by John Bollinger in the 1980s. They consist of a set of lines plotted at certain standard deviations (usually two) above and below a simple moving average (SMA) of a security's price. Bollinger Bands are used to measure market volatility and to identify potential overbought or oversold conditions.

### Components of Bollinger Bands

1. **Middle Band**: This is the simple moving average (SMA) of the price, typically calculated over 20 periods.
   
   $$
   \text{Middle Band} = \text{SMA}_{20}
   $$

2. **Upper Band**: This is the middle band plus two standard deviations.
   
   $$
   \text{Upper Band} = \text{SMA}_{20} + 2 \times \text{Standard Deviation}
   $$

3. **Lower Band**: This is the middle band minus two standard deviations.
   
   $$
   \text{Lower Band} = \text{SMA}_{20} - 2 \times \text{Standard Deviation}
   $$

### Interpretation

- **Volatility Measurement**: The width of the bands increases when the market is volatile and decreases during less volatile periods. Wider bands indicate higher volatility, while narrower bands indicate lower volatility.

- **Overbought and Oversold Conditions**:
  - When the price moves above the upper band, it may indicate that the asset is overbought.
  - When the price moves below the lower band, it may indicate that the asset is oversold.

- **Trend Reversal Signals**: Prices moving persistently close to the upper or lower band can signal the continuation of the current trend. Conversely, moves that start at one band and travel all the way to the opposite band can signal a trend reversal.

### Trading Strategies Using Bollinger Bands

1. **Mean Reversion**: Traders may buy when the price touches the lower band and sell when it touches the upper band, expecting the price to revert to the mean (middle band).

2. **Breakout Strategy**: Traders may look for breakouts above the upper band or below the lower band, expecting significant price movements to continue in the breakout direction.

3. **Bollinger Band Squeeze**: A squeeze occurs when the bands come close together, indicating low volatility. This is often followed by a period of high volatility and potential trading opportunities.


### Summary

Bollinger Bands are a versatile and widely-used technical analysis tool that helps traders and investors measure market volatility and identify potential overbought or oversold conditions. By using Bollinger Bands, traders can gain insights into market dynamics and make more informed trading decisions.

In [32]:
df['bb_low'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:, 0])
df['bb_mid'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:, 1])
df['bb_high'] = df.groupby(level=1)['adj close'].transform(lambda x: pandas_ta.bbands(close=np.log1p(x), length=20).iloc[:, 2])

The Average True Range (ATR) is a technical analysis indicator developed by J. Welles Wilder Jr. It measures market volatility by decomposing the entire range of an asset price for a given period. The ATR is primarily used to assess the degree of price volatility and is often employed to set stop-loss levels and inform trading decisions.

### Calculating ATR

The ATR is calculated using the following steps:

1. **True Range (TR)**: The true range for a given period is the greatest of the following three values:
   - The difference between the current high and the current low.
   - The absolute value of the difference between the current high and the previous close.
   - The absolute value of the difference between the current low and the previous close.
   
   Mathematically:
   $$
   TR = \max(\text{High} - \text{Low}, |\text{High} - \text{Previous Close}|, |\text{Low} - \text{Previous Close}|)
   $$

2. **Average True Range (ATR)**: The ATR is then calculated as the moving average of the true ranges over a specified number of periods. Wilder originally recommended a 14-day period.

   $$
   ATR = \frac{1}{n} \sum_{i=1}^{n} TR_i
   $$

### Interpretation of ATR

- **Volatility Indicator**: ATR provides a measure of volatility. Higher ATR values indicate higher volatility, whereas lower ATR values indicate lower volatility.
- **No Trend Indication**: Unlike some other indicators, the ATR does not provide any indication of the direction of the price movement, just the degree of volatility.
- **Setting Stop-Loss Levels**: Traders often use ATR to set stop-loss levels. A common approach is to set the stop-loss a certain multiple of the ATR below (for long positions) or above (for short positions) the entry price.

### Summary

The Average True Range (ATR) is a valuable indicator for measuring market volatility. It provides insights into the magnitude of price fluctuations and is widely used by traders to set stop-loss levels and assess the volatility of an asset. The ATR's primary advantage is its ability to capture the true range of price movements, making it a robust tool for volatility analysis.

In [33]:
def calculate_atr(stock_data):
    atr = pandas_ta.atr(
        high=stock_data['high'],
        low=stock_data['close'],
        close=stock_data['close'],
        length=14)
    return atr.sub(atr.mean()).div(atr.std())

In [None]:
df['atr'] = 