In [172]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import glob
import json
import difflib
import yfinance as yf
from sklearn.linear_model import LinearRegression
from beta_values import *

days_in_quarter = 63

We describe our strategy in four parts:

(1)  **When to invest?**

The idea of our strategy is to invest after a day during which the ETF volume is statistically large and the ETF volume is negative.  We compute these days in `high_volume_neg_return_days` below. The reason we focus on these days is that, on these days, the correlations among the stocks in the ETF is abnormally high.  These large selloffs are often a response to a large macro shock that affects many of the stocks within the ETF.  The idea of our strategy is based on the theory that, when these large ETF selloffs happen, some stocks in the ETF are "dragged down" along with the ETF even though, from a fundamental perspective, they should be less exposed to the macro shock that drove the selloff.

(2)  **Which stocks to invest in?**

We want to invest in stocks we label *outsider* stocks.  Roughly speaking, these are stocks within an ETF that respond less strongly to the movements of the ETF.  The intuition is that these outsider stocks are generally less likely to be affected by the macro developments that affect the ETF, and so these are the stocks whose prices are most likely to be distorted during large selloffs of the ETF.

How do we precisely measure this notion of *outsider* stock?  

To measure the stocks that respond less strongly to the movements of the ETF, we compute the coefficient $\beta$ of the exponentially weighted linear regression of the returns of the stock against the returns of the ETF.  (The reason we introduce this exponential weighting is that we want to weight more recent returns more heavily than less recent ones.  We used the same weighting approach in our calculation of the ETF volume spikes above.)  

This coefficient $\beta$ is called the *ETF beta* of the stock.  Note that the ETF beta changes from day to day, since we must update our regression model each day based on the new day's returns.  In `betas_each_day` (constructed in the notebook `beta_values.py`) we compile, for each ETF, a dataframe that contains these ETF betas for each stock and each day during the time period we are studying.

Once we have computed these ETF betas, we define the ETF's outsider stocks on a given day as the stocks whose ETF betas that day fall in the bottom 10 percent. In `outsider_stocks`, we assign a value of `True` to a stock on a given day if that stock is an outsider to its ETF on that day.

(3)  **How do we measure the performance of our choice of stock against the ETF?**

Once we choose a day $D$ and a stock $S$ within an ETF, we must determine the following: if we invested in the stock $S$ at the *close* of day $D$ and held it for 40 days, how does the stock's return $r_S$ compare with the return $r_{ETF}$ of the ETF over that same period?  Since we want to measure our portfolio's alpha compared with the ETF, we leverage our portfolio with the ETF beta value $\beta_S$ of the stock on day $D$.  Therefore, we define the alpha of the stock's 40-day return versus the ETF as $r_S/\beta_S - r_{ETF}$.  These alphas are computed in `stock_40_day_alpha`.

We emphasize that `stock_40_day_alpha` does not specify *which* stocks to invest in, or *when* to invest in these stocks.  It tells us that *if* we invest in this stock following this trading day, what is the associated alpha value: i.e., to what degree does this stock's leveraged 40-day return outperform that of the ETF.

(4)  **Putting it all together.**

Now that we have determined which stocks to invest in and when, and how to measure the performance of our investment choice, we can combine these steps to evaluate the performance of our overall strategy.  In `portfolio_alphas`, we multiply each pair of dataframes in `outsider_stocks` and `stock_40_day_alpha` to obtain the alphas associated to each of the investments dictated by our strategy.  In `portfolio_average_alphas`, we average over each of these dataframes to get, for each ETF, the average value of alpha for our investments in that ETF's stocks.

In [173]:
# List of ETFs we consider in this work.

etf_tickers = {
    'XLY',
    'XLP',
    'XLE',
    'XLF',
    'XLV',
    'XLI',
    'XLB',
    'XLK',
    'XLU',
    'XLRE',
    'XLC'
}

(1)  **When to invest?**  We calculate the days with statistically high ETF volume and negative ETF returns.  We store this information in `high_volume_neg_return_days`.

In [3]:
# This code retrieves the ETF beta values of the stocks in each ETF compiled in `beta_values.py`.
# Note that the ETF beta on a given day depends only on returns from previous days (not including that day).

betas_per_day = {}

for etf in etf_tickers:
    df = pd.read_csv(f"data/{etf}_betas_per_day.csv")
    df.set_index(df.columns[0],inplace=True)
    df.index.name = 'Date'
    df.index = pd.to_datetime(df.index)
    betas_per_day[etf] = df

In [4]:
# Function to calculate when the volume of a stock or ETF is "statistically large".

def exp_weighted_z_score(data,halflife):
    return (data - data.ewm(halflife=halflife).mean().shift(1)) / data.ewm(halflife=halflife).std().shift(1) 

def high_volume_negative_return(ticker, threshold=3, halflife=days_in_quarter):
  
    negative_returns = returns(ticker) < 0
    
    vol = yf.Ticker(ticker).history(period='max').Volume.tz_localize(None)    
    vol_z_scores = exp_weighted_z_score(vol,halflife=halflife)
    
    high_vol_neg_return = (vol_z_scores >= threshold) * negative_returns
    
    return high_vol_neg_return

high_volume_neg_return_days = {}

for etf in etf_tickers:
    high_volume_neg_return_days[etf] = high_volume_negative_return(etf)

(2)  **Which stocks to invest in?**  We calculate each day's "outsider stocks" in each ETF.  (Note that the outsider stocks for a given day is determined by information from *before* that day.)  We store this information in `outsider_stocks`.

In [5]:
# We define the outsider stocks in an ETF on a given day as those whose ETF betas that day 
# are in the bottom 10 percent.

outsider_stocks = {}

def is_in_lower_quantile(row,level=0.1):
    threshold = row.quantile(level)
    return row <= threshold

for etf in etf_tickers:
    outsider_stocks[etf] = betas_per_day[etf].apply(is_in_lower_quantile, axis=1)

(3)  **How do we measure the performance of our choice of stock against the ETF?**  

We compute the alpha from investing in a stock at the close of the given day nad holding for 40 days, leveraged against the ETF, compared to investing in the ETF for those 40 days.  We store the results in `stock_40_day_alpha`.

In [6]:
holding_period = 40 

# One can test other values to see what happens when you instead hold the stock for a different 
# number of days.  We are using the time horizon of 40 days based on the testing in Lynch et al.'s work.
# In future analysis, it would be interesting to explore varying this value based on developments 
# in the sector and in the overall market. 

In [150]:
stock_log_returns = {}
stock_40_day_returns = {}
stock_40_day_alpha = {}

for etf in etf_tickers:      

    # Compile tickers of both the ETF and all stocks in the ETF's holdings from the time period being considered
    tickers = []
    tickers.append(etf)
    stocks = holdings_per_day[etf].columns
    tickers.extend(stocks.values.tolist())

    # Compute the log returns of each of these tickers during the time period
    
    start_date = holdings_per_day[etf].index[0]
    end_date = holdings_per_day[etf].index[-1]
    
    etf_stock_log_returns = pd.DataFrame({ticker: log_returns(ticker,start_date=start_date,end_date=end_date) for ticker in tickers})
    
    # Compute, for each day, the returns from buying the stock AFTER that day and holding for 40 days
    
    etf_stock_40_day_returns = np.exp(sum(etf_stock_log_returns.shift(-i) for i in range(1,1 + holding_period))) - 1
    
    # Compute the difference between the stock's 40-day return that we just computed with the corresponding 40-day return of the ETF, leveraging by the stock's ETF beta
    
    df = etf_stock_40_day_returns[stocks].copy()

    for stock in stocks:
        df[stock] = (df[stock] / betas_per_day[etf][stock]) - etf_stock_40_day_returns[etf]
        
    df = df.dropna(how='all')
       
    etf_stock_40_day_alpha = df
       
    # Add all the information just computed to different dictionaries, indexing that information by the ETF ticker
    
    stock_log_returns[etf] = etf_stock_log_returns
    stock_40_day_returns[etf] = etf_stock_40_day_returns
    stock_40_day_alpha[etf] = etf_stock_40_day_alpha

(4)  **Putting it all together.**  

In [168]:
alphas = {}
alphas_mean = {}

for etf in etf_tickers:
    
    index = stock_40_day_alpha[etf].index 
    
    outsiders = outsider_stocks[etf].shift(-1).reindex(index)
    high_vol_neg_return = high_volume_neg_return_days[etf].reindex(index)
    
    outsiders = outsiders.mul(high_vol_neg_return, axis=0).replace(False, np.nan)
    
    alphas[etf] = outsiders * stock_40_day_alpha[etf]
    
    alphas[etf] = alphas[etf].dropna(how='all')
    
    alphas_mean[etf] = alphas[etf].mean(axis=1).mean()