<div style="background-color:#000;"><img src="pqn.png"></img></div>

## Download and prepare the data

We start by loading the S&P 500 tickers and downloading historical price data for each ticker. The data will be saved locally to avoid repeated downloads.

In [None]:
import time
from vectorbtpro import *
import pandas as pd
import scipy.stats as st
import statsmodels.tsa.stattools as ts  
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [None]:
sp500_tickers = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')[0]['Symbol'].tolist()

In [None]:
COINT_FILE = "coint_pvalues.pickle"
POOL_FILE = "data_pool.h5"
START = "2015-01-01"
END = "2023-12-31"

In [None]:
if not vbt.file_exists(POOL_FILE):
    with vbt.ProgressBar(total=len(sp500_tickers)) as pbar:  
        collected = 0
        for symbol in sp500_tickers:
            try:
                data = vbt.YFData.pull(
                    symbol, 
                    start=START,
                    end=END,
                    silence_warnings=True,
                )
                data.to_hdf(POOL_FILE)  
                collected += 1
            except:
                pass
            pbar.set_prefix(f"{symbol} ({collected})")  
            pbar.update()

In [None]:
data = vbt.HDFData.pull(
    POOL_FILE, 
    start=START, 
    end=END, 
    silence_warnings=True
)

In [None]:
data = data.select_symbols([
    k 
    for k, v in data.data.items() 
    if not v.isnull().any().any()
])

We retrieve the tickers of S&P 500 companies from Wikipedia. Using these tickers, we download historical price data from Yahoo Finance for a specified date range. If the data does not already exist locally, we save it in a file for future use. We also ensure the data is complete by filtering out any symbols with missing values.

## Perform cointegration test

Next, we perform a cointegration test on the closing prices of the stocks. This helps us identify pairs of stocks that have a statistically significant relationship.

In [None]:
@vbt.parameterized(
    merge_func="concat", 
    engine="pathos",
    distribute="chunks",  
    n_chunks="auto"  
)
def coint_pvalue(close, s1, s2):
    return ts.coint(np.log(close[s1]), np.log(close[s2]))[1]

In [None]:
if not vbt.file_exists(COINT_FILE):
    coint_pvalues = coint_pvalue(  
        data.close,
        vbt.Param(data.symbols, condition="s1 != s2"),  
        vbt.Param(data.symbols)
    )
    vbt.save(coint_pvalues, COINT_FILE)
else:
    coint_pvalues = vbt.load(COINT_FILE)

In [None]:
coint_pvalues = coint_pvalues.sort_values()
coint_pvalues.head(20)

We define a function to compute the p-value of the cointegration test between two time series. We iterate through all possible pairs of stocks to calculate their p-values, indicating the likelihood that they are cointegrated. The results are saved to a file to avoid recalculating them. We then load the p-values and sort them to find the most significant pairs.

## Visualize the cointegrated pairs

We visualize the price movements of two highly cointegrated stocks and their log difference. This helps us understand their relationship.

In [None]:
S1, S2 = "WYNN", "DVN"

In [None]:
data.plot(column="Close", symbol=[S1, S2], base=1).show()

In [None]:
S1_log = np.log(data.get("Close", S1))  
S2_log = np.log(data.get("Close", S2))
log_diff = (S2_log - S1_log).rename("Log diff")
fig = log_diff.vbt.plot()
fig.add_hline(y=log_diff.mean(), line_color="yellow", line_dash="dot")
fig.show()

In [None]:
data = vbt.YFData.pull(
    [S1, S2], 
    start=START,
    end=END,
    silence_warnings=True,
)

We select two stocks from the most significant cointegrated pairs. We plot their closing prices to visually inspect their relationship. We then calculate the log difference between their prices and plot it, including a horizontal line for the mean. This visualization helps us see how the stocks move together over time.

## Build and backtest the trading strategy

We set up a simple pairs trading strategy based on the z-score of the spread between the two stocks. We then backtest the strategy using historical data.

In [None]:
UPPER = st.norm.ppf(1 - 0.05 / 2)  
LOWER = -st.norm.ppf(1 - 0.05 / 2)

In [None]:
S1_close = data.get("Close", S1)
S2_close = data.get("Close", S2)
ols = vbt.OLS.run(S1_close, S2_close, window=vbt.Default(21))
spread = ols.error.rename("Spread")
zscore = ols.zscore.rename("Z-score")
print(pd.concat((spread, zscore), axis=1))

In [None]:
upper_crossed = zscore.vbt.crossed_above(UPPER)
lower_crossed = zscore.vbt.crossed_below(LOWER)

In [None]:
fig = zscore.vbt.plot()
fig.add_hline(y=UPPER, line_color="orangered", line_dash="dot")
fig.add_hline(y=0, line_color="yellow", line_dash="dot")
fig.add_hline(y=LOWER, line_color="limegreen", line_dash="dot")
upper_crossed.vbt.signals.plot_as_exits(zscore, fig=fig)
lower_crossed.vbt.signals.plot_as_entries(zscore, fig=fig)
fig.show()

In [None]:
long_entries = data.symbol_wrapper.fill(False)
short_entries = data.symbol_wrapper.fill(False)

In [None]:
short_entries.loc[upper_crossed, S1] = True
long_entries.loc[upper_crossed, S2] = True
long_entries.loc[lower_crossed, S1] = True
short_entries.loc[lower_crossed, S2] = True

In [None]:
pf = vbt.Portfolio.from_signals(
    data,
    entries=long_entries,
    short_entries=short_entries,
    size=10,  
    size_type="valuepercent100",  
    group_by=True,  
    cash_sharing=True,
    call_seq="auto"
)

In [None]:
pf.stats()

We determine the upper and lower thresholds for the z-score based on a 95% confidence interval. We calculate the spread and its z-score for the selected stock pair using a rolling window. When the z-score crosses above the upper threshold, we generate a short signal for the first stock and a long signal for the second stock. Conversely, when the z-score crosses below the lower threshold, we generate a long signal for the first stock and a short signal for the second stock. We then create a portfolio based on these signals and compute its performance statistics.

## Your next steps

Try experimenting with different stock pairs or changing the window size for calculating the z-score. You can also adjust the confidence interval to see how it affects the trading signals. These changes can help you understand the impact of different parameters on the performance of the trading strategy.

<a href="https://pyquantnews.com/">PyQuant News</a> is where finance practitioners level up with Python for quant finance, algorithmic trading, and market data analysis. Looking to get started? Check out the fastest growing, top-selling course to <a href="https://gettingstartedwithpythonforquantfinance.com/">get started with Python for quant finance</a>. For educational purposes. Not investment advise. Use at your own risk.