<img src='http://hilpisch.com/taim_logo.png' width="350px" align="right">

# Reinforcement Learning

**OLS Regression & Efficient Markets**

&copy; Dr Yves J Hilpisch | The Python Quants GmbH

http://tpq.io | http://twitter.com/dyjh

<img src="https://hilpisch.com/aiif_cover_shadow.png" width="300px" align="left">

## Imports

In [None]:
import math
import cufflinks
import numpy as np
import pandas as pd
from pylab import plt
plt.style.use('seaborn-v0_8')
cufflinks.set_config_file(offline=True)

## Random Walks

Eugene F. Fama (1965): “Random Walks in Stock Market Prices”:

> “For many years, economists, statisticians, and teachers of finance have been interested in developing and testing models of stock price behavior. One important model that has evolved from this research is the theory of random walks. This theory casts serious doubt on many other methods for describing and predicting stock price behavior—methods that have considerable popularity outside the academic world. For example, we shall see later that, if the random-walk theory is an accurate description of reality, then the various “technical” or “chartist” procedures for predicting stock prices are completely without value.”

Michael Jensen (1978): “Some Anomalous Evidence Regarding Market Efficiency”:

>“A market is efficient with respect to an information set S if it is impossible to make economic profits by trading on the basis of information set S.”

If a stock price follows a (simple) random walk (no drift & normally distributed returns), then it rises and falls with the same probability of 50% (“toss of a coin”).

**In such a case, the best predictor of tomorrow’s stock price — in a least-squares sense — is today’s stock price.**

### Retrieving Cross-Asset Data

In [None]:
url = 'https://certificate.tpq.io/findata.csv'

In [None]:
data = pd.read_csv(url, index_col=0, parse_dates=True)

In [None]:
data.dropna(inplace=True)

In [None]:
data.head()

In [None]:
data.tail()

In [None]:
data.info()

### Calculating the Log Returns

In [None]:
rets = np.log(data / data.shift(1))

In [None]:
rets.head()

In [None]:
rets.mean() * 252  # annualized, average log returns

### Plotting the Data

`pip install cufflinks`

In [None]:
data.normalize().iplot(kind='lines')

In [None]:
rets.iplot(kind='histogram', subplots=True)

In [None]:
rets.corr().iplot(kind='heatmap', colorscale='blues')

### Preparing Lagged Data

In [None]:
def add_lags(data, ric, lags):
    cols = []
    df = pd.DataFrame(data[ric])
    for lag in range(1, lags + 1):
        col = 'lag_{}'.format(lag)  # defines the column name
        df[col] = df[ric].shift(lag)  # creates the lagged data column
        cols.append(col)  # stores the column name
    df.dropna(inplace=True)  # gets rid of incomplete data rows
    return df, cols

In [None]:
lags = 7  # seven historical lags

In [None]:
dfs = {}
for sym in data.columns:
    df, cols = add_lags(data, sym, lags)
    dfs[sym] = df

In [None]:
cols  # the column names for the lags

In [None]:
dfs.keys()  # the keys of the dictonary

In [None]:
dfs['AAPL.O'].head(7)

### Implementing OLS Regression

In [None]:
regs = {}
for sym in data.columns:
    df = dfs[sym]  # getting data for the RIC
    reg = np.linalg.lstsq(df[cols], df[sym], rcond=-1)[0]  # the OLS regression
    regs[sym] = reg  # storing the results

In [None]:
np.set_printoptions(suppress=True)

In [None]:
for sym in data.columns:
    print('{:10} | {}'.format(sym, regs[sym].round(4)))

In [None]:
rega = np.stack(tuple(regs.values()))  # combines the regression results

In [None]:
rega.mean(axis=0)  # mean values by column

In [None]:
regd = pd.DataFrame(rega, columns=cols, index=data.columns)  # converting the results to DataFrame

In [None]:
regd

In [None]:
regd.iplot(kind='bar')

In [None]:
regd.mean().iplot(kind='bar')

## Another Approach

In [None]:
import statsmodels.api as sm

In [None]:
x_ = sm.add_constant(df[cols[:]], prepend=False)

In [None]:
y = df[sym]

In [None]:
mod = sm.OLS(y, x_)
reg = mod.fit()

In [None]:
reg.summary()

## Cross Check

In [None]:
sym = 'AAPL.O'

In [None]:
reg = regd.loc[sym].values

In [None]:
y_ = np.dot(dfs[sym][cols], reg)  # predictions

In [None]:
r = y_ - dfs[sym][sym]  # residuals

Check for assumptions:
* **linearity**: given
* **independence**: <b style="color: red;">not at all</b>
* **zero mean**: somehow
* **no correlation**: given
* **homoscedasticity**: <b>given</b>
* **no autocorrelation**: given
* **stationarity**: <b style="color: red;">not given</b>

In [None]:
dfs[sym][cols].corr()  # lags highly correlated

In [None]:
r.mean()

In [None]:
np.corrcoef(r, dfs[sym]['lag_3'])

In [None]:
from scipy.stats import bartlett

In [None]:
split = int(len(dfs) / 2)

In [None]:
bartlett(r[:split], r[split:])

In [None]:
from statsmodels.graphics.tsaplots import plot_acf

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))
plot_acf(r, ax=ax);

In [None]:
from statsmodels.tsa.stattools import adfuller

In [None]:
# adfuller?

In [None]:
adfuller(dfs[sym][sym])  # adf >> -2.567 --> not stationary

<img src="http://hilpisch.com/tpq_logo.png" alt="The Python Quants" width="35%" align="right" border="0"><br>

<a href="http://tpq.io" target="_blank">http://tpq.io</a> | <a href="http://twitter.com/dyjh" target="_blank">@dyjh</a> | <a href="mailto:training@tpq.io">training@tpq.io</a>