In [None]:
# import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import random
from statsmodels.tsa import stattools as tsat

# Asset-wise Target Autocorrelation
- In this notebook, we calculated **the autocorrelation of the time series of returns** up to 10 periods ago for each asset.
- The lags with the largest autocorrelations were tabulated and their properties were examined.
- About autocorrelation: https://en.wikipedia.org/wiki/Autocorrelation

## Conclusion
- **There are many assets that have autocorrelation of about +0.1~+0.2 with one period ago**.
- You can use the statistics: https://www.kaggle.com/farcii/ubiquantautocorr

# Details
- The statistics are calculated by codes like below.

In [None]:
# calc autocorr for each asset
# ignore time-discontinuity of observations

df_all = pd.read_parquet(f'../input/ubiquant-parquet/train_low_mem.parquet')
assets = df_all['investment_id'].values
assets = random.sample(list(assets), 5) 
RES = []

for asset in assets:
    df = df_all[df_all['investment_id'] == asset]
    x = df['time_id'].values
    y = df['target'].values

    nlags = min(len(x)-1, 10)
    ac, qstats, pvals = tsat.acf(y,nlags=nlags, qstat=True, fft=False)
    pac = tsat.pacf_ols(y, nlags = nlags)

    time_id_min = np.min(x)
    time_id_max = np.max(x)
    time_id_span = time_id_max - time_id_min
    ac_max_lag = np.argmax(np.abs(ac[1:])) + 1
    ac_max = ac[ac_max_lag]
    LBtest_pval = pvals[-1]
    pac_max_lag = np.argmax(np.abs(pac[1:])) + 1
    pac_max = pac[pac_max_lag]

    res = np.array([asset, time_id_min, time_id_max, time_id_span, ac_max_lag, ac_max, LBtest_pval, pac_max_lag, pac_max])
    RES.append(res)

columns=['asset', 'time_id_min', 'time_id_max', 'time_id_span', 'ac_max_lag', 'ac_max', 'LBtest_pval', 'pac_max_lag', 'pac_max']
ex_res_df = pd.DataFrame(np.stack(RES), columns=columns)

## features in result
- asset: Investment_id.
- ac_max_lag: A lag maximize autocorr function.
- ac_max: Maximum value of autocorr function.
- LBtest_pval: P-value of Ljung-Box Q-statistic. If sufficiently small (for example, <0.05), we can reject the hypothesis that "there exist no autocorrelation."
- pac_max_lag: A lag maximize partial autocorr function.
- pac_max: Maximum value of partial autocorr function.

In [None]:
ex_res_df.head()

- I calculated entire res_df on local.

In [None]:
res_df = pd.read_parquet('../input/ubiquantautocorr/autocorr_analysis.parquet')

- The absolute values of the largest autocorrelations are concentrated around 0.1. 
- It seems that there are about the same number of assets with positive autocorrelation and negative autocorrelation.

In [None]:
res_df['ac_max'].hist(bins=30, range=[-0.3,0.3])

- The time lag with the highest autocorrelation is **concentrated at 1.**

In [None]:
res_df['ac_max_lag'].hist()

- Look at the distribution of the autocorrelation of assets most autocorrelated with one period ago.
- Interestingly, **assets most autocorrelated with one period ago** tend to have **positive autocorrelation.**

In [None]:
res_df[res_df['ac_max_lag']==1]['ac_max'].hist(bins=30, range=[-0.3,0.3])

- Similar results can be obtained by taking out only those **assets that are found to have at least one significant autocorrelation in the LB-test**.

In [None]:
sgn_res_df = res_df[res_df['LBtest_pval'] <= 0.05]
sgn_res_df['ac_max'].hist(bins=30, range=[-0.3,0.3])

In [None]:
sgn_res_df[sgn_res_df['ac_max_lag'] == 1]['ac_max'].hist(bins=30, range=[-0.3,0.3])

- Partial autocorrelation has a similar property.

In [None]:
res_df['pac_max_lag'].hist()

In [None]:
res_df['pac_max'].hist(bins=30, range=[-0.3,0.3])

In [None]:
res_df[res_df['pac_max_lag']==1]['pac_max'].hist(bins=30, range=[-0.3,0.3])

### Please consider to upvote or comment if you find it interesting :-) 