# Chapter 1

### Time Series

- Correlation = root of r-squared
- Correlation between time series:
    - if 2 different stocks are trending, their correlation is high even if they do not show same pattern
    - Correct way : find correlation between the stock returns instead (eg: correlation between daily percentage change of two stocks)
- Predicting future points using regression : dependent time series = independent time series * slope + intercept + error
- Auto-correlation:
    - correlation between a time series with a lagged version of itself
    - an "echo" that exists in all points in a time series with other points in the past
    - eg: 1,2,3,4,5,6,7 in this series second number = first number + 1, third number = second number + 1.. this exist for all points
    - Negative autocorrelation = mean reverting
        - stocks have historically negative autocorrelation over weeks
        - strategy to make money : buy down -> sell up
    - Positive autocorrrelation = momentum
        - commodities and currencies have historically positive autocorrelation over months
        - strategy to make money : buy up -> sell down
    - An autocorrelation graph 
        - shows how many past points (lags) can we use to predict the future (including the present point).
        - Shows suitable model for prediction
- White Noise
    - constant mean over time
    - constant variance over time
    - 0 autocorrelation at all lags
    - Gaussian White Noise : the white noise has gaussian distribution and show bell curve
- Random Walk and White noise
    - Stock market follow a random walk, and so the return (gain or percent change) is white noise (Yesterday price - Today price = noise)
    - You cannot forecast a random walk. The best guess : todays price is same as yesterdays price
    - random walk with drift = random walk + mean (drift)
    - So, although we cannot forecast a random walk, we can guess the direction of the walk with the value of drift
    - How do we make sure if a series is rendom walk?
        - Dickey Fuller Test : You can test if a series is random walk
        - Augmented Dickey Fuller Test : Test if a series is random walk with more than one lags through augmentation
- Stationarity
    - Strong stationarity : Entire distribution of data is time invariant
    - Weak stationarity : mean, variance and autocorrelation of data are time invariant
    - stationary data is easy to model due to less number of parameters
    - non-stationary data is hard to model due to large number of parameters (new parameters found for each point in time)
    - eg: stock price (random walk) is non-stationary. reason : price of today will differ from price of 10 years into the future
    - eg: white noise is stationary. reason : mean, variance and auto-correlation of 100 data is same as 1000 data points
    - non-stationary to stationary : may require several transformations like:
        1. log transformation
        2. take the difference between current and a lagged version of itself (the right lag = look at acf graph)
    - Regression model:
        1. AR model : 
            - Theory : The next value should retain some information from the previous VALUE
            - todays value = mean + co-efficient * yesterday's value + noise(y = mx + c)
            - co-efficient = phi. Negative phi = mean reversion, positive phi = momentum
            - -1 < phi < +1 for stationary series
            - phi = 1 for random walk (high autocorrelation) , phi = 0 for white noise (no autocorrelation) , 
            - autocorrelation decays exponentially at a rate of phi
        1. MA model : 
            - Theory: The next value should retain some information from the previous NOISE
            - todays value = mean + co-efficient * yesterday's noise + today's noise (y = mx + c)
            - co-efficient = theta. Negative theta = mean reversion, positive theta = momentum
            - stationary for all values of co-efficient, theta
            - theta = 1 for random walk (high autocorrelation) , theta = 0 for white noise (no autocorrelation) , 
            - autocorrelation decays exponentially at a rate of phi

- Partial auto-correlation : 
    - incremental benefit of adding another lag
    - quantifies how significance adding n-th lag is when there is already (n-1)th lag
- Information Criteria : adjusts penalties on number of parameters in the model. The best model has least AIC or/and BIC model among the peers.
- Cointegration model: 
    - two series can be random walk, however the distance between them may be mean reverting. 
    - Check P-cQ with dicky fuller to see if that is random walk
    - eg: Owner and his dog with a leash. steps of owner or dog is separately random walk. but the distance between them is mean reverting
        - If dog falls too far behind, it gets pulled forward
        - If dog gets too far ahead, it gets pulled back
    - eg : Natural gas vs heating oil stock movement. or, economic substitutes or competing company stocks.
- Modeling flow:
    1. See if the series is stationary (dicky fuller test, higher p-value means non-stationary or random walk)
    2. take difference / percent change (to make it stationary)
    3. compute ACF and PACF
    4. Fit few AR, MA and ARMA models
    5. compare AIC, BIC to find the best model
    6. Forecast

```
df['num_col'].autocorr() # autocorrelation value
# Plot ACF and PACF graph
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(df['num_col'], lags= 20, alpha=0.05) # alpha = 1 - confidence interval
plot_pacf(df['num_col'], lags= 20, alpha=0.05)
from statsmodels.tsa.stattools import acf
acf(df['num_col']) # See acf values
# White noise
import numpy as np
noise = np.random.normal(loc=0, scale=1, size=500)
# Dickey Fuller test for random walk
from statsmodels.tsa.stattools import adfuller
adfuller(df['num_col'])
# Pure AR Series generation
from statsmodels.tsa.arima_process import ArmaProcess
phi = 0.9 # or theta for MA
ar_argen, ma_argen = np.array([1, -phi]), np.array([1]) # For pure AR series generation
ar_magen, ma_magen = np.array([1]), np.array([1, theta]) # For pure MA series generation
AR_object = ArmaProcess(ar_gen, ma_gen)
simulated_data = AR_object.generate_sample(nsample=1000)
plt.plot(simulated_data)
# ARIMA modeling 
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(data, trend='t', order=(1,0,0)) # (AR, diff, MA), t for linear trend
result = model.fit()
forecast = result.get_forecast(steps=50) # Forecast next 50 values
print(result.summary())
print(result.params) # Returns constant meu, and co-efficient phi
print(result.aic, result.bic) # AIC and BIC values of the model
# Plotting forecast
from statsmodels.graphics.tsaplots import plot_predict
fig, ax = plt.subplots()
data.plot(ax=ax)
plot_predict(result, start='2012-09-27', end='2012-10-06', alpha=0.05, ax=ax)
plt.show()
# Visualize best model : AR or MA values on X axis and AIC, BIC on Y axis
plt.plot(ar_values, aic_values, label='AIC', marker='o')
plt.plot(ar_values, bic_values, label='BIC', marker='o')
# Co-integration model (Also check P-cQ with dicky fuller)
from statsmodels.tsa.stattools import coint
coint(P,Q)
# Regress BTC on ETH (y = mx+c)
ETH = sm.add_constant(ETH)
result = sm.OLS(BTC,ETH).fit()
b = result.params[1] 
adf_stats = adfuller(BTC['Price'] - b*ETH['Price']) # Compute ADF
adf_p_stat =  adf_stats[1]
### Running rate of return
row_gain_series = data.SP500.pct_change() # period return
real_val_after_gain = row_gain_series.add(1)
cumulative_return = real_val_after_gain.cumprod().sub(1)
percentage_cum_gain = cumulative_return.mul(100).
### Alternative : Use a function to apply change in each row of the column
def cumulative_return_func(row):
    return np.prod(row + 1) - 1
row_gain_series = time_df["close"].pct_change() # period return
cumulative_return = row_gain_series.rolling('30D').apply(cumulative_return_func)
percentage_cum_gain = cumulative_return.mul(100).
### Weighted index
# company_capital = no_shares * stock_price
# total_market_capital_worldwide = sum(company_capital)
# company_weight = company_capital / total_market_capital_worldwide
# company_index = company_weight * company_pct_change

```

### Python date

```
from datetime import date
from datetime import datetime
# Create date
d =  date(2017, 6, 21) # ISO format: YYYY-MM-DD
# Create a datetime
dt = datetime(year= 2017 , month= 10 , day= 1 , hour= 15 , minute= 23 , second= 25 , microsecond= 500000 )
# Change value of existing datetime
dt_changed = dt.replace(minute=0, second=0, microsecond=0)
# Sort date
dates_ordered = sorted(date_list)
# Parse datetime
dt = datetime.strptime("12/30/2017 15:19:13", "%m/%d/%Y %H:%M:%S")
d.isoformat() # Express the date in ISO 8601 format
print(d.strftime("%Y/%m/%d")) # Print date in Format: YYYY/MM/DD
print(dt.strftime("%Y-%m-%d %H:%M:%S")) # Print datetime in specific format
##### Date addition / subtraction
from datetime import timedelta
delta = d2 - d # Subtract two dates
delta.days # Elapsed time in days
delta.total_seconds() # Elapsed time in seconds
td = timedelta(days=29) # Create a 29 day timedelta
print(d + td) # Add delta with existing date
# timestamp value
ts = 1514665153.0
# Convert to datetime from timestamp and print
print(datetime.fromtimestamp(ts))
# Parsing date
df = pd.read_csv('filename.csv', parse_dates = ['date_col1', 'date_col2'], index_col='date_col3') # during import
df["date_col"] = pd.to_datetime(df["date_col"], format = "%Y-%m-%d %H:%M:%S", errors='coerce') # Using pandas format specified
df["date_col"] = df["date_col"].dt.strftime("%d-%m-%Y") # Using python library format specified
# Extract information
df["date_col"].dt.month # Extract month information
df["date_col"].dt.day_name() # Extract day name : Sunday, Monday etc
df["date_col"].dt.year # Extract year information
# Shift dates
df["date_col"].shift(periods=1) # Push values 1 row below, first value becomes null. LEAD
df["date_col"].shift(periods=-1) # Pull values 1 row above, last value becomes null. LAG
# Numeric operations on date column
df["date_co1l"].div(df["date_col2"]) # percentage changes between 2 date columns
df["date_co1l"].pct_change(periods=3) # percentage change of same date column after 3 shifts
df["date_co1l"].diff() # Difference in value between 2 adjacent rows of same column
df["date_co1l"].sub(1).mul(100) # subtracting 1 from the column, then multiply 100 with the column
# Creating date
time_stamp1 = pd.Timestamp(datetime(2017, 1, 1))
time_stamp2 = pd.Timestamp('2017-01-01')
# Creating period
period = pd.Period('2017-01') # default: month-end period
period + 2 # period after 2 unit ('2017-03' in this case)
period.asfreq('D') # convert to daily
period.to_timestamp() # Convert period to timestamp
timestamp_1.to_period('M') # Convert timestamp to period
# Add missing time values / change frequency (can be alternative to .asfreq)
monthly_dates = pd.date_range(start, end, freq="M")
monthly = pd.Series(data=df["x"], index=monthly_dates)
weekly_dates = pd.date_range(start, end, freq="W")
monthly.reindex(weekly_dates)
# Create time series
t_series = pd.date_range(start='2017-1-1', periods=12, freq='M')
df.set_index('date_col', inplace=True) # setting the time series as index of dataframe
# Sampling date (Make sure the index of dataframe is time series), try to always use .resample
timed_df.resample('DS').asfreq().agg(['mean']) # Down-sampling to day start using .resample and mean aggregation
timed_df.resample('1H').interpolate(method='linear') # Up-sampling with .resample, and fill missing values linearly
timed_df.asfreq('1H', method='ffill') # Up-sampling with .asfreq and fill missing values with forward fill
timed_df.asfreq(freq='3H', method='linear') # Down-sampling (less values, aggregated values, linearly interpolated)
timed_df.resample('M', on = 'date_col')['col1'].mean() # Standard syntax
resampled_df.size() # Resampling count
# Normalization and comparison of time series data
first_row = time_df.iloc[0]
normalized = time_df.div(first_row).mul(100)
comparison_df = normalized.sub(df['normalized_benchmark_series'], axis=0)
# Add timezone in a datetime column
df['date_col'] = df['date_col'].dt.tz_localize('America/New_York', ambiguous='NaT')
# Convert to another timezone
df['date_col'] = df['date_col'].dt.tz_convert('Europe/London')
# Window functions:
time_df.rolling(window='30D').agg(['mean', 'std']) # moving range / rolling window
time_df.expanding().agg(['mean', 'sum']) # expanding range / cumulative expanding window
```

### Correlation between values vs Correlation between percent changes


<center><img src="images/01.01.png"  style="width: 400px, height: 300px;"/></center>


### Positive and Negative Autocorrelation

<center><img src="images/01.02.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/01.03.png"  style="width: 400px, height: 300px;"/></center>


# Chapter 2

### Autocorrelation Examples

<center><img src="images/02.01.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.02.png"  style="width: 400px, height: 300px;"/></center>


### White Noise : A perfect example of stationary time series

<center><img src="images/02.03.png"  style="width: 400px, height: 300px;"/></center>


### Random Walk and Dicky-Fuller Test

<center><img src="images/02.04.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.05.png"  style="width: 400px, height: 300px;"/></center>


### Non-stationary time series

<center><img src="images/02.06.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.07.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.08.png"  style="width: 400px, height: 300px;"/></center>


### Transformation : non-stationary to stationary

<center><img src="images/02.09.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.10.png"  style="width: 400px, height: 300px;"/></center>
<center><img src="images/02.11.png"  style="width: 400px, height: 300px;"/></center>


# Chapter 3

### AR series with different phi

<center><img src="images/03.01.png"  style="width: 400px, height: 300px;"/></center>

### Effect of phi on Autocorrelation

<center><img src="images/03.02.png"  style="width: 400px, height: 300px;"/></center>


### AR model with multiple lags

<center><img src="images/03.03.png"  style="width: 400px, height: 300px;"/></center>


### PACF of AR with different lags

<center><img src="images/03.04.png"  style="width: 400px, height: 300px;"/></center>


# Chapter 4

### MA with multiple lags

<center><img src="images/04.02.png"  style="width: 400px, height: 300px;"/></center>


### Effect of theta on Autocorrelation of MA(1) model

<center><img src="images/04.01.png"  style="width: 400px, height: 300px;"/></center>

### AR model can be converted to MA model

<center><img src="images/04.04.png"  style="width: 400px, height: 300px;"/></center>


### ARIMA Model : The combination of both AR and MA with the integration of difference (percent change)

<center><img src="images/04.03.png"  style="width: 400px, height: 300px;"/></center>
