# Time Series Processing

Version: 2022-10-13

In this notebook, we will cover how to process time series data. 


### A. Data

First let us load the data. We will use Hang Seng Index data from a csv file, 
but in practice you will probably want to pull the data with
a library such as `yfinance`.


In [None]:
import pandas as pd
import numpy as np

# Import stock data and keep only two variables
stock_data = pd.read_csv("../Data/hsi.csv")
stock_data.head(36)

We will will only keep the date and adjusted closing price.
We will also drop any samples with missing values.
- To drop rows with missing values:
```python
df.dropna()
```

In [None]:
# Keep only two columns and drop missing


# Show the data
stock_data.head(10)

### B. Datetime Index

We now carry out several time-series-specific operations:
- Convert the date to pandas **datatime** format, use ```pd.to_datetime()```.
  You can then extract individual date components by ```.dt.year```, ```.dt.month``` etc.
  For example, to extract year out of a column called *date*, you can write:
  ```python
  df['date'] = pd.to_datetime(df('date'))
  df['year'] = df('date').dt.year 
  ```
- Set the date as index. This allows the use of time-series-specific features.
```python
df.index = pd.DatetimeIndex(df['date_column'])
```
- Fill in missing dates:
```python
df.asfreq(freq,method)
```
where `freq` is the desired frequency and `method` is how the columns of newly inserted dates should be filled. 
    
    See [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases) for a list of valid frequency.
    
    The default `method` is `None`, which means the newly inserted dates' columns have missing values. 
    You can choose instead to propagate last valid observation forward (`method='pad'`) or use the next valid observation (`method='backfill'`).
  


In [None]:
# Convert date to pandas datetime format


# Use date as the index of the dataframe


# Show the data
stock_data.head(10)

In [None]:
# Fill in missing dates


# Show the data
stock_filled.head(10)

### C. What Do You Want to Model?

Next we have to decide what we want to model. Finance research mostly work with returns instead of price for a variety of reasons, but chief among those is the fact that return is stationary while price is usually not. 

More generally, when you model time series, you should consider whether you want to model:
- The original time series $x_t$
- First difference $x_t - x_{t-1}$
- Percentage change  $\frac{x_t - x_{t-1}}{x_{t-1}}$ 
- Direction of movement $\unicode{x1D7D9}\left[ x_t - x_{t-1} > 0 \right]$

This decision is important because it affects what models you can use and how well they might perform. For example, modelling direction of movement is a classification task, while the other three options are regression tasks.

The pandas technique we use here is ```.shift(x)```. 
This method shifts all rows down by *x* rows.
The nice thing about this technique is that you can totally do things
like 
```python
stock_data["Price"]/stock_data.shift(1)["Price"] - 1
```
which gives you all daily return in one single line.

In [None]:
# Change in index


# Direction of movement


# Return since the previous day


# 90-day future return


# Show the data
stock_filled.head(10)

Note that `.shift()` does not take into consideration the nature of the index. If you have gaps in your data, what you get might not be want you intend:

In [None]:
# Calculate return without first filling missing dates
# We get return since previous trading day


# Show the data
stock_filled.head(10)

In our case, maybe we do want return for a given number of trading days instead of return for a given number of calendar days. We can fill in the missing dates after we compute all the necessary variables.

### D. Lag Terms

In time series modelling we often include lag terms. Some models such as statsmodel's `ARMA` will compute the lag terms for you, but some others will not. If you need lag term for your model, you can also generate it with `.shift()`.

In [None]:
# Generate four period lag terms


# Show the data
stock_filled.head(10)

Now let us try fitting some models. First, an ARIMA from `statsmodels`. We only need to provide the variable we want to model:

In [None]:
from statsmodels.tsa.arima.model import ARIMA
# ARIMA will complain if we do not set frequency
stock_filled = stock_data.asfreq('D')
arma = ARIMA(stock_filled["Adj Close"], order=(4, 0, 0)).fit()
arma.summary()

We want to use a non-time-series-specific model like lasso, we will need the manually-created lag terms:

In [None]:
from sklearn.linear_model import Lasso 

# Create a copy of data with no missing values

#Run a lasso regression


#Coefficients
print("Coefficients:",model.coef_)
print("Intecept:",model.intercept_)

### E. Changing Frequency and Rolling Window

You can change the frequency of the data with `pd.resample(freq).ops()`. For example, to get the weekly average return:

In [None]:
# Weekly average return



Another technique is `pd.rolling().ops`, which applies an operation for each sample across a rolling window:

In [None]:
# Rolling 7-trading-day average



### F. Walk Forward Split

When working with time series data we need to ensure the training data comes before the validation and test data. Instead of randomly splitting the data, what we want is this:

![walk-forward-split](https://i.stack.imgur.com/padg4.gif)

The defining features are:
1. Test data must comes from a later date than training data.
2. Test data in each split do not overlap. 

Scikit-learn's `TimeSeriesSplit` can produce such splits:
```python
tscv = TimeSeriesSplit(n_splits, max_train_size)
for train_index, test_index in tscv.split(merged_data):
    # do something
```
Options:
- `n_splits` controls the number of splits returned. The default is 5 splits. You probably want more if you have very long time series.
- `max_train_size` specifies the maximum number of training samples in a split. The default is `None`, which means there is no limit. This also means by default each subsequent split will be longer than before, so specify this number if you want the splits to have equal size.  

**It is important note that walk-forward split as implemented by `TimeSeriesSplit` is *deterministic*---same data and same settings means the same split, everytime.** This is the nature of walk-forward split, and more generally the use of historical data for backtesting. There is a real chance of overfitting, because there is no guarantee that history will repeat itself in the exact same way.

Because `tscv.split()` returns *indexes*, you are responsible for fetching the data according to the indexes.

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# 5 splits with 14 days of training data in each split


The lists are quite long, so they are hard to see. Let us just print their range:

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# 5 splits with 14 days of training data in each split


To fetch the actual data, use `df.iloc[list_of_indexes]`:

In [None]:
# Fetching the actual data


Finally let us put everything together for cross validation:

In [None]:
# Predict tomorrow's stock price with past four days of stock price
# Specify number of splits here
n_splits = 20

from sklearn.linear_model import Lasso
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

# Drop any observation with missing columns
data = stock_data.dropna()

# Data
y = data[["Adj Close"]]
X = data[["ac_1","ac_2","ac_3","ac_4"]]

# Setup models
lasso = Lasso(alpha=500)
tscv = TimeSeriesSplit(n_splits=n_splits)

# List to store scores and predictions
oos_score_list = []
prediction_list = []

print("Split  In-sample R^2  Out-of-Sample R^2")
print("-"*40)

# Loop through the splits. Run a Lasso Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(X)):
    
    # Fetch data based on split
    X_train = X.iloc[train_index]
    y_train = y.iloc[train_index]
    X_test = X.iloc[test_index]
    y_test = y.iloc[test_index]
    
    # Fit the model
    lasso.fit(X_train,y_train)
    
    # Record score and prediction
    oos_score = lasso.score(X_test,y_test)
    oos_score_list.append(oos_score)
    prediction = lasso.predict(X_test)
    prediction_list.append(prediction)
    
    print(str(i).center(5),
          str(round(lasso.score(X_train,y_train),2)).center(13),
          str(round(oos_score,2)).center(13)
         )
    
print("-"*40)
print("Average out-of-sample score:",round(np.mean(oos_score_list),2))

Let us plot the actual index versus the predicted index:

In [None]:
# Predicted index, actual index and corresponding dates
# Since predicted index is shorter than the actual index, we have to cut the latter
prediction = np.asarray(prediction_list).flatten()
actual = y[-1*len(prediction):].to_numpy() 
dates = data["Date"].to_numpy()
dates = dates[-1*len(prediction):]

# Line Chart
import matplotlib.pyplot as plt
plt.plot(dates,prediction,label="predict") #First line
plt.plot(dates,actual,label="actual") #Second line
plt.rcParams["figure.figsize"] = (15,5) #Size
plt.legend() #Show legend
plt.show()

The two series look pretty close. Unfortunately, the chart is actually quite misleading. Let us see why in the next section.

### G. What Do You Want to Model? (Cont'd)

The out-of-sample score and the line chart from above might seem to suggest that our model of Hang Seng Index works quite well. We will now see why that is in fact not the case. 

Let us zoom into the line chart:

In [None]:
# Plot only the last 90 observations
prediction = np.asarray(prediction_list).flatten()
prediction = prediction[-90:]
actual = y[-1*len(prediction):].to_numpy()
dates = data["Date"].to_numpy()
dates = dates[-1*len(prediction):]

import matplotlib.pyplot as plt
plt.plot(dates,prediction,label="predict")
plt.plot(dates,actual,label="actual")
plt.rcParams["figure.figsize"] = (15,5)
plt.xticks(rotation=45)
plt.legend()
plt.show()

We now see the issue&mdash;the predicted index is basically the actual index lagged by one period. We can hardly call this a useful prediction. There is a reason why using a non-stationary time series as the target is problematic.

Let us try modelling return, first difference and direction of movement instead. Exact same design, just different variables. You will see they all perform very poorly.

In [None]:
# Predict tomorrow's stock return with past four days of stock return
# Specify number of splits here
n_splits = 20

# Generate four period lag terms for return
for t in range(1,5):
    stock_data["dr_"+str(t)] = stock_data["daily_return"].shift(t)

from sklearn.linear_model import Lasso
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

data = stock_data.dropna()

y = data[["daily_return"]]
X = data[["dr_1","dr_2","dr_3","dr_4"]]
lasso = Lasso(alpha=0.0001)
tscv = TimeSeriesSplit(n_splits=n_splits)
oos_score_list = []
prediction_list = []

# Loop through the splits. Run a Lasso Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(X)):
    
    # Fetch data based on split
    X_train = X.iloc[train_index]
    y_train = y.iloc[train_index]
    X_test = X.iloc[test_index]
    y_test = y.iloc[test_index]
    
    # Fit the model
    lasso.fit(X_train,y_train)
    
    # Record score and prediction
    oos_score = lasso.score(X_test,y_test)
    oos_score_list.append(oos_score)
    prediction = lasso.predict(X_test)
    prediction_list.append(prediction)
    
print("Average out-of-sample score:",round(np.mean(oos_score_list),2))    

# Chart
prediction = np.asarray(prediction_list).flatten()
prediction = prediction[-90:]
actual = y[-1*len(prediction):].to_numpy()
dates = data["Date"].to_numpy()
dates = dates[-1*len(prediction):]

import matplotlib.pyplot as plt
plt.plot(dates,prediction,label="predict")
plt.plot(dates,actual,label="actual")
plt.rcParams["figure.figsize"] = (15,5)
plt.xticks(rotation=45)
plt.legend()
plt.show()

In [None]:
# Predict tomorrow's first difference with past four days' first difference
# Specify number of splits here
n_splits = 20

# Generate four period lag terms for first difference
stock_data["change"] = (stock_data["Adj Close"] 
                                     - stock_data.shift(1)["Adj Close"])
for t in range(1,5):
    stock_data["ch_"+str(t)] = stock_data["change"].shift(t)

from sklearn.linear_model import Lasso
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

data = stock_data.dropna()

y = data[["change"]]
X = data[["ch_1","ch_2","ch_3","ch_4"]]
lasso = Lasso(alpha=0.0001)
tscv = TimeSeriesSplit(n_splits=n_splits)
oos_score_list = []
prediction_list = []

# Loop through the splits. Run a Lasso Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(X)):
    
    # Fetch data based on split
    X_train = X.iloc[train_index]
    y_train = y.iloc[train_index]
    X_test = X.iloc[test_index]
    y_test = y.iloc[test_index]
    
    # Fit the model
    lasso.fit(X_train,y_train)
    
    # Record score and prediction
    oos_score = lasso.score(X_test,y_test)
    oos_score_list.append(oos_score)
    prediction = lasso.predict(X_test)
    prediction_list.append(prediction)
    
print("Average out-of-sample score:",round(np.mean(oos_score_list),2))    

# Chart
prediction = np.asarray(prediction_list).flatten()
prediction = prediction[-90:]
actual = y[-1*len(prediction):].to_numpy()
dates = data["Date"].to_numpy()
dates = dates[-1*len(prediction):]

import matplotlib.pyplot as plt
plt.plot(dates,prediction,label="predict")
plt.plot(dates,actual,label="actual")
plt.rcParams["figure.figsize"] = (15,5)
plt.xticks(rotation=45)
plt.legend()
plt.show()

In [None]:
# Predict tomorrow's direction of movement with past four days' direction of movement
# Specify number of splits here
n_splits = 20

# Generate four period lag terms for direction of movement
stock_data["direction"] = np.where(stock_data["change"]>0,1,0)
stock_data["direction"] = np.where(stock_data["change"].isna(),
                                          np.nan,
                                          stock_data["direction"])
for t in range(1,5):
    stock_data["d_"+str(t)] = stock_data["direction"].shift(t)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

data = stock_data.dropna()

y = data["direction"]
X = data[["d_1","d_2","d_3","d_4"]]
model = LogisticRegression()
tscv = TimeSeriesSplit(n_splits=n_splits)
oos_score_list = []
prediction_list = []

# Loop through the splits. Run a Logistic Regression for each split.
for i, (train_index, test_index) in enumerate(tscv.split(X)):
    
    # Fetch data based on split
    X_train = X.iloc[train_index]
    y_train = y.iloc[train_index]
    X_test = X.iloc[test_index]
    y_test = y.iloc[test_index]
    
    # Fit the model
    model.fit(X_train,y_train)
    
    # Record score and prediction
    oos_score = model.score(X_test,y_test)
    oos_score_list.append(oos_score)
    prediction = model.predict(X_test)
    prediction_list.append(prediction)
    
print("Average out-of-sample score:",round(np.mean(oos_score_list),2))    

# Chart
prediction = np.asarray(prediction_list).flatten()
prediction = prediction[-90:]
actual = y[-1*len(prediction):].to_numpy()
dates = data["Date"].to_numpy()
dates = dates[-1*len(prediction):]

import matplotlib.pyplot as plt
plt.plot(dates,prediction,label="predict")
plt.plot(dates,actual,label="actual")
plt.rcParams["figure.figsize"] = (15,5)
plt.xticks(rotation=45)
plt.legend()
plt.show()