# Lecture 11 - Time Series Analysis

## 1. Overview

**Time series analysis** is crucial for financial data, as stock prices, economic indicators, and sales forecasts are often dependent on time. 

**What is Time Series Data?**

- A time series is a sequence of data points recorded at successive and equally spaced points in time.
    - Examples in Finance: Stock prices, interest rates, GDP growth, and exchange rates.

**Components of Time Series Data**
- **Trend:** Long-term increase or decrease in the data.
- **Seasonality:** Repeating patterns or cycles (e.g., sales increasing during the holiday season).
- **Noise/Residual:** Random fluctuations that are not explained by the model.

This notebook covers:

1. The **basics of time series** data and its components.
2. How to **manipulate** and **visualize** time series data with `pandas` and `matplotlib`.
3. Apply **basic time series models** such as moving averages and correlations.

##### Setting the environment

In [None]:
import numpy as np 
import pandas as pd
from pylab import mpl, plt 
plt.style.use('seaborn-v0_8-dark') 
mpl.rcParams['font.family'] = 'serif' 
%matplotlib inline

## 2. Data inspection

The first part of the analysis is to **inspect** the data set containing the timeseries. 

**Inspection steps:**
1. **Import** data
2. Generate **summary statistics**
3. Analysis **changes over time**
4. Adjust **frequency** (**resampling**)

### 2.1 Data import

For this part, we work with a standard `csv` database obtained from the **Thomson Reuters Eikon Data**. 
The data contains **end-of-day (EOD) price data** for a selection of instruments.

The following parameters apply:
```python
    file_path = 'Data/11/'
    file_name = 'tr_eikon_eod_data.csv'
```

##### Check file

In [None]:
# Data from the Thomson Reuters (TR) Eikon Data API
file_path = 'Data/11/'
file_name = 'tr_eikon_eod_data.csv'
file = open(file_path + file_name, 'r')

In [None]:
file.readlines()[:5]

In [None]:
file.close()

##### Import into `dataframe`

In [None]:
# index_col = 0: the first column shall be handled as an index.
# parse_dates = True: the index values are of type datetime.
data = pd.read_csv(file_path + file_name, index_col = 0, parse_dates = True)

- Use **time as label** on `index_col`
- Explicitly interpret as `datetime` object on `parse_dates`
    - from documentation:
        ```If True -> try parsing the index.```


##### Inspect `dataframe`

In [None]:
data.info()

In [None]:
data.head()

In [None]:
data.tail()

##### Visualize timeseries

In [None]:
data.plot(figsize = (10,12), subplots = True);

##### Add labels

Labeling from *Reuters Instrument Codes* (RICs)

In [None]:
instruments = ['Apple Stock', 'Microsoft Stock',
                           'Intel Stock', 'Amazon Stock', 'Goldman Sachs Stock',
                           'SPDR S&P 500 ETF Trust', 'S&P 500 Index',
                           'VIX Volatility Index', 'EUR/USD Exchange Rate',
                           'Gold Price', 'VanEck Vectors Gold Miners ETF',
                           'SPDR Gold Trust']

In [None]:
for ric, name in zip(data.columns, instruments):
    print('{:8s} | {}'.format(ric, name))

### 2.2 Summary statistics

##### Built-in tools

In [None]:
data.describe().round(2)

In [None]:
data.mean()

##### Customized satistics

In [None]:
data.aggregate(['min', 'mean', 'std', 'median', 'max']).round(2)

### 2.3 Changes over time

Statistical analysis methods are often based on **changes over time** and not the absolute values themselves. 

There are multiple options to calculate the changes in a time series over time:
- Absolute differences
- Percentage changes
- Logarithmic (log) returns.

#### **Absolute differences**

`.diff()`: subtracts each row’s value from the value in the previous row.

- It reveals the exact change in values from one time step to the next.
- The method returns a `dataframe`

In [None]:
data.diff().head()

In [None]:
data.diff(periods=2)

In [None]:
data.diff().mean()

#### **Percentage changes**

`.pct_change()`: calculates the percentage change between consecutive rows

- It reveals the relative change in values from one time step to the next.
- The method returns a `dataframe`

In [None]:
data.pct_change().round(3).head()

In [None]:
data.pct_change(periods = 7).round(3).head(10)

In [None]:
data.pct_change().mean().plot(kind = 'bar', figsize = (10,6));

#### **Log Returns**

**Logarithmic (log) returns** of time series data are the standard means to analyze returns on investments over time. 

The formula is given by
$$
\text{Log Return} = \ln\left(\frac{P_t}{P_{t-1}}\right)
$$

In `pandas`, the denominator naturally obtains by shifting data by one row using the `.shift()` method.

In [None]:
rets = np.log(data / data.shift(1))

In [None]:
rets.head().round(2)

**Cumulative returns** over a period are obtained by summing up the log returns for each interval and then exponentiate the result:
$$
\text{Cumulative Return} = e^{\sum \text{Log Returns}}
$$

In [None]:
rets.cumsum().apply(np.exp).plot(figsize = (10,6));

### 2.4 Resampling

**Resampling** of financial time series data refers to the process of **converting the frequency of data points** in a time series.

The `resample()` method in `pandas` is used to change the frequency of time series data. 

```python 
    data.resample(rule, label='right', closed='right', kind='timestamp')
```

Parameters:

1. `rule`: This is a required parameter and specifies the new frequency for resampling. Some common time-based frequency strings are:
    - `'D'`: Day
    - `'W'`: Week
    - `'M'`: Month
    - `'Q'`: Quarter
    - `'A'`: Year
One can also specify intervals like '5min', '15T' (15 minutes), '3H' (3 hours), etc.
2. `label`: Determines how the timestamp labels in the resulting data are aligned:
    - `'right'`: Assigns the label to the end of the resampling period (e.g., a week ending on Sunday will be labeled as Sunday).
    - `'left'`: Assigns the label to the beginning of the resampling period (e.g., the first day of the week).
3. `closed`: Specifies which side of each interval is closed:
    - `'right'`: The interval includes the right endpoint.
    - `'left'`: The interval includes the left endpoint.
4. `kind`: Defines the type of index used:
    - `'timestamp'`: Generates a DatetimeIndex.
    - `'period'`: Generates a PeriodIndex.
    
**Aggregation functions**: After resampling, you can apply an aggregation method directly, like `mean()`, `sum()`, `last()`, `first()`, `count()`, etc. These specify how to aggregate data within each new time interval.

In [None]:
data.resample('1W', label='right').last().head()

In [None]:
# Resample to quarterly data, labeling periods at the start of the quarter
data.resample('QE', label='left').mean().head()

In [None]:
data.resample('1ME', label = 'right').last().head()

In [None]:
rets.cumsum().apply(np.exp).resample('1ME', label='right').last().plot(figsize=(10, 6));

## 3. Rolling statistics

A **rolling window** is a technique used to **apply a calculation to a specific, fixed-size subset** of data, which “rolls” or **moves across a dataset** as a window. 

The purpose of a rolling window is to compute statistics, like the mean or standard deviation, for consecutive subsets of data points, creating a dynamic, time-dependent **view of trends, averages, or variability**. 

This technique is commonly used in time series analysis, especially in finance, to **understand patterns over time while smoothing out short-term fluctuations**.

In `Python`, the `.rolling()` method in `pandas` is used to apply a rolling window to a `DataFrame` or `Series`. 

This method returns a **“rolling” object** that can apply various aggregation functions, like `.mean()`, `.std()`, `.min()`, etc., over the rolling window.

```python
    data.rolling(window=window_size).function()
```

In [None]:
# Let's focus on a single financial time series
sym = 'AAPL.O'
data = pd.DataFrame(data[sym]).dropna()
data.tail()

In [None]:
window = 20

**- Calculate rolling minimum (`min`) and maximum (`max`):** identify the range of prices over the past 20 days. 

In [None]:
data['min'] = data[sym].rolling(window=window).min()

In [None]:
data['max'] = data[sym].rolling(window=window).max()

**- Calculate rolling mean (`mean`) and standard deviation (`std`):** The rolling mean provides a smoothed version of the price series. It smooths out short-term fluctuations, highlighting the medium-term trend. The standard deviation statistic shows the volatility of the stock price over each 20-day period.

In [None]:
data['mean'] = data[sym].rolling(window=window).mean()

In [None]:
data['std'] = data[sym].rolling(window=window).std()

In [None]:
data['median'] = data[sym].rolling(window=window).median()

**- Calculate Exponentially Weighted Moving Average (`ewma`):** Unlike a simple moving average, which weights all points equally, the **EWMA** gives more importance to recent observations, allowing it to react faster to recent price changes. The `halflife` parameter controls how quickly the weights decay, with a shorter halflife emphasizing more recent data.

In [None]:
data['ewma'] = data[sym].ewm(halflife=0.5, min_periods=window).mean()

In [None]:
data.head(25)

In [None]:
data.dropna().head()

**- Plotting the Rolling Statistics:**

In [None]:
ax = data[['min', 'mean', 'max']].iloc[-200:].plot(
    figsize = (10,6), style = ['g--', 'r--', 'g--'], lw = 0.8)
data[sym].iloc[-200:].plot(ax = ax, lw = 2.0) ;

#### Technical Analysis Example: SMAs 

A decades-old trading strategy based on technical analysis is using **two simple moving averages** (SMAs): 

Trading strategy
- Go long on a stock (or financial instrument in general) when the shorter-term SMA is above the longer-term SMA 
- Go short when the opposite holds true. 

In [None]:
data['SMA1'] = data[sym].rolling(window=42).mean()
data['SMA2'] = data[sym].rolling(window=252).mean()

In [None]:
data[[sym, 'SMA1', 'SMA2']].tail()

In [None]:
data[[sym, 'SMA1', 'SMA2']].plot(figsize=(10, 6));

SMAs are then used to derive positions to implement a trading strategy. 

Denote
- a long position by a value of 1 
- a short position by a value of -1. 

The change in the position is triggered by a crossover of the two lines representing the SMA time series:

In [None]:
data.dropna(inplace = True)

In [None]:
data['positions'] = np.where(data['SMA1'] > data['SMA2'], 1, -1)

In [None]:
ax = data[[sym, 'SMA1', 'SMA2', 'positions']].plot(
    figsize = (10,6), secondary_y = 'positions')
ax.get_legend().set_bbox_to_anchor((0.25,0.85));

## 4. Correlation analysis

### 4.1 Inspection of 2 timeseries

Let us consider the correlation analysis between two financial time series: the **S&P 500 Index** (.SPX) and the **VIX volatility index** (.VIX). 
- The S&P 500 is a benchmark index for U.S. stocks
- the VIX measures market volatility expectations. 

Typically, these indices have an inverse relationship: when the S&P 500 falls, the VIX tends to rise, indicating higher market fear or uncertainty.

In [None]:
raw = pd.read_csv(file_path + file_name, index_col=0, parse_dates=True)

In [None]:
data = raw[['.SPX', '.VIX']].dropna()

In [None]:
data.tail()

##### Visual inspection

In [None]:
data.plot(subplots=True, figsize=(10, 6));

In [None]:
data.loc[:'2012-12-31'].plot(secondary_y='.VIX', figsize=(10, 6));

### 4.2 Logarithmic Returns

#### Producing and processing output

In [None]:
rets = np.log(data / data.shift(1))

In [None]:
rets.head()

In [None]:
rets.dropna(inplace=True)

#### Visual inspection

In [None]:
rets.plot(subplots=True, figsize=(10, 6));

The `.plotting.scatter_matrix()` produces correlation analysis plots within and across timeseries.

In [None]:
pd.plotting.scatter_matrix(rets,
                           alpha=0.2,
                           diagonal='hist',
                           hist_kwds={'bins': 35},
                           figsize=(10, 6));

### 4.3 OLS Regression

**Ordinary Least Square** regression provide a formal way to inspect the correlation between two variables. 

`np.polyfit()` is a function in `NumPy` that fits a polynomial to a set of data points using least squares regression and returns the coeficient. 

In other words, it finds the polynomial function of a specified degree that best fits the data in terms of minimizing the sum of squared errors between the fitted polynomial values and the actual data points

```python
    np.polyfit(x, y, deg, rcond=None, full=False, w=None, cov=False)
```

Parameters:
- `x`: The x-coordinates (independent variable) of the data points.
- `y`: The y-coordinates (dependent variable) of the data points.
- `deg`: Degree of the polynomial to be fit to the data. For example:
    - `deg=1` fits a line (linear regression),
    - `deg=2` fits a quadratic curve, and so on.
- `cov` (optional): If `True`, the function also returns the covariance matrix of the polynomial coefficients.

In [None]:
reg, cov_matrix = np.polyfit(rets['.SPX'], rets['.VIX'], deg=1, cov=True)
print (f"The regression results in: VIX = {reg[0].round(2)} SPX + {reg[1].round(4)}")
# print (cov_matrix)

##### Visual inspection

In [None]:
ax = rets.plot(kind='scatter', x='.SPX', y='.VIX', figsize=(10, 6))
ax.plot(rets['.SPX'], np.polyval(reg, rets['.SPX']), 'r', lw=2);

where `np.polyval()` is a function in `NumPy` used to evaluate (calculate) the value of a polynomial for a given set of values. Essentially, given a polynomial’s coefficients, `np.polyval` compute the y-values for corresponding x-values on that polynomial.

```python
    np.polyval(p, x)
```

- `p`: Array of polynomial coefficients in decreasing order of power. 
    - For example, for a polynomial equation of the form  $y = ax^2 + bx + c$ , the coefficients array should be [$a, b, c$].
- `x`: Value(s) at which to evaluate the polynomial. This can be a single number or an array of x-values.


### 4.4 Correlation

`.corr()` computes the **Pearson correlation** coefficient between pairs of columns in a `DataFrame`, a measure of the strength and direction of their linear relationship.
- Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation). 
    - For .SPX and .VIX, a strong negative correlation is expected.
    
- Calling the method can be 
    - applied directly to a `DataFrame` to calculate correlations between each column pair.
    - used with one column to calculate correlation with another column, e.g., `df['col1'].corr(df['col2'])`.


In [None]:
rets.corr()

In [None]:
ax = rets['.SPX'].rolling(window=252).corr(
    rets['.VIX']).plot(figsize=(10, 6))
ax.axhline(rets.corr().iloc[0, 1], c='r');

## 5. A glimpse into high-frequency data

**High-frequency data** in finance refers to data captured at very short time intervals, often seconds or milliseconds, typically related to trades, bids, and asks. 

It provides detailed insights into market activity but requires careful handling due to its high volume and potential noise. Such data is commonly used in trading, market analysis, and to identify short-term price movements or anomalies.

For this part, we’re loading **tick** data from a `csv` file for **EUR/USD**, which contains high-frequency information, like **bid** and **ask prices**, captured by the **FXCM broker**.

The following parameters apply:
```python
    file_path = 'Data/11/'
    file_name = 'fxcm_eur_usd_tick_data.csv'
```

#### Check file

In [None]:
file_path = 'Data/11/'
file_name = 'fxcm_eur_usd_tick_data.csv'
file = open(file_path + file_name, 'r')

In [None]:
file.readlines()[:10]

In [None]:
file.close()

#### Import and inspect data

In [None]:
tick = pd.read_csv(file_path + file_name,
                   index_col=0, parse_dates=True)

In [None]:
tick.head()

In [None]:
tick.info()

#### Compute mid-prices

$$
\text{Mid Price} = \frac{\text{Bid} + \text{Ask}}{2}
$$

In [None]:
tick['Mid'] = tick.mean(axis = 1)

In [None]:
tick['Mid'].plot(figsize = (10,6));

#### Resampling to 5-minute intervals

In [None]:
tick_resam = tick.resample(rule='5min', label='right').last()

In [None]:
tick_resam.head()

In [None]:
tick_resam['Mid'].plot(figsize=(10, 6));

---