Autho: [Vronsky Wikramanayake](https://www.linkedin.com/in/vronskyw/)

General Notes
- Q1A was not attempted. All other questions are completed.
- I am relatively new to futures tick data engineering & analysis, hence why I would need more time to build the right metadata or find the correct date libraries to solve for 1A. This is something I have had to look at for my current role in partnership with quant devs. I have been working mainly on index futures. I have been involved in procurement, testing and piping the raw feeds for one time history from various aggregators and appending ongoing data via Bloomberg bpipe.

Technical Notes
- I have chosen to move to pandas too early in this analysis and with more time, would have liked to learn to do more directly working inside the Hierarchical Data Format world. 
- I assume from the review that trades_filter0vol is all records of data that have >0 volume from trades. I have isolated my analysis to this subsection only.

References
- [Understanding futures instruments](https://www.cmegroup.com/month-codes.html), states instruments are constructed "Globex product code + Month + Year. eg ESH04 = s&p e-mini march 2004 (quarterly), 
- [Understanding Futures Expiration & Contract Roll](https://www.cmegroup.com/education/courses/introduction-to-futures/understanding-futures-expiration-contract-roll.html)

Failed Installs
- hdf5:  does not work #had trouble installing these on my macbook m1
- Tables & pytables: does not work, had trouble installing these on my macbook m1

In [1]:
# IMPORTING LIBRARIES REQUIRED

import h5py
import pandas as pd
import numpy as np
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import jarque_bera

%matplotlib inline

### 0. Reviewing the data & Imports

In [2]:
# SET VIEWING OPTIONS FOR DATAFRAMES

pd.set_option("display.max_rows", 2000)
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 100)
pd.options.display.width = 0
pd.options.display.float_format = ("{:,.2f}".format
)  # all floats with 2 decimal place forced precision

In [3]:
#SETTING THE ASSET FILE LOCATION

f = h5py.File('/Users/vronskywikramanayake/documents/assets/adia/ES.h5', 'r')
file_path = '/Users/vronskywikramanayake/documents/assets/adia/ES.h5'

In [4]:
# REVIEW THE H5 FILE STRCUTURE AND NESTED GROUPS

def print_hdf5_structure(item, indent=0):
    if isinstance(item, h5py.Group):
        print("  " * indent + f"Group: {item.name}")
        for key in item.keys():
            print_hdf5_structure(item[key], indent + 1)
    elif isinstance(item, h5py.Dataset):
        print("  " * indent + f"Dataset: {item.name}")

with f as file:
    print_hdf5_structure(file)

Group: /
  Group: /tick
    Dataset: /tick/trades
    Dataset: /tick/trades_filter0vol


In [5]:
# RESET THE ASSET FILE LOCATION FOR FURTHER USE

f = h5py.File('/Users/vronskywikramanayake/documents/assets/adia/ES.h5', 'r')
file_path = '/Users/vronskywikramanayake/documents/assets/adia/ES.h5'

In [6]:
# PRINT THE FILTERED TRADES INFORMATION

tradesFilterd = f['/tick/trades_filter0vol']

print(tradesFilterd.dtype)
print(tradesFilterd.shape)
print(tradesFilterd[500:505])

[('Instrument', 'S5'), ('Price', '<f4'), ('Time', 'S17'), ('Volume', '<u4')]
(856183065,)
[(b'ESU03', 973.25, b'20030701030925000',  1)
 (b'ESU03', 973.25, b'20030701030925000',  1)
 (b'ESU03', 973.  , b'20030701030930000',  1)
 (b'ESU03', 973.  , b'20030701030931000',  1)
 (b'ESU03', 973.  , b'20030701030943000', 25)]


In [9]:
# LOAD DATA INTO PANDAS DF & DECODE FIELDS

"""
- Method to be able to operate on a personal machine: reducing the size of data loaded into pandas by taking every 10k records/ticks.
- Problem with this approach: we will miss may trades / ticks between 10k marks, we will also miss all the volume between these ticks. 
- Possible addition: I could have built a method similar to vwap (volume weighted average price) which sums all the volume and takes a vwap price for every 10k block to represent the data more fairly.  
"""

df = pd.DataFrame(tradesFilterd[::10000])

# decode Instrument & Time, reformat Time to a friendly strcture. 

df['Time'] = df['Time'].apply(lambda x: x.decode('utf-8'))
df['Time'] = pd.to_datetime(df['Time'], format='%Y%m%d%H%M%S%f')

df['Instrument'] = df['Instrument'].apply(lambda x: x.decode('utf-8'))

In [10]:
# VIEW AN ISOLATED FUTURES SERIES OVER TIME

df.loc[df['Instrument'] == 'ESH04'].head()


Unnamed: 0,Instrument,Price,Time,Volume
836,ESH04,1062.75,2003-12-11 08:50:46,3
837,ESH04,1064.0,2003-12-11 09:21:45,5
838,ESH04,1066.5,2003-12-11 09:51:40,10
839,ESH04,1066.0,2003-12-11 10:54:14,20
840,ESH04,1065.5,2003-12-11 12:17:37,10


In [9]:
# VIEW THE MIN / MAN DATES FOR EACH FUTURES SERIES

print(df.groupby('Instrument')['Time'].agg(['min', 'max']).reset_index())

   Instrument                     min                     max
0       ESH04 2003-12-11 08:50:46.000 2004-03-10 15:05:57.000
1       ESH05 2004-12-09 08:51:29.000 2005-03-09 14:52:53.000
2       ESH06 2005-12-08 08:31:19.000 2006-03-08 14:16:23.000
3       ESH07 2006-12-07 06:46:30.000 2007-03-07 14:56:27.000
4       ESH08 2007-12-13 07:40:33.000 2008-03-12 14:51:17.000
5       ESH09 2008-12-10 17:01:29.000 2009-03-11 15:00:10.000
6       ESH10 2009-12-10 02:25:22.000 2010-03-10 14:59:59.000
7       ESH11 2010-12-09 07:53:19.000 2011-03-10 15:02:46.000
8       ESH12 2011-12-08 15:54:38.325 2012-03-08 14:57:05.287
9       ESH13 2012-12-13 19:59:09.234 2013-03-07 15:13:02.170
10      ESH14 2013-12-12 21:58:56.263 2014-03-13 15:34:46.805
11      ESH15 2014-12-11 18:35:52.936 2015-03-12 15:50:47.516
12      ESM04 2004-03-11 08:36:24.000 2004-06-09 14:54:29.000
13      ESM05 2005-03-10 08:38:14.000 2005-06-08 14:42:47.000
14      ESM06 2006-03-09 08:40:05.000 2006-06-07 15:00:22.000
15      

In [None]:
# PLOT EACH SERIES

"""
Plot each instrument series to check for any data issues/anomalies visually (eg, spikes & missing data), with more time, more robust systematic checks and alarms should be put in place. I am happy to
delve into the data quality measures i have built per asset class / data family (nuance specific) which has downstream real time analytics, warning triggers (eg emails) & remediation steps.
"""

for instrument, group in df.groupby('Instrument'):
    plt.figure(figsize=(8, 4))  # Adjust the figure size as needed
    plt.plot(group['Time'], group['Price'])
    plt.title(f'Time Series Plot for Instrument {instrument}')
    plt.xlabel('Time')
    plt.ylabel('Price')
    plt.show()

### A. Form a continuous price series, by adjusting for rolls
- question 1A not attempted, notes on method are below.

In [118]:
"""
- With more time, i would have liked to create a method to build an expiry date column specific to s&p futures, in the past i have used copp clark calendar service for various derivatives including futures dates. 
- I understand these contracts expire every quarter on the 3rd Friday of March, June, September and December with each expiration month denoted by an “H”, “M”, “U” and “Z” respectively.
- From there you could build a roll method changing the front month as you reach expiry.
- The roll method could use a relative or absolute calculation method.
"""

'\n- With more time, i would have liked to create a method to build an expiry date column specific to s&p futures, in the past i have used copp clark calendar service for various derivatives including futures dates. \n- I understand these contracts expire every quarter on the 3rd Friday of March, June, September and December with each expiration month denoted by an “H”, “M”, “U” and “Z” respectively.\n- From there you could build a roll method changing the front month as you reach expiry.\n- The roll method could use a relative or absolute calculation method.\n'

### B. Sample observations by forming tick, volume, and dollar-traded bars

Steps
- All bars created will come in the form OHLC (Open, High, Low, Close)
- The time element in each new record is taken as the last time from the final record used to create the bar. 
- For tick bars: based on the number of ticks chosen
- For volume bars: based on a fixed volume chosen
- For dollar-traded bars: based on a fixed dollar amount chosen


In [11]:
# Set the number of ticks
ticks_per_bar = 5

# Calculate a temp bar number
df['BarNumber'] = (df.groupby('Instrument').cumcount() // ticks_per_bar) + 1

# Group by instrument and bar number, then aggregate
ohlc_data = df.groupby(['Instrument', 'BarNumber']).agg({
    'Price': ['first', 'max', 'min', 'last'],
    'Time': 'last',  # Snap to the last time
    'Volume': 'sum'
}).reset_index()

# Rename columns
ohlc_data.columns = ['Instrument', 'BarNumber', 'Open', 'High', 'Low', 'Close', 'LastTime', 'Volume']

# Drop the BarNumber column if not needed in the final result
ohlc_data = ohlc_data.drop(columns='BarNumber')

print(ohlc_data)

      Instrument     Open     High      Low    Close                LastTime  \
0          ESH04 1,062.75 1,066.50 1,062.75 1,065.50 2003-12-11 12:17:37.000   
1          ESH04 1,067.00 1,072.00 1,067.00 1,071.00 2003-12-12 08:31:07.000   
2          ESH04 1,067.00 1,069.50 1,067.00 1,069.50 2003-12-12 11:50:39.000   
3          ESH04 1,069.00 1,088.50 1,069.00 1,083.00 2003-12-15 08:31:39.000   
4          ESH04 1,080.75 1,080.75 1,077.00 1,078.25 2003-12-15 10:50:25.000   
...          ...      ...      ...      ...      ...                     ...   
17141      ESZ15 2,004.25 2,010.00 2,004.25 2,009.25 2015-10-12 14:46:04.715   
17142      ESZ15 2,012.00 2,012.00 1,998.50 1,998.50 2015-10-13 08:31:25.625   
17143      ESZ15 2,002.75 2,014.25 2,002.75 2,014.25 2015-10-13 10:13:00.268   
17144      ESZ15 2,009.75 2,009.75 2,000.50 2,002.00 2015-10-13 13:48:17.981   
17145      ESZ15 1,995.50 1,995.50 1,995.50 1,995.50 2015-10-13 15:11:04.885   

       Volume  
0          48  
1      

In [14]:
# Set the target volume to count
volume_per_bar = 20

# Group by instrument and calculate cumulative volume
df['CumVolume'] = df.groupby('Instrument')['Volume'].cumsum()

# Calculate the bar number based on the target volume
df['BarNumber'] = (df['CumVolume'] // target_volume_per_bar).astype(int) + 1

# Group by instrument and bar number, then aggregate
ohlc_data = df.groupby(['Instrument', 'BarNumber']).agg({
    'Price': ['first', 'max', 'min', 'last'],
    'Time': 'last',  # Snap to the last time
    'Volume': 'sum'
}).reset_index()

# Rename columns
ohlc_data.columns = ['Instrument', 'BarNumber', 'Open', 'High', 'Low', 'Close', 'LastTime', 'Volume']

# Drop the BarNumber column if not needed in the final result
ohlc_data = ohlc_data.drop(columns='BarNumber')

print(ohlc_data)

NameError: name 'target_volume_per_bar' is not defined

In [12]:
# Set the target dollar amount
dollar_per_bar = 100000

# Calculate cumulative dollar amount and determine bar numbers
df['DollarAmount'] = df['Price'] * df['Volume']
df['CumDollar'] = df.groupby('Instrument')['DollarAmount'].cumsum()
df['BarNumber'] = (df['CumDollar'] // dollar_per_bar).astype(int) + 1

# Group by instrument and bar number, then aggregate
ohlc_data = df.groupby(['Instrument', 'BarNumber']).agg({
    'Price': ['first', 'max', 'min', 'last'],
    'Time': 'last',  # Snap to the last time
    'Volume': 'sum'
}).reset_index()

# Rename columns
ohlc_data.columns = ['Instrument', 'BarNumber', 'Open', 'High', 'Low', 'Close', 'LastTime', 'Volume']

# Drop the CumDollar and BarNumber columns if not needed in the final result
ohlc_data = ohlc_data.drop(columns=['BarNumber'])

print(ohlc_data)

     Instrument     Open     High      Low    Close                LastTime  \
0         ESH04 1,062.75 1,072.00 1,062.75 1,067.00 2003-12-12 08:54:23.000   
1         ESH04 1,068.00 1,088.50 1,068.00 1,077.00 2003-12-15 10:03:23.000   
2         ESH04 1,078.25 1,078.25 1,066.75 1,073.00 2003-12-16 15:00:28.000   
3         ESH04 1,071.50 1,075.00 1,071.00 1,075.00 2003-12-17 14:56:36.000   
4         ESH04 1,077.00 1,088.00 1,077.00 1,088.00 2003-12-19 09:53:20.000   
...         ...      ...      ...      ...      ...                     ...   
5706      ESZ15 2,006.75 2,011.25 2,000.00 2,007.50 2015-10-09 09:50:26.723   
5707      ESZ15 2,009.25 2,009.25 1,999.00 2,004.75 2015-10-09 14:00:24.874   
5708      ESZ15 2,007.25 2,012.00 1,998.50 2,006.50 2015-10-13 09:03:42.936   
5709      ESZ15 2,008.00 2,014.25 2,002.75 2,002.75 2015-10-13 12:14:02.340   
5710      ESZ15 2,000.50 2,002.00 1,995.50 1,995.50 2015-10-13 15:11:04.885   

      Volume  
0         90  
1         89  
2     

### C. Count the number of bars produced by tick, volume, and dollar bars on a weekly basis

Steps
- Group the bars by week and count the number of bars for each type.
- The bar type that produces the most stable weekly count is the one with the least variation in bar counts over time.
- Note: ESH05  2004-12-07/2004-12-13 looks a little dubious, weed out like records and understand why they are lower or higher than a chosen standard deviation.
- Note: Increasing Ticks & trading volumes (due to the instruments growing popularity) as well as dollar bars (given inflation and general instruments price appreciation) is understood.

In [15]:
# Calculate cumulative volume and cumulative dollar
df['CumVolume'] = df.groupby('Instrument')['Volume'].cumsum()
df['CumDollar'] = df['Price'] * df['Volume'].cumsum()

# Calculate tick, volume, and dollar bars
df['TickBarNumber'] = (df.groupby('Instrument').cumcount() // ticks_per_bar) + 1
df['VolumeBarNumber'] = (df['CumVolume'] // volume_per_bar).astype(int) + 1
df['DollarBarNumber'] = (df['CumDollar'] // dollar_per_bar).astype(int) + 1

# Group by instrument and week, then count the number of bars for each type
weekly_counts = df.groupby(['Instrument', df['Time'].dt.to_period("W-Mon")]).agg({
    'TickBarNumber': 'max',
    'VolumeBarNumber': 'max',
    'DollarBarNumber': 'max'
}).reset_index()

# Rename columns
weekly_counts.columns = ['Instrument', 'Week', 'TickBars', 'VolumeBars', 'DollarBars']

print(weekly_counts)

    Instrument                   Week  TickBars  VolumeBars  DollarBars
0        ESH04  2003-12-09/2003-12-15         7          13          89
1        ESH04  2003-12-16/2003-12-22        15          26          94
2        ESH04  2003-12-23/2003-12-29        18          29          95
3        ESH04  2003-12-30/2004-01-05        24          36          98
4        ESH04  2004-01-06/2004-01-12        34          50         102
5        ESH04  2004-01-13/2004-01-19        43          61         105
6        ESH04  2004-01-20/2004-01-26        53          76         110
7        ESH04  2004-01-27/2004-02-02        65          96         114
8        ESH04  2004-02-03/2004-02-09        75         105         116
9        ESH04  2004-02-10/2004-02-16        83         118         120
10       ESH04  2004-02-17/2004-02-23        92         130         122
11       ESH04  2004-02-24/2004-03-01       102         160         130
12       ESH04  2004-03-02/2004-03-08       113         174     

### D. Compute the serial correlation of price-returns for the three bar types
- Definition: relationship between a given variable and a chosen lagged version of itself, A variable that is serially correlated indicates that it may not be random (ie future observations are affected by past values), TA validate the profitable patterns of a security or group of securities and determine the risk associated with investment opportunities
- Using the Durbin-Watson (DW) test, it can be positive (positive pattern) or negative (negative influence on itself over time).

In [17]:
# Calculate price returns for each bar type
df['TickReturns'] = df.groupby(['Instrument', 'TickBarNumber'])['Price'].pct_change()
df['VolumeReturns'] = df.groupby(['Instrument', 'VolumeBarNumber'])['Price'].pct_change()
df['DollarReturns'] = df.groupby(['Instrument', 'DollarBarNumber'])['Price'].pct_change()

# Compute serial correlation for each bar type
tick_corr = df['TickReturns'].corr(df['TickReturns'].shift(1))
volume_corr = df['VolumeReturns'].corr(df['VolumeReturns'].shift(1))
dollar_corr = df['DollarReturns'].corr(df['DollarReturns'].shift(1))

print(f"Serial correlation for Tick bars: {tick_corr}")
print(f"Serial correlation for Volume bars: {volume_corr}")
print(f"Serial correlation for Dollar bars: {dollar_corr}")

Serial correlation for Tick bars: -0.014922704214979526
Serial correlation for Volume bars: -0.0004453579193719647
Serial correlation for Dollar bars: 0.3963988619185025


### E. Partition the bar series into monthly subsets. Compute the variance of returns for every subset of every bar type. Compute the variance of those variances. What method exhibits the smallest variance of variances?

- The method with the smallest variance of variances is considered more stable.
- Variance as a measure of dispersion.

In [18]:
# Create monthly subsets
monthly_subsets = df.groupby([df['Time'].dt.to_period("M"), 'Instrument'])

# Compute the variance of returns for every subset of every bar type
tick_variances = monthly_subsets['TickReturns'].var()
volume_variances = monthly_subsets['VolumeReturns'].var()
dollar_variances = monthly_subsets['DollarReturns'].var()

# Compute the variance of those variances
tick_var_variance = tick_variances.var()
volume_var_variance = volume_variances.var()
dollar_var_variance = dollar_variances.var()

print(f"Variance for Tick bars: {tick_variances}")
print(f"Variance for Volume bars: {volume_variances}")
print(f"Variance for Dollar bars: {dollar_variances}")

print(f"Variance of variances for Tick bars: {tick_var_variance}")
print(f"Variance of variances for Volume bars: {volume_var_variance}")
print(f"Variance of variances for Dollar bars: {dollar_var_variance}")

Variance for Tick bars: Time     Instrument
2003-06  ESU03         NaN
2003-07  ESU03        0.00
2003-08  ESU03        0.00
2003-09  ESU03        0.00
         ESZ03        0.00
2003-10  ESZ03        0.00
2003-11  ESZ03        0.00
2003-12  ESH04        0.00
         ESZ03        0.00
2004-01  ESH04        0.00
2004-02  ESH04        0.00
2004-03  ESH04        0.00
         ESM04        0.00
2004-04  ESM04        0.00
2004-05  ESM04        0.00
2004-06  ESM04        0.00
         ESU04        0.00
2004-07  ESU04        0.00
2004-08  ESU04        0.00
2004-09  ESU04        0.00
         ESZ04        0.00
2004-10  ESZ04        0.00
2004-11  ESZ04        0.00
2004-12  ESH05        0.00
         ESZ04        0.00
2005-01  ESH05        0.00
2005-02  ESH05        0.00
2005-03  ESH05        0.00
         ESM05        0.00
2005-04  ESM05        0.00
2005-05  ESM05        0.00
2005-06  ESM05        0.00
         ESU05        0.00
2005-07  ESU05        0.00
2005-08  ESU05        0.00
2005-09  ES

In [20]:
monthly_subsets.head()

Unnamed: 0,Instrument,Price,Time,Volume,BarNumber,DollarAmount,CumDollar,CumVolume,TickBarNumber,VolumeBarNumber,DollarBarNumber,TickReturns,VolumeReturns,DollarReturns
0,ESU03,971.75,2003-06-30 23:00:01.000,1,1,971.75,971.75,1,1,1,1,,,
1,ESU03,968.0,2003-07-01 08:59:52.000,1,1,968.0,1936.0,2,1,1,1,-0.0,-0.0,-0.0
2,ESU03,963.25,2003-07-01 09:22:31.000,1,1,963.25,2889.75,3,1,1,1,-0.0,-0.0,-0.0
3,ESU03,965.0,2003-07-01 10:17:55.000,1,1,965.0,3860.0,4,1,1,1,0.0,0.0,0.0
4,ESU03,967.75,2003-07-01 12:03:37.000,2,1,1935.5,5806.5,6,1,1,1,0.0,0.0,0.0
5,ESU03,969.75,2003-07-01 13:04:56.000,5,1,4848.75,10667.25,11,2,1,1,,0.0,0.0
147,ESU03,985.25,2003-08-01 08:42:50.000,10,16,9852.5,1515314.5,1538,30,77,16,-0.0,-0.0,-0.0
148,ESU03,982.75,2003-08-01 09:07:18.000,2,16,1965.5,1513435.0,1540,30,78,16,-0.0,,-0.0
149,ESU03,978.75,2003-08-01 09:41:33.000,10,16,9787.5,1517062.5,1550,30,78,16,-0.0,-0.0,-0.0
150,ESU03,980.25,2003-08-01 10:45:59.000,10,16,9802.5,1529190.0,1560,31,79,16,,,0.0


### F. Apply the Jarque-Bera normality test on returns from the three bar types. What method achieves the lowest test statistic?

- I have not reviewed a deep study of Jarque–Bera and am not sure if the output from the library from jarque_bera is correct. Generally, i look for the lowest Jarque-Bera test statistic which indicates the distribution tends more toward normal.
- The Jarque–Bera test is a goodness-of-fit test of whether sample data have the skewness and kurtosis matching a normal distribution, (reference [wikipedia Jarque–Bera test
](https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test#:~:text=In%20statistics%2C%20the%20Jarque%E2%80%93Bera,test%20statistic%20is%20always%20nonnegative.))
- Apply the Jarque-Bera test to assess the normality of returns. 

In [19]:
# Apply the Jarque-Bera test to returns
tick_jb_test = jarque_bera(df['TickReturns'].dropna())
volume_jb_test = jarque_bera(df['VolumeReturns'].dropna())
dollar_jb_test = jarque_bera(df['DollarReturns'].dropna())

print(f"Jarque-Bera test statistic for Tick bars: {tick_jb_test.statistic}")
print(f"Jarque-Bera test statistic for Volume bars: {volume_jb_test.statistic}")
print(f"Jarque-Bera test statistic for Dollar bars: {dollar_jb_test.statistic}")

Jarque-Bera test statistic for Tick bars: 351640.8284988022
Jarque-Bera test statistic for Volume bars: 438130.255016064
Jarque-Bera test statistic for Dollar bars: 7667618.045951913
