### Betfair Historical Data large dataset creation
1. Following `2_betfair_advanced_data_test.ipynb` steps to download historical data
2. Iterating the stream through a list of files to create a dataset containing multiple events and race days


To connect to the Betfair API through betfairlightweight, you must have first set up your API certificates and saved the login details in dictionary format in a file called `api_logins.json` in the project home directory (`/betfair_project`). The following connects and should return `<LoginResource>` if successful.

In [1]:
import betfairlightweight
from betfairlightweight import filters
import pandas as pd
import numpy as np
import os
import datetime
import json
from pathlib import Path, PurePath #To define open and save locations that are cross-compatible between Windows/Linux
from bz2 import BZ2File #To unzip the Betfair data from its downloaded format
import matplotlib.pyplot as plt

project_dir = Path.cwd().parents[1]
logins_dir = project_dir / 'api_logins.json'

with open(logins_dir) as f:
    login_dict =  json.load(f)
    
trading = betfairlightweight.APIClient(username=login_dict['my_username'],
                                       password=login_dict['my_password'],
                                       app_key=login_dict['my_app_key'],
                                       certs=login_dict['certs_path'])

trading.login()

<LoginResource>

To perform the following, you must have purchased data from the [Betfair Historical data service](https://historicdata.betfair.com/#/home). Advanced data for all sports has been offered for free for Jan - May 2020. To understand more about what different packages of data include, look in the data dictionaries folder in this project. The following lists the data that you have purchased on your Betfair account. Since data is purchased by month, that is how it is represented here:

In [2]:
my_data = trading.historic.get_my_data()
for i in my_data:
    print(i)

{'sport': 'Horse Racing', 'plan': 'Advanced Plan', 'forDate': '2020-01-01T00:00:00', 'purchaseItemId': 42364}
{'sport': 'Horse Racing', 'plan': 'Advanced Plan', 'forDate': '2020-02-01T00:00:00', 'purchaseItemId': 42364}
{'sport': 'Horse Racing', 'plan': 'Advanced Plan', 'forDate': '2020-03-01T00:00:00', 'purchaseItemId': 42364}
{'sport': 'Horse Racing', 'plan': 'Advanced Plan', 'forDate': '2020-04-01T00:00:00', 'purchaseItemId': 41549}
{'sport': 'Horse Racing', 'plan': 'Advanced Plan', 'forDate': '2020-05-01T00:00:00', 'purchaseItemId': 41549}
{'sport': 'Horse Racing', 'plan': 'Basic Plan', 'forDate': '2016-01-01T00:00:00', 'purchaseItemId': 25202}
{'sport': 'Horse Racing', 'plan': 'Basic Plan', 'forDate': '2017-06-01T00:00:00', 'purchaseItemId': 770}
{'sport': 'Horse Racing', 'plan': 'Basic Plan', 'forDate': '2018-01-01T00:00:00', 'purchaseItemId': 24527}
{'sport': 'Horse Racing', 'plan': 'Basic Plan', 'forDate': '2018-02-01T00:00:00', 'purchaseItemId': 24527}
{'sport': 'Horse Racing'

The following selects just the data for from 01/01/20 to 01/03/20 and returns a dictionary of the contents of the data which we go on to use to download specific country/race/market types, and tells us the size of the entire selection:

In [3]:
collection_options = trading.historic.get_collection_options(
    "Horse Racing", "Advanced Plan", 1, 1, 2020, 1, 3, 2020
)

print(collection_options)

basket_size = trading.historic.get_data_size(
    "Horse Racing", "Advanced Plan", 1, 1, 2020, 1, 3, 2020
)
print(basket_size)

{'marketTypesCollection': [{'name': '', 'count': 1998}, {'name': 'ANTEPOST_WIN', 'count': 22}, {'name': 'DAILY_WIN_DIST', 'count': 1}, {'name': 'EACH_WAY', 'count': 1788}, {'name': 'FORECAST', 'count': 438}, {'name': 'MATCH_BET', 'count': 1692}, {'name': 'OTHER_PLACE', 'count': 4746}, {'name': 'PLACE', 'count': 9161}, {'name': 'RACE_WIN_DIST', 'count': 175}, {'name': 'REV_FORECAST', 'count': 784}, {'name': 'WIN', 'count': 11923}, {'name': 'WITHOUT_FAV', 'count': 306}], 'countriesCollection': [{'name': 'AE', 'count': 511}, {'name': 'AU', 'count': 11874}, {'name': 'ES', 'count': 6}, {'name': 'FR', 'count': 549}, {'name': 'GB', 'count': 10230}, {'name': 'HK', 'count': 1}, {'name': 'IE', 'count': 2147}, {'name': 'NZ', 'count': 2016}, {'name': 'SA', 'count': 44}, {'name': 'SG', 'count': 314}, {'name': 'US', 'count': 4038}, {'name': 'ZA', 'count': 1304}], 'fileTypeCollection': [{'name': 'E', 'count': 1998}, {'name': 'M', 'count': 31036}]}
{'totalSizeMB': 5731, 'fileCount': 33034}


To dig into particular dates, countries, race types and market types we use `.get_file_list()`, specifying our filtering parameters. For example, you can see above that within the sub-dictionary for `'marketTypesCollection'` we have `'WIN'`, among others. The following retrieves a list of the event files that match these parameters. We select only data for 01/02/20, for the `'WIN'` market only and in `'GB'`.

In [4]:
file_list = trading.historic.get_file_list(
    "Horse Racing",
    "Advanced Plan",
    from_day=1,
    from_month=2,
    from_year=2020,
    to_day=1,
    to_month=2,
    to_year=2020,
    market_types_collection=["WIN"],
    countries_collection=["GB"],
    file_type_collection=["M"]\
    ,
)
print(file_list)

['/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130661.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168096552.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678828/1.168136214.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/2/29678760/1.168133543.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130666.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168096559.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678828/1.168136219.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130671.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/2/29678760/1.168133548.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168096566.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678828/1.168136224.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/2/29678760/1.168133553.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130676.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168130245.bz2', '/xds

To downoad a file, `download_file()` is used, where we specify the download location (on Betfair) in `file_path` and the directory to which it is saved in `store_directory`. The filename is kept as is from Betfair. We both download the file and use the command to assign the file's location to the variable `download` so we can use it in subsequent operations. `file_path` below is a list input over which we iterate `.download_file()`.

In [5]:
data_dir = project_dir / 'data' / 'raw' / 'api'

available_files = file_list

print(available_files)

downloaded_files = [] #list of directories of each download

for file in available_files:
    download = trading.historic.download_file(file_path = file, store_directory = data_dir)
    print(download)
    downloaded_files.append(download)

['/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130661.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168096552.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678828/1.168136214.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/2/29678760/1.168133543.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130666.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168096559.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678828/1.168136219.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130671.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/2/29678760/1.168133548.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168096566.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678828/1.168136224.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/2/29678760/1.168133553.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29678694/1.168130676.bz2', '/xds_nfs/edp_processed/ADVANCED/2020/Feb/1/29677422/1.168130245.bz2', '/xds

The downloaded file is .json formatted compressed in a bzip2 (.bz2) folder. The following extracts the file where it is downloaded. It is given no file extension, however it remains in .txt format and is readable in any text reader.

In [None]:
extracted_files = []

for file in downloaded_files:
    zipfile = BZ2File(file) # open the file
    data = zipfile.read() # get the decompressed data
    newfilepath = file.split('bz2')[0] # removing the extension and saving without a filetype
    open(newfilepath, 'wb').write(data) # write an uncompressed file
    extracted_files.append(newfilepath)
    zipfile.close()
    
print(extracted_files)

['/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168130661.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168096552.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168136214.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168133543.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168130666.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168096559.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168136219.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168130671.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168133548.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168096566.', '/Users/tombardrick/Documents/projects/betfair/betfair_project/data/raw/api/1.168136224.'

## Using stream listener to read extracted data

We now have a readable raw text file in Betfair's dictionary format. To interpret the data with python, betfairlightweight includes a stream listener which will interpret the input of historical data in the same way that it does for live data. The below is edited from [here](https://github.com/liampauling/betfair/blob/104ff4cb8734038cb9351e74d16dc7bd018111bc/examples/examplestreaminghistorical.py). 

This finds the chosen variables in the data file and outputs them in a .csv format in output.txt. Note: output.txt is ignored by git and is not uploaded into the repository.

In [None]:
from betfairlightweight import StreamListener
from betfairlightweight.streaming.stream import MarketStream

class HistoricalStream(MarketStream):
    # create custom listener and stream

    def __init__(self, listener):
        super(HistoricalStream, self).__init__(listener)

    def on_process(self, market_books):
        with open("output.txt", "a") as output:
            for market_book in market_books:
                for runner in market_book.runners:

                    # how to get runner details from the market definition
                    market_def = market_book.market_definition
                    runners_dict = {
                        (runner.selection_id, runner.handicap): runner
                        for runner in market_def.runners
                    }
                    runner_def = runners_dict.get(
                        (runner.selection_id, runner.handicap)
                    )

                    output.write(
                        "%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n"
                        % (
                            market_book.publish_time, #datetime.datetime
                            market_book.market_id, #float
                            market_book.status, #unicode
                            market_book.inplay, #bool
                            runner.selection_id, #int
                            runner.last_price_traded or "", #float
                            runner.total_matched or "", #float
                            runner.sp.actual_sp or "", #float
                            runner.adjustment_factor or "", #float
                            runner.handicap or "", #float
                            market_book.number_of_active_runners or "", #int
                            runner.status,
                            market_book.total_matched or "", #float
                        )
                    )


class HistoricalListener(StreamListener):
    def _add_stream(self, unique_id, stream_type):
        if stream_type == "marketSubscription":
            return HistoricalStream(self)
        
listener = HistoricalListener(max_latency=None)

Iterating the Historical Listener through the list of `extracted_files`. For the full day of racing 01/02/20, `output.txt` is 493mb.

In [None]:
with open("output.txt", "w") as output:
    output.write("Time,MarketId,Status,Inplay,SelectionId,LastPriceTraded,TotalMatched,Sp,\
    AdjFactor,Handicap,NumRunnersActive,RunnerStatus,MktTotalMatched\n")
    pass

for file in extracted_files:
    stream = trading.streaming.create_historical_stream(
        directory=file,
        listener=listener,
    )
    stream.start()

Delete all downloaded files:

In [None]:
for file in downloaded_files:
    file = Path(file)
    file.unlink()
    
for file in extracted_files:
    file = Path(file)
    file.unlink()

We can now read output.txt with pandas.

In [None]:
df = pd.read_csv('output.txt')
df.head()

In [None]:
df.describe() #descriptive statistics for this data.

### Creating variables for better analysis

To understand the evolution of matches at each price, we can create `TradeSize` by looking at the change in `TotalMatched` grouped by `LastPriceTraded`.

In [None]:
df['TradeSize'] = df.groupby(['MarketId','SelectionId', 'LastPriceTraded'])['TotalMatched'].diff()

In order to make race timeseries data comparable between events, we need an indicator of what stage the market is in. For example, betting behaviour is very different when the market is first created compared to in the 5 minutes before the off, or compared to in play. A logical anchoring point is the moment in which the race starts, at which point BSP is also defined.

`TimeIndex` is created with time-zero defined at the second where `Inplay == True` for the first time for each `MarketId`, counting positively and negatively in either direction **in seconds**. 

In [None]:
# df['InplayBool'] = (df['Inplay'] == True) * 1
# df['InplayBool'] = df['InplayBool'].diff()

# race_start = df.iloc[df.index[df['InplayBool'] == 1.0].tolist(), 0].tolist()
# race_start_index = df.index[df['InplayBool'] == 1.0].tolist()
# market_ids = df['MarketId'].unique().tolist()

# del df['InplayBool']

# print(race_start)
# print(race_start_index)
# print(market_ids)

In [None]:
# original

In [None]:
# %%timeit

# df2 = df.iloc[0:0]
# df2['TimeIndex'] = None

# for market, t_0 in zip(market_ids, race_start):
#     df_temp = df[df['MarketId'] == market].reset_index()
#     df_temp['TimeIndex'] = df_temp['Time'].apply(lambda x: (datetime.datetime.strptime(x[:19], "%Y-%m-%d %H:%M:%S") - datetime.datetime.strptime(t_0[:19], "%Y-%m-%d %H:%M:%S")))
#     df2 = pd.concat([df2, df_temp])

# df2['TimeIndex'] = df2['TimeIndex'].apply(lambda x: int(pd.Timedelta.total_seconds(x)))

# del df2['index']

In [None]:
# test

In [None]:
# del df2 # free up memory for test

In [None]:
%%time

# copying df
df_test = df.copy()

# converting to datetime
df_test['Time'] = pd.to_datetime(df['Time'], format="%Y-%m-%d %H:%M:%S", errors='coerce')

# calculating difference for each time point and inplay start
df_test.assign(time_dif = df_test.groupby(level=0).apply(lambda x: x - x.loc[x['Inplay'] == True]['Time'].min()))

In [None]:
# should definitely refine / remove needless time points before applying function
# time taken seems to be exponentially longer by row otherwise

In [None]:
# milliseconds important?