# 6.1 Downloading raw prices
Because we already created functions in notebook 1 to download 1-minute tickers, this is easy. We need to keep in mind that a lot of stocks such as SPACs have almost no price data. So there are a lot of empty bars. Some stocks don't even have 1 trade for the entire day. However, I cannot remove them either. Because if for one week the stock has a lot of activity, my trading systems may still trade it. That is the curse of the HTB small caps space. I cannot just filter on monthly liquidity.

In [1]:
###
from polygon.rest import RESTClient
from datetime import datetime, date, time, timedelta
from tickers import get_tickers
from data import download_m1_raw_data
import pandas as pd
from fastparquet import write

DATA_PATH = "../data/polygon/"

with open(DATA_PATH + "secret.txt") as f:
    KEY = next(f).strip()

client = RESTClient(api_key=KEY)

**Non-parallel approach**

Downloading outside of market hours is much faster.

So instead of csv's we use <code>Parquet</code>. This saves a lot of disk space while sacrificing human readibility. Although you can just use [tad](https://www.tadviewer.com/) (fastparquet) or [ParquetViewer](https://github.com/mukunku/ParquetViewer/releases) (pyarrow). And pandas already supports reading from Parquet files. 

In <code>06_fastparquet.ipynb</code> you can see a comparison between parquet compression algorithms and csv's. We save more than 50% in disk space, while write speeds are more than x7.

Downloading the data takes around 20 hours for data from 2019-01-01 to 2023-09-01. However this only has to be done once, after which you only update (append).

In [None]:
tickers = get_tickers(v=3)
tickers = tickers[tickers["type"].isin(["CS", "ADRC", "ETF"])]
tickers.reset_index(inplace=True, drop=True)

# For timing
length = len(tickers)
start_time = datetime.now()
total_days_to_download = (tickers.end_date - tickers.start_date).sum()
downloaded_days = timedelta(0)

for index, row in tickers.iterrows():
    id = row["ID"]
    ticker = row["ticker"]

    start_date = row["start_date"]
    end_date = row["end_date"]

    m1 = download_m1_raw_data(ticker = ticker, from_ = start_date, to = end_date, columns = ["open", "high", "low", "close", "volume"], client=client)
    if m1 is None:
        continue

    m1.to_parquet(DATA_PATH + f"raw/m1/{id}.parquet", engine="fastparquet", compression="snappy", row_group_offsets=25000)

    # For timing (becomes accurate after 5.0%)
    passed_time = datetime.now() - start_time
    days_just_downloaded = end_date - start_date
    
    total_days_to_download -= days_just_downloaded
    downloaded_days += days_just_downloaded
    used_time_per_day = passed_time/downloaded_days

    remaining_time = used_time_per_day*total_days_to_download
    remaining_hours = int(remaining_time.total_seconds()/3600)
    remaining_minutes = int((remaining_time.total_seconds()%3600)/60)

    print(f"Progress: {round(index/length*100, 1)}% | ETA: {remaining_hours} hours and {remaining_minutes} minutes")

In [3]:
# pd.read_csv(DATA_PATH + f"raw/m1/A-2021-01-01.csv", index_col="datetime")
# pd.read_parquet(DATA_PATH + f"raw/m1/A-2021-01-01.parquet")

**Parallel approach (may not work now)**

The Polygon API uses the <code>requests</code> library which does not support asynchronous processing. So I have to use <code>aiohttp</code> and work with raw requests. Also we have to bother with pagination because of the 50000 limit.

I used ChatGPT to convert the code above to work with aiohttp. However it did not work. After manually debugging, I got it working and the file is now in <code>06_parallel.py</code>. You can specify the maximum amount of parallel requests. Setting it to 10 should make downloading 10 times faster. Be wary to not generate too many requests.

By the way, it is possible to run multiple python notebook files in VSCode. Simply copy this file multiple times, filter on <code>index</code> and run them all. This speeds it up, because the bottleneck is not processing power but the speed of a request.

Since April 2024 it is possible to download flat files from Polygon which will be MUCH faster than sending a gigantic amount of requests. However I am too lazy to update my code.

# 6.2 Updates
Run the first cell to import the modules and then run the cells below. Making a backup of the <code>raw/m1</code> folder is recommended! If something goes wrong in this step, you cannot go back without downloading everything again. Reminder: the only difference between tickers_v3 and tickers_v4 is that tickers_v4 removes the ghost tickers and has start/end dates for the data.

old_END_DATE is the date up to which we have data.

Loop through tickers_v3:
- If the ID is in the (old) <code>tickers_v4</code> and <code>end_date</code>(v3) is larger than the old_END_DATE (v4), then we know this is a ticker that kept its listing. We can simply append the new data. This is true for most stocks.
- If the ID is not in <code>tickers_v4</code> AND start_date (v3) is larger than old_END_DATE, we need to create a new file. This is a new listing. If it is smaller than old_END_DATE this means the ticker had no data in the first place because in tickers_v4 the ghost tickers are already removed.

All other tickers are those that have been delisted and need no updating.

In [2]:
tickers_v4 = get_tickers(v=4, types=["CS", "ADRC", "ETF"])
tickers_v4.to_csv("../data/tickers_v4_OLD.csv")

In [3]:
old_tickers_v4 = get_tickers(v=4, types=["CS", "ADRC", "ETF"])
old_END_DATE = old_tickers_v4['end_date'].max()
old_IDs = list(old_tickers_v4['ID'])

In [None]:
tickers_v3 = get_tickers(v=3, types=["CS", "ADRC", "ETF"])

for index, row in tickers_v3.iterrows():
    id = row["ID"]
    ticker = row["ticker"]
    start_date = row["start_date"]
    end_date = row["end_date"]

    if id in old_IDs and end_date > old_END_DATE:
        update = download_m1_raw_data(ticker = ticker, from_ = old_END_DATE + timedelta(days=1), to = end_date, columns = ["open", "high", "low", "close", "volume"], client=client)
        if update is not None:
            write(DATA_PATH + f"raw/m1/{id}.parquet", update, append=True, compression="snappy", row_group_offsets=25000)
            print(f'Updating {id}')

    elif id not in old_IDs and start_date > old_END_DATE:
        data = download_m1_raw_data(ticker = ticker, from_ = row["start_date"], to = end_date, columns = ["open", "high", "low", "close", "volume"], client=client)
        data.to_parquet(DATA_PATH + f"raw/m1/{id}.parquet", engine="fastparquet", compression="snappy", row_group_offsets=25000)
        print(f'Downloading {id}')

# Flat files
Goal: download the flatfiles and split them into individual ticker files.

In [2]:
import os
import pytz
import gzip
from pytz import timezone
from times import get_market_dates
import boto3
from botocore.config import Config

START_DATE = date(2003, 9, 10)
END_DATE = date(2024, 4, 19)

session = boto3.Session(
   aws_access_key_id='7203c471-037b-4944-96b0-effc0d3911b3',
   aws_secret_access_key='IOOFCMHAT7plpPitNmqFICLdG1AnhC5l',
)
s3 = session.client(
   's3',
   endpoint_url='https://files.polygon.io',
   config=Config(signature_version='s3v4'),
)

### Initial download
Download everything

In [94]:
# TODELETE
for day in get_market_dates(date(2004, 1, 1), END_DATE):
    destination = DATA_PATH + f'raw/flatfiles/{day.isoformat()}.csv.gz'
    s3.download_file('flatfiles', 
                f'us_stocks_sip/minute_aggs_v1/{day.year}/{day.strftime("%m")}/{day.isoformat()}.csv.gz', 
                destination)

Process year-by-year (this takes enormously long btw, however you only need to run it one time so I won't go into the hassle to optimize speed.)

In [3]:
for year in range(2003, 2024+1):
    files = []
    for day in get_market_dates(date(year, 1, 1), date(year, 12, 31)):
        destination = DATA_PATH + f'raw/flatfiles/{day.isoformat()}.csv.gz'
        with gzip.open(destination) as f:
            all_bars = pd.read_csv(f)
            all_bars = all_bars[['window_start', 'ticker', 'open', 'high', 'low', 'close', 'volume']]
            all_bars = all_bars.rename(columns={'window_start': 'datetime'})
            all_bars = all_bars.set_index('datetime')
            all_bars.index = pd.to_datetime(all_bars.index, unit='ns') # Convert to datetime (UTC-naive)
            # Make UTC aware (in order to convert)
            # Convert UTC to ET
            # Make timezone naive
            all_bars.index = all_bars.index.tz_localize(pytz.UTC).tz_convert("US/Eastern").tz_localize(None)  
            files.append(all_bars)
            print(day)
        
    all_bars = pd.concat(files)
    all_bars = all_bars.reset_index()
    all_bars = all_bars.set_index('ticker')

    for ticker in list(set(all_bars.index.unique())):
        bars = all_bars.loc[ticker]
        if isinstance(bars, pd.Series):
            bars = all_bars.loc[[ticker]]
        bars = bars[['datetime', 'open', 'high', 'low', 'close', 'volume']]
        bars = bars.set_index('datetime')

        if os.path.isfile(DATA_PATH + f'raw/m1/{ticker}.parquet'):
            write(DATA_PATH + f"raw/m1/{ticker}.parquet", bars, append=True, compression="snappy", row_group_offsets=25000)
        else:
            bars.to_parquet(DATA_PATH + f"raw/m1/{ticker}.parquet", engine="fastparquet", compression="snappy", row_group_offsets=25000)

### Updates
Process day-by-day

In [None]:
def process_flatfile(local_file_path):
    """Unzips the flat file and split or append it to ticker files.
    """
    with gzip.open(local_file_path) as f:
        all_bars = pd.read_csv(f)
        all_bars = all_bars[['window_start', 'ticker', 'open', 'high', 'low', 'close', 'volume']]
        all_bars = all_bars.rename(columns={'window_start': 'datetime'})
        all_bars = all_bars.set_index('datetime')
        all_bars.index = pd.to_datetime(all_bars.index, unit='ns') # Convert to datetime (UTC-naive)
        all_bars.index = all_bars.index.tz_localize(pytz.UTC)  # Make UTC aware (in order to convert)
        all_bars.index = all_bars.index.tz_convert("US/Eastern")  # Convert UTC to ET
        all_bars.index = all_bars.index.tz_localize(None)  # Make timezone naive
        
        for ticker in all_bars['ticker'].unique():
            bars = all_bars[all_bars['ticker'] == ticker]
            bars = bars[['open', 'high', 'low', 'close', 'volume']]

            if os.path.isfile(DATA_PATH + f'raw/m1/{ticker}.parquet'):
                write(DATA_PATH + f"raw/m1/{ticker}.parquet", bars, append=True, compression="snappy", row_group_offsets=25000)
            else:
                bars.to_parquet(DATA_PATH + f"raw/m1/{ticker}.parquet", engine="fastparquet", compression="snappy", row_group_offsets=25000)

In [None]:
for day in get_market_dates(START_DATE, END_DATE):
    destination = DATA_PATH + f'raw/{day.isoformat()}.csv.gz'
    s3.download_file('flatfiles', 
                 f'us_stocks_sip/minute_aggs_v1/{day.year}/{day.strftime("%m")}/{day.isoformat()}.csv.gz', 
                 destination)
    process_flatfile(destination)
    os.remove(destination)