# Downloading raw prices
Because we already created functions in notebook 1 to download 1-minute tickers, this is easy. We need to keep in mind that a lot of stocks such as SPACs have almost no price data. So there are a lot of empty bars. Some stocks don't even have 1 trade for the entire day. However, I cannot remove them either. Because if for one week the stock has a lot of activity, my trading systems may still trade it. That is the curse of the HTB small caps space. I cannot just filter on monthly liquidity.

In [1]:
from polygon.rest import RESTClient
from datetime import datetime, date, time, timedelta
from utils import get_tickers, download_m1_raw_data, datetime_to_unix
import pytz
import requests
import pandas as pd
import numpy as np

DATA_PATH = "../data/polygon/"

with open(DATA_PATH + "secret.txt") as f:
    KEY = next(f).strip()

client = RESTClient(api_key=KEY)

**Non-parallel approach**

Downloading outside of market hours is much faster.

So instead of csv's we use <code>Parquet</code>. This saves a lot of disk space while sacrificing human readibility. Although you can just use [tad](https://www.tadviewer.com/) (fastparquet) or [ParquetViewer](https://github.com/mukunku/ParquetViewer/releases) (pyarrow). And pandas already supports reading from Parquet files. 

In <code>06_fastparquet.ipynb</code> you can see a comparison between parquet compression algorithms and csv's. We save more than 50% in disk space, while write speeds are more than x7.

Downloading the data takes around 20 hours for data from 2019-01-01 to 2023-09-01. However this only has to be done once, after which you only update (append).

In [None]:
tickers = get_tickers(v=3)
tickers = tickers[tickers["type"].isin(["CS", "ADRC", "ETF", "ETN", "ETV"])]
tickers.reset_index(inplace=True, drop=True)

# For timing
length = len(tickers)
start_time = datetime.now()
total_days_to_download = (tickers.end_date - tickers.start_date).sum()
downloaded_days = timedelta(0)

for index, row in tickers.iterrows():
    if index < 4385:
        continue

    id = row["ID"]
    ticker = row["ticker"]
    start_date = row["start_date"]
    end_date = row["end_date"]

    download_m1_raw_data(ticker = ticker, from_ = start_date, to = end_date, columns = ["open", "high", "low", "close", "volume"], path = DATA_PATH + f"raw/m1/{id}.parquet", client=client, to_parquet=True)

    # For timing (becomes accurate after 5.0%)
    passed_time = datetime.now() - start_time
    days_just_downloaded = end_date - start_date
    
    total_days_to_download -= days_just_downloaded
    downloaded_days += days_just_downloaded
    used_time_per_day = passed_time/downloaded_days

    remaining_time = used_time_per_day*total_days_to_download
    remaining_hours = int(remaining_time.total_seconds()/3600)
    remaining_minutes = int((remaining_time.total_seconds()%3600)/60)

    print(f"Progress: {round(index/length*100, 1)}% | ETA: {remaining_hours} hours and {remaining_minutes} minutes")

In [3]:
# pd.read_csv(DATA_PATH + f"raw/m1/A-2021-01-01.csv", index_col="datetime")
# pd.read_parquet(DATA_PATH + f"raw/m1/A-2021-01-01.parquet")

**Parallel approach (may not work now)**

The Polygon API uses the <code>requests</code> library which does not support asynchronous processing. So I have to use <code>aiohttp</code> and work with raw requests. Also we have to bother with pagination because of the 50000 limit.

I used ChatGPT to convert the code above to work with aiohttp. However it did not work. After manually debugging, I got it working and the file is now in <code>06_parallel.py</code>. You can specify the maximum amount of parallel requests. Setting it to 10 should make downloading 10 times faster. Be wary to not generate too many requests.

By the way, it is possible to run multiple python notebook files in VSCode. Simply copy this file multiple times, filter on <code>index</code> and run them all. This speeds it up, because the bottleneck is not processing power but the speed of a request.

# Updates
Not implemented.