# Downloading raw prices
Because we already created functions in notebook 1 to download 1-minute tickers, this is easy. We need to keep in mind that a lot of stocks such as SPACs have almost no price data. So there are a lot of empty bars. Some stocks don't even have 1 trade for the entire day. However, I cannot remove them either. Because if for one week the stock has a lot of activity, my trading systems may still trade it.

In [4]:
from polygon.rest import RESTClient
from datetime import datetime, date, time, timedelta
from utils import get_tickers, download_m1_raw_data, datetime_to_unix
import pytz
import requests
import pandas as pd
import numpy as np

DATA_PATH = "../../../data/polygon/"

with open(DATA_PATH + "secret.txt") as f:
    KEY = next(f).strip()

client = RESTClient(api_key=KEY)

**Non-parallel approach**

Downloading 100 stocks for 2021-01 to 2023-08 took 10 minutes and took 538 MB. To download all stocks (assume 6000) takes 600 minutes or 10 hours or 32 GB. To download from 2004 up to now takes around 70 hours or 224 GB.

So instead of csv's we use <code>Parquet</code>. This saves a lot of disk space while sacrificing human readibility. Although you can just use [tad](https://www.tadviewer.com/) (does not work with brotli compression) or [ParquetViewer](https://github.com/mukunku/ParquetViewer/releases). And pandas already supports reading from Parquet files. Parquet only takes up 37.6% of the csv size based on a small sample. The parquet database (default snappy compression) from 2021-01 to 2023-08 takes up 13.8 GB in space, which would have taken 36.7 GB with csv's. To save even more disk space, I use the brotli compression algorithm, which only has 60% of the size of the standard snappy compression. However it is x1.28 slower. That seems like a good trade-off. Also, reading is very fast (~0.1 sec for almost 5 years of AAPL).

In the future when I know SQL I will build a proper database.

In [None]:
tickers = get_tickers(v=3)
tickers = tickers[tickers["type"].isin(["CS", "ADRC", "ETF", "ETN", "ETV"])]
tickers.reset_index(inplace=True, drop=True)

# For timing
length = len(tickers)
start_time = datetime.now()
total_days_to_download = (tickers.end_date - tickers.start_date).sum()
downloaded_days = timedelta(0)

for index, row in tickers.iterrows():
    id = row["ID"]
    ticker = row["ticker"]
    start_date = row["start_date"]
    end_date = row["end_date"]

    download_m1_raw_data(ticker = ticker, from_ = start_date, to = end_date, columns = ["open", "high", "low", "close", "volume"], path = DATA_PATH + f"raw/m1/{id}.parquet", client=client, to_parquet=True)

    # For timing (becomes accurate after 5.0%)
    passed_time = datetime.now() - start_time
    days_just_downloaded = end_date - start_date
    
    total_days_to_download -= days_just_downloaded
    downloaded_days += days_just_downloaded
    used_time_per_day = passed_time/downloaded_days

    remaining_time = used_time_per_day*total_days_to_download
    remaining_hours = int(remaining_time.total_seconds()/3600)
    remaining_minutes = int((remaining_time.total_seconds()%3600)/60)

    print(f"Progress: {round(index/length*100, 1)}% | ETA: {remaining_hours} hours and {remaining_minutes} minutes")

In [19]:
# pd.read_csv(DATA_PATH + f"raw/m1/A-2021-01-01.csv", index_col="datetime")
# pd.read_parquet(DATA_PATH + f"raw/m1/A-2021-01-01.parquet")

**Parallel approach**

The Polygon API uses the <code>requests</code> library which does not support asynchrounous processing. So I have to use <code>aiohttp</code> and work with raw requests. Also we have to bother with pagination because of the 50000 limit.

I used ChatGPT to convert the code above to work with aiohttp. However it did not work. After manually debugging, I got it working and the file is now in <code>3_get_prices_parallel.py</code>. You can specify the maximum amount of parallel requests. Setting it to 10 should make downloading 10 times faster. Be wary to not generate too many requests.

# Updates
Not implemented.