# Alpaca data import & processing
This notebook downloads tick data and converts them into 1-minute quotes.

My public and private key are in ../data/alpaca/data/secret.txt

First, it downloads tick data and saves them to ../data/alpaca/raw/tick/{stock}/{date}.csv

Second it converts the tick data and saves the quotes to ../data/alpaca/processed/m1/quotes. The quotes contain the columns         <code>["close_bid_size", "close_bid", "close_ask", "close_ask_size", "high_bid", "high_ask", "low_bid", "low_ask", "tradeable"]</code>. The column 'tradable' is True if the [condition](https://alpaca.markets/docs/market-data/#uqdf) 'R' (regular trading) applies. 

Third, it deletes the tick data.

Finally, it merges the 1-minute quotes with the 1-minute OHLC to get quotebars.

In [1]:
from alpaca.data import StockHistoricalDataClient
from alpaca.data.requests import StockQuotesRequest
from datetime import datetime, time
from pytz import timezone

import pandas as pd
import numpy as np
import os

**Step 1: download NBBO tick data.**

Downloading SPY data for one day took 30 minutes. And this is only from the IEX exchange...

In [2]:
SYMBOL_LIST = ["TOP"]
START_DATE = datetime(2023, 5, 4) #datetime(2015, 12, 1) is the first available date from Alpaca
END_DATE = datetime(2023, 5, 8)
#END_DATE = datetime(2023, 2, 4, hour=20)  # in ET

Because it will take too much memory or take too long, we will send a request for every day between START_DATE end END_DATE. To get all trading days between these two days, we look at the trading days from SPY.

In [None]:
SPY_df = pd.read_csv(
        f"../data/alpaca/raw/d1/unadjusted/SPY.csv",
        index_col="datetime",
        parse_dates=True,
    )
SPY_df.set_index(SPY_df.index.tz_convert("US/Eastern"), inplace=True) #Actually this isn't necessary, but we still do it for the sake of consistency
SPY_df.set_index(SPY_df.index.tz_localize(None), inplace=True)
dates = np.unique(pd.to_datetime(SPY_df.index).date)
trading_dates = dates[(dates >= START_DATE.date()) & (dates <= END_DATE.date())]
trading_dates

In [None]:
print(f"{datetime.utcnow()} | Starting download")
with open("../data/alpaca/secret.txt") as f:
    PUBLIC_KEY = next(f).strip()
    PRIVATE_KEY = next(f).strip()

stock_client = StockHistoricalDataClient(PUBLIC_KEY, PRIVATE_KEY)

for stock in SYMBOL_LIST:
    os.makedirs(f"../data/alpaca/raw/tick/{stock}", exist_ok=True) #Create folder if it does not exist
    for date in trading_dates:
        start_time = timezone("US/Eastern").localize(datetime.combine(date, time(hour=4, minute=0)))
        end_time = timezone("US/Eastern").localize(datetime.combine(date, time(hour=20, minute=0)))
        quote_request = StockQuotesRequest(
            symbol_or_symbols=stock, start=start_time, end=end_time
        )

        quotes = stock_client.get_stock_quotes(quote_request)

        quotes.df.to_csv(f"../data/alpaca/raw/tick/{stock}/{date.strftime('%Y-%m-%d')}.csv")
        print(f"{datetime.utcnow()} | Downloaded {stock} tick data for {date.strftime('%Y-%m-%d')}")
        del quotes #Delete from memory


**Step 2: Get tick data and convert to quotes.**


We cannot just use resample and then ohlc, because of halts and other special trading conditions. Also we have to think about what OHLC means and what we want. To recap, the quotes contain the columns <code>["close_bid_size", "close_bid", "close_ask", "close_ask_size", "high_bid", "high_ask", "low_bid", "low_ask", "tradeable"]</code>. 

For the corresponding bid or ask in a minute timestamp at 10:00:00:

* close: we want the price at 10:00:59.99999..., which is the same as 10:01:00000... The quotes at this timestamp are the last quotes. Because if no new quotes come in, the last quotes stand. If the last available tick condition is not R ('regular trading'), the quote bar becomes untradable (and "tradable" = <code>False</code>). For example Y means a halt.
* open: the open has no use, it is always equal to the previous close. That is why we do not have open prices in the quotes. It is **incorrect** to use ohlc() if we do want a opening bid/ask, because the open takes the first value in the bin. Which introduces look-ahead bias. It should use the last available value.
* high: get highest value of regular trading (excluding non-regular trading)
* low: get lowest value of regular trading (excluding non-regular trading)

*Regular trading means no halts or weird conditions. It has nothing to do with pre- or post-market.*
* Step 1: Resample closes
* Step 2: Create a 'tradeable' column
* Step 3: Get the high and low bid/asks
* Step 4: Concatenate daily DataFrames

In [5]:
SYMBOL_LIST = ["TOP"]

In [None]:
for stock in SYMBOL_LIST:
    quote_dfs = [] #A collection of dataframes for every day
    for date in trading_dates:
        tick_day_df = pd.read_csv(
            f"../data/alpaca/raw/tick/{stock}/{date.strftime('%Y-%m-%d')}.csv",
            usecols=[
                "timestamp",
                "bid_size",
                "bid_price",
                "ask_price",
                "ask_size",
                "conditions",
            ],
            index_col="timestamp",
            parse_dates=True,
        )
        tick_day_df.index.names = ["datetime"]
        
        # Convert to ET-naive as always (raw data is always in UTC-aware)
        tick_day_df.set_index(tick_day_df.index.tz_convert("US/Eastern"), inplace=True)
        tick_day_df.set_index(tick_day_df.index.tz_localize(None), inplace=True)

        #Step 1: To get the close quotes and last trade condition for m1, resample and take last value
        minute_df = tick_day_df.resample("1Min").last()
        minute_df.rename(
            columns={
                "bid_size": "close_bid_size",
                "bid_price": "close_bid",
                "ask_price": "close_ask",
                "ask_size": "close_ask_size",
            },
            inplace=True,
        )
        #Due to resampling it also creates rows for weekends and non-trading hours. Hence we need to shrink it
        all_minutes_in_day = []
        for hour in range(4, 20):
            for minute in range(0, 60):
                all_minutes_in_day.append(
                    datetime.combine(date, time(hour=hour, minute=minute))
                )
        all_minutes_in_day = np.array(all_minutes_in_day)
        assert len(all_minutes_in_day) == 16 * 60

        minute_df = minute_df.reindex(all_minutes_in_day)

        #Step 2: Create a 'tradeable' column which is False if there is no R in the conditions.
        tradeable = np.vectorize(lambda condition_list: "R" in condition_list if condition_list is not None else False)

        minute_df.ffill(inplace=True)
        minute_df["tradeable"] = tradeable(minute_df["conditions"])
        minute_df.drop(columns=["conditions"], inplace=True)

        #If there were no quotes for the first minutes of the day (very unlikely), we will forward fill them with the values of the previous day. We will do that later. 'tradeable' is already false in this case because there was no 'R' in the conditions.

        #Step 3: Get the high and low bid/asks. If the value is empty, it means that during the entire minute there were no new quotes. Then we must use the last quotes. Since we already filled the close_bid and close_ask with the latest quote, we can set it equal to that.
        high_df = tick_day_df.resample("1Min").max()[["bid_price", "ask_price"]]
        high_df.rename(
            columns={
                "bid_price": "high_bid",
                "ask_price": "high_ask",
            },
            inplace=True,
        )

        low_df = tick_day_df.resample("1Min").min()[["bid_price", "ask_price"]]
        low_df.rename(
            columns={
                "bid_price": "low_bid",
                "ask_price": "low_ask",
            },
            inplace=True,
        )
        minute_df = pd.merge(
            left=minute_df, right=low_df, how="left", left_index=True, right_index=True
        )
        minute_df = pd.merge(
            left=minute_df, right=high_df, how="left", left_index=True, right_index=True
        )

        # Fill high/low with close
        minute_df["low_bid"] = minute_df["low_bid"].fillna(minute_df["close_bid"])
        minute_df["low_ask"] = minute_df["low_ask"].fillna(minute_df["close_ask"])
        minute_df["high_bid"] = minute_df["high_bid"].fillna(minute_df["close_bid"])
        minute_df["high_ask"] = minute_df["high_ask"].fillna(minute_df["close_ask"])

        # Reorder columns
        minute_df = minute_df[
                [
                    "close_bid_size",
                    "close_bid",
                    "close_ask",
                    "close_ask_size",
                    "high_bid",
                    "high_ask",
                    "low_bid",
                    "low_ask",
                    "tradeable",
                ]
            ]

        # Append the df (which only contains the minutes of 1 day) to quote_dfs.
        quote_dfs.append(minute_df) #quote_dfs[0] contains the DataFrame from the first day; quote_dfs[1] the second day etc.

    # Step 4: Concatenate daily DataFrames
    quote_df = pd.concat(quote_dfs)

    # As said in step 2, if there were no quotes for the first minutes of the day (very unlikely), we will forward fill them with the values of the previous day. 'tradeable' is already False. 
    if quote_df.isnull().any().any() == True:
        amount_null = quote_df.isna().sum().sum()
        print(
            f"{stock} | WARNING: dataframe contain {amount_null} null values, which will be forward filled."
        )
        quote_df.ffill(inplace=True)

    quote_df.to_csv(f"../data/alpaca/processed/m1/quotes/{stock}.csv")
    print(f"{datetime.utcnow()} | {stock} | Processed tick data to quotes")