# Data update

If we want to update data, we should use append the data sets instead of downloading everything again. However if there are splits/dividends in the meantime, we should readjust all historical data. From Alpaca and most other vendors we can download adjusted and unadjusted daily data. So using that we calculate the adjustment factor. This is probably a crude way, but then I am sure that the adjustments are correct. I could also use the dividend/split endpoint, but that would overcomplicate things.

For example lets assume that we have (already adjusted) data from day 1 to day 10. After close on day 15 we want to update the historical data. Lets assume that between and including day 11 to day 15 there was a 1 to 2 stock split. Then the day 10 adjustment factor will be 0.5x. Then all data from day 1 to day 10 should be multiplied by 0.5x. The data from day 11 to 15 is simply appended given that it is already adjusted. 

We will mostly just follow the same steps as in <code>bars.ipynb</code> and <code>tick.ipynb</code>, but instead of downloading everything we simply append. We also have to do some check to see if the dates make sense.

* Step 1. Download and update all raw data.
* Step 2. Using raw data, update processed data

In [2]:
from alpaca.data import StockHistoricalDataClient
from alpaca.data.requests import StockBarsRequest
from alpaca.data.timeframe import TimeFrame, TimeFrameUnit
from alpaca.data.enums import Adjustment

from datetime import datetime, time, timedelta
from pytz import timezone
import pandas as pd
import numpy as np

In [3]:
UPDATE_TO = datetime(2023, 7, 28) #ET time
SYMBOL_LIST = ["SPY"]
MARKET_HOURS_ONLY = True #If True, then the processed m1 data only contains market hours.

**Step 1: Update all raw data.**

Raw data is everything in the <code>/raw</code> folder. These include 1-minute adjusted bars, 1-day adjusted and unadjusted bars, and tick data. 

Unadjusted 1-day bars and ticks can just be appended.
For the 1-minute and 1-day unadjusted bars we can also just append them, however if there has been splits/dividends we must readjust all previous data.

We only need to download the new data. For this we need to get a list of all trading dates. Just like before we will use SPY to determine this. Although this is a crude way. Because we may not have this data, we have to update SPY. It does not matter whether we use adjust or unadjusted, so I will just update the unadjusted because that is easier.

In [8]:
# This should be put in a function later. We should have a function for updating and for retrieving all trading dates and minutes.
with open("../../data/alpaca/secret.txt") as f:
    PUBLIC_KEY = next(f).strip()
    PRIVATE_KEY = next(f).strip()

# Retrieve old daily data
bars_old = pd.read_csv(
        f"../../data/alpaca/raw/d1/unadjusted/SPY.csv",
        index_col="datetime",
        parse_dates=True,
    )

last_available_date = bars_old.index[-1].tz_localize(None)
first_day_of_update = last_available_date + timedelta(days=1) # Day+1 may not be a trading day, but since Alpaca does not return non-trading days, this will work

# Get new bars
if first_day_of_update.date() > UPDATE_TO.date():
    print("We already have enough data for SPY")
    #continue
else:
    stock_client = StockHistoricalDataClient(PUBLIC_KEY, PRIVATE_KEY)
    spy_request = StockBarsRequest(
        symbol_or_symbols="SPY",
        # If we do not replace hour with 0, we do not get the first day.
        start=timezone("US/Eastern").localize(first_day_of_update.replace(hour=0)),
        end=timezone("US/Eastern").localize(UPDATE_TO.replace(hour=0)),
        timeframe=TimeFrame(1, TimeFrameUnit.Day),
        adjustment=Adjustment.RAW,
    )
    bars_new = stock_client.get_stock_bars(spy_request).df

    bars_new = bars_new.loc["SPY"][["close", "volume"]]
    bars_new.index.names = ["datetime"]

# Combine all
bars_all = pd.concat([bars_old, bars_new])
bars_all.to_csv(f"../../data/alpaca/raw/d1/unadjusted/SPY.csv")
print(f"Updated SPY d1 unadjusted")


In [9]:
# Retrieve all SPY data
SPY_df = pd.read_csv(
        f"../../data/alpaca/raw/d1/unadjusted/SPY.csv",
        index_col="datetime",
        parse_dates=True,
    )
SPY_df.set_index(SPY_df.index.tz_localize(None), inplace=True)

# Get a list of all trading days
all_trading_dates = pd.to_datetime(SPY_df.index).date

# Get a list of all trading minutes (note: all last minute are hence 19:59)
all_trading_minutes = []
amount_of_days = len(all_trading_dates)
for date in all_trading_dates:
    for hour in range(4, 20):
        for minute in range(0, 60):
            all_trading_minutes.append(datetime.combine(date, time(hour=hour, minute=minute)))
all_trading_minutes = np.array(all_trading_minutes)

assert len(all_trading_minutes) == amount_of_days * 16 * 60

Now we can finally download our data and update.

In [None]:
stock = "O"

# UNADJUSTED DAILY BARS
# Retrieve old daily data
bars_old = pd.read_csv(
        f"../../data/alpaca/raw/d1/unadjusted/{stock}.csv",
        index_col="datetime",
        parse_dates=True,
    )

last_available_date = bars_old.index[-1].tz_localize(None)
first_day_of_update = last_available_date + timedelta(days=1) # Day+1 may not be a trading day, but since Alpaca does not return non-trading days, this will work

# Get new bars
if first_day_of_update.date() > UPDATE_TO.date():
    print(f"We already have enough data for {stock}")
else:
    stock_client = StockHistoricalDataClient(PUBLIC_KEY, PRIVATE_KEY)
    stock_request = StockBarsRequest(
        symbol_or_symbols=stock,
        # If we do not replace hour with 0, we do not get the first day.
        start=timezone("US/Eastern").localize(first_day_of_update.replace(hour=0)),
        end=timezone("US/Eastern").localize(UPDATE_TO.replace(hour=0)),
        timeframe=TimeFrame(1, TimeFrameUnit.Day),
        adjustment=Adjustment.RAW,
    )
    bars_new = stock_client.get_stock_bars(stock_request).df

    bars_new = bars_new.loc[stock][["close", "volume"]]
    bars_new.index.names = ["datetime"]

    # Combine all
    bars_all = pd.concat([bars_old, bars_new])
    bars_all.to_csv(f"../../data/alpaca/raw/d1/unadjusted/{stock}.csv")

    print(f"{datetime.utcnow().replace(microsecond=0)} | Updated {stock} d1 unadjusted from {first_day_of_update.strftime('%Y-%m-%d')} to {UPDATE_TO.strftime('%Y-%m-%d')}")

# ADJUSTED DAILY BARS
# Retrieve old daily data
bars_old = pd.read_csv(
        f"../../data/alpaca/raw/d1/adjusted/{stock}.csv",
        index_col="datetime",
        parse_dates=True,
    )

last_available_date = bars_old.index[-1].tz_localize(None)
first_day_of_update = last_available_date + timedelta(days=1) # Day+1 may not be a trading day, but since Alpaca does not return non-trading days, this will work

# Get new bars
if first_day_of_update.date() > UPDATE_TO.date():
    print(f"We already have enough data for {stock}")
    #continue
else:
    stock_client = StockHistoricalDataClient(PUBLIC_KEY, PRIVATE_KEY)
    """
    WARNING: because we have to calculate the adjustment factor we also need the data at last_available_date
    If we have data from day 1 to day 10 which we need rejust, we need the NEW adjustment factor at day 10, not day 11.
    But when combining we need to leave out last_available_date, else there will be two entries with last_available_date
    """
    stock_request = StockBarsRequest(
        symbol_or_symbols=stock,
        start=timezone("US/Eastern").localize(first_day_of_update.replace(hour=0) - timedelta(days=1)),
        end=timezone("US/Eastern").localize(UPDATE_TO.replace(hour=0)),
        timeframe=TimeFrame(1, TimeFrameUnit.Day),
        adjustment=Adjustment.ALL,
    )
    bars_new = stock_client.get_stock_bars(stock_request).df

    bars_new = bars_new.loc[stock][["close", "volume"]]
    bars_new.index.names = ["datetime"]

    # Readjust bars_old using adjustment factor
        # Calculate adjustment factor
    last_close_old = bars_old.iloc[-1]
    last_close_new = bars_new.iloc[0]

    # If old is 50, new is 100, old should be X2
    adjustment_factor = last_close_new / last_close_old
    bars_old['close'] = bars_old['close'] * adjustment_factor['close']
    bars_old['volume'] = bars_old['volume'] / adjustment_factor['close'] #If stock split X2, volume should be X0.5

    # Strip first bar of bars_new, because we already have it
    bars_new = bars_new[1:]

    # Combine all
    bars_all = pd.concat([bars_old, bars_new])
    bars_all.to_csv(f"../../data/alpaca/raw/d1/adjusted/{stock}.csv")

    print(f"{datetime.utcnow().replace(microsecond=0)} | Updated {stock} d1 adjusted from {first_day_of_update.strftime('%Y-%m-%d')} to {UPDATE_TO.strftime('%Y-%m-%d')}")

# ADJUSTED MINUTE BARS
# Retrieve old minute data
bars_old = pd.read_csv(
        f"../../data/alpaca/raw/m1/{stock}.csv",
        index_col="datetime",
        parse_dates=True,
    )

last_available_date = bars_old.index[-1].tz_localize(None)
first_day_of_update = last_available_date + timedelta(days=1) # Day+1 may not be a trading day, but since Alpaca does not return non-trading days, this will work

# Get new bars
if first_day_of_update.date() > UPDATE_TO.date():
    print(f"We already have enough data for {stock}")
    #continue
else:
    stock_client = StockHistoricalDataClient(PUBLIC_KEY, PRIVATE_KEY)
    stock_request = StockBarsRequest(
        symbol_or_symbols=stock,
        start=timezone("US/Eastern").localize(first_day_of_update.replace(hour=4) - timedelta(days=1)),
        end=timezone("US/Eastern").localize(UPDATE_TO.replace(hour=20)),
        timeframe=TimeFrame(1, TimeFrameUnit.Minute),
        adjustment=Adjustment.ALL,
    )
    bars_new = stock_client.get_stock_bars(stock_request).df

    bars_new = bars_new.loc[stock][["open", "high", "low", "close", "volume"]]
    bars_new.index.names = ["datetime"]

    # Get adjustment factor from daily data
    stock_df_unadjusted = pd.read_csv(
        f"../../data/alpaca/raw/d1/unadjusted/{stock}.csv",
        index_col="datetime",
        parse_dates=True,
    )
    stock_df_adjusted = pd.read_csv(
        f"../../data/alpaca/raw/d1/adjusted/{stock}.csv",
        index_col="datetime",
        parse_dates=True,
    )
    if not stock_df_adjusted.index.equals(stock_df_unadjusted.index):
        raise Exception(
            "The indices in the adjusted and unadjusted DataFrames are not equal."
        )
    adjustment = stock_df_adjusted / stock_df_unadjusted
    adjustment.index = adjustment.index.date
    adjustment.rename(columns={"close": "adjustment"}, inplace=True)

    # Get difference in adjustment factors
    adjustment_factor_last_day = adjustment[adjustment.index == last_available_date.date()]['adjustment']
    adjustment_factor_last_day_plus_one = adjustment.iloc[adjustment.index.get_loc(adjustment_factor_last_day.index[0]) + 1]['adjustment']

    difference_adjustment = adjustment_factor_last_day_plus_one / adjustment_factor_last_day
    difference_adjustment = difference_adjustment[0]

    bars_old['open'] = bars_old['open'] * difference_adjustment
    bars_old['high'] = bars_old['high'] * difference_adjustment
    bars_old['low'] = bars_old['low'] * difference_adjustment
    bars_old['close'] = bars_old['close'] * difference_adjustment
    bars_old['volume'] = bars_old['volume'] / difference_adjustment

    # Combine all
    bars_all = pd.concat([bars_old, bars_new])
    bars_all.to_csv(f"../../data/alpaca/raw/m1/{stock}.csv")

    print(f"{datetime.utcnow().replace(microsecond=0)} | Updated {stock} m1 adjusted from {first_day_of_update.strftime('%Y-%m-%d')} to {UPDATE_TO.strftime('%Y-%m-%d')}")

    # TICK DATA
    # Simply rerun step 1 in tick.ipynb, but only download the required data by changing the start and end dates

**Step 2: Using raw data, update processed data**

Now the data processing steps 2 and 4 in <code>bars.ipynb</code> can be run again. Those steps only use raw data, which is now updated and correct. It's fine to run it all, because we already have all data. So that won't take long.

For quotes and quotebars (step 2 to 4) we can also do this, but it will take longer. But still, since all data is already downloaded, this is not a problem. The data download is the most time-consuming.

I could make a script to only update the processed data, but that would be extra code and potential bugs for no real benefit. Also then we need to again take into account readjustments, which are very annoying.

In live trading, the assets that are traded should be recorded. For these we should not update using historical data. Because then we can properly compare 'backtest' results and real results.