[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/witchapong/build-ai-based-applications/blob/main/tabular/3_features_prep.ipynb)

# Stock Price Prediction using ML model
In this session, we'll learn how to build a ML model for predicting **%change of stock prices of the next day** of stocks in SET index (Stock Exchange of Thailand). Thus, we should be able to use the prediction to buy stocks that are going up the next day, make profits, and hopefully get rich!

This session is divided into the following 5 notebooks.
1. `1_collect_data.ipynb`
2. `2_eda.ipynb`
3. `3_features_prep.ipynb` (current notebook)
4. `4_make_prediction.ipynb`
5. `5_evaluation.ipynb`

In [1]:
# # mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
from pathlib import Path

# running on Colab
DATA_DIR = Path("/content/drive/MyDrive/build-ai-based-applications/")

# # running on local
# DATA_DIR = Path(".")

In [6]:
# install talib; for any issue, start from this thread https://stackoverflow.com/questions/49648391/how-to-install-ta-lib-in-google-colab
url = 'https://anaconda.org/conda-forge/libta-lib/0.4.0/download/linux-64/libta-lib-0.4.0-h166bdaf_1.tar.bz2'
!curl -L $url | tar xj -C /usr/lib/x86_64-linux-gnu/ lib --strip-components=1
!pip install conda-package-handling
!wget https://anaconda.org/conda-forge/ta-lib/0.5.1/download/linux-64/ta-lib-0.5.1-py311h9ecbd09_0.conda
!cph x ta-lib-0.5.1-py311h9ecbd09_0.conda
!mv ./ta-lib-0.5.1-py311h9ecbd09_0/lib/python3.11/site-packages/talib /usr/local/lib/python3.11/dist-packages/
import talib

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3995    0  3995    0     0  13843      0 --:--:-- --:--:-- --:--:-- 13871
100  517k  100  517k    0     0   543k      0 --:--:-- --:--:-- --:--:-- 1581k
--2025-02-06 16:22:48--  https://anaconda.org/conda-forge/ta-lib/0.5.1/download/linux-64/ta-lib-0.5.1-py311h9ecbd09_0.conda
Resolving anaconda.org (anaconda.org)... 104.19.145.37, 104.19.144.37, 2606:4700::6813:9125, ...
Connecting to anaconda.org (anaconda.org)|104.19.145.37|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://binstar-cio-packages-prod.s3.amazonaws.com/5f738b4b78d4370a69f82984/673b62920c609ff6945bb24d?response-content-disposition=attachment%3B%20filename%3D%22ta-lib-0.5.1-py311h9ecbd09_0.conda%22%3B%20filename%2A%3DUTF-8%27%27ta-lib-0.5.1-py311h9ecbd09_0.conda&response-content-type=application%2Foctet-stream&X-Amz-Algorithm=

# Feature Preparation
In this notebook, we will prepare features from the data we have collected and explored from the previous step. Preparing features is a really important step in extracting useful information and context for a ML model to learn on and make a right prediction. This is particularly true for tabular data where rich features outweighs complex models. Even though, it is one of the most time consuming steps in building ML model, it worths the effort as good and insightful features could potentially boost your model performance!

In [5]:
import datetime as dt
from dateutil.relativedelta import relativedelta

from glob import glob
from pathlib import Path
import os
import json

import numpy as np
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
import matplotlib.pyplot as plt
import seaborn as sns

from tqdm.notebook import tqdm

import talib
from talib import RSI, BBANDS, MACD, ATR

In [7]:
sns.set_style('darkgrid')
idx = pd.IndexSlice

# Price features
We'll extract some popular indicators for predicting stock price from the price data. The features are far from complete as there are plenty more of indicators you can add to the model. You can take a look at an extensive list of indicators at the `TA-Lib` documentary page [here](https://github.com/TA-Lib/ta-lib-python?tab=readme-ov-file#supported-indicators-and-functions).

In [8]:
MONTH = 20
YEAR = 12 * MONTH

In [9]:
price_df = pd.read_csv(DATA_DIR / "data/set_price.csv", parse_dates=["date"])

In [10]:
price_df.head()

Unnamed: 0,symbol,date,open,high,low,close,volume,dividends,stock splits,capital gains
0,24CS,2022-10-03,7.1,10.2,7.1,10.2,559465900,0.0,0.0,
1,24CS,2022-10-04,10.7,11.1,7.15,7.15,330707400,0.0,0.0,
2,24CS,2022-10-05,5.85,6.45,5.05,5.15,361028900,0.0,0.0,
3,24CS,2022-10-06,5.4,5.45,4.7,5.2,232679200,0.0,0.0,
4,24CS,2022-10-07,5.1,5.15,4.76,5.0,131778400,0.0,0.0,


In [11]:
# filter abnormal close price
price_df = price_df[price_df["close"] > 0]

# drop unused columns
price_df = price_df.drop(columns=["capital gains"])

In [12]:
# set index
price_df = price_df.set_index(["symbol","date"]).sort_index()

In [13]:
# remove stock with insufficient data
min_obs = 2 * YEAR
nobs = price_df.groupby('symbol').size()
keep = nobs[nobs > min_obs].index

price_df = price_df.loc[idx[keep, :], :]

## 21-day moving average of trading value
We extract this feature to measure how hot or trendy a stock is over last 21 days period.

In [14]:
# calculate daily trading value
price_df["value"] = price_df["close"] * price_df["volume"]

In [15]:
price_df["value"] = price_df["value"].div(1e6)

val_ma = price_df["value"]\
    .unstack("symbol")\
    .rolling(window=21, min_periods=1)\
    .mean()

price_df["val_rank"] = val_ma\
                        .rank(axis=1, ascending=False)\
                        .stack("symbol")\
                        .swaplevel()

## Relative Strength Index (RSI)
RSI is usually used to determined whether a stock is overly bought or sold already, so it might be a good time to execute a trade.

In [16]:
price_df["rsi"] = price_df.groupby(level="symbol", group_keys=False).close.apply(RSI)

## Bollinger bands
This indicator is used for measuring volatility of a stock and also determining whether a stock is undervalue or overvalue.

![BB](https://github.com/witchapong/build-ai-based-applications/blob/main/tabular/img/BB.png?raw=1)

In [17]:
def compute_bb(close):
    high, mid, low = BBANDS(close, timeperiod=20)
    return pd.DataFrame({'bb_high': high, 'bb_low': low}, index=close.index)

In [18]:
price_df = (price_df.join(price_df
                      .groupby(level='symbol', group_keys=False)
                      .close
                      .apply(compute_bb)))

In [19]:
price_df['bb_high'] = price_df.bb_high.sub(price_df.close).div(price_df.bb_high).apply(np.log1p)
price_df['bb_low'] = price_df.close.sub(price_df.bb_low).div(price_df.close).apply(np.log1p)

## Average True Range (ATR)
The indicator mesures level of volatility of a stock for the period evaluated.

In [20]:
price_df['NATR'] = price_df.groupby(level='symbol',
                                group_keys=False).apply(lambda x:
                                                        talib.NATR(x.high, x.low, x.close))

In [21]:
def compute_atr(stock_data):
    df = ATR(stock_data.high, stock_data.low,
             stock_data.close, timeperiod=14)
    return df.sub(df.mean()).div(df.std())

In [22]:
price_df['ATR'] = (price_df.groupby('symbol', group_keys=False)
                 .apply(compute_atr))

## Moving Average Convergence/Divergence (MACD)
MACD is one of the most popular indicators used for indentifying price trends and measuring trend momentum.

In [23]:
# price_df["PPO"] = price_df.groupby(level="symbol", group_keys=False)["close"].apply(talib.PPO)

In [24]:
def compute_macd(close):
    macd = MACD(close)[0]
    return (macd - np.mean(macd)) / np.std(macd)

In [25]:
price_df["MACD"] = (price_df.
                    groupby("symbol", group_keys=False)
                    .close
                    .apply(compute_macd)
                    )

In [26]:
price_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,dividends,stock splits,value,val_rank,rsi,bb_high,bb_low,NATR,ATR,MACD
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
24CS,2022-10-03,7.1,10.2,7.1,10.2,559465900,0.0,0.0,5706.552073,1.0,,,,,,
24CS,2022-10-04,10.7,11.1,7.15,7.15,330707400,0.0,0.0,2364.557942,1.0,,,,,,
24CS,2022-10-05,5.85,6.45,5.05,5.15,361028900,0.0,0.0,1859.298869,1.0,,,,,,
24CS,2022-10-06,5.4,5.45,4.7,5.2,232679200,0.0,0.0,1209.931796,2.0,,,,,,
24CS,2022-10-07,5.1,5.15,4.76,5.0,131778400,0.0,0.0,658.892,2.0,,,,,,


# Join company information

In [27]:
company_info_df = pd.read_csv(DATA_DIR / "data/set_company_info.csv")

In [28]:
company_info_df.head()

Unnamed: 0,symbol,industry,sector
0,24CS,Building Products & Equipment,Industrials
1,2S,Steel,Basic Materials
2,3BBIF,,
3,A,Real Estate - Development,Real Estate
4,A5,Real Estate - Development,Real Estate


In [29]:
company_info_df.set_index("symbol", inplace=True)

In [30]:
data_df = price_df.join(company_info_df, on="symbol")

In [31]:
data_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,open,high,low,close,volume,dividends,stock splits,value,val_rank,rsi,bb_high,bb_low,NATR,ATR,MACD,industry,sector
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
24CS,2022-10-03,7.1,10.2,7.1,10.2,559465900,0.0,0.0,5706.552073,1.0,,,,,,,Building Products & Equipment,Industrials
24CS,2022-10-04,10.7,11.1,7.15,7.15,330707400,0.0,0.0,2364.557942,1.0,,,,,,,Building Products & Equipment,Industrials
24CS,2022-10-05,5.85,6.45,5.05,5.15,361028900,0.0,0.0,1859.298869,1.0,,,,,,,Building Products & Equipment,Industrials
24CS,2022-10-06,5.4,5.45,4.7,5.2,232679200,0.0,0.0,1209.931796,2.0,,,,,,,Building Products & Equipment,Industrials
24CS,2022-10-07,5.1,5.15,4.76,5.0,131778400,0.0,0.0,658.892,2.0,,,,,,,Building Products & Equipment,Industrials


# Compute return

## Historical returns
Returns and return ranking over different periods could also be extracted as features.

In [32]:
# Historical returns
T = [1, 5, 10, 21, 42, 63]
by_sym = data_df.groupby(level="symbol")["close"]
for t in T:
    data_df[f"r{t:02}"] = by_sym.pct_change(t)

In [33]:
# Daily historical return deciles (return ranking)
for t in T:
    data_df[f"r{t:02}dec"] = (data_df[f"r{t:02}"].dropna()
                              .groupby(level="date", group_keys=False)
                              .apply(lambda x: pd.qcut(x,
                                                       q=10,
                                                       labels=False,
                                                       duplicates="drop")))

In [34]:
# Daily sector return deciles (return ranking in a sector)
for t in T:
    data_df[f"r{t:02}q_sector"] = (data_df
                                   .groupby(["date","sector"])[f"r{t:02}"]
                                   .transform(lambda x: pd.qcut(x,
                                                                 q=5,
                                                                labels=False,
                                                                duplicates="drop")))

## Forward returns
Here we'll create multiple forward returns over different period, but for our application, we'll only use the forward return over the next day.

In [35]:
for t in [1, 5, 21]:
    data_df[f'r{t:02}_fwd'] = data_df.groupby(level='symbol')[f'r{t:02}'].shift(-t)

# Join income statement
In this step, we'll join the annual income statement data to our feature. Here, we'll join 3 previous annual income statement corresponding to date of each row of our data. What we need to be really careful when joining data with longer update frequency (obviously price data is updated daily, but annual income statement is updated annually!) is to make sure about when it is available. Thus, we would like to avoid joining the income statement to the price data where the income statement is not actually available.

In [36]:
incm_stmt_df = pd.read_csv(DATA_DIR / "data/set_incm_stmt.csv")

In [37]:
# parse date column
incm_stmt_df["date"] = pd.to_datetime(incm_stmt_df["date"])
incm_stmt_df["realized_dt"] = incm_stmt_df["date"] + pd.DateOffset(days=100)

In [38]:
# drop unused and mostly empty columns
unused_cols_income = ["date", ]
incm_stmt_df.drop(columns=unused_cols_income, inplace=True)

In [39]:
MAX_PCT_NULL = .95

income_null = incm_stmt_df.isna().mean()
drop_cols_income = income_null[income_null >= MAX_PCT_NULL].index
drop_cols_income

Index(['Otherunder Preferred Stock Dividend', 'Other Taxes',
       'Rent And Landing Fees', 'Restructuring And Mergern Acquisition',
       'Other Non Interest Expense', 'Average Dilution Earnings',
       'Amortization', 'Amortization Of Intangibles Income Statement',
       'Depreciation Income Statement', 'Insurance And Claims',
       'Loss Adjustment Expense', 'Net Policyholder Benefits And Claims',
       'Policyholder Benefits Gross', 'Policyholder Benefits Ceded',
       'Occupancy And Equipment', 'Preferred Stock Dividends',
       'Net Income From Tax Loss Carryforward', 'Research And Development',
       'Professional Expense And Contract Services Expense',
       'Earnings From Equity Interest Net Of Tax', 'Net Income Extraordinary',
       'Excise Taxes'],
      dtype='object')

In [40]:
incm_stmt_df.drop(columns=drop_cols_income, inplace=True)

In [41]:
# rearrange columns
incm_stmt_df = incm_stmt_df[["symbol", "realized_dt"] + [col for col in incm_stmt_df.columns if col not in ["symbol", "realized_dt"]]]

In [42]:
incm_stmt_df.head()

Unnamed: 0,symbol,realized_dt,Tax Effect Of Unusual Items,Tax Rate For Calcs,Normalized EBITDA,Total Unusual Items,Total Unusual Items Excluding Goodwill,Net Income From Continuing Operation Net Minority Interest,Reconciled Depreciation,Reconciled Cost Of Revenue,EBITDA,EBIT,Net Interest Income,Interest Expense,Interest Income,Normalized Income,Net Income From Continuing And Discontinued Operation,Total Expenses,Diluted Average Shares,Basic Average Shares,Diluted EPS,Basic EPS,Diluted NI Availto Com Stockholders,Net Income Common Stockholders,Net Income,Net Income Including Noncontrolling Interests,Net Income Continuous Operations,Tax Provision,Pretax Income,Other Income Expense,Other Non Operating Income Expenses,Special Income Charges,Gain On Sale Of Ppe,Other Special Charges,Gain On Sale Of Security,Net Non Operating Interest Income Expense,Interest Expense Non Operating,Interest Income Non Operating,Operating Income,Operating Expense,Selling General And Administration,Selling And Marketing Expense,General And Administrative Expense,Other Gand A,Salaries And Wages,Gross Profit,Cost Of Revenue,Total Revenue,Operating Revenue,Minority Interests,Impairment Of Capital Assets,Total Operating Income As Reported,Total Other Finance Cost,Other Operating Expenses,Depreciation Amortization Depletion Income Statement,Depreciation And Amortization In Income Statement,Earnings From Equity Interest,Provision For Doubtful Accounts,Rent Expense Supplemental,Write Off,Net Income Discontinuous Operations,Gain On Sale Of Business
0,24CS,2024-04-09,247811.9,0.195597,-44136736.0,1266954.0,1266954.0,-45071044.0,9617938.0,671432400.0,-42869782.0,-52487720.0,-2439536.0,3542679.0,1103143.0,-46090190.0,-45071044.0,742760300.0,430000000.0,430000000.0,-0.1,-0.1,-45071044.0,-45071044.0,-45071044.0,-45071044.0,-45071044.0,-10959355.0,-56030399.0,7674673.0,6407719.0,1266954.0,1214952.0,-52002.0,,-2439536.0,3542679.0,1103143.0,-61265536.0,71327944.0,71327944.0,9318398.0,62009546.0,62009546.0,,10062408.0,671432400.0,681494800.0,681494800.0,,,,,,,,,,,,,
1,24CS,2023-04-10,547957.8,0.223307,41876591.0,2453829.0,2453829.0,24494231.0,7726356.0,870063900.0,44330420.0,36604064.0,-4082461.0,5067487.0,985026.0,22588360.0,24494231.0,945659500.0,325513699.0,325513699.0,0.08,0.08,24494231.0,24494231.0,24494231.0,24494231.0,24494231.0,7042346.0,31536577.0,2710901.0,257072.0,1295046.0,1295046.0,,1158783.0,-4082461.0,5067487.0,985026.0,32908137.0,75595530.0,75595530.0,,,,,108503667.0,870063900.0,978567600.0,978567600.0,,,,,,,,,,,,,
2,24CS,2022-04-10,0.0,0.241495,35101165.0,0.0,0.0,19455578.0,6959244.0,563704500.0,35101165.0,28141921.0,-2013217.0,2492033.0,478816.0,19455580.0,19455578.0,613971700.0,430000000.0,430000000.0,0.045246,0.045246,19455578.0,19455578.0,19455578.0,19455578.0,19455578.0,6194310.0,25649888.0,161675.0,161675.0,0.0,0.0,,,-2013217.0,2492033.0,478816.0,27501430.0,50267186.0,50267186.0,19250000.0,31020000.0,2560000.0,28460000.0,77768616.0,563704500.0,641473100.0,641473100.0,,,,,,,,,,,,,
3,24CS,2021-04-10,0.0,0.352007,18310000.0,,,6940000.0,6310000.0,354560000.0,18310000.0,12000000.0,-1290000.0,1290000.0,,6940000.0,6940000.0,390560000.0,430000000.0,430000000.0,0.01614,0.01614,6940000.0,6940000.0,6940000.0,6940000.0,6940000.0,3770000.0,10710000.0,720000.0,720000.0,,,,,-1290000.0,1290000.0,,10880000.0,36000000.0,36000000.0,17830000.0,18170000.0,2100000.0,16070000.0,46880000.0,354560000.0,401440000.0,401440000.0,,,,,,,,,,,,,
4,2S,2024-04-09,-1027415.0,0.037799,248561000.0,-27181000.0,-27181000.0,160083000.0,55027000.0,6351661000.0,221380000.0,166353000.0,3354000.0,264000.0,3618000.0,186236600.0,160083000.0,6625824000.0,549996000.0,549996000.0,0.29,0.29,160083000.0,160083000.0,160083000.0,159811000.0,159811000.0,6278000.0,166089000.0,13214000.0,40395000.0,-15086000.0,,,-12095000.0,3354000.0,264000.0,3618000.0,149521000.0,274163000.0,274163000.0,163066000.0,111097000.0,111097000.0,,423684000.0,6351661000.0,6775345000.0,6775345000.0,272000.0,15086000.0,,,,,,,,,,,


In [43]:
data_df.reset_index(inplace=True)

In [44]:
symbol_col = "symbol"
date_col_feat = "date"
date_col_income = "realized_dt"

map_df = data_df[[symbol_col, date_col_feat]].merge(incm_stmt_df[[symbol_col, date_col_income]], on=symbol_col)
print(map_df.shape)

map_df = map_df.query("realized_dt < date")

map_df["realized_dt_rank"] = map_df.groupby(["symbol", "date"])["realized_dt"].rank(ascending=False)

(3738294, 3)


In [45]:
map_df.head()

Unnamed: 0,symbol,date,realized_dt,realized_dt_rank
2,24CS,2022-10-03,2022-04-10,1.0
3,24CS,2022-10-03,2021-04-10,2.0
6,24CS,2022-10-04,2022-04-10,1.0
7,24CS,2022-10-04,2021-04-10,2.0
10,24CS,2022-10-05,2022-04-10,1.0


In [46]:
N_YEAR_LOOKBACK = 3

map_df = map_df.query(f"realized_dt_rank <= {N_YEAR_LOOKBACK}")
map_df = map_df.set_index(["symbol", "date", "realized_dt_rank"]).unstack(-1)

# flatten columns
map_df.columns = ["_".join(map(lambda x: str(x), col)).replace(".0", "") for col in map_df.columns]

In [47]:
map_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,realized_dt_1,realized_dt_2,realized_dt_3
symbol,date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
24CS,2022-10-03,2022-04-10,2021-04-10,NaT
24CS,2022-10-04,2022-04-10,2021-04-10,NaT
24CS,2022-10-05,2022-04-10,2021-04-10,NaT
24CS,2022-10-06,2022-04-10,2021-04-10,NaT
24CS,2022-10-07,2022-04-10,2021-04-10,NaT


In [48]:
data_df = data_df.set_index(["symbol", "date"])
data_df = data_df.join(map_df, how="left")

In [49]:
data_df[["realized_dt_1", "realized_dt_2", "realized_dt_3"]].isna().mean()

Unnamed: 0,0
realized_dt_1,0.243605
realized_dt_2,0.438295
realized_dt_3,0.642235


In [50]:
data_index = data_df.index

In [51]:
def lowercase_first_char(s):
    if len(s) > 0:
        return s[0].lower() + s[1:]
    return s

for i in range(1,4):

    # merge income df
    data_df = data_df.merge(incm_stmt_df, left_on=[data_df.index.get_level_values("symbol"), f"realized_dt_{i}"], right_on=["symbol", "realized_dt"], how="left")\
        .drop(columns=["symbol", "realized_dt"])

    data_df.index = data_index

    # rename columns from income df
    data_df.rename(columns={col: "income_stmt_" + lowercase_first_char(col.replace(" ","")) + f"_p{i}y" for col in data_df.columns.intersection(incm_stmt_df.columns)}, inplace=True)

In [52]:
data_df.drop(columns=["realized_dt_1", "realized_dt_2", "realized_dt_3"], inplace=True)

# Remove outliers
In this step, we shall remove symbols with 1 day forward return exceed 100%. This is arbitrary! You can use quantile or percentile to detect outliers here but we'll use the heuristic criteria (100%) for simplicity.

In [53]:
data_df[[f'r{t:02}' for t in T]].describe()

Unnamed: 0,r01,r05,r10,r21,r42,r63
count,941977.0,938669.0,934534.0,925437.0,908070.0,890703.0
mean,0.005072,0.02483,0.049802,0.10619,0.218177,0.337982
std,4.496026,10.07103,14.274028,20.786351,29.675778,36.697368
min,-0.933718,-0.934104,-0.935632,-0.937143,-0.937407,-0.940266
25%,-0.009434,-0.024096,-0.035714,-0.055046,-0.081421,-0.095652
50%,0.0,0.0,0.0,-0.005263,-0.007519,-0.00813
75%,0.008265,0.019512,0.028169,0.043479,0.066289,0.084615
max,4363.508545,4363.508545,4363.508545,4363.508545,4363.508545,4363.508545


In [54]:
outliers = data_df[data_df["r01"] > 1].index.get_level_values('symbol').unique()
outliers

Index(['AKS', 'JCKH', 'KC', 'KKC', 'NEWS', 'STHAI', 'STOWER'], dtype='object', name='symbol')

In [55]:
data_df = data_df.drop(outliers, level='symbol')

# Create time variables
We could extract time information such as year, month, weekday (i.e. Mon, Tue, Wed, ...) as part of our feature to account for time or seasonality context for a ML model.

In [56]:
data_df['year'] = data_df.index.get_level_values('date').year
data_df['month'] = data_df.index.get_level_values('date').month
data_df['weekday'] = data_df.index.get_level_values('date').weekday

# Store model data

In [57]:
data_df.reset_index().to_csv(DATA_DIR / "data/model_data.csv",index=False)