# FinMLKit Quick Start Guide

In this notebook, we will demonstrate how to use `FinMLKit`. At the end of this notebook, you will be able to
1. Process raw trade data utilizing the `TradesData` class.
2. Save and load the preprocessed data to and from a hdf5 file.
3. Build intra-bar features from the preprocessed trades data.
4. Build inter-bar features (aka. indicators) from the bar data.
5. Build labels and finalize a dataset for ML model training.

This is a self-contained notebook, so I encourage you to run it for yourself and play around with the code.

## 1. Process Raw Trade Data

__Downloading Raw Trade Data:__

Fortunately, more and more centralized exchanges are providing raw trade data. In this example, we will download and process raw trade `BTC` data from __Binance__.

In [7]:
import numpy as np
# download 1 month of raw trades data from binance
! curl -s "https://data.binance.vision/data/futures/um/monthly/trades/BTCUSDT/BTCUSDT-trades-2025-07.zip" -o "BTCUSDT-trades-2025-07.zip"
# download the corresponding checksum
! curl -s "https://data.binance.vision/data/futures/um/monthly/trades/BTCUSDT/BTCUSDT-trades-2025-07.zip.CHECKSUM" -o "BTCUSDT-trades-2025-07.zip.CHECKSUM"
# verify the checksum (MacOS)
! shasum -a 256 -c "BTCUSDT-trades-2025-07.zip.CHECKSUM"
# verify the checksum (Linux)
# sha256sum -c "BTCUSDT-trades-2025-07.zip.CHECKSUM"

BTCUSDT-trades-2025-07.zip: OK


In [5]:
# unzip the downloaded file
! unzip -o "BTCUSDT-trades-2025-07.zip"

Archive:  BTCUSDT-trades-2025-07.zip
  inflating: BTCUSDT-trades-2025-07.csv  


In [3]:
! ls

BTCUSDT-trades-2025-07.csv TestFMK.ipynb
BTCUSDT-trades-2025-07.zip


Now we have the raw trades data in a file named `BTCUSDT-trades-2025-07.csv`. Next, install and import the necessary packages: `finmlkit` and `plotly` for visualization.

In [57]:
! pip install finmlkit plotly

Collecting scipy
  Downloading scipy-1.16.1-cp312-cp312-macosx_14_0_arm64.whl.metadata (61 kB)
Downloading scipy-1.16.1-cp312-cp312-macosx_14_0_arm64.whl (20.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.9/20.9 MB[0m [31m5.9 MB/s[0m  [33m0:00:03[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: scipy
Successfully installed scipy-1.16.1


In [1]:
from finmlkit.bar.data_model import TradesData
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("BTCUSDT-trades-2025-07.csv")
df.head()

Unnamed: 0,id,price,qty,quote_qty,time,is_buyer_maker
0,6440230568,107087.3,0.002,214.1746,1751328000018,True
1,6440230569,107087.3,1.391,148958.4343,1751328004439,True
2,6440230570,107087.3,3.45,369451.185,1751328004439,True
3,6440230571,107087.3,0.046,4926.0158,1751328004439,True
4,6440230572,107087.3,0.005,535.4365,1751328004439,True


In [3]:
trades = TradesData(df.time.values, df.price.values, df.qty.values,
                    id=df.id.values, is_buyer_maker=df.is_buyer_maker.values,
                    preprocess=True)

finmlkit.bar.data_model:416 | INFO | Inferred timestamp format: ms
finmlkit.bar.data_model:364 | INFO | Converting timestamp to nanoseconds units for processing...
finmlkit.bar.data_model:330 | INFO | Merging split trades (same timestamps) on same price level...
finmlkit.bar.data_model:191 | INFO | TradesData prepared successfully.


`TradesData` checks data integrity, here we can see that we jace around 15k missing trades in the data. It also checks for larger discontinuities in the data exceeding 1 minute. Fortunately, we have no such discontinuities in this data, it is quite clean.

`TradesData` with `preprocess=True` did the following:
- Infers timestamp unit from the data, in this case, it is `ms` (milliseconds).
- Converts the timestamp to nanoseconds as it is the pandas standard and also `finmlkit` numba functions require nanoseconds timestamp.
- validated the data integrity, if any, stores critical discontinuities in the `discontinuities` attribute.
- Merging split trades, i.e. merging fragmented trades with the same timestamp, price, and buyer maker status (if provided). This fragmentation is due to large market orders matching multiple limit orders.
- If buyer maker status is not provided, the side information is inferred from the price movement (tick rule method).


In [15]:
# We have no large discontinuities in the data exceeding 1 minute
trades.discontinuities

[]

In [17]:
trades.data

Unnamed: 0_level_0,timestamp,price,amount,side
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-07-01 00:00:00.018,1751328000018000000,107087.3,0.002,-1
2025-07-01 00:00:04.439,1751328004439000000,107087.3,10.909,-1
2025-07-01 00:00:04.439,1751328004439000000,107087.2,0.004,-1
2025-07-01 00:00:04.439,1751328004439000000,107087.1,0.001,-1
2025-07-01 00:00:04.439,1751328004439000000,107086.7,0.002,-1
...,...,...,...,...
2025-07-31 23:59:59.676,1754006399676000000,115697.3,0.004,-1
2025-07-31 23:59:59.799,1754006399799000000,115697.4,0.001,1
2025-07-31 23:59:59.818,1754006399818000000,115697.4,0.080,1
2025-07-31 23:59:59.844,1754006399844000000,115697.4,0.001,1


Now we have validated, preprocessed and prepared the raw trades data for further processing. We can save this data to a hdf5 file for later use. This saves any discontinuity information also.

## 2. Save and Load data

In [22]:
trades.save_h5("BTCUSDT.h5")

finmlkit.bar.data_model:437 | INFO | Creating new data for 2025-07...


  check_attribute_name(name)


finmlkit.bar.data_model:486 | INFO | Successfully saved 39,171,929 records for 2025-07


  check_attribute_name(name)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->values] [items->None]

  store.put(meta_key, meta, format="fixed")


'/trades/2025-07'

We get some harmless warnings but the data is saved successfully. We can now load this data from the hdf5 file. Note that we can put many months in this h5 file (do the same steps for the next month of data, then save it to the same h5 file `BTCUSDT.h5`), this way we can store our database in a single, compact file.

In [23]:
! ls

BTCUSDT-trades-2025-07.csv          BTCUSDT.h5
BTCUSDT-trades-2025-07.zip          TestFMK.ipynb
BTCUSDT-trades-2025-07.zip.CHECKSUM


There are some utility modules in the `io` module. It is often useful to have aggregated information from which we can access arbitrary frequency data quickly. Thus we can add time bars to the hdf5 file. This is done by the `AddTimeBarH5` class. It will add 1 second frequency bars from which then any other frequency can be derived (larger than 1 sec).

In [4]:
from finmlkit.bar.io import H5Inspector, AddTimeBarH5, TimeBarReader

In [28]:
# Inspect the h5 file
h5_info = H5Inspector("BTCUSDT.h5")
h5_info.list_keys()

['/trades/2025-07']

In [30]:
h5_info.get_integrity_summary()

finmlkit.bar.io:177 | INFO | All data passed integrity checks. No issues found.


In [31]:
# Now AddTimeBarH5 will add 1 second bars to the h5 file
AddTimeBarH5("BTCUSDT.h5").process_key('/trades/2025-07')

finmlkit.bar.io:266 | INFO | Loading trades data for 2025-07...
finmlkit.bar.data_model:549 | INFO | Loading trades from BTCUSDT.h5...
finmlkit.bar.data_model:633 | INFO | Loading 1 groups sequentially...
finmlkit.bar.data_model:655 | INFO | Concatenating 1 DataFrames...
finmlkit.bar.data_model:663 | INFO | Successfully loaded 39,171,929 trades from 1 monthly groups.
finmlkit.bar.data_model:324 | INFO | Inferred timestamp format: ns
finmlkit.bar.io:270 | INFO | Building 1-second time bars for 2025-07...
finmlkit.bar.kit:27 | INFO | Time bar builder initialized with interval: 1.0 seconds.
finmlkit.bar.base:70 | INFO | Calculating bar open tick indices and timestamps...
finmlkit.bar.base:107 | INFO | OHLCV bar calculated successfully.
finmlkit.bar.base:120 | INFO | OHLCV bar converted to DataFrame.
finmlkit.bar.io:275 | INFO | Saving time bars for 2025-07...
finmlkit.bar.io:300 | INFO | Successfully added time bars for 2025-07. Created 2678400 bars.


  check_attribute_name(name)
  check_attribute_name(name)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->values] [items->None]

  store.put(meta_key, metadata, format='fixed')


True

In [33]:
# Now we can read timebars in arbitrary freq. for example 1 minute bars
TimeBarReader("BTCUSDT.h5").list_keys()

['/klines/2025-07']

In [39]:
tb1min = TimeBarReader("BTCUSDT.h5").read(start_time="2025-07-01", end_time="2025-07-02", timeframe="1min")
tb1min.head()

Unnamed: 0_level_0,open,high,low,close,volume,trades,vwap,median_trade_size
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2025-07-01 00:00:00,107087.3,107093.8,107063.5,107083.4,95.382996,739,107080.851562,0.005
2025-07-01 00:01:00,107083.4,107087.6,107061.7,107087.6,40.875,469,107071.140625,0.005
2025-07-01 00:02:00,107087.5,107099.8,107073.2,107099.7,46.274002,424,107084.835938,0.005
2025-07-01 00:03:00,107099.8,107114.1,107099.7,107114.1,19.092001,333,107104.921875,0.0065
2025-07-01 00:04:00,107114.1,107114.1,107065.0,107084.7,44.549999,541,107094.273438,0.005


In [38]:
tb1min.tail()

Unnamed: 0_level_0,open,high,low,close,volume,trades,vwap,median_trade_size
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2025-07-31 23:56:00,115596.4,115768.7,115588.6,115675.8,247.477005,3235,115676.578125,0.002
2025-07-31 23:57:00,115675.7,115712.4,115662.0,115712.4,73.338005,1148,115686.257812,0.002
2025-07-31 23:58:00,115712.4,115735.1,115668.1,115668.2,77.241005,1024,115713.929688,0.002
2025-07-31 23:59:00,115668.1,115712.7,115668.1,115697.3,55.620998,859,115696.132812,0.002
2025-08-01 00:00:00,115697.4,115697.4,115697.3,115697.3,0.305,9,115697.398438,0.002


Yes, we can read arbetrary frequency bars from the h5 file for a selected time range! The `TimeBarReader` class reads the bars from the h5 file, aggrefates to the required frequency, and returns a pandas DataFrame.


# 3. Build Intra-Bar Features
Intra-bar features are features that are computed from the trades data within a bar, e.g. OHLCV features, directional features like volume imbalance, and footprint information.

Now, that we preprocessed and saved the trades data, we can easily restore it every time we need it and spare the preprocessing steps:

In [40]:
trades = TradesData.load_trades_h5("BTCUSDT.h5")
trades.data.head()

finmlkit.bar.data_model:549 | INFO | Loading trades from BTCUSDT.h5...
finmlkit.bar.data_model:633 | INFO | Loading 1 groups sequentially...
finmlkit.bar.data_model:655 | INFO | Concatenating 1 DataFrames...
finmlkit.bar.data_model:663 | INFO | Successfully loaded 39,171,929 trades from 1 monthly groups.
finmlkit.bar.data_model:324 | INFO | Inferred timestamp format: ns


Unnamed: 0_level_0,timestamp,price,amount,id,side
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2025-07-01 00:00:00.018,1751328000018000000,107087.3,0.002,,-1
2025-07-01 00:00:04.439,1751328004439000000,107087.3,10.909,,-1
2025-07-01 00:00:04.439,1751328004439000000,107087.2,0.004,,-1
2025-07-01 00:00:04.439,1751328004439000000,107087.1,0.001,,-1
2025-07-01 00:00:04.439,1751328004439000000,107086.7,0.002,,-1


In this example, we will build `time bars` and `volume bars`. To see all supported bar types, please refer to the [documentation](https://finmlkit.readthedocs.io/en/v0.1.6/api/finmlkit.bar.kit.html#module-finmlkit.bar.kit).

In [5]:
from finmlkit.bar.kit import TimeBarKit, VolumeBarKit

### A – Build Time Bars

In [5]:
tb5min_kit = TimeBarKit(trades, period=pd.Timedelta(minutes=5))
tb5min_klines = tb5min_kit.build_ohlcv()
tb5min_klines.head()

finmlkit.bar.kit:27 | INFO | Time bar builder initialized with interval: 300.0 seconds.
finmlkit.bar.base:106 | INFO | Calculating bar close tick indices and timestamps...
finmlkit.bar.base:146 | INFO | OHLCV bar calculated successfully.
finmlkit.bar.base:159 | INFO | OHLCV bar converted to DataFrame.


Unnamed: 0_level_0,open,high,low,close,volume,trades,median_trade_size,vwap
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2025-07-01 00:05:00,107087.3,107114.1,107061.7,107084.8,246.416,2511,0.005,107084.284019
2025-07-01 00:10:00,107084.8,107197.3,107036.7,107197.3,180.108994,2805,0.005,107109.481958
2025-07-01 00:15:00,107197.3,107314.3,107160.0,107277.5,336.639008,3168,0.008,107240.896874
2025-07-01 00:20:00,107277.5,107390.6,107274.5,107374.0,683.143982,3541,0.008,107353.698499
2025-07-01 00:25:00,107374.0,107408.2,107207.8,107207.8,451.330994,2762,0.01,107354.988931


In [46]:
tb5min_klines.columns

Index(['open', 'high', 'low', 'close', 'volume', 'trades', 'median_trade_size',
       'vwap'],
      dtype='object')

Now we built 5-minute candles from 1 month of raw trades data instantly. By the way, this is exactly how `AddTimeBarH5` builds the 1-second bars for the h5 file.

Can we produce more interesting features from the trades data? Yes, this is why `finmlkit` exists!

In [6]:
tb5min_directional = tb5min_kit.build_directional_features()
tb5min_directional.head()

finmlkit.bar.base:187 | INFO | Directional features calculated successfully.
finmlkit.bar.base:206 | INFO | Directional features converted to DataFrame.


Unnamed: 0_level_0,ticks_buy,ticks_sell,volume_buy,volume_sell,dollars_buy,dollars_sell,mean_spread,max_spread,cum_ticks_min,cum_ticks_max,cum_volume_min,cum_volume_max,cum_dollars_min,cum_dollars_max
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2025-07-01 00:05:00,1349,1162,124.296997,122.119003,13310200.0,13077081.0,0.048706,10.3,-206,288,-27.846003,6.903998,-2982230.0,739452.4
2025-07-01 00:10:00,1888,917,119.467003,60.641998,12796514.0,6494867.0,0.042638,9.1,-134,971,-11.552999,59.021,-1236879.0,6322655.0
2025-07-01 00:15:00,2007,1161,230.694,105.945,24738034.0,11363434.0,0.029104,5.0,1,1072,0.381,161.805008,40842.17,17350660.0
2025-07-01 00:20:00,2174,1367,376.598999,306.544983,40427128.0,32910906.0,0.048404,5.5,-30,834,-1.366,154.916,-146538.1,16627220.0
2025-07-01 00:25:00,776,1986,87.012001,364.319,9342337.0,39110296.0,0.024511,1.9,-1210,215,-279.545013,32.505001,-30007900.0,3490988.0


In [47]:
tb5min_directional.columns

Index(['ticks_buy', 'ticks_sell', 'volume_buy', 'volume_sell', 'dollars_buy',
       'dollars_sell', 'mean_spread', 'max_spread', 'cum_ticks_min',
       'cum_ticks_max', 'cum_volume_min', 'cum_volume_max', 'cum_dollars_min',
       'cum_dollars_max'],
      dtype='object')

We can do even more! If we have the average trade size, we can examine large prints within the bar. Moreover, size distribution features may be useful for some models...To determine the typical trade size, we will use the `TimeBarReader`, read daily bars, and compute the median trade size.

In [6]:
from finmlkit.bar.io import TimeBarReader

In [6]:
tbd = TimeBarReader("BTCUSDT.h5").read(timeframe="1d")
tbd.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 32 entries, 2025-07-01 to 2025-08-01
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   open               32 non-null     float64
 1   high               32 non-null     float64
 2   low                32 non-null     float64
 3   close              32 non-null     float64
 4   volume             32 non-null     float32
 5   trades             32 non-null     int64  
 6   vwap               32 non-null     float32
 7   median_trade_size  32 non-null     float32
dtypes: float32(3), float64(4), int64(1)
memory usage: 1.9 KB


In [7]:
typical_trade_size = tbd.median_trade_size.median()
typical_trade_size

np.float32(0.00475)

Now that we know the typical trade size, we can build size distribution features:

In [10]:
tb5min_sizedis = tb5min_kit.build_trade_size_features(theta=np.ones_like(tb5min_klines.close.values)*typical_trade_size)
tb5min_sizedis.head()

finmlkit.bar.base:230 | INFO | Trade size features calculated successfully.
finmlkit.bar.base:239 | INFO | Trade size features converted to DataFrame.


Unnamed: 0_level_0,mean_size_rel,size_95_rel,pct_block,size_gini
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-07-01 00:05:00,1.635492,2.88274,0.962838,0.990182
2025-07-01 00:10:00,1.309299,2.620388,0.943248,0.994339
2025-07-01 00:15:00,1.700047,3.005474,0.967217,0.994328
2025-07-01 00:20:00,2.210812,3.634535,0.983676,0.994886
2025-07-01 00:25:00,2.064367,3.356494,0.977998,0.993117


- `mean_size_rel`: Mean trade size relative to theta per bar: log1p(mean_size / theta)
- `size_95_rel`: 95th percentile of trade sizes per bar relative to theta: log1p(size_95 / theta)
- `pct_block`: Percentage of trades that are larger than theta per bar: SUM( size_i [ size_i>theta ] / volume )
- `size_gini`: Gini coefficient of trade sizes per bar

And we are still not done! We can build footprint features, e.g. volume profile, volume skew, and maximum run of signed volume.

In [11]:
tb5min_fp = tb5min_kit.build_footprints()

finmlkit.bar.base:264 | INFO | Price tick size is set to: 0.1
finmlkit.bar.base:277 | INFO | Footprint data calculated successfully.
finmlkit.bar.base:298 | INFO | Footprint data converted to FootprintData object.


In [65]:
print(tb5min_fp)

FootprintData:
  Number of Bars: 8928
  Price Tick: 0.1
  Date Range: 2025-07-01 00:05:00 to 2025-08-01 00:00:00
  Array Types: List
  Optional Attributes:
    COT Price Levels: present
    Sell Imbalances Sum: present
    Buy Imbalances Sum: present
  Total Memory Usage: 3.338 MB



Unlike the previous functions, this returns a `FootprintData` object which contains the footprints and some derived features. If we want to get the footprint DataFrame, we can use the `get_df()` method:

In [59]:
tb5min_fp.get_df()

Unnamed: 0_level_0,Unnamed: 1_level_0,price_level,sell_ticks,buy_ticks,sell_volume,buy_volume,sell_imbalance,buy_imbalance
bar_idx,bar_datetime_idx,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,2025-07-01 00:05:00,107114.1,0,91,0.000000,5.536999,False,False
0,2025-07-01 00:05:00,107114.0,27,0,5.765999,0.000000,False,False
0,2025-07-01 00:05:00,107113.9,1,0,0.215000,0.000000,True,False
0,2025-07-01 00:05:00,107113.8,1,0,0.003000,0.000000,True,False
0,2025-07-01 00:05:00,107113.7,0,0,0.000000,0.000000,False,False
...,...,...,...,...,...,...,...,...
8927,2025-08-01 00:00:00,115500.4,1,0,0.002000,0.000000,True,False
8927,2025-08-01 00:00:00,115500.3,1,0,0.007000,0.000000,True,False
8927,2025-08-01 00:00:00,115500.2,1,1,0.116000,0.003000,True,False
8927,2025-08-01 00:00:00,115500.1,1,4,1.014000,5.056000,True,False


This structures the footprint data into a multi-index DataFrame with the following columns:
- `bar_idx`: Index of the bar
- `bar_datetime_idx`: Datetime index corresponding to the bar
- `price`: Price level

and the tick/volume buy/sell values for each price level. Some derived footprint features can be accessed directly from the `FootprintData` object:
- cot_price_levels –  Commitment of Traders price levels.
- sell_imbalances_sum –  total sell imbalance counts per bar.
- buy_imbalances_sum –  total buy imbalance counts per bar.
- imb_max_run_signed –  longest signed imbalance run for each bar.
- vp_skew –  volume profile skew for each bar (positive = buy pressure above VWAP).
- vp_gini –  volume profile Gini coefficient for each bar (0 = concentrated, →1 = even).


In [60]:
tb5min_fp.vp_gini

array([0.96253387, 0.98415297, 0.98294061, ..., 0.99237972, 0.99188963,
       0.99283131], shape=(8928,))

Allright, thats it, we've built intra-bar features for the time bars. Lets do the same for the volume bars.
For this, we will decide the volume bar bucket size, as the median daily trade volume in the month.

### B – Build Volume Bars

In [62]:
# Remember that we've read the daily bars from the h5 file, it is in `tbd`
daily_volume_med = tbd.volume.median()
daily_volume_med

np.float32(133671.56)

In [72]:
# Set the bucket size such that we have approx. 2000 volume bars per day (on a typical day)
bucket_size = daily_volume_med / 2000

In [87]:
vb_kit = VolumeBarKit(trades, volume_ths=bucket_size)
vb_klines = vb_kit.build_ohlcv()
vb_klines.head()

finmlkit.bar.kit:87 | INFO | Volume bar builder initialized with volume: 66.83578491210938.
finmlkit.bar.base:70 | INFO | Calculating bar open tick indices and timestamps...
finmlkit.bar.base:107 | INFO | OHLCV bar calculated successfully.
finmlkit.bar.base:120 | INFO | OHLCV bar converted to DataFrame.


Unnamed: 0_level_0,open,high,low,close,volume,trades,median_trade_size,vwap
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2025-07-01 00:00:17.794,107087.3,107093.8,107072.5,107077.9,67.587997,349,0.005,107082.584286
2025-07-01 00:01:53.556,107077.9,107087.3,107061.7,107087.3,68.255997,836,0.005,107073.275079
2025-07-01 00:04:00.081,107087.2,107114.1,107073.2,107114.1,69.097,785,0.006,107091.811559
2025-07-01 00:06:05.306,107114.1,107114.1,107036.7,107050.2,66.856003,1059,0.005,107078.741443
2025-07-01 00:07:37.939,107050.8,107108.0,107039.4,107108.0,67.191002,1047,0.005,107080.972847


In [88]:
vb_directional = vb_kit.build_directional_features()
vb_directional.head()

finmlkit.bar.base:147 | INFO | Directional features calculated successfully.
finmlkit.bar.base:166 | INFO | Directional features converted to DataFrame.


Unnamed: 0_level_0,ticks_buy,ticks_sell,volume_buy,volume_sell,dollars_buy,dollars_sell,mean_spread,max_spread,cum_ticks_min,cum_ticks_max,cum_volume_min,cum_volume_max,cum_dollars_min,cum_dollars_max
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2025-07-01 00:00:17.794,106,243,25.643,41.945,2745893.75,4491604.0,0.156734,10.3,-156,-1,-19.799002,-9.498,-2120155.0,-1017056.0
2025-07-01 00:01:53.556,457,379,41.837002,26.418999,4479544.5,2828848.75,0.030742,0.3,-68,78,-7.899999,15.418,-845992.2,1650696.0
2025-07-01 00:04:00.081,556,229,38.320999,30.776001,4104080.5,3295642.5,0.032229,0.5,-46,327,-26.960001,7.544999,-2887002.0,808438.0
2025-07-01 00:06:05.306,452,607,27.755001,39.100998,2971843.25,4187013.0,0.030123,3.1,-214,21,-16.033998,0.245,-1716969.0,26242.98
2025-07-01 00:07:37.939,709,338,44.711002,22.48,4787596.5,2407281.25,0.071442,9.1,-1,372,0.002,22.814001,214.1016,2442632.0


In [89]:
vb_sizedis = vb_kit.build_trade_size_features(theta=np.ones_like(vb_klines.close.values)*typical_trade_size)
vb_sizedis.head()

finmlkit.bar.base:190 | INFO | Trade size features calculated successfully.
finmlkit.bar.base:199 | INFO | Trade size features converted to DataFrame.


Unnamed: 0_level_0,mean_size_rel,size_95_rel,pct_block,size_gini
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2025-07-01 00:00:17.794,2.214211,3.751854,0.985679,0.952057
2025-07-01 00:01:53.556,1.490142,2.612389,0.954951,0.975858
2025-07-01 00:04:00.081,1.548876,2.375053,0.954471,0.952826
2025-07-01 00:06:05.306,1.29696,2.548623,0.940951,0.987168
2025-07-01 00:07:37.939,1.308899,2.647294,0.945798,0.981521


In [90]:
vb_fp = vb_kit.build_footprints()

finmlkit.bar.base:223 | INFO | Price tick size is set to: 0.1
finmlkit.bar.base:236 | INFO | Footprint data calculated successfully.
finmlkit.bar.base:257 | INFO | Footprint data converted to FootprintData object.


Now, we have built volume and time bars with intra-bar features. We can now combine these features into a single DataFrame and use them as input for ML models.

Often, we want to produce more hand-crafted features from the intra-bar features, e.g. indicators like RSI, MACD, etc. This is done in the next step.

# 4. Build Inter-Bar (aka. bar-level) Features (Indicators)
Inter-bar features are features that are computed from the bar data. For this, we have to define transforms, features, and finally, a feature kit, or in other words a set of features that will be built from the source data (intra-bar features).

We have a special class of inter-bar features which is calculated from footprints, called volume profile. You can access and use it the following way:

In [7]:
from finmlkit.feature.core.volume import VolumePro

In [93]:
# calculate the 12-hour volume profile for the volume bars
vp12h = VolumePro(window_size=pd.Timedelta(minutes=12), n_bins= 41)
poc, hva, lva, pct_above_poc = vp12h.compute(vb_klines, vb_fp)

These are the point of control (POC) which is the price level with the highest volume, high volume area (HVA) and low value area (LVA) which mark the price levels around the POC which contributes around 70% of the volume, and finally, the percentage of volume above the POC. These are useful features to identify support and resistance levels.

Let's visualize this:

In [99]:
# Lets plot the POC, HVA, and LVA on the volume bars along with the close price for 1 day
import plotly.graph_objects as go
import plotly.express as px
from datetime import timedelta
# Select a single day for plotting
start_time = pd.Timestamp("2025-07-01 00:00:00")
end_time = start_time + timedelta(days=1)

vb_klines_vp = vb_klines.copy()
# Add the POC, HVA, and LVA to the volume bars DataFrame
vb_klines_vp["poc"] = poc
vb_klines_vp["hva"] = hva
vb_klines_vp["lva"] = lva
# Filter the volume bars for the selected day
vb_klines_vp_day = vb_klines_vp[(vb_klines.index >= start_time) & (vb_klines.index < end_time)]
# Create scatter plot for the volume bars with POC, HVA, and LVA along with the close price
fig = go.Figure()
# Add volume bars as a bar chart
fig.add_trace(go.Scatter(
    x=vb_klines_vp_day.index,
    y=vb_klines_vp_day["close"],
    mode='lines+markers',
    name='Close Price',
    line=dict(color='blue', width=2)))
# Add POC, HVA, and LVA as horizontal lines
fig.add_trace(go.Scatter(
    x=vb_klines_vp_day.index,
    y=vb_klines_vp_day["poc"],
    mode='lines',
    name='POC',
    line=dict(color='red', width=1.5)))
fig.add_trace(go.Scatter(
    x=vb_klines_vp_day.index,
    y=vb_klines_vp_day["hva"],
    mode='lines',
    name='HVA',
    line=dict(color='green', width=1.5, dash='dash')))
fig.add_trace(go.Scatter(
    x=vb_klines_vp_day.index,
    y=vb_klines_vp_day["lva"],
    mode='lines',
    name='LVA',
    line=dict(color='orange', width=1.5, dash='dash')))
# Update layout
fig.update_layout(
    title=f"Volume Bars with POC, HVA, and LVA for {start_time.date()}",
    xaxis_title="Time",
    yaxis_title="Price",
    legend=dict(x=0, y=1, traceorder='normal', orientation='h' ),
    # set height and width
    height=600,
    #width=1000,
)
# Show the plot
fig.show()


Next, move on to feature defined on more general intra-bar features, like OHLCV data. For this, import the necessary modules:

In [8]:
from finmlkit.feature.kit import Feature, Compose
from finmlkit.feature.transforms import EWMST, ReturnT

First, create a rolling standard volatility estimator for the volume bar. It is irregular, but these transforms work based on time window, not period, so we can use it on irregular data. We can compose a return and a exponential moving standard deviation transform to get the volatility estimator:

In [75]:
volatility_tfs = Compose(
    ReturnT(window=pd.Timedelta(hours=2)),
    EWMST(half_life=pd.Timedelta(hours=2))
)

The `Compose` module will chain the output of the previous transform to the next one. The `ReturnT` will compute the log return of the volume bar, and the `EWMST` will compute the exponentially weighted moving standard deviation of the log return with a half-life of 2 hours. We can apply this even on tick data:

In [78]:
volatility_tfs = Compose(
    ReturnT(window=pd.Timedelta(hours=2), input_col="price"),
    EWMST(half_life=pd.Timedelta(hours=2))
)

sigma = volatility_tfs(trades.data)
sigma.tail()

datetime
2025-07-31 23:59:59.676    0.002829
2025-07-31 23:59:59.799    0.002829
2025-07-31 23:59:59.818    0.002829
2025-07-31 23:59:59.844    0.002829
2025-07-31 23:59:59.978    0.002829
Name: price_ret7200.0s_ewms7200.0s, dtype: float64

The output column name is automatically generated based on the operations: `price_ret7200.0s_ewms7200.0s`.
When we are designing the features, we can rename them so that they are more readable (we will see this later).

There are some fundamental and some more exotic transforms available in `finmlkit`. You can find the full list in the [documentation](https://finmlkit.readthedocs.io/en/v0.1.6/api/finmlkit.feature.transforms.html). But I encourage you to build your own transforms for your specific use-case. It is easy to do, just inherit from the appropriate base transform class and implement the `_nb()` (numba implementation) or `_pd()` (pandas implementation) method. It enables the convince of fast prototyping with pandas (bars are coarser than raw trades, so performance is not critical here). You can pick a base class from these:

- `SISOTransform`: Single input, single output transform.
- `SIMOTransform`: Single input, multiple output transform.
- `MISOTransform`: Multiple input, single output transform.
- `MIMOTransform`: Multiple input, multiple output transform.

Let's show the process through an exaple.

In [79]:
from finmlkit.feature.base import SISOTransform  # 1. import the appropriate base class
from finmlkit.utils.log import get_logger        # you can use finmlkkit's logger to log messages (how cool is that?)
logger = get_logger(__name__)
from typing import Union
from scipy import stats

class TrendSlope(SISOTransform):
    """
    Computes the OLS slope of ln(close) over a specified window and converts it to an angle in degrees.

    This is useful as a trend indicator where the angle represents how steep the trend is.
    Positive angles indicate uptrend, negative angles indicate downtrend, and the magnitude
    represents the steepness of the trend.
    """
    def __init__(self, window: int = 24, input_col: str = "close"):
        """
        Compute the OLS slope of ln(close) over a specified window and convert to an angle in degrees.

        :param window: Window size for the rolling OLS calculation, default is 24
        :param input_col: Input column to compute slope on, default is "close"
        """
        super().__init__(input_col, f"trend_slope_{window}")  # 2. call the base class constructor, define the input and output column names.
        self.window = window

    def _pd(self, x):
        """Pandas implementation of trend slope calculation"""
        # Get the series to compute the trend slope on
        series = x[self.requires[0]]                          # 3. get the input series from the input DataFrame!

        # ----------- Implement the logic for the transform -----------
        log_series = np.log(series)
        # Initialize result series with NaN values
        result = pd.Series(np.nan, index=series.index, name=self.output_name)
        # Create x values (time indices) for the linear regression
        x_vals = np.arange(self.window)
        # Calculate rolling OLS slope and convert to angle in degrees
        for i in range(self.window - 1, len(log_series)):
            window_data = log_series.iloc[i - self.window + 1:i + 1]
            if window_data.isna().any():
                # Skip if there are any NaN values in the window
                continue
            # Calculate slope using OLS
            slope, _, _, _, _ = stats.linregress(x_vals, window_data.values)
            # Convert slope to angle in degrees
            angle = np.degrees(np.arctan(slope))
            # Store result
            result.iloc[i] = angle
        # ---------------------------------------------------------------

        result.name = self.output_name                         # 4. Ensure the output series name is set correctly!
        return result

    def _nb(self, x: Union[pd.DataFrame, pd.Series]) -> pd.Series:
        """Numba implementation would be more complex - falling back to pandas for now"""
        logger.info(f"Fall back to pandas for {self.__class__.__name__}")
        return self._pd(x)            # Falling back to pandas implementation for simplicity and fast prototyping

In [81]:
# Lets try this on the time bars
trend_slope_tfs = TrendSlope(window=24, input_col="close")  # here window is in periods, so 24 means 24 bars, in this case 24 * 5 minutes = 120 minutes = 2 hours
trend_slope_output = trend_slope_tfs(tb5min_klines)
trend_slope_output.tail(10)

__main__:52 | INFO | Fall back to pandas for TrendSlope


timestamp
2025-07-31 23:15:00   -0.015555
2025-07-31 23:20:00   -0.015266
2025-07-31 23:25:00   -0.014593
2025-07-31 23:30:00   -0.013604
2025-07-31 23:35:00   -0.012499
2025-07-31 23:40:00   -0.012103
2025-07-31 23:45:00   -0.011695
2025-07-31 23:50:00   -0.010958
2025-07-31 23:55:00   -0.012125
2025-08-01 00:00:00   -0.011576
Freq: 5min, Name: close_trend_slope_24, dtype: float64

Nice, we just made a cool feature transform.

Let's move on to
- create features
- arbitrary operations between features
- build a full feature feature

You can create features from transforms the following way:

In [82]:
trend_slope = Feature(trend_slope_tfs)
trend_slope.name

'close_trend_slope_24'

You may ask why we need the `Feature` class, why not just use the transform directly? The reason is that the `Feature` class allows us to define more complex features, e.g. combining multiple transforms or features, renaming the output, and so on. It also provides a convenient way to build a feature kit from a set of features.

In [83]:
trend_slope_derivative = trend_slope.rolling_mean(5).lag(1) # 5-period rolling mean and lagged by 1 period
trend_slope_derivative.name

'close_trend_slope_24_rmean5_lag1'

In [84]:
# you can give a custom name to the feature
trend_slope_derivative.name = "my_custom_feature"

Why these names important? Because we will use them to build a feature kit, which is a collection of features that can be built from the source data. The feature kit will take care of building the features and producing a DataFrame with the defined features.

Next, we will show the process of building a feature kit, starting from the bar construction.

In [9]:
# 1. Build intra-bar features + volume profile
tb5m = TimeBarKit(trades, pd.Timedelta(minutes=5))
tb5m_klines = tb5m.build_ohlcv()
tb5m_directional = tb5m.build_directional_features()
tb5m_fp = tb5m.build_footprints()

vp30 = VolumePro(window_size=pd.Timedelta(minutes=30), n_bins=41)
vp30_res = vp30.compute(tb5m_klines, tb5m_fp)
vp60 = VolumePro(window_size=pd.Timedelta(minutes=60), n_bins=41)
vp60_res = vp60.compute(tb5m_klines, tb5m_fp)
vp12h = VolumePro(window_size=pd.Timedelta(hours=12), n_bins=51)
vp12h_res = vp12h.compute(tb5m_klines, tb5m_fp)

full_tdf = tb5m_klines.join(tb5m_directional, validate="1:1")

full_tdf["cot"] = tb5m_fp.cot_price_levels * tb5m_fp.price_tick
full_tdf["poc_vp30m"] = vp30_res[0]
full_tdf["poc_vp30m_shift"] = (full_tdf["poc_vp30m"]-full_tdf["poc_vp30m"].shift(1)) / tb5m_fp.price_tick
full_tdf["poc_vp60m"] = vp60_res[0]
full_tdf["poc_vp60m_shift"] = (full_tdf["poc_vp60m"]-full_tdf["poc_vp60m"].shift(1)) / tb5m_fp.price_tick
full_tdf["poc_vp12h"] = vp12h_res[0]
full_tdf["hva_vp30m"] = vp30_res[1]
full_tdf["hva_vp60m"] = vp60_res[1]
full_tdf["hva_vp12h"] = vp12h_res[1]
full_tdf["lva_vp30m"] = vp30_res[2]
full_tdf["lva_vp60m"] = vp60_res[2]
full_tdf["lva_vp12h"] = vp12h_res[2]
full_tdf["pct_above_poc_vp30m"] = vp30_res[3]
full_tdf["pct_above_poc_vp60m"] = vp60_res[3]
full_tdf["pct_above_poc_vp12h"] = vp12h_res[3]

finmlkit.bar.kit:27 | INFO | Time bar builder initialized with interval: 300.0 seconds.
finmlkit.bar.base:106 | INFO | Calculating bar close tick indices and timestamps...
finmlkit.bar.base:146 | INFO | OHLCV bar calculated successfully.
finmlkit.bar.base:159 | INFO | OHLCV bar converted to DataFrame.
finmlkit.bar.base:187 | INFO | Directional features calculated successfully.
finmlkit.bar.base:206 | INFO | Directional features converted to DataFrame.
finmlkit.bar.base:264 | INFO | Price tick size is set to: 0.1
finmlkit.bar.base:277 | INFO | Footprint data calculated successfully.
finmlkit.bar.base:298 | INFO | Footprint data converted to FootprintData object.


In [10]:
# 2. Define features
import finmlkit.feature.kit as fk
import finmlkit.feature.transforms as tfs

In [20]:
# for one tfs we need this package:
!pip install antropy

Collecting antropy
  Using cached antropy-0.1.9-py3-none-any.whl.metadata (6.6 kB)
Collecting scikit-learn (from antropy)
  Using cached scikit_learn-1.7.1-cp312-cp312-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting joblib>=1.2.0 (from scikit-learn->antropy)
  Using cached joblib-1.5.1-py3-none-any.whl.metadata (5.6 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn->antropy)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached antropy-0.1.9-py3-none-any.whl (18 kB)
Using cached scikit_learn-1.7.1-cp312-cp312-macosx_12_0_arm64.whl (8.6 MB)
Using cached joblib-1.5.1-py3-none-any.whl (307 kB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn, antropy
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4/4[0m [antropy]m2/4[0m [scikit-learn]
[1A[2KSuccessfully installed antropy-0.1.9 joblib-1.5.1 scikit-learn-1.7.1 threadpoolctl-3.6.0


In [11]:
full_feature_list = []

ret1_tfs = tfs.Return(input_col="close", is_log=True)

# Build realized volatility features
rv_3 = fk.Feature(Compose(ret1_tfs, tfs.RealizedVolatility(3)))
rv_12 = fk.Feature(Compose(ret1_tfs, tfs.RealizedVolatility(12)))
rv_24 = fk.Feature(Compose(ret1_tfs, tfs.RealizedVolatility(24)))
bv_12 = fk.Feature(Compose(ret1_tfs, tfs.BiPowerVariation(12)))
jump_var = (rv_12 - bv_12).clip(lower=0)
jump_var.name = "jump_var"
jump_prop = jump_var / (rv_12 + 1e-9)
jump_prop.name = "jump_prop"

full_feature_list.extend([rv_3, rv_12, rv_24, bv_12, jump_prop])  # Add realized volatility features to the list

# Structural break features
cusum_test_fast = fk.Feature(tfs.CUSUMTest(36, 24, 144))
cusum_test_medium = fk.Feature(tfs.CUSUMTest(144, 60, 144))
cusum_test_slow = fk.Feature(tfs.CUSUMTest(288, 120, 288))

full_feature_list.extend([cusum_test_fast, cusum_test_medium, cusum_test_slow])  # Add structural break features to the list

# Order imbalance
volume_buy = fk.Feature(tfs.Identity("volume_buy"))
volume_sell = fk.Feature(tfs.Identity("volume_sell"))
volume_imbalance_1 = (volume_buy - volume_sell) / (volume_buy + volume_sell + 1e-9)
volume_imbalance_1.name = "volume_imbalance"
volume_imbalance_3 = volume_imbalance_1.rolling_mean(3)
volume_imbalance_6 = volume_imbalance_1.rolling_mean(6)

full_feature_list.extend([volume_imbalance_1, volume_imbalance_3, volume_imbalance_6])  # Add order imbalance features to the list

# VP regime
# keep The followings from the source dataframe
# "poc_vp30m_shift"
# "poc_vp60m_shift"
# "pct_above_poc_vp30m"
# "pct_above_poc_vp60m"
# "pct_above_poc_vp12h"
poc12h = fk.Feature(tfs.Identity("poc_vp12h"))  # With the Identity transform, we can use the existing column as a feature
hva12h = fk.Feature(tfs.Identity("hva_vp12h"))  # This is useful if we want to combine them
lva12h = fk.Feature(tfs.Identity("lva_vp12h"))
vp_va = hva12h - lva12h
close = fk.Feature(tfs.Identity("close"))
vp_pctb = (close - poc12h) / (vp_va + 1e-9)
vp_pctb.name = "vp12h_pctb"
vp_dis_vah = (hva12h - close) / poc12h
vp_dis_vah.name = "vp12h_hva_distance"
vp_dis_val = (close - lva12h) / poc12h
vp_dis_val.name = "vp12h_lva_distance"
vp_range = vp_va / poc12h
vp_range.name = "vp12h_va_range"

full_feature_list.extend([vp_pctb, vp_dis_vah, vp_dis_val, vp_range])  # Add volume profile regime features to the list

# Momentum features
close = fk.Feature(tfs.Identity("close"))
ret3 = fk.Feature(tfs.Return(3, input_col="close", is_log=True))
ret6 = fk.Feature(tfs.Return(6, input_col="close", is_log=True))
close_ema_fast = fk.Feature(tfs.EWMA(6, input_col="close"))
close_ema_slow = fk.Feature(tfs.EWMA(24, input_col="close"))
close_ema_fast_dev = (close - close_ema_slow) / (close_ema_slow + 1e-9)
close_ema_fast_dev.name = "close_ema_fast_dev"
close_ema_slow_dev = (close - close_ema_fast) / (close_ema_fast + 1e-9)
close_ema_slow_dev.name = "close_ema_slow_dev"
rsi12 = fk.Feature(tfs.RSIWilder(12, input_col="close"))

full_feature_list.extend([ret3, ret6, close_ema_fast_dev, close_ema_slow_dev, rsi12])  # Add momentum features to the list

# Volatility ratio features
rv_ratio_3v12 = rv_3 / (rv_12 + 1e-9)
rv_ratio_3v12.name = "rv_ratio_3v12"
rv_6 = fk.Feature(Compose(ret1_tfs, tfs.RealizedVolatility(6)))
rv_ratio_6v24 = rv_6 / (rv_24 + 1e-9)
rv_ratio_6v24.name = "rv_ratio_6v24"

full_feature_list.extend([rv_ratio_3v12, rv_ratio_6v24])  # Add volatility ratio features to the list

# Slow features
hurst24 = fk.Feature(Compose(ret1_tfs, tfs.HurstExponent(24)))
apen24 = fk.Feature(Compose(ret1_tfs, tfs.ApproximateEntropy(24)))

full_feature_list.extend([hurst24, apen24])  # Add slow features to the list

[f.name for f in full_feature_list]  # Print all feature names

['close_ret1_rv3',
 'close_ret1_rv12',
 'close_ret1_rv24',
 'close_ret1_bv_12',
 'jump_prop',
 ['cumote_up36_score',
  'cumote_down36_score',
  'cumote_up36_flag',
  'cumote_down36_flag',
  'cumote_up36_age',
  'cumote_down36_age'],
 ['cumote_up144_score',
  'cumote_down144_score',
  'cumote_up144_flag',
  'cumote_down144_flag',
  'cumote_up144_age',
  'cumote_down144_age'],
 ['cumote_up288_score',
  'cumote_down288_score',
  'cumote_up288_flag',
  'cumote_down288_flag',
  'cumote_up288_age',
  'cumote_down288_age'],
 'volume_imbalance',
 'volume_imbalance_rmean3',
 'volume_imbalance_rmean6',
 'vp12h_pctb',
 'vp12h_hva_distance',
 'vp12h_lva_distance',
 'vp12h_va_range',
 'close_ret3',
 'close_ret6',
 'close_ema_fast_dev',
 'close_ema_slow_dev',
 'close_rsiw12',
 'rv_ratio_3v12',
 'rv_ratio_6v24',
 'close_ret1_hurst24',
 'close_ret1_apen24']

Now, that we defined the set of features, we can build the feature kit. The feature kit will take care of building the features from the source data and producing a DataFrame with the defined features. With the `retain` argument, we can specify which features to keep from the source DataFrame.

In [13]:
fkit = fk.FeatureKit(full_feature_list,
                     retain=["open", "high", "low", "close", "volume", "max_spread",
                             "poc_vp30m_shift", "poc_vp60m_shift", "pct_above_poc_vp30m","pct_above_poc_vp60m", "pct_above_poc_vp12h"]
                     )

In [16]:
feature_df = fkit.build(full_tdf)  # This will build the features on the `full_tdf` source DataFrame

finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:1601 | INFO | Fall back to pandas for BiPowerVariation
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:1396 | INFO | Fall back to pandas for HurstExponent
finmlkit.feature.transforms:151 | INFO | Fall back to pandas for Return
finmlkit.feature.transforms:1456 | INFO | Fall back to pandas for ApproximateEntropy


In [18]:
fkit.save_config("my_fkit.json")
fkit = fk.FeatureKit.from_config("my_fkit.json")

ValueError: Unsupported unary op: rmean3

In [117]:
feature_df.tail(10)  # Show the first 10 rows of the feature DataFrame

Unnamed: 0_level_0,open,high,low,close,volume,max_spread,poc_vp30m_shift,poc_vp60m_shift,pct_above_poc_vp30m,pct_above_poc_vp60m,...,vp12h_va_range,close_ret3,close_ret6,close_ema_fast_dev,close_ema_slow_dev,close_rsiw12,rv_ratio_3v12,rv_ratio_6v24,close_ret1_hurst24,close_ret1_apen24
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-07-31 23:15:00,116083.0,116307.6,116083.0,116290.3,489.619995,10.3,-309.0,0.0,0.71177,0.590503,...,0.016855,0.003061,0.001167,8.5e-05,0.001359,50.888125,1.152636,1.17572,0.29804,0.123861
2025-07-31 23:20:00,116290.2,116290.3,116083.0,116092.8,455.520996,5.3,0.0,0.0,0.709125,0.598725,...,0.016855,0.000277,0.00064,-0.001484,-0.000244,43.720506,1.280399,1.217748,0.302293,0.171627
2025-07-31 23:25:00,116092.9,116208.0,116084.8,116085.9,242.332001,4.8,0.0,296.0,0.700613,0.453822,...,0.016855,2.4e-05,-0.000165,-0.00142,-0.000217,43.487059,1.283165,1.186179,0.326638,0.171627
2025-07-31 23:30:00,116085.8,116100.9,115980.0,115980.6,284.428009,10.3,0.0,0.0,0.678794,0.419896,...,0.016855,-0.002667,0.000394,-0.00214,-0.000803,39.936913,1.039499,1.102486,0.302068,0.123861
2025-07-31 23:35:00,115980.6,115983.6,115744.7,115882.2,1978.467041,39.5,-1960.0,-624.0,0.89693,0.524153,...,0.017498,-0.001816,-0.001539,-0.002749,-0.00118,36.868611,0.658276,1.060934,0.265468,0.123861
2025-07-31 23:40:00,115882.3,115893.4,115739.5,115750.2,503.731995,17.4,-43.0,267.0,0.845885,0.39358,...,0.017499,-0.002896,-0.002872,-0.003575,-0.001656,33.142341,0.877616,1.128553,0.19729,0.123861
2025-07-31 23:45:00,115750.1,115760.0,115373.9,115693.0,2994.667969,62.700001,-82.0,-314.0,0.471252,0.335855,...,0.018468,-0.002483,-0.00515,-0.003743,-0.001536,31.631071,0.791462,0.920752,0.119826,0.184135
2025-07-31 23:50:00,115693.0,115815.1,115580.1,115758.7,609.112976,3.9,149.0,0.0,0.362927,0.298324,...,0.018468,-0.001066,-0.002882,-0.002924,-0.000693,35.326325,0.74623,0.697184,0.10491,0.231901
2025-07-31 23:55:00,115758.8,115806.9,115508.8,115508.9,550.132019,7.0,-187.0,0.0,0.337423,0.266079,...,0.018468,-0.002087,-0.004983,-0.004671,-0.002037,28.857081,1.093324,1.0369,0.26064,0.231901
2025-08-01 00:00:00,115508.9,115768.7,115500.0,115697.3,609.695984,20.0,175.0,0.0,0.254183,0.225543,...,0.018468,3.7e-05,-0.002446,-0.002805,-0.000292,38.172713,1.302736,1.107776,0.248163,0.136369


In [118]:
feature_df.columns

Index(['open', 'high', 'low', 'close', 'volume', 'max_spread',
       'poc_vp30m_shift', 'poc_vp60m_shift', 'pct_above_poc_vp30m',
       'pct_above_poc_vp60m', 'pct_above_poc_vp12h', 'close_ret1_rv3',
       'close_ret1_rv12', 'close_ret1_rv24', 'close_ret1_bv_12', 'jump_prop',
       'cumote_up36_score', 'cumote_down36_score', 'cumote_up36_flag',
       'cumote_down36_flag', 'cumote_up36_age', 'cumote_down36_age',
       'cumote_up144_score', 'cumote_down144_score', 'cumote_up144_flag',
       'cumote_down144_flag', 'cumote_up144_age', 'cumote_down144_age',
       'cumote_up288_score', 'cumote_down288_score', 'cumote_up288_flag',
       'cumote_down288_flag', 'cumote_up288_age', 'cumote_down288_age',
       'volume_imbalance', 'volume_imbalance_rmean3',
       'volume_imbalance_rmean6', 'vp12h_pctb', 'vp12h_hva_distance',
       'vp12h_lva_distance', 'vp12h_va_range', 'close_ret3', 'close_ret6',
       'close_ema_fast_dev', 'close_ema_slow_dev', 'close_rsiw12',
       'rv_ratio_3v12'

We've just bult our first feature df! Congratulations!
The only thing left is to build labels for the features, so that we can train an ML model on them.

### New FeatureKit capabilities: Feature ops, caching-aware pipelines, topology, serialization
In this subsection, we showcase the latest FeatureKit capabilities you can leverage when crafting bar-level features:
- Arithmetic between features and constants (+, -, *, /), abs, clip, and min/max between features or with constants.
- Compose and Feature transforms are caching-aware and can reuse previously computed columns from your DataFrame.
- FeatureKit can resolve dependencies and execute in topological order; optional timing gives quick performance insights.
- Save and load FeatureKit configurations to JSON for reproducibility and deployment.

In [14]:
# Feature arithmetic and convenience ops
import finmlkit.feature.kit as fk
import finmlkit.feature.transforms as tfs
import numpy as np

# Define some base features
f_close = fk.Feature(tfs.Identity("close"))
f_sma3 = fk.Feature(tfs.SMA(3, input_col="close"))
f_ewma5 = fk.Feature(tfs.EWMA(5, input_col="close"))

# Arithmetic operations between features and constants
f_ratio = f_sma3 / (f_ewma5 + 1e-9)  # avoid division-by-zero
f_ratio.name = "sma3_over_ewma5"

f_shifted = f_close - 1000.0  # constant subtraction
f_abs = (f_close - f_sma3).abs()  # absolute distance from SMA
f_clipped = (f_close - f_ewma5).clip(lower=-100.0, upper=100.0)

# Min/Max operations (feature-feature and feature-constant)
f_min_fc = fk.Feature.min(f_close, f_sma3)
f_min_fc.name = "min_close_sma3"
f_max_fC = fk.Feature.max(f_close, 100.0)
f_max_fC.name = "max_close_100"

# Build a small kit with these ops
ops_kit = fk.FeatureKit([
    f_sma3, f_ewma5, f_ratio, f_shifted, f_abs, f_clipped, f_min_fc, f_max_fC
], retain=["close"]) 
ops_df = ops_kit.build(full_tdf, backend="pd", order="topo")
ops_df.tail()

Unnamed: 0_level_0,close,close_sma3,max_close_100,"sub(close,close_ewma5)_clip_-100.0_100.0","sub(close,1000.0)","abs(sub(close,close_sma3))",close_ewma5,min_close_sma3,sma3_over_ewma5
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2025-07-31 23:40:00,115750.2,115871.0,115750.2,-100.0,114750.2,120.8,115920.559876,115750.2,0.999572
2025-07-31 23:45:00,115693.0,115775.133333,115693.0,-100.0,114693.0,82.133333,115844.706584,115693.0,0.999399
2025-07-31 23:50:00,115758.7,115733.966667,115758.7,-57.337723,114758.7,24.733333,115816.037723,115733.966667,0.999291
2025-07-31 23:55:00,115508.9,115653.533333,115508.9,-100.0,114508.9,144.633333,115713.658482,115508.9,0.99948
2025-08-01 00:00:00,115697.3,115654.966667,115697.3,-10.905655,114697.3,42.333333,115708.205655,115654.966667,0.99954


#### Caching-aware Compose pipelines
You can chain transforms with Compose. If intermediate or final outputs already exist in your working DataFrame, Compose will short-circuit and reuse them.

In [15]:
from finmlkit.feature.kit import Compose

# Compose a 2-step pipeline: SMA(3) -> EWMA(5) on the SMA output
sma3_t = tfs.SMA(3, input_col="close")
ewma5_on_sma = tfs.EWMA(5, input_col=sma3_t.output_name)
comp = Compose(sma3_t, ewma5_on_sma)

# First run computes and returns the Series
comp_out_1 = comp(full_tdf, backend="pd")

# Prepare a copy with the final composed output cached under the final composed name
_df_cached = full_tdf.copy()
_df_cached[comp.output_name] = comp_out_1.values

# Second run: short-circuits (reuses cached final column)
comp_out_2 = comp(_df_cached, backend="pd")

# Validate same result
bool(np.allclose(comp_out_1.fillna(0).values, comp_out_2.fillna(0).values))

True

#### Execution order, topological sort and timing
FeatureKit can resolve dependencies between features and run in a valid topological order. Enable timeit for quick timing.

In [16]:
# Print inferred topological order among features of the main kit we created above
print("Topological order (subset shown):")
order = fkit.topological_order()
print(order[:10], "... (total:", len(order), "features)")

# Rebuild with timing enabled (console chart)
_ = fkit.build(full_tdf, backend="pd", order="topo", timeit=True)

Topological order (subset shown):
['close_ret1_hurst24', "['cumote_up144_score', 'cumote_down144_score', 'cumote_up144_flag', 'cumote_down144_flag', 'cumote_up144_age', 'cumote_down144_age']", 'close_ema_fast_dev', 'rv_ratio_6v24', 'close_ema_slow_dev', 'close_ret1_rv12', 'volume_imbalance', 'vp12h_hva_distance', 'vp12h_lva_distance', 'close_ret1_apen24'] ... (total: 24 features)
finmlkit.feature.transforms:666 | INFO | Fall back to numba for CUSUMTest
finmlkit.feature.transforms:666 | INFO | Fall back to numba for CUSUMTest
finmlkit.feature.transforms:666 | INFO | Fall back to numba for CUSUMTest

Feature Timing Analysis:
close_ret1_apen24              | ██████████████████████████████████████████████████ 1.2203s
rv_ratio_6v24                  | ██████████████████████████████████████████████ 1.1304s
close_ret1_hurst24             | ██████████████████████████████ 0.7375s
close_ret1_rv12                | ███████████████████████ 0.5619s
close_ret1_rv3                 | ███████████████████

#### Save/Load FeatureKit configuration (serialization)
You can serialize a kit to JSON and later reconstruct it for reproducibility.

In [17]:
# Save the small ops kit to JSON
cfg_path = "featurekit_ops_quickstart.json"
ops_kit.save_config(cfg_path)

# Load it back and rebuild
loaded_kit = fk.FeatureKit.from_config(cfg_path)
ops_df_loaded = loaded_kit.build(full_tdf, backend="pd", order="topo")

# Check that the columns and values match (up to NaNs)
print(set(ops_df.columns) == set(ops_df_loaded.columns))
for c in ops_df.columns:
    assert np.allclose(ops_df[c].fillna(0).values, ops_df_loaded[c].fillna(0).values)
print("Serialization round-trip OK")

True
Serialization round-trip OK


#### Visualize the computational graph
The FeatureKit can expose the computational graph that links raw inputs to derived features. This helps you understand dependencies and debug complex pipelines.

In [18]:
# Visualize the computational graph inferred from our feature kit
# Input nodes are prefixed with "input:"; edges indicate data dependencies
G = fkit.build_graph()
print(G.visualize())

ComputationGraph:
  ['cumote_up144_score', 'cumote_down144_score', 'cumote_up144_flag', 'cumote_down144_flag', 'cumote_up144_age', 'cumote_down144_age'] -> []
  ['cumote_up288_score', 'cumote_down288_score', 'cumote_up288_flag', 'cumote_down288_flag', 'cumote_up288_age', 'cumote_down288_age'] -> []
  ['cumote_up36_score', 'cumote_down36_score', 'cumote_up36_flag', 'cumote_down36_flag', 'cumote_up36_age', 'cumote_down36_age'] -> []
  close_ema_fast_dev -> []
  close_ema_slow_dev -> []
  close_ret1_apen24 -> []
  close_ret1_bv_12 -> []
  close_ret1_hurst24 -> []
  close_ret1_rv12 -> []
  close_ret1_rv24 -> []
  close_ret1_rv3 -> [rv_ratio_3v12]
  close_ret3 -> []
  close_ret6 -> []
  close_rsiw12 -> []
  input:close -> [['cumote_up144_score', 'cumote_down144_score', 'cumote_up144_flag', 'cumote_down144_flag', 'cumote_up144_age', 'cumote_down144_age'], ['cumote_up288_score', 'cumote_down288_score', 'cumote_up288_flag', 'cumote_down288_flag', 'cumote_up288_age', 'cumote_down288_age'], ['cu

#### Integrating external functions via ExternalFunction (TA-Lib)
You can integrate external Python functions (by object or import path) into your pipelines using ExternalFunction. Here we demonstrate using TA-Lib’s technical indicators. If installation fails, consult TA-Lib platform-specific install instructions or try `pip install talib-binary`.

In [19]:
# Attempt to install TA-Lib (may require platform-specific setup)
!pip install --quiet TA-Lib || echo "If TA-Lib install fails, try 'pip install talib-binary' or consult TA-Lib install docs."

In [22]:
import talib

In [23]:
import numpy as np
import finmlkit.feature.kit as fk
from finmlkit.feature.transforms import ExternalFunction

# Define TA-Lib external indicators on 'close' using numpy arrays for compatibility
ext_sma14 = ExternalFunction(talib.SMA, input_cols="close", output_cols="talib_sma14", args=[14], pass_numpy=True)
ext_rsi14 = ExternalFunction(talib.RSI, input_cols="close", output_cols="talib_rsi14", args=[14], pass_numpy=True)

f_ext_sma14 = fk.Feature(ext_sma14)
f_ext_rsi14 = fk.Feature(ext_rsi14)

# Build a small kit with external indicators
_talib_kit = fk.FeatureKit([f_ext_sma14, f_ext_rsi14], retain=["close"]) 
_talib_df = _talib_kit.build(full_tdf, backend="pd", order="topo")
_talib_df.tail()

Unnamed: 0_level_0,close,talib_sma14,talib_rsi14
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-07-31 23:40:00,115750.2,116048.614286,33.89874
2025-07-31 23:45:00,115693.0,116014.3,32.567512
2025-07-31 23:50:00,115758.7,115992.185714,35.691375
2025-07-31 23:55:00,115508.9,115946.057143,30.000677
2025-08-01 00:00:00,115697.3,115923.114286,38.026367


Serialization also works for external functions:

In [24]:
# Save and load a TA-Lib-based kit and verify outputs match
cfg_path_talib = "featurekit_talib_quickstart.json"
_talib_kit.save_config(cfg_path_talib)

_talib_loaded = fk.FeatureKit.from_config(cfg_path_talib)
_talib_df_loaded = _talib_loaded.build(full_tdf, backend="pd", order="topo")

print(set(_talib_df.columns) == set(_talib_df_loaded.columns))
for c in _talib_df.columns:
    assert np.allclose(_talib_df[c].fillna(0).values, _talib_df_loaded[c].fillna(0).values)
print("ExternalFunction serialization round-trip OK")

True
ExternalFunction serialization round-trip OK


# 5. Build Labels

To build TBM labels (research and understand the triple barrier labeling method if you are not familiar with it) we need a target return. We can set this based on the volatility of the data. We've already calculated this on the trade data, quickly recap it:

In [119]:
vola_estim = Compose(
    tfs.ReturnT(window=pd.Timedelta(hours=1), is_log=True, input_col="price"),
    tfs.EWMST(pd.Timedelta(hours=1)),
)
sigma = vola_estim(trades.data)

In [123]:
# add this as a feature to the feature DataFrame
feature_df = pd.merge_asof(feature_df, sigma, right_index=True, left_index=True, direction="backward")
feature_df.rename(columns={sigma.name: "sigma"}, inplace=True)  # rename the column to "sigma"

Now, we can build the labels using the `TBMLabel` class. This class will take care of building the labels based on the target return and the feature DataFrame. It will also clean the DataFrame from any NaN values and return a clean DataFrame with the labels.

In [124]:
from finmlkit.label.kit import TBMLabel

In [125]:
MIN_RET = 0.001  # minimum target return, this is a hyperparameter, you can set it based on your data, we set this to 10 bps here.
vertical_window = pd.Timedelta(hours=2)  # vertical barrier, this is the maximum time to hold the position, we set it to 2 hour here
tbm_label = TBMLabel(feature_df, target_ret_col="sigma", min_ret = MIN_RET, horizontal_barriers=(1.5, 1.5), vertical_barrier=vertical_window)
fts, lbs = tbm_label.compute_labels(trades)  # the path and label is evaluated on the raw trades data in this example
lbs.head()

Unnamed: 0_level_0,touch_time,event_idx,touch_idx,labels,returns,vertical_touch_weights
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2025-07-01 12:05:00,2025-07-01 12:23:41.878,352269,360727,-1,-0.002296,1.0
2025-07-01 12:10:00,2025-07-01 12:25:41.443,354939,362666,-1,-0.00224,1.0
2025-07-01 12:15:00,2025-07-01 13:36:30.622,356884,404891,1,0.00222,1.0
2025-07-01 12:20:00,2025-07-01 13:33:30.767,358770,400697,1,0.002229,1.0
2025-07-01 12:25:00,2025-07-01 13:30:12.163,361825,395887,1,0.002155,1.0


We can also compute sample weights based on label concurrence and return attribution:

In [126]:
# Compute sample weights
info_weights = tbm_label.compute_weights(trades)
info_weights.head()

Unnamed: 0_level_0,avg_uniqueness,return_attribution
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2025-07-01 12:05:00,0.562793,0.001371
2025-07-01 12:10:00,0.345087,0.000705
2025-07-01 12:15:00,0.189739,0.000141
2025-07-01 12:20:00,0.159444,0.000142
2025-07-01 12:25:00,0.138479,0.000194


Finally, we can calculate a final weight, combining average uniqueness, return attribution, class imbalance, time decay etc...

In [127]:
from finmlkit.label.kit import SampleWeights

In [128]:
sample_weights = SampleWeights().compute_final_weights(info_weights.avg_uniqueness, time_decay_intercept=0.5, return_attribution=info_weights.return_attribution, labels=lbs.labels)
sample_weights.head()

Unnamed: 0_level_0,time_decay_weights,return_attribution,weights
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-07-01 12:05:00,0.5003,5.06635,3.316089
2025-07-01 12:10:00,0.500485,2.605158,1.705788
2025-07-01 12:15:00,0.500586,0.520231,0.344857
2025-07-01 12:20:00,0.500671,0.523092,0.346813
2025-07-01 12:25:00,0.500745,0.716021,0.474796


In [129]:
sample_weights.tail()

Unnamed: 0_level_0,time_decay_weights,return_attribution,weights
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2025-07-31 21:35:00,0.999715,0.67212,0.879069
2025-07-31 21:40:00,0.999739,0.710845,0.929741
2025-07-31 21:45:00,0.999766,0.706195,0.923684
2025-07-31 21:50:00,0.999801,0.871354,1.139746
2025-07-31 21:55:00,1.0,6.31232,8.258271


Now, we have everything we need to train a machine learning model. We have the features, labels, and sample weights.

Happy coding and tinkering with `FinMLKit`!