# Solutions to Riptable Exercises

This notebook contains the solutions to the [Riptable Exercises](RiptableExercises.ipynb).

Your solutions may be implemented slightly differently, but they should get the same essential results.

If you have any questions or comments, email RiptableDocumentation@sig.com.

In [1]:
# Generate notebook download link
from IPython.display import FileLink
print('To download this notebook, right click on the link and Save link as...')
FileLink('RiptableSolutions.ipynb')

To download this notebook, right click on the link and Save link as...


In [2]:
import riptable as rt
import numpy as np

## Introduction to the Riptable Dataset

**Datasets** are the core class of riptable. 

They are tables of data, consisting of a series of **columns** of the same length (sometimes referred to as **fields**).

Structurally, they behave like python dictionaries, and can be created directly from one.

We'll familiarize ourselves with Datasets by manually constructing one by generating fake sample data using `np.random.default_rng().choice(...)` or similar.

In real life they will essentially always be generated from world data.

**First, create a python dictionary with two fields of the same length (>1000); one column of stock prices and one of symbols.**

**Make sure the symbols have duplicates, for later aggregation exercises.**

In [3]:
rng = np.random.default_rng()
dset_length = 5_000

In [4]:
my_dict = {'Price': rng.uniform(0, 1000, dset_length), 'Symbol': rng.choice(['GME', 'AMZN', 'TSLA', 'SPY'], dset_length)}

**Create a riptable dataset from this, using** `rt.Dataset(my_dict)`.

In [5]:
my_dset = rt.Dataset(my_dict)

You can easily append more columns to a dataset.

**Add a new column of integer trade size, using** `my_dset.Size = `.

In [6]:
my_dset.Size = rng.integers(1, 1000, dset_length)

Columns can be referred with brackets around a string name as well. This is typically used when the column name comes from a variable.

**Add a new column of booleans indicating whether you traded this trade, using**
`my_dset['MyTrade'] =`.

In [7]:
my_dset['MyTrade'] = rng.choice([True, False], dset_length)

**Add a new column of string "Buy" or "Sell" indicating the customer direction.**

In [8]:
my_dset.CustDirection = rng.choice(['Buy', 'Sell'], dset_length)

Riptable will convert these lists to the riptable **FastArray** container and cast the data to an appropriate numpy datatype.

**View the datatypes with** `my_dset.dtypes`.

In [9]:
my_dset.dtypes

{'Price': dtype('float64'),
 'Symbol': dtype('S4'),
 'Size': dtype('int64'),
 'MyTrade': dtype('bool'),
 'CustDirection': dtype('S4')}

**View some sample rows of the dataset using** `.sample()`.

You should use this instead of `.head()` because the initial rows of a dataset are often unrepresentative.

In [12]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection
0,110.11,AMZN,499,True,Buy
1,441.11,TSLA,135,False,Sell
2,16.95,AMZN,456,True,Sell
3,350.19,TSLA,458,False,Buy
4,965.67,TSLA,560,False,Sell
5,982.08,AMZN,292,True,Buy
6,770.71,TSLA,923,False,Sell
7,280.44,AMZN,147,True,Buy
8,350.86,SPY,928,True,Buy
9,947.27,TSLA,767,True,Buy


**View distributional stats of the numerical fields of your dataset with** `.describe()`.

You can call this on a single column as well.

In [13]:
my_dset.describe()

*Stats,Price,Size,MyTrade
Count,5000.0,5000.0,5000.0
Valid,5000.0,5000.0,5000.0
Nans,0.0,0.0,0.0
Mean,507.17,499.68,0.5
Std,289.56,284.99,0.5
Min,0.4,1.0,0.0
P10,96.65,107.0,0.0
P25,262.44,258.0,0.0
P50,509.31,496.0,0.0
P75,762.43,743.0,1.0


## Manipulating data

You can perform simple operation on riptable columns with normal python syntax. Riptable will do them to the whole column at once, efficiently.

**Create a new column by performing scalar arithmetic on one of your numeric columns.**

In [14]:
my_dset.SharesOfStock = 100 * my_dset.Size

In [15]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection,SharesOfStock
0,396.83,TSLA,402,True,Buy,40200
1,802.5,AMZN,926,True,Sell,92600
2,845.9,AMZN,305,True,Sell,30500
3,93.57,AMZN,316,False,Sell,31600
4,216.31,SPY,109,True,Buy,10900
5,546.9,TSLA,650,False,Buy,65000
6,380.3,GME,164,False,Buy,16400
7,131.45,AMZN,419,True,Buy,41900
8,885.55,SPY,862,True,Sell,86200
9,893.31,TSLA,323,True,Sell,32300


As long as the columns are the same size (as is guaranteed if they're in the same dataset) you can perform combining operations the same way.

**Create a new column of total price paid for the trade by multiplying two existing columns together.**

Riptable will automatically upcast types as necessary to preserve information.

In [16]:
my_dset.TotalCash = my_dset.Price * my_dset.Size

In [17]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection,SharesOfStock,TotalCash
0,486.83,TSLA,388,True,Sell,38800,188890.52
1,883.75,AMZN,91,False,Sell,9100,80421.64
2,216.25,GME,192,True,Buy,19200,41520.74
3,533.71,TSLA,29,True,Buy,2900,15477.53
4,991.15,SPY,954,False,Sell,95400,945554.54
5,208.46,TSLA,765,False,Buy,76500,159473.02
6,993.24,TSLA,396,False,Sell,39600,393323.84
7,785.89,SPY,430,True,Sell,43000,337932.91
8,197.52,SPY,357,True,Buy,35700,70516.33
9,963.39,GME,726,False,Buy,72600,699419.65


There are many built-in functions as well, which you call with either `my_dset.field.function()` or `rt.function(my_dset.field)` syntax.

**Find the unique Symbols in your dataset.**

In [18]:
my_dset.Symbol.unique()

FastArray([b'AMZN', b'GME', b'SPY', b'TSLA'], dtype='|S4')

## Date/Time

Riptable has three main date/time types: `Date`, `DateTimeNano`, and `TimeSpan`.

**Give each row of your dataset an** `rt.Date`.

**Make sure they're not all different, but still include days from multiple months.**

Note that due to Riptable idiosyncracies you need to generate a list of yyyymmdd strings and pass into the `rt.Date(...)` constructor, not construct Dates individually.

In [19]:
my_dset.Date = rt.Date(rng.choice(rt.Date.range('20220201', '20220430'), dset_length))

In [20]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection,SharesOfStock,TotalCash,Date
0,800.34,TSLA,875,True,Sell,87500,700299.36,2022-02-08
1,837.0,TSLA,22,False,Sell,2200,18414.08,2022-03-19
2,739.75,AMZN,894,True,Buy,89400,661334.5,2022-03-13
3,333.02,GME,433,False,Sell,43300,144198.32,2022-04-23
4,154.95,AMZN,628,False,Sell,62800,97307.43,2022-02-23
5,995.3,AMZN,184,True,Buy,18400,183136.04,2022-03-11
6,734.84,GME,531,True,Sell,53100,390202.38,2022-02-06
7,238.48,SPY,500,True,Buy,50000,119241.98,2022-04-17
8,16.35,GME,97,True,Sell,9700,1586.28,2022-03-17
9,893.65,AMZN,545,False,Sell,54500,487041.87,2022-03-29


**Give each row a unique(ish)** `TimeSpan` **as a trade time.**

You can instantiate them using `rt.TimeSpan(hours_var, unit='h')`.

In [21]:
my_dset.TradeTime = rt.TimeSpan(rng.uniform(9.5, 16, dset_length), unit='h')

In [22]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,SharesOfStock,TotalCash,Date,TradeTime
0,735.74,SPY,513,False,...,51300,377432.28,2022-04-12,12:25:35.400236737
1,846.86,SPY,619,False,...,61900,524207.19,2022-02-26,14:23:07.768135867
2,36.61,AMZN,520,False,...,52000,19036.05,2022-02-12,14:21:19.051116920
3,313.17,SPY,230,False,...,23000,72028.38,2022-03-01,13:26:41.105253958
4,245.27,TSLA,897,True,...,89700,220008.96,2022-03-01,14:38:19.886975870
5,894.97,SPY,713,False,...,71300,638116.05,2022-04-03,12:12:35.683634970
6,279.53,AMZN,255,True,...,25500,71280.87,2022-02-20,15:14:44.946418434
7,710.2,TSLA,105,True,...,10500,74570.97,2022-03-03,14:23:16.690519947
8,211.5,TSLA,207,True,...,20700,43781.07,2022-04-20,14:34:06.527025035
9,526.5,TSLA,645,True,...,64500,339591.78,2022-04-22,13:32:00.731248375


**Create a DateTimeNano of the combined TradeDateTime by simple addition. Riptable knows how to sum the types.**

Be careful here, by default you'll get a GMT timezone, you can force NYC with `rt.DateTimeNano(..., from_tz='NYC')`.

In [23]:
my_dset.TradeDateTime = rt.DateTimeNano(my_dset.Date + my_dset.TradeTime, from_tz='NYC')

In [24]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TotalCash,Date,TradeTime,TradeDateTime
0,980.77,GME,946,False,...,927804.65,2022-02-24,12:40:21.109339625,20220224 12:40:21.109339625
1,695.3,GME,549,True,...,381718.71,2022-04-30,10:04:09.135858889,20220430 10:04:09.135858889
2,88.44,GME,845,True,...,74731.74,2022-02-04,11:34:29.436742013,20220204 11:34:29.436742013
3,87.12,SPY,404,False,...,35195.26,2022-03-25,13:41:35.937013122,20220325 13:41:35.937013122
4,37.81,GME,479,False,...,18109.0,2022-03-26,13:09:25.299921592,20220326 13:09:25.299921592
5,312.32,TSLA,44,True,...,13742.07,2022-03-22,12:38:20.177905206,20220322 12:38:20.177905206
6,888.37,SPY,822,False,...,730237.26,2022-02-22,11:44:06.554816187,20220222 11:44:06.554816187
7,880.87,TSLA,897,False,...,790143.05,2022-02-01,12:52:50.335991030,20220201 12:52:50.335991030
8,721.04,AMZN,544,False,...,392246.76,2022-03-05,10:34:53.186864088,20220305 10:34:53.186864088
9,382.08,SPY,548,True,...,209379.04,2022-04-02,15:32:18.771448479,20220402 15:32:18.771448479


To reverse this operation and get out separate dates and times from a DateTimeNano, you can call `rt.Date(my_DateTimeNano)` and `my_DateTimeNano.time_since_midnight()`.

**Create a new month name column by using the** `.strftime` **function.**

In [25]:
my_dset.month_name = my_dset.Date.strftime('%b%y')

In [26]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TradeTime,TradeDateTime,month_name
0,29.78,TSLA,770,False,...,11:47:11.463924743,20220422 11:47:11.463924743,Apr22
1,3.47,GME,673,False,...,11:45:18.212819501,20220214 11:45:18.212819501,Feb22
2,236.91,TSLA,125,True,...,15:11:16.258631766,20220408 15:11:16.258631766,Apr22
3,352.97,SPY,319,True,...,15:10:00.957743793,20220331 15:10:00.957743793,Mar22
4,974.6,GME,884,False,...,12:28:22.043173523,20220220 12:28:22.043173523,Feb22
5,670.58,AMZN,381,False,...,13:59:53.069688371,20220430 13:59:53.069688371,Apr22
6,361.65,TSLA,521,True,...,13:16:51.163380616,20220206 13:16:51.163380616,Feb22
7,567.18,GME,630,True,...,15:13:42.601042079,20220408 15:13:42.601042079,Apr22
8,396.56,AMZN,567,True,...,12:59:16.990516705,20220411 12:59:16.990516705,Apr22
9,126.72,TSLA,931,True,...,14:18:21.738492674,20220417 14:18:21.738492674,Apr22


**Create another new month column by using the** `.start_of_month` **attribute.**

This is nice for grouping because it will automatically sort correctly.

In [27]:
my_dset.month = my_dset.Date.start_of_month

In [28]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TradeDateTime,month_name,month
0,725.67,GME,688,True,...,20220208 14:46:46.584027086,Feb22,2022-02-01
1,619.16,SPY,187,True,...,20220226 14:00:32.836528589,Feb22,2022-02-01
2,315.76,TSLA,660,False,...,20220206 12:58:29.632363604,Feb22,2022-02-01
3,356.52,SPY,369,True,...,20220426 13:34:36.699243210,Apr22,2022-04-01
4,116.84,TSLA,796,False,...,20220329 11:43:24.038580496,Mar22,2022-03-01
5,940.18,SPY,495,False,...,20220330 14:38:33.061508953,Mar22,2022-03-01
6,245.84,AMZN,105,False,...,20220409 15:10:18.174861038,Apr22,2022-04-01
7,62.33,SPY,3,False,...,20220402 10:14:02.656973901,Apr22,2022-04-01
8,724.23,TSLA,192,False,...,20220208 09:54:54.929626782,Feb22,2022-02-01
9,780.04,AMZN,453,True,...,20220427 14:12:24.850968828,Apr22,2022-04-01


## Sorting

Riptable has two sorts, `sort_copy` (which preserves the original dataset) and `sort_inplace`, which is faster and more memory-efficient if you don't need the original data order.

**Sort your dataset by TradeDateTime.**

This is the natural ordering of a list of trades, so do it in-place.

In [29]:
my_dset = my_dset.sort_inplace('TradeDateTime')

In [30]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TradeDateTime,month_name,month
0,251.35,AMZN,289,False,...,20220201 12:19:44.249569503,Feb22,2022-02-01
1,58.39,AMZN,682,False,...,20220210 12:49:15.747778700,Feb22,2022-02-01
2,341.83,AMZN,676,True,...,20220223 11:54:57.261881458,Feb22,2022-02-01
3,421.14,SPY,270,False,...,20220306 15:17:15.323909275,Mar22,2022-03-01
4,336.62,AMZN,401,True,...,20220407 14:28:21.294242194,Apr22,2022-04-01
5,581.11,TSLA,871,True,...,20220413 14:00:47.151287717,Apr22,2022-04-01
6,725.35,AMZN,996,False,...,20220418 14:08:27.959836126,Apr22,2022-04-01
7,986.29,GME,221,False,...,20220419 11:01:20.821549222,Apr22,2022-04-01
8,6.09,GME,890,False,...,20220421 14:23:44.655935672,Apr22,2022-04-01
9,721.65,GME,469,True,...,20220425 15:25:04.021192737,Apr22,2022-04-01


## Filtering

Filtering is the principal way to work with a subset of your data in riptable. It is commonly used for looking at a restricted set of trades matching some criterion you care about.

Except in rare instances, though, you should maintain your dataset in its full size, and only apply a filter when performing a final computation.

This will avoid unnecessary data duplication and improve speed & memory usage.

**Construct a filter of only your sales. (A filter is a column of Booleans which is true only for the rows you're interested in.)**

You can combine filters using & or |. Be careful to always wrap expressions in parentheses to avoid an extremely slow call into native python followed by a crash.

Always `(my_dset.field1 > 10) & (my_dset.field2 < 5)`, never `my_dset.field1 > 10 & my_dset.field2 > 5`.

In [31]:
f_my_sales = my_dset.MyTrade & (my_dset.CustDirection == 'Buy')

**Compute the total Trade Size, filtered for only your sales.**

For this and many other instances, you can & should pass your filter into the `filter` kwarg of the `.nansum(...)` call.

This allows riptable to perform the filtering during the nansum computation, rather than instantiating a new column and then summing it.

In [32]:
my_dset.Size.nansum(filter=f_my_sales)

2498422

**Count how many times you sold each symbol.**

Here the `.count()` function doesn't accept a `filter` kwarg, so you must fall back to explicitly filtering the `Symbol` field before counting.

Be careful that you only filter down the `Symbol` field, not the entire dataset, otherwise you are wasting a lot of compute.

In [33]:
my_dset.Symbol[f_my_sales].count()

*Unique,Count
AMZN,318
GME,310
SPY,307
TSLA,315


## Categoricals

So far, we've been operating on your symbol column as a column of strings.

However, it's far more efficient when you have a large column with many repeats to use a categorical, which assigns each unique value a number, and stores the labels & numbers separately.

This is memory-efficient, and also computationally efficient, as riptable can peform operations on the unique values, then expand out to the full vector appropriately.

**Make a new column of your string column converted to a categorical, using** `rt.Cat(column)`.

In [34]:
my_dset.Symbol_cat = rt.Cat(my_dset.Symbol)
my_dset.Symbol_cat

Categorical([AMZN, TSLA, AMZN, SPY, SPY, ..., TSLA, SPY, AMZN, TSLA, TSLA]) Length: 5000
  FastArray([1, 4, 1, 3, 3, ..., 4, 3, 1, 4, 4], dtype=int8) Base Index: 1
  FastArray([b'AMZN', b'GME', b'SPY', b'TSLA'], dtype='|S4') Unique count: 4

**Perform the same filtered count from above, on the categorical.**

The categorical `.count()` admits a `filter` kwarg, which makes it simpler.

In [35]:
my_dset.Symbol_cat.count(filter=f_my_sales)

*Symbol_cat,Count
AMZN,318
GME,310
SPY,307
TSLA,315


Categoricals can be used as groupings. When you call a numeric function on a categorical and pass numeric columns in, riptable knows to do the calculation per-group.

**Compute the total amount of contracts sold by customers in each symbol.**

In [36]:
my_dset.Symbol_cat.sum(my_dset.Size, filter=my_dset.CustDirection == 'Sell')

*Symbol_cat,Size
AMZN,317423
GME,315979
SPY,316665
TSLA,314398


The `transform=True` kwarg in a categorical operation performs the aggregation, then *transforms* it back up to the original shape of the categorical, giving each row the appropriate value from its group.

**Make a new column which is the average trade price, per symbol.**

In [37]:
my_dset.average_trade_price = my_dset.Symbol_cat.mean(my_dset.Price, transform=True)

**Inspect with** `.sample()` **to confirm that this value is consistent for rows with matching symbol.**

In [38]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,month_name,month,Symbol_cat,average_trade_price
0,485.95,SPY,671,True,...,Feb22,2022-02-01,SPY,507.38
1,158.59,GME,463,False,...,Mar22,2022-03-01,GME,505.36
2,277.87,SPY,26,True,...,Mar22,2022-03-01,SPY,507.38
3,996.43,GME,596,False,...,Mar22,2022-03-01,GME,505.36
4,202.23,AMZN,928,False,...,Mar22,2022-03-01,AMZN,510.34
5,359.18,GME,778,False,...,Mar22,2022-03-01,GME,505.36
6,125.56,SPY,612,True,...,Mar22,2022-03-01,SPY,507.38
7,686.36,AMZN,39,True,...,Mar22,2022-03-01,AMZN,510.34
8,720.28,GME,661,True,...,Apr22,2022-04-01,GME,505.36
9,727.12,SPY,148,True,...,Apr22,2022-04-01,SPY,507.38


If you need to perform a custom operation on each categorical, you can pass in a function with `.apply_reduce` (which aggregates) or `.apply_nonreduce` (which is like `transform=True`).

Note that the custom function you pass needs to expect a FastArray, and output a scalar (`apply_reduce`) or same-length FastArray (`apply_nonreduce`).

**Find, for each symbol, the trade size of the second trade occuring in the dataset.**

In [39]:
my_dset.Symbol_cat.apply_reduce(lambda x: x[1], my_dset.Size)

*Symbol_cat,Size
AMZN,360
GME,923
SPY,409
TSLA,446


Sometimes you want to aggregate based on multiple values. In these cases we use multi-key categoricals.

**Use a multi-key categorical to compute the average size per symbol-month pair.**

In [40]:
my_dset.Symbol_month_cat = rt.Cat([my_dset.Symbol, my_dset.month])

In [41]:
my_dset.Symbol_month_cat.nanmean(my_dset.Size).sort_inplace('Symbol')

*Symbol,*month,Size
AMZN,2022-02-01,491.85
.,2022-03-01,486.34
.,2022-04-01,517.41
GME,2022-02-01,484.2
.,2022-03-01,527.31
.,2022-04-01,513.59
SPY,2022-02-01,499.23
.,2022-03-01,507.26
.,2022-04-01,474.98
TSLA,2022-02-01,478.41


## Accumulating

Aggregating over two values for human viewing is often most conveniently done with an accum. 

**Use** `Accum2` **to compute the average size per symbol-month pair.**

In [42]:
rt.Accum2(my_dset.Symbol, my_dset.month).nanmean(my_dset.Size)

*Symbol,2022-02-01,2022-03-01,2022-04-01,Nanmean
AMZN,491.85,486.34,517.41,498.2
GME,484.2,527.31,513.59,508.62
SPY,499.23,507.26,474.98,493.75
TSLA,478.41,500.87,513.25,498.3
Nanmean,488.39,505.27,504.57,499.68


Average numbers can be meaningless. It is often better to consider relative percentage instead.

**Use** `accum_ratiop` **to compute the fraction of total volume done by each symbol-month pair.**

In [43]:
rt.accum_ratiop(my_dset.Symbol, my_dset.month, my_dset.Size, norm_by='R')

*Symbol,2022-02-01,2022-03-01,2022-04-01,TotalRatio,Total
AMZN,32.35,34.03,33.62,100.0,621754.0
GME,31.19,35.23,33.58,100.0,628657.0
SPY,30.86,36.18,32.96,100.0,619657.0
TSLA,29.54,34.36,36.1,100.0,628354.0
TotalRatio,30.98,34.95,34.07,100.0,
Total,774095.0,873112.0,851215.0,,2498422.0


## Merging

There are two main types of merges.

First is `merge_lookup`. This is used for enriching one (typically large) dataset with information from another (typically small) dataset.

**Create a new dataset with one row per symbol from your dataset, and a second column of who trades each symbol.**

In [44]:
symbol_trader = rt.Dataset({'UnderlyingSymbol': ['GME', 'TSLA', 'SPY', 'AMZN'],
                           'Trader': ['Nate', 'Elon', 'Josh', 'Dan']})

In [45]:
symbol_trader

#,UnderlyingSymbol,Trader
0,GME,Nate
1,TSLA,Elon
2,SPY,Josh
3,AMZN,Dan


**Enrich the main dataset by putting the correct trader into each row.**

In [46]:
my_dset.Trader = my_dset.merge_lookup(symbol_trader, on=('Symbol', 'UnderlyingSymbol'), columns_left=[])['Trader']

In [47]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,Symbol_cat,average_trade_price,Symbol_month_cat,Trader
0,92.15,AMZN,420,False,...,AMZN,510.34,"(AMZN, 2022-02-01)",Dan
1,536.58,SPY,484,True,...,SPY,507.38,"(SPY, 2022-03-01)",Josh
2,356.18,TSLA,427,False,...,TSLA,505.6,"(TSLA, 2022-03-01)",Elon
3,417.53,TSLA,359,False,...,TSLA,505.6,"(TSLA, 2022-04-01)",Elon
4,4.38,SPY,594,True,...,SPY,507.38,"(SPY, 2022-04-01)",Josh
5,11.18,GME,630,True,...,GME,505.36,"(GME, 2022-04-01)",Nate
6,609.13,GME,802,False,...,GME,505.36,"(GME, 2022-04-01)",Nate
7,615.85,SPY,24,False,...,SPY,507.38,"(SPY, 2022-04-01)",Josh
8,334.67,TSLA,967,True,...,TSLA,505.6,"(TSLA, 2022-04-01)",Elon
9,906.37,AMZN,861,True,...,AMZN,510.34,"(AMZN, 2022-04-01)",Dan


The second type of merge is `merge_asof`, which is used for fuzzy alignment between two datasets, typically by time (though often by other variables).

**Create a new index price dataset with one price per minute, which covers all the Dates in your dataset.**

The index price doesn't need to be reasonable.

Each row should have a DateTimeNano as the datetime.

In [48]:
num_minutes = int((my_dset.TradeDateTime.max() - my_dset.TradeDateTime.min()).minutes[0])
start_datetime = rt.Date(my_dset.TradeDateTime.min())

In [49]:
index_price = rt.Dataset({'DateTime': start_datetime + rt.TimeSpan(range(num_minutes), unit='m'),
                          'IndexPrice': rng.uniform(3500, 4500, num_minutes)})

In [50]:
index_price.sample()

#,DateTime,IndexPrice
0,20220201 13:48:00.000000000,4380.32
1,20220214 05:40:00.000000000,3819.5
2,20220222 18:52:00.000000000,3594.59
3,20220312 10:34:00.000000000,3921.27
4,20220327 01:59:00.000000000,3943.76
5,20220328 02:20:00.000000000,4021.35
6,20220331 22:25:00.000000000,4206.64
7,20220403 14:16:00.000000000,3619.59
8,20220407 10:45:00.000000000,4048.61
9,20220415 12:03:00.000000000,4467.44


**Use** `merge_asof` **to get the most recent Index Price associated with each trade in your main dataset.**

Note both datasets need to be sorted for merge_asof.

The `on` kwarg is the numeric/time field that looks for close matches.

The `by` kwarg is not necessary here, but could constrain the match to a subset if, for example, you had multiple indices and a column of which one each row is associated with.

**Use** `direction='backward'` **to ensure you're not biasing your data by looking into the future!**

In [51]:
my_dset.IndexPrice = my_dset.merge_asof(index_price, on=('TradeDateTime', 'DateTime'), direction='backward', columns_left=[])['IndexPrice']

## Saving/Loading

The native riptable filetype is .sds. It's the fastest way to save & load your data.

**Save out your dataset to file using** `rt.save_sds`.

In [52]:
rt.save_sds('my_dset.sds', my_dset)

**Delete your dataset to free up memory using the native python** `del my_dset`.

Note that if there are references to the dataset in other objects you may not actually free up memory.

In [53]:
del my_dset

**Reload your saved dataset from disk with** `rt.load_sds`.

In [54]:
my_dset = rt.load_sds('my_dset.sds')

In [55]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,average_trade_price,Symbol_month_cat,Trader,IndexPrice
0,553.08,TSLA,274,True,...,505.6,"(TSLA, 2022-02-01)",Elon,3599.69
1,215.5,TSLA,957,False,...,505.6,"(TSLA, 2022-02-01)",Elon,3623.25
2,649.53,AMZN,764,True,...,510.34,"(AMZN, 2022-02-01)",Dan,3760.78
3,257.19,GME,62,False,...,505.36,"(GME, 2022-02-01)",Nate,3974.95
4,263.29,SPY,922,True,...,507.38,"(SPY, 2022-02-01)",Josh,4341.34
5,632.26,AMZN,567,True,...,510.34,"(AMZN, 2022-02-01)",Dan,4293.08
6,631.12,SPY,988,False,...,507.38,"(SPY, 2022-03-01)",Josh,3568.03
7,773.3,GME,82,False,...,505.36,"(GME, 2022-04-01)",Nate,3677.05
8,182.55,GME,358,True,...,505.36,"(GME, 2022-04-01)",Nate,4128.69
9,408.96,TSLA,808,True,...,505.6,"(TSLA, 2022-04-01)",Elon,4209.58


To load from h5 files (a common file type at SIG), use `rt.load_h5(file)`.

To load from csv files, use the slow but robust pandas loader, with `rt.Dataset.from_pandas(pd.read_csv(file))`.