# Solutions to Riptable Exercises

This notebook contains the solutions to the [Riptable Exercises](RiptableExercises.ipynb).

Your solutions may be implemented slightly differently, but they should get the same essential results.

If you have any questions or comments, reach out to the Riptable Documentation team.

In [1]:
import riptable as rt
import numpy as np

## Introduction to the Riptable Dataset

**Datasets** are the core class of riptable. 

They are tables of data, consisting of a series of **columns** of the same length (sometimes referred to as **fields**).

Structurally, they behave like python dictionaries, and can be created directly from one.

We'll familiarize ourselves with Datasets by manually constructing one by generating fake sample data using `np.random.default_rng().choice(...)` or similar.

In real life they will essentially always be generated from world data.

**First, create a python dictionary with two fields of the same length (>1000); one column of stock prices and one of symbols.**

**Make sure the symbols have duplicates, for later aggregation exercises.**

In [2]:
rng = np.random.default_rng()
dset_length = 5_000

In [3]:
my_dict = {'Price': rng.uniform(0, 1000, dset_length), 'Symbol': rng.choice(['GME', 'AMZN', 'TSLA', 'SPY'], dset_length)}

**Create a riptable dataset from this, using** `rt.Dataset(my_dict)`.

In [4]:
my_dset = rt.Dataset(my_dict)

You can easily append more columns to a dataset.

**Add a new column of integer trade size, using** `my_dset.Size = `.

In [5]:
my_dset.Size = rng.integers(1, 1000, dset_length)

Columns can be referred with brackets around a string name as well. This is typically used when the column name comes from a variable.

**Add a new column of booleans indicating whether you traded this trade, using**
`my_dset['MyTrade'] =`.

In [6]:
my_dset['MyTrade'] = rng.choice([True, False], dset_length)

**Add a new column of string "Buy" or "Sell" indicating the customer direction.**

In [7]:
my_dset.CustDirection = rng.choice(['Buy', 'Sell'], dset_length)

Riptable will convert these lists to the riptable **FastArray** container and cast the data to an appropriate numpy datatype.

**View the datatypes with** `my_dset.dtypes`.

In [8]:
my_dset.dtypes

{'Price': dtype('float64'),
 'Symbol': dtype('S4'),
 'Size': dtype('int64'),
 'MyTrade': dtype('bool'),
 'CustDirection': dtype('S4')}

**View some sample rows of the dataset using** `.sample()`.

You should use this instead of `.head()` because the initial rows of a dataset are often unrepresentative.

In [9]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection
0,294.84,SPY,18,False,Buy
1,80.15,TSLA,939,True,Buy
2,189.83,AMZN,919,True,Buy
3,795.83,SPY,324,True,Sell
4,111.57,AMZN,53,True,Buy
5,695.09,AMZN,173,True,Sell
6,109.21,AMZN,810,True,Buy
7,83.23,SPY,674,True,Sell
8,388.8,AMZN,872,False,Sell
9,744.19,SPY,164,False,Buy


**View distributional stats of the numerical fields of your dataset with** `.describe()`.

You can call this on a single column as well.

In [10]:
my_dset.describe()

*Stats,Price,Size,MyTrade
Count,5000.0,5000.0,5000.0
Valid,5000.0,5000.0,5000.0
Nans,0.0,0.0,0.0
Mean,495.85,496.97,0.5
Std,292.47,288.05,0.5
Min,0.24,1.0,0.0
P10,92.07,97.0,0.0
P25,240.33,247.0,0.0
P50,493.73,499.0,0.0
P75,750.31,747.0,1.0


## Manipulating data

You can perform simple operation on riptable columns with normal python syntax. Riptable will do them to the whole column at once, efficiently.

**Create a new column by performing scalar arithmetic on one of your numeric columns.**

In [11]:
my_dset.SharesOfStock = 100 * my_dset.Size

In [12]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection,SharesOfStock
0,594.55,GME,428,False,Sell,42800
1,575.25,SPY,218,False,Buy,21800
2,88.07,TSLA,655,False,Sell,65500
3,616.62,SPY,941,True,Buy,94100
4,589.26,TSLA,235,False,Buy,23500
5,560.63,SPY,99,True,Sell,9900
6,461.22,SPY,982,True,Buy,98200
7,636.12,AMZN,96,False,Sell,9600
8,632.51,GME,253,True,Sell,25300
9,293.35,AMZN,104,False,Buy,10400


As long as the columns are the same size (as is guaranteed if they're in the same dataset) you can perform combining operations the same way.

**Create a new column of total price paid for the trade by multiplying two existing columns together.**

Riptable will automatically upcast types as necessary to preserve information.

In [13]:
my_dset.TotalCash = my_dset.Price * my_dset.Size

In [14]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection,SharesOfStock,TotalCash
0,263.41,AMZN,533,False,Buy,53300,140395.27
1,6.94,SPY,111,True,Buy,11100,770.02
2,238.68,SPY,409,False,Sell,40900,97620.54
3,420.92,SPY,463,False,Buy,46300,194887.29
4,435.0,TSLA,676,False,Buy,67600,294057.25
5,947.84,TSLA,334,False,Buy,33400,316579.05
6,439.43,AMZN,364,False,Buy,36400,159952.95
7,44.15,AMZN,222,False,Buy,22200,9800.34
8,972.57,GME,243,False,Sell,24300,236334.91
9,873.75,TSLA,599,False,Sell,59900,523377.81


There are many built-in functions as well, which you call with either `my_dset.field.function()` or `rt.function(my_dset.field)` syntax.

**Find the unique Symbols in your dataset.**

In [15]:
my_dset.Symbol.unique()

FastArray([b'AMZN', b'GME', b'SPY', b'TSLA'], dtype='|S4')

## Date/Time

Riptable has three main date/time types: `Date`, `DateTimeNano`, and `TimeSpan`.

**Give each row of your dataset an** `rt.Date`.

**Make sure they're not all different, but still include days from multiple months.**

Note that due to Riptable idiosyncracies you need to generate a list of yyyymmdd strings and pass into the `rt.Date(...)` constructor, not construct Dates individually.

In [16]:
my_dset.Date = rt.Date(rng.choice(rt.Date.range('20220201', '20220430'), dset_length))

In [17]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,CustDirection,SharesOfStock,TotalCash,Date
0,28.77,SPY,452,False,Sell,45200,13004.17,2022-02-18
1,705.58,SPY,892,False,Sell,89200,629378.88,2022-02-17
2,225.44,SPY,221,True,Buy,22100,49821.42,2022-04-18
3,647.78,TSLA,488,True,Sell,48800,316118.91,2022-03-10
4,926.05,GME,426,False,Sell,42600,394497.53,2022-04-14
5,226.91,SPY,705,True,Sell,70500,159972.93,2022-04-27
6,816.29,SPY,400,True,Sell,40000,326516.95,2022-03-12
7,177.41,GME,257,True,Buy,25700,45594.88,2022-03-10
8,414.71,AMZN,232,False,Sell,23200,96213.22,2022-04-19
9,471.17,AMZN,576,False,Buy,57600,271393.67,2022-04-25


**Give each row a unique(ish)** `TimeSpan` **as a trade time.**

You can instantiate them using `rt.TimeSpan(hours_var, unit='h')`.

In [18]:
my_dset.TradeTime = rt.TimeSpan(rng.uniform(9.5, 16, dset_length), unit='h')

In [19]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,SharesOfStock,TotalCash,Date,TradeTime
0,189.25,SPY,779,False,...,77900,147423.2,2022-02-19,09:47:31.067694548
1,405.3,TSLA,575,True,...,57500,233044.97,2022-02-28,13:47:22.412012377
2,306.29,SPY,12,True,...,1200,3675.43,2022-03-26,12:56:47.572327046
3,740.9,SPY,872,False,...,87200,646063.08,2022-02-20,11:16:31.322328844
4,148.72,AMZN,469,True,...,46900,69751.19,2022-02-22,13:45:17.960767118
5,687.63,AMZN,900,False,...,90000,618863.96,2022-02-16,13:03:38.246123166
6,395.53,AMZN,171,False,...,17100,67635.54,2022-04-23,15:47:05.167722885
7,620.26,TSLA,777,False,...,77700,481941.71,2022-04-14,11:40:32.776661126
8,266.59,GME,765,True,...,76500,203941.39,2022-04-12,14:37:57.503811705
9,286.83,AMZN,613,False,...,61300,175824.15,2022-03-03,14:45:09.378948295


**Create a DateTimeNano of the combined TradeTime + Date by simple addition. Riptable knows how to sum the types.**

Be careful here, by default you'll get a GMT timezone, you can force NYC with `rt.DateTimeNano(..., from_tz='NYC')`.

In [20]:
my_dset.TradeDateTime = rt.DateTimeNano(my_dset.Date + my_dset.TradeTime, from_tz='NYC')

In [21]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TotalCash,Date,TradeTime,TradeDateTime
0,455.87,TSLA,397,True,...,180979.3,2022-03-19,13:27:13.903568326,20220319 13:27:13.903568326
1,73.24,TSLA,887,False,...,64965.26,2022-03-06,11:30:56.002432889,20220306 11:30:56.002432889
2,263.36,SPY,807,True,...,212535.23,2022-02-19,13:34:55.264039702,20220219 13:34:55.264039702
3,88.49,GME,382,True,...,33802.09,2022-03-01,14:31:17.048883242,20220301 14:31:17.048883242
4,889.71,AMZN,554,False,...,492899.07,2022-03-08,11:12:06.558229570,20220308 11:12:06.558229570
5,466.92,TSLA,688,True,...,321237.96,2022-04-26,09:59:53.558187626,20220426 09:59:53.558187626
6,165.87,AMZN,272,False,...,45116.56,2022-03-10,13:41:53.830767087,20220310 13:41:53.830767087
7,740.9,SPY,872,False,...,646063.08,2022-02-20,11:16:31.322328844,20220220 11:16:31.322328844
8,146.13,TSLA,896,True,...,130931.03,2022-02-24,12:18:33.660870308,20220224 12:18:33.660870308
9,240.26,AMZN,587,True,...,141032.79,2022-03-02,10:43:44.856459110,20220302 10:43:44.856459110


To reverse this operation and get out separate dates and times from a DateTimeNano, you can call `rt.Date(my_DateTimeNano)` and `my_DateTimeNano.time_since_midnight()`.

**Create a new month name column by using the** `.strftime` **function.**

In [22]:
my_dset.month_name = my_dset.Date.strftime('%b%y')

In [23]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TradeTime,TradeDateTime,month_name
0,731.57,TSLA,687,True,...,13:18:00.786792038,20220426 13:18:00.786792038,Apr22
1,491.23,TSLA,472,True,...,12:13:02.001061236,20220313 12:13:02.001061236,Mar22
2,321.02,SPY,739,True,...,10:05:47.468199835,20220428 10:05:47.468199835,Apr22
3,728.75,TSLA,191,True,...,13:33:27.072606771,20220204 13:33:27.072606771,Feb22
4,414.23,AMZN,955,False,...,12:21:10.042617283,20220209 12:21:10.042617283,Feb22
5,399.06,TSLA,608,True,...,10:22:12.201005950,20220430 10:22:12.201005950,Apr22
6,918.26,AMZN,93,False,...,12:32:03.015948824,20220227 12:32:03.015948824,Feb22
7,787.75,GME,408,True,...,15:00:34.274163745,20220420 15:00:34.274163745,Apr22
8,478.57,TSLA,595,False,...,13:17:26.624255585,20220211 13:17:26.624255585,Feb22
9,488.22,SPY,807,False,...,13:45:28.888911507,20220228 13:45:28.888911507,Feb22


**Create another new month column by using the** `.start_of_month` **attribute.**

This is nice for grouping because it will automatically sort correctly.

In [24]:
my_dset.month = my_dset.Date.start_of_month

In [25]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TradeDateTime,month_name,month
0,272.34,SPY,593,False,...,20220421 15:56:06.447740285,Apr22,2022-04-01
1,758.55,SPY,210,True,...,20220425 10:01:29.981701403,Apr22,2022-04-01
2,654.72,SPY,613,True,...,20220330 09:38:12.459630380,Mar22,2022-03-01
3,417.83,TSLA,737,True,...,20220426 15:29:01.508398022,Apr22,2022-04-01
4,280.31,AMZN,958,True,...,20220327 11:43:36.813136697,Mar22,2022-03-01
5,139.19,AMZN,52,False,...,20220418 13:50:10.245021957,Apr22,2022-04-01
6,915.93,TSLA,18,True,...,20220420 13:55:37.175809652,Apr22,2022-04-01
7,877.0,TSLA,928,False,...,20220223 11:51:05.923929754,Feb22,2022-02-01
8,61.4,AMZN,472,True,...,20220311 09:40:49.323285896,Mar22,2022-03-01
9,807.3,SPY,863,True,...,20220325 11:48:40.251197776,Mar22,2022-03-01


## Sorting

Riptable has two sorts, `sort_copy` (which preserves the original dataset) and `sort_inplace`, which is faster and more memory-efficient if you don't need the original data order.

**Sort your dataset by TradeDateTime.**

This is the natural ordering of a list of trades, so do it in-place.

In [26]:
my_dset = my_dset.sort_inplace('TradeDateTime')

In [27]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,TradeDateTime,month_name,month
0,379.21,AMZN,385,False,...,20220203 10:58:19.673425054,Feb22,2022-02-01
1,617.5,SPY,413,True,...,20220214 14:02:42.426523660,Feb22,2022-02-01
2,555.63,SPY,617,False,...,20220308 13:25:17.282540250,Mar22,2022-03-01
3,718.59,AMZN,34,True,...,20220324 11:57:46.925665938,Mar22,2022-03-01
4,838.81,GME,958,False,...,20220330 10:09:37.013952547,Mar22,2022-03-01
5,498.17,AMZN,135,False,...,20220403 12:36:29.924505469,Apr22,2022-04-01
6,123.59,AMZN,927,True,...,20220414 12:51:26.312197854,Apr22,2022-04-01
7,339.86,AMZN,805,True,...,20220417 13:05:59.168005100,Apr22,2022-04-01
8,158.91,SPY,211,False,...,20220428 09:30:22.411907771,Apr22,2022-04-01
9,936.81,TSLA,81,False,...,20220430 15:06:53.166066606,Apr22,2022-04-01


## Filtering

Filtering is the principal way to work with a subset of your data in riptable. It is commonly used for looking at a restricted set of trades matching some criterion you care about.

Except in rare instances, though, you should maintain your dataset in its full size, and only apply a filter when performing a final computation.

This will avoid unnecessary data duplication and improve speed & memory usage.

**Construct a filter of only your sales. (A filter is a column of Booleans which is true only for the rows you're interested in.)**

You can combine filters using & or |. Be careful to always wrap expressions in parentheses to avoid an extremely slow call into native python followed by a crash.

Always `(my_dset.field1 > 10) & (my_dset.field2 < 5)`, never `my_dset.field1 > 10 & my_dset.field2 > 5`.

In [28]:
f_my_sales = my_dset.MyTrade & (my_dset.CustDirection == 'Buy')

**Compute the total Trade Size, filtered for only your sales.**

For this and many other instances, you can & should pass your filter into the `filter` kwarg of the `.nansum(...)` call.

This allows riptable to perform the filtering during the nansum computation, rather than instantiating a new column and then summing it.

In [29]:
my_dset.Size.nansum(filter=f_my_sales)

621241

**Count how many times you sold each symbol.**

Here the `.count()` function doesn't accept a `filter` kwarg, so you must fall back to explicitly filtering the `Symbol` field before counting.

Be careful that you only filter down the `Symbol` field, not the entire dataset, otherwise you are wasting a lot of compute.

In [30]:
my_dset.Symbol[f_my_sales].count()

*Unique,Count
AMZN,301
GME,306
SPY,282
TSLA,340


## Categoricals

So far, we've been operating on your symbol column as a column of strings.

However, it's far more efficient when you have a large column with many repeats to use a categorical, which assigns each unique value a number, and stores the labels & numbers separately.

This is memory-efficient, and also computationally efficient, as riptable can peform operations on the unique values, then expand out to the full vector appropriately.

**Make a new column of your string column converted to a categorical, using** `rt.Cat(column)`.

In [31]:
my_dset.Symbol_cat = rt.Cat(my_dset.Symbol)
my_dset.Symbol_cat

Categorical([AMZN, SPY, SPY, SPY, SPY, ..., TSLA, GME, SPY, AMZN, SPY]) Length: 5000
  FastArray([1, 3, 3, 3, 3, ..., 4, 2, 3, 1, 3], dtype=int8) Base Index: 1
  FastArray([b'AMZN', b'GME', b'SPY', b'TSLA'], dtype='|S4') Unique count: 4

**Perform the same filtered count from above, on the categorical.**

The categorical `.count()` admits a `filter` kwarg, which makes it simpler.

In [32]:
my_dset.Symbol_cat.count(filter=f_my_sales)

*Symbol_cat,Count
AMZN,301
GME,306
SPY,282
TSLA,340


Categoricals can be used as groupings. When you call a numeric function on a categorical and pass numeric columns in, riptable knows to do the calculation per-group.

**Compute the total amount of contracts sold by customers in each symbol.**

In [33]:
my_dset.Symbol_cat.sum(my_dset.Size, filter=my_dset.CustDirection == 'Sell')

*Symbol_cat,Size
AMZN,303513
GME,290964
SPY,337699
TSLA,304961


The `transform=True` kwarg in a categorical operation performs the aggregation, then *transforms* it back up to the original shape of the categorical, giving each row the appropriate value from its group.

**Make a new column which is the average trade price, per symbol.**

In [34]:
my_dset.average_trade_price = my_dset.Symbol_cat.mean(my_dset.Price, transform=True)

**Inspect with** `.sample()` **to confirm that this value is consistent for rows with matching symbol.**

In [35]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,month_name,month,Symbol_cat,average_trade_price
0,612.27,AMZN,356,False,...,Feb22,2022-02-01,AMZN,497.66
1,5.42,AMZN,610,False,...,Feb22,2022-02-01,AMZN,497.66
2,877.96,AMZN,58,False,...,Feb22,2022-02-01,AMZN,497.66
3,340.75,AMZN,802,True,...,Mar22,2022-03-01,AMZN,497.66
4,564.53,GME,486,True,...,Apr22,2022-04-01,GME,495.29
5,46.86,TSLA,414,True,...,Apr22,2022-04-01,TSLA,499.91
6,850.28,SPY,723,True,...,Apr22,2022-04-01,SPY,490.44
7,895.93,SPY,967,False,...,Apr22,2022-04-01,SPY,490.44
8,267.21,AMZN,98,True,...,Apr22,2022-04-01,AMZN,497.66
9,279.97,GME,887,True,...,Apr22,2022-04-01,GME,495.29


If you need to perform a custom operation on each categorical, you can pass in a function with `.apply_reduce` (which aggregates) or `.apply_nonreduce` (which is like `transform=True`).

Note that the custom function you pass needs to expect a FastArray, and output a scalar (`apply_reduce`) or same-length FastArray (`apply_nonreduce`).

**Find, for each symbol, the trade size of the second trade occuring in the dataset.**

In [36]:
my_dset.Symbol_cat.apply_reduce(lambda x: x[1], my_dset.Size)

*Symbol_cat,Size
AMZN,700
GME,42
SPY,492
TSLA,536


Sometimes you want to aggregate based on multiple values. In these cases we use multi-key categoricals.

**Use a multi-key categorical to compute the average size per symbol-month pair.**

In [37]:
my_dset.Symbol_month_cat = rt.Cat([my_dset.Symbol, my_dset.month])

In [38]:
my_dset.Symbol_month_cat.nanmean(my_dset.Size).sort_inplace('Symbol')

*Symbol,*month,Size
AMZN,2022-02-01,473.99
.,2022-03-01,495.32
.,2022-04-01,493.56
GME,2022-02-01,508.95
.,2022-03-01,506.99
.,2022-04-01,479.18
SPY,2022-02-01,509.76
.,2022-03-01,529.28
.,2022-04-01,479.58
TSLA,2022-02-01,501.33


## Accumulating

Aggregating over two values for human viewing is often most conveniently done with an accum. 

**Use** `Accum2` **to compute the average size per symbol-month pair.**

In [39]:
rt.Accum2(my_dset.Symbol, my_dset.month).nanmean(my_dset.Size)

*Symbol,2022-02-01,2022-03-01,2022-04-01,Nanmean
AMZN,473.99,495.32,493.56,487.57
GME,508.95,506.99,479.18,498.34
SPY,509.76,529.28,479.58,506.63
TSLA,501.33,469.43,517.81,495.67
Nanmean,497.85,499.98,492.81,496.97


Average numbers can be meaningless. It is often better to consider relative percentage instead.

**Use** `accum_ratiop` **to compute the fraction of total volume done by each symbol-month pair.**

In [40]:
rt.accum_ratiop(my_dset.Symbol, my_dset.month, my_dset.Size, norm_by='R')

*Symbol,2022-02-01,2022-03-01,2022-04-01,TotalRatio,Total
AMZN,32.93,36.7,30.37,100.0,628959.0
GME,31.96,36.02,32.02,100.0,594021.0
SPY,32.93,36.1,30.98,100.0,636328.0
TSLA,31.9,33.17,34.93,100.0,625533.0
TotalRatio,32.44,35.49,32.07,100.0,
Total,806012.0,881963.0,796866.0,,2484841.0


## Merging

There are two main types of merges.

First is `merge_lookup`. This is used for enriching one (typically large) dataset with information from another (typically small) dataset.

**Create a new dataset with one row per symbol from your dataset, and a second column of who trades each symbol.**

In [41]:
symbol_trader = rt.Dataset({'UnderlyingSymbol': ['GME', 'TSLA', 'SPY', 'AMZN'],
                           'Trader': ['Nate', 'Elon', 'Josh', 'Dan']})

In [42]:
symbol_trader

#,UnderlyingSymbol,Trader
0,GME,Nate
1,TSLA,Elon
2,SPY,Josh
3,AMZN,Dan


**Enrich the main dataset by putting the correct trader into each row.**

In [43]:
my_dset.Trader = my_dset.merge_lookup(symbol_trader, on=('Symbol', 'UnderlyingSymbol'), columns_left=[])['Trader']

In [44]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,Symbol_cat,average_trade_price,Symbol_month_cat,Trader
0,702.44,SPY,739,True,...,SPY,490.44,"(SPY, 2022-02-01)",Josh
1,926.83,TSLA,591,True,...,TSLA,499.91,"(TSLA, 2022-02-01)",Elon
2,450.66,SPY,459,False,...,SPY,490.44,"(SPY, 2022-02-01)",Josh
3,664.47,SPY,846,True,...,SPY,490.44,"(SPY, 2022-03-01)",Josh
4,464.46,AMZN,379,True,...,AMZN,497.66,"(AMZN, 2022-03-01)",Dan
5,508.8,GME,907,False,...,GME,495.29,"(GME, 2022-03-01)",Nate
6,145.61,TSLA,289,True,...,TSLA,499.91,"(TSLA, 2022-04-01)",Elon
7,729.66,GME,148,False,...,GME,495.29,"(GME, 2022-04-01)",Nate
8,20.5,TSLA,769,True,...,TSLA,499.91,"(TSLA, 2022-04-01)",Elon
9,768.59,GME,957,True,...,GME,495.29,"(GME, 2022-04-01)",Nate


The second type of merge is `merge_asof`, which is used for fuzzy alignment between two datasets, typically by time (though often by other variables).

**Create a new index price dataset with one price per minute, which covers all the Dates in your dataset.**

The index price doesn't need to be reasonable.

Each row should have a DateTimeNano as the datetime.

In [45]:
num_minutes = int((my_dset.TradeDateTime.max() - my_dset.TradeDateTime.min()).minutes[0])
start_datetime = rt.Date(my_dset.TradeDateTime.min())

In [46]:
index_price = rt.Dataset({'DateTime': start_datetime + rt.TimeSpan(range(num_minutes), unit='m'),
                          'IndexPrice': rng.uniform(3500, 4500, num_minutes)})

In [47]:
index_price.sample()

#,DateTime,IndexPrice
0,20220217 07:25:00.000000000,3742.56
1,20220218 12:24:00.000000000,4439.16
2,20220225 16:41:00.000000000,3833.25
3,20220303 13:44:00.000000000,4341.4
4,20220326 08:00:00.000000000,4356.62
5,20220402 02:58:00.000000000,3796.68
6,20220403 15:55:00.000000000,3645.95
7,20220416 10:01:00.000000000,4469.1
8,20220423 03:30:00.000000000,4284.35
9,20220427 08:09:00.000000000,4347.81


**Use** `merge_asof` **to get the most recent Index Price associated with each trade in your main dataset.**

Note both datasets need to be sorted for merge_asof.

The `on` kwarg is the numeric/time field that looks for close matches.

The `by` kwarg is not necessary here, but could constrain the match to a subset if, for example, you had multiple indices and a column of which one each row is associated with.

**Use** `direction='backward'` **to ensure you're not biasing your data by looking into the future!**

In [48]:
my_dset.IndexPrice = my_dset.merge_asof(index_price, on=('TradeDateTime', 'DateTime'), direction='backward', columns_left=[])['IndexPrice']

## Saving/Loading

The native riptable filetype is .sds. It's the fastest way to save & load your data.

**Save out your dataset to file using** `rt.save_sds`.

In [49]:
rt.save_sds('my_dset.sds', my_dset)

**Delete your dataset to free up memory using the native python** `del my_dset`.

Note that if there are references to the dataset in other objects you may not actually free up memory.

In [50]:
del my_dset

**Reload your saved dataset from disk with** `rt.load_sds`.

In [51]:
my_dset = rt.load_sds('my_dset.sds')

In [52]:
my_dset.sample()

#,Price,Symbol,Size,MyTrade,...,average_trade_price,Symbol_month_cat,Trader,IndexPrice
0,569.1,TSLA,719,True,...,499.91,"(TSLA, 2022-02-01)",Elon,3945.79
1,915.31,GME,4,False,...,495.29,"(GME, 2022-02-01)",Nate,3756.71
2,173.42,AMZN,166,False,...,497.66,"(AMZN, 2022-03-01)",Dan,3972.01
3,410.27,GME,722,True,...,495.29,"(GME, 2022-03-01)",Nate,4458.17
4,606.42,SPY,995,True,...,490.44,"(SPY, 2022-03-01)",Josh,3954.01
5,910.27,AMZN,219,True,...,497.66,"(AMZN, 2022-03-01)",Dan,4108.6
6,609.56,TSLA,459,False,...,499.91,"(TSLA, 2022-04-01)",Elon,4384.97
7,466.54,TSLA,400,False,...,499.91,"(TSLA, 2022-04-01)",Elon,4221.02
8,225.37,AMZN,150,False,...,497.66,"(AMZN, 2022-04-01)",Dan,3688.96
9,912.44,AMZN,615,True,...,497.66,"(AMZN, 2022-04-01)",Dan,3527.44


To load from h5 files (a common file type at SIG), use `rt.load_h5(file)`.

To load from csv files, use the slow but robust pandas loader, with `rt.Dataset.from_pandas(pd.read_csv(file))`.