The following code demonstrates how you can use SpiceAI's Python SDK to query data from Ethereum to train a machine learning model. For this use case, we use the XGBoost model from the Forust-ml package.  This package is a pure Rust implementation of the popular XGBoost algorithm complete with Python bindings.  The advantage of this package is that it allows users to train an XGBoost model in Python that is compatible with the RISC Zero zkVM for use in zkML applications.   The code in this notebook will present a straightforward workflow consisting of data querying using Spice AI's querying API, model training, and model export to the RISC Zero zkVM.

You can learn more about SpiceAI's Python sdk here:     https://docs.spice.xyz/sdks/python-sdk/streaming

Do NOT use `pip install spicepy` to install the SDK as this will install a different package.

Instead use `pip install git+https://github.com/spiceai/spicepy`

You can learn more about the Forust package and its implementation of XGBoost here:  https://pypi.org/project/forust/

Use `pip install forust` to install the package.

As a demonstration, we will train a very basic gas fee prediction model that will predict the gas fee in gwei per transaction for the next Ethereum block.

Be sure you have the following packages installeds:  Pandas, Numpy, Forust, and Spicepy

In [1]:
import pandas
import numpy
import forust
import spicepy

Using Spice AI's Python SDK, we can make SQL queries.  The returned data is formatted into a Pandas dataframe.

First, define your sql query

In [2]:
sql_query = 'WITH counts AS (SELECT block_number, count(1) as "count" FROM eth.recent_transactions GROUP BY block_number) SELECT number as "block number", CAST(b.base_fee_per_gas / 1000000000.0 AS DOUBLE) as "base gwei", CAST(c."count" AS DOUBLE) as "txns"  FROM eth.recent_blocks b  INNER JOIN counts c ON b.number = c.block_number  WHERE b.base_fee_per_gas IS NOT NULL ORDER BY block_number DESC LIMIT 500'

You can change the SQL query to obtain your data.  
A list of datasets available from SpiceAI can be found here:  https://docs.spice.xyz/getting-started/datasets

Next, use your SpiceAI API key to obtain the data.

In [3]:
API_KEY = 'YOUR_SPICEAI_API_KEY'

client = spicepy.Client(API_KEY)
gas_fee_data_df = client.query(sql_query).read_pandas()

We can inspect the DataFrame

In [4]:
gas_fee_data_df.head()

Unnamed: 0,block number,base gwei,txns
0,18511304,30.117385,117.0
1,18511303,30.993204,149.0
2,18511302,32.192796,131.0
3,18511301,32.189241,128.0
4,18511300,33.820501,195.0


Next we define are X and y variables

In [5]:
X = gas_fee_data_df.drop(columns=['base gwei'])
y = gas_fee_data_df['base gwei']

We are now ready to train the model.  Do NOT change the value for max_size!  Otherwise, the model will cause an error inside the zkVM.

Since we are using XGBoost as a regressor, we selective "Squared Loss" for our objective type.

Note that we set the number of trees (iterations) at 10.  The default value is set at 100.  Fewer trees will result in a smaller model which will execute faster in the zkVM but at the expense of model accuracy.  For purposes of this demonstration, we want developers to be able to run this example without needing to use Bonsai, RISC Zero's AWS proving service. In practice, the iterations value should be set higher than 10.

For a full list of parameters than can be modified, please reference the Forust machine learning API documentation:  https://jinlow.github.io/forust/

In [6]:
# max_leaves must be set to the value below in order for compatibility with the zkVM
max_size = 2**32 -1
model = forust.GradientBooster(objective_type = "SquaredLoss", max_leaves = max_size, iterations = 10)

In [7]:
model.fit(X, y)

We can quickly test to see if the model is working by inputting a sample value from the training data

In [8]:
model.predict(X.head(1))

array([30.52804254])

After confirming that the model is working as expected, we can export the parameters as a JSON file using the `save_booster` method

In [11]:
model.save_booster("res/trained_model.json")

We are now ready to use the trained model in the RISC Zero zkVM