Copyright 2024 RISC Zero, Inc.

 Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

The following notebook is meant to serve as a guide for training classifiers and regression models using the SmartCore crate.  Prior to training the classifier in Rust, the data should be processed in Python.  The data and classes should be exported as separate CSV files.

Start by importing the Smartcore and Polars crates as dependencies.  Outside of a jupyter notebook environment, you can add these to your cargo.toml file or use cargo add "CRATE-NAME" in the command line.

Be sure to include serde as a feature for the smartcore crate, otherwise the Smartcore CSV readers will not work.

In [35]:
:dep smartcore = {version = "0.3.2", features = ["serde"]}
:dep polars = "*"
:dep serde_json = "1.0"
:dep rmp-serde = "1.1.2"

In [36]:
use smartcore::linalg::basic::matrix::DenseMatrix;
use smartcore::tree::decision_tree_classifier::*;
use smartcore::readers;

use std::fs::File;
use std::io::{Read, Write};
use polars::prelude::*;
use serde_json;
use rmp_serde;

We use Smartcore's CSV reader to import the input data for our classifier.  This will automatically format the data into a Smartcore DenseMatrix, which is the required format in order to train the classifier and perform inference.

In [37]:
let input = readers::csv::matrix_from_csv_source::<f64, Vec<_>, DenseMatrix<_>>(
    File::open("iris_input_data.csv").unwrap(),
    readers::csv::CSVDefinition::default()
).unwrap();

In [38]:
input

DenseMatrix { ncols: 4, nrows: 150, values: [5.1, 3.5, 1.4, 0.2, 4.9, 3.0, 1.4, 0.2, 4.7, 3.2, 1.3, 0.2, 4.6, 3.1, 1.5, 0.2, 5.0, 3.6, 1.4, 0.2, 5.4, 3.9, 1.7, 0.4, 4.6, 3.4, 1.4, 0.3, 5.0, 3.4, 1.5, 0.2, 4.4, 2.9, 1.4, 0.2, 4.9, 3.1, 1.5, 0.1, 5.4, 3.7, 1.5, 0.2, 4.8, 3.4, 1.6, 0.2, 4.8, 3.0, 1.4, 0.1, 4.3, 3.0, 1.1, 0.1, 5.8, 4.0, 1.2, 0.2, 5.7, 4.4, 1.5, 0.4, 5.4, 3.9, 1.3, 0.4, 5.1, 3.5, 1.4, 0.3, 5.7, 3.8, 1.7, 0.3, 5.1, 3.8, 1.5, 0.3, 5.4, 3.4, 1.7, 0.2, 5.1, 3.7, 1.5, 0.4, 4.6, 3.6, 1.0, 0.2, 5.1, 3.3, 1.7, 0.5, 4.8, 3.4, 1.9, 0.2, 5.0, 3.0, 1.6, 0.2, 5.0, 3.4, 1.6, 0.4, 5.2, 3.5, 1.5, 0.2, 5.2, 3.4, 1.4, 0.2, 4.7, 3.2, 1.6, 0.2, 4.8, 3.1, 1.6, 0.2, 5.4, 3.4, 1.5, 0.4, 5.2, 4.1, 1.5, 0.1, 5.5, 4.2, 1.4, 0.2, 4.9, 3.1, 1.5, 0.2, 5.0, 3.2, 1.2, 0.2, 5.5, 3.5, 1.3, 0.2, 4.9, 3.6, 1.4, 0.1, 4.4, 3.0, 1.3, 0.2, 5.1, 3.4, 1.5, 0.2, 5.0, 3.5, 1.3, 0.3, 4.5, 2.3, 1.3, 0.3, 4.4, 3.2, 1.3, 0.2, 5.0, 3.5, 1.6, 0.6, 5.1, 3.8, 1.9, 0.4, 4.8, 3.0, 1.4, 0.3, 5.1, 3.8, 1.6, 0.2, 4.6, 3.2, 1.4, 

We import the classes from a separate CSV file using Polars.  We transform the DataFrame into a DataSeries and then convert to a `Vec<i64>`.  We then need to cast from `Vec<i64>` to `Vec<u32>`, which is the required format for the Smartcore classifier.

In [39]:
let filepath_iris_classes = "iris_classes.csv";

let y_u32s: Vec<u32> = CsvReader::from_path(filepath_iris_classes).unwrap().finish().unwrap()
                .column("variety").unwrap().clone()
                .i64()?.into_no_null_iter().collect::<Vec<i64>>()
                .into_iter().map(|x| x as u32).collect::<Vec<u32>>();

In [40]:
y_u32s

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Now, we can train the model using our desired classifier.  

In [41]:
let model = DecisionTreeClassifier::fit(&input, &y_u32s, Default::default()).unwrap();

We call predict() on the model in order to perform inference.

In [42]:
model.predict(&input).unwrap()

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

For Decision Trees, Smartcore gives you the option to specify the following parameters:
- split criterion (Gini or Entropy)
- maximum tree depth
- minimum number of leaves (The minimum number of samples required to be at a leaf node).
- minimum sample splits (The minimum number of samples required to split an internal node).
- seed (Controls the randomness of the estimator)

In [43]:
let model_with_custom_params = DecisionTreeClassifier::fit(&input, &y_u32s, DecisionTreeClassifierParameters {
                    criterion: SplitCriterion::Entropy,
                    max_depth: Some(3),
                    min_samples_leaf: 1,
                    min_samples_split: 2,
                    seed: Option::None
                }
            )
            .unwrap();

In [44]:
model_with_custom_params.predict(&input).unwrap()

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]

Model training can be performed in the host code, but you can also import a serialized pre-trained model from a JSON, YAML, or ProtoBuf file.  

The code below let's you export the trained model and the input data as serialized JSON files which can be imported into the host.

For use in the ZKVM, serializing the model and input data as a byte array is ideal.  The code below exports the trained model and input data as byte arrays in JSON files.

In [48]:

let model_bytes = rmp_serde::to_vec(&model).unwrap();
let data_bytes = rmp_serde::to_vec(&input).unwrap();

let mut f = File::create("res/ml-model/tree_model_bytes.bin").expect("unable to create file");
f.write_all(&model_bytes).expect("Unable to write data");

let mut f1 = File::create("res/input-data/tree_model_data_bytes.bin").expect("unable to create file");
f1.write_all(&data_bytes).expect("Unable to write data");