Copyright 2023 RISC Zero, Inc.

 Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.

The following notebook is meant to serve as a guide for training classifiers and regression models using the SmartCore crate.  Prior to training the classifier in Rust, the data should be processed in Python.  The data and classes should be exported as seperate CSV files.

Start by importing the Smartcore and Polars crates as dependencies.  Outside of a jupyter notebook environment, you can add these to your cargo.toml file or use cargo add "CRATE-NAME" in the command line.

Be sure to include serde as a feature for the smartcore crate, otherwise the Smartcore CSV readers will not work.

In [None]:
:dep smartcore = {version = "0.3.2", features = ["serde"]}
:dep polars = "*"
:dep serde_json = "1.0"
:dep rmp-serde = "1.1.2"

In [None]:
use smartcore::linalg::basic::matrix::DenseMatrix;
use smartcore::ensemble::random_forest_classifier::*;
use smartcore::readers;

use std::fs::File;
use std::io::{Read, Write};
use polars::prelude::*;
use serde_json;
use rmp_serde;

We use Smartcore's CSV reader to import the input data for our classifier.  This will automatically format the data into a Smartcore DenseMatrix, which is the required format in order to train the classifier and perform inference.

In [None]:
let input = readers::csv::matrix_from_csv_source::<f64, Vec<_>, DenseMatrix<_>>(
    File::open("iris_input_data.csv").unwrap(),
    readers::csv::CSVDefinition::default()
).unwrap();

In [None]:
input

We import the classes from a seperate CSV file using Polars.  We transform the DataFrame into a DataSeries and then convert to a `Vec<i64>`.  We then need to cast from `Vec<i64>` to `Vec<u8>`, which is the required format for the Smartcore random forest classifier.

In [None]:
let filepath_iris_classes = "iris_classes.csv";

let y_u8s: Vec<u8> = CsvReader::from_path(filepath_iris_classes).unwrap().finish().unwrap()
                .column("variety").unwrap().clone()
                .i64()?.into_no_null_iter().collect::<Vec<i64>>()
                .into_iter().map(|x| x as u8).collect::<Vec<u8>>();

In [None]:
y_u8s

Now, we can train the model using our desired classifier.  

In [None]:
let model = RandomForestClassifier::fit(&input, &y_u8s, Default::default()).unwrap();

We call predict() on the model in order to perform inference.

In [None]:
// Create DenseMatrix from the first element in the input array
let input = DenseMatrix::from_2d_array(
    &[
        &[5.1, 3.5, 1.4, 0.2],
    ]
);


model.predict(
    &input
).unwrap()

Model training can be performed in the host code, but you can also import a serialized pre-trained model from a JSON, YAML, or ProtoBuf file.  

The code below let's you export the trained model and the input data as serialized JSON files which can be imported into the host.

For use in the ZKVM, serializing the model and input data as a byte array is ideal.  The code below exports the trained model and input data as byte arrays in JSON files.

In [None]:
let model_bytes = rmp_serde::to_vec(&model).unwrap();
let data_bytes = rmp_serde::to_vec(&input).unwrap();

let model_json = serde_json::to_string(&model_bytes)?;
let x_json = serde_json::to_string(&data_bytes)?;

let mut f = File::create("../../res/ml-model/random_forest_model_bytes.json").expect("unable to create file");
f.write_all(model_json.as_bytes()).expect("Unable to write data");

let mut f1 = File::create("../../res/input-data/random_forest_data_bytes.json").expect("unable to create file");
f1.write_all(x_json.as_bytes()).expect("Unable to write data");