# RustKit Learn
### CIS 1905 Rust Final Project
#### Marwan Achi, Rose Wang

For this project, we implemented the following methods in Rust to emulate the commonly used Python library SKLearn.
- Preprocessing:
    - Scaler
    - Imputer
- Supervised
    - Ridge Regression
    - With the following Regression Metrics:
        - R^2
        - MSE
- Unsupervised
    - KMeans
    - PCA

After implementing the methods in Rust, we created Python bindings using `maturin` and `PyO3` to use these methods and classes as a library in Python, called `rustkit`. To do so, we implemented converter functions that converted `numpy` matrices and vectors into `nalgebra` matrices and vectors, handling generic types and null values.

In [1]:
# imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import rustkit
import unit_tests as tests

### 1. Benchmarking
We tested our implementation's performance in comparison to `sklearn`'s performance by measuring runtime across increasing input size in the following ways:
- `sklearn` method runtime: using wallclock time from the function call to when it returns in Python
- `rustkit` method runtime: using wallclock time from the function call to when it returns in Python, including the full process of converting to/from Rust/Python objects
- `rustkit` Rust internal runtime: using wallclock time from the function call to when the function returns in Rust, excluding all Python interoperability computation

All Python benchmarking was done for 100 iterations. Input matrices ranged from 10 x 10 to 10000 x 10000 with a stepsize of x10. All KMeans were tested with 10 features and 3 clusters.

In [None]:
# read in benchmark logs
rustkit_python = pd.read_csv("rustkit_benchmarking.csv", sep=",", header=None, names=["function", "nrows", "ncols", "avg_time"])
rustkit_rust = pd.read_csv("timing_log.csv", sep=",", header=None, names=["function", "nrows", "ncols", "time"])
sklearn_python = pd.read_csv("sklearn_benchmarking.csv", sep=",", header=None, names=["function", "nrows", "ncols", "avg_time"])

In [None]:
# aggregate rustkit_rust
rustkit_rust = rustkit_rust.groupby(["function", "nrows", "ncols"]).agg({"time": "mean"}).reset_index()
rustkit_rust["avg_time"] = rustkit_rust["time"]
rustkit_rust = rustkit_rust.drop(columns="time")

# remove rustkit rust benchmarks that are not in rustkit python
rustkit_rust_clean = rustkit_rust[rustkit_rust["function"].isin(rustkit_python["function"].unique())]

In [None]:
# normalize across function
rustkit_rust_clean["avg_time_norm"] = rustkit_rust_clean.groupby("function")["avg_time"].transform(lambda x: x / x.min())
rustkit_python["avg_time_norm"] = rustkit_python.groupby("function")["avg_time"].transform(lambda x: x / x.min())
sklearn_python["avg_time_norm"] = sklearn_python.groupby("function")["avg_time"].transform(lambda x: x / x.min())
# Plot all benchmarks
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.lineplot(data=rustkit_python, x="nrows", y="avg_time", hue="function", style="function", ax=ax)
sns.lineplot(data=sklearn_python, x="nrows", y="avg_time", hue="function", style="function", ax=ax)
ax.set_xscale("log")

Knowing that some of the difference in runtime can be attributed to the Python bindings themselves, let's see how the Python interop impacted performance by comparing rustkit's runtime from Python vs. Rust.

In [None]:
# Difference between rustkit python and rustkit rust
rustkit_diff = rustkit_python.merge(rustkit_rust_clean, on=["function", "nrows", "ncols"], suffixes=("_python", "_rust"))
rustkit_diff["diff"] = rustkit_diff["avg_time_rust"] - rustkit_diff["avg_time_python"]

# heatmap of difference
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.heatmap(rustkit_diff.pivot("function", "nrows", "diff"), annot=True, ax=ax)
plt.show()

# Plot rustkit python vs rustkit rust
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.lineplot(data=rustkit_python, x="nrows", y="avg_time", hue="function", style="function", ax=ax)
sns.lineplot(data=rustkit_rust_clean, x="nrows", y="avg_time", hue="function", style="function", ax=ax)
plt.show()


Now, let's assess runtime differences between `sklearn` and `rustkit`'s underlying Rust runtime.

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
sns.lineplot(data=rustkit_rust_clean, x="nrows", y="avg_time", hue="function", style="function", ax=ax)
sns.lineplot(data=sklearn_python, x="nrows", y="avg_time", hue="function", style="function", ax=ax)
# ax.set_xscale("log")
plt.show()

### 2. Unit Tests
We tested our implementation fo **correctness** by comparing `rustkit` outputs to `sklearn` outputs.
Specifically, we assessed correctness on the following inputs:
- Inputs of size 1
- Square inputs
- Rectangular inputs
- Large inputs
- Inputs with only negative values
- Inputs with only positive values
- Inputs with both positive and negative values

In [None]:
tests.test_single_input()
print("\n\n")
tests.test_square_input()
print("\n\n")
tests.test_large_input()
print("\n\n")
tests.test_negative_input()
print("\n\n")
tests.test_mixed_input()

We tested our **Python bindings** for correctness by comparing the outputs of `main.rs` with `test.py`, where we tested each method for expected rust functionality. We also tested the conversion to/from Rust/Python in isolation by converting in isolation.

In [None]:
def test_converter_vector():
    input_vector = np.array([1.0, 2.0, 3.0, 4.0])
    result = rustkit.converter_vector_test(input_vector)
    
    result_vector = np.array(result)
    
    print("VECTOR TEST")
    print("Input vector:")
    print(input_vector)
    print("Result vector:")
    print(result_vector)
    assert np.array_equal(input_vector, result_vector), "Test failed! Input and output vectors are not equal."
    print("Vector test passed!")

def test_converter_matrix():
    input_matrix = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    
    result = rustkit.converter_matrix_test(input_matrix)
    
    result_matrix = np.array(result)
    
    print("MATRIX TEST")
    print("Input matrix:")
    print(input_matrix)
    print("Result matrix:")
    print(result_matrix)
    assert np.array_equal(input_matrix, result_matrix), "Test failed! Input and output matrices are not equal."
    print("Matrix test passed!")

In [None]:
test_converter_matrix()
test_converter_vector()