# C-Transpiled model VS Joblib model : Comparison

Quick notebook to verify that the implemented transpiler works as intended. We will work with:
* `LinearRegression`
* `LogisticRegression` (binary)
* `DecisionTreeClassifier` (binary)
* `DecisionTreeRegressor`
  
On the following datasets:
* `Houses` dataset (regression)
* `Breast Cancer` dataset (classification)

In [1]:
import pandas as pd
import numpy as np
import joblib
import subprocess
import json
from transpile_simple_model import LinearModelTranspiler
from sklearn.metrics import r2_score, accuracy_score
from sklearn.base import BaseEstimator

HOUSE_DATA_PATH = "data/houses.csv"
CANCER_DATA_PATH = "data/breast-cancer.csv" # https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset?resource=download

LIN_REG_PATH = "data/regression.joblib"
LOG_REG_PATH = "data/logistic_regression.joblib"

DECISION_TREE_CLF_PATH = "data/tree_clf.joblib"
DECISION_TREE_REG_PATH = "data/tree_reg.joblib"

## Util Functions:

In [2]:
def build_c_dataset_string(X_df, n_samples=None):
    """
    Convert pandas DataFrame to C 2D array string format.
    
    Args:
        X_df: pandas DataFrame with features
        n_samples: number of samples to include (None = all)
    
    Returns:
        C code string for 2D array declaration
    """
    if n_samples is None:
        n_samples = len(X_df)
    
    X_subset = X_df.iloc[:n_samples]
    n_features = X_df.shape[1]
    
    # Start building the C array
    c_array = f"float dataset[{n_samples}][{n_features}] = {{\n"
    
    for i, row in X_subset.iterrows():
        values = ", ".join([f"{val}f" for val in row.values])
        c_array += f"    {{{values}}}"
        if i < n_samples - 1:
            c_array += ","
        c_array += "\n"
    
    c_array += "};"
    return c_array



In [3]:
def build_c_main_function(X_df, y_series, n_samples=None):
    """
    Build a complete C main function computes predictions with the transpiled model.
    
    Args:
        X_df: pandas DataFrame with features
        y_series: pandas Series with true target values
        n_samples: number of samples to test (None = all)
    
    Returns:
        Complete C main function as a string
    """
    if n_samples is None:
        n_samples = len(X_df)
    
    n_features = X_df.shape[1]
    
    # Build the main function
    main_func = f"""
int main() {{
    // Dataset with {n_samples} samples and {n_features} features
    {build_c_dataset_string(X_df, n_samples)}
    
    // True for comparison
    float y_true[] = {{{", ".join([f"{y_series.iloc[i]}f" for i in range(n_samples)])}}};
    
    // Test predictions for each sample
    printf("[%f", prediction(dataset[0]));
    for (int i = 1; i < {n_samples}; i++) {{
        float y_pred = prediction(dataset[i]);
        printf(", %f", y_pred);
    }}
    printf("]");
    
    return 0;
}}
"""
    return main_func



In [4]:
def create_complete_c_file(original_c_file, output_c_file, X_df, y_series, n_samples=5):
    """
    Create a complete C file with the original model code + main function.
    
    Args:
        original_c_file: path to the original transpiled C file
        output_c_file: path for the new complete C file
        X_df: pandas DataFrame with features
        y_series: pandas Series with true target values
        n_samples: number of samples to test
    """
    # Read the original C file
    with open(original_c_file, 'r') as f:
        original_content = f.read()
    
    # Generate the main function
    main_function = build_c_main_function(X_df, y_series, n_samples)
    
    # Combine them
    complete_content = original_content + "\n" + main_function
    
    # Write to new file
    with open(output_c_file, 'w') as f:
        f.write(complete_content)
    
    print(f"* Created complete C file: {output_c_file}")
    print(f"  - Original code from: {original_c_file}")
    print(f"  - Added main function with {n_samples} test samples")
    
    return complete_content




In [5]:
def predict_and_eval(model: BaseEstimator | str, X: pd.DataFrame, y: pd.Series, metric = None, verbose=True):
    """
    Predict on a dataset using either a C transpiled model or a Python model, optionally evaluating a metric
    Args:
        model: either the sklearn BaseEstimator object or the path to the C executable
        X: dataset to predict on
        y: labels or targets for dataset X
        metric: sklearn's metric function
    """
    if isinstance(model, BaseEstimator): # Use python
        y_pred = model.predict(X)
        if metric is not None:
            m = metric(y_true=y.to_numpy(), y_pred = y_pred)
            print(f"model's {metric.__name__} (Python): {m}")

    elif isinstance(model, str): # Use transpiled C
        result = subprocess.run(['./test_model'], capture_output=True, text=True)
        output = result.stdout
        y_pred = json.loads(output)
        if metric is not None:
            m = metric(y_true=y.to_numpy(), y_pred = y_pred)
            print(f"model's {metric.__name__} (C): {m}")
    else:
        raise ValueError("model must be either a BaseEstimator or a string path to the C executable")
    

    if verbose:
        for i in range(5):
            print(f"Sample {i+1}: True = {y.iloc[i]:.2f}, Predicted = {y_pred[i]:.2f}")

    return y_pred
        

        




## Linear Regression Comparison:

In [6]:
data = pd.read_csv(HOUSE_DATA_PATH)
X = data.drop(columns=["price", "orientation"])
y = data["price"]

### C:

In [7]:
transpiler = LinearModelTranspiler(LIN_REG_PATH, output_c_file="linear_model.c")
transpiler.transpile()

# Create the complete C file
complete_c_content = create_complete_c_file(
    original_c_file="linear_model.c",
    output_c_file="linear_model_with_main.c",
    X_df=X,
    y_series=y,
    n_samples=len(X)
)

Loading model from data/regression.joblib...

C code generated and saved to: linear_model.c
* Created complete C file: linear_model_with_main.c
  - Original code from: linear_model.c
  - Added main function with 40 test samples


In [8]:
!gcc linear_model_with_main.c -o test_model


In [9]:
c_predictions = predict_and_eval(model="./test_model", X=X, y=y, metric=r2_score)

model's r2_score (C): 0.15715596845899216
Sample 1: True = 260972.16, Predicted = 213250.08
Sample 2: True = 256534.25, Predicted = 199306.58
Sample 3: True = 282674.29, Predicted = 264473.56
Sample 4: True = 266555.38, Predicted = 226825.94
Sample 5: True = 319158.42, Predicted = 283106.50


### Python

In [10]:
model = joblib.load(LIN_REG_PATH)
py_predictions = predict_and_eval(model=model,X=X, y=y, metric=r2_score)

model's r2_score (Python): 0.15715598448124224
Sample 1: True = 260972.16, Predicted = 213250.08
Sample 2: True = 256534.25, Predicted = 199306.58
Sample 3: True = 282674.29, Predicted = 264473.54
Sample 4: True = 266555.38, Predicted = 226825.94
Sample 5: True = 319158.42, Predicted = 283106.49


### Comparison

In [11]:
are_same = np.allclose(c_predictions, py_predictions, atol=1e-15)
print(f"Are C predictions and Python predictions the same? {are_same}")

Are C predictions and Python predictions the same? True


**WE OBSERVE**:
* **The same R^2 Score**
* **For the same 5 first samples, the same predicted price (visual confirmation)**

Similarity of predictions are confirmed by previous cell: Linear Regression was **successfully transpiled**

## Logistic Regression Comparison

In [12]:
data = pd.read_csv(CANCER_DATA_PATH) 
X = data.drop(columns=["diagnosis"])
y = data["diagnosis"].map({'M':1., 'B':0.})

### C:

In [13]:
transpiler = LinearModelTranspiler(LOG_REG_PATH, output_c_file="logistic_reg_model.c")
transpiler.transpile()

# Create the complete C file
complete_c_content = create_complete_c_file(
    original_c_file="logistic_reg_model.c",
    output_c_file="logistic_reg_model_with_main.c",
    X_df=X,
    y_series=y,
    n_samples=len(X)
)

Loading model from data/logistic_regression.joblib...

C code generated and saved to: logistic_reg_model.c
* Created complete C file: logistic_reg_model_with_main.c
  - Original code from: logistic_reg_model.c
  - Added main function with 569 test samples


In [14]:
!gcc logistic_reg_model_with_main.c -o test_model


In [15]:
c_predictions = predict_and_eval(model="./test_model", X=X,y=y, metric=accuracy_score)

model's accuracy_score (C): 0.9033391915641477
Sample 1: True = 1.00, Predicted = 1.00
Sample 2: True = 1.00, Predicted = 1.00
Sample 3: True = 1.00, Predicted = 1.00
Sample 4: True = 1.00, Predicted = 1.00
Sample 5: True = 1.00, Predicted = 0.00


### Python:

In [16]:
model = joblib.load(LOG_REG_PATH)
py_predictions = predict_and_eval(model=model,X=X,y=y, metric=accuracy_score)    

model's accuracy_score (Python): 0.9033391915641477
Sample 1: True = 1.00, Predicted = 1.00
Sample 2: True = 1.00, Predicted = 1.00
Sample 3: True = 1.00, Predicted = 1.00
Sample 4: True = 1.00, Predicted = 1.00
Sample 5: True = 1.00, Predicted = 0.00


### Comparison

In [17]:
are_same = np.allclose(c_predictions, py_predictions[:len(c_predictions)], atol=1e-15)
print(f"Are C predictions and Python predictions the same? {are_same}")


Are C predictions and Python predictions the same? True


**AGAIN WE OBSERVE**:
* **The same accuracy Score**
* **For the same 5 first samples, the same predicted label (visual confirmation)**

Similarity of predictions are confirmed by previous cell: Logistic Regression was **successfully transpiled**

## Decision Trees Classifier Comparison

In [18]:
df = pd.read_csv(CANCER_DATA_PATH)
X = data.drop(columns=["diagnosis"])
y = data["diagnosis"].map({'M':1., 'B':0.})

### C:

In [19]:
transpiler = LinearModelTranspiler(model_path=DECISION_TREE_CLF_PATH, output_c_file="decision_tree_clf.c")
transpiler.transpile()

complete_c_content = create_complete_c_file(
    original_c_file="decision_tree_clf.c",
    output_c_file="decision_tree_clf_with_main.c",
    X_df=X,
    y_series=y,
    n_samples=len(X),
)

Loading model from data/tree_clf.joblib...

C code generated and saved to: decision_tree_clf.c
* Created complete C file: decision_tree_clf_with_main.c
  - Original code from: decision_tree_clf.c
  - Added main function with 569 test samples


In [20]:
!gcc decision_tree_clf_with_main.c -o test_model

In [21]:
c_predictions = predict_and_eval(model="./test_model", X=X, y=y, metric=accuracy_score)

model's accuracy_score (C): 1.0
Sample 1: True = 1.00, Predicted = 1.00
Sample 2: True = 1.00, Predicted = 1.00
Sample 3: True = 1.00, Predicted = 1.00
Sample 4: True = 1.00, Predicted = 1.00
Sample 5: True = 1.00, Predicted = 1.00


### Python:

In [22]:
model = joblib.load(DECISION_TREE_CLF_PATH)

py_predictions = predict_and_eval(model=model, X=X, y=y, metric=accuracy_score)

model's accuracy_score (Python): 1.0
Sample 1: True = 1.00, Predicted = 1.00
Sample 2: True = 1.00, Predicted = 1.00
Sample 3: True = 1.00, Predicted = 1.00
Sample 4: True = 1.00, Predicted = 1.00
Sample 5: True = 1.00, Predicted = 1.00


### Comparison

In [23]:
are_same = np.allclose(c_predictions, py_predictions[:len(c_predictions)], atol=1e-15)
print(f"Are C predictions and Python predictions the same? {are_same}")

Are C predictions and Python predictions the same? True


## Decision Tree Regressor Comparison

In [24]:
df = pd.read_csv(HOUSE_DATA_PATH)
X = df.drop(columns=["price", "orientation"])
y = df["price"]

### C:

In [25]:
transpiler = LinearModelTranspiler(DECISION_TREE_REG_PATH, output_c_file="decision_tree_reg.c")
transpiler.transpile()

complete_c_content = create_complete_c_file(
    original_c_file="decision_tree_reg.c",
    output_c_file="decision_tree_reg_with_main.c",
    X_df=X,
    y_series=y,
    n_samples=len(X),
)

Loading model from data/tree_reg.joblib...

C code generated and saved to: decision_tree_reg.c
* Created complete C file: decision_tree_reg_with_main.c
  - Original code from: decision_tree_reg.c
  - Added main function with 40 test samples


In [26]:
!gcc decision_tree_reg_with_main.c -o test_model

In [27]:
c_predictions = predict_and_eval(model="./test_model", X=X, y=y, metric=r2_score)

model's r2_score (C): 0.9999999999999961
Sample 1: True = 260972.16, Predicted = 260972.17
Sample 2: True = 256534.25, Predicted = 256534.25
Sample 3: True = 282674.29, Predicted = 282674.28
Sample 4: True = 266555.38, Predicted = 266555.38
Sample 5: True = 319158.42, Predicted = 319158.41


### Python:

In [28]:
model = joblib.load(DECISION_TREE_REG_PATH)
py_predictions = predict_and_eval(model=model, X=X, y=y, metric=r2_score)

model's r2_score (Python): 1.0
Sample 1: True = 260972.16, Predicted = 260972.16
Sample 2: True = 256534.25, Predicted = 256534.25
Sample 3: True = 282674.29, Predicted = 282674.29
Sample 4: True = 266555.38, Predicted = 266555.38
Sample 5: True = 319158.42, Predicted = 319158.42


### Comparison

In [29]:
are_same = np.allclose(c_predictions, py_predictions[:len(c_predictions)], atol=1e-15)
print(f"Are C predictions and Python predictions the same? {are_same}")

Are C predictions and Python predictions the same? True



**In the same fashion, we can conclude from all those results that the transpiler works as intended for simple Decision Trees**

## Conclusion:

The transpiler works as intended