# Introduction

causalkit is a rust implementation for a set of causal inference algorithms. Currently, there are two algorithms: RandomForestClassifier and RandomForestRegressor.

In this notebook, we use synthetic data generated from causalml package to demonstrate how to use causalkit

In [None]:
import numpy as np
import pandas as pd

from causalml.dataset import make_uplift_classification
from causalml.metrics import plot_gain

from sklearn.model_selection import train_test_split

In [None]:
from causalkit import CausalModel

### Generate synthetic data by causalml 

To train the model, we need to specify a treatment column that indicates which treatment is taken for each record. the treatment column of the synthetic data is `treatment_group_key`. It contains string values such as control,treatment1, etc..

causalkit only accepts dataframe with numeric values. Especially, the treatment column should contain values 0,1,..K where 0 represents the control group. 

so we need to create a new treatment column `action` and delete the string column `treatment_group_key`

In [None]:
df, x_names = make_uplift_classification()

In [None]:
df.head()

In [None]:
# Look at the conversion rate and sample size in each group
df.pivot_table(values='conversion',
               index='treatment_group_key',
               aggfunc=[np.mean, np.size],
               margins=True)

df_train, df_test = train_test_split(df, test_size=0.2, random_state=111)

In [None]:
mapping = {
    "control": 0,
    "treatment1": 1,
    "treatment2": 2,
    "treatment3": 3,    
}

In [None]:
df_train["action"] = df_train["treatment_group_key"].map(lambda x: mapping[x])
del df_train["treatment_group_key"]

df_test["action"] = df_test["treatment_group_key"].map(lambda x: mapping[x])
del df_test["treatment_group_key"]

### Create random forest classifier by causalkit

The entry point to create a model is `CausalModel`. For training, `feature`, `treatment`, `y` are required in `params`; for testing, you may provide all, but only `feature` is required in params.

CausalModel(model_type, params)

- model_type (str): RandomForestClassifier / RandomForestRegressor
- params (dict):
    - feature (List[str]): list of input features; should be the same order for train/test
    - cat (List[str] = []): features in this list will be treated as categorical features
    - treatment (List[str] = []): list of treatment columns, support only one treatment column for now
    - y (str = ""): the response column
    - weight (str = ""): weight column, if empty, all samples in the dataset have equal weights
    - n_bin (int = 30): num of bins for each feature
    - min_samples_leaf (int = 100): minimal #samples in each leaf node
    - min_samples_treatment (int = 10): minimal #treated samples in each leaf node
    - max_features (int = 10): maximal #feature to consider when splitting node
    - max_depth (int = 6): maximal depth of a tree
    - n_tree (int = 100): #trees in the random forest
    - n_reg (int = 10): regularization param
    - alpha (float = 0.9): regularization param
    - normalization (bool = True): regularization param
    - subsample (float = 1.0): subsample of a dataset to build a tree
    - n_thread (int = 1): #threads to build trees in parallel
    - seed (int = None): random seed
    
Model functions
- fit(columns, array):
    - columns: all column names corresponding to the array
    - array: numpy array of the data
- predict(columns, array):
    - columns: all column names corresponding to the array
    - array: numpy array of the data
    - return:
        - score: NxT numpy matrix, N is #record, T is #treatment. for example, if there are two groups control/treatment, then T=1. the score is the uplift of Prob(Y|treatment) - Prob(Y|control)
- load(model_type, path):
    - model_type (str): RandomForestClassifier / RandomForestRegressor
    - path (str): disk location of the model file
- save(path):
    - path (str): save to disk

In [None]:
params = {"feature": x_names, "y": "conversion", "treatment": ["action"]}

In [None]:
# parameter for regression is the same
# simply use CausalModel("RandomForestRegressor", params)

model = CausalModel("RandomForestClassifier", params)

In [None]:
model.fit(df_train.columns.tolist(), df_train.values)

In [None]:
score = model.predict(df_test.columns.tolist(), df_test.values)

In [None]:
# there are 3 treatments in the dataset, therefore score is Nx3 matrix
result = pd.DataFrame(score, columns=["treatment1", "treatment2", "treatment3"])