# balance Quickstart (raking): Analyzing and adjusting the bias on a simulated toy dataset

The raking method is an advanced technique that extends post-stratification. It is well-suited for situations where we have marginal distributions of multiple covariates and we don't know the joint distribution. Raking works by applying post-stratification to the data based on the first covariate, using the resulting output weights as input for adjustment based on the second covariate, and so forth. Once all covariates have been utilized for adjustment, the process is repeated until a specified level of convergence is attained

One of the main advantages of raking is its ability to work with user-level data while also utilizing marginal distributions that lack user-level granularity. Another benefit is its capacity to closely fit these distributions, depending on the convergence achieved. This is in contrast to techniques such as inverse probability weighting (IPW) and covariate balancing propensity score (CBPS), which may only approximate the data and potentially fail to fit them even at marginal levels.

This notebook demonstrates how to use the raking method and showcases the high degree of fit it can provide.

## Load the data

In [None]:
from balance import load_data

In [None]:
target_df, sample_df = load_data()

print("target_df: \n", target_df.head())
print("sample_df: \n", sample_df.head())

In [None]:
from balance import Sample

Raking can work with numerical variables since the variable is automatically bucketed. But for the simplicitiy of the discussion, we'll focus only on age and gender.

In [None]:
sample = Sample.from_frame(sample_df[['id', 'gender', 'age_group', 'happiness']], outcome_columns=["happiness"])
target = Sample.from_frame(target_df[['id', 'gender', 'age_group', 'happiness']], outcome_columns=["happiness"])
sample_with_target = sample.set_target(target)

## Fit models using ipw and rake


Fit an ipw model:

In [None]:
adjusted_ipw = sample_with_target.adjust(method = "ipw")

Fit a raking model (on the user level data as input):

In [None]:
adjusted_rake = sample_with_target.adjust(method = "rake")

When comparing the results of ipw and rake, we can see that rake has a larger design effect, and that it provides a perfect fit. In contrast, ipw gives only a partial fit.

We can see it in the ASMD and also the bar plots.

In [None]:
print(adjusted_ipw.summary())

In [None]:
print(adjusted_rake.summary())

In [None]:
adjusted_ipw.covars().plot()

In [None]:
adjusted_rake.covars().plot()

# Using marginal distribution with rake

The benefit of rake is that we can define a target population from a marginal distribution, and fit towards it.
The function to use for this purpose is `prepare_marginal_dist_for_raking`.

In order to demonstrate this point, let us assume we have another target population in mind, with different proportions. Since it is known, we can create a sample with that target population based on a dict of marginal distributions using the `realize_dicts_of_proportions` function.

In [None]:
from balance.weighting_methods.rake import prepare_marginal_dist_for_raking
# import pandas as pd
import numpy as np

a_dict_with_marginal_distributions = {"gender": {"Female": 0.1, "Male": 0.85, np.nan: 0.05}, "age_group": {"18-24": 0.25, "25-34": 0.25, "35-44": 0.25, "45+": 0.25}}

target_df_from_marginals = prepare_marginal_dist_for_raking(a_dict_with_marginal_distributions)

In [None]:
target_df_from_marginals

In [None]:
target_df_from_marginals.info()

With the new `target_df_from_marginals` object ready, we can use it as a target. Notice that this makes sense ONLY for the raking method. This should NOT be used for any other method.

In [None]:
target_from_marginals = Sample.from_frame(target_df_from_marginals)
sample_with_target_2 = sample.set_target(target_from_marginals)

And fit a raking model:

In [None]:
adjusted_rake_2 = sample_with_target_2.adjust(method = "rake")

As the following code shows, we get our data to have a perfect fit to the marginal distribution defined for age and gender.

In [None]:
print(adjusted_rake_2.summary())

In [None]:
adjusted_rake_2.covars().plot()