# ReciPies Basics

## Imports

In [None]:
import numpy as np
import polars as pl
from src.recipies import Recipe
from src.recipies.ingredients import Ingredients
from datetime import datetime, MINYEAR

## Creating our data as Polars DataFrame
We will create a simple dataset to demonstrate the functionality of ReciPys. We have different datatypes, and a temporal aspect to our data. We also add some missing values to our data as this common.

In [None]:
rand_state = np.random.RandomState(42)
timecolumn = pl.concat(
    [
        pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 5), "1h", eager=True),
        pl.datetime_range(datetime(MINYEAR, 1, 1, 0), datetime(MINYEAR, 1, 1, 3), "1h", eager=True),
    ]
)
df = pl.DataFrame(
    {
        "id": [1] * 6 + [2] * 4,
        "time": timecolumn,
        "y": rand_state.normal(size=(10,)),
        "x1": rand_state.normal(loc=10, scale=5, size=(10,)),
        "x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
        "x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"], dtype=pl.Categorical),
        "x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"], dtype=pl.Categorical),
    }
)
df[[1, 2, 4, 7], "x1"] = None

In [None]:
df

## Creating Ingredients
To get started, we need to create an ingredients object. This object will be used to create a recipe.

In [None]:
ing = Ingredients(df)

This ingredients object should contain the roles of the columns. The roles are used to determine how we can process the data. For example, the column "y" can be defined as an outcome column, which we can use later to define what we want to do with this type of columns:

In [None]:
roles = {"y": ["outcome"]}
ing = Ingredients(df, copy=False, roles=roles)

## Creating a recipe
We can also directly create a recipy and specify the roles as arguments to the instantion. A recipy always needs to have an ingredients object and optionally also the target column, the feature columns, the group columns and the sequential or time column.

In [None]:
ing = Ingredients(df)
rec = Recipe(ing, outcomes=["y"], predictors=["x1", "x2", "x3", "x4"], groups=["id"], sequences=["time"])

In [None]:
rec

We see that the operations are not yet defined. We have to add steps to our recipe to define what we want to do with the data. But, first, we want to be able to select which columns we want to prepare in our recipe. 


## Selectors

In [None]:
from src.recipies.selector import all_numeric_predictors

all_numeric_predictors()

## Adding steps
Let's preprocess our data! First: we know that there is some missing data in our predictors. We can easily add a step to fill in the missing values with the mean of the column.

In [None]:
from src.recipies.selector import all_numeric_predictors
from src.recipies.step import StepImputeFill

rec = rec.add_step(StepImputeFill(sel=all_numeric_predictors(), strategy="mean"))
print(rec)

# rec = rec.add_step(StepImputeFastZeroFill(sel=all_numeric_predictors()))

## Prepping the recipe
Let's prep the recipe. This will "train" the steps we added to the recipe to the data in order. The result will be a recipe object that is ready to bake any data that has the same schema as the data we used to prep the recipe. This is useful for example when we want to apply the same preprocessing steps to a test set or new data.

In [None]:
rec.prep(df)
print(rec)

## Baking the recipe
We now bake the recipe. This will apply the steps we added to the recipe to the data in order. The result will be a new DataFrame with the preprocessed data.


In [None]:
baked_df = rec.bake(data=df)
print(baked_df)

Let's try and bake the recipe with a different dataframe that has the same schema but some missing values in the "x1" column. The recipe should fill in the trained missing values with the mean of the column.:


In [None]:
df2 = df.clone()
df2[list(range(1, 9)), "x1"] = None
baked_df2 = rec.bake(data=df2)
print(baked_df2)

This is useful when we want to apply the same preprocessing steps to a test set, for example, to prevent data leakage.