# ReciPys Tutorial

## Imports

In [1]:
import pytest
import numpy as np
import polars as pl
from recipys.recipe import Recipe
from recipys.ingredients import Ingredients
from datetime import datetime, MINYEAR
import pandas as pd

## Creating our data as Polars DataFrame
We will create a simple dataset to demonstrate the functionality of ReciPys. We have different datatypes, and a temporal aspect to our data. We also add some missing values to our data as this common.

In [16]:
rand_state = np.random.RandomState(42)
timecolumn = pl.concat([pl.datetime_range(datetime(MINYEAR, 1, 1,0), datetime(MINYEAR, 1, 1,5), "1h", eager=True),
              pl.datetime_range(datetime(MINYEAR, 1, 1,0), datetime(MINYEAR, 1, 1,3), "1h", eager=True)])
df = pl.DataFrame(
{
    "id": [1] * 6 + [2] * 4,
    "time": timecolumn,
    "y": rand_state.normal(size=(10,)),
    "x1": rand_state.normal(loc=10, scale=5, size=(10,)),
    "x2": rand_state.binomial(n=1, p=0.3, size=(10,)),
    "x3": pl.Series(["a", "b", "c", "a", "c", "b", "c", "a", "b", "c"],dtype=pl.Categorical),
    "x4": pl.Series(["x", "y", "y", "x", "y", "y", "x", "x", "y", "x"],dtype=pl.Categorical),
}
)
df[[1, 2, 4, 7], "x1"] = np.nan

In [17]:
df

id,time,y,x1,x2,x3,x4
i64,datetime[μs],f64,f64,i32,cat,cat
1,0001-01-01 00:00:00,0.496714,7.682912,0,"""a""","""x"""
1,0001-01-01 01:00:00,-0.138264,,1,"""b""","""y"""
1,0001-01-01 02:00:00,0.647689,,0,"""c""","""y"""
1,0001-01-01 03:00:00,1.52303,0.433599,0,"""a""","""x"""
1,0001-01-01 04:00:00,-0.234153,,0,"""c""","""y"""
1,0001-01-01 05:00:00,-0.234137,7.188562,0,"""b""","""y"""
2,0001-01-01 00:00:00,1.579213,4.935844,0,"""c""","""x"""
2,0001-01-01 01:00:00,0.767435,,0,"""a""","""x"""
2,0001-01-01 02:00:00,-0.469474,5.45988,0,"""b""","""y"""
2,0001-01-01 03:00:00,0.54256,2.938481,1,"""c""","""x"""


## Creating Ingredients
To get started, we need to create an ingredients object. This object will be used to create a recipe.

In [18]:
ing = Ingredients(df)

This ingredients object should contain the roles of the columns. The roles are used to determine how we can process the data. For example, the column "y" can be defined as an outcome column, which we can use later to define what we want to do with this type of columns:

In [19]:
roles = {"y": ["outcome"]}
ing = Ingredients(df, copy=False, roles=roles)

## Creating a recipe
We can also directly create a recipy and specify the roles as arguments to the instantion. A recipy always needs to have an ingredients object and optionally also the target column, the feature columns, the group columns and the sequential or time column.

In [20]:
ing = Ingredients(df)
rec = Recipe(ing, outcomes=["y"], predictors=["x1", "x2", "x3", "x4"], groups=["id"], sequences=["time"])

In [21]:
rec

Recipe

Inputs:

shape: (4, 2)
┌───────────┬────────────┐
│ role      ┆ #variables │
│ ---       ┆ ---        │
│ str       ┆ i64        │
╞═══════════╪════════════╡
│ outcome   ┆ 1          │
│ predictor ┆ 4          │
│ group     ┆ 1          │
│ sequence  ┆ 1          │
└───────────┴────────────┘

Operations:


We see that the operations are not yet defined. We have to add steps to our recipe to define what we want to do with the data. But, first, we want to be able to select which columns we want to prepare in our recipe. 


## Selectors

In [26]:
from recipys.selector import all_numeric_predictors
all_numeric_predictors()

all numeric predictors

## Adding steps
Let's preprocess our data! First: we know that there is some missing data in our predictors. We can easily add a step to fill in the missing values with the mean of the column.

In [None]:
from recipys.selector import all_numeric_predictors
from recipys.step import StepImputeFill

rec.add_step(StepImputeFill(sel=all_numeric_predictors(), strategy="mean"))
