## Linear models with Polars-ols

The Polars-ols plugin allows you to fit linear models using Polars expressions. You can install it (along with the `patsy` package for formulae) here

In [None]:
# %pip install polars-ols patsy

We begin by importing polars and polars_ols.



In [None]:
import polars as pl
import polars_ols as pls  

Polars-ols is a Polars *plugin*. When we import a plugin the plugin registers its *namespace* with Polars. A namespace is a set of expressions that are gathered under a title. We have already met built-in namespaces such as `dt` for timeseries expressions or `str` for string expressions

We create a `DataFrame` with a target column `y` that we regress against two predictor columns `x1` and `x2`

In [None]:
df = pl.DataFrame(
    {
        "y": [1.16, -2.16, -1.57, 0.21, 0.22, 1.6, -2.11, -2.92, -0.86, 0.47],
        "x1": [0.72, -2.43, -0.63, 0.05, -0.07, 0.65, -0.02, -1.64, -0.92, -0.27],
        "x2": [0.24, 0.18, -0.95, 0.23, 0.44, 1.01, -2.08, -1.36, 0.01, 0.75],
    }
)
df.head()

We start by fitting an ordinary least squares (i.e. vanilla linear regression) model. We specify:
- the target column as `pl.col("y")`
- an ordinary least squares model with the `least_squares.ols` expression
- the predictors as a list of expressions inside `least_squares.ols`
- the name of the output column of predictions with `alias`

In [None]:
ols_expr = (
    pl.col("y")
    .least_squares.ols(
        pl.col("x1"), pl.col("x2")
    )
    .alias("ols")
)

We can then add a column with the predictions by passing the expression to `with_columns`

In [None]:
(
    df
    .with_columns(
        ols_expr
    )
)       

### Coefficients
If we want the regression coefficients instead of the predictions we can set the `mode` of the expression

In [None]:
ols_coeff_expr = (
    pl.col("y")
    .least_squares.ols(
        pl.col("x1"), 
        pl.col("x2"),
        mode="coefficients",
        add_intercept=True
    ).alias("ols_intercept")
)

We then get the coefficients as a `pl.Struct` column

In [None]:
(
    df
    .select(
        ols_coeff_expr
    )
)       

The order here is `x1`, `x2`,`intercept`. We can get the variable names if we `unnest` the struct

In [None]:
(
    df
    .select(
        ols_coeff_expr
    )
    .unnest("ols_intercept")
)       

### Regularised regression

For practical applications of linear regression we often want to apply regularisation to damp the effect of noisy data.

We can do that with:
- Lasso regression (that uses an L1 norm for the regularisation)
- Ridge regression(that uses an L2 norm for the regularisation)
- Elastic regression (that uses L1 and L2 norms for the regularisation)

In [None]:
lasso_expr = pl.col("y").least_squares.lasso(pl.col("x1"), pl.col("x2"), alpha=0.0001, add_intercept=True)
ridge_expr = pl.col("y").least_squares.ridge(pl.col("x1"), pl.col("x2"), alpha=0.0001, add_intercept=True)
elastic_expr = pl.col("y").least_squares.elastic_net(pl.col("x1"), pl.col("x2"), alpha=0.0001,l1_ratio=0.5, add_intercept=True)

See the Scikit-learn docs for the models with the same names for more background on the modelling method

We now make predictions with these models

In [None]:
(
    df
    .with_columns(
        lasso_expr.round(3).alias("predictions_lasso"),
        ridge_expr.round(3).alias("predictions_ridge"),
        elastic_expr.round(3).alias("predictions_elastic"),
    )
)


> I've compared the results of the polars-ols Elastic Net model with the results from the Scikit-learn library in my production pipelines and they closely match.


## Fitting models by groups
We may want to fit a different model for different subgroups of the data. First we make a new `DataFrame` with a `groups` column

In [None]:
df_groups = pl.DataFrame(
    {
        "y": [1.16, -2.16, -1.57, 0.21, 0.22, 1.6, -2.11, -2.92, -0.86, 0.47],
        "x1": [0.72, -2.43, -0.63, 0.05, -0.07, 0.65, -0.02, -1.64, -0.92, -0.27],
        "x2": [0.24, 0.18, -0.95, 0.23, 0.44, 1.01, -2.08, -1.36, 0.01, 0.75],
        "groups":[0]*5 + [1]*5
    }
)
df_groups.head()

We can then fit by group using `over`

In [None]:
ols_groups_expr = (
    pl.col("y")
    .least_squares.ols(
        pl.col("x1"), 
        pl.col("x2")
    )
    .over("groups")
    .alias("ols")
)

In [None]:
(
    df_groups
    .with_columns(
        ols_groups_expr
    )
)       

## Making predictions on new data
In the examples above we made predictions on the same data that we used to train the model. We see here how we can fit a model on one set of data and make predictions on another.

First we need to fit a model to get the coefficients. We use the basic `ols` model again

In [None]:
ols_coef_expr = (
    pl.col("y")
    .least_squares.ols(pl.col("x1"), pl.col("x2"), mode="coefficients")
    .alias("coef")
)


We can use this to make `DataFrame` of variable column names and coefficients

In [None]:
(
    df
    .select(
        ols_coef_expr
    )
    .unnest("coef")
    .melt()
)

Now we can add a column with predictions made from the coefficient `DataFrame`. The general flow is that we:
- start with the or

In [None]:
(
    df
    .with_row_index()
    .pipe(
        lambda df: (
            df
            .join(
                df
                .select("index","x1","x2")
                .melt(id_vars="index")
                .join(
                    (
                        df
                        .select(
                            ols_coef_expr
                        )
                        .unnest("coef")
                        .melt()
                    ),
                on="variable",
                )
                .with_columns(
                    pred = (pl.col("value")*pl.col("value"))
                )
                .group_by("index")
                .agg(pl.col("pred").sum()),
                on="index"
            )
        )
    )
)

Now we can make a class to fit the model and then make predictions on new data. The general flow is:
- initialise the class with the model fit expression
- fit the model in `fit` to get the coefficients
- make predictions by:
    - adding a row index to keep track of which row data came from
    - piping the output to a function that joins the predictions
    - in the join select the predictor columns along with the row index
    - `melt` the predictor columns
    - join the coefficients
    - multiply the predictors by the coefficients
    - `group_by` to gather the predictors back up into rows
    - `agg` to get the sum of the predictors for the total prediction for each row

In [None]:
class LinearRegressor:
    def __init__(
        self,
        model_expr: pl.Expr = pl.col("y")
        .least_squares.ols(pl.col("x1"), pl.col("x2"), mode="coefficients")
        .alias("coef"),
    ):
        self.model_expr = model_expr

    def fit(self, X):
        # Fit the model and save the coefficients in a DataFrame
        self._coef = df.select(
            pl.Series("variables", self.model_expr.meta.root_names()[1:]),
            self.model_expr,
        )
        return self

    def transform(self, X: pl.DataFrame):
        # Make predictions using the saved coefficients
        return (
            X
            # Add a row index
            .with_row_index()
            .pipe(
                # Join the predictions
                lambda X: X.join(
                    # Select the predictor columns
                    X.select("index", "x1", "x2")
                    # Melt (so we can join the coefficients)
                    .melt(id_vars="index",value_name="predictor")
                    .join(
                        # Join the coefficients
                    (
                        df
                        .select(
                            ols_coef_expr
                        )
                        .unnest("coef")
                        .melt(value_name="coef")
                    ),
                        on="variable",
                    )
                    # Multiply by the predictors
                    .with_columns(pred=(pl.col("predictor") * pl.col("coef")))
                    # Gather back up into rows
                    .group_by("index")
                    .agg(pl.col("pred").sum()),
                    on="index",
                )
            )
            .sort("index")
        )


Now we make train and test `DataFrames`

In [None]:
df_train = df[:7]
df_test = df[7:]

We then: 
- instantiate the model
- `fit` the model on `df_train`
- make predictions on `df_test`

In [None]:
linear_regressor = LinearRegressor()
linear_regressor.fit(X=df_train)
linear_regressor.transform(X=df_test)

More material to come on this excellent new package!

See the repo page for more: https://github.com/azmyrajab/polars_ols/