# Kaggle Example: Store Item Demand Forecasting Challenge

Following this example notebook you'll see how easy you can boost your ML tasks with Upgini. We will enrich a dataset with relevant features and build a better model upon them.

If you haven't got our library yet, you can install it now. Also, you can install CatBoost for the last part of this demonstartion.

In [None]:
%pip install -Uq upgini catboost

## Prepare the input data

For this demo we will use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv) or get from [our repo](https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip).

To speed up the search let's take a random sample.

In [1]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=7_000, random_state=0)
df["date"] = pd.to_datetime(df["date"])
df.head()

Unnamed: 0,date,store,item,sales
335813,2017-07-14,4,19,56
630838,2015-05-19,6,35,45
365685,2014-05-01,1,21,48
322781,2016-11-06,7,18,85
151590,2013-02-02,4,9,46


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [2]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets for future use.

In [3]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## Search relevant features with FeaturesEnricher

Next, we will use FeaturesEnricher on the train dataset to find features best suited for this particular target prediction. To do this we need to specify the column containing dates and provide the target to predict. Also, we can specify any number of additional datasets to evaluate the features. We will use our test dataset to get the eavaluation metrics.

In [5]:
from upgini import FeaturesEnricher, SearchKey

enricher = FeaturesEnricher(
    search_keys={"date": SearchKey.DATE},
    keep_input=True,
)
enricher.fit(train_features, train_target, eval_set=[(test_features, test_target)])

Unnamed: 0,Key,Status,Description
0,date,All valid,All values in this column are good to go
1,target,All valid,All values in this column are good to go


Running f7fd28ca-ae0a-4335-abbc-4108a957d6d5 search request.
We'll email you once it's completed. Please wait a few minutes.
/
[92m[1m
Quality metrics[0m


Unnamed: 0,match rate,rmse,uplift
train,100.0,10.488501,5.47725
eval 1,100.0,13.620399,5.202253



Following features was used for accuracy uplift estimation: store, item


In our case the task is auto-detected as a regression. Hence the metric to optimize is auto-selected as RMSE.

In the output you see RMSE values for the train dataset (using cross-validation) and for every evaluation dataset we have provided. There are also match rate values (a percent share of rows enriched with features) and uplift values (a relative improvement in RMSE for the enriched dataset over the initial dataset).

Here we can see a strong uplift both on the cross-validation and on the out-of-time validation dataset.

## Get the features and test them locally

Finally, we can enrich our datasets with the features found and use them in our own ML pipelines. Lets's enrich both the train and the test datasets.

In [6]:
enriched_train_features = enricher.transform(train_features)
enriched_test_features = enricher.transform(test_features)
enriched_train_features.head()

Unnamed: 0,Key,Status,Description
0,date,All valid,All values in this column are good to go


Running d4820c74-21e2-4ce6-bc78-b3809b55179f search request.
We'll email you once it's completed. Please wait a few minutes.
/
Executing transform step...


Unnamed: 0,Key,Status,Description
0,date,All valid,All values in this column are good to go


Running be396cf8-e41b-43f4-88cb-15173ddfe054 search request.
We'll email you once it's completed. Please wait a few minutes.
/
Executing transform step...


Unnamed: 0,date,store,item,f_b2a0ab34eba8a027,f_aa56f3d319a74c78,f_ef51816499755030,f_6d3bb2c253012f04,f_3337640c6521dd13,f_60091eb7636849ce,f_dcc56f96529a93a4,...,f_4a1d42a953f35713,f_497dd3e95bca8ed1,f_25e8d342b2bdfd7e,f_ef819c0cc6941e1c,f_483435d688c2ba47,f_15f177b331e0a943,f_c934ce6d892f4739,f_2a5a3453dda1e638,f_cb04ccb306995c33,f_1d868ee9e3a2c974
630838,2015-05-19,6,35,0.319779,0.570318,0.030312,19,0.395049,5,0.378082,...,1.059475,1.17051,0.88291,1.003429,1.076434,1.207199,12.85,1.008182,0.879191,1.023048
365685,2014-05-01,1,21,0.369156,0.536414,0.030364,1,0.331484,5,0.328767,...,0.982994,0.968942,0.72102,0.997813,0.973129,0.942934,13.25,0.960941,0.950444,1.003639
322781,2016-11-06,7,18,0.572214,0.178637,0.028914,6,0.251106,11,0.846995,...,1.009348,0.99514,0.9007,0.996151,1.003539,0.989861,22.51,1.090073,1.250052,1.403671
151590,2013-02-02,4,9,0.444276,0.180064,0.029806,2,0.265788,2,0.087671,...,0.985821,1.005008,0.7347,0.992474,0.953622,0.972042,12.9,0.958904,0.783288,0.727013
572011,2014-04-19,4,32,0.439849,0.337163,0.028215,19,0.343071,4,0.29589,...,0.98274,0.96806,0.724,1.00132,0.971869,0.946664,13.36,0.907873,1.013456,0.98678


Here, we've got several dozens of extra features in addition to our initial columns. They should improve the quality of our model.

First, we will fit a CatBoost model on the initial train dataset and evaluate the SMAPE metric on the corresponding test dataset.

In [7]:
from catboost import CatBoost
from catboost.utils import eval_metric

model = CatBoost({"cat_features": ["store", "item"], "verbose": False, "allow_writing_files": False})
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[35.52217301530647]

Next, we will fit the same model on the enriched train dataset and evaluate the SMAPE metric on the corresponding test dataset.

In [8]:
enriched_model = CatBoost({"cat_features": ["store", "item"], "verbose": False, "allow_writing_files": False})
enriched_model.fit(enriched_train_features, train_target)
enriched_preds = enriched_model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[14.305751461690585]

You see a much better result after the enrichment. That's the magic of using our library.