# Quick Start guide: Kaggle Store Item Demand Forecasting Challenge

Following this guide you'll see how easy you can search for new relevant features with Upgini low-code library. We will enrich a dataset with new features and significantly improve model accuracy. All in 4 simple steps.

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [1]:
%pip install -Uq upgini catboost

[K     |████████████████████████████████| 55 kB 1.2 MB/s 
[K     |████████████████████████████████| 76.6 MB 64.7 MB/s 
[K     |████████████████████████████████| 1.5 MB 39.2 MB/s 
[K     |████████████████████████████████| 2.0 MB 43.6 MB/s 
[K     |████████████████████████████████| 10.9 MB 42.0 MB/s 
[K     |████████████████████████████████| 199 kB 20.8 MB/s 
[K     |████████████████████████████████| 1.6 MB 42.7 MB/s 
[K     |████████████████████████████████| 136 kB 31.9 MB/s 
[?25h

## 1. Prepare the input data

For this demo we will use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv) or get from [our repo](https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip).

To speed up the search let's take a random sample.

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=10_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)
df["date"] = pd.to_datetime(df["date"])
df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,10,21,33
1,2013-01-01,5,24,26
2,2013-01-01,3,27,11
3,2013-01-02,9,7,24
4,2013-01-02,6,40,9


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets for future use.

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## 2. Search relevant features with FeaturesEnricher

Next, we will use FeaturesEnricher on the train dataset to find features best suited for this particular target prediction.  
To do this, we need to specify the column containing **search key(s)**, in this case it's `date` and provide the target to predict.  
Also, we can specify any number of additional out-of-time valiadtion datasets to evaluate robustness of the features. We will use our test dataset to get the eavaluation metrics.  
Search step will take aroung *2 minutes*

In [18]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys={
      "date": SearchKey.DATE,
    },
    keep_input=True,
    cv=CVType.time_series
)
enricher.fit(train_features, train_target, eval_set=[(test_features, test_target)])

Detected task type: ModelTaskType.REGRESSION


Unnamed: 0,Column name,Status,Description
0,target,All valid,All values in this column are good to go
1,date,All valid,All values in this column are good to go


Running search request with search_id=a4cec6b1-e5e4-4e2e-9664-1f08cd43b6a8
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m 

[92m[1m
We found 12 useful feature(s) for you by search keys: ['date', 'country_iso_code'][0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.434272,100.0,CHARACTER
1,store,0.16374,100.0,CHARACTER
2,f_weather_pca_0_94efd18d,0.097509,100.0,NUMERIC
3,f_year_cos1_cd165f8c,0.016779,100.0,NUMERIC
4,f_payment_fraud_score_3cae9c42,0.015332,100.0,NUMERIC
5,f_week_sin1_a71d22f6,0.014975,100.0,NUMERIC
6,f_week_cos1_d3d56d7f,0.012109,100.0,NUMERIC
7,f_c2c_fraud_score_5028232e,0.010888,100.0,NUMERIC
8,f_cpi_pca_2_3c36cd6c,0.010555,100.0,NUMERIC
9,f_finance_umap_0_ad818bcb,0.008208,100.0,NUMERIC


This search task is auto-detected as a regression.  
And as we have typical time series prediction (daily sales as a target variable), we pass time series specific cross-validation parameter `CVType.time_series`. Now search algorithm know that we are working with the time series prediction task.     
We've got **14 relevant features (including 2 initial features)**, which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from search dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3. Calculate SMAPE metric and uplift from relevant features
You can use any model estimator with scikit-learn compartible interface.
Let's take CatBoost regressor and use [scikit-learn make_scorer](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html) to define [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error) metric function (evaluation metric in [kaggle competition](https://www.kaggle.com/c/demand-forecasting-kernels-only))

In [16]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
from sklearn.metrics import make_scorer

# model for sales prediction
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# custom SMAPE scorer function
smape_scorer = make_scorer(
    lambda y_true, y_pred: eval_metric(y_true.values, y_pred, "SMAPE")[0], 
    greater_is_better=False
)
smape_scorer.__name__ = "SMAPE"

# calculate metrics before and after feature enrichment
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set=[(test_features, test_target)],
    estimator=model,
    scoring=smape_scorer
)

Unnamed: 0,match_rate,baseline SMAPE,enriched SMAPE,uplift
,,,,
train,100.0,-26.577871,-16.24334,10.334531
eval 1,100.0,-25.713841,-14.510655,11.203186


SMAPE value both for train and validation datasets will be calculated with the same cross-validation strategy as for `FeaturesEnricher.fit()` -  in this example [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split). Don't be confused by negative values of SMAPE, it's a "feature" of scikit-learn scorer function.    
We see a strong metric uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**.

## 4. Enrich datasets with the new features and retrain model

Now we can enrich our datasets with the features found and use them in our own ML pipelines. Lets's enrich both the train and the test datasets.  
Enrichment step will take aroung *2.5 minutes*

In [8]:
enriched_train_features = enricher.transform(train_features)
enriched_test_features = enricher.transform(test_features)
enriched_train_features.head()

74.43151% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=e0865d01-5a92-492c-a16a-4c2c343672df
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m 

Executing transform step
[KDone                         [0m 
74.55830% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=50736c3b-0f04-46d9-87a8-765efb1db0a9
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m 

Executing transform step
[KDone                         [0m 


Unnamed: 0,date,store,item,f_weather_pca_0_94efd18d,f_year_cos1_cd165f8c,f_payment_fraud_score_3cae9c42,f_week_sin1_a71d22f6,f_week_cos1_d3d56d7f,f_c2c_fraud_score_5028232e,f_cpi_pca_2_3c36cd6c,f_finance_umap_0_ad818bcb,f_credit_default_score_05229fa7,f_italy_match_cnt_fdb09b71,f_finance_umap_1_15890450,f_weather_umap_30_98fa4f7d
0,2013-01-01,10,21,28.661328,0.98522,0.232837,0.781831,0.62349,0.369604,-33.814365,10.026849,0.118754,0,9.95028,3.547175
1,2013-01-01,5,24,28.661328,0.98522,0.232837,0.781831,0.62349,0.369604,-33.814365,10.026849,0.118754,0,9.95028,3.547175
2,2013-01-01,3,27,28.661328,0.98522,0.232837,0.781831,0.62349,0.369604,-33.814365,10.026849,0.118754,0,9.95028,3.547175
3,2013-01-02,9,7,28.79589,0.982126,0.115787,0.974928,-0.222521,0.277366,-33.814365,10.075461,0.050849,0,9.880929,3.400228
4,2013-01-02,6,40,28.79589,0.982126,0.115787,0.974928,-0.222521,0.277366,-33.814365,10.075461,0.050849,0,9.880929,3.400228


We've got new features and ready to retrain the the model

In [12]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[15.569061045902828]

We've got much better result after feature search and enrichment in 4 simple steps