# Quick Start guide: Search new features for Kaggle Store item demand forecasting challenge

Following this guide you'll learn how to search new relevant features with Upgini low-code library. We will enrich a dataset with new features and significantly improve model accuracy. All in 4 simple steps.  
Time needed: 15 minutes.  

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [None]:
%pip install -Uq upgini catboost

## 1️⃣ Prepare input data

For this guide we'll use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv).  
To speed up the search we'll take a subsample.  
⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path)
df = df.sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets for future use.

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## 2️⃣ Search new relevant features with FeaturesEnricher

Next, we will use FeaturesEnricher on the train dataset to find new features relevant for this particular target prediction.  
* To do this, we need to specify the column(s) containing [**search key(s)**](https://github.com/upgini/upgini#-search-key-types-we-support-more-is-coming), in this case it's `date` and provide the target to predict.  
* Also, we can specify any number of additional out-of-time valiadtion datasets to evaluate robustness of the new features.  
* This search task will be auto-detected as a regression. And as we have time series prediction (daily sales as a target variable), we have to pass [**time series specific cross-validation parameter**](https://github.com/upgini/upgini#-time-series-prediction-support) `CVType.time_series`. Now search algorithm know that we are working with the time series prediction task, not just simple regression and will use [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) for new features search.  

Search step will take aroung *2.5 minutes*

In [5]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    keep_input = True,
    cv = CVType.time_series
)
enricher.fit(train_features, train_target, eval_set=[(test_features, test_target)])

<IPython.core.display.Javascript object>

Detected task type: ModelTaskType.REGRESSION


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go
1,target,All valid,All values in this column are good to go


Running search request with search_id=cf28ebf5-b046-4f39-a7d6-6f4238bdbc3d
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m 

[92m[1m
We found 26 useful feature(s) for you by search keys: ['date'][0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.488321,100.0,CHARACTER
1,store,0.172208,100.0,CHARACTER
2,f_weather_pca_0_94efd18d,0.104175,100.0,NUMERIC
3,f_week_cos1_d3d56d7f,0.027857,100.0,NUMERIC
4,f_year_cos1_cd165f8c,0.021098,100.0,NUMERIC
5,f_week_sin1_a71d22f6,0.020424,100.0,NUMERIC
6,f_payment_fraud_score_3cae9c42,0.016814,100.0,NUMERIC
7,f_c2c_fraud_score_5028232e,0.015534,100.0,NUMERIC
8,f_dow_jones_89547e1d,0.009432,100.0,NUMERIC
9,f_silver_d4264cf9,0.007962,100.0,NUMERIC


We've got **10+ relevant features (including 2 initial features)**, which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from search dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3️⃣ Calculate model metrics and uplift from new relevant features
You can use any model estimator with scikit-learn compartible interface. Let's take CatBoost regressor.  
For evaluation metric there are two options:
* Predefined functions from [*Upgini library*](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations), like `RMSLE` for Root Mean Squared Logaritmic Error

* Define custom evaluation function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for `FeaturesEnricher.fit()`-  in this example [time series CV](https://github.com/upgini/upgini#-time-series-prediction-support). 

In [None]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after feature enrichment
enricher.calculate_metrics(
    train_features, train_target, 
    eval_set = [(test_features, test_target)],
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Unnamed: 0,match_rate,baseline mean_absolute_percentage_error,enriched mean_absolute_percentage_error,uplift
,,,,
train,100.0,0.255844,0.170667,0.085177
eval 1,100.0,0.243877,0.132441,0.111436


We've got a strong metric uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**.

## 4️⃣ Enrich datasets with the new features and retrain model

Now we can enrich our datasets with the features found and use them in our own ML pipelines. Lets's enrich both the train and the test datasets.  
Enrichment step for two datasets will take aroung *2.5 minutes*

In [None]:
enriched_train_features = enricher.transform(train_features)
enriched_test_features = enricher.transform(test_features)
enriched_train_features.head()

90.39637% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=9c21daba-40f9-4467-b16b-c766e619422d
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m 

Executing transform step
[KDone                         [0m 
90.36176% of the rows are fully duplicated


Unnamed: 0,Column name,Status,Description
0,date,All valid,All values in this column are good to go


Running search request with search_id=1bed3926-f77d-4bcf-9a42-29389baec24d
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
[KDone                         [0m 

Executing transform step
[KDone                         [0m 


Unnamed: 0,date,store,item,f_weather_pca_0_94efd18d,f_week_cos1_d3d56d7f,f_year_cos1_cd165f8c,f_week_sin1_a71d22f6,f_payment_fraud_score_3cae9c42,f_c2c_fraud_score_5028232e,f_dow_jones_89547e1d,...,f_italy_match_cnt_fdb09b71,f_nasdaq_d309709a,f_silver_7d_to_7d_1y_shift_ccbd2abf,f_weather_umap_11_c213a9d7,f_dow_jones_7d_to_7d_1y_shift_9628c89b,f_finance_umap_1_15890450,f_usd_7d_to_7d_1y_shift_497dd3e9,f_cpi_pca_0_1a4b6212,f_cbpol_pca_0_516fff50,f_finance_pca_4_e139d2da
0,2013-01-01,7,5,28.661328,0.62349,0.98522,0.781831,0.232837,0.369604,13104.139648,...,0,3019.51001,1.08019,7.594507,1.065812,9.95028,0.993552,-31.479368,-2.336017,0.685219
1,2013-01-01,4,9,28.661328,0.62349,0.98522,0.781831,0.232837,0.369604,13104.139648,...,0,3019.51001,1.08019,7.594507,1.065812,9.95028,0.993552,-31.479368,-2.336017,0.685219
2,2013-01-01,1,33,28.661328,0.62349,0.98522,0.781831,0.232837,0.369604,13104.139648,...,0,3019.51001,1.08019,7.594507,1.065812,9.95028,0.993552,-31.479368,-2.336017,0.685219
3,2013-01-01,3,41,28.661328,0.62349,0.98522,0.781831,0.232837,0.369604,13104.139648,...,0,3019.51001,1.08019,7.594507,1.065812,9.95028,0.993552,-31.479368,-2.336017,0.685219
4,2013-01-01,5,24,28.661328,0.62349,0.98522,0.781831,0.232837,0.369604,13104.139648,...,0,3019.51001,1.08019,7.594507,1.065812,9.95028,0.993552,-31.479368,-2.336017,0.685219


We've got new features and ready to retrain the model.  
**BEFORE** enrichment with the new features:

In [None]:
model.fit(train_features, train_target)
enriched_preds = model.predict(test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[37.65141857448004]

**AFTER** enrichment:

In [None]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[14.62789832842992]