![alt text](
https://cdn.prod.website-files.com/65d5721664bea140c05f5301/65e354e4b9ddb1c6aaa7d7b1_upgini_logo%20gradient.svg)   
## [Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
# Quick Start guide: Search new relevant external features for  store item demand forecast
_________________

Following this guide, you'll learn how to **search new relevant features with Upgini library**. We will enrich a dataset with new features and significantly improve model accuracy. All in 3 simple steps.  
The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is MAPE.  
⏱ Time needed: *10 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
_________________

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [1]:
%pip install -Uq upgini catboost

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.6/48.6 kB[0m [31m642.4 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m913.9/913.9 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.1/264.1 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m162.7/162.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m125.0/125.0 kB[0m [31m7.4 MB/s[0m e

## 1️⃣ Prepare input data

For this guide we'll use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv).  
To speed up the search we'll take a subsample.  
⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path).sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets in *a scikit-learn style* (X and y).

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## 2️⃣ Search new relevant features with FeaturesEnricher

Next, we will use **`FeaturesEnricher`** on the train dataset to find new features relevant for this target prediction.  
* To do this, we need to specify the column(s) containing [**search key(s)**](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come), in this case it's `date` and provide the target to predict.  
* Also, we can specify any number of additional out-of-time validation datasets to evaluate robustness of the new features.  
* This search task will be auto-detected as a regression. And as we have time series prediction (daily sales as a target variable), we have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**. Now search algorithm know that we are working with the time series prediction task, not just simple regression and will use [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) for new features search.  
* For Multivariate Time Series you should specify **`id_columns`** which contains `id` of univariate TS, in this example - combination of Store and Item.

Search step will take around *2.5 minutes*

In [6]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series,
    id_columns = ["store","item"],
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IP to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Detected task type: ModelTaskType.REGRESSION. Reason: date search key is present, treating as regression
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly


<IPython.core.display.Javascript object>




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Column name,Status,Errors
date,All valid,-
target,All valid,-
item,All valid,-
store,All valid,-





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Running search request, search_id=37ebb20f-a069-405d-9cd6-a4316109c88d
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
35 relevant feature(s) found with the search keys: ['date', 'store', 'item'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_events_date_year_cos1_9014a856,4.002,100.0,"0.2093, -0.7325, -0.9231",Upgini,Calendar data,Daily
f_autofe_lag_7d_44ccb1e1,1.7691,99.4807,"0.9905, -0.2345, -0.7624","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_events_date_week_sin1_847b5db1,1.0779,100.0,"-0.9749, -0.7818, 0.0",Upgini,Calendar data,Daily
f_financial_date_gold_bf71e733,0.9815,100.0,"1501.0, 1214.2, 1306.3",Upgini,Markets data,Daily
f_autofe_lag_2d_1947d061,0.942,99.8554,"0.6235, -0.2225, 0.6235","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_lag_3d_d168eb3e,0.792,99.7831,"-0.9749, 0.7818, -0.4339","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_lag_3d_5dba02da,0.6874,99.7831,"341.48, 339.14, 346.42","Training dataset,Upgini","AutoFE: features from Training dataset,Markets data",Daily
f_autofe_lag_7d_6354b8dc,0.6632,99.4807,"377.68, 320.97, 342.92","Training dataset,Upgini","AutoFE: features from Training dataset,Markets data",Daily
f_autofe_roll_7d_norm_mean_b8cbc4e4,0.4932,42.937,"5.6806442808608536e+16, 5.6806...","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_lag_7d_24084e88,0.4916,99.4807,"-0.4339, -0.9749, 0.4339","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily


Provider,Source,All features SHAP,Number of relevant features
"Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",5.8843,11
Upgini,Calendar data,5.0799,2
"Training dataset,Upgini","AutoFE: features from Training dataset,Markets data",3.0873,19
Upgini,Markets data,1.3511,3


Sources,Feature name,Feature 1,Function
"Training dataset,Calendar data",f_autofe_lag_7d_44ccb1e1,f_events_date_year_cos1_9014a856,lag_7d
"Training dataset,Calendar data",f_autofe_lag_2d_1947d061,f_events_date_week_cos1_f6a8c1fc,lag_2d
"Training dataset,Calendar data",f_autofe_lag_3d_d168eb3e,f_events_date_week_sin1_847b5db1,lag_3d
"Training dataset,Markets data",f_autofe_lag_3d_5dba02da,f_financial_date_stoxx_043cbcd4,lag_3d
"Training dataset,Markets data",f_autofe_lag_7d_6354b8dc,f_financial_date_stoxx_043cbcd4,lag_7d
"Training dataset,Calendar data",f_autofe_roll_7d_norm_mean_b8cbc4e4,f_events_date_week_cos2_b0a07cfc,roll_7d_norm_mean
"Training dataset,Calendar data",f_autofe_lag_7d_24084e88,f_events_date_week_sin1_847b5db1,lag_7d
"Training dataset,Markets data",f_autofe_lag_3d_7d332726,f_financial_date_gold_bf71e733,lag_3d
"Training dataset,Calendar data",f_autofe_roll_7d_norm_mean_f4c75f3e,f_events_date_year_cos1_9014a856,roll_7d_norm_mean
"Training dataset,Calendar data",f_autofe_roll_3d_median_67a4ca4c,f_events_date_week_cos3_7525fe31,roll_3d_median


We detected 113 outliers in your sample.
Examples of outliers with maximum value of target:
84    205
47    196
38    187
Name: target, dtype: int64
Outliers will be excluded during the metrics calculation.
Calculating accuracy uplift after enrichment...
y distributions from the training sample and eval_set differ according to the Kolmogorov-Smirnov test,
which makes metrics between the train and eval_set incomparable.


Dataset type,Rows,Mean target,Baseline mean_squared_error,Enriched mean_squared_error,Uplift
Train,15213,50.3977,305.220 ± 103.118,201.473 ± 62.127,103.7467
Eval 1,3787,59.2424,490.798 ± 71.251,336.804 ± 76.049,153.9939


We've got **20+ new relevant features** from [different sources such as weather data, calendar data, financial data](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3️⃣ Calculate uplift from new relevant features using optimized custom estimator and metric
You can use any model estimator with scikit-learn compatible interface. Let's take CatBoost regressor.  
For evaluation metric there are two options:
* Predefined evaluation function alias from [*Upgini library*](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations), like **`MAPE`** for Mean Average Percentage Error

* Define custom evaluation function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for **`FeaturesEnricher.fit()`**-  in this example [time series CV](https://github.com/upgini/upgini#-time-series-prediction-support).

In [None]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with a new relevant features
enricher.calculate_metrics(
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating accuracy uplift after enrichment...
-y distributions from the training sample and eval_set differ according to the Kolmogorov-Smirnov test,
which makes metrics between the train and eval_set incomparable.


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_events_date_year_cos1_9014a856,4.0524,100.0,"-0.2512, -0.133, -0.6793",Upgini,Calendar data,Daily
f_autofe_lag_7d_44ccb1e1,2.1382,99.4807,"-0.8945, -0.9093, -0.5864","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_events_date_week_sin1_847b5db1,1.4226,100.0,"0.9749, 0.9749, 0.9749",Upgini,Calendar data,Daily
f_autofe_lag_3d_d168eb3e,1.0297,99.7831,"0.4339, -0.4339, 0.4339","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_lag_7d_6354b8dc,1.0281,99.4807,"332.72, 344.12, 395.81","Training dataset,Upgini","AutoFE: features from Training dataset,Markets data",Daily
f_autofe_lag_3d_5dba02da,0.8501,99.7831,"344.67, 378.32, 353.11","Training dataset,Upgini","AutoFE: features from Training dataset,Markets data",Daily
f_financial_date_gold_bf71e733,0.7752,100.0,"1296.9, 1310.8, 1551.8",Upgini,Markets data,Daily
f_autofe_roll_7d_norm_mean_f4c75f3e,0.6334,99.553,"0.9487, 0.8103, 0.2945","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_lag_2d_1947d061,0.6249,99.8554,"-0.2225, -0.2225, -0.2225","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_lag_7d_24084e88,0.5177,99.4807,"-0.7818, 0.9749, -0.7818","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily


Provider,Source,All features SHAP,Number of relevant features
"Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",6.9362,11
Upgini,Calendar data,5.475,2
"Training dataset,Upgini","AutoFE: features from Training dataset,Markets data",4.9618,19
Upgini,Markets data,1.5773,3


|

Sources,Feature name,Feature 1,Function
"Training dataset,Calendar data",f_autofe_lag_7d_44ccb1e1,f_events_date_year_cos1_9014a856,lag_7d
"Training dataset,Calendar data",f_autofe_lag_3d_d168eb3e,f_events_date_week_sin1_847b5db1,lag_3d
"Training dataset,Markets data",f_autofe_lag_7d_6354b8dc,f_financial_date_stoxx_043cbcd4,lag_7d
"Training dataset,Markets data",f_autofe_lag_3d_5dba02da,f_financial_date_stoxx_043cbcd4,lag_3d
"Training dataset,Calendar data",f_autofe_roll_7d_norm_mean_f4c75f3e,f_events_date_year_cos1_9014a856,roll_7d_norm_mean
"Training dataset,Calendar data",f_autofe_lag_2d_1947d061,f_events_date_week_cos1_f6a8c1fc,lag_2d
"Training dataset,Calendar data",f_autofe_lag_7d_24084e88,f_events_date_week_sin1_847b5db1,lag_7d
"Training dataset,Calendar data",f_autofe_roll_3d_median_67a4ca4c,f_events_date_week_cos3_7525fe31,roll_3d_median
"Training dataset,Calendar data",f_autofe_roll_7d_norm_mean_b8cbc4e4,f_events_date_week_cos2_b0a07cfc,roll_7d_norm_mean
"Training dataset,Calendar data",f_autofe_roll_2d_min_bd3d6759,f_events_date_week_cos1_f6a8c1fc,roll_2d_min




|

Unnamed: 0,Dataset type,Rows,Mean target,Baseline mean_absolute_percentage_error,Enriched mean_absolute_percentage_error,Uplift
0,Train,15213,50.3977,0.248 ± 0.029,0.166 ± 0.020,0.081058
1,Eval 1,3787,59.2424,0.251 ± 0.015,0.185 ± 0.032,0.066307


We've got a strong uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**:   
**BEFORE** enrichment 0.251   
**AFTER** enrichment 0.185

______________________________
**That's all for a quick start in 10 minutes!**  
If you found this useful or interesting, feel free to share.  
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini?tab=readme-ov-file#-tutorials)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>