![alt text](
https://cdn.prod.website-files.com/65d5721664bea140c05f5301/65e354e4b9ddb1c6aaa7d7b1_upgini_logo%20gradient.svg)   
## [Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
# Quick Start guide: Search new relevant external features for  store item demand forecast
_________________

Following this guide, you'll learn how to **search new relevant features with Upgini library**. We will enrich a dataset with new features and significantly improve model accuracy. All in 3 simple steps.  
The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is MAPE.  
⏱ Time needed: *10 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
_________________

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [1]:
%pip install -Uq upgini catboost

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m913.9/913.9 kB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.3/303.3 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.3/65.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m76.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m11.7 MB/s[0m eta [36

## 1️⃣ Prepare input data

For this guide we'll use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv).  
To speed up the search we'll take a subsample.  
⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path).sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets in *a scikit-learn style* (X and y).

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## 2️⃣ Search new relevant features with FeaturesEnricher

Next, we will use **`FeaturesEnricher`** on the train dataset to find new features relevant for this target prediction.  
* To do this, we need to specify the column(s) containing [**search key(s)**](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come), in this case it's `date` and provide the target to predict.  
* Also, we can specify any number of additional out-of-time validation datasets to evaluate robustness of the new features.  
* This search task will be auto-detected as a regression. And as we have time series prediction (daily sales as a target variable), we have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**. Now search algorithm know that we are working with the time series prediction task, not just simple regression and will use [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) for new features search.  
* For Multivariate Time Series you should specify **`id_columns`** which contains `id` of univariate TS, in this example - combination of Store and Item.

Search step will take around *12 minutes*

In [5]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series,
    id_columns = ["store","item"],
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)

<IPython.core.display.Javascript object>

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IP to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history


Detected task type: ModelTaskType.REGRESSION. Reason: date search key is present, treating as regression
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly




Column name,Status,Errors
target,All valid,-
store,All valid,-
item,All valid,-
date,All valid,-





<IPython.core.display.Javascript object>


Running search request, search_id=71c54cde-ae05-4359-9f96-1cd917ee322f
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com



Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
item,8.9572,100.0,"12, 38, 2",,,
store,6.6829,100.0,"6, 2, 5",,,
f_autofe_trend_coef_8dae5d36ce,6.1655,100.0,"0.0189, 0.0134, -0.0035",Training dataset,AutoFE: features from Training dataset,
f_events_date_week_sin1_847b5db1,4.1391,100.0,"0.0, -0.4339, 0.4339",Upgini,Calendar data,Daily
f_events_date_year_cos1_9014a856,2.624,100.0,"-0.8566, -0.263, 0.878",Upgini,Calendar data,Daily
f_autofe_roll_3d_min_aed5463a33,2.5417,100.0,"-0.8566, -0.263, 0.878","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_roll_7d_median_847e678e51,1.859,100.0,"-0.3119, 0.1793, 0.4822","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_events_date_week_cos3_7525fe31,1.8024,100.0,"1.0, -0.2225, -0.2225",Upgini,Calendar data,Daily
f_financial_date_crude_oil_7d_to_1y_c3e0ad17,1.2221,100.0,"1.1037, 1.1405, 0.9803",Upgini,Markets data,Daily
f_financial_date_finance_umap_0_3c020a5e,0.9725,100.0,"9.6007, 10.914, 9.645",Upgini,Markets data,Daily


Provider,Source,All features SHAP,Number of relevant features
Upgini,Calendar data,9.6795,5
Training dataset,AutoFE: features from Training dataset,6.1655,1
"Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",4.4007,2
Upgini,Markets data,2.9867,4
Upgini,World economic indicators,0.32,1


Sources,Feature name,Feature 1,Function
Training dataset,f_autofe_trend_coef_8dae5d36ce,target,trend_coef
"Training dataset,Calendar data",f_autofe_roll_3d_min_aed5463a33,f_events_date_year_cos1_9014a856,roll_3d_min
"Training dataset,Calendar data",f_autofe_roll_7d_median_847e678e51,f_events_date_year_cos1_9014a856,roll_7d_median


Calculating accuracy uplift after enrichment...
-y distributions from the training sample and eval_set differ according to the Kolmogorov-Smirnov test,
which makes metrics between the train and eval_set incomparable.


Dataset type,Rows,Mean target,Baseline MAPE,Enriched MAPE,"Uplift, abs","Uplift, %"
Train,9930,53.8254,0.328 ± 0.115,0.264 ± 0.050,0.065,19.7%
Eval 1,3787,59.2424,0.281 ± 0.010,0.240 ± 0.019,0.041,14.6%


We've got **10+ new relevant features** from [different sources such as weather data, calendar data, financial data](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3️⃣ Calculate uplift from new relevant features using optimized custom estimator and metric
You can use any model estimator with scikit-learn compatible interface. Let's take CatBoost regressor.  
For evaluation metric there are two options:
* Predefined evaluation function alias from [*Upgini library*](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations), like **`MAPE`** for Mean Average Percentage Error

* Define custom evaluation function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for **`FeaturesEnricher.fit()`**-  in this example [time series CV](https://github.com/upgini/upgini#-time-series-prediction-support).

In [6]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with a new relevant features
enricher.calculate_metrics(
    estimator=model,
)

Calculating accuracy uplift after enrichment...
-y distributions from the training sample and eval_set differ according to the Kolmogorov-Smirnov test,
which makes metrics between the train and eval_set incomparable.


Unnamed: 0,Dataset type,Rows,Mean target,Baseline MAPE,Enriched MAPE,"Uplift, abs","Uplift, %"
0,Train,9930,53.8254,0.294 ± 0.113,0.192 ± 0.046,0.102,34.6%
1,Eval 1,3787,59.2424,0.251 ± 0.015,0.183 ± 0.021,0.068,27.0%


We've got a strong uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**:   
**BEFORE** enrichment 0.251   
**AFTER** enrichment 0.183

## 4️⃣ Enrich dataset with selected features
Limit 1000 rows for unregistered user. After [registration](https://profile.upgini.com/login), an additional 1000 rows will be available for enrichment.

In [8]:
xy = pd.concat([train_features, train_target.to_frame("target")], axis=1)
xy_sampled = xy.sample(n=1000)
x = xy_sampled.drop(columns="target")
y = xy_sampled["target"]

transformed = enricher.transform(x, y=y)
transformed

You use Trial access to Upgini data enrichment. Limit for Trial: 1000 rows. You have already enriched: 0 rows.


Column name,Status,Errors
store,All valid,-
item,All valid,-
date,All valid,-




Running transform request, id=80c422a7-b116-4b8e-8e8a-06dc125a434a
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


Unnamed: 0,date,store,item,target,f_financial_date_silver_7d_to_7d_1y_shift_55fa8001,f_events_date_week_sin1_847b5db1,f_events_date_week_cos3_7525fe31,f_economic_date_cbpol_umap_6_aa0352de,f_financial_date_finance_umap_0_3c020a5e,f_financial_date_finance_umap_2_a414df3b,f_events_date_year_sin1_3c44bc64,f_events_date_year_sin2_59955ffd,f_events_date_year_cos1_9014a856,f_financial_date_crude_oil_7d_to_1y_c3e0ad17,f_autofe_roll_3d_min_aed5463a33,f_autofe_roll_7d_median_847e678e51,f_autofe_trend_coef_8dae5d36ce
10892,2015-11-11,6,36,54,0.939922,0.974928,0.623490,1.936489,13.300418,4.990743,-0.648630,-0.987349,0.761104,0.850650,0.761104,0.761104,0.000057
3221,2013-10-29,9,40,25,0.708309,0.781831,-0.900969,9.239768,12.981220,8.329916,-0.801361,-0.958718,0.598181,1.012201,0.598181,0.598181,0.005571
12453,2016-04-12,4,8,84,0.929156,0.781831,-0.900969,7.300408,11.202875,6.444409,0.938710,-0.647161,-0.344707,0.864444,-0.344707,-0.344707,0.000806
2910,2013-09-29,6,19,18,0.634936,-0.781831,-0.900969,8.985870,12.914780,8.351043,-0.992222,-0.247022,0.124479,1.079850,0.124479,0.124479,0.040780
3177,2013-10-25,1,37,17,0.694254,-0.433884,-0.222521,9.136400,13.017970,8.199723,-0.840618,-0.910605,0.541628,1.027521,0.541628,0.541628,0.000414
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12267,2016-03-26,4,33,70,0.919325,-0.974928,0.623490,1.176196,11.017932,6.415632,0.998195,-0.119881,-0.060049,0.880801,-0.060049,-0.060049,0.003212
4965,2014-04-17,10,17,43,0.766698,0.433884,-0.222521,6.382960,12.713668,7.391176,0.910605,-0.752667,-0.413279,1.042749,-0.413279,-0.413279,0.064625
10357,2015-09-20,5,30,38,0.799745,-0.781831,-0.900969,2.327512,13.120562,5.366381,-0.999546,0.060213,-0.030120,0.779210,-0.030120,-0.030120,0.000000
3980,2014-01-13,6,7,20,0.658138,0.000000,1.000000,9.325478,13.040100,8.248272,0.369725,0.687053,0.929141,0.945776,0.929141,0.929141,0.042657


______________________________
**That's all for a quick start in 15 minutes!**  
If you found this useful or interesting, feel free to share.  
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini?tab=readme-ov-file#-tutorials)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found typo or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>