![alt text](
https://cdn.prod.website-files.com/65d5721664bea140c05f5301/65e354e4b9ddb1c6aaa7d7b1_upgini_logo%20gradient.svg)   
## [Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
# Quick Start guide: Search new relevant external features for  store item demand forecast
_________________

Following this guide, you'll learn how to **search new relevant features with Upgini library**. We will enrich a dataset with new features and significantly improve model accuracy. All in 3 simple steps.  
The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is MAPE.  
⏱ Time needed: *10 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
_________________

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [1]:
%pip install -Uq upgini catboost

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m860.5 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m913.9/913.9 kB[0m [31m23.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m291.0/291.0 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m59.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.8/139.8 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m15.4 MB/s[0m eta 

## 1️⃣ Prepare input data

For this guide we'll use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv).  
To speed up the search we'll take a subsample.  
⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation

In [2]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path).sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [3]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets in *a scikit-learn style* (X and y).

In [4]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## 2️⃣ Search new relevant features with FeaturesEnricher

Next, we will use **`FeaturesEnricher`** on the train dataset to find new features relevant for this target prediction.  
* To do this, we need to specify the column(s) containing [**search key(s)**](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come), in this case it's `date` and provide the target to predict.  
* Also, we can specify any number of additional out-of-time validation datasets to evaluate robustness of the new features.  
* This search task will be auto-detected as a regression. And as we have time series prediction (daily sales as a target variable), we have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**. Now search algorithm know that we are working with the time series prediction task, not just simple regression and will use [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) for new features search.  
* For Multivariate Time Series you should specify **`id_columns`** which contains `id` of univariate TS, in this example - combination of Store and Item.

Search step will take around *2.5 minutes*

In [5]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series,
    id_columns = ["store","item"],
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Try to add other keys like the COUNTRY, POSTAL_CODE, PHONE NUMBER, EMAIL/HEM, IP to your training dataset
for search through all the available data sources.
See docs https://github.com/upgini/upgini#-total-239-countries-and-up-to-41-years-of-history



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Detected task type: ModelTaskType.REGRESSION. Reason: date search key is present, treating as regression
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly



<IPython.core.display.Javascript object>




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Column name,Status,Errors
date,All valid,-
store,All valid,-
target,All valid,-
item,All valid,-





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Running search request, search_id=c168a650-ee30-4f76-8903-6c387a95bf9f
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
22 relevant feature(s) found with the search keys: ['date'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
item,8.7631,100.0,"12, 38, 2",,,
store,6.6029,100.0,"6, 2, 5",,,
f_autofe_trend_coef_8dae5d36ce,6.1936,100.0,"0.0189, 0.0072, 0.005",Training dataset,AutoFE: features from Training dataset,
f_events_date_week_sin1_847b5db1,4.1014,100.0,"0.0, -0.4339, 0.4339",Upgini,Calendar data,Daily
f_events_date_year_cos1_9014a856,2.227,100.0,"-0.8566, -0.263, 0.878",Upgini,Calendar data,Daily
f_autofe_roll_3d_min_aed5463a33,2.1619,100.0,"-0.8566, -0.263, 0.878","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_autofe_roll_7d_min_ea7571ef80,1.8358,100.0,"-0.8566, -0.263, 0.878","Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",Daily
f_events_date_week_cos3_7525fe31,1.7808,100.0,"1.0, -0.2225, -0.2225",Upgini,Calendar data,Daily
f_financial_date_crude_oil_7d_to_1y_c3e0ad17,1.1176,100.0,"1.1037, 1.1405, 0.9803",Upgini,Markets data,Daily
f_events_date_year_sin1_3c44bc64,0.6248,100.0,"0.5307, 0.9806, -0.7117",Upgini,Calendar data,Daily


Provider,Source,All features SHAP,Number of relevant features
Upgini,Calendar data,8.9927,5
Training dataset,AutoFE: features from Training dataset,6.1936,1
"Training dataset,Upgini","AutoFE: features from Training dataset,Calendar data",3.9977,2
Upgini,Markets data,2.7696,6
Upgini,World economic indicators,1.7192,6


Sources,Feature name,Feature 1,Function
Training dataset,f_autofe_trend_coef_8dae5d36ce,target,trend_coef
"Training dataset,Calendar data",f_autofe_roll_3d_min_aed5463a33,f_events_date_year_cos1_9014a856,roll_3d_min
"Training dataset,Calendar data",f_autofe_roll_7d_min_ea7571ef80,f_events_date_year_cos1_9014a856,roll_7d_min


Calculating accuracy uplift after enrichment...
-y distributions from the training sample and eval_set differ according to the Kolmogorov-Smirnov test,
which makes metrics between the train and eval_set incomparable.


Dataset type,Rows,Mean target,Baseline MAPE,Enriched MAPE,"Uplift, abs","Uplift, %"
Train,9930,53.8254,0.328 ± 0.115,0.277 ± 0.064,0.052,15.7%
Eval 1,3787,59.2424,0.281 ± 0.010,0.252 ± 0.024,0.028,10.1%


We've got **20+ new relevant features** from [different sources such as weather data, calendar data, financial data](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3️⃣ Calculate uplift from new relevant features using optimized custom estimator and metric
You can use any model estimator with scikit-learn compatible interface. Let's take CatBoost regressor.  
For evaluation metric there are two options:
* Predefined evaluation function alias from [*Upgini library*](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations), like **`MAPE`** for Mean Average Percentage Error

* Define custom evaluation function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for **`FeaturesEnricher.fit()`**-  in this example [time series CV](https://github.com/upgini/upgini#-time-series-prediction-support).

In [6]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with a new relevant features
enricher.calculate_metrics(
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating accuracy uplift after enrichment...
-y distributions from the training sample and eval_set differ according to the Kolmogorov-Smirnov test,
which makes metrics between the train and eval_set incomparable.


Unnamed: 0,Dataset type,Rows,Mean target,Baseline MAPE,Enriched MAPE,"Uplift, abs","Uplift, %"
0,Train,9930,53.8254,0.294 ± 0.113,0.200 ± 0.059,0.094,32.0%
1,Eval 1,3787,59.2424,0.251 ± 0.015,0.190 ± 0.020,0.061,24.3%


We've got a strong uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**:   
**BEFORE** enrichment 0.251   
**AFTER** enrichment 0.185

## 4️⃣ Enrich dataset with selected features
Limit 1000 rows for unregistered user

In [13]:
xy = pd.concat([train_features, train_target.to_frame("target")], axis=1)
xy_sampled = xy.sample(n=1000)
x = xy_sampled.drop(columns="target")
y = xy_sampled["target"]

transformed = enricher.transform(x, y=y)
transformed

You use Trial access to Upgini data enrichment. Limit for Trial: 1000 rows. You have already enriched: 0 rows.


Column name,Status,Errors
store,All valid,-
date,All valid,-
item,All valid,-




Running transform request, id=a06ff2ce-9c7f-4bb7-b70b-2a3fae754dee
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


Unnamed: 0,date,store,item,target,f_financial_date_silver_7d_to_1y_0ccfe462,f_financial_date_silver_7d_to_7d_1y_shift_55fa8001,f_economic_date_cbpol_umap_6_aa0352de,f_financial_date_finance_umap_0_3c020a5e,f_financial_date_finance_umap_4_c0717402,f_economic_date_cbpol_pca_4_be889d56,...,f_economic_date_cpi_umap_1_ba1c4045,f_economic_date_cpi_umap_4_23b6dfce,f_economic_date_cbpol_pca_0_ef0bbff4,f_events_date_year_sin1_3c44bc64,f_events_date_year_cos1_9014a856,f_financial_date_crude_oil_7d_to_1y_c3e0ad17,f_financial_date_finance_umap_2_a414df3b,f_autofe_roll_3d_min_aed5463a33,f_autofe_roll_7d_min_ea7571ef80,f_autofe_trend_coef_8dae5d36ce
7398,2014-12-05,4,24,41,0.832188,0.816911,1.946529,13.127942,5.390800,-2.899011,...,5.000503,6.670873,1.044253,-0.288482,0.957485,0.697612,5.748660,0.957485,0.957485,0.029175
9,2013-01-01,10,21,33,0.963323,1.072025,3.161273,11.626170,3.280215,-1.012797,...,-3.219577,-3.515559,1.791906,0.171293,0.985220,0.962707,9.873652,0.985220,0.985220,0.000148
13066,2016-06-11,6,32,41,1.097506,1.047575,8.524164,10.809167,7.830725,0.686011,...,-5.599431,9.366595,-0.043835,0.187719,-0.982223,1.166733,6.578324,-0.982223,-0.982223,0.015550
7672,2015-01-03,1,24,51,0.834301,0.802567,0.504820,13.256419,5.586895,-2.636645,...,-1.852548,5.262119,1.078451,0.205104,0.978740,0.580085,5.476885,0.978740,0.978740,0.004449
11692,2016-01-28,2,49,21,0.919757,0.780429,3.428345,13.014300,6.392590,-0.518986,...,9.303439,-1.462748,-0.623442,0.593327,0.804962,0.660162,5.464448,0.804962,0.804962,0.012821
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3168,2013-10-24,7,20,26,0.852487,0.685435,9.187079,12.985901,3.776883,-2.210282,...,7.747222,7.829422,1.394478,-0.849817,0.527078,1.033131,8.343953,0.527078,0.527078,0.011203
10563,2015-10-10,1,38,86,0.976836,0.922405,2.564006,12.997958,6.642457,-0.288038,...,-0.357522,2.271612,0.476008,-0.951057,0.309017,0.855238,5.105819,0.309017,0.309017,0.000408
9088,2015-05-17,7,47,22,0.959473,0.877345,1.705898,13.701124,6.071007,-0.888304,...,9.374797,-0.928875,1.510224,0.587785,-0.809017,0.799212,4.684276,-0.809017,-0.809017,0.005109
14796,2016-11-23,7,19,30,0.987996,1.180037,8.211638,10.124886,7.855833,0.551180,...,12.966796,2.471978,-0.315194,-0.477536,0.878612,1.099300,7.118220,0.878612,0.878612,-0.014720


______________________________
**That's all for a quick start in 10 minutes!**  
If you found this useful or interesting, feel free to share.  
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini?tab=readme-ov-file#-tutorials)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>