![alt text](
https://upgini.com/lib_tHloTHmnvYomfhRQ/1lth2xdahxnz5tr4.svg?w=206)

# Quick Start guide: Search new relevant external features for  product sales forecast  
_________________

Following this guide, you'll learn how to **search new relevant features with Upgini library**. We will enrich a dataset with new features and significantly improve model accuracy. All in 4 simple steps.  
The goal is to predict future sales of different goods in stores based on a 5-year history of sales. The evaluation metric is SMAPE.  
⏱ Time needed: *15 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
_________________

First, let's install latest version of Upgini library. Also, we'll need CatBoost for the last part of this guide.

In [None]:
%pip install -Uq upgini catboost

## 1️⃣ Prepare input data

For this guide we'll use the train dataset from [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only). You can download it from [here](https://www.kaggle.com/c/demand-forecasting-kernels-only/data?select=train.csv).  
To speed up the search we'll take a subsample.  
⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation

In [3]:
from os.path import exists
import pandas as pd

df_path = "train.csv.zip" if exists("train.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip"
df = pd.read_csv(df_path).sample(n=19_000, random_state=0)
df["store"] = df["store"].astype(str)
df["item"] = df["item"].astype(str)

# Convert date column to datetime pandas object
df["date"] = pd.to_datetime(df["date"])

df.sort_values("date", inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,date,store,item,sales
0,2013-01-01,7,5,5
1,2013-01-01,4,9,19
2,2013-01-01,1,33,37
3,2013-01-01,3,41,14
4,2013-01-01,5,24,26


This dataset contains 5 years of records from 2013 to 2017. Let's split it into the train (2013–2016) and the evaluation (2017) parts.

In [4]:
train = df[df["date"] < "2017-01-01"]
test = df[df["date"] >= "2017-01-01"]

Let's also separate features from targets in *a scikit-learn style* (X and y).

In [5]:
train_features = train.drop(columns=["sales"])
train_target = train["sales"]
test_features = test.drop(columns=["sales"])
test_target = test["sales"]

## 2️⃣ Search new relevant features with FeaturesEnricher

Next, we will use **`FeaturesEnricher`** on the train dataset to find new features relevant for this target prediction.  
* To do this, we need to specify the column(s) containing [**search key(s)**](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come), in this case it's `date` and provide the target to predict.  
* Also, we can specify any number of additional out-of-time validation datasets to evaluate robustness of the new features.  
* This search task will be auto-detected as a regression. And as we have time series prediction (daily sales as a target variable), we have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**. Now search algorithm know that we are working with the time series prediction task, not just simple regression and will use [time series CV](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) for new features search.  

Search step will take around *2.5 minutes*

In [6]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys = {
      "date": SearchKey.DATE,
    },
    cv = CVType.time_series,
)
enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)]
)

<IPython.core.display.Javascript object>

Detected task type: ModelTaskType.REGRESSION


Column name,Status,Description
date,All valid,All values in this column are good to go
target,All valid,All values in this column are good to go


Running search request with search_id=5d9fc97f-dd01-45bf-b089-a5817e40a2e6
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

[92m[1m
26 relevant feature(s) found with the search keys: ['date'].[0m


Unnamed: 0,feature_name,shap_value,coverage %,type
0,item,0.488457,100.0,CHARACTER
1,store,0.17259,100.0,CHARACTER
2,f_weather_date_weather_pca_0_d7e0a1fc,0.056901,100.0,NUMERIC
3,f_events_date_week_sin1_847b5db1,0.047424,100.0,NUMERIC
4,f_events_date_week_cos1_f6a8c1fc,0.029408,100.0,NUMERIC
5,f_weather_date_weather_umap_48_b39cd0c4,0.025571,100.0,NUMERIC
6,f_weather_date_weather_umap_24_2e14c9a6,0.018662,100.0,NUMERIC
7,f_weather_date_weather_umap_33_89bb7578,0.0154,100.0,NUMERIC
8,f_events_date_year_cos1_9014a856,0.01297,100.0,NUMERIC
9,f_financial_date_silver_14e835ea,0.006757,100.0,NUMERIC


We've got **20+ new relevant features** from [different sources such as weather data, calendar data, financial data](https://github.com/upgini/upgini#-connected-data-sources-and-coverage), which expected to improve accuracy of the model. Ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from search dataset will be checked for relevancy as well, so you don't need an extra feature selection step.

## 3️⃣ Calculate model metrics and uplift from new relevant features
You can use any model estimator with scikit-learn compatible interface. Let's take CatBoost regressor.  
For evaluation metric there are two options:
* Predefined evaluation function alias from [*Upgini library*](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations), like **`RMSLE`** for Root Mean Squared Logarithmic Error

* Define custom evaluation function using [scikit-learn make_scorer](https://scikit-learn.org/0.15/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error)

Model evaluation metric both for train and validation datasets will be calculated with the same cross-validation strategy as for **`FeaturesEnricher.fit()`**-  in this example [time series CV](https://github.com/upgini/upgini#-time-series-prediction-support). 

In [7]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

# Calculate metrics before and after enrichment with a new relevant features
enricher.calculate_metrics(
    estimator = model,
    scoring = "mean_absolute_percentage_error"
)

Calculating metrics...
Done


Unnamed: 0,match_rate,baseline mean_absolute_percentage_error,enriched mean_absolute_percentage_error,uplift
,,,,
train,100.0,0.254322,0.166744,0.087578
eval 1,100.0,0.267351,0.187815,0.079536


We've got a strong metric uplift both on the cross-validation (*train*) and on the out-of-time validation dataset (*eval1*) **after enrichment**.

## 4️⃣ Enrich datasets with new features and retrain model

Now we can enrich our datasets with the features found and use them in our own ML pipelines. Lets' enrich both the train and the test datasets.  
Enrichment step for two datasets will take *2.5 minutes*

In [8]:
enriched_train_features = enricher.transform(train_features, keep_input = True)
enriched_test_features = enricher.transform(test_features, keep_input = True)
enriched_train_features.head()

90.39637% of the rows are fully duplicated


Column name,Status,Description
date,All valid,All values in this column are good to go


Running search request with search_id=9faaf771-c3ae-4e7f-9390-9dc95c84da66
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

Collecting selected features...
Done
90.36176% of the rows are fully duplicated


Column name,Status,Description
date,All valid,All values in this column are good to go


Running search request with search_id=a56b34dc-7812-4d88-9f25-ad38405cfae8
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com
Done

Collecting selected features...
Done


Unnamed: 0,date,store,item,f_weather_date_weather_pca_0_d7e0a1fc,f_events_date_week_sin1_847b5db1,f_events_date_week_cos1_f6a8c1fc,f_weather_date_weather_umap_48_b39cd0c4,f_weather_date_weather_umap_24_2e14c9a6,f_weather_date_weather_umap_33_89bb7578,f_events_date_year_cos1_9014a856,...,f_financial_date_dow_jones_7d_to_7d_1y_shift_61f71e90,f_events_date_italy_game_cnt_99570b80,f_weather_date_weather_umap_34_c3ef5b4f,f_economic_date_cbpol_pca_3_27450634,f_financial_date_nasdaq_c568533e,f_economic_date_cbpol_umap_1_7eb7a343,f_financial_date_silver_7d_to_7d_1y_shift_9cb6bdfc,f_weather_date_weather_umap_45_d474bf8d,f_economic_date_cbpol_umap_6_aa0352de,f_economic_date_cpi_umap_4_970cc061
0,2013-01-01,7,5,29.676683,0.781831,0.62349,4.540985,5.828106,4.644803,0.98522,...,1.065267,0,5.664261,-0.323471,3019.51001,4.815701,1.072025,4.923654,1.367325,10.153208
1,2013-01-01,4,9,29.676683,0.781831,0.62349,4.540985,5.828106,4.644803,0.98522,...,1.065267,0,5.664261,-0.323471,3019.51001,4.815701,1.072025,4.923654,1.367325,10.153208
2,2013-01-01,1,33,29.676683,0.781831,0.62349,4.540985,5.828106,4.644803,0.98522,...,1.065267,0,5.664261,-0.323471,3019.51001,4.815701,1.072025,4.923654,1.367325,10.153208
3,2013-01-01,3,41,29.676683,0.781831,0.62349,4.540985,5.828106,4.644803,0.98522,...,1.065267,0,5.664261,-0.323471,3019.51001,4.815701,1.072025,4.923654,1.367325,10.153208
4,2013-01-01,5,24,29.676683,0.781831,0.62349,4.540985,5.828106,4.644803,0.98522,...,1.065267,0,5.664261,-0.323471,3019.51001,4.815701,1.072025,4.923654,1.367325,10.153208


We've got new features and ready to retrain the model.  
**BEFORE** enrichment with the new features:

In [9]:
model.fit(train_features, train_target)
preds = model.predict(test_features)
eval_metric(test_target.values, preds, "SMAPE")

[37.65141857448004]

**AFTER** enrichment:

In [10]:
model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
eval_metric(test_target.values, enriched_preds, "SMAPE")

[14.187522108368377]

______________________________
Thanks for reading! If you found this useful or interesting, please share with a friend.
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini#briefcase-use-cases)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>