![alt text](
https://cdn.prod.website-files.com/65d5721664bea140c05f5301/65e354e4b9ddb1c6aaa7d7b1_upgini_logo%20gradient.svg)   
## [Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
### Quick Start guide:
### Search of relevant external features &  Automated feature generation for Salary predicton task  
_________________

Following this guide, you'll learn how to **search & auto generate new relevant features with Upgini library, in just 3 simple steps.**  
We will enrich a training dataset with both external & automaticaly generated features and significantly improve model accuracy.  
*The goal is to predict salary for data science job postning based on information about employer and job description.*  
The evaluation metric is Mean Absolute Error (MAE).  
⏱ Time needed: *15 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search&generation.ipynb)  
_________________

First, let's install latest version of Upgini library.

In [1]:
%pip install -Uq upgini catboost

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.3/303.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## 1️⃣ Use your labeled training dataset for search & feature generation

You can use your labeled training datasets "as is" to initiate the search.  
For this guide we'll use the dataset from [Glasdoor salary prediction](https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor) with geocoded addresses of employers as a postal/ZIP codes. You can download extended version [here](https://github.com/upgini/upgini/blob/main/notebooks/demo_salary.csv.zip).  
*This dataset contains job postings from Glassdoor.com from 2017, with several text columns including Job title, Job description, and Company name.*  
License CC0: Public Domain  
The goal is to predict salary for data science job postning.
The column with the target label for salary prediction is `avg_salary`.  
> ⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation


In [3]:
from os.path import exists
import pandas as pd

df_path = "demo_salary.csv.zip" if exists("demo_salary.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/demo_salary.csv.zip"
df = pd.read_csv(df_path)
df.head(2)

Unnamed: 0,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,avg_salary,...,R_yn,spark,aws,excel,job_simp,desc_len,num_comp,Postal_code,country,combined
0,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),72.0,...,0,0,0,1,data scientist,2536,0,87102,US,Job title: Data Scientist; Job Description: Da...
1,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),87.5,...,0,0,0,0,data scientist,4783,0,21090,US,Job title: Healthcare Data Scientist; Job Desc...


In [4]:
pd.set_option('display.max_colwidth', 768)
df["combined"].head(2)

Unnamed: 0,combined
0,"Job title: Data Scientist; Job Description: Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor’s Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years’ experience credit for Master’s degree; five years’ experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplica..."
1,"Job title: Healthcare Data Scientist; Job Description: What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Da..."


## 2️⃣ Choose one or multiple columns as a search keys, select columns for automated feature generation

Under the hood, we'll search for relevant data using:
- **[search keys](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with a new features
- **labels** from the training dataset to estimate the relevancy of candidate features for your ML task and calculate feature importance metrics  
- **your features** from the training dataset to find external datasets and features that will improve accuracy in addition to your existing features and estimate accuracy uplift ([optional](https://github.com/upgini/upgini#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))


Define one or multiple columns as a search keys  and select **text columns** for automated feature generation, in this example `'combined', 'company_txt'`  

>⚠️ This search task will be auto-detected as a regression. If you have time series prediction (for example, daily sales as a target variable) and not just simple regression, you have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**, as well

In [5]:
from upgini import FeaturesEnricher, SearchKey

enricher = FeaturesEnricher(
  search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE
  },
  text_features=['combined', 'company_txt']
)

## 3️⃣ Start your search & feature generation with Scikit-learn compatible estimator

The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn compatible estimator.  You can easily add it into your existing ML pipelines.
Create instance of the `FeaturesEnricher` class and call:
- `fit()` to search relevant datasets & features  
- than `transform()` to enrich your dataset with features from search result
- or combine both steps with a single method `fit_transform()`

You need to separate features from targets in *a scikit-learn style* (X and y).

> Search step will take around *30 minutes* for this training dataset

In [6]:
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary
enriched_train_features = enricher.fit(
    train_features,
    train_target,
    scoring = "mean_absolute_error")

<IPython.core.display.Javascript object>

Demo training dataset detected. Registration for an API key is not required.


Detected task type: ModelTaskType.REGRESSION. Reason: many unique label-values or non-integer floating point values observed
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly



['r_yn_8f6661']




Column name,Status,Errors
country,All valid,-
current_date,All valid,-
target,All valid,-
Postal_code,Some invalid,"2.16% values failed validation and removed from dataframe, invalid values: [<NA>]"





<IPython.core.display.Javascript object>


Running search request, search_id=e6b39324-f948-45ec-80c2-9ac426c643f1
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com



Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
job_simp,7.0045,100.0,"director, na, mle",,,
f_autofe_sim_jw2_fe9c7bc0f8,4.554,100.0,"0.4951, 0.4653, 0.5196",Training dataset,AutoFE: features from Training dataset,
f_autofe_div_32bb85fdeb,2.8315,96.9828,"41120.5452, 37013.3994, 21011....",Upgini,AutoFE: features from World mobile network coverage,Quarterly
f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,2.3359,96.9828,"8465.557, 3471.5939, 12162.449...",Upgini,World mobile network coverage,Quarterly
f_autofe_div_ccdcaa4174,2.2362,74.3534,"5914.3989, 2853.9094, 667.8693",Upgini,AutoFE: features from POI data OpenStreetMap,Quarterly
f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,2.2043,83.1897,"0.1252, 0.3747, 1.0",Upgini,Public customer profile,Quarterly
f_autofe_div_bb4ce57711,2.1399,95.6897,"11.9194, 561.4679, 130.7484",Upgini,AutoFE: features from POI data OpenStreetMap,Quarterly
f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,2.1097,96.9828,"3476.5716, 1557.6053, 4090.918...",Upgini,World mobile network coverage,Quarterly
f_company_txt_4c5666_org_emb_autofe_emb_outlier_dist_all_49907a2062,2.0522,100.0,"16.5912, 18.7222, 14.0203",Upgini,AutoFE: feature from LLM with external data augmentation,Daily
f_avg_lat_ms_to_history_a084d607,1.6616,96.9828,"2.4444, 0.6667, 0.3571",Ookla Speedtest,okla open data,Quarterly


Provider,Source,All features SHAP,Number of relevant features
Training dataset,AutoFE: features from Training dataset,6.1677,2
Upgini,World mobile network coverage,5.3977,3
Upgini,AutoFE: features from POI data OpenStreetMap,4.3761,2
Ookla Speedtest,okla open data,2.9774,2
Upgini,AutoFE: features from World mobile network coverage,2.8315,1
Upgini,Public customer profile,2.2043,1
Upgini,AutoFE: feature from LLM with external data augmentation,2.0522,1
Upgini,"AutoFE: features from World demographic data,POI data OpenStreetMap",1.3726,1
Upgini,"AutoFE: features from LLM with external data augmentation,World demographic data",1.2568,1


Sources,Feature name,Feature 1,Feature 2,Function
Training dataset,f_autofe_sim_jw2_fe9c7bc0f8,job_simp_d49976,type_of_ownership_a589fc,sim_jw2
World mobile network coverage,f_autofe_div_32bb85fdeb,f_telecom_country_postal_cells_LTE_10km_days_from_update_avg_708fde84,f_telecom_country_postal_cells_UMTS_10km_days_from_update_min_a00c8d95,"/,norm"
POI data OpenStreetMap,f_autofe_div_ccdcaa4174,f_location_country_postal_poi_leisure_pitch_2km_cnt_c57fa462,f_location_country_postal_poi_public_police_2km_cnt_to_population_fed6a74b,"/,norm"
POI data OpenStreetMap,f_autofe_div_bb4ce57711,f_location_country_postal_poi_tourism_tourist_map_10km_cnt_bc05acc3,f_location_country_postal_poi_public_police_10km_cnt_to_population_5f94f54b,"/,norm"
LLM with external data augmentation,f_company_txt_4c5666_org_emb_autofe_emb_outlier_dist_all_49907a2062,company_txt_4c5666_org_emb,,"emb,outlier_dist_all"
Training dataset,f_autofe_sim_jw1_22754eb785,industry_b44484,type_of_ownership_a589fc,sim_jw1
"World demographic data,POI data OpenStreetMap",f_autofe_mul_95d464525a,f_location_country_postal_b01001e43_05df4697,f_location_country_postal_poi_public_kindergarten_10km_cnt_568440d3,*
"LLM with external data augmentation,World demographic data",f_autofe_dist_5a4ff26466,company_txt_4c5666_org_emb,f_location_country_postal_region_capital_city_name_489f3ce8,"dist,emb"


Calculating accuracy uplift after enrichment...


Dataset type,Rows,Mean target,Baseline mean_absolute_error,Enriched mean_absolute_error,"Uplift, abs","Uplift, %"
Train,464,100.7802,24.482 ± 1.016,22.030 ± 0.940,2.453,10.0%


We've got **10+ new relevant features** from:
- Various sources  [automatically optimized by Upgini](https://upgini.com/#optimized_external_data) such as [World demographic & census data, World mobile network coverage, Location/Places/POI/Area/Proximity data from OpenStreetMap](https://github.com/upgini/upgini#-connected-data-sources-and-coverage)
- Automated feature generation for two selected text columns `'combined', 'company_txt'` with [Large Language Models' data augmentation](https://upgini.com/#large_language_models)

All ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will also be checked for relevancy, so you don't need an extra feature selection step.

Also, `FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features using default `calculate_metrics=True` parameter in `fit()` or `fit_transform()` methods.  
For this, you can use any estimator with scikit-learn compartible interface with `estimator` and define custom model metrics with `scoring`. More details [here](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations)

Result of search & enrichment request:

⭐️ Enrcihed pandas dataframe **with 10+ new relevant features** `enriched_train_features`  
⭐️ Calculated accuracy Uplift after enrichment: from 22.9 BEFORE  to 22.3 AFTER for a basic **non task-optimized ML model**; MAE - mean absolute error, less is better
>💡 You can also enrich production ML pipelines, more details [here](https://github.com/upgini/upgini#6--enrich-production-ml-pipeline-with-relevant-external-features)

## ✅ Retrain model with enriched training dataset

Now, you can use an enriched dataframe to train a more accurate, **task-optimized ML model** in your existing ML pipeline.   
As example, let's take `CatBoostRegressor`.

In [8]:
enriched_train_features = enricher.transform(
  train_features,
  train_target,
)
enriched_train_features.head(2)

You use Trial access to Upgini data enrichment. Limit for Trial: 1000 rows. You have already enriched: 0 rows.


Column name,Status,Errors
country,All valid,-
current_date,All valid,-
Postal_code,Some invalid,"2.41% values failed validation and removed from dataframe, invalid values: [<NA>]"




Running transform request, id=dc9ab7f3-f940-447d-b4b0-998f02e08316
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

Retrieving selected features from data sources...


Unnamed: 0,job_simp,desc_len,Postal_code,country,target,f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,f_avg_lat_ms_to_history_bea1030b,f_avg_lat_ms_to_history_a084d607,f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,f_telecom_country_postal_cells_20km_samples_avg_80024d9e,f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,f_autofe_div_32bb85fdeb,f_autofe_mul_95d464525a,f_autofe_div_bb4ce57711,f_autofe_div_ccdcaa4174,f_autofe_dist_5a4ff26466,f_company_txt_4c5666_org_emb_autofe_emb_outlier_dist_all_49907a2062,f_autofe_sim_jw1_22754eb785,f_autofe_sim_jw2_fe9c7bc0f8
0,data scientist,2536,87102,US,72.0,0.07018,0.558824,0.909091,11660.630246,1097.624628,4894.663616,34113.280695,888.0,0.0,684.643303,0.693521,18.586511,0.500688,0.482726
1,data scientist,4783,21090,US,87.5,0.152481,0.612903,0.5,15182.660656,812.480826,6139.380261,39386.8965,935.0,56.748444,,0.625486,17.638172,0.541667,0.546958


In [9]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
from sklearn.model_selection import train_test_split

# Find all categorical features and replace NaNs with 'NA'
cat_col_enriched = [col for col in enriched_train_features.columns if enriched_train_features[col].dtype == "O"]
enriched_train_features.loc[:, cat_col_enriched] = enriched_train_features.loc[:, cat_col_enriched].fillna("NA")

cat_col_baseline = [col for col in train_features.columns if train_features[col].dtype == "O"]
train_features.loc[:, cat_col_baseline] = train_features.loc[:, cat_col_baseline].fillna("NA")

# Train and test split for correct model evaluation
X_train, X_test, y_train, y_test, X_train_baseline, X_test_baseline = train_test_split(
    enriched_train_features,
    train_target,
    train_features,
    test_size=0.2,
    shuffle=True,
    random_state=0)

# Task-optimized Catboost estimator
model = CatBoostRegressor(
    learning_rate=0.03,
    iterations=330,
    random_state=0,
    eval_metric="MAE",
    verbose=False,)

Baseline **BEFORE** enrichment with the new features, *Mean Absolute Error*:

In [10]:
model.fit(X_train_baseline, y_train, cat_features=cat_col_baseline)
preds = model.predict(X_test_baseline)
eval_metric(y_test.values, preds, "MAE")

[22.415689095313247]

**AFTER** enrichment, *Mean Absolute Error*:

In [11]:
model.fit(X_train, y_train, cat_features=cat_col_enriched)
preds = model.predict(X_test)
eval_metric(y_test.values, preds, "MAE")

[2.099071501866811]

______________________________
**That's all for a quick start in 15 minutes!**  
If you found this useful or interesting, feel free to share.  
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini?tab=readme-ov-file#-tutorials)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>

## Optional: Enrichment with **external data & features only**, whithout LLM based feature generation

To enrich training dataset ONLY with features from external data sources, without automated feature generation on the text columns, you can simply remove parameter  `generate_features=['combined', 'company_txt']` from  `FeaturesEnricher`.  
Thus, you'll be able to compare Uplift from *LLM based feature generation + External Data* VS. *Uplift from External data and features only*:  

In [12]:
df = pd.read_csv(df_path)
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary

enricher = FeaturesEnricher(
  search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE
  }
)
enricher.fit(train_features, train_target, scoring="mean_absolute_error")

Demo training dataset detected. Registration for an API key is not required.


Detected task type: ModelTaskType.REGRESSION. Reason: many unique label-values or non-integer floating point values observed
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly



['r_yn_8f6661']


Sample of incorrect row indexes: [3, 8, 17, 24, 51, 72, 121, 122, 134, 139, 147, 151, 155, 162, 169, 178, 180, 192, 237, 244, 269, 274, 280, 282, 288, 320, 323, 329, 332, 340, 373, 380, 389, 404, 414, 426, 428, 432, 438, 441, 444, 450]



Column name,Status,Errors
country,All valid,-
current_date,All valid,-
target,All valid,-
Postal_code,Some invalid,"2.40% values failed validation and removed from dataframe, invalid values: [<NA>]"




Running search request, search_id=939399e3-2da9-473f-8a0d-bc6966c03292
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com



Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
job_simp,7.3723,100.0,"mle, analyst, manager",,,
f_autofe_sim_jw2_fe9c7bc0f8,4.3998,100.0,"0.4951, 0.4653, 0.4232",Training dataset,AutoFE: features from Training dataset,
f_telecom_country_postal_cells_CDMA_10km_samples_avg_c5d14676,3.1364,96.6346,"2.8302, 21.9818, 17.0479",Upgini,World mobile network coverage,Quarterly
f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,2.669,96.6346,"4146.128, 3578.1304, 1774.4718",Upgini,World mobile network coverage,Quarterly
f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,2.4264,96.6346,"10745.9065, 5154.0085, 3680.14...",Upgini,World mobile network coverage,Quarterly
f_location_country_postal_poi_public_graveyard_10km_cnt_to_population_b5b66acf,2.4153,96.6346,"0.0113, 0.0729, 0.0624",Upgini,POI data OpenStreetMap,Quarterly
f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,2.0674,81.9712,"0.0748, 0.1476, 0.1927",Upgini,Public customer profile,Quarterly
f_location_country_postal_poi_shopping_car_repair_5km_cnt_2a954887,1.6547,96.6346,"26.0, 107.0, 23.0",Upgini,POI data OpenStreetMap,Quarterly
f_location_country_postal_poi_shopping_department_store_5km_cnt_5fd3f732,1.561,96.6346,"13.0, 25.0, 9.0",Upgini,POI data OpenStreetMap,Quarterly
f_autofe_sim_lv_8401ad2e64,1.1206,100.0,"0.2353, 0.0588, 0.05",Training dataset,AutoFE: features from Training dataset,


Provider,Source,All features SHAP,Number of relevant features
Upgini,World mobile network coverage,5.8955,3
Upgini,POI data OpenStreetMap,5.631,3
Training dataset,AutoFE: features from Training dataset,5.5204,2
Upgini,World mobile network coverage,3.1364,1
Upgini,Public customer profile,2.0674,1
Upgini,"AutoFE: features from World demographic data,POI data OpenStreetMap",0.9693,1
Upgini,World house prices data,0.7798,1


Sources,Feature name,Feature 1,Feature 2,Function
Training dataset,f_autofe_sim_jw2_fe9c7bc0f8,job_simp_d49976,type_of_ownership_a589fc,sim_jw2
Training dataset,f_autofe_sim_lv_8401ad2e64,job_simp_d49976,type_of_ownership_a589fc,sim_lv
"World demographic data,POI data OpenStreetMap",f_autofe_mul_3b6e8e00e0,f_location_country_postal_b24080e10_240aba9e,f_location_country_postal_poi_tourism_attraction_10km_cnt_12ab4251,*


Calculating accuracy uplift after enrichment...


Dataset type,Rows,Mean target,Baseline mean_absolute_error,Enriched mean_absolute_error,"Uplift, abs","Uplift, %"
Train,416,99.512,24.475 ± 1.455,22.376 ± 1.096,2.098,8.6%
