![alt text](
https://cdn.prod.website-files.com/65d5721664bea140c05f5301/65e354e4b9ddb1c6aaa7d7b1_upgini_logo%20gradient.svg)   
## [Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
### Quick Start guide:
### Search of relevant external features &  Automated feature generation for Salary predicton task  
_________________

Following this guide, you'll learn how to **search & auto generate new relevant features with Upgini library, in just 3 simple steps.**  
We will enrich a training dataset with both external & automaticaly generated features and significantly improve model accuracy.  
*The goal is to predict salary for data science job postning based on information about employer and job description.*  
The evaluation metric is Mean Absolute Error (MAE).  
⏱ Time needed: *15 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search&generation.ipynb)  
_________________

First, let's install latest version of Upgini library.

In [1]:
%pip install -Uq upgini catboost

## 1️⃣ Use your labeled training dataset for search & feature generation

You can use your labeled training datasets "as is" to initiate the search.  
For this guide we'll use the dataset from [Glasdoor salary prediction](https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor) with geocoded addresses of employers as a postal/ZIP codes. You can download extended version [here](https://github.com/upgini/upgini/blob/main/notebooks/demo_salary.csv.zip).  
*This dataset contains job postings from Glassdoor.com from 2017, with several text columns including Job title, Job description, and Company name.*  
License CC0: Public Domain  
The goal is to predict salary for data science job postning.
The column with the target label for salary prediction is `avg_salary`.  
> ⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation


In [2]:
from os.path import exists
import pandas as pd

df_path = "demo_salary.csv.zip" if exists("demo_salary.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/demo_salary.csv.zip"
df = pd.read_csv(df_path)
df.head(2)

Unnamed: 0,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,avg_salary,...,R_yn,spark,aws,excel,job_simp,desc_len,num_comp,Postal_code,country,combined
0,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),72.0,...,0,0,0,1,data scientist,2536,0,87102,US,Job title: Data Scientist; Job Description: Da...
1,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),87.5,...,0,0,0,0,data scientist,4783,0,21090,US,Job title: Healthcare Data Scientist; Job Desc...


In [3]:
pd.set_option('display.max_colwidth', 768)
df["combined"].head(2)

Unnamed: 0,combined
0,"Job title: Data Scientist; Job Description: Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor’s Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years’ experience credit for Master’s degree; five years’ experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplica..."
1,"Job title: Healthcare Data Scientist; Job Description: What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Da..."


## 2️⃣ Choose one or multiple columns as a search keys, select columns for automated feature generation

Under the hood, we'll search for relevant data using:
- **[search keys](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with a new features
- **labels** from the training dataset to estimate the relevancy of candidate features for your ML task and calculate feature importance metrics  
- **your features** from the training dataset to find external datasets and features that will improve accuracy in addition to your existing features and estimate accuracy uplift ([optional](https://github.com/upgini/upgini#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))


Define one or multiple columns as a search keys  and select **text columns** for automated feature generation, in this example `'combined', 'company_txt'`  

>⚠️ This search task will be auto-detected as a regression. If you have time series prediction (for example, daily sales as a target variable) and not just simple regression, you have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**, as well

In [4]:
from upgini import FeaturesEnricher, SearchKey

enricher = FeaturesEnricher(
  search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE
  },
  text_features=['combined', 'company_txt']
)

## 3️⃣ Start your search & feature generation with Scikit-learn compatible estimator

The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn compatible estimator.  You can easily add it into your existing ML pipelines.
Create instance of the `FeaturesEnricher` class and call:
- `fit()` to search relevant datasets & features  
- than `transform()` to enrich your dataset with features from search result
- or combine both steps with a single method `fit_transform()`

You need to separate features from targets in *a scikit-learn style* (X and y).

> Search step will take around *30 minutes* for this training dataset

In [5]:
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary
enriched_train_features = enricher.fit(
    train_features,
    train_target,
    scoring = "mean_absolute_error"
)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Detected task type: ModelTaskType.REGRESSION. Reason: many unique label-values or non-integer floating point values observed
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly






Column name,Status,Errors
country,All valid,-
target,All valid,-
Postal_code,Some invalid,"2.16% values failed validation and removed from dataframe, invalid values: [<NA>]"




Running search request, search_id=1bb3b1ee-1fdf-4822-9387-a9d3dd2d883a



Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
job_simp,6.4065,100.0,"director, na, mle",,,
f_autofe_sim_jw2_fe9c7bc0f8,4.1053,100.0,"0.4951, 0.4653, 0.4232",Training dataset,AutoFE: features from Training dataset,
f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,2.9398,96.9828,"3476.5716, 1557.6053, 4090.918...",Upgini,World mobile network coverage,Quarterly
f_autofe_groupbythenrank_5f4929a737,1.451,96.7672,"0.5108, 0.7339, 0.3","Upgini,Training dataset","AutoFE: feature from POI data OpenStreetMap, grouped by feature from training dataset",Quarterly
f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,1.4243,83.1897,"0.1252, 0.3747, 1.0",Upgini,Public customer profile,Quarterly
f_autofe_groupbythenrank_39559aa5e9,1.2956,83.1897,"0.6752, 0.3846, 0.5556","Upgini,Training dataset","AutoFE: feature from Public customer profile, grouped by feature from training dataset",Quarterly
f_autofe_groupbythenrank_edd259e5b2,1.2743,96.7672,"0.6968, 0.0957, 0.0616","Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",Quarterly
f_autofe_sim_jw1_22754eb785,1.2579,100.0,"0.5177, 0.4604, 0.5281",Training dataset,AutoFE: features from Training dataset,
f_autofe_groupbythenrank_8b1fc7d89d,1.1459,96.9828,"0.9787, 0.1344, 0.4926","Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",Quarterly
f_avg_lat_ms_to_history_a084d607,1.0858,96.9828,"2.4444, 0.6667, 0.3571",Ookla Speedtest,okla open data,Quarterly


Provider,Source,All features SHAP,Number of relevant features
Training dataset,AutoFE: features from Training dataset,5.3632,2
"Upgini,Training dataset","AutoFE: feature from POI data OpenStreetMap, grouped by feature from training dataset",4.1216,6
"Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",3.9315,5
Upgini,World mobile network coverage,3.8749,2
"Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",1.7351,3
"Upgini,Training dataset","AutoFE: feature from Public customer profile, grouped by feature from training dataset",1.5892,2
Upgini,Public customer profile,1.4243,1
Upgini,POI data OpenStreetMap,1.2604,2
Ookla Speedtest,okla open data,1.0858,1
Upgini,"AutoFE: features from LLM with external data augmentation,World demographic data",0.8427,1


Sources,Feature name,Feature 1,Feature 2,Function
Training dataset,f_autofe_sim_jw2_fe9c7bc0f8,job_simp_d49976,type_of_ownership_a589fc,sim_jw2
"POI data OpenStreetMap, grouped by feature from training dataset",f_autofe_groupbythenrank_5f4929a737,f_location_country_postal_poi_shopping_doityourself_5km_cnt_to_population_7c52864d,job_simp_d49976,GroupByThenRank
"Public customer profile, grouped by feature from training dataset",f_autofe_groupbythenrank_39559aa5e9,f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,job_simp_d49976,GroupByThenRank
"World mobile network coverage, grouped by feature from training dataset",f_autofe_groupbythenrank_edd259e5b2,f_telecom_country_postal_cells_5km_days_from_update_avg_a01cfafa,job_simp_d49976,GroupByThenRank
Training dataset,f_autofe_sim_jw1_22754eb785,industry_b44484,type_of_ownership_a589fc,sim_jw1
"World mobile network coverage, grouped by feature from training dataset",f_autofe_groupbythenrank_8b1fc7d89d,f_telecom_country_postal_cells_CDMA_10km_samples_avg_c5d14676,job_simp_d49976,GroupByThenRank
"POI data OpenStreetMap, grouped by feature from training dataset",f_autofe_groupbythenrank_230fa55503,f_location_country_postal_poi_shopping_department_store_5km_cnt_5fd3f732,job_simp_d49976,GroupByThenRank
"World mobile network coverage, grouped by feature from training dataset",f_autofe_groupbythenrank_5640591d87,f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,job_simp_d49976,GroupByThenRank
"LLM with external data augmentation,World demographic data",f_autofe_dist_301b71b7be,company_txt_4c5666_org_emb,f_location_country_postal_region_name_346abe35,"dist,emb"
"World mobile network coverage,POI data OpenStreetMap",f_autofe_mul_75c3afb5a7,f_telecom_country_postal_cells_LTE_10km_cnt_to_cells_cnt_1fa697bf,f_location_country_postal_poi_public_fire_station_10km_cnt_to_population_f4484aff,*


Calculating accuracy uplift after enrichment...


Dataset type,Rows,Mean target,Baseline mean_absolute_error,Enriched mean_absolute_error,"Uplift, abs","Uplift, %"
Train,464,100.7802,24.482 ± 1.016,22.168 ± 0.972,2.314,9.5%


We've got **10+ new relevant features** from:
- Various sources  [automatically optimized by Upgini](https://upgini.com/#optimized_external_data) such as [World demographic & census data, World mobile network coverage, Location/Places/POI/Area/Proximity data from OpenStreetMap](https://github.com/upgini/upgini#-connected-data-sources-and-coverage)
- Automated feature generation for two selected text columns `'combined', 'company_txt'` with [Large Language Models' data augmentation](https://upgini.com/#large_language_models)

All ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will also be checked for relevancy, so you don't need an extra feature selection step.

Also, `FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features using default `calculate_metrics=True` parameter in `fit()` or `fit_transform()` methods.  
For this, you can use any estimator with scikit-learn compartible interface with `estimator` and define custom model metrics with `scoring`. More details [here](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations)

Result of search & enrichment request:

⭐️ Enrcihed pandas dataframe **with 10+ new relevant features** `enriched_train_features`  
⭐️ Calculated accuracy Uplift after enrichment: from 24.5 BEFORE  to 22.2 AFTER for a basic **non task-optimized ML model**; MAE - mean absolute error, less is better
>💡 You can also enrich production ML pipelines, more details [here](https://github.com/upgini/upgini#6--enrich-production-ml-pipeline-with-relevant-external-features)

## ✅ Enrich training dataset

Now, you can enrich dataframe to train a more accurate, **task-optimized ML model** in your existing ML pipeline.

In [6]:
enriched_train_features = enricher.transform(
  train_features,
  train_target,
)
enriched_train_features.head(2)

Column name,Status,Errors
country,All valid,-
Postal_code,Some invalid,"2.41% values failed validation and removed from dataframe, invalid values: [<NA>]"




Running transform request, id=6e2f1ca6-d1f0-4a95-b1b2-7990bcd62903

Retrieving selected features from data sources...


Unnamed: 0,job_simp,Postal_code,country,f_avg_lat_ms_to_history_a084d607,f_location_country_postal_poi_shopping_department_store_5km_cnt_5fd3f732,f_location_country_postal_poi_accommodation_motel_10km_cnt_to_population_3bc53b5d,f_weather_country_date_postal_delta_to_avg_tavg_81aca723,f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,...,f_autofe_groupbythenrank_5fa49e33ec,f_autofe_groupbythenrank_c035e4bf2c,f_autofe_groupbythenrank_230fa55503,f_autofe_groupbythenrank_61f9cf4dbc,f_autofe_groupbythenrank_5f4929a737,f_autofe_groupbythenrank_52c5913cc2,f_autofe_groupbythenrank_35178ff462,f_autofe_dist_301b71b7be,f_autofe_sim_jw1_22754eb785,f_autofe_sim_jw2_fe9c7bc0f8
0,data scientist,87102,US,0.909091,2.0,0.092813,7.857206,0.07018,11660.630246,4894.663616,...,0.884615,0.636095,0.180473,0.571006,0.411243,0.756849,0.616071,0.744933,0.500688,0.482726
1,data scientist,21090,US,0.5,3.0,0.04035,-7.259289,0.152481,15182.660656,6139.380261,...,0.757396,0.242604,0.227811,0.585799,0.254438,,0.380952,0.474221,0.541667,0.546958


______________________________
**That's all for a quick start in 15 minutes!**  
If you found this useful or interesting, feel free to share.  
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini?tab=readme-ov-file#-tutorials)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>

## Optional: Enrichment with **external data & features only**, whithout LLM based feature generation

To enrich training dataset ONLY with features from external data sources, without automated feature generation on the text columns, you can simply remove parameter  `text_features=['combined', 'company_txt']` from  `FeaturesEnricher`.  
Thus, you'll be able to compare Uplift from *LLM based feature generation + External Data* VS. *Uplift from External data and features only*:  

In [7]:
df = pd.read_csv(df_path)
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary

enricher = FeaturesEnricher(
  search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE
  }
)
enricher.fit(train_features, train_target, scoring="mean_absolute_error")

Demo training dataset detected. Registration for an API key is not required.


Detected task type: ModelTaskType.REGRESSION. Reason: many unique label-values or non-integer floating point values observed
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly




Sample of incorrect row indexes: [3, 8, 17, 24, 51, 72, 121, 122, 134, 139, 147, 151, 155, 162, 169, 178, 180, 192, 237, 244, 269, 274, 280, 282, 288, 320, 323, 329, 332, 340, 373, 380, 389, 404, 414, 426, 428, 432, 438, 441, 444, 450]



Column name,Status,Errors
country,All valid,-
target,All valid,-
Postal_code,Some invalid,"2.40% values failed validation and removed from dataframe, invalid values: [<NA>]"




Running search request, search_id=174ff633-d44a-4f63-a403-e1dd6fa0c091
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com



Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
job_simp,4.7606,100.0,"mle, analyst, manager",,,
f_autofe_sim_jw2_fe9c7bc0f8,4.115,100.0,"0.4951, 0.4653, 0.5548",Training dataset,AutoFE: features from Training dataset,
f_autofe_sim_jw1_1d4a407c9e,3.5675,100.0,"0.6022, 0.3958, 0.502",Training dataset,AutoFE: features from Training dataset,
f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,2.7987,96.6346,"4146.128, 3578.1304, 1774.4718",Upgini,World mobile network coverage,Quarterly
f_telecom_country_postal_cells_CDMA_10km_samples_avg_c5d14676,2.1755,96.6346,"2.8302, 21.9818, 17.0479",Upgini,World mobile network coverage,Quarterly
f_autofe_groupbythenrank_39559aa5e9,2.1148,81.9712,"0.0308, 0.5942, 0.9","Upgini,Training dataset","AutoFE: feature from Public customer profile, grouped by feature from training dataset",Quarterly
f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,1.9575,96.6346,"10745.9065, 5154.0085, 3680.14...",Upgini,World mobile network coverage,Quarterly
f_location_country_postal_poi_shopping_department_store_5km_cnt_5fd3f732,1.4756,96.6346,"13.0, 25.0, 9.0",Upgini,POI data OpenStreetMap,Quarterly
f_autofe_groupbythenrank_8b1fc7d89d,0.9736,96.6346,"0.75, 0.5068, 0.7836","Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",Quarterly
f_location_country_postal_poi_public_graveyard_10km_cnt_to_population_b5b66acf,0.9375,96.6346,"0.0113, 0.0729, 0.0624",Upgini,POI data OpenStreetMap,Quarterly


Provider,Source,All features SHAP,Number of relevant features
Training dataset,AutoFE: features from Training dataset,7.6825,2
Upgini,World mobile network coverage,4.7562,2
"Upgini,Training dataset","AutoFE: feature from POI data OpenStreetMap, grouped by feature from training dataset",2.7002,5
Upgini,POI data OpenStreetMap,2.4131,2
"Upgini,Training dataset","AutoFE: feature from Public customer profile, grouped by feature from training dataset",2.3755,2
Upgini,World mobile network coverage,2.1755,1
"Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",1.7086,3
"Upgini,Training dataset","AutoFE: feature from World mobile network coverage, grouped by feature from training dataset",1.1789,2
"Upgini,Training dataset","AutoFE: feature from Weather & climate normals data, grouped by feature from training dataset",1.156,2
Upgini,AutoFE: features from POI data OpenStreetMap,0.9283,2


Sources,Feature name,Feature 1,Feature 2,Function
Training dataset,f_autofe_sim_jw2_fe9c7bc0f8,job_simp_d49976,type_of_ownership_a589fc,sim_jw2
Training dataset,f_autofe_sim_jw1_1d4a407c9e,job_simp_d49976,type_of_ownership_a589fc,sim_jw1
"Public customer profile, grouped by feature from training dataset",f_autofe_groupbythenrank_39559aa5e9,f_marketing_country_postal_person_ethnic_code_non_europe_prc_8c825cdd,job_simp_d49976,GroupByThenRank
"World mobile network coverage, grouped by feature from training dataset",f_autofe_groupbythenrank_8b1fc7d89d,f_telecom_country_postal_cells_CDMA_10km_samples_avg_c5d14676,job_simp_d49976,GroupByThenRank
"POI data OpenStreetMap, grouped by feature from training dataset",f_autofe_groupbythenrank_7efb6b8778,f_location_country_postal_subway_4km_length_923f45f1,job_simp_d49976,GroupByThenRank
"Weather & climate normals data, grouped by feature from training dataset",f_autofe_groupbythenrank_35178ff462,f_weather_country_date_postal_delta_to_avg_awnd_3d575167,job_simp_d49976,GroupByThenRank
"POI data OpenStreetMap, grouped by feature from training dataset",f_autofe_groupbythenrank_f2492c3f52,f_location_country_postal_poi_public_police_10km_cnt_to_population_5f94f54b,job_simp_d49976,GroupByThenRank
"World mobile network coverage, grouped by feature from training dataset",f_autofe_groupbythenrank_5640591d87,f_telecom_country_postal_cells_UMTS_20km_range_stddev_c850530d,job_simp_d49976,GroupByThenRank
"World mobile network coverage,World demographic data",f_autofe_mul_b77c185444,f_telecom_country_postal_cells_10km_samples_max_221c5684,f_location_country_postal_employed_manufacturing_0e89967c,"*,norm"
POI data OpenStreetMap,f_autofe_div_464d0fca2d,f_location_country_postal_poi_leisure_park_2km_cnt_to_population_1a51c071,f_location_country_postal_poi_miscpoi_toilet_5km_cnt_to_population_54772056,"/,norm"


Calculating accuracy uplift after enrichment...


Dataset type,Rows,Mean target,Baseline mean_absolute_error,Enriched mean_absolute_error,"Uplift, abs","Uplift, %"
Train,416,99.512,24.475 ± 1.455,22.755 ± 1.219,1.719,7.0%
