![alt text](
https://cdn.prod.website-files.com/65d5721664bea140c05f5301/65e354e4b9ddb1c6aaa7d7b1_upgini_logo%20gradient.svg)   
## [Intelligent data search & enrichment engine for Machine Learning](https://upgini.com)
### Quick Start guide:
### Search of relevant external features &  Automated feature generation for Salary predicton task  
_________________

Following this guide, you'll learn how to **search & auto generate new relevant features with Upgini library, in just 3 simple steps.**  
We will enrich a training dataset with both external & automaticaly generated features and significantly improve model accuracy.  
*The goal is to predict salary for data science job postning based on information about employer and job description.*  
The evaluation metric is Mean Absolute Error (MAE).  
⏱ Time needed: *15 minutes.*  

Download this notebook: [GitHub Link](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search&generation.ipynb)  
_________________

First, let's install latest version of Upgini library.

In [None]:
%pip install -Uq upgini catboost

## 1️⃣ Use your labeled training dataset for search & feature generation

You can use your labeled training datasets "as is" to initiate the search.  
For this guide we'll use the dataset from [Glasdoor salary prediction](https://www.kaggle.com/datasets/thedevastator/jobs-dataset-from-glassdoor) with geocoded addresses of employers as a postal/ZIP codes. You can download extended version [here](https://github.com/upgini/upgini/blob/main/notebooks/demo_salary.csv.zip).  
*This dataset contains job postings from Glassdoor.com from 2017, with several text columns including Job title, Job description, and Company name.*  
License CC0: Public Domain  
The goal is to predict salary for data science job postning.
The column with the target label for salary prediction is `avg_salary`.  
> ⚠️ All columns in the input dataset with dates/datetime should be converted to pandas datetime object for correct datetime representation


In [None]:
from os.path import exists
import pandas as pd

df_path = "demo_salary.csv.zip" if exists("demo_salary.csv.zip") else "https://github.com/upgini/upgini/raw/main/notebooks/demo_salary.csv.zip"
df = pd.read_csv(df_path)
df.head(2)

Unnamed: 0,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,avg_salary,...,R_yn,spark,aws,excel,job_simp,desc_len,num_comp,Postal_code,country,combined
0,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),72.0,...,0,0,0,1,data scientist,2536,0,87102,US,Job title: Data Scientist; Job Description: Da...
1,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),87.5,...,0,0,0,0,data scientist,4783,0,21090,US,Job title: Healthcare Data Scientist; Job Desc...


In [None]:
pd.set_option('display.max_colwidth', 768)
df["combined"].head(2)

Unnamed: 0,combined
0,"Job title: Data Scientist; Job Description: Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor’s Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years’ experience credit for Master’s degree; five years’ experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplica..."
1,"Job title: Healthcare Data Scientist; Job Description: What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Da..."


## 2️⃣ Choose one or multiple columns as a search keys, select columns for automated feature generation

Under the hood, we'll search for relevant data using:
- **[search keys](https://github.com/upgini/upgini#-search-key-types-we-support-more-to-come)** from training dataset to match records from potential data sources with a new features
- **labels** from the training dataset to estimate the relevancy of candidate features for your ML task and calculate feature importance metrics  
- **your features** from the training dataset to find external datasets and features that will improve accuracy in addition to your existing features and estimate accuracy uplift ([optional](https://github.com/upgini/upgini#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))


Define one or multiple columns as a search keys  and select **text columns** for automated feature generation, in this example `'combined', 'company_txt'`  

>⚠️ This search task will be auto-detected as a regression. If you have time series prediction (for example, daily sales as a target variable) and not just simple regression, you have to pass [**time series specific cross-validation split**](https://github.com/upgini/upgini#-time-series-prediction-support) **`CVType.time_series`**, as well

In [None]:
from upgini import FeaturesEnricher, SearchKey

enricher = FeaturesEnricher(
    search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE},
    generate_features=['combined', 'company_txt'])

## 3️⃣ Start your search & feature generation with Scikit-learn compatible estimator

The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn compatible estimator.  You can easily add it into your existing ML pipelines.
Create instance of the `FeaturesEnricher` class and call:
- `fit()` to search relevant datasets & features  
- than `transform()` to enrich your dataset with features from search result
- or combine both steps with a single method `fit_transform()`

You need to separate features from targets in *a scikit-learn style* (X and y).

> Search step will take around *12 minutes* for this training dataset

In [None]:
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary
enriched_train_features = enricher.fit_transform(
    train_features,
    train_target,
    scoring = "mean_absolute_error")

<IPython.core.display.Javascript object>

Demo training dataset detected. Registration for an API key is not required.



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Detected task type: ModelTaskType.REGRESSION. Reason: many unique label-values or non-integer floating point values observed
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly



<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>




<IPython.core.display.Javascript object>




<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Column name,Status,Errors
Postal_code,Some invalid,"2.2% values failed validation and removed from dataframe, invalid values: [<NA>, <NA>, <NA>, <NA>, <NA>]"
country,All valid,-
target,All valid,-
current_date,All valid,-





<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


Running search request, search_id=a03bb3f0-6020-40a1-9aea-f9edd7a6bd74
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
15 relevant feature(s) found with the search keys: ['Postal_code', 'country', 'current_date'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,4.9627,96.9828,"4335.0217, 3369.9946, 4071.873...",Upgini,World mobile network coverage,Quarterly
f_autofe_mul_f8ed6d9a,2.4468,84.4828,"98.0, 172.0, 3538.0",Upgini,"AutoFE: features from World demographic data,POI data OpenStreetMap",Quarterly
f_location_country_postal_poi_shopping_car_repair_5km_cnt_2a954887,2.3457,96.9828,"13.0, 28.0, 67.0",Upgini,POI data OpenStreetMap,Quarterly
combined_8a92bc_emb11,2.0494,100.0,"-0.0363, -0.0596, -0.0498",Upgini,LLM with external data augmentation,
combined_8a92bc_emb115,2.0494,100.0,"-0.1046, -0.0763, 0.0288",Upgini,LLM with external data augmentation,
combined_8a92bc_emb155,2.0494,100.0,"-0.0614, -0.0426, -0.1027",Upgini,LLM with external data augmentation,
combined_8a92bc_emb179,2.0494,100.0,"0.0445, -0.0391, -0.028",Upgini,LLM with external data augmentation,
combined_8a92bc_emb181,2.0494,100.0,"0.0078, -0.0442, 0.04",Upgini,LLM with external data augmentation,
combined_8a92bc_emb37,2.0494,100.0,"0.015, 0.0834, 0.0101",Upgini,LLM with external data augmentation,
combined_8a92bc_emb45,2.0494,100.0,"-0.0013, -0.0398, 0.0387",Upgini,LLM with external data augmentation,


Provider,Source,All features SHAP,Number of relevant features
Upgini,LLM with external data augmentation,22.5434,11
Upgini,World mobile network coverage,4.9627,1
Upgini,"AutoFE: features from World demographic data,POI data OpenStreetMap",3.1133,2
Upgini,POI data OpenStreetMap,2.3457,1


Sources,Feature name,Feature 1,Feature 2,Function
"World demographic data,POI data OpenStreetMap",f_autofe_mul_f8ed6d9a,f_location_country_postal_b01001e25_16d0bfd6,f_location_country_postal_poi_leisure_dog_park_5km_cnt_87849333,*
"World demographic data,POI data OpenStreetMap",f_autofe_mul_922848b1,f_location_country_postal_b01001e25_16d0bfd6,f_location_country_postal_poi_shopping_garden_centre_5km_cnt_eb15de74,*


Calculating accuracy uplift after enrichment...


Dataset type,Rows,Mean target,Baseline mean_absolute_error,Enriched mean_absolute_error,Uplift
Train,464,100.7802,22.892 ± 1.865,22.263 ± 2.376,0.6284


You use Trial access to Upgini data enrichment. Limit for Trial: 1000 rows. You have already enriched: 464 rows.




We've got **10+ new relevant features** from:
- Various sources  [automatically optimized by Upgini](https://upgini.com/#optimized_external_data) such as [World demographic & census data, World mobile network coverage, Location/Places/POI/Area/Proximity data from OpenStreetMap](https://github.com/upgini/upgini#-connected-data-sources-and-coverage)
- Automated feature generation for two selected text columns `'combined', 'company_txt'` with [Large Language Models' data augmentation](https://upgini.com/#large_language_models)

All ranked by [SHAP values](https://en.wikipedia.org/wiki/Shapley_value).

Initial features from the training dataset will also be checked for relevancy, so you don't need an extra feature selection step.

Also, `FeaturesEnricher` automaticaly calculates model metrics and uplift from new relevant features using default `calculate_metrics=True` parameter in `fit()` or `fit_transform()` methods.  
For this, you can use any estimator with scikit-learn compartible interface with `estimator` and define custom model metrics with `scoring`. More details [here](https://github.com/upgini/upgini#-accuracy-and-uplift-metrics-calculations)

Result of search & enrichment request:

⭐️ Enrcihed pandas dataframe **with 10+ new relevant features** `enriched_train_features`  
⭐️ Calculated accuracy Uplift after enrichment: from 22.9 BEFORE  to 22.3 AFTER for a basic **non task-optimized ML model**; MAE - mean absolute error, less is better
>💡 You can also enrich production ML pipelines, more details [here](https://github.com/upgini/upgini#6--enrich-production-ml-pipeline-with-relevant-external-features)

## ✅ Retrain model with enriched training dataset

Now, you can use an enriched dataframe to train a more accurate, **task-optimized ML model** in your existing ML pipeline.   
As example, let's take `CatBoostRegressor`.

In [None]:
enriched_train_features.head(2)

Unnamed: 0,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,company_txt,...,combined_8a92bc_emb155,combined_8a92bc_emb179,combined_8a92bc_emb181,combined_8a92bc_emb37,combined_8a92bc_emb45,combined_8a92bc_emb72,company_txt_4c5666_org_emb100,company_txt_4c5666_org_emb142,company_txt_4c5666_org_emb76,f_autofe_mul_922848b1
0,Tecolote Research\n3.8,"Albuquerque, NM","Goleta, CA",501 to 1000 employees,1973,Company - Private,Aerospace & Defense,Aerospace & Defense,$50 to $100 million (USD),Tecolote Research,...,-0.016766,-0.032975,0.003364,-0.009393,0.004895,-0.024411,0.000687,-0.080272,0.103452,153.0
1,University of Maryland Medical System\n3.4,"Linthicum, MD","Baltimore, MD",10000+ employees,1984,Other Organization,Health Care Services & Hospitals,Health Care,$2 to $5 billion (USD),University of Maryland Medical System,...,-0.009389,0.00903,0.026491,0.005215,0.018044,0.030807,0.020352,-0.040527,0.029741,0.0


In [None]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric
from sklearn.model_selection import train_test_split

# Find all categorical features and replace NaNs with 'NA'
cat_col_enriched = [col for col in enriched_train_features.columns if enriched_train_features[col].dtype == "O"]
enriched_train_features.loc[:, cat_col_enriched] = enriched_train_features.loc[:, cat_col_enriched].fillna("NA")

cat_col_baseline = [col for col in train_features.columns if train_features[col].dtype == "O"]
train_features.loc[:, cat_col_baseline] = train_features.loc[:, cat_col_baseline].fillna("NA")

# Train and test split for correct model evaluation
X_train, X_test, y_train, y_test, X_train_baseline, X_test_baseline = train_test_split(
    enriched_train_features,
    train_target,
    train_features,
    test_size=0.2,
    shuffle=True,
    random_state=0)

# Task-optimized Catboost estimator
model = CatBoostRegressor(
    learning_rate=0.03,
    iterations=330,
    random_state=0,
    eval_metric="MAE",
    verbose=False,)

Baseline **BEFORE** enrichment with the new features, *Mean Absolute Error*:

In [None]:
model.fit(X_train_baseline, y_train, cat_features=cat_col_baseline)
preds = model.predict(X_test_baseline)
eval_metric(y_test.values, preds, "MAE")

[22.415689095313247]

**AFTER** enrichment, *Mean Absolute Error*:

In [None]:
model.fit(X_train, y_train, cat_features=cat_col_enriched)
preds = model.predict(X_test)
eval_metric(y_test.values, preds, "MAE")

[20.95046063080986]

______________________________
**That's all for a quick start in 15 minutes!**  
If you found this useful or interesting, feel free to share.  
______________________________
## 🔗 Useful links
* Upgini Library [Documentation](https://github.com/upgini/upgini#readme)
* More [Notebooks and Guides](https://github.com/upgini/upgini?tab=readme-ov-file#-tutorials)
* Kaggle public [Notebooks](https://www.kaggle.com/romaupgini/code)


<sup>😔 Found mistype or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here.</a></sup>

## Optional: Enrichment with **external data & features only**, whithout LLM based feature generation

To enrich training dataset ONLY with features from external data sources, without automated feature generation on the text columns, you can simply remove parameter  `generate_features=['combined', 'company_txt']` from  `FeaturesEnricher`.  
Thus, you'll be able to compare Uplift from *LLM based feature generation + External Data* VS. *Uplift from External data and features only*:  

In [None]:
df = pd.read_csv(df_path)
train_features = df.drop(['avg_salary'], axis=1)
train_target = df.avg_salary

enricher = FeaturesEnricher(
    search_keys={
    'country': SearchKey.COUNTRY,
    'Postal_code': SearchKey.POSTAL_CODE})
enricher.fit(train_features, train_target, scoring = "mean_absolute_error")

Demo training dataset detected. Registration for an API key is not required.


Detected task type: ModelTaskType.REGRESSION. Reason: many unique label-values or non-integer floating point values observed
You can set task type manually with argument `model_task_type` of FeaturesEnricher constructor if task type detected incorrectly







Column name,Status,Errors
Postal_code,Some invalid,"2.2% values failed validation and removed from dataframe, invalid values: [<NA>, <NA>, <NA>, <NA>, <NA>]"
country,All valid,-
target,All valid,-
current_date,All valid,-




Running search request, search_id=79962dc4-ed99-4392-8df1-a17d04e62195
We'll send email notification once it's completed, just use your personal api_key from profile.upgini.com

[92m[1m
1 relevant feature(s) found with the search keys: ['Postal_code', 'country', 'current_date'][0m


Feature name,SHAP value,Coverage %,Value preview,Provider,Source,Updates
f_telecom_country_postal_cells_UMTS_10km_range_avg_9dcf0c9e,6.1814,96.9828,"4079.0219, 5621.3995, 4372.761...",Upgini,World mobile network coverage,Quarterly


Provider,Source,All features SHAP,Number of relevant features
Upgini,World mobile network coverage,6.1814,1


Calculating accuracy uplift after enrichment...


Dataset type,Rows,Mean target,Baseline mean_absolute_error,Enriched mean_absolute_error,Uplift
Train,464,100.7802,22.892 ± 1.865,22.857 ± 1.617,0.0347
