## OpenFE Short Demo / Tutorial

A tool for automated feature engineering. It creates many features using common feature engineering techniques.
I thought this was an interesting tool and wanted to do a quicker look. I am not an expert in OpenFE and there are many other great automated feature engineering tools.   

Paper: https://arxiv.org/abs/2211.12507  
Code: https://github.com/IIIS-Li-Group/OpenFE

The notebook was exported from Snowflake, but should work in any compute environment. The only change will be ingesting the data.

In [None]:
!pip install openfe matplotlib -q

# Import python packages
import streamlit as st
import pandas as pd

# We can also use Snowpark for our analyses!
from snowflake.snowpark.context import get_active_session
session = get_active_session()


References:

Using a Kaggle dataset on Mohs Hardness - Playground Series - Season 3, Episode 25: https://www.kaggle.com/competitions/playground-series-s3e25/data

Useful Kaggle notebooks using OpenFE:
Elevating Kaggle Performance with OpenFE - https://www.kaggle.com/code/sunilkumaradapa/elevating-kaggle-performance-with-openfe

1st Place Solution for the Regression with an Abalone Dataset Competition - https://www.kaggle.com/competitions/playground-series-s4e4/discussion/499174

OpenFE + Blending + Explain - https://www.kaggle.com/code/trupologhelper/ps4e5-openfe-blending-explain/notebook


In [None]:
train = session.read.table("RAJIV.KAGGLE.MOHS_TRAIN").to_pandas()
train.head()

In [None]:
test = session.read.table("RAJIV.KAGGLE.MOHS_TEST").to_pandas()
test.head()
test["IONENERGY_AVERAGE"] = pd.to_numeric(test["IONENERGY_AVERAGE"]) #something happened and this didn't get recognized as umeric

Feature engineering to select the features we want to use with OpenFE.You may need to use some of your expertise to exclude some features (leakage). 
But remember, some uninformative features may also yield informative candidate features after transformation. In a Diabetes dataset, for example, when the goal is to forecast if a patient will be readmitted to the hospital, the feature ‘patient id’ is useless. However, ‘freq(patient id)’, which is the number of times the patient has been admitted to the hospital, is a strong predictor of whether the patient would be readmitted.

In [None]:
y = train['HARDNESS']
X = train.drop(['ID','HARDNESS'],axis=1)
X_test = test.drop(['ID'],axis=1)
X.head()

Let's use OpenFE

In [None]:
from openfe import OpenFE, transform, tree_to_formula
ofe = OpenFE()
features = ofe.fit(data=X, label=y)

In [None]:
for feature in ofe.new_features_list[:10]:
        print(tree_to_formula(feature))

In [None]:
X_t, X_test = transform(X,X_test, features,n_jobs=4)
X_t.shape

In [None]:
X_t.head()

Advanced OpenFE

When generating a lot of features, you will need more RAM, get a bigger instance. Or sample down your dataset (1% of it for example)

You can increase `n_data_blocks` to speed up computation, but the quality of candidate features might be reduced

Feature boosting is a more efficient way to identify the best featuers - I have seen people use OpenFE with and without it

In [None]:
from openfe import OpenFE, transform, tree_to_formula

n_jobs = 4
params = {"n_estimators": 1000, "importance_type": 
          "gain", "num_leaves": 64,
          "seed": 1, "n_jobs": n_jobs} ##for GBDT model by OpenFE

ofe = OpenFE()
features = ofe.fit(data=X, label=y, metric='rmse', 
                   task='regression', stage2_params=params,
                   min_candidate_features=5000,n_jobs=n_jobs, 
                   n_data_blocks=2, feature_boosting=True)

X_t2, test = transform(X, test, features,n_jobs=4)
X_t2.shape