SemForms automatically mines code from large GitHub repositories of existing code that may be operating on the same dataset to construct features for AutoML.  It uses static analysis of existing Python code to extract features used in that code.

In [None]:
from auto_example import handle_transforms
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import statistics
import numpy

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

import pandas as pd
pd.options.mode.chained_assignment = None 

Consider the housing dataset from OpenML as an example.

In [None]:
from sklearn.datasets import fetch_openml
dataset      = fetch_openml('houses', version=1) # name is dataset_name
X            = dataset['data']
target       = dataset['target'].to_frame()

print(X)
print(target)
print(dataset['target_names'])

We use a random forest regressor on this dataset to find a baseline R^2 value.

In [None]:
# Set standard classifier (could be any AutoML as well)
estimator = RandomForestRegressor(random_state = 1908)
# Evaluate on original data
scores = cross_val_score(estimator, X, numpy.ravel(target), cv=3, scoring='r2')
print("Averaged r2 score on original data:  " + str(statistics.mean(scores)))

In [None]:
cols = X.columns
print(cols)

To find relevant scripts, we can use the GitHub API to find Python notebooks that have mentions of the table name and columns.  For this demo, to avoid issues with networking, GitHub access tokens and search non-determinism, we use a set of Python files that we have found before with a search.

In [None]:
urls = ["https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ritabanmitra/californiahousingreg.ipynb",
"https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/obrunet/california-housing-prices.ipynb",
"https://raw.githubusercontent.com/pollozhao/JupyterNotebook/424e39b10a35b9bcbb1833c14d5776b68460c8a4/pysparkML.ipynb",
"https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ronishternberg/roni-california-housing-prices.ipynb",
"https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ashokshamnani/handson-book-end-to-end-ml.ipynb",
"https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/harrywang/housing-price-prediction.ipynb",
"https://raw.githubusercontent.com/francisco-renteria/francisco-renteria.github.io/59427cce8ca3bd71b9968f65c52332a5ba884df5/HOML/OEBPS/ch02.html",
"https://raw.githubusercontent.com/binarybrain-009/M.L.-learning-and-reading-material/4d119e1d49bbfe8427645ae4090114bc5b1acf1f/01_end_to_end_machine_learning_project-checkpoint.ipynb",
"https://raw.githubusercontent.com/sudhanshu1402/Basic-Machine-Learning-Projects/a07146832e2289f1b9b9924a0806f74e63c375b6/Housing%20Prices%20Prediction/Housing%20Prices%20Prediction.ipynb",
"https://raw.githubusercontent.com/CS196Illinois/Group33-FA22/52fc391f2344df3c7c1d62ada34fe09e30dd5a56/Project/Backend/price_prediction_models.ipynb",
"https://raw.githubusercontent.com/Sayar1106/Sayar1106.github.io/45899ac1f8e6368a1f066cde46e9968f6ffa68f0/post/index.xml",
"https://raw.githubusercontent.com/arunpeddakotla/hansonml2/72501d47132a18da11e741db27c30c3061c1b9f0/2.%20End-to-End%20Machine%20Learning%20Project%20-%20Hands-on%20Machine%20Learning%20with%20Scikit-Learn,%20Keras,%20and%20TensorFlow,%202nd%20Edition.html",
"https://raw.githubusercontent.com/Sayar1106/portfolio-website/484cfe9ba95f69085dc11cd0d543ca4799d040b6/public/post/index.xml",
"https://raw.githubusercontent.com/Sayar1106/portfolio-website/484cfe9ba95f69085dc11cd0d543ca4799d040b6/public/authors/admin/index.xml",
"https://raw.githubusercontent.com/Sayar1106/Sayar1106.github.io/45899ac1f8e6368a1f066cde46e9968f6ffa68f0/index.xml",
"https://raw.githubusercontent.com/DAI-Lab/RP_Master_thesis/b76f726c0bb2917cde8d476124383ebebca354aa/N1_extract_data_generate_synt_data.ipynb",
"https://raw.githubusercontent.com/dmsenter89/dmsenter89.github.io/e6664e7c90bf902accaf848fedbe5a30acc906ea/post/index.xml",
"https://raw.githubusercontent.com/dmsenter89/dmsenter89.github.io/e6664e7c90bf902accaf848fedbe5a30acc906ea/index.xml",
"https://raw.githubusercontent.com/stigbosmans/MLEX/55de97bb7f94293bba8f0b29107ebd48e84ab278/CH2/.ipynb_checkpoints/HousingPrices-checkpoint.ipynb",
"https://raw.githubusercontent.com/stigbosmans/MLEX/55de97bb7f94293bba8f0b29107ebd48e84ab278/CH2/HousingPrices.ipynb"]

In [None]:
import urllib
import logging
import nbformat
from nbconvert.exporters import PythonExporter

def get_python(nbhandle):
    # read source notebook
    print('reading:' + nbhandle)
    f = urllib.request.urlopen(nbhandle)
    try:
        nb = nbformat.read(f, as_version=4)
        python_exporter = PythonExporter()
        code = python_exporter.from_notebook_node(nb)
        return code[0]
    except:
        # logging.exception("message")
        return None


Once have the relevant URLs on GitHub, we can access the source code, and submit it to a program analysis script.  This script analyzes creation of pandas dataframe objects, and any reads or writes to column names in the dataset, and any sort of operations performed on them.  The extracted code is then normalized into a set of Python lambda expressions so it can be dropped into an AutoML pipeline.

In [None]:
import requests

expressions = []
code2count = {}
code2url = {}

for url in urls:
    code = get_python(url)
    if code is None:
        continue
    req = {'repo': url, 'source': code, 'indexName': 'expressions'}
    response = requests.post('http://localhost:4567/index', json=req)
    for r in response.json():
        if r['code'] not in code2count:
            code2count[r['code']] = 0
            code2url[r['code']] = r['source_file']
        code2count[r['code']] += 1

codes = {k: v for k, v in sorted(code2count.items(), reverse=True, key=lambda item: item[1])[:10]}

for idx, c in enumerate(codes):
    expressions.append({'expr_name': 'expr' + str(idx), 'code': c, 'url': code2url[c]})

print(expressions)


As you can see from the example, not all expressions that are extracted can be applied. Either the column names might not exist (e.g., income_cat) or the expressions might yield a single value and so cannot be used as a feature, or it may be directly correlated with an existing column.  This next step filters out what cannot be used.

In [None]:
# Analyze given transforms and if applicable create SKLEARN Function Transforms as a pipeline
transforms_suggested, correlation, generated_code = handle_transforms("both", expressions, target, X, 'houses.csv')

In [None]:
# show what augmentions are kept
generated_code

Now we just prepend this set of suggested transformers to our existing pipeline.

In [None]:
# Add estimator to suggested transformation pipeline
transforms_suggested.append(('estimator', estimator))
pipeline = Pipeline(transforms_suggested)

In [None]:
# Evaluate with data augmentation added as function transformers based on original data
scores = cross_val_score(pipeline, X, numpy.ravel(target), cv=3, scoring='r2')
print("Averaged r2 score with augmentations on original data:  " + str(statistics.mean(scores)))

As we can see, in the case of this dataset, we see an improvement from adding mined features.  Not every dataset will show this improvement, but this is in the spirit of AutoML.  AutoML explores many features and pipelines for datasets, and is not guaranteed to improve performance for every dataset.