SemForms automatically mines code from large GitHub repositories of existing code that may be operating on the same dataset to construct features for AutoML.

In [1]:
from auto_example import handle_transforms
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn import metrics
import statistics
import numpy

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer

Consider the housing dataset from OpenML as an example.

In [2]:
from sklearn.datasets import fetch_openml
dataset      = fetch_openml('houses', version=1) # name is dataset_name
X            = dataset['data']
target       = dataset['target'].to_frame()

print(X)
print(target)
print(dataset['target_names'])

  warn(


       median_income  housing_median_age  total_rooms  total_bedrooms  \
0             8.3252                41.0        880.0           129.0   
1             8.3014                21.0       7099.0          1106.0   
2             7.2574                52.0       1467.0           190.0   
3             5.6431                52.0       1274.0           235.0   
4             3.8462                52.0       1627.0           280.0   
...              ...                 ...          ...             ...   
20635         1.5603                25.0       1665.0           374.0   
20636         2.5568                18.0        697.0           150.0   
20637         1.7000                17.0       2254.0           485.0   
20638         1.8672                18.0       1860.0           409.0   
20639         2.3886                16.0       2785.0           616.0   

       population  households  latitude  longitude  
0           322.0       126.0     37.88    -122.23  
1          2401.0

We use a random forest regressor on this dataset to find a baseline R^2 value.

In [3]:
# Set standard classifier (could be any AutoML as well)
estimator = RandomForestRegressor(random_state = 1908)
# Evaluate on original data
scores = cross_val_score(estimator, X, numpy.ravel(target), cv=3, scoring='r2')
print("Averaged r2 score on original data:  " + str(statistics.mean(scores)))

Averaged r2 score on original data:  0.5814812832363954


In [4]:
cols = X.columns
print(cols)

Index(['median_income', 'housing_median_age', 'total_rooms', 'total_bedrooms',
       'population', 'households', 'latitude', 'longitude'],
      dtype='object')


To find relevant scripts, we can use the GitHub API to find Python notebooks that have mentions of the table name and columns.

In [5]:
import requests
import json

buf = ['houses']
for i in cols:
    buf.append(i)
query = ' '.join(buf)
print('query:' + query)
        
rest_url = 'http://localhost:8001/api/github_search'
payload = {
    "query": query,
}
ret = requests.get(rest_url, params=payload)
        
data = json.loads(ret.content)
        
urls = [d['URL'].replace('github.com', 'raw.githubusercontent.com').replace('/blob/', '/') for d in data]

print(urls)


query:houses median_income housing_median_age total_rooms total_bedrooms population households latitude longitude
['https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ritabanmitra/californiahousingreg.ipynb', 'https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/obrunet/california-housing-prices.ipynb', 'https://raw.githubusercontent.com/pollozhao/JupyterNotebook/424e39b10a35b9bcbb1833c14d5776b68460c8a4/pysparkML.ipynb', 'https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ronishternberg/roni-california-housing-prices.ipynb', 'https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ashokshamnani/handson-book-end-to-end-ml.ipynb', 'https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/har

In [6]:
import urllib
import logging
import nbformat
from nbconvert.exporters import PythonExporter

def get_python(nbhandle):
    # read source notebook
    print('reading:' + nbhandle)
    f = urllib.request.urlopen(nbhandle)
    try:
        nb = nbformat.read(f, as_version=4)
        python_exporter = PythonExporter()
        code = python_exporter.from_notebook_node(nb)
        return code[0]
    except:
        # logging.exception("message")
        return None


Once have the relevant URLs on GitHub, we can access the source code, and submit it to a program analysis script.  This script analyzes creation of pandas dataframe objects, and any reads or writes to column names in the dataset, and any sort of operations performed on them.  The extracted code is then normalized into a set of Python lambda expressions so it can be dropped into an AutoML pipeline.

In [7]:
expressions = []
code2count = {}

for url in urls:
    code = get_python(url)
    if code is None:
        continue
    req = {'repo': url, 'source': code, 'indexName': 'expressions'}
    response = requests.post('http://localhost:4567/index', json=req)
    for r in response.json():
        if r['code'] not in code2count:
            code2count[r['code']] = 0
        code2count[r['code']] += 1

codes = {k: v for k, v in sorted(code2count.items(), reverse=True, key=lambda item: item[1])[:10]}

for idx, c in enumerate(codes):
    expressions.append({'expr_name': 'expr' + str(idx), 'code': c})

print(expressions)


reading:https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ritabanmitra/californiahousingreg.ipynb
reading:https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/obrunet/california-housing-prices.ipynb
reading:https://raw.githubusercontent.com/pollozhao/JupyterNotebook/424e39b10a35b9bcbb1833c14d5776b68460c8a4/pysparkML.ipynb
reading:https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ronishternberg/roni-california-housing-prices.ipynb
reading:https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/ashokshamnani/handson-book-end-to-end-ml.ipynb
reading:https://raw.githubusercontent.com/sayemimtiaz/kaggle-notebooks/4b639bfedf2245b288d0b0d02dcbfa838a18caf0/notebooks/harrywang/housing-price-prediction.ipynb
reading:https://raw.githubusercontent.com/fra

As you can see from the example, not all expressions that are extracted can be applied. Either the column names might not exist (e.g., income_cat) or the expressions might yield a single value (e.g., a median and so cannot be used as a feature), or it may be directly correlated with an existing column.  This next step filters out what cannot be used.

In [8]:
# Analyze given transforms and if applicable create SKLEARN Function Transforms as a pipeline
transforms_suggested, correlation = handle_transforms("both", expressions, target, X, 'houses.csv')

------------------------------------------------------
Dataset columns (X): ['median_income', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'latitude', 'longitude']
Dataset columns (Y): ['median_house_value']
correlation matrix:
                    median_income  housing_median_age  total_rooms  \
median_income            1.000000            0.119034     0.198050   
housing_median_age       0.119034            1.000000     0.361262   
total_rooms              0.198050            0.361262     1.000000   
total_bedrooms           0.008093            0.320485     0.929893   
population               0.004834            0.296244     0.857126   
households               0.013033            0.302916     0.918484   
latitude                 0.079809            0.011173     0.036100   
longitude                0.015176            0.108197     0.044568   

                    total_bedrooms  population  households  latitude  \
median_income             0.008

In [9]:
# Pipeline of suggested transforms (as a SKLEARN functional transformer)
transforms_suggested

[('expr1',
  FunctionTransformer(func=<function wrapper_func.<locals>.df_func at 0x7f3752c0a560>)),
 ('expr2',
  FunctionTransformer(func=<function wrapper_func.<locals>.df_func at 0x7f3752a53c70>)),
 ('expr3',
  FunctionTransformer(func=<function wrapper_func.<locals>.df_func at 0x7f3752a53d00>)),
 ('expr4',
  FunctionTransformer(func=<function wrapper_func.<locals>.df_func at 0x7f3752a51d80>)),
 ('expr8',
  FunctionTransformer(func=<function wrapper_func.<locals>.df_func at 0x7f3752a52b00>))]

Now we just prepend this set of suggested transformers to an existing pipeline.

In [10]:
# Add estimator to suggested transformation pipeline
transforms_suggested.append(('estimator', estimator))
pipeline = Pipeline(transforms_suggested)

As we can see, in the case of this dataset, we see an improvement from adding mined features.  Not every dataset will show this improvement, but this is in the spirit of AutoML.  AutoML explores many features and pipelines for datasets, and is not guaranteed to improve performance for every dataset.

In [11]:
# Evaluate with data augmentation added as function transformers based on original data
scores = cross_val_score(pipeline, X, numpy.ravel(target), cv=3, scoring='r2')
print("Averaged r2 score with augmentations on original data:  " + str(statistics.mean(scores)))

Averaged r2 score with augmentations on original data:  0.6544639843416239
