In [1]:
import IPython

def display_code(code):
    def _jupyterlab_repr_html_(self):
        from pygments import highlight
        from pygments.formatters import HtmlFormatter

        fmt = HtmlFormatter()
        style = "<style>{}\n{}</style>".format(
            fmt.get_style_defs(".output_html"), fmt.get_style_defs(".jp-RenderedHTML")
        )
        return style + highlight(self.data, self._get_lexer(), fmt)

    # Replace _repr_html_ with our own version that adds the 'jp-RenderedHTML' class
    # in addition to 'output_html'.
    IPython.display.Code._repr_html_ = _jupyterlab_repr_html_
    return IPython.display.Code(data=code, language="python3")

def display_for_code():
    def _jupyterlab_repr_html_(self):
        from pygments import highlight
        from pygments.formatters import HtmlFormatter

        fmt = HtmlFormatter()
        style = "<style>{}\n{}</style>".format(
            fmt.get_style_defs(".output_html"), fmt.get_style_defs(".jp-RenderedHTML")
        )
        return style + highlight(self.data, self._get_lexer(), fmt)

    # Replace _repr_html_ with our own version that adds the 'jp-RenderedHTML' class
    # in addition to 'output_html'.
    IPython.display.Code._repr_html_ = _jupyterlab_repr_html_
    return IPython.display.Code

Enterprises often have multiple data scientists working on the similar data, and their code is usually checked into a central repository such as GitHub.  SemForms mines such a repository of code manipulating various datasets to mine what other data scientists have done with similar datasets.  In particular, this demo notebook illustrates how SemForms can generate code which help with cleansing a dataset for downstream tasks such as model building or analysis.

As an example, let us assume that a data scientist wants to clean the Kaggle titanic survival prediction dataset.

In [2]:
import pandas
df = pandas.read_csv('/data/semforms-steam/graph4code/data/titanic_train.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Notice that this dataset like many others has many issues - it needs to be cleansed in order for it to be useful.   We have columns that are categorical (e.g. Embarked), columns that do not seem very relevant (e.g., Ticket, Name).  Now lets try and ask SEMFORMs to ask for code recommendations on cleaning up this dataset.

In [3]:
import requests
import json
cols = list(df.columns)
dataset_desc = 'titanic ' + ' '.join(cols)
dataset_desc
#response = requests.get('http://expressions2.sl.cloud9.ibm.com:8000/expressions/?dataset_url=' + dataset_desc)
#response.json()
with open('file.json') as f:
    data = json.load(f)

SEMFORMs returns a list of functions, organized by script and the field that they operate on.  For operations that create new features from multiple columns, the system will return a single function that may depend on other cleansing functions.  Choosing a specific function will copy the code into the notebook.

In [6]:
import ipywidgets as widgets

columns = widgets.Dropdown(
        options=data.keys(),
        value=list(data.keys())[0],
        description='Columns:',
        disabled=False,
    )

display(columns) 
dh = IPython.display.display(display_id=True)
dc = display_for_code()

def on_change(change):
    if change['type'] == 'change' and change['name'] == 'value':
        col = columns.value
        scripts = list(data[col].keys())
        buf = []
        for script in scripts:
            buf.append('# from script:' + script + '\n')
            for function in data[col][script]:
                fn = 'def ' + data[col][script][function] 
                buf.append(fn)
                
        fun = ' '.join(buf)
        dh.update(dc(data=fun, language="python3"))

columns.observe(on_change)


Dropdown(description='Columns:', options=('all', 'Embarked', 'Pclass', 'Survived', 'PassengerId', 'columns', '…