# Introduction

## AutoPandas
<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/autopandas-logo.png" alt="AutoPandas Logo" style="width: 10%; float: left; padding: 10px"/>


[AutoPandas](https://autopandas.io) is a input-output example based synthesis engine for the [Pandas](https://pandas.pydata.org) data-analytics library in Python. Users provide input-output pairs (dataframes) specifying their intent, and the AutoPandas engine searches for programs using the Pandas library that transform the input to the output.

## Atlas
<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/atlas-logo.png" alt="AutoPandas Logo" style="width: 10%; float: left; padding: 10px"/>

[Atlas](https://github.com/rbavishi/atlas) is the framework instantiating AutoPandas. It generalizes the key ideas behind its synthesis engine making it possible to apply them to build engines for entirely new domains such as other APIs like Numpy or Tensorflow, or domain-specific languages (DSLs) for say string-manipulation. The core concept in Atlas is that of a *generator*, which you will learn about shortly. Generators are not only useful in synthesis, but have potential applications in testing as well.

## What are we going to learn?

Pandas is a huge library, and building a synthesizer for the entirety of Pandas is out-of-scope for the tutorial.
Instead we will pick a single function in Pandas, namely `pivot`, and try to build an efficient synthesizer for it. That is, given input and output dataframes, our synthesizer will determine the right arguments to the `pivot` function.

We will be using the the abstractions provided by Atlas and in doing so, we hope to motivate why we feel these abstractions are useful for instantiating engines for other domains.

----------------------------------------------

## 0. What is Pivoting?

Pivot is a summarization operation on tables that is very useful for certain kinds of queries. It is best explained using a concrete example. It is inspired from the example provided [here](https://kite.com/blog/python/pandas-pivot-table/).

Suppose you are working on stock-market data that looks something like the following -

<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/stocks-input.png" alt="Pivot Documentation" style="width: 30%;"/>.

It is hard to compare trade volume of stocks on different dates in this particular form. Specifically, it is difficult to glean information when the data is presented in this particular format. `Pivot` is a useful transformation to alleviate this issue. We can `pivot` on the `date` column, keeping index as the `symbol` column and rearranging the `volume` column accordingly as follows - 

<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/stocks-output.png" alt="Pivot Documentation" style="width: 40%;"/>


## 1. The Synthesis Task

The user provides input dataframe(s) as well as an output dataframe. Our system should be able to produce a call or a sequence of calls using `pivot`, `groupby` and various aggregation functions that transform the input to the desired output. Here is another example - 

In [1]:
import pandas as pd
from IPython.display import display

inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06'],
    'volume': [6167358, 3681522, 3996001, 27436203, 19737419, 20810384, 1446047, 1443174, 1099289]
})

out_df = pd.DataFrame({
 '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
 '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174},
 '2019-03-06': {'AAPL': 20810384, 'AMZN': 3996001, 'GOOG': 1099289}
})

print("Input DataFrame")
display(inp_df)

print("Output DataFrame")
display(out_df)

Input DataFrame


Unnamed: 0,symbol,date,volume
0,AMZN,2019-03-04,6167358
1,AMZN,2019-03-05,3681522
2,AMZN,2019-03-06,3996001
3,AAPL,2019-03-04,27436203
4,AAPL,2019-03-05,19737419
5,AAPL,2019-03-06,20810384
6,GOOG,2019-03-04,1446047
7,GOOG,2019-03-05,1443174
8,GOOG,2019-03-06,1099289


Output DataFrame


Unnamed: 0,2019-03-04,2019-03-05,2019-03-06
AAPL,27436203,19737419,20810384
AMZN,6167358,3681522,3996001
GOOG,1446047,1443174,1099289


We need to write a method `synthesize_pivot` that when run on the input-output example as follows -

```python
synthesize_pivot(inp_df, out_df)
```

returns the following program or an equivalent program

```python
inp_df.pivot(index="symbol", columns="date", values="volume")
```

## 2. A Brute-Force Synthesizer for Pivot

Before we start building our synthesizer, we need to know the space of programs we are going to search over.

### (a) What are the Possible Programs?

Here's the documentation for pivot -

## Documentation for **Pivot**
<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/pandas-pivot-doc.png" alt="Pivot Documentation" style="width: 80%;"/>

There is a lot of information here! But here are the key points to help us get started -

1. The `index` must be one of the columns of the input table or the value `None`.
2. The `columns` argument must only be one of the columns and NOT `None`.
3. The `values` argument must also be one of the columns or `None`. It can also be a list of columns.

The set of possible arguments to `pivot` is therefore the cross-product of all the possible values for each of the arguments (`index`, `columns` and `pivot`) individually. How can we express this succintly? Now comes in the concept of a `generator` in Atlas/AutoPandas.

### (b) Introduction to Generators

In [2]:
from atlas import generator

@generator
def gen_column_tuples(df: pd.DataFrame):
    """
    Produces all possible 2-tuples of column names
    """
    col1 = Select(df.columns)  # Select one of the columns
    col2 = Select(df.columns)  # Select one of the columns
    
    return (col1, col2)  # Construct the tuple and return

inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06'],
    'volume': [6167358, 3681522, 3996001, 27436203, 19737419, 20810384, 1446047, 1443174, 1099289]
})

# Print all possible tuples that can be returned by gen_column_tuples
# In this case all 3x3 = 9 tuples will be returned
for result in gen_column_tuples.generate(inp_df):
    print("Tuple :", result)

Tuple : ('symbol', 'symbol')
Tuple : ('symbol', 'date')
Tuple : ('symbol', 'volume')
Tuple : ('date', 'symbol')
Tuple : ('date', 'date')
Tuple : ('date', 'volume')
Tuple : ('volume', 'symbol')
Tuple : ('volume', 'date')
Tuple : ('volume', 'volume')


The `Select` operator is provided by Atlas. Given a list of values, it returns *one* of the values as the result. The `@generator` transforms any function into an Atlas Generator and overloads any calls to operators like `Select`. One can call `generate(*args, **kwargs)` on such a generator to get an iterator that returns the result of all possible executions of the generator (as governed by the calls to `Select`).

By default, the iterator has **depth-first behavior**. That is, later calls to `Select` explore all possibilities before the previous `Select` calls.

**NOTE** : An Atlas `generator` is different from the Python Generator. Everything that can be expressed using a python generator can be expressed as an Atlas generator. The power of Atlas generators is in combining these generators with probabilistic models as we will see later.

### (c) A Generator for Pivot Arguments

At this point, you should have a fair idea of how to encode the constraints described in **2.(a)** using a generator. Let us try to write a generator for the arguments to the `pivot` function.

**Exercise:** Fill in the missing arguments to the two `Select` calls shown below.

In [3]:
@generator
def gen_pivot_args(input_df: pd.DataFrame):
    # Select one of columns
    arg_columns = Select(list(input_df.columns))
    
    # Select one of columns or None
    arg_index = Select([None] + list(input_df.columns))
    
    # Whether to use a column or a list of columns
    if Select([True, False]): # TODO : FILL IN THE ARGUMENTS
        
        # Select one of the columns or None
        arg_values = Select([None] + list(input_df.columns)) # TODO : FILL IN THE ARGUMENTS
    else:
        # Pick a Permutation/Ordered-Subset of the list of columns
        arg_values = list(OrderedSubset(list(input_df.columns)))
    
    return {'index': arg_index, 'columns': arg_columns, 'values': arg_values}

We are also ready to write the `synthesize_pivot` function.

**Exercise:** Fill in the code for the iterator as shown below.

In [4]:
def check_equal(df1, df2):
    try:
        pd.testing.assert_frame_equal(df1, df2, check_names=False)
        return True
    except:
        return False
    
def synthesize_pivot(inp, out):
    for args in gen_pivot_args.generate(inp): # TODO : Fill in the expression for the iterator.
        try:
            # If there are exceptions while running pivot or checking, skip
            result = inp.pivot(**args)
            if check_equal(result, out):
                print("Found Solution!", 
                      f"inp.pivot(index='{args['index']}', columns='{args['columns']}', values='{args['values']}')")
        except Exception as e:
            continue

Let's try it out on the example we had in **Section 1**!

In [5]:
inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06'],
    'volume': [6167358, 3681522, 3996001, 27436203, 19737419, 20810384, 1446047, 1443174, 1099289]
})

out_df = pd.DataFrame({
 '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
 '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174},
 '2019-03-06': {'AAPL': 20810384, 'AMZN': 3996001, 'GOOG': 1099289}
})

synthesize_pivot(inp_df, out_df)

Found Solution! inp.pivot(index='symbol', columns='date', values='volume')


## 3. Performance Analysis

Let us log some statistics about our `synthesize_pivot` routine. In particular, let's track the number of programs explored, time taken and number of exceptions raised (while executing the program).

In [6]:
import time

def synthesize_pivot(inp, out):
    start_time = time.time()
    num_explored = 0
    num_errors = 0
    for args in gen_pivot_args.generate(inp):
        num_explored += 1
        
        try:
            # If there are exceptions while running pivot or checking, skip
            result = inp.pivot(**args)
            if check_equal(result, out):
                print("Found Solution!", 
                      f"inp.pivot(index='{args['index']}', columns='{args['columns']}', values='{args['values']}')")
                print(f"Time to solution: {time.time() - start_time: .3f} seconds")
                print(f"Number of Programs Explored Till Now: {num_explored}")

        except Exception as e:
            num_errors += 1
            continue
            
    print(f"Total Number of Programs Explored: {num_explored}")
    print(f"Total Number of Programs Crashed: {num_errors}")

In [7]:
inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06'],
    'volume': [6167358, 3681522, 3996001, 27436203, 19737419, 20810384, 1446047, 1443174, 1099289]
})

out_df = pd.DataFrame({
 '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
 '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174},
 '2019-03-06': {'AAPL': 20810384, 'AMZN': 3996001, 'GOOG': 1099289}
})

synthesize_pivot(inp_df, out_df)

Found Solution! inp.pivot(index='symbol', columns='date', values='volume')
Time to solution:  0.194 seconds
Number of Programs Explored Till Now: 99
Total Number of Programs Explored: 228
Total Number of Programs Crashed: 57


The solution was produced fairly quickly. But `57` of the `228` argument combinations returned by our generator `gen_pivot` errored out! This means our synthesizer is wasting time evaluating programs that do not even produce an output. Can we could optimize our synthesizer to avoid wasting time on these programs?

### (a) Incorporating Domain-Specific Constraints in Generator

Can we exploit our knowledge of the pivot function to avoid enumerating these programs? Turns out, a meaningful `pivot` program satisfies the following -

1. `columns` and `index` argument cannot be equal
2. The column(s) in `values` must be different from both `columns` and `index`.

It is very easy to incorporate these constraints in the generator as it is regular Python code.

In [8]:
@generator
def gen_pivot_args(input_df: pd.DataFrame):
    # Select one of columns
    arg_columns = Select(list(input_df.columns))
    
    # Select one of columns or None
    arg_index = Select([None] + [i for i in input_df.columns if i != arg_columns]) # CONSTRAINT-1
    
    # Whether to use a column or a list of columns
    if Select([True, False]):
        # Select one of the columns or None
        arg_values = Select([None] + [i for i in input_df.columns if i not in {arg_columns, arg_index}]) # CONSTRAINT-2
    else:
        # Pick a Permutation/Ordered-Subset of the list of columns
        arg_values = list(OrderedSubset([i for i in input_df.columns if i not in {arg_columns, arg_index}])) # CONSTRAINT-2
    
    return {'index': arg_index, 'columns': arg_columns, 'values': arg_values}

In [9]:
inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06'],
    'volume': [6167358, 3681522, 3996001, 27436203, 19737419, 20810384, 1446047, 1443174, 1099289]
})

out_df = pd.DataFrame({
 '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
 '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174},
 '2019-03-06': {'AAPL': 20810384, 'AMZN': 3996001, 'GOOG': 1099289}
})

synthesize_pivot(inp_df, out_df)

Found Solution! inp.pivot(index='symbol', columns='date', values='volume')
Time to solution:  0.067 seconds
Number of Programs Explored Till Now: 22
Total Number of Programs Explored: 39
Total Number of Programs Crashed: 0


Much better! We are getting to the solution much faster as we are not wasting time exploring bad programs.

### (b) Handling Big DataFrames

We are still doing a brute-force search. Performance would suffer if the space of possible programs itself is pretty large. In the case of `pivot`, what happens if our dataframe has a large number of columns? In the running example, it is more realistic to have other columns in the table as well as, that may not be relevant to a particular query. For example, for stocks, it is common to have columns for opening and closing values, as well as highs and lows.

In [10]:
inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-01', '2019-03-04', '2019-03-05', 
             '2019-03-01', '2019-03-04', '2019-03-05', 
             '2019-03-01', '2019-03-04', '2019-03-05'],
    'open': [1655.13, 1685.0, 1702.95, 174.28, 175.69, 175.94, 1124.9, 1146.99, 1150.06], 
    'high': [1674.26, 1709.43, 1707.8, 175.15, 177.75, 176.0, 1142.97, 1158.28, 1169.61], 
    'low': [1651.0, 1674.36, 1689.01, 172.89, 173.97, 174.54, 1124.75, 1130.69, 1146.19], 
    'close': [1671.73, 1696.17, 1692.43, 174.97, 175.85, 175.53, 1140.99, 1147.8, 1162.03], 
    'volume': [4974877, 6167358, 3681522, 25886167, 27436203, 19737419, 1450316, 1446047, 1443174],
    
})

out_df = pd.DataFrame({
    '2019-03-01': {'AAPL': 25886167, 'AMZN': 4974877, 'GOOG': 1450316},
    '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
    '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174}
})

print("Input DataFrame")
display(inp_df)
print("Output DataFrame")
display(out_df)

Input DataFrame


Unnamed: 0,symbol,date,open,high,low,close,volume
0,AMZN,2019-03-01,1655.13,1674.26,1651.0,1671.73,4974877
1,AMZN,2019-03-04,1685.0,1709.43,1674.36,1696.17,6167358
2,AMZN,2019-03-05,1702.95,1707.8,1689.01,1692.43,3681522
3,AAPL,2019-03-01,174.28,175.15,172.89,174.97,25886167
4,AAPL,2019-03-04,175.69,177.75,173.97,175.85,27436203
5,AAPL,2019-03-05,175.94,176.0,174.54,175.53,19737419
6,GOOG,2019-03-01,1124.9,1142.97,1124.75,1140.99,1450316
7,GOOG,2019-03-04,1146.99,1158.28,1130.69,1147.8,1446047
8,GOOG,2019-03-05,1150.06,1169.61,1146.19,1162.03,1443174


Output DataFrame


Unnamed: 0,2019-03-01,2019-03-04,2019-03-05
AAPL,25886167,27436203,19737419
AMZN,4974877,6167358,3681522
GOOG,1450316,1446047,1443174


In [11]:
synthesize_pivot(inp_df, out_df)

Found Solution! inp.pivot(index='symbol', columns='date', values='volume')
Time to solution:  13.187 seconds
Number of Programs Explored Till Now: 5918
Total Number of Programs Explored: 27643
Total Number of Programs Crashed: 0


Whoops! That is quite a bit slow. If we wanted to build a synthesis service where people can get their input-output queries answered in real-time, this approach would definitely not work.

## 5. Training a Generator

In our brute-force synthesizer, we have been using a Depth-First enumeration strategy. The question we ask now is can devise a smart enumeration strategy that can make our generator return the most promising arguments first.

Specifically, the enumeration order of argument combinations returned by `gen_pivot` is governed by the order of values returned by the individual operators (four `Select`s and one `OrderedSubset`). Therefore, to speed up our synthesizer, the operators need to return the correct values *first*.

#### Can we train the individual operators to return values **smartly** i.e. adjust the order in which they return values based on the input-output example?
##### For the purpose of the tutorial, let us focus on the first call to `Select` i.e. the operator that decides the value of the `columns` argument.

### (a) Adding Input-Output Example as the Context

In order to train the operator, we first need to provide access to the input-output example. Currently only `inp_df` is passed as an argument to the generator `gen_pivot`.

In [12]:
@generator
def gen_pivot_args(input_df: pd.DataFrame, output_df: pd.DataFrame):
    # Select one of columns
    arg_columns = Select(list(input_df.columns), context=(input_df, output_df),  # ADDING CONTEXT HERE
                         uid="select_columns") # Adding uid to aid identification while defining model
    
    # Select one of columns or None
    arg_index = Select([None] + [i for i in input_df.columns if i != arg_columns])
    
    # Whether to use a column or a list of columns
    if Select([True, False]):
        # Select one of the columns or None
        arg_values = Select([None] + [i for i in input_df.columns if i not in {arg_columns, arg_index}])
    else:
        # Pick a Permutation/Ordered-Subset of the list of columns
        arg_values = list(OrderedSubset([i for i in input_df.columns if i not in {arg_columns, arg_index}]))
    
    return {'index': arg_index, 'columns': arg_columns, 'values': arg_values}

In [13]:
def synthesize_pivot(inp, out):
    start_time = time.time()
    num_explored = 0
    num_errors = 0
    for args in gen_pivot_args.generate(inp, out):  # CHANGED : PASSING out AS AN ARGUMENT
        num_explored += 1
        
        try:
            # If there are exceptions while running pivot or checking, skip
            result = inp.pivot(**args)
            if check_equal(result, out):
                print("Found Solution!", 
                      f"inp.pivot(index='{args['index']}', columns='{args['columns']}', values='{args['values']}')")
                print(f"Time to solution: {time.time() - start_time: .3f} seconds")
                print(f"Number of Programs Tried: {num_explored}")
                
                break  # CHANGED : Stop at first solution to avoid spending time exploring other programs

        except Exception as e:
            num_errors += 1
            continue

### (b) Defining the Model

Our model needs to take the list of columns (domain) passed to the `Select` operator, as well as the context and return a probabibility distribution over the list of columns.

We are going to use a Graph-Neural-Network model for our Select operator. Covering this model is out-of-scope for this tutorial, but here are the key insights behind using this model.

1. The transformation represented by an input-output example consisting of dataframes is a function of the relationships between the values in the input and the values in the output, rather the concrete values themselves.  For example, the concrete column names `Category`, `Expense` etc. are irrelevant. It is their position in the output dataframe that really captures the transformation.

2. We can represent dataframes as a graph where each column, cell and index value is represented as a node, and edges represent relationships amongst these nodes. The nodes are labeled with the type of the value (int, str etc.) rather than the value itself. For example, each column node will have a `COLUMN` edge to all the cell nodes in the corresponding column. We will also have an `EQUALITY` edge between between each pair of nodes that have the same concrete value.

Here is a graphical representation of the graph encoding for the given input and output dataframe along with the domain for the `Select` operator we're trying to learn a model for.

<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/input-output-example.png" alt="Pivot Documentation" style="width: 80%;"/>
<img src="https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/imgs/graph-encoding-example.png" alt="Pivot Documentation" style="width: 80%;"/>

Let's instantiate the model for the generator. For more details, refer to the `models.py` file in the root directory.

In [14]:
from atlas.models.imitation import IndependentOperatorsModel
from atlas.operators import operator
from models import PivotSelectModel

class PivotGeneratorModel(IndependentOperatorsModel):
    @operator(name='Select', uid="select_columns")
    def SelectColumns(*args, **kwargs):
        config = {
            'learning_rate': 0.01,
            'node_dimension': 50,
            'classifier_hidden_dims': [50],
            'batch_size': 30000,  # number of nodes in a batch
            'layer_timesteps': [1, 1, 1]
        }

        return PivotSelectModel(config, debug=True)

### (c) Training Data

Training data for our purposes consists of **traces** of generator executions that produce the correct program given an input-output example. These traces store the correct choices made by the operators and hence are suffice as training data for the operators.

This data has been generated randomly. In particular, we collect random input dataframes run them through our brute-force generator, randomly pick some argument combinations and execute it to get the output. This gives us a `(input, output, program)` tuple that suffices as training data. For more details, refer to `data_generation.py`.

**NOTE:** This is only for illustration purposes, so you should not worry too much about understanding the format of these traces.

In [15]:
import pickle
import random
from urllib.request import urlopen

data = pickle.load(urlopen("https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/training_data_pivot_columns.pkl"))
data[100]


        GeneratorTrace(inputs=((      0           1     2       3                 4   5
0   baz       123.4   baz  Joseph      banana_foo_2  50
1   bar  23894243.7  fizz     Amy       date_bar_24  30
2  buzz       123.4  buzz    Anne  cherimoya_baz_71  20
3   foo       123.4  buzz  Nikita      apple_fizz_7  35,                             1                          2                      \
3                         Amy   Anne Joseph Nikita   Amy  Anne Joseph Nikita   
4                                                                              
apple_fizz_7              NaN    NaN    NaN  123.4   NaN   NaN    NaN   buzz   
banana_foo_2              NaN    NaN  123.4    NaN   NaN   NaN    baz    NaN   
cherimoya_baz_71          NaN  123.4    NaN    NaN   NaN  buzz    NaN    NaN   
date_bar_24       2.38942e+07    NaN    NaN    NaN  fizz   NaN    NaN    NaN   

                    5                     
3                 Amy Anne Joseph Nikita  
4                                      

In [16]:
model = PivotGeneratorModel()
train = data[:500]
valid = data[500:]
model.train(train, valid, num_epochs=30)

100%|██████████| 500/500 [00:00<00:00, 2234.17it/s]
100%|██████████| 100/100 [00:00<00:00, 2166.66it/s]


[+] Training model for OpInfo(sid='/gen_pivot_args/Select@select_columns@1', gen_name='gen_pivot_args', op_type='Select', index=1, gen_group=None, uid='select_columns', tags=None)
Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Use tf.cast instead.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


[Training(1/30)] Loss:  1.433152 Accuracy:  0.4720
[Validation(1/30)] Loss:  1.241686 Accuracy:  0.3800
[Training(2/30)] Loss:  1.322016 Accuracy:  0.4580
[Validation(2/30)] Loss:  1.044748 Accuracy:  0.3300
[Training(3/30)] Loss:  1.124238 Accuracy:  0.4480
[Validation(3/30)] Loss:  1.082418 Accuracy:  0.3500
[Training(4/30)] Loss:  1.123267 Accuracy:  0.4680
[Validation(4/30)] Loss:  1.071257 Accuracy:  0.3900
[Training(5/30)] Loss:  1.102100 Accuracy:  0.4720
[Validation(5/30)] Loss:  1.030015 Accuracy:  0.3900
[Training(6/30)] Loss:  1.062197 Accuracy:  0.4760
[Validation(6/30)] Loss:  1.004234 Accuracy:  0.3900
[Training(7/30)] Loss:  1.043536 Accuracy:  0.4800
[Validation(7/30)] Loss:  0.992720 Accuracy:  0.3800
[Training(8/30)] Loss:  1.030523 Accuracy:  0.4980
[Validation(8/30)] Loss:  0.992106 Accuracy:  0.4000
[Training(9/30)] Loss:  1.018177 Accuracy:  0.5020
[Validation(9/30)] Loss:  0.987970 Accuracy:  0.4100
[Training(10/30)] Loss:  1.005545 Accuracy:  0.5040
[Validation(

In [17]:
def synthesize_pivot(inp, out):
    start_time = time.time()
    num_explored = 0
    num_errors = 0
    for args in gen_pivot_args.generate(inp, out).with_model(model):  # CHANGED : Using model
        num_explored += 1
        
        try:
            # If there are exceptions while running pivot or checking, skip
            result = inp.pivot(**args)
            if check_equal(result, out):
                print("Found Solution!", 
                      f"inp.pivot(index='{args['index']}', columns='{args['columns']}', values='{args['values']}')")
                print(f"Time to solution: {time.time() - start_time: .3f} seconds")
                print(f"Number of Programs Tried: {num_explored}")
                
                break

        except Exception as e:
            num_errors += 1
            continue

In [18]:
synthesize_pivot(inp_df, out_df)

Inference for operator /gen_pivot_args/Select@select_columns@1
[('date', 0.9620377), ('volume', 0.037817474), ('symbol', 0.00012981212), ('high', 5.758068e-06), ('low', 5.758068e-06), ('open', 2.6130274e-06), ('close', 7.516266e-07)]
Found Solution! inp.pivot(index='symbol', columns='date', values='volume')
Time to solution:  4.943 seconds
Number of Programs Tried: 1969


Much better! The model for the first `Select` helped it pick the right value in the first try (and with >95% confidence) which significantly improved the search time. Why stop at using a model for only one `Select`. We can use models for all the operators!

### (d) Defining a Model for All Operators

First let's give the same context to all the operators and not just the first `Select`.

In [19]:
@generator
def gen_pivot_args(input_df: pd.DataFrame, output_df: pd.DataFrame):
    # Select one of columns
    arg_columns = Select(list(input_df.columns), context=(input_df, output_df))
    # Select one of columns or None
    arg_index = Select([None] + [i for i in list(input_df.columns) if i != arg_columns], context=(input_df, output_df))

    # Select one of columns or list of columns
    if Select([True, False], context=(input_df, output_df), uid="branch"):
        arg_values = Select([None] + [i for i in list(input_df.columns) if i != arg_columns and i != arg_index],
                            context=(input_df, output_df))
    else:
        arg_values = list(OrderedSubset([i for i in list(input_df.columns) if i != arg_columns and i != arg_index],
                                        context=(input_df, output_df)))

    return {'index': arg_index, 'columns': arg_columns, 'values': arg_values}

Let's define the model as before, covering all the operators this time.

In [20]:
common_config = {
    'learning_rate': 0.01,
    'node_dimension': 50,
    'classifier_hidden_dims': [50],
    'batch_size': 30000,
    'layer_timesteps': [1, 1, 1]
}

class PivotGeneratorModel(IndependentOperatorsModel):
    @operator
    def Select(*args, **kwargs):
        return PivotSelectModel(common_config)

    @operator(name='Select', uid="branch")
    def SelectBranch(*args, **kwargs):
        return PivotClassifyModel(common_config, domain_size=2)

    @operator
    def OrderedSubset(*args, **kwargs):
        return PivotOrderedSubsetModel(common_config)

As before, we can train the model using the following commands.

```python
model = PivotGeneratorModel()
data = pickle.load(urlopen("https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/training_data_pivot_full.pkl"))
train_data = data[:500]
valid_data = data[500:]
model.train(train_data, valid_data, num_epochs=100)
```

Since training for 100 epochs would take some time, we'll load a pre-trained model (obtained through the same set of commands).

In [21]:
from atlas.models.utils import restore_model
model = restore_model("https://risecamp2019-atlas.s3.us-east-2.amazonaws.com/pandas-pivot-model-full.zip", from_url=True)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


INFO:tensorflow:Restoring parameters from /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmph301q2dz/models/gen_pivot_args/Select@@1/model.weights
INFO:tensorflow:Restoring parameters from /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmph301q2dz/models/gen_pivot_args/Select@@2/model.weights
INFO:tensorflow:Restoring parameters from /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmph301q2dz/models/gen_pivot_args/Select@branch@1/model.weights
INFO:tensorflow:Restoring parameters from /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmph301q2dz/models/gen_pivot_args/Select@@3/model.weights
INFO:tensorflow:Restoring parameters from /var/folders/s9/__w2d9dx2ljdx9qk865hh5qs559bh9/T/tmph301q2dz/models/gen_pivot_args/OrderedSubset@@1/model.weights


In [22]:
inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-01', '2019-03-04', '2019-03-05', 
             '2019-03-01', '2019-03-04', '2019-03-05', 
             '2019-03-01', '2019-03-04', '2019-03-05'],
    'open': [1655.13, 1685.0, 1702.95, 174.28, 175.69, 175.94, 1124.9, 1146.99, 1150.06], 
    'high': [1674.26, 1709.43, 1707.8, 175.15, 177.75, 176.0, 1142.97, 1158.28, 1169.61], 
    'low': [1651.0, 1674.36, 1689.01, 172.89, 173.97, 174.54, 1124.75, 1130.69, 1146.19], 
    'close': [1671.73, 1696.17, 1692.43, 174.97, 175.85, 175.53, 1140.99, 1147.8, 1162.03], 
    'volume': [4974877, 6167358, 3681522, 25886167, 27436203, 19737419, 1450316, 1446047, 1443174],
    
})

out_df = pd.DataFrame({
    '2019-03-01': {'AAPL': 25886167, 'AMZN': 4974877, 'GOOG': 1450316},
    '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
    '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174}
})

synthesize_pivot(inp_df, out_df)

Found Solution! inp.pivot(index='symbol', columns='date', values='volume')
Time to solution:  0.044 seconds
Number of Programs Tried: 1


In [23]:
inp_df = pd.DataFrame({
    'symbol': ['AMZN', 'AMZN', 'AMZN', 'AAPL', 'AAPL', 'AAPL', 'GOOG', 'GOOG', 'GOOG'],
    'date': ['2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06', 
             '2019-03-04', '2019-03-05', '2019-03-06'],
    'volume': [6167358, 3681522, 3996001, 27436203, 19737419, 20810384, 1446047, 1443174, 1099289]
})

out_df = pd.DataFrame({
 '2019-03-04': {'AAPL': 27436203, 'AMZN': 6167358, 'GOOG': 1446047},
 '2019-03-05': {'AAPL': 19737419, 'AMZN': 3681522, 'GOOG': 1443174},
 '2019-03-06': {'AAPL': 20810384, 'AMZN': 3996001, 'GOOG': 1099289}
})

synthesize_pivot(inp_df, out_df)

Found Solution! inp.pivot(index='symbol', columns='date', values='volume')
Time to solution:  0.034 seconds
Number of Programs Tried: 1


Our synthesizer gets it right on the first try on both occasions! Here are some more examples involving different dataframes.

In [24]:
inp_df = pd.DataFrame({
  'foo': ['one', 'one', 'one', 'two', 'two', 'two'],
  'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
  'baz': [10, 20, 30, 40, 50, 60],
})

out_df = pd.DataFrame({
    'A': {'one': 10, 'two': 40},
    'B': {'one': 20, 'two': 50},
    'C': {'one': 30, 'two': 60}
})

print("Input DataFrame")
display(inp_df)
print("Output DataFrame")
display(out_df)

synthesize_pivot(inp_df, out_df)

Input DataFrame


Unnamed: 0,foo,bar,baz
0,one,A,10
1,one,B,20
2,one,C,30
3,two,A,40
4,two,B,50
5,two,C,60


Output DataFrame


Unnamed: 0,A,B,C
one,10,20,30
two,40,50,60


Found Solution! inp.pivot(index='foo', columns='bar', values='baz')
Time to solution:  0.028 seconds
Number of Programs Tried: 1


In [25]:
inp_df = pd.DataFrame({
    'Date': {0: '2018-02-18', 1: '2018-02-18', 2: '2018-02-24', 3: '2018-02-24'},
    'Location': {0: 'Terrace', 1: 'Pox', 2: 'Gate 320', 3: 'Pox'},
    'Balance': {0: 9971.66, 1: 9726.03, 2: 9604.14, 3: 9356.04},
    'Added By': {0: 'Theresa', 1: 'Margaret', 2: 'Helena', 3:'Katherine'},
    'Expense': {0: 98.34, 1: 245.63, 2: 121.89, 3: 248.0},
    'Category': {0: 'Social', 1: 'Lunch', 2: 'Social', 3: 'Lunch'}})

out_df = pd.DataFrame({'Lunch': {'2018-02-18': 245.63, '2018-02-24': 248.0},
 'Social': {'2018-02-18': 98.34, '2018-02-24': 121.89}})

print("Input DataFrame")
display(inp_df)
print("Output DataFrame")
display(out_df)

synthesize_pivot(inp_df, out_df)

Input DataFrame


Unnamed: 0,Date,Location,Balance,Added By,Expense,Category
0,2018-02-18,Terrace,9971.66,Theresa,98.34,Social
1,2018-02-18,Pox,9726.03,Margaret,245.63,Lunch
2,2018-02-24,Gate 320,9604.14,Helena,121.89,Social
3,2018-02-24,Pox,9356.04,Katherine,248.0,Lunch


Output DataFrame


Unnamed: 0,Lunch,Social
2018-02-18,245.63,98.34
2018-02-24,248.0,121.89


Found Solution! inp.pivot(index='Date', columns='Category', values='Expense')
Time to solution:  0.029 seconds
Number of Programs Tried: 1


## Conclusion

In this tutorial, we've learnt to express a search space using generators in Atlas. A Generator is a regular Python function which uses some special operators to capture non-deterministic decisions (such as selecting one element from a list). We built a generator for the `pivot` function in Pandas which helped us construct a synthesizer, which works fine for small inputs but suffers on large ones as the search space is too large. We then used graph-based neural-network models to *guide* the generator towards programs in the search space that are most likely to produce the output given the input. In fact, our learned synthesizer gets the correct program on the first try.