# Trane - A quick DEMO

Trane is a software package for automatically generating prediction problems and generating labels for supervised learning. This tutorial shows the workflow of Trane.

### Get Example Dataset
[Download a synthetic taxi dataset here](https://s3.amazonaws.com/hdi-demos/trane-demo/taxi_data.zip). Unzip the file and get the folder with the raw data `synthetic_taxi_data.csv` and the table metadata `taxi_meta.json`. Put the folder `taxi_data` in Trane or set the correct path in the cell below. 

### Generate Prediction Problems
We first import trane and other packages. We set data path and other parameters. 

In [1]:
import trane
import json

multiple_csv = ["taxi_data/synthetic_taxi_data.csv"] # path to multiple csv tables.
table_meta_json = "taxi_data/taxi_meta.json"         # path to table metadata. 

entity_id_column = 'taxi_id'        # Trane will generate a label of each entity in the entity_id_column.
label_generating_column = 'fare'    # Trane will use data in label_generating_column to generate labels. 
time_column = 'trip_id'             # time_column is used for cutoff time. 

We load table metadata, then create a PredictionProblemGenerator.

In [3]:
table_meta = trane.TableMeta(json.loads(open(table_meta_json).read()))
generator = trane.PredictionProblemGenerator(table_meta, entity_id_column, label_generating_column, time_column)


We use the generator to generate 3 prediction problems. 

In [4]:
probs = []
for idx, prob in enumerate(generator.generate()):
    probs.append(prob)
    if idx + 1 == 3:
        break

We save prediction problems in to `prediction_problems.json`

In [5]:
prediction_problems_json = trane.prediction_problems_to_json(
    probs, table_meta, entity_id_column, label_generating_column, time_column)
with open("prediction_problems.json", "w") as f:
    json.dump(json.loads(prediction_problems_json), f, indent=4, separators=(',', ': '))

### Check Prediction Problems and Tune HyperParameters
Now we should check saved prediction problems and set thresholds in field `param_values` for some operations.

Here is the truncated output. 
```
{
    "entity_id_column": "taxi_id",
    "time_column": "trip_id",
    "table_meta": ...,
    "prediction_problems": [
        {
            "operations": [
                {
                    "SubopType": "AllFilterOp",
                    "OpType": "FilterOpBase",
                    "param_values": {},
                    "column_name": "duration",
                    "iotype": [
                        "value",
                        "value"
                    ]
                },
                {
                    "SubopType": "IdentityRowOp",
                    "OpType": "RowOpBase",
                    "param_values": {},
                    "column_name": "fare",
                    "iotype": [
                        "value",
                        "value"
                    ]
                },
                {
                    "SubopType": "IdentityTransformationOp",
                    "OpType": "TransformationOpBase",
                    "param_values": {},
                    "column_name": "fare",
                    "iotype": [
                        "value",
                        "value"
                    ]
                },
                {
                    "SubopType": "FirstAggregationOp",
                    "OpType": "AggregationOpBase",
                    "param_values": {},
                    "column_name": "fare",
                    "iotype": [
                        "value",
                        "value"
                    ]
                }
            ]
        }, ...
    ],
    "label_generating_column": "fare"
}

```

### Load Problems and Generate Labels
We load multiple csvs and denormalize them into a Pandas DataFrame. We group them by entity ids. 
We show the first 5 records of entity taxi 0.

In [6]:
denormalized_dataframe = trane.csv_to_df(multiple_csv)
entity_to_data_dict = trane.df_group_by_entity_id(denormalized_dataframe, entity_id_column)
entity_to_data_dict[0].head(5)

Unnamed: 0,vendor_id,taxi_id,trip_id,distance,duration,fare,num_passengers
0,0,0,0,4.97,16.53,46.8,3
1,0,0,1,6.0,16.82,49.6,4
2,0,0,2,0.68,11.7,27.87,1
3,0,0,3,7.75,11.69,43.12,1
4,0,0,4,6.05,13.32,42.71,4


We apply a cutoff strategy. Here we simple use fixed cuttoff time. The cutoff time for all entities are 0.

In [6]:
entity_to_data_and_cutoff_dict = trane.FixedCutoffTimes().generate_cutoffs(entity_to_data_dict)

Create a labeler and generate labels. 

In [7]:
labeler = trane.Labeler()
output = labeler.execute(entity_to_data_and_cutoff_dict, "prediction_problems.json")
output

{0: ([49.600000000000001, 49.600000000000001, 49.600000000000001], 0),
 1: ([20.449999999999999, 20.449999999999999, 20.449999999999999], 0),
 2: ([61.600000000000001, 61.600000000000001, 61.600000000000001], 0),
 3: ([58.520000000000003, 58.520000000000003, 58.520000000000003], 0),
 4: ([42.100000000000001, 42.100000000000001, 42.100000000000001], 0),
 5: ([58.659999999999997, 58.659999999999997, 58.659999999999997], 0),
 6: ([34.5, 34.5, 34.5], 0),
 7: ([34.049999999999997, 34.049999999999997, 34.049999999999997], 0),
 8: ([54.299999999999997, 54.299999999999997, 54.299999999999997], 0),
 9: ([44.549999999999997, 44.549999999999997, 44.549999999999997], 0),
 10: ([62.880000000000003, 62.880000000000003, 62.880000000000003], 0),
 11: ([30.829999999999998, 30.829999999999998, 30.829999999999998], 0),
 12: ([29.699999999999999, 29.699999999999999, 29.699999999999999], 0),
 13: ([50.560000000000002, 50.560000000000002, 50.560000000000002], 0),
 14: ([43.729999999999997, 43.72999999999999