# Python API for NYU's DataMart

This notebook showcases how to use the Python API for the NYU's DataMart system, which implements the common DataMart interface (https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py). To install it: `pip install datamart_nyu`

In [10]:
from d3m import container
import datamart
import datamart_rest
import datetime
from pathlib import Path
import requests
import json
from urllib.error import HTTPError

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result.score())
        print(result.get_json_metadata()['metadata']['name'])
        if (result.get_augment_hint()):
            print("Left Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result.get_json_metadata()['augmentation']['right_columns_names']))
        else:
            print(result.id())
        print("-------------------")

Loading the taxi data, i.e., our supplied data.

In [3]:
ds_n = "LL0_1008_analcatdata_reviewer"
target_dataset = Path(f"datasets-master/training_datasets/LL0/{ds_n}",
         f"{ds_n}_dataset/")
target_dataset_metadata = target_dataset/Path("datasetDoc.json")
target_dataset_learning_data = target_dataset/Path("tables/learningData.csv")
target_dataset_learning_data.exists()

True

In [4]:
full_container = container.Dataset.load(target_dataset_learning_data.absolute().as_uri()) 

In [5]:
full_container

Dataset(id='file:///home/soda/rcappuzz/work/study-d3m/datasets-master/training_datasets/LL0/LL0_1008_analcatdata_reviewer/LL0_1008_analcatdata_reviewer_dataset/tables/learningData.csv', name='learningData.csv', location_uris='('file:///home/soda/rcappuzz/work/study-d3m/datasets-master/training_datasets/LL0/LL0_1008_analcatdata_reviewer/LL0_1008_analcatdata_reviewer_dataset/tables/learningData.csv',)')

## Searching for Datasets

Let's first instantiate our client:

In [6]:
client = datamart_rest.RESTDatamart('https://auctus.vida-nyu.org/api/v1')

### Search using data

In [12]:
cursor = client.search_with_data(query={}, supplied_data=full_container)

In [15]:
try:
    cursor = client.search_with_data(query={}, supplied_data=full_container)
    results = cursor.get_next_page()
except Exception as e:
    
    print("Server error")
    raise

Error from DataMart: 500 Internal Server Error


AttributeError: 'HTTPError' object has no attribute 'code'

In [46]:
URL = 'https://auctus.vida-nyu.org/api/v1' 
url = URL + '/search'
query = {

}


with open(target_dataset_learning_data, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
query_results = response.json()['results'] 

In [36]:
response

<Response [200]>

In [45]:
len(results)

20

In [37]:
print_results(results)

1.0
OATH Hearings Division Case Status
Left Columns: [['year_zone']]
Right Columns: [['Charge #4: Code']]
-------------------
1.0
OATH Hearings Division Case Status
Left Columns: [['year_zone']]
Right Columns: [['Charge #5: Code']]
-------------------
1.0
OATH Hearings Division Case Status
Left Columns: [['year_zone']]
Right Columns: [['Charge #7: Code']]
-------------------
1.0
OATH Hearings Division Case Status
Left Columns: [['year_zone']]
Right Columns: [['Charge #9: Code']]
-------------------
1.0
Multi-Modal Intelligent Traffic Signal Systems Basic Safety Message
Left Columns: [['year_zone']]
Right Columns: [['MsgCnt']]
-------------------
1.0
Parcels
Left Columns: [['year_zone']]
Right Columns: [['Situs Unit No']]
-------------------
1.0
Arrests
Left Columns: [['zone']]
Right Columns: [['CHARGE 4 TYPE']]
-------------------
1.0
Arrests
Left Columns: [['zone']]
Right Columns: [['CHARGE 3 TYPE']]
-------------------
1.0
Arrests
Left Columns: [['zone']]
Right Columns: [['CHARGE 2 T

In [53]:
set([_["id"] for _ in response.json()["results"]])

{'datamart.socrata.data-austintexas-gov.m4cc-q8pr',
 'datamart.socrata.data-austintexas-gov.vv43-e55n',
 'datamart.socrata.data-bayareametro-gov.67rs-kbwq',
 'datamart.socrata.data-bayareametro-gov.iqfe-j3rr',
 'datamart.socrata.data-cityofchicago-org.dpt3-jri9',
 'datamart.socrata.data-cityofnewyork-us.jz4z-kudi',
 'datamart.socrata.data-cityofnewyork-us.m8p6-tp4b',
 'datamart.socrata.data-cityofnewyork-us.nu7n-tubp',
 'datamart.socrata.data-cityofnewyork-us.nyis-y4yr',
 'datamart.socrata.data-lacounty-gov.n54c-jkaq',
 'datamart.socrata.data-michigan-gov.abaf-2i39',
 'datamart.socrata.data-novascotia-ca.2ga3-gg5k',
 'datamart.socrata.data-novascotia-ca.36ek-n7n8',
 'datamart.socrata.data-novascotia-ca.8524-ec3n',
 'datamart.socrata.data-novascotia-ca.czww-f8n7',
 'datamart.socrata.data-novascotia-ca.k29k-n2db',
 'datamart.socrata.data-novascotia-ca.wu5w-qxki',
 'datamart.socrata.data-novascotia-ca.xxcy-v3fh',
 'datamart.socrata.data-ny-gov.q4hy-kbtf',
 'datamart.socrata.data-ny-gov.u6

## Downloading a dataset

Now let's materialize one of the weather datasets, in case the user wants to take a look at the data before augmenting it (or so that the user can augment the data him/herself).

In [None]:
ny_weather_data = results[0].download(supplied_data=None)

In [None]:
ny_weather_data['learningData'].head()

Unnamed: 0,DATE,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,2016-01-01 01:00:00,OVC:08 38,6.1,58.0,17,300,30.03
1,2016-01-01 02:00:00,OVC:08 38,6.1,56.0,16,320,30.03
2,2016-01-01 03:00:00,OVC:08 38,5.6,55.0,13,340,30.03
3,2016-01-01 04:00:00,OVC:08 36,5.6,55.0,13,300,30.03
4,2016-01-01 05:00:00,FEW:02 34 OVC:08 45,5.0,60.0,13,270,30.01


You can also give a dataset as input so that DataMart can try to return a dataset that joins well with it. Only portions of the DataMart dataset that join with the input data will be returned.

In [None]:
ny_weather_data = results[0].download(supplied_data=ny_taxi_demand)

In [None]:
ny_weather_data['learningData'].head()

Unnamed: 0,DATE,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,2018-04-19 22:00:00,FEW:02 42,5.0,53.0,16.0,310,29.97
1,2018-06-30 20:00:00,SCT:04 250,30.6,43.0,5.0,180,29.97
2,2018-06-02 10:00:00,FEW:02 40 FEW:02 150 SCT:04 200,28.3,61.0,6.0,70,29.7
3,2018-04-17 13:00:00,BKN:07 46 BKN:07 85,8.3,44.0,17.0,260,29.6
4,2018-01-04 01:00:00,OVC:08 32,-1.7,45.0,8.0,20,29.91


## Augmenting a dataset

 Let's try to do our augmentation for the first query result.

In [None]:
join_ = results[0].augment(supplied_data=ny_taxi_demand)

In [None]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups,HOURLYSKYCONDITIONS,HOURLYDRYBULBTEMPC,HOURLYRelativeHumidity,HOURLYWindSpeed,HOURLYWindDirection,HOURLYStationPressure
0,0,2018-04-19 22:00:00,731,FEW:02 42,5.0,53.0,16.0,310,29.97
1,1,2018-06-30 20:00:00,183,SCT:04 250,30.6,43.0,5.0,180,29.97
2,2,2018-06-02 10:00:00,384,FEW:02 40 FEW:02 150 SCT:04 200,28.3,61.0,6.0,70,29.7
3,3,2018-04-17 13:00:00,648,BKN:07 46 BKN:07 85,8.3,44.0,17.0,260,29.6
4,4,2018-01-04 01:00:00,3,OVC:08 32,-1.7,45.0,8.0,20,29.91


We can also choose which columns from the DataMart dataset (i.e., the weather data) that we want in the augmentation process.

In [None]:
join_ = results[0].augment(
    supplied_data=ny_taxi_demand,
    augment_columns=[datamart.DatasetColumn('0', 3), datamart.DatasetColumn('0', 5)]
)

In [None]:
join_['learningData'].head()

Unnamed: 0,d3mIndex,tpep_pickup_datetime,num_pickups,HOURLYRelativeHumidity,HOURLYWindDirection
0,0,2018-04-19 22:00:00,731,53.0,310
1,1,2018-06-30 20:00:00,183,43.0,180
2,2,2018-06-02 10:00:00,384,61.0,70
3,3,2018-04-17 13:00:00,648,44.0,260
4,4,2018-01-04 01:00:00,3,45.0,20
