## REST API for DataMart

This notebook showcases how to use the REST API for the DataMart system.

For the augmentation, we use the FIFA 2018 Man of Match data, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_fifa2018_manofmatch

The documentation for the REST API is available here: https://docs.auctus.vida-nyu.org/rest/

In [1]:
from io import BytesIO
import json
import os
import pandas as pd
from pprint import pprint
from pathlib import Path
import requests
import zipfile

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result['metadata']['name'])
        print('Score: ', result['score'])
        if 'augmentation' in result:
            aug_type = result['augmentation']['type']
            print('Augmentation: %s' % aug_type)
            print("Left Columns: %s" %
                  str(result['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result['augmentation']['right_columns_names']))
            
        print("-------------------")

Initially, we have the supplied data.

In [3]:
# You can change this accordingly
fifa_manofmatch_dir = os.path.expanduser('~/projects/d3m/datasets/seed_datasets_data_augmentation/DA_fifa2018_manofmatch/DA_fifa2018_manofmatch_dataset')
fifa_manofmatch_file = os.path.join(fifa_manofmatch_dir, 'tables', 'learningData.csv')
fifa_manofmatch_table = pd.read_csv(fifa_manofmatch_file)

In [4]:
fifa_manofmatch_table.head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,Pass Accuracy %,Passes,Distance Covered (Kms),Yellow & Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals
0,0,55,23-06-2018,Mexico,Korea Republic,59,6,2,0,5,89,485,97,0,1,26.0,Group Stage,No,0,
1,1,40,21-06-2018,Denmark,Australia,49,5,0,1,4,88,458,112,0,1,7.0,Group Stage,No,0,
2,2,19,17-06-2018,Mexico,Germany,40,6,2,2,9,82,281,106,0,0,35.0,Group Stage,No,0,
3,3,31,19-06-2018,Senegal,Poland,43,4,2,3,3,81,328,107,0,1,60.0,Group Stage,No,0,
4,4,98,30-06-2018,Uruguay,Portugal,39,2,1,0,4,69,269,106,0,1,7.0,Round of 16,No,0,


### Searching for Datasets

Let's use DataMart to search for datasets that can be used to augment the supplied one.

In [5]:
URL = 'https://auctus.vida-nyu.org/api/v1'

In [6]:
url = URL + '/search'

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(fifa_manofmatch_file, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
        }
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
query_results = response.json()['results']

In [7]:
print_results(query_results)

Water Consumption And Cost (2013 - March 2019)
Score:  1.0
Augmentation: join
Left Columns: [['Date']]
Right Columns: [['Revenue Month']]
-------------------
Board of Standards and Appeals (BSA) Applications Status
Score:  0.1
Augmentation: union
Left Columns: [['Date'], ['Blocked']]
Right Columns: [['Date'], ['Block']]
-------------------
Street Construction Permits
Score:  1.0
Augmentation: join
Left Columns: [['Date']]
Right Columns: [['ModifiedOn']]
-------------------
FIFA 2018 game statistics data
Score:  0.08125
Augmentation: union
Left Columns: [['GameID'], ['Off-Target']]
Right Columns: [['GameID'], ['On-Target']]
-------------------
Housing New York Units by Building
Score:  1.0
Augmentation: join
Left Columns: [['Date']]
Right Columns: [['Project Start Date']]
-------------------
Housing Maintenance Code Complaints
Score:  1.0
Augmentation: join
Left Columns: [['Date']]
Right Columns: [['ReceivedDate']]
-------------------
Capital Projects
Score:  1.0
Augmentation: join
Left

It is also possible to specify which column will be used for augmentation.

In [8]:
url = URL + '/search'
query={
    'variables': [
        {
            'type': 'tabular_variable',
            'columns': [1],  # GameID
            'relationship': 'contains'
        }
    ]
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(fifa_manofmatch_file, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
gameID_results = response.json()['results']

In [9]:
print_results(gameID_results)

FIFA 2018 game statistics data
Score:  0.98275864
Augmentation: join
Left Columns: [['GameID']]
Right Columns: [['GameID']]
-------------------


### Augmenting a Dataset

Let's try to do our augmentation for the previous first query result then.

In [10]:
url = URL + '/augment'
task = gameID_results[0]

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(fifa_manofmatch_file, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [11]:
learning_data.head()

Unnamed: 0,d3mIndex,GameID,Date,Team,Opponent,Ball Possession %,Off-Target,Blocked,Offsides,Saves,...,Own goals,Goal Scored,Attempts,On-Target,Corners,Free Kicks,Fouls Committed,Yellow Card,Red,Own goal Time
0,0,55,2018-06-23,Mexico,Korea Republic,59,6,2,0,5,...,,2,13,5,5,24,7,0,0,
1,1,40,2018-06-21,Denmark,Australia,49,5,0,1,4,...,,1,10,5,3,5,7,2,0,
2,2,19,2018-06-17,Mexico,Germany,40,6,2,2,9,...,,1,12,4,1,11,15,2,0,
3,3,31,2018-06-19,Senegal,Poland,43,4,2,3,3,...,,2,8,2,3,11,15,2,0,
4,4,98,2018-06-30,Uruguay,Portugal,39,2,1,0,4,...,,2,6,3,2,14,13,0,0,


And we have our augmented data! Its corresponding datasetDoc JSON object is presented below.

However, note that this datasetDoc JSON **does not** preserve the information from the supplied data's datasetDoc JSON. You need to use the Python DataMart API for that: https://gitlab.com/datadrivendiscovery/datamart-api/blob/master/datamart.py

In [12]:
pprint(dataset_doc, indent=2)

{ 'about': { 'approximateSize': '59300 B',
             'datasetID': 'e8a407b75568422eb33bb0338f1b989b',
             'datasetName': 'e8a407b75568422eb33bb0338f1b989b',
             'datasetSchemaVersion': '3.2.0',
             'datasetVersion': '0.0',
             'license': 'unknown',
             'redacted': False},
  'dataResources': [ { 'columns': [ { 'colIndex': 0,
                                      'colName': 'd3mIndex',
                                      'colType': 'integer',
                                      'role': ['index']},
                                    { 'colIndex': 1,
                                      'colName': 'GameID',
                                      'colType': 'integer',
                                      'role': ['attribute']},
                                    { 'colIndex': 2,
                                      'colName': 'Date',
                                      'colType': 'dateTime',
                                      'rol