## REST API for DataMart

This notebook showcases how to use the Rest API for the DataMart system.

For the augmentation, we use the medical malpractice example, available here: https://gitlab.datadrivendiscovery.org/d3m/datasets/tree/master/seed_datasets_data_augmentation/DA_medical_malpractice

The documentation for the REST API is available here: https://docs.auctus.vida-nyu.org/rest/

In [1]:
from io import BytesIO
import json
import os
import pandas as pd
from pprint import pprint
from pathlib import Path
import requests
import zipfile

In [2]:
def print_results(results):
    if not results:
        return
    for result in results:
        print(result['metadata']['name'])
        print('Score: ', result['score'])
        if 'augmentation' in result:
            aug_type = result['augmentation']['type']
            print('Augmentation: %s' % aug_type)
            print("Left Columns: %s" %
                  str(result['augmentation']['left_columns_names']))
            print("Right Columns: %s" %
                  str(result['augmentation']['right_columns_names']))
            
        print("-------------------")

Initially, we have the medical malpractice data.

In [6]:
# You can change this accordingly
medical_malpractice_dir = os.path.expanduser('~/projects/d3m/datasets/seed_datasets_data_augmentation/DA_medical_malpractice/DA_medical_malpractice_dataset')
medical_malpractice_file = os.path.join(medical_malpractice_dir, 'tables', 'learningData.csv')
medical_malpractice_table = pd.read_csv(medical_malpractice_file)

In [7]:
medical_malpractice_table.head()

Unnamed: 0,d3mIndex,SEQNO,LICNFELD,ORIGYEAR,WORKSTAT,ALGNNATR,ALEGATN1,PTTYPE,PRACTAGE,PFIDX
0,404537,514456,10,2004,AZ,20,306,I,30,32.737
1,404538,514457,10,2004,PA,1,200,B,50,42.09
2,404540,514460,651,2004,SD,100,316,O,50,58.926
3,404554,514475,430,2004,NJ,60,334,O,20,77.633
4,404556,514477,30,2004,NH,60,306,O,50,1.871


In [22]:
print('Number of records: %d' % medical_malpractice_table.shape[0])

Number of records: 169686


### Searching for Datasets

Let's use DataMart to search for potential datasets for augmentation.

In [9]:
URL = 'https://auctus.vida-nyu.org/api/v1'

In [10]:
url = URL + '/search'
query = {
    'keywords': ['practitioner', 'clinical', 'malpractice', 'practitioner data bank',
                 'government', 'healthcare', 'Department of health and human services']
}

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(medical_malpractice_file, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'query': ('query.json', json.dumps(query), 'application/json'),
        }
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
query_results = response.json()['results']

In [10]:
print_results(query_results)

NPDB1807
Score:  50.0
Augmentation: join
Left Columns: [['ORIGYEAR']]
Right Columns: [['ORIGYEAR']]
-------------------
NPDB1807
Score:  41.71331
Augmentation: union
Left Columns: [['SEQNO'], ['LICNFELD'], ['ORIGYEAR'], ['ALGNNATR'], ['ALEGATN1'], ['PRACTAGE']]
Right Columns: [['SEQNO'], ['LICNFELD'], ['ORIGYEAR'], ['ALGNNATR'], ['ALEGATN1'], ['PRACTAGE']]
-------------------
NPDB1807
Score:  50.0
Augmentation: join
Left Columns: [['ALGNNATR']]
Right Columns: [['ALGNNATR']]
-------------------
NPDB1807
Score:  44.667477
Augmentation: join
Left Columns: [['SEQNO']]
Right Columns: [['SEQNO']]
-------------------
NPDB1807
Score:  41.181362
Augmentation: join
Left Columns: [['ALEGATN1']]
Right Columns: [['ALEGATN1']]
-------------------
NPDB1807
Score:  39.95984
Augmentation: join
Left Columns: [['LICNFELD']]
Right Columns: [['LICNFELD']]
-------------------
NPDB1807
Score:  30.434784
Augmentation: join
Left Columns: [['PRACTAGE']]
Right Columns: [['PRACTAGE']]
-------------------
NPDB1807

### Downloading a Dataset

Let's materialize the first search results, `NPDB1807`.

In [30]:
url = URL + '/download'
id_ = query_results[0]['id']
params = {'format': 'd3m'} # returns a .zip file with the data and its corresponding datasetDoc

response = requests.get(url + '/%s' % id_, params=params)
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()

zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [31]:
learning_data.head()

Unnamed: 0,SEQNO,RECTYPE,REPTYPE,ORIGYEAR,WORKSTAT,WORKCTRY,HOMESTAT,HOMECTRY,LICNSTAT,LICNFELD,...,ACCRRPTS,NPMALRPT,NPLICRPT,NPCLPRPT,NPPSMRPT,NPDEARPT,NPEXCRPT,NPGARPT,NPCTMRPT,FUNDPYMT
0,1,A,301,1991,OK,,,,OK,10,...,0,0,2,0,0,0,0,0,0,
1,2,A,301,1991,OK,,,,OK,10,...,0,0,7,0,0,0,1,0,0,
2,4,A,301,1991,MA,,,,MA,15,...,0,1,1,0,0,0,2,0,0,
3,6,A,301,1990,OK,,,,OK,10,...,0,0,2,0,0,0,0,0,0,
4,8,A,301,1990,OK,,,,OK,10,...,0,0,7,0,1,0,0,0,0,


In [32]:
print('Number of records: %d' % learning_data.shape[0])

Number of records: 1406584


### Augmenting a Dataset

If we augment with `NPDB1807` on `SEQNO`, we have our expected augmented dataset.

In [27]:
url = URL + '/augment'
task = query_results[3]  # 4th query result

# http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
with open(medical_malpractice_file, 'rb') as data:
    response = requests.post(
        url,
        files={
            'data': data,
            'task': ('task.json', json.dumps(task), 'application/json'),
        },
        stream=True,
    )
if response.status_code == 400:
    try:
        print('Error: %s' % response.json()['error'])
    except (KeyError, ValueError):
        pass
response.raise_for_status()
zip_ = zipfile.ZipFile(BytesIO(response.content), 'r')
learning_data = pd.read_csv(zip_.open('tables/learningData.csv'))
dataset_doc = json.load(zip_.open('datasetDoc.json'))
zip_.close()

In [28]:
learning_data.head()

Unnamed: 0,d3mIndex,SEQNO,LICNFELD,ORIGYEAR,WORKSTAT,ALGNNATR,ALEGATN1,PTTYPE,PRACTAGE,PFIDX,...,ACCRRPTS,NPMALRPT,NPLICRPT,NPCLPRPT,NPPSMRPT,NPDEARPT,NPEXCRPT,NPGARPT,NPCTMRPT,FUNDPYMT
0,404537,514456,10,2004,AZ,20,306,I,30,32.737,...,0,2,2,0,0,0,0,0,0,0.0
1,404538,514457,10,2004,PA,1,200,B,50,42.09,...,0,5,0,0,0,0,0,0,0,0.0
2,404540,514460,651,2004,SD,100,316,O,50,58.926,...,0,1,0,0,0,0,0,0,0,0.0
3,404554,514475,430,2004,NJ,60,334,O,20,77.633,...,0,1,0,0,0,0,0,0,0,0.0
4,404556,514477,30,2004,NH,60,306,O,50,1.871,...,0,1,0,0,0,0,0,0,0,0.0


In [29]:
print('Number of records: %d' % learning_data.shape[0])

Number of records: 169686
