# Dataset-JSON Workshop Notebook

This notebook was developed as part of the PHUSE Dataset-JSON Workshop. It is experimental and meant as an exercise to explore working with Dataset-JSON in Python.

This notebook demonstrates reading and writing Dataset-JSON files and validating them against the Dataset-JSON schema. It creates a simple Python Pandas dataframe from the Dataset-JSON file. Other conversions applications, including the notebook created for the initial Dataset-JSON Hackathon provide more detailed dataframe conversion (e.g., include datatype conversions). 

### Import the Python libraries

In [1]:
import pandas as pd
import json
from jsonschema import validate
from jsonschema.exceptions import ValidationError
import requests

### Get Dataset-JSON example file from GitHub

Retrieve an example Dataset-JSON file from the CDISC DataExchange-Dataset-Json repository and write the file in a local data directory.

In [2]:
data_file = 'data/vs.json'
try:
    r = requests.get('https://github.com/cdisc-org/DataExchange-DatasetJson/blob/master/examples/sdtm/vs.json?raw=True')
    r.raise_for_status()
except requests.exceptions.HTTPError as err:
    raise SystemExit(err)
with open(data_file, 'w') as f:
    json.dump(r.json(), f)

### Get Dataset-JSON schema file from GitHub

Retrieve the Dataset-JSON schema file from the CDISC DataExchange-Dataset-Json repository and write the file in a local data directory.

In [3]:
schema_file = 'data/dataset-json-schema.json'
try:
    r = requests.get('https://github.com/cdisc-org/DataExchange-DatasetJson/blob/master/schema/dataset.schema.json?raw=True')
    r.raise_for_status()
except requests.exceptions.HTTPError as err:
    raise SystemExit(err)
with open(schema_file, 'w') as f:
    json.dump(r.json(), f)

### Load and Validate Dataset-JSON File
For smaller datasets, simply load data using json module. This loads the entire file into memory so may not work for very large datasets.

In [4]:
with open(data_file, 'r') as f:
    data = json.loads(f.read())
with open(schema_file, 'r') as f:
    dsj_schema = json.loads(f.read())

try:
    validate(instance=data, schema=dsj_schema)
except ValidationError as e:
    print(f"Validation error in {data_file}: {e}")
else:
    print(f"{data_file} is valid")

data/vs.json is valid


### Load and Validate Invalid Dataset-JSON File
This version of the VS Dataset-JSON dataset is missing the required "datasetJSONVersion" attribute making it invalid. Validating this version of the VS dataset against the schema produces a validation error.

In [8]:
invalid_data_file = 'data/vs-invalid.json'
with open(invalid_data_file, 'r') as f:
    invalid_dataset = json.loads(f.read())

try:
    validate(instance=invalid_dataset, schema=dsj_schema)
except ValidationError as e:
    print(f"Validation error in {invalid_data_file}: {e.message}")
else:
    print(f"{invalid_data_file} is valid")

Validation error in data/vs-invalid.json: 'datasetJSONVersion' is a required property


### Show Dataset Metadata
Show the name and label for the dataset as well as all the variables names that will be used as column headings. Also, load the data types from the Dataset-JSON file to set the data types in the Pandas dataframe.

In [10]:
dataset_attrs = list(data["clinicalData"]["itemGroupData"].values())[0]
print(f"Name: {dataset_attrs['name']} ({dataset_attrs['label']})", end='\n\n')
variables = [var['name'] for var in dataset_attrs['items']]
print(f"Variables: {', '.join([var_name for var_name in variables])}")
data_types = [var['type'] for var in dataset_attrs['items']]

Name: VS (Vital Signs)

Variables: ITEMGROUPDATASEQ, STUDYID, DOMAIN, USUBJID, VSSEQ, VSTESTCD, VSTEST, VSPOS, VSORRES, VSORRESU, VSSTRESC, VSSTRESN, VSSTRESU, VSSTAT, VSLOC, VSLOBXFL, VSREPNUM, VISITNUM, VISIT, EPOCH, VSDTC, VSDY


### Create a dataframe from the Dataset-JSON file
Create a dataframe from the Dataset-JSON file. Then print the top 5 rows and provide the memory usage for the dataframe. Then save the dataframe as a CSV file.

In [11]:
df = pd.DataFrame(dataset_attrs['itemData'], columns=variables)
print(df.head(5), end='\n\n')
print(f"\ndataframe memory usage: {df.memory_usage().sum()} bytes")

   ITEMGROUPDATASEQ       STUDYID DOMAIN   USUBJID  VSSEQ VSTESTCD  \
0                 1  CDISCPILOT01     VS  CDISC001      1    DIABP   
1                 2  CDISCPILOT01     VS  CDISC001      2    DIABP   
2                 3  CDISCPILOT01     VS  CDISC001      3    DIABP   
3                 4  CDISCPILOT01     VS  CDISC001      4    DIABP   
4                 5  CDISCPILOT01     VS  CDISC001      5    DIABP   

                     VSTEST     VSPOS VSORRES VSORRESU  ... VSSTRESU  VSSTAT  \
0  Diastolic Blood Pressure  STANDING      71     mmHg  ...     mmHg           
1  Diastolic Blood Pressure  STANDING      71     mmHg  ...     mmHg           
2  Diastolic Blood Pressure  STANDING      83     mmHg  ...     mmHg           
3  Diastolic Blood Pressure  STANDING      79     mmHg  ...     mmHg           
4  Diastolic Blood Pressure  STANDING      68     mmHg  ...     mmHg           

  VSLOC VSLOBXFL VSREPNUM VISITNUM        VISIT      EPOCH       VSDTC VSDY  
0                   

Write the newly created dataset as a CSV file.

In [12]:
df.to_csv('data/' + dataset_attrs['name'] + '.csv', index=False, encoding='utf-8')