# NIH SPARC Python Client Tutorial

Welcome to NIH SPARC Python client tutorial. 

In this document you will learn the most common functionalities of the **sparc.client** library - a Python client designed to interact with SPARC.  

# Installation

The easiest way to obtain Python Sparc Client library (sparc.client) is to install the latest available version from PyPI:

In [5]:
!pip install sparc.client



# Configuration

The **sparc.client** library allows some basic operations for unauthorized users, such as browsing and querying publicly available datasets, listing records, files or downloading data.

More advanced operations, including managing datasets, or uploading files require to create a configuration file and installation of Pennsieve Agent. For details, please follow the Pennsieve Agent tutorial: [Uploading files to SPARC Portal](https://docs.pennsieve.io/docs/uploading-files-programmatically) .

**sparc.client** uses a INI file in order to store configuration variable. 

The basic structure of config.ini file looks as follows:

```
[global]
default_profile=name

[name]
...
variable_name=value
```

Each of the modules provided by **sparc.client** may require separate set of environmental variables in INI file:



- Pennsieve: requires _pennsieve_profile_name_ variable, which should be the same as the name of the profile used by Pennsieve Agent. For more details, please refer to:  https://docs.pennsieve.io/docs/uploading-files-programmatically


In order to initialize the library, we need to import SparcClient class and point it to the configuration file:

In [9]:
from sparc.client import SparcClient
client = SparcClient(connect=False, config_file='../config/config.ini')

#In order to connect to all the services: 
#client = SparcClient(config_file='../config/config.ini')
#client.connect()  #connect to all services

Alternatively, each of the sparc.client modules can be used independently to interact with specific service.

# Modules


sparc.client has a modular structure. Modules can be loaded either automatically (without the 'connect' flag), or manually.

In the following example we are loading a Pennsieve2 module and connecting to the Pennsieve agent running in the background:  https://docs.pennsieve.io/docs/uploading-files-programmatically



In [11]:
#Connect to a specific module - REQUIRES PENNSIEVE AGENT RUNNING
#pennsieve_module = client.pennsieve.connect()
#pennsieve_module.user.whoami() #execute internal functions of the module

Modules can also be loaded from other locations by simply providing a dictionary with configurations and path to the module.


In [3]:
#modules could also be added later by passing a config with env variables and path to the module  
client.add_module(config={'pennsieve_profile_name' : 'ci'}, 
                  paths = 'sparc.client.services.pennsieve', 
                  connect=False)

## Pennsieve Module API 

Pennsieve module allows users to interact with Pennsieve platform.

Without connecting to the agent, the user is able to query Discover service of the Pennsieve platform for databases, specific files and records as well as to download publicly available files or datasets. 



### Accessing data

Pennsieve allows unauthorized users to list datasets, records, and files of publicly available datasets, as well as to download them to a local machine.

#### Listing datasets

Listing a dataset that match a specific word, e.g. the last name of the person or a term could be performed in the following manner:

In [9]:
response=client.pennsieve.list_datasets(query='cancer', limit=2)
response

{'limit': 2,
 'offset': 0,
 'totalCount': 2,
 'datasets': [{'id': 93,
   'sourceDatasetId': 1,
   'name': 'LLC and MM DiFC CTC detections 2020_11_04',
   'description': 'Processed diffuse in vivo flow cytometry (DiFC) data for mice with Lewis lung carcinoma (LLC) and multiple myeloma (MM) tumors used to study short term circulating tumor cell (CTC) dynamics.',
   'ownerId': 1042,
   'ownerFirstName': 'Amber',
   'ownerLastName': 'Williams',
   'ownerOrcid': '0000-0003-1815-6479',
   'organizationName': 'Northeastern University',
   'organizationId': 629,
   'license': 'Community Data License Agreement – Permissive',
   'tags': ['circulating tumor cell',
    'multiple myeloma',
    'lewis lung carcinoma',
    'in vivo flow cytometry',
    'cancer',
    'flow cytometry',
    'ctc'],
   'version': 1,
   'revision': None,
   'size': 1145788,
   'modelCount': [],
   'fileCount': 241,
   'recordCount': 0,
   'uri': 's3://pennsieve-prod-discover-publish-use1/93/1/',
   'arn': 'arn:aws:s3:::pe

We can query the Discover service with different options, e.g. looking within certain organization, only for embargoed datasets, and order the records by name, date, size in ascending or descending direction.

In [14]:
response=client.pennsieve.list_datasets(organization='Sparc', embargo=True, order_by='date', order_direction='asc')
response['datasets'][0]

{'id': 275,
 'sourceDatasetId': 1969,
 'name': 'Test Embargo dataset',
 'description': 'This is a test dataset for SPARC to test request to embargo datasets',
 'ownerId': 29,
 'ownerFirstName': 'Joost',
 'ownerLastName': 'Wagenaar',
 'ownerOrcid': '0000-0003-0837-7120',
 'organizationName': 'SPARC Consortium',
 'organizationId': 367,
 'license': 'Creative Commons Attribution',
 'tags': ['test'],
 'version': 1,
 'revision': None,
 'size': 4315620,
 'modelCount': [],
 'fileCount': 21,
 'recordCount': 0,
 'uri': 's3://pennsieve-prod-discover-embargo-use1/275/1/',
 'arn': 'arn:aws:s3:::pennsieve-prod-discover-embargo-use1/275/1/',
 'status': 'EMBARGO_SUCCEEDED',
 'doi': '10.26275/xioi-rjik',
 'banner': 'https://assets.discover.pennsieve.io/dataset-assets/275/1/banner.jpg',
 'readme': 'https://assets.discover.pennsieve.io/dataset-assets/275/1/readme.md',
 'contributors': [{'firstName': 'Joost',
   'middleInitial': 'B',
   'lastName': 'Wagenaar',
   'degree': 'Ph.D.',
   'orcid': '0000-0003-

#### Listing records

Apart from listing the dataset, we can also zoom into the records of a given dataset for a specific model, for example explore researchers within SPARC project.

In [12]:
response=client.pennsieve.list_records(model='researcher', organization='SPARC')
response

{'limit': 10,
 'offset': 0,
 'totalCount': 1174,
 'records': [{'datasetId': 282,
   'version': 1,
   'model': 'researcher',
   'properties': {'hasORCIDId': 'https://orcid.org/0000-0002-0067-510X',
    'hasAffiliation': 'Univeristy of California, Los Angeles;University of California, Los Angeles;https://ror.org/046rm7j60',
    'middleName': '',
    'hasRole': '',
    'lastName': 'Yuan',
    'recordHash': '82329d634a673e2f45a1bfe90930e5fc',
    'firstName': 'Pu-Qing',
    'id': '3136648b-e83d-4e30-a20c-d6af747712d5'}},
  {'datasetId': 287,
   'version': 1,
   'model': 'researcher',
   'properties': {'hasORCIDId': 'https://orcid.org/0000-0002-4153-9614',
    'hasAffiliation': 'University College London',
    'middleName': '',
    'hasRole': '',
    'lastName': 'Thompson',
    'recordHash': '77fbf8f43a9adb0180b8d1c8d289e6f6',
    'firstName': 'Nicole',
    'id': 'ea50a52a-54a3-4180-ace6-0b6b59beeb2a'}},
  {'datasetId': 304,
   'version': 1,
   'model': 'researcher',
   'properties': {'hasO

#### Listing files

Similarly, we can query for files that are related with given name, or extension, e.g. that are included in a specific dataset.

In [16]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest', file_type='json')
response

[{'name': 'manifest.json',
  'datasetId': 90,
  'datasetVersion': 1,
  'size': 19928,
  'fileType': 'Json',
  'packageType': 'Unsupported',
  'icon': 'JSON',
  'uri': 's3://pennsieve-prod-discover-publish-use1/90/1/manifest.json',
  'createdAt': None,
  'sourcePackageId': None},
 {'name': 'manifest.json',
  'datasetId': 90,
  'datasetVersion': 1,
  'size': 1660,
  'fileType': 'Json',
  'packageType': 'Unsupported',
  'icon': 'JSON',
  'uri': 's3://pennsieve-prod-discover-publish-use1/90/1/revisions/1/manifest.json',
  'createdAt': None,
  'sourcePackageId': None}]

If we are only interested in relative paths of the files, for convenienve we can use list_filenames() function. 

In [17]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest')
list(map(lambda x: "/".join(x["uri"].split("/")[5:]), response))

#response=client.pennsieve.list_filenames(dataset_id=90, query='manifest')
#response

['files/primary/sub-898/manifest.xlsx',
 'files/derivative/manifest.xlsx',
 'manifest.json',
 'files/primary/sub-897/manifest.xlsx',
 'files/primary/sub-896/manifest.xlsx',
 'revisions/1/manifest.json']

### Downloading files

Downloading files is also very simple. All we need to do is to list file(s) that are to be downloaded and pass it to the download_files function.

The function will either download the file with its original extension (if output_name is not specified) or pack the files and download them in gzip format to the specified directory.

In [18]:
!dir

_build	   intro.rst  sparc.client.rst		 _templates
conf.py    make.bat   sparc.client.services.rst  tutorial.ipynb
index.rst  Makefile   _static			 tutorials.rst


In [19]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest', file_type='json')
client.pennsieve.download_file(file_list=response, output_name='myfile.gz')
!dir 

_build	   intro.rst  myfile.gz			 _static	 tutorials.rst
conf.py    make.bat   sparc.client.rst		 _templates
index.rst  Makefile   sparc.client.services.rst  tutorial.ipynb


We can also download a single file. In that case, the original file will be downloaded (not a zip archive).


In [20]:
response[1]

{'name': 'manifest.json',
 'datasetId': 90,
 'datasetVersion': 1,
 'size': 1660,
 'fileType': 'Json',
 'packageType': 'Unsupported',
 'icon': 'JSON',
 'uri': 's3://pennsieve-prod-discover-publish-use1/90/1/revisions/1/manifest.json',
 'createdAt': None,
 'sourcePackageId': None}

In [21]:
response=client.pennsieve.list_files(dataset_id=90, query='manifest', file_type='json')
client.pennsieve.download_file(file_list=response[1]) # Expected output_name: 'manifest.json'
!dir 

_build	   make.bat	  sparc.client.rst	     tutorial.ipynb
conf.py    Makefile	  sparc.client.services.rst  tutorials.rst
index.rst  manifest.json  _static
intro.rst  myfile.gz	  _templates


### Interacting with Pennsieve API

Sparc Client can also interact with Pennsieve API and submit HTTP requests, e.g. GET or POST.

This functionality is limited to browsing publicly available datasets. 

More advanced functions, such as creating a dataset or uploading files, may require user authentication, which we will cover in the next section.

For reference, please visit the documentation website:
- https://docs.pennsieve.io/reference
- https://docs.pennsieve.io/docs/uploading-files-programmatically .

In [22]:
#e.g. calling GET: https://docs.pennsieve.io/reference/browsefiles-1 for version 1 of dataset 90 with additional parameters
client.pennsieve.get('https://api.pennsieve.io/discover/datasets/90/versions/1/files/browse', params={'limit':2})

{'limit': 2,
 'offset': 0,
 'totalCount': 6,
 'files': [{'name': 'files',
   'path': 'files',
   'size': 1140301577,
   'type': 'Directory'},
  {'name': 'metadata',
   'path': 'metadata',
   'size': 48774,
   'type': 'Directory'}]}

In [23]:
#e.g. calling POST: https://docs.pennsieve.io/reference/downloadmanifest-1 for version 1 of dataset 90 and browsing /metadata/records folder
client.pennsieve.post('/discover/datasets/90/versions/1/files/download-manifest', json={"paths":["/metadata/records"]})

{'header': {'count': 8, 'size': 24898},
 'data': [{'fileName': 'animal_subject.csv',
   'path': ['metadata', 'records'],
   'url': 'https://pennsieve-prod-discover-publish-use1.s3.amazonaws.com/90/1/metadata/records/animal_subject.csv?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEN%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMSJHMEUCIA4kuiXnBvAyT2Iy%2BOoNXy3muCTnItqylSqWgQtmIOv%2BAiEA2Bq4gMgesRRSsJ9lzN%2BXcDo7TPtiosSMlE2M5Y9NE7MqgQQIGBADGgw3NDA0NjMzMzcxNzciDEssXpFcFkPOGDQajSreA0xoy65dBsXWk7%2BuSQIpx1EO3ZYCA814oJOMcaky14BTkZwXnagwWkHAmcrn50NlIw0WxBj9X8B6l47zk2%2BIKdLbJstpCHEfO5fe1F1r373stiArDUWXmFk7dvAkOJO87x76w7sDerQHYtDFNlSUYmO6zVUTNm6FyBEBRLW9vqfXM7k%2BYQfT91NJYvKw8J6FMvJLu8VLwwcfRrwvlMYnAk70u0cj7gXYriRzlN%2FBaCV7EtjxFb3T1Q059qFQQm0x0ysjsyzubLsM5qi2q%2BFO4v04jcS5EE%2BHxBU84Y2gfJC4D6V%2Fcokc9WtB22IJZ8O0P5s5eDBQ1u7M6dqdLCTkUVwAuX%2BhHuEPNbEgmDaTKdl2r2UbeBzVbQRz0azPpLX4FAZCWUeWEb1fU5AiUlHJ8c25ZJhTk8Z6Jb9P5XxRczm0XPdvaf8f9jEL58WdE9zRyFdmz8kwyfMpnknzEREjHTXjvH1m9PIIe%2BEtuqNVTV4ZxJxjzmfwQus4Vs

### Uploading files

In this section we will cover how to programmatically upload files from local drive to Pennsieve server with sparc.client.

First, we need to download and configure Pennsieve Agent which is required for uploading files. 

The instruction on configuring the agent could be found here:
https://docs.pennsieve.io/docs/uploading-files-programmatically

In [24]:
!pennsieve agent

Pennsieve Agent is already running on port: 9000


In [26]:
p=client.pennsieve.connect()

Please set the dataset with use_dataset([name])


In [34]:
dataset_list=p.get_datasets()

In [35]:
dataset=p.use_dataset('Dataset_0')

In [32]:
p.manifest.create('manifest.json')

manifest_id: 1
message: "Successfully indexed 1 files."

In [33]:
p.manifest.upload()

status: "Upload initiated."