# 3 - Accessing and downloading Data



*   Finding the route to get hold of the data
*   Converting spreadsheets to CSV



In [1]:
import requests
import json
import string
import io
import pandas as pd
from IPython.display import HTML

In [2]:
base_url = 'https://fairdomhub.org'

In [3]:
def json_for_resource(type, id):    

  headers = {"Accept": "application/vnd.api+json",
           "Accept-Charset": "ISO-8859-1"}
  r = requests.get(base_url + "/" + type + "/" + str(id), headers=headers)
  r.raise_for_status()
  return r.json()

Fetch the JSON for the Data file resource at https://fairdomhub.org/data_files/1049

We print out title to be sure we have the correct item.

In [4]:
data_file_id = 1049

result = json_for_resource('data_files',data_file_id)

title = result['data']['attributes']['title']

title

'amino acid auxotrophies S. pyogenes'

The attributes contain a 'content_blobs' block. Content Blob is the name we use in SEEK for the entity that corresponds to a file or URL.

Note that content_blobs is always an array. Models can currently contain multiple content blobs (multiple files), and we plan to provide the same support to Data files and other assets in the future.



In [5]:
result['data']['attributes']

{'discussion_links': [],
 'title': 'amino acid auxotrophies S. pyogenes',
 'license': None,
 'description': 'raw data OD600 of amino acid auxotrophy exps S. pyogenes',
 'latest_version': 1,
 'tags': [],
 'versions': [{'version': 1,
   'revision_comments': None,
   'url': 'https://fairdomhub.org/data_files/1049?version=1',
   'doi': None}],
 'version': 1,
 'revision_comments': None,
 'created_at': '2012-12-13T15:26:33.000Z',
 'updated_at': '2012-12-13T15:26:33.000Z',
 'doi': None,
 'content_blobs': [{'original_filename': '1205 amino acid omission pyogenes.xlsx',
   'url': None,
   'md5sum': '4d177f724cee2a992b9b284e145d43b7',
   'sha1sum': 'b8176276857795e89db6a2e2c154a8c351780e4d',
   'content_type': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
   'link': 'https://fairdomhub.org/data_files/1049/content_blobs/1518',
   'size': 59334}],
 'creators': [{'profile': '/people/413',
   'family_name': 'van Grinsven',
   'given_name': 'Koen',
   'affiliation': 'University

Here we focus on the details about a single content blob:

* **content_type** - this is the mime type of the file, or the whatever the URL points to
*  **link** - this is the link that describes the content blob route
*  **md5sum** - an MD5 checksum of the contents
* **sha1sum** - a SHA1 based checksum of the contents. These checksums are useful for checking the file downloaded is correct, and there hasn't been an error or has been modified since being registered with SEEK.
* **original_filename** - the filename if the file, as it was when registered with SEEK
* **size** - the size of the file in bytes
* **url** - url to an external resource, if the item was registered with SEEK using a URL rather than a direct upload

In this case, this is an *Excel XLSX* file, called *1205 amino acid omission pyogenes.xlsx*, and is about 59k

In [6]:
blob = result['data']['attributes']['content_blobs'][0]

blob

{'original_filename': '1205 amino acid omission pyogenes.xlsx',
 'url': None,
 'md5sum': '4d177f724cee2a992b9b284e145d43b7',
 'sha1sum': 'b8176276857795e89db6a2e2c154a8c351780e4d',
 'content_type': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
 'link': 'https://fairdomhub.org/data_files/1049/content_blobs/1518',
 'size': 59334}

The route to directly download a file is the content blob route, with the */download* action appended. This is always the case for anything downloadable in SEEK.

In this example we display the URL to download the content blob for generete a HTML hyperlink for it.

Although in this case we download the content blob itself directly, it is also possible to download with https://fairdomhub.org/data_files/1049/download . Other than Models, this currently results in downloading a single file. For models, a ZIP file is generated that contains all files. To be future proof, we recommend downloading individual files through the content-blob route.

In [7]:
link = blob['link']
filename = blob['original_filename']

download_link = link+"/download"

print("Download link is: " + download_link + "\n")

HTML("<a href='"+ download_link + "'>Download + " + filename + "</a>")

Download link is: https://fairdomhub.org/data_files/1049/content_blobs/1518/download



As we saw earlier. this Data file is an Excel spreadsheet. Where data is an Excel spreadsheet, it can be converted to a Comma Seperated File (CSV), by requesting this format through content negotiation. 

In this case, we request a GET to https://fairdomhub.org/data_files/1049/content_blobs/1518, but instead of requesting JSON we use an Accept: header of 'text/csv'. A parameter 'sheet' can be included to access different sheets, which if missed always defaults to the first sheet.

Here we request CSV and display the first sheet in a table using the Pandas module. (NaN is just a blank cell in the spreadsheet).

This code sends a GET request to a specified URL, requesting CSV data from the server. It ensures the response is in CSV format, checks for any errors in the request, and then reads the CSV content into a pandas DataFrame for further manipulation.

This header tells the server that the client expects the response in CSV format (text/csv). 

In [8]:
headers = { "Accept": "text/csv" }
r = requests.get(link, headers=headers, params={'sheet':'1'})
r.raise_for_status()

csv = pd.read_csv(io.StringIO(r.content.decode('utf-8')))

csv

Unnamed: 0,Medium preparation (all separate media are prepared using the same stock solutions),Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32
0,,,,,,,,,,,...,,,,,,,,,,
1,,CDM,CDM1,CDM2,CDM3,CDM4,CDM5,CDM6,CDM7,CDM8,...,CDM21,,CDM22,CDM23,CDM24,CDM25,CDM26,CDM27,,total amount
2,,all amino acids,ala,arg,asn,asp,cys,cyn,glu,gln,...,val,,cys/cyn,gln/glu,gly/ser,asn/asp,cys/cyn/ser,no AA,,Sums
3,Basal Solution 4*,25,25,25,25,25,25,25,25,25,...,25,Basal Solution 2*,25,25,25,25,25,25,,700
4,AGU mix 50*,2,2,2,2,2,2,2,2,2,...,2,AGU mix 50*,2,2,2,2,2,2,,56
5,hydroxide (xanth HCO3),5,5,5,5,5,5,5,5,5,...,5,hydroxide (xanth HCO3),5,5,5,5,5,5,,140
6,100*vitamin solution,1,1,1,1,1,1,1,1,1,...,1,100*vitamin solution,1,1,1,1,1,1,,28
7,100*metal solution,1,1,1,1,1,1,1,1,1,...,1,100*metal solution,1,1,1,1,1,1,,28
8,alanine*100,1,0,1,1,1,1,1,1,1,...,1,alanine*100,1,1,1,1,1,0,,26
9,arginine*100,1,1,0,1,1,1,1,1,1,...,1,arginine*100,1,1,1,1,1,0,,26
