Open and run this in Google Colab: <a href="https://colab.research.google.com/github/rcsb/rcsb-training-resources/blob/master/example-use-cases/archive-wide-queries/fetch_initial_release_date.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fetch Initial Release Date for All PDB IDs

## Introduction

The code in this notebook is designed to perform the following tasks:

1. Use Data API to get a list of all PDB IDs currently released
2. Use a Data API query to retrieve initial release data

### Purpose

This notebook is designed to help you fetch initial release dates for all PDB IDs

## Libraries

These libraries will be called in the coding cells in this notebook. 

| Library |Contents | Source |
| :-----: | :------- | :----- |
| requests | simple HTTP library for Python | [documentation](https://requests.readthedocs.io/en/latest/) |
| dateutil | extensions to the standard datetime module | [documentation](https://dateutil.readthedocs.io/en/stable/) |
| rcsb-api | python interface for the API services at [RCSB Protein Data Bank](https://www.rcsb.org)| [py-rcsb-api on GitHub](https://github.com/rcsb/py-rcsb-api) |

## Installation

To use this notebook, you will need to have the following libraries installed in your computing environment: json, rcsbsearchapi, python_graphql_client. To install from the command line on your computer, use this command:

`pip install python-dateutil`\
`pip install requests`\
`pip install rcsb-api`

To install from within a Jupyter notebook or CoLab notebook, you need to type the same command in a coding cell, preceded by %:

In [None]:
# Use this coding cell to install necessary libraries if they are not already in your system or environment
%pip install python-dateutil
%pip install requests
%pip install rcsb-api

## Running the Notebook

Then coding cell below contains all of the raw code for this example. **Experienced coders** should use this as you see fit.

For **novice coders**, the code is broken up into smaller chunks in the subsequent coding cells, with stepwise inputs and outputs to better explain how this code can be used.

In [None]:
import requests
from dateutil import parser
from rcsbapi.data import DataQuery as Query

# Step 1: Retrieve all PDB IDs from Data API
url = 'https://data.rcsb.org/rest/v1/holdings/current/entry_ids'
response = requests.get(url)
ids = eval(response.text)

# Step 2: Split full list of IDs into batches
batchSize = 5_000
idBatches = [ids[i:i+batchSize] for i in range(0, len(ids), batchSize)]

#Step 3: Query release date
release_dates = []
for batch in idBatches:
    query = Query(
        input_type="entries",
        input_ids=batch,
        return_data_list=["rcsb_accession_info.initial_release_date"]
    )
    data = query.exec()
    for d in data['data']['entries']:
        entry_id = d['rcsb_id']
        isodate = d["rcsb_accession_info"]["initial_release_date"]
        date = parser.parse(isodate).strftime('%Y-%m-%d')
        release_dates.append({
            "pdb_id": entry_id,
            "release_date": date
        })
print(release_dates[0])

## Code Breakdown

### Importing Libraries

The following simply imports the required libraries that contain the methods that are called in this notebook.

In [None]:
import requests
from dateutil import parser
from rcsbapi.data import DataQuery as Query

### Step 1 

The [Repository Holdings Service REST API](https://data.rcsb.org/redoc/index.html#tag/Repository-Holdings-Service) current entries endpoint provides a full list of current PDB IDs

In [None]:
# Step 1: Retrieve all PDB IDs from Data API

url = 'https://data.rcsb.org/rest/v1/holdings/current/entry_ids'
response = requests.get(url)
ids = eval(response.text)

print(f"There are {len(ids)} released structures in the PDB archive")

### Step 2
Requesting a large number of objects at a time is resource intensive and not recommended. Making requests in periodic batches, instead of a single request for a large number of objects, is more effective

In [None]:
batchSize = 5_000
idBatches = [ids[i:i+batchSize] for i in range(0, len(ids), batchSize)]

print(f"Split IDs into {len(idBatches)} batches")

### Step 3

In this step, we create a query object that points to the data API on the RCSB PDB website. This query extracts data about the initial release date for a given list of IDs. Example query below:

In [None]:
# Step 3: Run data API query to retrieve initial release date for the first ID
query = Query(
    input_type="entries",
    input_ids=["4HHB"],
    return_data_list=["rcsb_accession_info.initial_release_date"]
)
data = query.exec()
print(data['data']['entries'][0])

### Step 4

The final step is to produce the results we want to see and store for future use. 

In [None]:
release_dates = []
for batch in idBatches:
    query = Query(
        input_type="entries",
        input_ids=batch,
        return_data_list=["rcsb_accession_info.initial_release_date"]
    )
    data = query.exec()
    for d in data['data']['entries']:
        entry_id = d['rcsb_id']
        isodate = d["rcsb_accession_info"]["initial_release_date"]
        date = parser.parse(isodate).strftime('%Y-%m-%d')
        release_dates.append({
            "pdb_id": entry_id,
            "release_date": date
        })
print(release_dates[0])

By modifying "return_data_list" parameter of the query object, any available data attributes can be queried for all currently released PDB entries. Available fields can be explored [here](https://data.rcsb.org/data-attributes.html) 