# Using clinicaltrials.gov API for extracting information on COVID-19 antibodies

# Background

1. We will use the "full-study" API to extract results. The API returns a max of 100 hits each time. So we need to iterate through all hits, 100 each time, until the end

2. We're interested in extracting the following data:

  Inclusion criteria:
  - COVID-19 indication
  - Antibody treatments, combination treatments involving antibodies

  Exclusion criteria:
  - Entries describing preclinical or clinical development of diagnostic antibodies, polyclonal antibodies, convalescent plasma therapies, immune globulin intravenous therapies (IGIV), vaccines, small molecules, and recombinant proteins other than immunoglobin (Ig), Ig fragments, and Ig fusion proteins were removed from our collection. Studies and clinical trials without explicitly stating COVID-19 or SARS-CoV-2 as their indication or target were also eliminated


3. Shown below are some useful resources on the JSON data from "full-study" queries. **Note that NOT ALL of these fields will be available for an unique study, so it is recommended to use `try...except` to avoid extracting data from a non-existent field**

   - [List of study fields](https://clinicaltrials.gov/api/info/study_fields_list)
   - [Empty structure of JSON returned from a full-study query](https://clinicaltrials.gov/api/info/study_structure)
   - [Search areas](https://clinicaltrials.gov/api/info/search_areas) : Note that regardless of the hierarchy of the fields in JSON, you can use the search areas directly in the `expr` param (see code)

4. Caveat: Manual inspection of the returned results from the script is strongly recommended to ensure relevancy.


# How to format your query

There are 2 ways:

## 1. Single field search

```python
url="https://ClinicalTrials.gov/api/query/full_studies"
params={
    "expr": "NCT04320615",
    "field": "NCTId",
    "min_rnk": 1,
    "max_rnk": 1,
    "fmt": "JSON"
}

data=requests.get(url, params=params).json()
```

Parameters:
- `expr`: value for the field that you want to earch
- `field`: what field to search. Can only be a single field
- `fields`: fields to return. Only applies to "Study Fields" queries. "Full-Study" queries return all fields,  therefore is NOT affected by this parameter
- `min_rnk` and `max_rnk`: When you submit a query, hits are numbered from 1 to xxxx (the last hit). For full-studies, each time you can only retrieve 100 hits. Use this parameter specifify the "index" of the hits that you want to retrieve. E.g. 1-100, 101-200, ....Use these parameters to iterate through hits to extract them all in to sqlite database.
- `fmt`: format, specify as `"JSON"`

## 2. Multiple Field Search (RECOMMENDED)

A barebone multi-field search request looks like this:

```python
url="https://ClinicalTrials.gov/api/query/full_studies"
params={
    "expr": "AREA[NCTId]NCT04320615",
    "min_rnk": 1,
    "max_rnk": 1,
    "fmt": "JSON"
}

data=requests.get(url, params=params).json()
```

Parameters:
- `expr`: use an expression string. For how to structure the string, see [ref on logical operators](https://clinicaltrials.gov/api/gui/ref/expr).

  A simple example of the string query:
  ```
  AREA[InterventionDescription]antibody  NOT AREA[InterventionType]diagnostic
  ```

  Note that regardless of the hierachy of the field in JSON, you can search the field directly using the string query above. The API also appears to have some abilities to match synonymous words - e.g. COVID-19 will automatically cover covid19, etc.

## 3. How many hits are there in my query?

The number of hits matching your query can be found in the `data['FullStudiesResponse']['NStudiesFound']` parameter. For example:

```python

url="https://ClinicalTrials.gov/api/query/full_studies"

params={
    "expr": "AREA[Condition]COVID-19",
    "fmt": "JSON",
    "min_rnk": 1,
    "max_rnk": 1

}
data=requests.get(url, params=params).json()

# total number of trials in the database
print("Total number of studies: ", data['FullStudiesResponse']['NStudiesAvail'])

# total number of trials found matching your query
print("Total number of trials mathcing query: ", data['FullStudiesResponse']['NStudiesFound'])

```

# Example Code

## Imports

In [1]:
import requests
import sqlite3
import pandas as pd
from datetime import datetime

## Generate a dataframe to save output

In [2]:
columns = ['category', 'nctid', 'name', 'phase', 'summary']
df = pd.DataFrame(columns=columns)

## Base url

In [3]:
url="https://ClinicalTrials.gov/api/query/full_studies"

## Get nctiID, mabs, and trial status from entries currently in the database

In [4]:
# connect to SQLITE database
conn = sqlite3.connect('covid_mabs.sqlite')
cur = conn.cursor()

# get a list of nctid, its corresponding mabs and clinical trial phases currently in the database
nctidList, mabList, phaseList = [], [], []
cur.execute("select nctid, name, phase from covid_clinical_trials_API")
for item in cur.fetchall(): # a list of tuples
    nctidList.append(item[0])
    mabList.append(item[1])
    phases = sorted([p.strip() for p in item[2].split(',')])
    phaseList.append(phases)

## Update clinical trial phase for entries in the current database

In [5]:
# define function to get clinical trial phase using nctid
def get_trial_phase(nctID):
    """
    retrieve phase status based on nctID. If for some reason the phase status cannot be extracted, an error message will be printed
    :param nctID: string, nctid identifier
    :return: list, a list clinical trial phase(s), such as ['Phase 1', 'Phase 2']
    """
    params={
        "expr": "AREA[NCTId]" + nctID,
        "min_rnk": 1,
        "max_rnk": 1,
        "fmt": "JSON"
    }

    data=requests.get(url, params=params).json()
    try:
        phaseList=data['FullStudiesResponse']['FullStudies'][0]['Study']['ProtocolSection']['DesignModule']['PhaseList']['Phase']
        return sorted(phaseList)
    except:
        print("\nExtraction error occured for NCTId = " + nctID +'\n')

In [6]:
# print trials whose clinical trial phases differ in the API and db
for nctid, mab, old_phase in zip(nctidList, mabList, phaseList):
    new_phase = get_trial_phase(nctid)
    if new_phase is None:
        continue
    elif new_phase == ['Not Applicable']:
        continue
    elif all(item in old_phase for item in new_phase):
        continue
    else:
        print(nctid)
        print("Old phase: ", old_phase)
        print("New phase: ", new_phase )
        print("\n")
        
        new_item = {'category': 'need to update phase', 
                    'nctid': nctid, 
                    'name': ', '.join(list(set(mab))), 
                    'phase': ', '.join(new_phase), 
                    'summary': ''}
        df = df.append(new_item, ignore_index=True)


Extraction error occured for NCTId = NCT04346277



## Query Version 1

Search "mab" in InterventionName. Note this produces a narrower result than searching "mab" in "DetailedDescription" or "BriefSummary".

Note that:
1. Use `AREA[InterventionName]mab` might miss many hits

### 1. Get total number of hits

In [7]:
# parameters for query

params={
    "expr": "AREA[Condition]COVID-19 NOT AREA[InterventionType]diagnostic NOT AREA[InterventionType]convalescent NOT AREA[InterventionType]plasma NOT AREA[OfficialTitle]convalescent NOT AREA[OfficialTitle]seroprevalence NOT AREA[OfficialTitle]serological NOT AREA[OfficialTitle]plasma NOT AREA[OfficialTitle]test NOT AREA[OfficialTitle]prevalence NOT AREA[OfficialTitle]polyvalent",
    "fmt": "JSON",
    "min_rnk": 1,  # starting index number (default 1, don't put 0 here)
    "max_rnk": 1   # retrieve 1 record, so we can obtain the value of totalMatches
}

response=requests.get(url, params=params)
data=response.json()
totalMatches=data['FullStudiesResponse']['NStudiesFound']

totalMatches

2124

In [8]:
# use i to iterate through the indexes of hits
i=1

# use counter to record the number of "FILTERED" hits. We're filtering using our customized script
counter=0

while i<= totalMatches:
    min_rnk=i
    if min_rnk+99 <= totalMatches:  #returns max 100 hits per query
        max_rnk=min_rnk+99
    else:
        max_rnk=totalMatches
    params['min_rnk']=min_rnk
    params['max_rnk']=max_rnk
    i+=100
    data=requests.get(url, params=params).json()

    for study in data['FullStudiesResponse']['FullStudies']:
        try:

            # dict of fields that are most relevant to what we want
            protSelect=study['Study']['ProtocolSection']

            # check if nctid already exists in the database
            nctInDb=True if protSelect['IdentificationModule']['NCTId'] in nctidList else False

            # only process contents if nctId is not currently in the database

            if nctInDb==False:
                # if a match is found
                found=False

                # make a list of intervention names (there might be multiple mab drugs in 1 study)
                mabs=[]
                # make a list of phases (might be multiple phases in 1 study)
                phases=[]

                # iterate through a list of dictionaries of interventions (this include BOTH cases where 
                # intervention type = drug & intervention type = biological
                for interv in protSelect['ArmsInterventionsModule']['InterventionList']['Intervention']:
                    if "mab" in interv['InterventionName']:
                        found=True
                        mabs.append(interv['InterventionName'])

                # if found, iterate through a list of phases
                if found:
                    for p in protSelect['DesignModule']['PhaseList']['Phase']:
                        phases.append(p)

                # print a summary of results
                if found:
                    print(protSelect['IdentificationModule']['NCTId'])
                    print(list(set(mabs)))
                    print(phases)
                    try:
                        print(protSelect['StatusModule']['OverallStatus'])
                    except:
                        pass
                    print(protSelect["DescriptionModule"]["BriefSummary"])
                    print("\n\n")
                    counter +=1
                    
                    new_item = {'category': 'new hit from stringent search', 
                                'nctid': protSelect['IdentificationModule']['NCTId'], 
                                'name': ', '.join(list(set(mabs))), 
                                'phase': ', '.join(phases), 
                                'summary': protSelect["DescriptionModule"]["BriefSummary"]}
                    df = df.append(new_item, ignore_index=True)


        # if for any reason we can't extract the field, just pass
        except:
            pass

print("A total of ", counter, " matching entries were retrieved.")

NCT04479358
['Tocilizumab']
['Phase 2']
Not yet recruiting
COVID-19's high mortality may be driven by hyperinflammation. Interleukin-6 (IL-6) axis therapies may reduce COVID-19 mortality. Retrospective analyses of tocilizumab in severe to critical COVID-19 patients have demonstrated survival advantage and lower likelihood of requiring invasive ventilation following tocilizumab administration. The majority of patients have rapid resolution (i.e., within 24-72 hours following administration) of both clinical and biochemical signs (fever and CRP, respectively) of hyperinflammation with only a single tocilizumab dose.

The investigators hypothesized that a dose of tocilizumab significantly lower than the EMA- and FDA-labeled dose (8mg/kg) as well as the emerging standard of care dose (400mg) may be effective in patients with COVID-19 pneumonitis and hyperinflammation. Advantages to the lower dose of tocilizumab may include lower likelihood of secondary bacterial infections as well as exten

NCT04409262
['Tocilizumab']
['Phase 3']
Recruiting
This study will evaluate the efficacy and safety of combination therapy with remdesivir plus tocilizumab compared with remdesivir plus placebo in hospitalized patients with COVID-19 pneumonia.



NCT04372186
['Tocilizumab']
['Phase 3']
Active, not recruiting
This study will evaluate the efficacy and safety of tocilizumab (TCZ) compared with a placebo in combination with standard of care (SOC) in hospitalized participants with COVID-19 pneumonia.



NCT04363736
['Tociliuzumab']
['Phase 2']
Completed
This study will assess the pharmacodynamics, pharmacokinetics, safety and efficacy of two different doses of tocilizumab (TCZ) in combination with standard-of-care (SOC) in hospitalized adult participants with moderate to severe COVID-19 pneumonia.



NCT04346355
['Tocilizumab']
['Phase 2']
Terminated
The clinical study aims at assessing whether early administration of Tocilizumab compared to late administration of Tocilizumab can reduce the

NCT04412291
['Tocilizumab Prefilled Syringe']
['Phase 2']
Recruiting
The study is designed as a randomized, controlled, single-center open-label trial to compare standard-of-care (SOC) treatment with SOC + anakinra or SOC + tocilizumab treatment in hospitalized adult subjects who are diagnosed with severe COVID 19.

Arm A: Standard-of-care Treatment (SOC) Arm B: Anakinra + SOC Arm C: Tocilizumab + SOC.

All subjects will be treated with standard-of-care treatment and broad spectrum antibiotics initiated before or latest 24 hours after initiation of treatment with study drug.

The primary follow up period of the study is 29 days



NCT04357808
['Sarilumab']
['Phase 2']
Recruiting
The global health emergency created by the rapid spread of the SARS-CoV-2 coronavirus has pushed healthcare services to face unprecedent challenges to properly manage COVID-19 severe and critical manifestations affecting a wide population in a short period of time. Clinicians are committed to do their best with

NCT04335071
['Tocilizumab (TCZ)']
['Phase 2']
Recruiting
The mortality rate of the disease caused by the corona virus induced disease (COVID-19) has been estimated to be 3.7% (WHO), which is more than 10-fold higher than the mortality of influenza. Patients with certain risk factors seem to die by an overwhelming reaction of the immune system to the virus, causing a cytokine storm with features of Cytokine-Release Syndrome (CRS) and Macrophage Activation Syndrome (MAS) and resulting in Acute Respiratory Distress Syndrome (ARDS). Several pro-inflammatory cytokines are elevated in the plasma of patients and features of MAS in COVID-19, include elevated levels of ferritin, d-dimer, and low platelets.

There is increasing data that cytokine-targeted biological therapies can improve outcomes in CRS or MAS and even in sepsis. Tocilizumab (TCZ), an anti-IL-6R biological therapy, has been approved for the treatment of CRS and is used in patients with MAS. Based on these data, it is hypothesize

A total of  50  matching entries were retrieved.


## Query Version 2

In [9]:
params={
    "expr": "AREA[Condition]COVID-19 AND (AREA[InterventionDescription]antibody OR AREA[InterventionName]antibody OR AREA[OfficialTitle]antibody) NOT AREA[InterventionType]diagnostic NOT AREA[InterventionType]convalescent NOT AREA[InterventionType]plasma NOT AREA[OfficialTitle]convalescent NOT AREA[OfficialTitle]seroprevalence NOT AREA[OfficialTitle]serological NOT AREA[OfficialTitle]ivig NOT AREA[OfficialTitle]plasma NOT AREA[OfficialTitle]test NOT AREA[OfficialTitle]prevalence NOT AREA[OfficialTitle]polyvalent NOT AREA[OfficialTitle]vaccine",
    "fmt": "JSON",
    "min_rnk": 1,  #starting index number (default 1, don't put 0 here)
    "max_rnk": 1   #max is 100 for full studies. You have to run another query starting at 101 if needed

}

response=requests.get(url, params=params)
data=response.json()
totalMatches=data['FullStudiesResponse']['NStudiesFound']

totalMatches

41

In [10]:
# a counter to keep track of the number of FILTERED hits that we get
counter=0
i=1

while i<= totalMatches:
    min_rnk=i
    if min_rnk+99 <= totalMatches:
        max_rnk=min_rnk+99
    else:
        max_rnk=totalMatches
    params['min_rnk']=min_rnk
    params['max_rnk']=max_rnk
    i+=100
    data=requests.get(url, params=params).json()

    for study in data['FullStudiesResponse']['FullStudies']:
        covered=False
        phases=[]

        protSelect=study['Study']['ProtocolSection'] # dict of fields that are most relevant to what we want
        try:

            # check if nctid already exists in the database
            nctID=study['Study']['ProtocolSection']['IdentificationModule']['NCTId']
            if protSelect['IdentificationModule']['NCTId'].upper() in nctidList:
                continue

            for interv in protSelect['ArmsInterventionsModule']['InterventionList']['Intervention']:
                if "mab" in interv['InterventionName']:
                    covered=True
                    
            if covered==False:
                for p in protSelect['DesignModule']['PhaseList']['Phase']:
                    phases.append(p)

                print(nctID)
                print(phases)
                #print(protSelect['ArmsInterventionsModule']['InterventionList']['Intervention'][0]['InterventionName'])
                print(study['Study']['ProtocolSection']['IdentificationModule']['OfficialTitle'])
                print("\n\n")
                counter=counter+1
                
            new_item = {'category': 'new hit from relaxed search', 
                        'nctid': study['Study']['ProtocolSection']['IdentificationModule']['NCTId'], 
                        'name': '', 
                        'phase': ', '.join(phases), 
                        'summary': study['Study']['ProtocolSection']['IdentificationModule']['OfficialTitle']}
            df = df.append(new_item, ignore_index=True)
            
        except:
            print("Extraction error occured for NCTId = " + nctID +'\n')


print("A total of ", counter, " matching entries were retrieved.")

NCT04441918
['Phase 1']
A Randomized, Double-blind, Placebo-controlled, Phase I Clinical Study to Evaluate the Tolerability, Safety, Pharmacokinetic Profile and Immunogenicity of JS016 (Anti-SARS-CoV-2 Monoclonal Antibody) Injection in Chinese Healthy Subjects After Intravenous Infusion of Single Dose



NCT04514302
['Phase 1', 'Phase 2']
Pilot Study to Evaluate Safety and Efficacy of Anti-SARS-CoV-2 Equine Immunoglobulin F(ab')2 Fragments (INOSARS) in Hospitalized Patients With COVID-19



Extraction error occured for NCTId = NCT04542200

Extraction error occured for NCTId = NCT04444310

Extraction error occured for NCTId = NCT04537338

Extraction error occured for NCTId = NCT04367402

NCT04555148
['Phase 2']
A Prospective, Open-label, Randomized, Multi-center, Phase 2a Study to Evaluation the Dose Response, Efficacy and Safety of Hyper-Ig (Hyper-immunoglobulin) GC5131 in Patients With COVID-19



Extraction error occured for NCTId = NCT04483622

NCT04551898
['Phase 2']
A Phase 2, Ran

## Write output into a CSV file for further review

In [11]:
date = datetime.today().strftime('%Y-%m-%d')
outfile = "mabs_update_from_API." + date + ".csv"
df.to_csv(outfile, index= False)