# PMA Dataset

This notebook contains code for retreiving PMA approval statements mentioning indications from OpenFDA, formatting them, and uploading to a collection on PubAnnotation.

In [1]:
import pandas
import requests

#### Get Product Codes

There are too many PMAs for us to search all at once, so we must group them by product code.  We first get a list of product codes (with PMA counts for each product code) from the pma endpoint.

In [2]:
response = requests.get('https://api.fda.gov/device/pma.json?count=product_code&limit=1000')
response_json = response.json()
product_codes = pandas.DataFrame(response_json['results'])

#### Get Indication Statements

For each product code we fetch PMA records where the keywords "indicated" or "intended" appear in the text of the approval statement summary.  We then generate ids from the pma number and supplement number, format summaries that are in ALL CAPS (this can break NER software), and store the results in a dictionary.  This is our corpus. 

In [3]:
keywords = 'ao_statement:indicated+ao_statement:intended'

dataset = dict()
for term in product_codes['term']:
    url = f'https://api.fda.gov/device/pma.json?search=product_code:{term}+AND+({keywords})&limit=100'
    response = requests.get(url)
    response_json = response.json()
    if 'results' not in response_json.keys():
        continue
    for result in response_json['results']:
        pma_number = result['pma_number']
        supplement_number = result['supplement_number']
        if supplement_number == '':
            supplement_number = 'S000'
        pma_id = pma_number + '_' + supplement_number 
        statement = result['ao_statement']
        if statement.isupper():
            statement = '.  '.join([i.capitalize() for i in statement.split('.  ')])
        dataset[pma_id] = statement

#### Upload to PubAnnotation

Finally, we upload all of our documents to PubAnnotation for so we can annotate them and make them available to whoever would like to download them.

In [5]:
project_url = 'http://pubannotation.org/projects/blah6_medical_device/docs.json'
auth=('srensi@stanford.edu', 'qukraw-7gumba-cEvdyf')
payload = {
    'sourceid' : '',
    'text' : ''
}

In [28]:
failed = []
for pma_id, pma_summary in dataset.items():
    payload['sourceid'] = pma_id
    payload['text'] = pma_summary
    try:
        r = requests.post(project_url, data=payload, auth=auth)
    except:
        failed.append(payload.copy())

#### Conclusion

We have now uploaded a set of PMAs to PubAnnotation for to annotate and share with NLP researchers who are interested in using NER tools to annotate indications.  This is the first step in structuring the space of medical device indications.

In [29]:
failed

[{'sourceid': 'P950021_S007',
  'text': 'Approval for transfer of the assay to a new bayer platform, the advia centaur cp system.  The advia centaur cp psa assay is intended to quantitatively measure prostate-specific-antigen (psa) in human serum using the advia centaur cp system.'},
 {'sourceid': 'P930036_S003',
  'text': 'Approval for the acs:180 and the centaur afp assays on the advia centaur cp system.  The device, as modified, will be marketed under the trade name advia centaur cp afp and is indicated for in vitro diagnostic use in the quantitative determination of alpha-fetoprotein (afp) in 1) human serum and in amniotic fluid from specimens obtained at 15 and 20 weeks gestation, as an aid in detecting open neural tube defects (ntds) when used in conjunction with ultrasonography and amniography testing 2) human serum, as an aid in managing non-seminomatous testicular cancer when used in conjunction with physical examination, histology/pathology, and other clinical evaluation proc