## Prepare Raw Data for Use in NLP Pipeline

### Written By: Robert Thombley, UCSF (7/19/2018)
Before we begin our classification task, we need to gather the raw data and put it into a form that's easily  accessible, in the proper format and conducive to working in Python. All of the raw data for this project lives at https://clinicaltrials.ucsf.edu/browse/.  We will be using the BeautifulSoup Python library to scrape this website and harvest all of the potentially informative text from each clinical trial's description.
    
This website is formatted such that most clinical trials are categorized by major body system or condition, then subcategorized into sub-conditions.  The conditions are listed as headings, with each of the subconditions listed as a link underneath that heading. When we follow the subcondition link, we end up with a list of all clinical trials categorized into that condition/sub-condition bucket.  Each of the listed clinical trials here is a link to the clinical trial description page, which holds the text that we are most interested in.
    
As we are building our data structure without a great sense of the features we are interested in modeling, we will have to make some assumptions about the schema. Recognize that we may make a mistake here and leave out an important variable, but we'll have to cross that bridge when we get there.  For now, it seems to me that these are the interesting data points:
* Condition name/heading: This is the bucket name into which we will try to categorize the "other" trials. It is a critical value and we should be able to easily subset our data structure using this key (Hint: This means we should probably be using a dictionary/hash table as our main data structure)
* Sub condition name: We won't use this as a categorization label, but it may be a valuable piece of information.
* Sub condition link: It might be useful to retain this, but it isn't super critical to keep around.
* Number of trials: Could be useful for to get a sense of relative importance of sub-conditions, but it's probably not that useful.
* Clinical Trial name: This is the actual title of the trial and is an important piece of text data.
* Clinical Trial description page link: This will be important to retain so that we can easily add more data later
* Clinical Trial study text: This is the actual text from the study page. I made some assumptions about what is useful and what is not, but this is our payload and the content we actually care about.

So, given the fact that the data that's available exists as heteogeneous types (ie - objects/chars/etc) and the requirement that we need to be able to easily subset by the condition heading (ie - 'cancer', 'eyes and vision', etc), this points to using a dictionary of a list of objects as our data structure, something like:

Ok! Let's build it.
Start by importing the required packages:

In [1]:
import requests # Performs HTTP requests for returning webpage data
from bs4 import BeautifulSoup # Web Scraping library
import json # Allows us to save python objects in text format
import re # Regular expressions library

Define the file location on disk for storing the raw data

In [2]:
data_file = '/path/to/data_repository/clinical_trial_data.json'

We are going to work backwards a little bit.  Let's create a function that we can call with the clinical trial description page link.  This will allow us to call this function each time we want to extract data from the study page, simplifying our scraping code.

Given an example trial description page, https://clinicaltrials.ucsf.edu/trial/NCT02548598, I only want to extract the Description and Eligibility sections, since those seem to hold the information that is actually condition-related.

Each of the sections (title, Description, Eligibility ,etc) is <DIV> that is a member of the 'show-jargon-defitions' CSS class.  This function just finds all instances of this class in the beautiful soup object and extracts the text within the 2nd and 3rd (indexes 1 and 2) sections ('Description' and 'Eligibility', respectively). 

In [3]:
def getTextFromURL(url):
    page_obj = BeautifulSoup(requests.get(url).content, 'lxml')
    sections = page_obj.findAll('div','show-jargon-definitions')
    
    # Grab all text, storing each line as a new position in the text_out list.
    # We skip any lines that are empty.
    text_out = [line for line in sections[1].findAll(text=True) if line != ' ']   
    text_out.extend([line for line in sections[2].findAll(text=True) if line != ' '])
    return(text_out)

Now let's start back at the main page and use the requests library to get all of the HTML from the main page.

In [4]:
base_url = "https://clinicaltrials.ucsf.edu/browse/"
urldat = requests.get(base_url) #
page_content = urldat.content

Convert the HTML data into a beautiful soup object and identify all of the headings that are instances of the 'browse-condition-cluster--block' CSS class.  These are our conditions

In [5]:
soup = BeautifulSoup(page_content, "lxml")
headings = soup.find_all("div", "browse-condition-cluster--block")

This section builds our dataset (Warning: This takes a long time to run (> 10min) and creates a ~25Mb dataset, so I've added 3 "ejection commands" to keep it from running too long.  Remove these to create the full dataset.)

In [8]:
data = {};
for heading in headings: # Iterate over all the headings on the browse page
    # In order to be used as a key, we need to remove commas, replace spaces with underscores, 
    # and convert the text to all lower case.
    
    key = heading.h2.contents[0].strip().lower().replace(',','').replace(' ','_')  
    data[key] = [] 
    
    # Beautiful soup sometimes parses pages in a tricky way. Often you will have to experiment to see 
    # what works best for accessing the data you want. Here we find all of the <span> tags that are 
    # children of the current <div class='browse-condition-cluster--block'> (ie -heading)
    
    for sub_condition_span in heading.div.findAll("span"):
        if sub_condition_span.a: # We only care about the span if it contains a link.
            # Extract the name and the link for each subcondition.
            sub_condition_name = sub_condition_span.findAll("span", "browse-condition-cluster--condition-name")[0].string
            sub_link = sub_condition_span.a.attrs['href']
            
            # Print sub_condition_name and link to keep track of where we are
            print("{} => {}".format(sub_condition_name, sub_link))
            
            # Load the sub-condition page into a Beautiful Soup object
            sub_page_dat = BeautifulSoup(requests.get(base_url + sub_link).content, 'lxml')
            
            # Trials are displayed as an unordered list with the CSS class 'list-unstyled'
            # However, there are some <li> tags we don't care about, so we 
            # need to filter by list items that include the text 'trials-list' somewhere in them.
            # We need to use a regex to complete this filter.
            
            trial_list = sub_page_dat.find("ul", "list-unstyled").findAll('li', re.compile("trials-list"))
            
            # For every trial we identify, extract relevant information and call the getTextFromUrl() function
            for trial_num, trial in enumerate(trial_list):
                study_name = trial.h2.a.string
                study_link = trial.h2.a.attrs['href']
                study_data = getTextFromURL(study_link)
                print("{}: {}".format(trial_num+1, study_name))
                
                # Store all of the data into our data structure
                data[key].append([key, sub_condition_name, study_name, study_link, study_data])
                
                # EJECTION command #1: exit the loop after 1 iteration so that this runs quickly.
                # Delete this if you want to generate the full dataset
                if trial_num > 0:
                    break
                    
            # EJECTION command #2
            if len(data[key]) > 1:
                break

    # EJECTION command #3
    if len(data.keys()) > 2: 
        break
# Write the data file to disk. I like to serialize objects (ie- write objects to disk in a way that makes them easier
# to load back into memory) using json. You could try this with pickle or some other serialization library.

with open(data_file, 'w') as fp:
    json.dump(data, fp)

Acute Lymphoblastic Leukemia => ../acute-lymphoblastic-leukemia
1: A Multi-Center Study Evaluating KTE-C19 in Pediatric and Adolescent Subjects With Relapsed/Refractory B-precursor Acute Lymphoblastic Leukemia
2: A Multicenter Access and Distribution Protocol for Unlicensed Cryopreserved Cord Blood Units (CBUs)
Acute Coronary Syndrome => ../acute-coronary-syndrome
1: Tailored Antiplatelet Therapy Following PCI
Acute Myocardial Infarction => ../acute-myocardial-infarction
1: Prospective ARNI vs ACE Inhibitor Trial to DetermIne Superiority in Reducing Heart Failure Events After MI
Dislocation of Shoulder Region => ../dislocation-of-shoulder-region
1: MOON Shoulder Instability-Cohort of Patients Undergoing Operative Treatment.
Fibrodysplasia Ossificans Progressiva => ../fibrodysplasia-ossificans-progressiva
1: An Efficacy and Safety Study of Palovarotene for the Treatment of FOP
2: In-Home Evaluation of Episodic Administration of Palovarotene in Fibrodysplasia Ossificans Progressiva (FOP)

In [10]:
data['cancer'][1] # Demo the format for the 2nd trial in the 'other' category

['cancer',
 'Acute Lymphoblastic Leukemia',
 'A Multicenter Access and Distribution Protocol for Unlicensed Cryopreserved Cord Blood Units (CBUs)',
 'https://clinicaltrials.ucsf.edu/trial/NCT01351545',
 ['Summary',
  'This study is an access and distribution protocol for unlicensed cryopreserved cord blood units (CBUs) in pediatric and adult patients with hematologic malignancies and other indications.',
  'Official Title',
  'A Multicenter Access and Distribution Protocol for Unlicensed Cryopreserved Cord Blood Units (CBUs) for Transplantation in Pediatric and Adult Patients With Hematologic Malignancies and Other Indications',
  'Details',
  'Principal Investigators:',
  'The principal investigators (PIs) will be transplant physicians at all participating U.S. transplant centers.',
  'Study Design:',
  'This study is an access and distribution protocol for unlicensed cryopreserved cord blood units (CBUs) in pediatric and adult patients with hematologic malignancies and other indicati