**Patrick Tyson**

- **Description**: Write a program that maps drugs to related diagnoses

- **Drug Class Analyzed** - Recombinant Human Growth Hormone - abuse of HGH

- **Diagnosis Information used**:
    - Int'l Classification of Diseases (#10)
    - Drug Data - National Drug Code (NDC)

- Program takes two input files, one for drugs, other for diagnoses
    - Takes the drug file, pulls back label info from openFDA API

- Label and Diagnoses are then put into N-grams sets

     - Intersection of sets is shared wording between drug and diagnoses

In [1]:
import pandas as pd
import requests
import csv
import time
import regex

import nltk

## Getting Stop Words for text mining

In [2]:
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
stop_words.add('due')
stop_words.add('age')
stop_words.add('associated')
stop_words.add('onset')
stop_words.add('secondary')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Patrick\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Getting tokenize ability for strings

In [3]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Patrick\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Creating Text Mining Class

In [4]:
class Text_Mining_String:
    # Construct a text mining string object
    def __init__(self, text_data=''):
        self.__text_data = text_data

    # creating functions to perform various text mining prep steps
    def get_text_data(self):
        return self.__text_data

    # creating functions set values for various text mining prep steps
    def set_text_data(self, text_data):
        self.__text_data = text_data

    # functions for text mining
    # returns text source in lower case and without punctuation
    def lower_case(self):
        low = str(self.__text_data)
        low = low.lower()
        return low

    # returns text with punctuation removed
    def remove_punctuation(self, string_data):
        if not string_data:
            return 'NA'
        else:
            r_p = str(string_data)
            r_p = regex.sub("('s)", '', r_p)
            no_punct = regex.sub(r'[^0-z\s]', ' ', r_p)
            no_punct = ' '.join(no_punct.split())
            return no_punct

    # creates string for each word
    def tokenize(self, string_data):
        if not string_data:
            return 'NA'
        else:
            tokens = [tk for tk in string_data.split(' ') if tk != ""]
            return tokens

    # removes common english words that mostly create noise for model
    def remove_stopwords(self, string_data):
        if not string_data:
            return 'NA'
        else:
            # list comprehension to create list of non-stop words
            tokens_no_stop = ([w for w in string_data if w not in stop_words])
            return tokens_no_stop

### Subclass of Text Mining String, adding ability to create n-Grams

In [5]:
class N_Grams(Text_Mining_String):
    def __init__(self, ngrams, text_data):
        # inheriting all Text Mining String arguments, adding ngram argument
        super().__init__(text_data)
        self.__ngrams = ngrams
        self.__text_data = text_data

    def get_ngrams(self):
        return self.__ngrams

    def set_ngrams(self, ngrams):
        self.__ngrams = ngrams

    # generating strings of n-values to match other strings
    def generate_ngrams(self, string_data):
        if not string_data:
            return 'NA'
        else:
            # getting n of grams from class object
            num = self.get_ngrams()
            # zipping n-number of words together to form n-gram
            if len(string_data) == 1:
                return string_data
            else:
                n_g = zip(*[string_data[i:] for i in range(num)])
                # joining as lists
                ngrams_out = [" ".join(ngram) for ngram in n_g]
                return ngrams_out

## Possible future enhancement, add a boolean value for reverse tokenization to account for different in sentence structure

## Functions related to the API calls

### function to build out url for api requests

In [6]:
def url_output(url_beginning, query_data):
    url_output = (url_beginning + query_data)
    return url_output

### Function for api call for set id

In [7]:
def fda_set_id_call(url_output):
    json_data = requests.get(url_output).json()

    # if statement to take care of any not found errors
    if json_data.get('error') is not None:
        return "Source Data Not Found"

    else:
        # try to find the relevant data using parsing structure
        # json data is series of lists and dictionaries
        # have to use get() and list indices to parse through data
        try:
            target = (json_data.get('results')[0]
                      .get('openfda')
                      .get('spl_set_id')[0])

            return target

        # if the parsing structure for relevant data missing
        except (AttributeError, TypeError):
            # printing unique message for missing data
            return "Source Data Found, but Relevant Data Missing"

### Function for api call for label

In [8]:
def fda_label_call(url_output):
    json_lbl = requests.get(url_output).json()

    # if statement to take care of any not found errors
    if json_lbl.get('error') is not None:
        return "Source Data Not Found"

    else:
        # try to find the relevant data using parsing structure
        # json data is series of lists and dictionaries
        # have to use get() and list indices to parse through data
        try:
            label = (json_lbl
                     .get('results')[0]
                     .get('indications_and_usage'))
            return label

        # if the parsing structure for relevant data missing
        except (AttributeError, TypeError):
            # printing unique message for missing data
            return "Source Data Found, but Relevant Data Missing"

### Function to maintain traceability throughout process, storing in Pandas DF

In [9]:
# creating function to add list to pandas dataframe
def list_to_df(dataframe='', new_list='', column_name=''):
    dataframe[column_name] = new_list

# Input Files

## DRUG INPUT FILE

### Reading in file

In [10]:
drug_in = open(r"C:\Users\Patrick\Documents\NDC_drug_source_file.csv", "r")
reader = csv.DictReader(drug_in)

In [11]:
# creating list to add NDC values to
ndc_11_list = []
drug_name_list = []
# for loop to add values to NDC list
for rdr in reader:
    ndc_11_list.append(rdr['NDC'])
    drug_name_list.append(rdr['PRD_LABEL'])

## DIAGNOSIS INPUT FILE

### Reading in file

In [12]:
diagnosis_in = open(r"C:\Users\Patrick\Documents\ICD10_diagnosis_source_file.csv", "r")
reader2 = csv.DictReader(diagnosis_in)

## Creating lists to add ICD-10 values to

In [13]:
icd_10_block = []
icd_10_desc = []
# for loop to add values to ICD-10 code  list
for rdr2 in reader2:
    icd_10_block.append([rdr2['DIAGNOSIS_CODE']])
    icd_10_desc.append([rdr2['DIAGNOSIS_DESC']])

# FDA Label & NDC Data File Prep

Current NDC value has no leading zeros, and is in NDC-11 format.
NDC-11 is used for billing a drug, but NDC-10 is displayed on the packaging
Need to convert NDC-11 to NDC-10 in order to use API.

https://phpa.health.maryland.gov/OIDEOR/IMMUN/Shared%20Documents/Handout%203%20-%20NDC%20conversion%20to%2011%20digits.pdf

Three different formats, depending on where additional zero in NDC-11 format is

## For loop to identify which type of conversion needed

In [14]:
ndc_10_list = []

for j in ndc_11_list:
    # if < 11 digits, leading zeros cut off, we pad left up to 10 digits
    if len(j) < 11 or j[0] == '0':
        ndc_10_list.append(j.zfill(10))

    # if the 6th digit is a zero, then the 6th indexed zero gets cut
    elif j[5] == '0':
        # slicing up to 5th index, then starting at the 6th index
        ndc_10_list.append(j[: 5] + j[6:])

    # if the 2nd last digit is a zero, then the 2nd last indexed zero gets cut
    elif j[-2] == '0':
        # slicing up to second last digit, then just keeping last digit
        ndc_10_list.append(j[:-2] + j[-1])

    # not passing invalid NDCs to list for API call
    else:
        ndc_10_list.append('')
        print("NDC {} is not valid.".format(j))

From here, we need to be able to trace the changes we make back to NDC-11.
Creating pandas df, appending columns as we move through process.
Pandas DF will how we establish traceability throughout process.

## Creating drug data df to add new columns to

In [15]:
drug_data = pd.DataFrame(ndc_11_list, columns=['NDC-11'])

# calling list to df function
list_to_df(drug_data, drug_name_list, 'Product Name')
list_to_df(drug_data, ndc_10_list, 'NDC-10')

## Need to reformat NDC

### Need to format NDC as 4-4-2

In [16]:
# now ndc-10 list is done, need to format it with dashes 9999-9999-99
pkg_ndc_442_list = []
# for loop to add dashes in index 3 and 7
for k in ndc_10_list:
    pkg_ndc_442_list.append(k[:4] + '-' + k[4:8] + '-' + k[8:])

# calling 442 list to df function
list_to_df(drug_data, pkg_ndc_442_list, 'Package NDC (4-4-2)')

### Need to format NDC as 5-3-2

In [17]:
# Need to format it 5-3-2
pkg_ndc_532_list = []
# for loop to add dashes in index 3 and 7
for m in ndc_10_list:
    pkg_ndc_532_list.append(m[:5] + '-' + m[5:8] + '-' + m[8:])

# calling 532 list to df function
list_to_df(drug_data, pkg_ndc_532_list, 'Package NDC (5-3-2)')

### Need to format NDC as 5-4-1

In [18]:
# Need to format it 5-4-1
pkg_ndc_541_list = []
# for loop to add dashes in index 3 and 7
for n in ndc_10_list:
    pkg_ndc_541_list.append(n[:5] + '-' + n[5:9] + '-' + n[9:])

# calling 541 list to df function
list_to_df(drug_data, pkg_ndc_541_list, 'Package NDC (5-4-1)')

Now we have the NDC used on the drug package label in each format.
With the package NDC, we can query the OpenFDA API for the drug "Set ID".

## Pulling Back FDA Label Information from openFDA API

### Creating for loop to get list of all set ids from 4-4-2 NDCs

In [19]:
# creating for loop to get list of all set ids from 4-4-2 NDCs
ndc_442_set_id_list = []
for p in range(0, len(pkg_ndc_442_list)):
    # calling function to create url from inputs
    url_out = url_output("https://api.fda.gov/drug/ndc.json?api_key=jX24nplX7IraJUJ6Fj2wacey0QNGxis8aljnENKA&search=packaging.package_ndc:", pkg_ndc_442_list[p])
    # calling set id parse function
    set_id = fda_set_id_call(url_out.strip())
    # appending to list
    ndc_442_set_id_list.append(set_id)
    # forcing system to wait, only allowed 240 requests/minute
    time.sleep(0.1)

# adding 442 set id list to df
list_to_df(drug_data, ndc_442_set_id_list, 'Set ID (4-4-2)')
print('All 4-4-2 requests completed.')

All 4-4-2 requests completed.


### creating for loop to get list of all set ids from 5-3-2 NDCs

In [20]:
ndc_532_set_id_list = []
for q in range(0, len(pkg_ndc_532_list)):
    # calling function to create url from inputs
    url_out = url_output("https://api.fda.gov/drug/ndc.json?api_key=jX24nplX7IraJUJ6Fj2wacey0QNGxis8aljnENKA&search=packaging.package_ndc:", pkg_ndc_532_list[q])
    # calling set id parse function
    set_id = fda_set_id_call(url_out.strip())
    # appending to list
    ndc_532_set_id_list.append(set_id)
    # forcing system to wait, only allowed 240 requests/minute
    time.sleep(0.1)

# adding 532 set id list to df
list_to_df(drug_data, ndc_532_set_id_list, 'Set ID (5-3-2)')
print('All 5-3-2 requests completed.')

All 5-3-2 requests completed.


### Creating for loop to get list of all set ids from 5-4-1 NDCs

In [21]:
ndc_541_set_id_list = []
for r in range(0, len(pkg_ndc_541_list)):
    # calling function to create url from inputs
    url_out = url_output("https://api.fda.gov/drug/ndc.json?api_key=jX24nplX7IraJUJ6Fj2wacey0QNGxis8aljnENKA&search=packaging.package_ndc:", pkg_ndc_541_list[r])
    # calling set id parse function
    set_id = fda_set_id_call(url_out.strip())
    # appending to list
    ndc_541_set_id_list.append(set_id)
    # forcing system to wait, only allowed 240 requests/minute
    time.sleep(0.1)

# adding 541 set id list to df
list_to_df(drug_data, ndc_541_set_id_list, 'Set ID (5-4-1)')
print('All 5-4-1 requests completed.')

All 5-4-1 requests completed.


We now have set ids based on all three formats.
Want to create a single list of set ids to pull label information from

In [22]:
ndc_set_id_list = []
# looping through rows in pandas df to get set id if available
for row in (drug_data[['Set ID (4-4-2)', 'Set ID (5-3-2)', 'Set ID (5-4-1)']]
            .itertuples()):
    if 'Source' in row[1]:
        if 'Source' in row[2]:
            if 'Source' in row[3]:
                ndc_set_id_list.append('NA')
            else:
                ndc_set_id_list.append(row[3])
        else:
            ndc_set_id_list.append(row[2])
    else:
        ndc_set_id_list.append(row[1])


# adding set id list to df
list_to_df(drug_data, ndc_set_id_list, 'Set ID')

Now that we have the set ids, we can query the system again and pull label info
We will pass the set ids into the API, and retrieve the label information

## Creating for loop to get label using set id

In [23]:
# creating for loop to get label using set id
label_list = []
for s in range(0, len(ndc_set_id_list)):
    # calling function to create url from inputs
    url_out = url_output("https://api.fda.gov/drug/label.json?api_key=jX24nplX7IraJUJ6Fj2wacey0QNGxis8aljnENKA&search=set_id:", ndc_set_id_list[s])
#    print(url_out)
    if ndc_set_id_list[s] == 'NA' or url_out == []:
        label_list.append([])
    else:
        # calling set id parse function
        label = fda_label_call("{}".format(url_out))
        # appending to list
        label_list.append(label)
    # forcing system to wait, only allowed 240 requests/minute
    time.sleep(0.1)


### Adding label list to df

In [24]:
list_to_df(drug_data, label_list, 'Label - Indications and Usage')

Common FDA labeling practice is to say "Drug X is indicated for..."
Then list the diagnoses the drugs treat

We only want to include sentences with "indicated for" from the label.

In [25]:
# creating blank list to append
label_indications_list = []
for t in label_list:
    if not t:
        label_indications_list.append('NA')

    else:
        # regex used to pick out sentences that contain "indicated for"
        lbl_ind_all = regex.findall(r'([^.]*indicated for[^.]*)', t[0])
        # need to join multiple sentences into 1 paragraph
        lbl_ind_all_new = '.'.join(lbl_ind_all)
        # regex to remove sentences with "not indicated for"
        lbl_ind = regex.sub(r'([^.]*not indicated for[^.]*)',
                            '', lbl_ind_all_new)
        label_indications_list.append(lbl_ind)

## Adding label indications list to df

In [26]:
list_to_df(drug_data, label_indications_list, 'Label - Indications Only')

# Now we have our data ready for to begin ngramming and mining

## Building FDA Drug Label N-Grams

In [27]:
# preparing a list to store cleansed label information
ngram_drug_list_lists = [list(), list(), list(), list(), list()]
temp_list = []

for u in range(1, 6):
    for v in label_indications_list:
        # creating NGrams object with variable text data
        txt_obj = N_Grams(text_data=v, ngrams=u)
        # stepping through each function and applying
        label_string = txt_obj.lower_case()
        label_string = txt_obj.remove_punctuation(label_string)
        label_string = txt_obj.tokenize(label_string)
        label_string = txt_obj.remove_stopwords(label_string)
        label_string = txt_obj.generate_ngrams(label_string)
        # appending data to list after process complete
        temp_list.append(label_string)
    # appending temp list to list of lists
    ngram_drug_list_lists[u-1].append(temp_list)
    # resetting temp list for next run
    temp_list = []

Now our drug data is ready, so we need to prepare the diagnosis info

Will be using the ICD-10 (International Classification of Diseases)

In order to cut down on computation I am using the first 3 characters or block

Full code is longer and contains 70000+ codes vs ~1900 I am using

## Building Diagnosis N-Grams

### Creating new pandas df to store the data

In [28]:
diagnosis_data = pd.DataFrame()
# adding lists to new df
list_to_df(diagnosis_data, icd_10_block, 'ICD-10 Code Block')
list_to_df(diagnosis_data, icd_10_desc, 'ICD-10 Description')

### Creating N-Grams

In [29]:
## preparing a list of lists to store diagnosis ngram information
ngram_diag_list_lists = [list(), list(), list(), list(), list()]

temp_list = []
for y in range(1, 6):
    for z in icd_10_desc:
        diag_obj = N_Grams(text_data=z, ngrams=y)
        # stepping through each function and applying
        diag_string = diag_obj.lower_case()
        diag_string = diag_obj.remove_punctuation(diag_string)
        diag_string = diag_obj.tokenize(diag_string)
        diag_string = diag_obj.remove_stopwords(diag_string)
        diag_string = diag_obj.generate_ngrams(diag_string)
        # appending data to list after process complete
        temp_list.append(diag_string)
    # appending temp list to list of lists
    ngram_diag_list_lists[y-1].append(temp_list)
    temp_list = []

# Set Intersection of Drug Label vs Diagnosis 

In [30]:
# creating reference lists to help with readability of for loops
drug_ngrams = ngram_drug_list_lists[1][0]
diag_ngrams = ngram_diag_list_lists[1][0]

# creating empty list to append
intersection_list = []
# enumerating through drug bigrams
for a, b in enumerate(drug_ngrams):
    # putting bigrams into set to eliminate duplicates & allow for intersection
    drug_ngrams_set = {*b}
    # enumerating through diagnosis bigrams
    for c, d in enumerate(diag_ngrams):
        # creating diagnosis bigram set
        diag_ngrams_set = {*d}
        # finding matching bigram pairs using set intersection
        intersection = drug_ngrams_set.intersection(diag_ngrams_set)
        if intersection != set():
            #print("Drug Index = {}, Diagnosis Index = {}, shared wording = {}"
                  #.format(a, c, intersection))
            # appending intersection data to list
            # also adding the indices for both the drug and diagnosis lists
            intersection_list.append([a, c, intersection])

## Ceating pandas dataframe in order to merge into other dataframes

In [31]:
intersection_df = pd.DataFrame(intersection_list)

### Naming columns

In [32]:
intersection_df.columns = ['drug index', 'diagnosis index', 'shared wording']

### Merge intersection data with drug dataframe, select columns only

In [33]:
merged_df = pd.merge(intersection_df, drug_data.iloc[:, [0, 1, 10]],
                     how='left', left_on='drug index', right_index=True)

### Merge intersection data with diagnosis dataframe, select columns only

In [34]:
merged_df = pd.merge(merged_df, diagnosis_data.iloc[:, [0, 1]],
                     how='left', left_on='diagnosis index', right_index=True)

This is the final output, you can see both the drug and diagnosis info
Map shows a Drug, the related diagnoses, and the wording shared between them.

In [35]:
merged_df.tail(20)

Unnamed: 0,drug index,diagnosis index,shared wording,NDC-11,Product Name,Label - Indications and Usage,ICD-10 Code Block,ICD-10 Description
591,31,27129,{small gestational},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[P0517],[Newborn small for gestational age. 1750-1999 ...
592,31,27130,{small gestational},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[P0518],[Newborn small for gestational age. 2000-2499 ...
593,31,27131,{small gestational},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[P0519],[Newborn small for gestational age. other]
594,31,27132,{small gestational},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[P052],[Newborn affected by fetal (intrauterine) maln...
595,31,28337,{short stature},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[Q771],[Thanatophoric short stature]
596,31,28426,{short stature},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[Q871],[Congenital malformation syndromes predominant...
597,31,28498,{turner syndrome},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[Q96],[Turner's syndrome]
598,31,28504,{turner syndrome},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[Q968],[Other variants of Turner's syndrome]
599,31,28505,{turner syndrome},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[Q969],[Turner's syndrome. unspecified]
600,31,83479,{radiation therapy},13265802,GENOTROPIN INJ 2MG,[1 INDICATIONS AND USAGE GENOTROPIN is a recom...,[Z510],[Encounter for antineoplastic radiation therapy]
