# Determine the active ingredients of RXCUIs associated with National Drug Codes

2019-06-03

Now that we have managed to convert NDCs into RXCUIs, we have an entry point into the RxNorm semantic network.
We will use the relationships contained in the semantic network to determine what the active ingredients of each NDC are.

## Version 6

Algorithm for version 6:
- walk the RxNorm semantic network using BFS
- only traverse good edges
- don't add a node if one of its `has_form` targets is already an ingredient
- only add ingredient (IN) and precise ingredient (PIN) nodes

Changes
- Algorithm is the same as version 5
- Harmonize results for NDCs with multiple RXCUIs using the RxNorm suppress column

### All RXCUI edge types

From `code/rxnorm/extract_rxcui_relationships.ipynb`:

```
isa                       202708
inverse_isa               202708
ingredient_of             155438
has_ingredient            155438
constitutes               110911
consists_of               110911
tradename_of              106752
has_tradename             106752
dose_form_of               89850
has_dose_form              89850
has_doseformgroup          36876
doseformgroup_of           36876
has_ingredients            11649
ingredients_of             11649
has_precise_ingredient     10864
precise_ingredient_of      10864
has_part                   10585
part_of                    10585
quantified_form_of          5310
has_quantified_form         5310
has_form                    2845
form_of                     2845
contained_in                2310
contains                    2310
reformulation_of              62
reformulated_to               62
Name: rela, dtype: int64
```

Of these, I sampled the edges of each type by hand to determine which ones were useful for finding the active ingredients.

### Good edge types

```
# special, treat differently
has_precise_ingredient     10864

# all equal as far as i can tell

has_ingredient            155438
has_ingredients            11649

consists_of               110911
contains                    2310
has_part                   10585
```

In [1]:
import pandas as pd
from collections import deque
from collections import defaultdict

## Read semantic network and convert to adjacency list

In [2]:
def get_adj_list(rels_fname):
    rel_table = pd.read_csv(rels_fname, sep='\t')
    
    adj_list = defaultdict(lambda: defaultdict(list))
    for row in rel_table.itertuples():
        adj_list[row.rxcui2][row.rela].append(row.rxcui1)
        
    return adj_list

In [3]:
fname = "../../pipeline/rxnorm/rxcui_rels.tsv"

adj_list = get_adj_list(fname)

In [4]:
len(adj_list)

196202

In [5]:
adj_list[602734]

defaultdict(list,
            {'has_dose_form': [317541],
             'consists_of': [330343, 602732],
             'tradename_of': [579148],
             'has_ingredient': [583099],
             'isa': [602733, 1178299, 1178300]})

---

## Identify all ingredients and precise ingredients

In [6]:
def get_rxnorm_ingredients(conso_fname):
    return set(pd
        .read_csv(conso_fname, sep='\t')
        .query("tty == 'IN' or tty == 'PIN'")
        ["rxcui"]
    )

In [7]:
conso_fname = "../../pipeline/rxnorm/rxconso_info.tsv"

rxnorm_ingredients = get_rxnorm_ingredients(conso_fname)

## Determine active ingredients

In [8]:
def get_active_ingredients(start_rxcui):
    """Get the active ingredients of a specific RxCUI."""

    GOOD_ETYPES = [
        "has_ingredient",
        "has_ingredients",
        "consists_of",
        "contains",
        "has_part"
    ]

    ingredients = set()

    queue = deque([start_rxcui])
    been = set([start_rxcui])

    while queue:
        cur_node = queue.popleft()
        
        if "has_precise_ingredient" in adj_list[cur_node]:
            # there can be multiple precise ingredients!
            
            assert set(adj_list[cur_node]["has_precise_ingredient"]) <= rxnorm_ingredients
        
            ingredients |= set(adj_list[cur_node]["has_precise_ingredient"])
            
        elif adj_list[cur_node].keys().isdisjoint(set(GOOD_ETYPES)):
            
            # no more edges to walk
            if (cur_node in rxnorm_ingredients) and (set(adj_list[cur_node]["has_form"]).isdisjoint(ingredients)):
                ingredients.add(cur_node)

        else:
            # we need to continue to traverse the graph
            for etype in GOOD_ETYPES:
                for neighbour in adj_list[cur_node][etype]:
                    if neighbour not in been:
                        queue.append(neighbour)
                        been.add(neighbour)

    if not ingredients:
        return "-1"

    return ",".join(str(v) for v in sorted(list(ingredients)))

---

## Read all NDCs with RXCUIs

In [9]:
data = pd.read_csv("../../pipeline/merged_ndc_info.tsv", sep='\t')

In [10]:
data.shape

(246695, 20)

In [11]:
data.head(2)

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,PRODUCTNDC,PACKAGEDESCRIPTION,PRODUCTTYPENAME,PROPRIETARYNAME,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,0002-0800-01,540930,False,0002-0800,1 VIAL in 1 CARTON (0002-0800-01) > 10 mL in ...,HUMAN OTC DRUG,Sterile Diluent,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,NDA,NDA018781,Eli Lilly and Company,WATER,1,mL/mL,,,N,20191231.0
1,0002-1200-30,1297712,False,0002-1200,"1 VIAL, MULTI-DOSE in 1 CAN (0002-1200-30) > ...",HUMAN PRESCRIPTION DRUG,Amyvid,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,NDA,NDA202008,Eli Lilly and Company,FLORBETAPIR F-18,51,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,20191231.0


## Filter out relevant data

We just need the starting RXCUI.

In [12]:
drugs = (data
    [["rxcui"]]
    .drop_duplicates()
    .sort_values("rxcui")
    .reset_index(drop=True)
)

In [13]:
drugs.shape

(43556, 1)

In [14]:
drugs.head()

Unnamed: 0,rxcui
0,91349
1,91792
2,92582
3,92583
4,92584


We will analyze our algorithm's performance in another notebook.
For the purposes of finding the active ingredients we only need the starting RXCUI associated with the NDC.

# Use the semantic network to determine the active ingredients

In [15]:
ingredients = drugs.assign(
    active_ingredients = lambda df: df["rxcui"].map(get_active_ingredients)
)

In [16]:
ingredients.shape

(43556, 2)

In [17]:
ingredients.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,5499
1,91792,7813
2,92582,30145
3,92583,30145
4,92584,30145


## Save results to file

In [18]:
ingredients.to_csv(
    "../../pipeline/ingredients/rxcui_ingredients/rxcui_ingredients_version_7.tsv",
    sep='\t', index=False
)