# Validate the generated NDC to active ingredient mappings

2019-05-07

Ensure that the mappings we generated from NDCs to active ingredient RXCUIs are correct.

This notebook provides a high level overview.
Details are examined in depth in subsequent notebooks.

In [1]:
import pandas as pd
from collections import defaultdict

## Read consolidated NDC to RXCUI active ingredient mappings

In [2]:
mapping = pd.read_csv("../../pipeline/ingredients/ndc_tables/ndc_to_rxcui_map_version_2.tsv", sep='\t')

In [3]:
mapping.shape

(242988, 4)

In [4]:
mapping.head()

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients
0,0002-0800-01,540930,False,11295
1,0002-1200-30,1297712,False,-1
2,0002-1200-50,1297712,False,-1
3,0002-1407-01,853004,False,35220
4,0002-1433-61,1551300,False,1551291


## Read NDC metadata

In [5]:
metadata = pd.read_csv("../../pipeline/merged_ndc_info.tsv", sep='\t')

In [6]:
metadata.shape

(244648, 20)

In [7]:
metadata.head(2)

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,PRODUCTNDC,PACKAGEDESCRIPTION,PRODUCTTYPENAME,PROPRIETARYNAME,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,0002-0800-01,540930,False,0002-0800,1 VIAL in 1 CARTON (0002-0800-01) > 10 mL in ...,HUMAN OTC DRUG,Sterile Diluent,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,NDA,NDA018781,Eli Lilly and Company,WATER,1,mL/mL,,,N,20191231.0
1,0002-1200-30,1297712,False,0002-1200,"1 VIAL, MULTI-DOSE in 1 CAN (0002-1200-30) > ...",HUMAN PRESCRIPTION DRUG,Amyvid,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,NDA,NDA202008,Eli Lilly and Company,FLORBETAPIR F-18,51,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,20191231.0


---

## Verify that the set of NDCs which we found active ingredients for is disjoint from the set of NDCs for which we did not find active ingredients

In [8]:
set(
    mapping.query(
        "active_ingredients == '-1'"
    )
    ["NDCPACKAGECODE"]
).isdisjoint(
    set(
        mapping.query(
            "active_ingredients != '-1'"
        )
        ["NDCPACKAGECODE"]
    )
)

True

### Verify that there is one consistent answer for each NDC

In [9]:
mapping.groupby("NDCPACKAGECODE")["active_ingredients"].nunique().value_counts()

1    242966
Name: active_ingredients, dtype: int64

---

# Overview

## How many NDCs did we find ingredients for?

In [10]:
data = mapping.assign(good_ans = lambda df: df["active_ingredients"] != "-1")

In [11]:
data.shape

(242988, 5)

In [12]:
data.head()

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans
0,0002-0800-01,540930,False,11295,True
1,0002-1200-30,1297712,False,-1,False
2,0002-1200-50,1297712,False,-1,False
3,0002-1407-01,853004,False,35220,True
4,0002-1433-61,1551300,False,1551291,True


In [13]:
(data
    [["good_ans", "NDCPACKAGECODE"]]
    .drop_duplicates()
    ["good_ans"]
    .value_counts()
 
    .to_frame("ndcs")
    .assign(percent = lambda df: df["ndcs"].div(df["ndcs"].sum()).mul(100))
)

Unnamed: 0,ndcs,percent
True,191191,78.690434
False,51775,21.309566


Our algorithm managed to find active ingredients for 78.69% of the NDCs.
The remaining 21.3% did not have any active ingredients according to our algorithm.

---

# Merge active ingredient results with drug metadata

In [14]:
res = (data
    .merge(
        metadata[[
            "NDCPACKAGECODE",
            "PROPRIETARYNAME",
            "NONPROPRIETARYNAME",
            "MARKETINGCATEGORYNAME",
            "APPLICATIONNUMBER",
            "SUBSTANCENAME",
            "NDC_EXCLUDE_FLAG"
        ]],
        how="inner", on="NDCPACKAGECODE"
    )
    .drop_duplicates()        
    .reset_index(drop=True)
)

In [15]:
res.shape

(242993, 11)

In [16]:
res.head()

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
0,0002-0800-01,540930,False,11295,True,Sterile Diluent,diluent,NDA,NDA018781,WATER,N
1,0002-1200-30,1297712,False,-1,False,Amyvid,Florbetapir F 18,NDA,NDA202008,FLORBETAPIR F-18,N
2,0002-1200-50,1297712,False,-1,False,Amyvid,Florbetapir F 18,NDA,NDA202008,FLORBETAPIR F-18,N
3,0002-1407-01,853004,False,35220,True,Quinidine Gluconate,Quinidine Gluconate,NDA,NDA007529,QUINIDINE GLUCONATE,N
4,0002-1433-61,1551300,False,1551291,True,Trulicity,Dulaglutide,BLA,BLA125469,DULAGLUTIDE,N


## Algorithm performance by drug category

Use the FDA's drug metadata to examine our algorithm's performance based on the drug category.

In [17]:
# number of unique NDCs in each group of (marketing category name, good ans)

ngroup = (res
    .groupby(["MARKETINGCATEGORYNAME", "good_ans"])
    ["NDCPACKAGECODE"]
    .nunique()
    .to_frame("ndcs")
)

In [18]:
# number of unique NDCs in each group of marketing category name

ncat = (res
    .groupby("MARKETINGCATEGORYNAME")
    ["NDCPACKAGECODE"]
    .nunique()
    .to_frame("total")
)

In [19]:
(ngroup
    .join(ncat, on="MARKETINGCATEGORYNAME")
    .assign(percent = lambda df: df["ndcs"].div(df["total"]).mul(100))
)

Unnamed: 0_level_0,Unnamed: 1_level_0,ndcs,total,percent
MARKETINGCATEGORYNAME,good_ans,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ANDA,False,525,101823,0.515601
ANDA,True,101298,101823,99.484399
BLA,False,1308,19878,6.580139
BLA,True,18570,19878,93.419861
NDA,False,1027,19259,5.332572
NDA,True,18232,19259,94.667428
NDA AUTHORIZED GENERIC,False,19,2419,0.785449
NDA AUTHORIZED GENERIC,True,2400,2419,99.214551
OTC MONOGRAPH FINAL,False,9406,29934,31.422463
OTC MONOGRAPH FINAL,True,20528,29934,68.577537


## Conclusion

This table shows that we managed to find active ingredients for the majority of NDCs.
For the approved ANDA, BLA, and NDA categories, we mapped >90% of NDCs in each group.

However, the algorithm performed poorly on the OTC Monograph groups, resulting in only 60% of drugs in these categories being mapped to their active ingredients.

A lot of the missed NDC drugs are actually not approved at all, and therefore we will not try to identify ingredients for these NDCs.

We will proceed to checking that the generated results are accurate, and will use these results as an acceptable version 1 of the NDC to active ingredients mapping.

---

# Examine individual examples

## Check that the BN example was fixed

In [20]:
# we now correctly only find the active ingredient 29046
# the incorrect BN node 196472 has been removed

res.query("PROPRIETARYNAME == 'Zestril'")

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
114355,52427-438-90,104375,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
114356,52427-439-90,104376,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
114357,52427-440-90,104377,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
114358,52427-441-90,104378,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
114359,52427-442-90,213482,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
114360,52427-443-90,206771,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
220921,70518-1451-0,104377,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N
221250,70518-1741-0,104378,False,29046,True,Zestril,Lisinopril,NDA,NDA019777,LISINOPRIL,N


## Razadyne

In [21]:
# we correctly identified the active ingredient for the drug razadyne

# problem still not resolved in version 5

# rxcui 2103461 is giving us an error because it has no edges
# it was removed from the FDA database though
# deal with this in a later version

res.query("PROPRIETARYNAME == 'RAZADYNE'")

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
45936,21695-184-30,2103461,True,-1,False,RAZADYNE,galantamine hydrobromide,NDA,NDA021169,GALANTAMINE HYDROBROMIDE,E
46634,21695-591-30,602734,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021169,GALANTAMINE HYDROBROMIDE,E
105810,50458-387-30,860717,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021615,GALANTAMINE HYDROBROMIDE,N
105811,50458-388-30,860697,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021615,GALANTAMINE HYDROBROMIDE,N
105812,50458-389-30,860709,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021615,GALANTAMINE HYDROBROMIDE,N
105813,50458-396-60,602736,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021169,GALANTAMINE HYDROBROMIDE,N
105814,50458-397-60,602737,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021169,GALANTAMINE HYDROBROMIDE,N
105815,50458-398-60,602734,False,860693,True,RAZADYNE,galantamine hydrobromide,NDA,NDA021169,GALANTAMINE HYDROBROMIDE,N


---

# Analyze based on the FDA's stated active ingredients

The FDA provides some information about the active ingredient.
Use the information to see if we can find disagreements with our algorithm.

### Examine disagreements between our algorithm and the FDA

In [22]:
# this one also had a BN error earlier, which is now fixed

# there are three versions of the same protein
# debateable whether they're really the same

res.query("SUBSTANCENAME == 'PEGFILGRASTIM'")

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
130366,55513-190-01,727542,False,338036,True,Neulasta,pegfilgrastim,BLA,BLA125031,PEGFILGRASTIM,N
186718,67457-833-06,2048025,False,2048018,True,Fulphila,pegfilgrastim,BLA,BLA761075,PEGFILGRASTIM,N
217328,70114-101-01,2102705,False,2102692,True,UDENYCA,pegfilgrastim-cbqv,BLA,BLA761039,PEGFILGRASTIM,N


In [23]:
# this seems ok since one is e coli derived and the other is recombinant dna

# still good in version 4

res.query("SUBSTANCENAME == 'SOMATROPIN'")

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
6035,0169-7703-21,849851,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
6036,0169-7704-21,847247,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
6037,0169-7704-92,847247,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
6038,0169-7705-21,847348,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
6039,0169-7705-92,847348,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
6040,0169-7708-21,847245,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
6041,0169-7708-92,847245,False,314845,True,Norditropin,somatropin,NDA,NDA021148,SOMATROPIN,N
20893,0781-3001-07,864110,False,314845,True,Omnitrope,Somatropin,NDA,NDA021426,SOMATROPIN,N
20894,0781-3001-26,864110,False,314845,True,Omnitrope,Somatropin,NDA,NDA021426,SOMATROPIN,N
20897,0781-3004-07,854302,False,314845,True,Omnitrope,Somatropin,NDA,NDA021426,SOMATROPIN,N


For these examples it seems that there are nuanced differences between the active ingredients of some similar drugs.
The FDA's table provides a high level summary of the active ingredients, but does not contain enough information to draw a conclusion regarding whether the mapping is correct.

For the two examples we looked at here our algorithm's outputs seem to be correct.

## Previous version 2 disagreements with version 1

In [24]:
res.query("SUBSTANCENAME == 'ONDANSETRON'").head()

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
2687,0078-0679-19,876690,False,203148,True,ZOFRAN,ondansetron hydrochloride,NDA,NDA020781,ONDANSETRON,N
2688,0078-0680-19,876693,False,203148,True,ZOFRAN,ondansetron hydrochloride,NDA,NDA020781,ONDANSETRON,N
14248,0378-7732-93,104894,False,26225,True,Ondansetron,Ondansetron,ANDA,ANDA078139,ONDANSETRON,N
14249,0378-7734-93,312087,False,26225,True,Ondansetron,Ondansetron,ANDA,ANDA078139,ONDANSETRON,N
14250,0378-7734-97,312087,False,26225,True,Ondansetron,Ondansetron,ANDA,ANDA078139,ONDANSETRON,N


In [25]:
res.query("SUBSTANCENAME == 'LIDOCAINE'").head()

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
3740,0113-0135-45,1442274,False,6387,True,Lidocaine,Burn Relief,OTC MONOGRAPH NOT FINAL,part348,LIDOCAINE,N
5850,0168-0204-37,1543069,False,6387,True,Lidocaine,Lidocaine,ANDA,ANDA080198,LIDOCAINE,N
11471,0362-0221-10,1543069,False,6387,True,Lidocaine,Lidocaine,ANDA,ANDA040911,LIDOCAINE,N
12631,0363-1114-01,1366789,False,6387,True,Anorectal,Lidocaine,OTC MONOGRAPH FINAL,part346,LIDOCAINE,N
12826,0363-3001-24,2104325,False,-1,False,Pain and Itch Relief,Lidocaine,OTC MONOGRAPH NOT FINAL,part348,LIDOCAINE,N


---

## Multi active ingredient drug examples

In [26]:
# now uses the right form of menthol

res.query("rxcui == 1300293")

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
110402,51457-000-04,1300293,False,6750142136,True,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",OTC MONOGRAPH NOT FINAL,part348,MENTHOL; HISTAMINE DIHYDROCHLORIDE,N
110403,51457-001-32,1300293,False,6750142136,True,ALO THERAPEUTIC MASSAGE PAIN RELIEVING,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",OTC MONOGRAPH NOT FINAL,part348,MENTHOL; HISTAMINE DIHYDROCHLORIDE,N
224986,71061-763-04,1300293,False,6750142136,True,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",OTC MONOGRAPH NOT FINAL,part348,MENTHOL; HISTAMINE DIHYDROCHLORIDE,N
224987,71061-764-32,1300293,False,6750142136,True,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",OTC MONOGRAPH NOT FINAL,part348,MENTHOL; HISTAMINE DIHYDROCHLORIDE,N
224988,71061-765-28,1300293,False,6750142136,True,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",OTC MONOGRAPH NOT FINAL,part348,MENTHOL; HISTAMINE DIHYDROCHLORIDE,N
224989,71061-766-05,1300293,False,6750142136,True,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",OTC MONOGRAPH NOT FINAL,part348,MENTHOL; HISTAMINE DIHYDROCHLORIDE,N


In [27]:
# this is also now correct

res.query("rxcui == 543879")

Unnamed: 0,NDCPACKAGECODE,rxcui,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
112135,51674-0130-5,543879,False,16842836,True,RELEGARD,"GLACIAL ACETIC ACID, OXYQUINOLINE",UNAPPROVED DRUG OTHER,,ACETIC ACID; OXYQUINOLINE,E


---

# What are the term types of the active ingredients we found?

Verify that all the BN term nodes have been removed as active ingredients.

## Read relationships

In [28]:
rels = pd.read_csv("../../pipeline/rxnorm/rxcui_rels.tsv", sep='\t')

In [29]:
rels.head()

Unnamed: 0,rxcui1,rel,rxcui2,rela
0,38,RB,1760,has_tradename
1,38,RO,105050,has_ingredient
2,38,RO,105445,has_ingredient
3,38,RO,105446,has_ingredient
4,38,RO,105447,has_ingredient


## Read term types

In [30]:
conso = pd.read_csv("../../pipeline/rxnorm/rxconso_info.tsv", sep='\t')

In [31]:
conso.head()

Unnamed: 0,rxcui,rxaui,tty,str,suppress,cvf
0,38,829,BN,Parlodel,N,4096.0
1,44,947,IN,Mesna,N,4096.0
2,61,1424,IN,beta-Alanine,N,4096.0
3,73,2458041,IN,Docosahexaenoate,N,4096.0
4,74,1684,IN,4-Aminobenzoic Acid,N,4096.0


## Read ingredients

In [32]:
ingredients = pd.read_csv(
    "../../pipeline/ingredients/rxcui_ingredients/rxcui_ingredients_version_7.tsv",
    sep='\t'
)

### Get term types for each node

In [33]:
tty = defaultdict(set)
for row in conso.itertuples():
    tty[row.rxcui].add(row.tty)

### Generate results

In [34]:
ans_ttys = defaultdict(set)

for row in ingredients.itertuples():
    for node in row.active_ingredients.split(","):
        ans_ttys[row.rxcui] |= tty[int(node)]

In [35]:
ans = defaultdict(list)

for rxcui, temp in ans_ttys.items():
    ans["rxcui"].append(rxcui)
    ans["ing_ttys"].append(",".join(sorted(temp)))
    ans["num_ttys"].append(len(temp))
    
ans = pd.DataFrame(ans)

In [36]:
ans.shape

(42554, 3)

In [37]:
ans.head()

Unnamed: 0,rxcui,ing_ttys,num_ttys
0,91349,IN,1
1,91792,IN,1
2,92582,PIN,1
3,92583,PIN,1
4,92584,PIN,1


## Term types of active ingredients

In [38]:
ans["ing_ttys"].value_counts()

                  26135
IN                 8100
PIN                3862
IN,SY              1775
IN,TMSY             757
IN,PIN              709
PIN,TMSY            636
IN,PIN,TMSY         462
IN,PIN,SY            71
PIN,SY               27
IN,SY,TMSY           12
IN,PIN,SY,TMSY        8
Name: ing_ttys, dtype: int64

All of the BN terms are gone.
Only the IN and PIN nodes remain.
The SY and TMSY terms are synonyms.

In [39]:
ans["num_ttys"].value_counts()

0    26135
1    11962
2     3904
3      545
4        8
Name: num_ttys, dtype: int64

## Sample some examples

In [40]:
# BN terms have been removed

ans.query("rxcui == 757969").merge(res, how="left", on="rxcui")

Unnamed: 0,rxcui,ing_ttys,num_ttys,NDCPACKAGECODE,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
0,757969,"IN,PIN",2,64764-702-01,False,1712821212133008,True,PREVPAC,"lansoprazole, amoxicillin and clarithromycin",NDA,NDA050757,,N


In [41]:
# here we included one BN term as an ingredient

ans.query("rxcui == 1493510").merge(res, how="left", on="rxcui")

Unnamed: 0,rxcui,ing_ttys,num_ttys,NDCPACKAGECODE,suppress,active_ingredients,good_ans,PROPRIETARYNAME,NONPROPRIETARYNAME,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,SUBSTANCENAME,NDC_EXCLUDE_FLAG
0,1493510,"IN,PIN",2,51531-8977-0,False,141895229907235418253186,True,Clear Proof Acne System,Benzoyl Peroxide and Salicylic Acid,OTC MONOGRAPH FINAL,part333D,,N


## Summary

The term type filter has worked very well.
All BN nodes have been removed as active ingredients.

# Conclusion

We seem to have successfully mapped NDCs to their active ingredients.