# Validate the generated NDC to active ingredient mappings

2019-05-01

Ensure that the mappings we generated from NDCs to active ingredient RXCUIs are correct.

## Version 5

Check that we have removed the BN nodes from the active ingredients list.

Look for any remaining errors.

In [1]:
import pandas as pd
from collections import defaultdict

## Read generated mappings

In [2]:
ingredients = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_5.tsv", sep='\t')

In [3]:
ingredients.shape

(41576, 2)

In [4]:
ingredients.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,5499
1,91792,7813
2,92582,30145
3,92583,30145
4,92584,30145


## Read NDC metadata

In [5]:
data = pd.read_csv("../../pipeline/merged_ndc_info.tsv", sep='\t')

In [6]:
data.shape

(265692, 22)

In [7]:
data.head(2)

Unnamed: 0,rxcui,rxaui,NDCPACKAGECODE,suppress,PRODUCTID,PRODUCTNDC,PACKAGEDESCRIPTION,PRODUCTTYPENAME,PROPRIETARYNAME,NONPROPRIETARYNAME,...,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,91349,3507080,12745-202-01,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"59 mL in 1 BOTTLE, PLASTIC (12745-202-01)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0
1,91349,3507080,12745-202-02,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"118 mL in 1 BOTTLE, PLASTIC (12745-202-02)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0


---

# Create results table

Compare against the previous version.

In [8]:
prev = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_4.tsv", sep='\t')

In [9]:
prev.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,5499
1,91792,7813215260
2,92582,30145
3,92583,30145
4,92584,30145


In [10]:
res = (data
    [[
        "rxcui", "NDCPACKAGECODE", "PROPRIETARYNAME",
        "NONPROPRIETARYNAME", "SUBSTANCENAME"
    ]]
    .drop_duplicates()
    .reset_index(drop=True)
       
    .merge(ingredients, how="inner", on="rxcui")
    .rename(columns={"active_ingredients": "v5"})
       
    .merge(prev, how="inner", on="rxcui")
    .rename(columns={"active_ingredients": "v4"})

    .reset_index(drop=True)
)

In [11]:
res.shape

(241458, 7)

In [12]:
res.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
0,91349,12745-202-01,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,5499
1,91349,12745-202-02,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,5499
2,91349,12745-202-03,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,5499
3,91349,34645-8030-4,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,5499,5499
4,91349,55316-871-43,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,5499,5499


---

## Changes from version 4

In [13]:
(res["v5"] == res["v4"]).value_counts()

True     179535
False     61923
dtype: int64

## Check that the BN example was fixed

In [14]:
# we now correctly only find the active ingredient 29046
# the incorrect BN node 196472 has been removed

res.query("PROPRIETARYNAME == 'Zestril'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
591,104375,52427-438-90,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
592,104376,52427-439-90,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
593,104377,52427-440-90,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
594,104377,70518-1451-0,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
595,104378,52427-441-90,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
596,104378,70518-1741-0,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
28636,206771,52427-443-90,Zestril,Lisinopril,LISINOPRIL,29046,29046196472
31079,213482,52427-442-90,Zestril,Lisinopril,LISINOPRIL,29046,29046196472


# Examine results

## Case study: razadyne

In [15]:
# we correctly identified the active ingredient for the drug razadyne

# problem still not resolved in version 5

# rxcui 2103461 is giving us an error because it has no edges
# it was removed from the FDA database though
# deal with this in a later version

res.query("PROPRIETARYNAME == 'RAZADYNE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
90547,602734,21695-591-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90548,602734,50458-398-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90549,602736,50458-396-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90550,602737,50458-397-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115659,860697,50458-388-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115671,860709,50458-389-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115683,860717,50458-387-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
238571,2103461,21695-184-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,-1,2103461


---

# Analyze based on the FDA's stated active ingredients

The FDA provides some information about the active ingredient.
Use the information to see if we can find disagreements with our algorithm.

### Examine disagreements between our algorithm and the FDA

In [16]:
# this one also had a BN error earlier, which is now fixed

# there are three versions of the same protein
# debateable whether they're really the same

res.query("SUBSTANCENAME == 'PEGFILGRASTIM'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
96010,727542,55513-190-01,Neulasta,pegfilgrastim,PEGFILGRASTIM,338036,338036353501
235873,2048025,67457-833-06,Fulphila,pegfilgrastim,PEGFILGRASTIM,2048018,2048018
238216,2102705,70114-101-01,UDENYCA,pegfilgrastim-cbqv,PEGFILGRASTIM,2102692,2102692


In [17]:
# this seems ok since one is e coli derived and the other is recombinant dna

# still good in version 4

res.query("SUBSTANCENAME == 'SOMATROPIN'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
103251,847245,0169-7708-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103252,847245,0169-7708-92,Norditropin,somatropin,SOMATROPIN,314845,314845
103253,847247,0169-7704-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103254,847247,0169-7704-92,Norditropin,somatropin,SOMATROPIN,314845,314845
103259,847348,0169-7705-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103260,847348,0169-7705-92,Norditropin,somatropin,SOMATROPIN,314845,314845
104543,849851,0169-7703-21,Norditropin,somatropin,SOMATROPIN,314845,314845
108460,854302,0781-3004-07,Omnitrope,Somatropin,SOMATROPIN,314845,314845
108461,854302,0781-3004-26,Omnitrope,Somatropin,SOMATROPIN,314845,314845
118882,864110,0781-3001-07,Omnitrope,Somatropin,SOMATROPIN,314845,314845


For these examples it seems that there are nuanced differences between the active ingredients of some similar drugs.
The FDA's table provides a high level summary of the active ingredients, but does not contain enough information to draw a conclusion regarding whether the mapping is correct.

For the two examples we looked at here our algorithm's outputs seem to be correct.

## Previous version 2 disagreements with version 1

In [18]:
res.query("SUBSTANCENAME == 'ONDANSETRON'").head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
654,104894,68462-157-13,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
655,104894,0781-5238-01,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
656,104894,0781-5238-64,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
657,104894,0378-7732-93,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
658,104894,62756-240-64,ondansetron,ondansetron,ONDANSETRON,26225,26225


In [19]:
res.query("SUBSTANCENAME == 'LIDOCAINE'").head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
1041,106222,61543-1601-1,Bikini Zone Medicated CREME,LIDOCAINE,LIDOCAINE,6387,6387
34420,251919,66428-007-01,Instant Cool Skin,Lidocaine 0.5%,LIDOCAINE,6387,6387
148619,1010895,50488-6262-1,Lidocaine 4%,Lidocaine,LIDOCAINE,142440,142440
148711,1010895,50488-6263-1,Lidocaine 4 Percent PLUS,Lidocaine,LIDOCAINE,142440,142440
148799,1010931,46122-113-21,Good Neighbor Pharmacy Burn Relief,Lidocaine,LIDOCAINE,142440,142440


---

## Multi active ingredient drug examples

In [20]:
# now uses the right form of menthol

res.query("rxcui == 1300293")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
179723,1300293,51457-000-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,6750142136
179724,1300293,71061-763-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,6750142136
179725,1300293,71061-764-32,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,6750142136
179726,1300293,71061-765-28,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,6750142136
179727,1300293,71061-766-05,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,6750142136
179728,1300293,51457-001-32,ALO THERAPEUTIC MASSAGE PAIN RELIEVING,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,6750142136


In [21]:
# this is also now correct

res.query("rxcui == 543879")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
85480,543879,51674-0130-5,RELEGARD,"GLACIAL ACETIC ACID, OXYQUINOLINE",ACETIC ACID; OXYQUINOLINE,16842836,16842836


---

# What are the term types of the active ingredients we found?

Verify that all the BN term nodes have been removed as active ingredients.

## Read relationships

In [22]:
rels = pd.read_csv("../../pipeline/rxnorm/rxcui_rels.tsv", sep='\t')

In [23]:
rels.head()

Unnamed: 0,rxcui1,rel,rxcui2,rela,rui
0,38,RB,1760,has_tradename,4696871
1,38,RO,105050,has_ingredient,4343918
2,38,RO,105445,has_ingredient,4229336
3,38,RO,105446,has_ingredient,3798489
4,38,RO,105447,has_ingredient,4423580


## Read term types

In [24]:
conso = pd.read_csv("../../pipeline/rxnorm/rxconso_info.tsv", sep='\t')

In [25]:
conso.head()

Unnamed: 0,rxcui,rxaui,tty,str,suppress,cvf
0,38,829,BN,Parlodel,N,4096.0
1,44,947,IN,Mesna,N,4096.0
2,61,1424,IN,beta-Alanine,N,4096.0
3,73,2458041,IN,Docosahexaenoate,N,4096.0
4,74,1684,IN,4-Aminobenzoic Acid,N,4096.0


### Get term types for each node

In [26]:
tty = defaultdict(set)
for row in conso.itertuples():
    tty[row.rxcui].add(row.tty)

### Generate results

In [27]:
ans_ttys = defaultdict(set)

for row in ingredients.itertuples():
    for node in row.active_ingredients.split(","):
        ans_ttys[row.rxcui] |= tty[int(node)]

In [28]:
ans = defaultdict(list)

for rxcui, temp in ans_ttys.items():
    ans["rxcui"].append(rxcui)
    ans["ing_ttys"].append(",".join(sorted(temp)))
    ans["num_ttys"].append(len(temp))
    
ans = pd.DataFrame(ans)

In [29]:
ans.shape

(41576, 3)

In [30]:
ans.head()

Unnamed: 0,rxcui,ing_ttys,num_ttys
0,91349,IN,1
1,91792,IN,1
2,92582,PIN,1
3,92583,PIN,1
4,92584,PIN,1


## Term types of active ingredients

In [31]:
ans["ing_ttys"].value_counts()

                  25229
IN                 7972
PIN                3853
IN,SY              1849
IN,TMSY             756
IN,PIN              689
PIN,TMSY            635
IN,PIN,TMSY         453
IN,PIN,SY            89
PIN,SY               27
IN,SY,TMSY           12
IN,PIN,SY,TMSY       12
Name: ing_ttys, dtype: int64

All of the BN terms are gone.
Only the IN and PIN nodes remain.
The SY and TMSY terms are synonyms.

In [32]:
ans["num_ttys"].value_counts()

0    25229
1    11825
2     3956
3      554
4       12
Name: num_ttys, dtype: int64

## Sample some examples

In [33]:
# BN terms have been removed

ans.query("rxcui == 757969").merge(res, how="left", on="rxcui")

Unnamed: 0,rxcui,ing_ttys,num_ttys,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
0,757969,"IN,PIN",2,64764-702-01,PREVPAC,"lansoprazole, amoxicillin and clarithromycin",,1712821212133008,171282121283156133008203729


In [34]:
# here we included one BN term as an ingredient

ans.query("rxcui == 1493510").merge(res, how="left", on="rxcui")

Unnamed: 0,rxcui,ing_ttys,num_ttys,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v5,v4
0,1493510,"IN,PIN",2,51531-8977-0,Clear Proof Acne System,Benzoyl Peroxide and Salicylic Acid,,141895229907235418253186,1418952299072354182531861489128


## Summary

The term type filter has worked very well.
All BN nodes have been removed as active ingredients.

---

# Remaining problems

NDCs with multiple RXCUIS: do they only have one consistent answer?

In [35]:
summary = (ingredients
    .merge(
        data[["rxcui", "NDCPACKAGECODE", "suppress"]],
        how="inner", on="rxcui"
    )
    .drop_duplicates()
    [["NDCPACKAGECODE", "suppress", "rxcui", "active_ingredients"]]
    .sort_values(["NDCPACKAGECODE", "rxcui"])
    .reset_index(drop=True)
)

In [36]:
summary.shape

(241456, 4)

In [37]:
summary.head()

Unnamed: 0,NDCPACKAGECODE,suppress,rxcui,active_ingredients
0,0002-0800-01,N,540930,11295
1,0002-1200-30,N,1297712,-1
2,0002-1200-50,N,1297712,-1
3,0002-1407-01,N,853004,35220
4,0002-1433-61,N,1551300,1551291


In [38]:
summary["active_ingredients"].value_counts().head()

-1      51418
7806     6125
448      4792
5640     3475
1379     2289
Name: active_ingredients, dtype: int64

### Number of RXCUIs per NDC

In [39]:
ncuis = (summary
    .groupby("NDCPACKAGECODE")
    ["rxcui"]
    .nunique()
    .to_frame("num_cuis")
    .reset_index()
)

In [40]:
ncuis.shape

(239810, 2)

In [41]:
ncuis.head()

Unnamed: 0,NDCPACKAGECODE,num_cuis
0,0002-0800-01,1
1,0002-1200-30,1
2,0002-1200-50,1
3,0002-1407-01,1
4,0002-1433-61,1


In [42]:
ncuis["num_cuis"].value_counts()

1    238185
2      1607
3        18
Name: num_cuis, dtype: int64

In [43]:
stats = (summary
    .merge(
        ncuis, how="inner", on="NDCPACKAGECODE"
    )
    .assign(
        good_ans = lambda df: df["active_ingredients"].map(
            lambda v: v != "-1"
        )
    )
)

In [44]:
stats.head()

Unnamed: 0,NDCPACKAGECODE,suppress,rxcui,active_ingredients,num_cuis,good_ans
0,0002-0800-01,N,540930,11295,1,True
1,0002-1200-30,N,1297712,-1,1,False
2,0002-1200-50,N,1297712,-1,1,False
3,0002-1407-01,N,853004,35220,1,True
4,0002-1433-61,N,1551300,1551291,1,True


In [45]:
stats["good_ans"].value_counts()

True     190038
False     51418
Name: good_ans, dtype: int64

In [46]:
stats["good_ans"].value_counts(normalize=True)

True     0.78705
False    0.21295
Name: good_ans, dtype: float64

We have mapped 78% of all NDCs to their active ingredients.

In [47]:
stats.groupby(["num_cuis", "suppress", "good_ans"]).size()

num_cuis  suppress  good_ans
1         N         False        49229
                    True        186032
          O         False            4
                    True            18
          Y         False         1495
                    True          1410
2         N         False            3
                    True          1588
          Y         False          657
                    True           966
3         N         True            14
          Y         False           30
                    True            10
dtype: int64

## Check that we get consistent results for NDCs with multiple RXCUIs

In [48]:
weird = []
for label, df in stats.query("num_cuis > 1").groupby("NDCPACKAGECODE"):
    if df["good_ans"].any():
        if df.query("good_ans")["active_ingredients"].nunique() != 1:
            weird.append(label)

In [49]:
(pd
    .Series(weird)
    .to_frame("NDCPACKAGECODE")
    .merge(stats, how="left", on="NDCPACKAGECODE")
    .sort_values(["NDCPACKAGECODE", "suppress"])
)

Unnamed: 0,NDCPACKAGECODE,suppress,rxcui,active_ingredients,num_cuis,good_ans
0,52584-360-01,N,1668250,1908,2,True
1,52584-360-01,Y,1867737,1901,2,True
2,52584-360-03,N,1668248,1908,2,True
3,52584-360-03,Y,1867737,1901,2,True
5,68001-285-36,N,1803932,6313,2,True
4,68001-285-36,Y,1720771,877015,2,True
7,68001-285-37,N,1803937,6313,2,True
6,68001-285-37,Y,1720771,877015,2,True
9,68001-285-40,N,1803930,6313,2,True
8,68001-285-40,Y,1720771,877015,2,True


Based on these results, it seems that the rows which are supposed to be suppressed are incorrect.
The rows which are not supposed to be suppressed gave the correct results.

We will update our algorithm to remove disagreements using the suppress column.

### Verify that we can use the suppress column, and that there will only be one unsuppressed result

In [50]:
for label, df in stats.query("num_cuis > 1").groupby("NDCPACKAGECODE"):
    if df["good_ans"].any():
        assert (df.query("~good_ans")["suppress"] != "N").all()

In [51]:
for label, df in stats.query("num_cuis > 1").groupby("NDCPACKAGECODE"):
    assert (df["suppress"] == "N").sum() <= 1, label

# Conclusion

Version 5 has removed the bad BN nodes from the ingredients list.

However, we have some issues with NDCs that have multiple RXCUIs.