# Validate the generated NDC to active ingredient mappings

2019-04-23

Ensure that the mappings we generated from NDCs to active ingredient RXCUIs are correct.

## Version 4

Check that examples where we mapped to a different ingredient form have been fixed.

In [1]:
import pandas as pd
from collections import defaultdict

## Read generated mappings

In [2]:
ingredients = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_4.tsv", sep='\t')

In [3]:
ingredients.shape

(41576, 2)

In [4]:
ingredients.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,5499
1,91792,7813215260
2,92582,30145
3,92583,30145
4,92584,30145


## Read NDC metadata

In [5]:
data = pd.read_csv("../../pipeline/merged_ndc_info.tsv", sep='\t')

In [6]:
data.shape

(265692, 22)

In [7]:
data.head(2)

Unnamed: 0,rxcui,rxaui,NDCPACKAGECODE,suppress,PRODUCTID,PRODUCTNDC,PACKAGEDESCRIPTION,PRODUCTTYPENAME,PROPRIETARYNAME,NONPROPRIETARYNAME,...,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,91349,3507080,12745-202-01,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"59 mL in 1 BOTTLE, PLASTIC (12745-202-01)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0
1,91349,3507080,12745-202-02,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"118 mL in 1 BOTTLE, PLASTIC (12745-202-02)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0


---

# Create results table

Compare against version 3.

In [8]:
ver3 = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_3.tsv", sep='\t')

In [9]:
ver3.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,5499
1,91792,7813215260
2,92582,30145
3,92583,30145
4,92584,30145


In [10]:
res = (data
    [[
        "rxcui", "NDCPACKAGECODE", "PROPRIETARYNAME",
        "NONPROPRIETARYNAME", "SUBSTANCENAME"
    ]]
    .drop_duplicates()
    .reset_index(drop=True)
       
    .merge(ingredients, how="inner", on="rxcui")
    .rename(columns={"active_ingredients": "v4"})
       
    .merge(ver3, how="inner", on="rxcui")
    .rename(columns={"active_ingredients": "v3"})

    .reset_index(drop=True)
)

In [11]:
res.shape

(241458, 7)

In [12]:
res.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
0,91349,12745-202-01,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,5499
1,91349,12745-202-02,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,5499
2,91349,12745-202-03,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,5499
3,91349,34645-8030-4,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,5499,5499
4,91349,55316-871-43,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,5499,5499


---

## Changes from version 3

In [13]:
(res["v4"] == res["v3"]).value_counts()

True     205338
False     36120
dtype: int64

In [14]:
# both versions give too many active ingredients
# correct answer is 29046

# 196472 is a BN type node and not an ingredient
# for all the others this kind of node has a has_precise_ingredient edge

res.query("PROPRIETARYNAME == 'Zestril'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
591,104375,52427-438-90,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
592,104376,52427-439-90,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
593,104377,52427-440-90,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
594,104377,70518-1451-0,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
595,104378,52427-441-90,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
596,104378,70518-1741-0,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
28636,206771,52427-443-90,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022
31079,213482,52427-442-90,Zestril,Lisinopril,LISINOPRIL,29046196472,1964721546022


# Examine results

## Case study: razadyne

In [15]:
# we correctly identified the active ingredient for the drug razadyne

# problem still not resolved in version 4

# rxcui 2103461 is giving us an error because it has no edges
# it was removed from the FDA database though
# deal with this in a later version

res.query("PROPRIETARYNAME == 'RAZADYNE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
90547,602734,21695-591-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90548,602734,50458-398-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90549,602736,50458-396-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90550,602737,50458-397-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115659,860697,50458-388-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115671,860709,50458-389-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115683,860717,50458-387-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
238571,2103461,21695-184-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,2103461,2103461


---

# Analyze based on the FDA's stated active ingredients

The FDA provides some information about the active ingredient.
Use the information to see if we can find disagreements with our algorithm.

### Examine disagreements between our algorithm and the FDA

In [16]:
# seems to be two versions of the same protein
# debateable whether they're really the same

res.query("SUBSTANCENAME == 'PEGFILGRASTIM'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
96010,727542,55513-190-01,Neulasta,pegfilgrastim,PEGFILGRASTIM,338036353501,338036353501
235873,2048025,67457-833-06,Fulphila,pegfilgrastim,PEGFILGRASTIM,2048018,2048018
238216,2102705,70114-101-01,UDENYCA,pegfilgrastim-cbqv,PEGFILGRASTIM,2102692,2102692


In [17]:
# this seems ok since one is e coli derived and the other is recombinant dna

# still good in version 4

res.query("SUBSTANCENAME == 'SOMATROPIN'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
103251,847245,0169-7708-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103252,847245,0169-7708-92,Norditropin,somatropin,SOMATROPIN,314845,314845
103253,847247,0169-7704-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103254,847247,0169-7704-92,Norditropin,somatropin,SOMATROPIN,314845,314845
103259,847348,0169-7705-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103260,847348,0169-7705-92,Norditropin,somatropin,SOMATROPIN,314845,314845
104543,849851,0169-7703-21,Norditropin,somatropin,SOMATROPIN,314845,314845
108460,854302,0781-3004-07,Omnitrope,Somatropin,SOMATROPIN,314845,314845
108461,854302,0781-3004-26,Omnitrope,Somatropin,SOMATROPIN,314845,314845
118882,864110,0781-3001-07,Omnitrope,Somatropin,SOMATROPIN,314845,314845


For these examples it seems that there are nuanced differences between the active ingredients of some similar drugs.
The FDA's table provides a high level summary of the active ingredients, but does not contain enough information to draw a conclusion regarding whether the mapping is correct.

For the two examples we looked at here our algorithm's outputs seem to be correct.

## Previous version 2 disagreements with version 1

In [18]:
# version 3 has fixed version 2's errors
# we have removed 26225 with the new rule

res.query("SUBSTANCENAME == 'ONDANSETRON'").head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
654,104894,68462-157-13,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
655,104894,0781-5238-01,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
656,104894,0781-5238-64,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
657,104894,0378-7732-93,Ondansetron,Ondansetron,ONDANSETRON,26225,26225
658,104894,62756-240-64,ondansetron,ondansetron,ONDANSETRON,26225,26225


In [19]:
# need to revisit this example
# more complicated than we previously thought

res.query("SUBSTANCENAME == 'LIDOCAINE'").head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
1041,106222,61543-1601-1,Bikini Zone Medicated CREME,LIDOCAINE,LIDOCAINE,6387,6387
34420,251919,66428-007-01,Instant Cool Skin,Lidocaine 0.5%,LIDOCAINE,6387,6387
148619,1010895,50488-6262-1,Lidocaine 4%,Lidocaine,LIDOCAINE,142440,142440
148711,1010895,50488-6263-1,Lidocaine 4 Percent PLUS,Lidocaine,LIDOCAINE,142440,142440
148799,1010931,46122-113-21,Good Neighbor Pharmacy Burn Relief,Lidocaine,LIDOCAINE,142440,142440


---

## Multi active ingredient drug examples

In [20]:
# now uses the right form of menthol

res.query("rxcui == 1300293")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
179723,1300293,51457-000-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,1421361648123
179724,1300293,71061-763-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,1421361648123
179725,1300293,71061-764-32,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,1421361648123
179726,1300293,71061-765-28,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,1421361648123
179727,1300293,71061-766-05,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,1421361648123
179728,1300293,51457-001-32,ALO THERAPEUTIC MASSAGE PAIN RELIEVING,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,6750142136,1421361648123


In [21]:
# this is also now correct

res.query("rxcui == 543879")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v4,v3
85480,543879,51674-0130-5,RELEGARD,"GLACIAL ACETIC ACID, OXYQUINOLINE",ACETIC ACID; OXYQUINOLINE,16842836,16842836


# Conclusion

Version 4 finally fixed the has_form mapping issue.

However, there are some instances where we have too many ingredients due to issues with the semantic network.
We will see if there is a way to resolve these issues.