# Validate the generated NDC to active ingredient mappings

2019-04-23

Ensure that the mappings we generated from NDCs to active ingredient RXCUIs are correct.

## Version 3

We want to see if the new `has_form` rule has reduced the number of errors.

In [1]:
import pandas as pd
from collections import defaultdict

## Read generated mappings

In [2]:
ingredients = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_3.tsv", sep='\t')

In [3]:
ingredients.shape

(41576, 2)

In [4]:
ingredients.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,5499
1,91792,7813215260
2,92582,30145
3,92583,30145
4,92584,30145


## Read NDC metadata

In [5]:
data = pd.read_csv("../../pipeline/merged_ndc_info.tsv", sep='\t')

In [6]:
data.shape

(265692, 22)

In [7]:
data.head(2)

Unnamed: 0,rxcui,rxaui,NDCPACKAGECODE,suppress,PRODUCTID,PRODUCTNDC,PACKAGEDESCRIPTION,PRODUCTTYPENAME,PROPRIETARYNAME,NONPROPRIETARYNAME,...,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,91349,3507080,12745-202-01,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"59 mL in 1 BOTTLE, PLASTIC (12745-202-01)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0
1,91349,3507080,12745-202-02,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"118 mL in 1 BOTTLE, PLASTIC (12745-202-02)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0


---

# Create results table

Compare version 3 with version 1's results.

In [8]:
ver1 = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_1.tsv", sep='\t')

In [9]:
ver1.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,-90000
1,91792,-500
2,92582,30145
3,92583,30145
4,92584,30145


In [10]:
res = (data
    [[
        "rxcui", "NDCPACKAGECODE", "PROPRIETARYNAME",
        "NONPROPRIETARYNAME", "SUBSTANCENAME"
    ]]
    .drop_duplicates()
    .reset_index(drop=True)
       
    .merge(ingredients, how="inner", on="rxcui")
    .rename(columns={"active_ingredients": "v3"})
       
    .merge(ver1, how="inner", on="rxcui")
    .rename(columns={"active_ingredients": "v1"})

    .reset_index(drop=True)
)

In [11]:
res.shape

(241458, 7)

In [12]:
res.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
0,91349,12745-202-01,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,-90000
1,91349,12745-202-02,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,-90000
2,91349,12745-202-03,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,5499,-90000
3,91349,34645-8030-4,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,5499,-90000
4,91349,55316-871-43,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,5499,-90000


---

# Examine results

## Case study: razadyne

In [13]:
# we correctly identified the active ingredient for the drug razadyne

# version 3 still has the same problem as version 2
# rxcui 2103461 is giving us an error because it has no edges
# it was removed from the FDA database though
# deal with this in a later version

res.query("PROPRIETARYNAME == 'RAZADYNE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
90547,602734,21695-591-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90548,602734,50458-398-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90549,602736,50458-396-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
90550,602737,50458-397-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115659,860697,50458-388-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115671,860709,50458-389-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
115683,860717,50458-387-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693,860693
238571,2103461,21695-184-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,2103461,-90000


---

## Did we improve upon version 1's good results?

In [14]:
good = res.query("v1 > 0")

In [15]:
good.shape

(6922, 7)

In [16]:
good.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
244,92582,0085-3149-01,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145,30145
245,92582,0085-3149-02,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145,30145
246,92582,0085-3149-03,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145,30145
247,92583,0085-0854-01,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145,30145
248,92583,0085-0854-02,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145,30145


---

# Analyze based on the FDA's stated active ingredients

The FDA provides some information about the active ingredient.
Use the information to see if we can find disagreements with our algorithm.

### Examine disagreements between our algorithm and the FDA

In [17]:
for label, df in good.groupby("SUBSTANCENAME"):
    if df["v1"].nunique() != 1:
        print(label)

BOTULINUM TOXIN TYPE A
CHLORCYCLIZINE HYDROCHLORIDE
DEXTROMETHORPHAN HYDROBROMIDE
DOXORUBICIN HYDROCHLORIDE
DOXYCYCLINE
FILGRASTIM
GUANFACINE HYDROCHLORIDE
HUMAN IMMUNOGLOBULIN G
INFLIXIMAB
IRINOTECAN HYDROCHLORIDE
PEGFILGRASTIM
SOMATROPIN


In [18]:
# seems like rxnorm considers them two different things
# no clear way to get from one form to the liposome form

# does seem like a real difference

good.query("SUBSTANCENAME == 'IRINOTECAN HYDROCHLORIDE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
206647,1719777,69171-398-01,Onivyde,Irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,1719767,1719767
206648,1719777,15054-0043-1,Onivyde,IRINOTECAN HYDROCHLORIDE,IRINOTECAN HYDROCHLORIDE,1719767,1719767
208228,1726321,0009-7529-04,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208229,1726323,0009-7529-04,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208230,1726323,0009-7529-03,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208231,1726323,0009-7529-05,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208232,1726323,0009-1111-02,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208250,1726325,0009-7529-03,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208251,1726325,0009-1111-02,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329
208252,1726335,0009-7529-05,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329,153329


In [19]:
# seems to be two versions of the same protein
# debateable whether they're really the same

good.query("SUBSTANCENAME == 'PEGFILGRASTIM'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
235873,2048025,67457-833-06,Fulphila,pegfilgrastim,PEGFILGRASTIM,2048018,2048018
238216,2102705,70114-101-01,UDENYCA,pegfilgrastim-cbqv,PEGFILGRASTIM,2102692,2102692


In [20]:
# this seems ok since one is e coli derived and the other is recombinant dna

# version 3 has fixed version 2's error of too many active ingredients

good.query("SUBSTANCENAME == 'SOMATROPIN'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
103251,847245,0169-7708-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103252,847245,0169-7708-92,Norditropin,somatropin,SOMATROPIN,314845,314845
103253,847247,0169-7704-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103254,847247,0169-7704-92,Norditropin,somatropin,SOMATROPIN,314845,314845
103259,847348,0169-7705-21,Norditropin,somatropin,SOMATROPIN,314845,314845
103260,847348,0169-7705-92,Norditropin,somatropin,SOMATROPIN,314845,314845
104543,849851,0169-7703-21,Norditropin,somatropin,SOMATROPIN,314845,314845
108460,854302,0781-3004-07,Omnitrope,Somatropin,SOMATROPIN,314845,314845
108461,854302,0781-3004-26,Omnitrope,Somatropin,SOMATROPIN,314845,314845
118882,864110,0781-3001-07,Omnitrope,Somatropin,SOMATROPIN,314845,314845


For these examples it seems that there are nuanced differences between the active ingredients of some similar drugs.
The FDA's table provides a high level summary of the active ingredients, but does not contain enough information to draw a conclusion regarding whether the mapping is correct.

For the three examples we looked at here our algorithm's outputs seem to be correct.

Version 3 update: we have fixed the somatropin example to remove 61148.

## Drugs with multiple ingredients

In [21]:
temp = (good
    .groupby("SUBSTANCENAME")
    ["v1"]
    .nunique()
    .to_frame("num_v1")
    .reset_index()
    .assign(num_fda = lambda df: df["SUBSTANCENAME"].str.count(";") + 1)
    .assign(match_v1 = lambda df: df["num_v1"] == df["num_fda"])
)

In [22]:
temp.head()

Unnamed: 0,SUBSTANCENAME,num_v1,num_fda,match_v1
0,.ALPHA.1-PROTEINASE INHIBITOR HUMAN,1,1,True
1,ABACAVIR SULFATE; LAMIVUDINE,1,2,False
2,ABIRATERONE ACETATE,1,1,True
3,ACEBUTOLOL HYDROCHLORIDE,1,1,True
4,ACETAMINOPHEN; ASPIRIN; DIPHENHYDRAMINE CITRATE,1,3,False


In [23]:
v3 = defaultdict(list)

for label, df in good.groupby("SUBSTANCENAME"):
    v3["SUBSTANCENAME"].append(label)
    v3["num_v3"].append(
        len(
            set().union(
                *df["v3"].str.split(",").map(set)
            )
        )
    )
    
v3 = pd.DataFrame(v3)

In [24]:
v3.head()

Unnamed: 0,SUBSTANCENAME,num_v3
0,.ALPHA.1-PROTEINASE INHIBITOR HUMAN,1
1,ABACAVIR SULFATE; LAMIVUDINE,2
2,ABIRATERONE ACETATE,1
3,ACEBUTOLOL HYDROCHLORIDE,1
4,ACETAMINOPHEN; ASPIRIN; DIPHENHYDRAMINE CITRATE,3


---

## Compare version 1 with version 3

In [25]:
diff = (temp
    .merge(v3, how="inner", on="SUBSTANCENAME")
    .assign(match_v3 = lambda df: df["num_fda"] == df["num_v3"])
)

In [26]:
diff.shape

(769, 6)

In [27]:
diff.head()

Unnamed: 0,SUBSTANCENAME,num_v1,num_fda,match_v1,num_v3,match_v3
0,.ALPHA.1-PROTEINASE INHIBITOR HUMAN,1,1,True,1,True
1,ABACAVIR SULFATE; LAMIVUDINE,1,2,False,2,True
2,ABIRATERONE ACETATE,1,1,True,1,True
3,ACEBUTOLOL HYDROCHLORIDE,1,1,True,1,True
4,ACETAMINOPHEN; ASPIRIN; DIPHENHYDRAMINE CITRATE,1,3,False,3,True


In [28]:
diff["match_v1"].value_counts()

True     559
False    210
Name: match_v1, dtype: int64

In [29]:
diff["match_v3"].value_counts()

True     748
False     21
Name: match_v3, dtype: int64

In [30]:
diff.groupby(["match_v1", "match_v3"]).size()

match_v1  match_v3
False     False        20
          True        190
True      False         1
          True        558
dtype: int64

The number of agreements we have with the FDA's data has gone up since version 2, which is good.

We have also drastically reduced the number of disagreements we had with version 1.

## Previous version 2 disagreements with version 1

In [31]:
# version 3 has fixed version 2's errors
# we have removed 26225 with the new rule

good.query("SUBSTANCENAME == 'ONDANSETRON'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
121316,876690,0078-0679-19,ZOFRAN,ondansetron hydrochloride,ONDANSETRON,203148,203148
121317,876693,0078-0680-19,ZOFRAN,ondansetron hydrochloride,ONDANSETRON,203148,203148


In [32]:
# this example has also been fixed

good.query("SUBSTANCENAME == 'LIDOCAINE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
149244,1011705,63481-687-06,LIDODERM,lidocaine,LIDOCAINE,142440,142440
149245,1011705,50436-1331-1,LIDODERM,lidocaine,LIDOCAINE,142440,142440
149246,1011705,49999-419-30,LIDODERM,lidocaine,LIDOCAINE,142440,142440
149276,1011767,11523-0404-1,Solarcaine,Lidocaine,LIDOCAINE,142440,142440
149277,1011767,11523-0404-2,Solarcaine,Lidocaine,LIDOCAINE,142440,142440
211535,1737782,62168-0584-0,Aspercreme,Lidocaine,LIDOCAINE,142440,142440
211536,1737782,62168-0584-1,Aspercreme,Lidocaine,LIDOCAINE,142440,142440
211537,1737782,62168-0585-1,Aspercreme,Lidocaine,LIDOCAINE,142440,142440


---

## Remaining disagreement with version 1

In [33]:
# there seem to be different forms of insulin, so our result is correct

diff.query("match_v1 and ~match_v3").merge(good, how="left", on="SUBSTANCENAME")

Unnamed: 0,SUBSTANCENAME,num_v1,num_fda,match_v1,num_v3,match_v3,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,v3,v1
0,INSULIN HUMAN,1,1,True,2,False,213442,64725-1837-1,Novolin,Human Insulin,253181253182,253181
1,INSULIN HUMAN,1,1,True,2,False,213442,50090-0403-0,Novolin,Human Insulin,253181253182,253181
2,INSULIN HUMAN,1,1,True,2,False,213442,0169-1837-02,Novolin,Human Insulin,253181253182,253181
3,INSULIN HUMAN,1,1,True,2,False,213442,0169-1837-11,Novolin,Human Insulin,253181253182,253181
4,INSULIN HUMAN,1,1,True,2,False,311027,0169-1834-02,Novolin,Human Insulin,253181,253181
5,INSULIN HUMAN,1,1,True,2,False,311027,0169-1834-11,Novolin,Human Insulin,253181,253181
6,INSULIN HUMAN,1,1,True,2,False,311027,64725-1834-1,Novolin,Human Insulin,253181,253181
7,INSULIN HUMAN,1,1,True,2,False,311027,50090-0498-0,Novolin,Human Insulin,253181,253181
8,INSULIN HUMAN,1,1,True,2,False,2049380,0169-3007-15,Novolin,Human Insulin,253181253182,253181
9,INSULIN HUMAN,1,1,True,2,False,2049380,0169-3007-25,Novolin,Human Insulin,253181253182,253181


## Multi active ingredient drug examples

In [34]:
diff["num_fda"].value_counts()

1    571
2    141
3     31
4     15
5      9
8      1
7      1
Name: num_fda, dtype: int64

In [35]:
# version 3 still incorrectly uses the wrong form of menthol

good.query("rxcui == 1300293")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
179723,1300293,51457-000-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,1421361648123,142136
179724,1300293,71061-763-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,1421361648123,142136
179725,1300293,71061-764-32,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,1421361648123,142136
179726,1300293,71061-765-28,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,1421361648123,142136
179727,1300293,71061-766-05,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,1421361648123,142136
179728,1300293,51457-001-32,ALO THERAPEUTIC MASSAGE PAIN RELIEVING,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,1421361648123,142136


In [36]:
# version 2 no longer has problems with multiple ingredients
# these results are correct now

good.query("PROPRIETARYNAME == 'EPZICOM'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
90537,602395,53808-0767-1,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,68244221052,221052
90538,602395,49702-206-13,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,68244221052,221052
90539,602395,70518-0691-0,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,68244221052,221052
90540,602395,50090-0874-0,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,68244221052,221052


In [37]:
# this is also now correct

good.query("rxcui == 543879")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,v3,v1
85480,543879,51674-0130-5,RELEGARD,"GLACIAL ACETIC ACID, OXYQUINOLINE",ACETIC ACID; OXYQUINOLINE,16842836,42836


# Conclusion

Version 3 fixes some of the problems with identifying too many active ingredients while retaining the same correct behaviour on previous examples.

However, we still incorrectly map ingredients to other forms implicitly.

To fix this, we will remove the rule where we walk along single `has_form` edges.