# Validate the generated NDC to active ingredient mappings

2019-04-22

Ensure that the mappings we generated from NDCs to active ingredient RXCUIs are correct.

## Version 1

We want to see where the simple pattern for finding active ingredients fails.

In [1]:
import pandas as pd

## Read generated mappings

In [2]:
ingredients = pd.read_csv("../../pipeline/ingredients/ndc_active_ingredients_version_1.tsv", sep='\t')

In [3]:
ingredients.shape

(41576, 2)

In [4]:
ingredients.head()

Unnamed: 0,rxcui,active_ingredients
0,91349,-90000
1,91792,-500
2,92582,30145
3,92583,30145
4,92584,30145


## Read NDC metadata

In [5]:
data = pd.read_csv("../../pipeline/merged_ndc_info.tsv", sep='\t')

In [6]:
data.shape

(265692, 22)

In [7]:
data.head(2)

Unnamed: 0,rxcui,rxaui,NDCPACKAGECODE,suppress,PRODUCTID,PRODUCTNDC,PACKAGEDESCRIPTION,PRODUCTTYPENAME,PROPRIETARYNAME,NONPROPRIETARYNAME,...,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,91349,3507080,12745-202-01,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"59 mL in 1 BOTTLE, PLASTIC (12745-202-01)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0
1,91349,3507080,12745-202-02,N,12745-202_7d063901-255c-bffc-e053-2a91aa0a91ee,12745-202,"118 mL in 1 BOTTLE, PLASTIC (12745-202-02)",HUMAN OTC DRUG,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,...,OTC MONOGRAPH NOT FINAL,part333A,Medical Chemical Corporation,HYDROGEN PEROXIDE,8.57,g/100mL,,,N,20191231.0


---

# Create results table

In [8]:
res = (data
    [[
        "rxcui", "NDCPACKAGECODE", "PROPRIETARYNAME",
        "NONPROPRIETARYNAME", "SUBSTANCENAME"
    ]]
    .drop_duplicates()
    .reset_index(drop=True)
       
    .merge(ingredients, how="inner", on="rxcui")
    .reset_index(drop=True)
)

In [9]:
res.shape

(241458, 6)

In [10]:
res.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
0,91349,12745-202-01,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,-90000
1,91349,12745-202-02,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,-90000
2,91349,12745-202-03,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,-90000
3,91349,34645-8030-4,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,-90000
4,91349,55316-871-43,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,-90000


---

# Examine results

## Case study: razadyne

In [11]:
# we correctly identified the active ingredient for the drug razadyne

res.query("PROPRIETARYNAME == 'RAZADYNE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
90547,602734,21695-591-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
90548,602734,50458-398-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
90549,602736,50458-396-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
90550,602737,50458-397-60,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
115659,860697,50458-388-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
115671,860709,50458-389-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
115683,860717,50458-387-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,860693
238571,2103461,21695-184-30,RAZADYNE,galantamine hydrobromide,GALANTAMINE HYDROBROMIDE,-90000


---

In [12]:
bad = res.query("active_ingredients < 0")

In [13]:
bad.shape

(234536, 6)

In [14]:
bad.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
0,91349,12745-202-01,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,-90000
1,91349,12745-202-02,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,-90000
2,91349,12745-202-03,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,HYDROGEN PEROXIDE,-90000
3,91349,34645-8030-4,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,-90000
4,91349,55316-871-43,Hydrogen Peroxide,Hydrogen Peroxide,HYDROGEN PEROXIDE,-90000


In [15]:
bad["rxcui"].nunique()

38789

In [16]:
bad["active_ingredients"].value_counts()

-90000    222819
-500       10377
-40         1340
Name: active_ingredients, dtype: int64

These rows represent drugs for which the simple pattern does not apply.

## Look at good results

In [17]:
good = res.query("active_ingredients > 0")

In [18]:
good.shape

(6922, 6)

In [19]:
good.head()

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
244,92582,0085-3149-01,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145
245,92582,0085-3149-02,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145
246,92582,0085-3149-03,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145
247,92583,0085-0854-01,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145
248,92583,0085-0854-02,ELOCON,Mometasone Furoate,MOMETASONE FUROATE,30145


In [20]:
good["NDCPACKAGECODE"].nunique()

6736

In [21]:
good["rxcui"].nunique()

2787

In [22]:
good["active_ingredients"].nunique()

611

---

# Analyze based on the FDA's stated active ingredients

The FDA provides some information about the active ingredient.
Use the information to see if we can find disagreements with our algorithm.

### Examine disagreements between our algorithm and the FDA

In [23]:
for label, df in good.groupby("SUBSTANCENAME"):
    if df["active_ingredients"].nunique() != 1:
        print(label)

BOTULINUM TOXIN TYPE A
CHLORCYCLIZINE HYDROCHLORIDE
DEXTROMETHORPHAN HYDROBROMIDE
DOXORUBICIN HYDROCHLORIDE
DOXYCYCLINE
FILGRASTIM
GUANFACINE HYDROCHLORIDE
HUMAN IMMUNOGLOBULIN G
INFLIXIMAB
IRINOTECAN HYDROCHLORIDE
PEGFILGRASTIM
SOMATROPIN


In [24]:
# seems like rxnorm considers them two different things
# no clear way to get from one form to the liposome form

# does seem like a real difference

good.query("SUBSTANCENAME == 'IRINOTECAN HYDROCHLORIDE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
206647,1719777,69171-398-01,Onivyde,Irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,1719767
206648,1719777,15054-0043-1,Onivyde,IRINOTECAN HYDROCHLORIDE,IRINOTECAN HYDROCHLORIDE,1719767
208228,1726321,0009-7529-04,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208229,1726323,0009-7529-04,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208230,1726323,0009-7529-03,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208231,1726323,0009-7529-05,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208232,1726323,0009-1111-02,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208250,1726325,0009-7529-03,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208251,1726325,0009-1111-02,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329
208252,1726335,0009-7529-05,Camptosar,irinotecan hydrochloride,IRINOTECAN HYDROCHLORIDE,153329


In [25]:
# seems to be two versions of the same protein
# debateable whether they're really the same

good.query("SUBSTANCENAME == 'PEGFILGRASTIM'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
235873,2048025,67457-833-06,Fulphila,pegfilgrastim,PEGFILGRASTIM,2048018
238216,2102705,70114-101-01,UDENYCA,pegfilgrastim-cbqv,PEGFILGRASTIM,2102692


In [26]:
# this seems ok since one is e coli derived and the other is recombinant dna

good.query("SUBSTANCENAME == 'SOMATROPIN'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
103251,847245,0169-7708-21,Norditropin,somatropin,SOMATROPIN,314845
103252,847245,0169-7708-92,Norditropin,somatropin,SOMATROPIN,314845
103253,847247,0169-7704-21,Norditropin,somatropin,SOMATROPIN,314845
103254,847247,0169-7704-92,Norditropin,somatropin,SOMATROPIN,314845
103259,847348,0169-7705-21,Norditropin,somatropin,SOMATROPIN,314845
103260,847348,0169-7705-92,Norditropin,somatropin,SOMATROPIN,314845
104543,849851,0169-7703-21,Norditropin,somatropin,SOMATROPIN,314845
108460,854302,0781-3004-07,Omnitrope,Somatropin,SOMATROPIN,314845
108461,854302,0781-3004-26,Omnitrope,Somatropin,SOMATROPIN,314845
118882,864110,0781-3001-07,Omnitrope,Somatropin,SOMATROPIN,314845


For these examples it seems that there are nuanced differences between the active ingredients of some similar drugs.
The FDA's table provides a high level summary of the active ingredients, but does not contain enough information to draw a conclusion regarding whether the mapping is correct.

For the three examples we looked at here our algorithm's outputs seem to be correct.

## Drugs with multiple ingredients

In [27]:
temp = (good
    .groupby("SUBSTANCENAME")
    ["active_ingredients"]
    .nunique()
    .to_frame("num_ans")
    .reset_index()
    .assign(num_ingredients = lambda df: df["SUBSTANCENAME"].str.count(";") + 1)
    .assign(match = lambda df: df["num_ans"] == df["num_ingredients"])
)

In [28]:
temp.shape

(769, 4)

In [29]:
temp.head()

Unnamed: 0,SUBSTANCENAME,num_ans,num_ingredients,match
0,.ALPHA.1-PROTEINASE INHIBITOR HUMAN,1,1,True
1,ABACAVIR SULFATE; LAMIVUDINE,1,2,False
2,ABIRATERONE ACETATE,1,1,True
3,ACEBUTOLOL HYDROCHLORIDE,1,1,True
4,ACETAMINOPHEN; ASPIRIN; DIPHENHYDRAMINE CITRATE,1,3,False


In [30]:
temp["match"].value_counts()

True     559
False    210
Name: match, dtype: int64

The majority of our results matches up with the FDA's.
This isn't too bad for a first try.

---

## Multi active ingredient drug examples

In [31]:
temp["num_ingredients"].value_counts()

1    571
2    141
3     31
4     15
5      9
8      1
7      1
Name: num_ingredients, dtype: int64

In [32]:
# we only found one of the two active ingredients

good.query("rxcui == 1300293")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
179723,1300293,51457-000-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,142136
179724,1300293,71061-763-04,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,142136
179725,1300293,71061-764-32,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,142136
179726,1300293,71061-765-28,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,142136
179727,1300293,71061-766-05,Alo Therapeutic Massage,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,142136
179728,1300293,51457-001-32,ALO THERAPEUTIC MASSAGE PAIN RELIEVING,"MENTHOL, HISTAMINE DIHYDROCHLORIDE",MENTHOL; HISTAMINE DIHYDROCHLORIDE,142136


In [33]:
# again we see that we only found one of the two active ingredients

good.query("SUBSTANCENAME == 'ABACAVIR SULFATE; LAMIVUDINE'")

Unnamed: 0,rxcui,NDCPACKAGECODE,PROPRIETARYNAME,NONPROPRIETARYNAME,SUBSTANCENAME,active_ingredients
90537,602395,53808-0767-1,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,221052
90538,602395,49702-206-13,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,221052
90539,602395,70518-0691-0,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,221052
90540,602395,50090-0874-0,EPZICOM,abacavir sulfate and lamivudine,ABACAVIR SULFATE; LAMIVUDINE,221052


# Conclusion

Our simple pattern does a reasonable job at finding the active ingredient for a small number of drugs.
It does however fail to work for drugs with multiple active ingredients.

We will need to move to a BFS algorithm to deal with these more complex drugs.