# TP2: Preprocessing and data visualization

## Winter 2023 - BIN710 Data Mining (UdeS)

Second assignement as part of the Data Mining class at UdeS.

Student name : Simon Lalonde

### Directory structure

├── package2.csv    ---> Data

├── product2.csv    ---> Data

├── tp1.ipynb   ---> Jupyter Notebook

└── TP1.pdf    ---> Tasks to complete

### Data
2 files for each dataset and both have the *byte signature*, meaning when compared byte by byte they are similar.

NDC = National Drug Code

### Metadata
Description for the 2 data files used : 
- [Product](https://www.fda.gov/drugs/drug-approvals-and-databases/ndc-product-file-definitions)
- [Package](https://www.fda.gov/drugs/drug-approvals-and-databases/ndc-package-file-definitions)

### Goal
Use preprocessing and data visualization techniques on FDA drugs databases

---

## 1, 2 and 3 : Data verification and cleaning for individual tables (coherence, types, redundency etc.)

Importing all required libraries and modules. Reading file to dataframe with proper encoding

In [1]:
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder

In [2]:
root_dir = Path.cwd()
pack = pd.read_csv(root_dir / "package2.csv", delimiter=";")
prod = pd.read_csv(root_dir / "product2.csv", delimiter=";", encoding="ISO-8859-1")    # Latin-1 encoding



### Exploring and cleaning the product data table

In [3]:
prod.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,,0002-0800,HUMAN OTC DRUG,Sterile Diluent,,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,19870710,,NDA,NDA018781,10,WATER,1.0,mL/mL,,,N,20201231.0
1,,0002-1200,HUMAN PRESCRIPTION DRUG,Amyvid,,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,20120601,,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,20211231.0
2,,0002-1433,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,20140918,,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,20201231.0
3,,0002-1434,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,20140918,,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,20201231.0
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,galcanezumab,"INJECTION, SOLUTION",SUBCUTANEOUS,20180927,,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,,N,20201231.0


In [4]:
print(f"Product dataframe has {prod.shape[0]} objects and {prod.shape[1]} columns.") 

Product dataframe has 93238 objects and 20 columns.


In [5]:
print(f" There are {len(prod.dtypes[prod.dtypes != 'object'])} numerical columns :\n") 
print(prod.dtypes[prod.dtypes != "object"].index.to_list())

 There are 3 numerical columns :

['STARTMARKETINGDATE', 'ENDMARKETINGDATE', 'LISTING_RECORD_CERTIFIED_THROUGH']


In [6]:
print(f" There are {len(prod.dtypes[prod.dtypes == 'object'])} non-numerical columns :\n") 
print(prod.dtypes[prod.dtypes == "object"].index.to_list())

 There are 17 non-numerical columns :

['PRODUCTID', 'PRODUCTNDC', 'PRODUCTTYPENAME', 'PROPRIETARYNAME', 'PROPRIETARYNAMESUFFIX', 'NONPROPRIETARYNAME', 'DOSAGEFORMNAME', 'ROUTENAME', 'MARKETINGCATEGORYNAME', 'APPLICATIONNUMBER', 'LABELERNAME', 'SUBSTANCENAME', 'ACTIVE_NUMERATOR_STRENGTH', 'ACTIVE_INGRED_UNIT', 'PHARM_CLASSES', 'DEASCHEDULE', 'NDC_EXCLUDE_FLAG']


Looking at null/missing values for each feature. We see some feature with very high missing vals such as ProrietaryNameSuffix, EndMarketingDate and DeaSchedule. It makes sense since these are not missing per-se but information on the object itself (No need for suffix on Rx, does have a known end marketing date and no classification of dependency potential respectively)

In [7]:
prod.isnull().sum()

PRODUCTID                            1560
PRODUCTNDC                              0
PRODUCTTYPENAME                         0
PROPRIETARYNAME                         6
PROPRIETARYNAMESUFFIX               83075
NONPROPRIETARYNAME                      4
DOSAGEFORMNAME                          0
ROUTENAME                            1932
STARTMARKETINGDATE                      0
ENDMARKETINGDATE                    88915
MARKETINGCATEGORYNAME                   0
APPLICATIONNUMBER                   13097
LABELERNAME                             0
SUBSTANCENAME                        2309
ACTIVE_NUMERATOR_STRENGTH            2309
ACTIVE_INGRED_UNIT                   2309
PHARM_CLASSES                       50984
DEASCHEDULE                         88815
NDC_EXCLUDE_FLAG                        0
LISTING_RECORD_CERTIFIED_THROUGH     4325
dtype: int64

ProductID is supposed to be composed of ProductNDC and other information. Let's look if if's true

In [8]:
print(f"Num objects with no NA in ProdID/NDC : {len(prod[['PRODUCTID', 'PRODUCTNDC']].dropna())}")

Num objects with no NA in ProdID/NDC : 91678


In [9]:
print(f"Num objects with NDC code within ID col : {prod[['PRODUCTID', 'PRODUCTNDC']].dropna().apply(lambda x : x.PRODUCTNDC in x.PRODUCTID, axis=1).sum()}")

Num objects with NDC code within ID col : 91165


In [10]:
prod[prod['PRODUCTID'].notna()]

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
20,0002-3251_67a53369-eead-4f2c-afe9-f3274899c47e,0002-3251,HUMAN PRESCRIPTION DRUG,Strattera,,Atomoxetine hydrochloride,CAPSULE,ORAL,20050214,,NDA,NDA021411,10,ATOMOXETINE HYDROCHLORIDE,100,mg/1,"Norepinephrine Reuptake Inhibitor [EPC],Norepi...",,N,20211231.0
21,0002-3270_06e2a1f2-459c-45aa-9341-54e36f7726a7,0002-3270,HUMAN PRESCRIPTION DRUG,Cymbalta,,Duloxetine hydrochloride,"CAPSULE, DELAYED RELEASE",ORAL,20100115,,NDA,NDA021427,10,DULOXETINE HYDROCHLORIDE,60,mg/1,"Norepinephrine Uptake Inhibitors [MoA],Seroton...",,N,20201231.0
22,0002-4112_d561034d-ea58-45fe-9d07-2e9eba98c2e4,0002-4112,HUMAN PRESCRIPTION DRUG,Zyprexa,,Olanzapine,TABLET,ORAL,19970623,,NDA,NDA020592,10,OLANZAPINE,2.5,mg/1,Atypical Antipsychotic [EPC],,N,20201231.0
23,0002-4115_d561034d-ea58-45fe-9d07-2e9eba98c2e4,0002-4115,HUMAN PRESCRIPTION DRUG,Zyprexa,,Olanzapine,TABLET,ORAL,19961001,,NDA,NDA020592,10,OLANZAPINE,5,mg/1,Atypical Antipsychotic [EPC],,N,20201231.0
24,0002-4116_d561034d-ea58-45fe-9d07-2e9eba98c2e4,0002-4116,HUMAN PRESCRIPTION DRUG,Zyprexa,,Olanzapine,TABLET,ORAL,19961001,,NDA,NDA020592,10,OLANZAPINE,7.5,mg/1,Atypical Antipsychotic [EPC],,N,20201231.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93233,99207-465_7578e84a-41ed-498d-8c2b-56a9931679db,99207-465,HUMAN PRESCRIPTION DRUG,Solodyn,,minocycline hydrochloride,"TABLET, FILM COATED, EXTENDED RELEASE",ORAL,20100927,,NDA,NDA050808,Valeant Pharmaceuticals North America LLC,MINOCYCLINE HYDROCHLORIDE,55,mg/1,"Tetracycline-class Drug [EPC],Tetracyclines [CS]",,N,20201231.0
93234,99207-466_7578e84a-41ed-498d-8c2b-56a9931679db,99207-466,HUMAN PRESCRIPTION DRUG,Solodyn,,minocycline hydrochloride,"TABLET, FILM COATED, EXTENDED RELEASE",ORAL,20100927,,NDA,NDA050808,Valeant Pharmaceuticals North America LLC,MINOCYCLINE HYDROCHLORIDE,80,mg/1,"Tetracycline-class Drug [EPC],Tetracyclines [CS]",,N,20201231.0
93235,99207-467_7578e84a-41ed-498d-8c2b-56a9931679db,99207-467,HUMAN PRESCRIPTION DRUG,Solodyn,,minocycline hydrochloride,"TABLET, FILM COATED, EXTENDED RELEASE",ORAL,20100927,,NDA,NDA050808,Valeant Pharmaceuticals North America LLC,MINOCYCLINE HYDROCHLORIDE,105,mg/1,"Tetracycline-class Drug [EPC],Tetracyclines [CS]",,N,20201231.0
93236,99207-525_d47eda34-3952-463c-9597-4225a19dbf13,99207-525,HUMAN PRESCRIPTION DRUG,Vanos,,fluocinonide,CREAM,TOPICAL,20060313,,NDA,NDA021758,Valeant Pharmaceuticals North America LLC,FLUOCINONIDE,1,mg/g,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0


In [11]:
id_ndc_incoherent = prod[prod['PRODUCTID'].notna()][prod[['PRODUCTID', 'PRODUCTNDC']].dropna().apply(lambda x : x.PRODUCTNDC not in x.PRODUCTID, axis=1)]
print(f"Num objects with incoherent ID and NDC : {len(id_ndc_incoherent)}")

Num objects with incoherent ID and NDC : 513


In [12]:
id_ndc_incoherent.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
159,0006-0005_0c7a3452-ecb2-4f66-ad52-94f8eaf8cde8,05-juin,HUMAN PRESCRIPTION DRUG,BELSOMRA,,suvorexant,"TABLET, FILM COATED",ORAL,20140829,,NDA,NDA204569,10,SUVOREXANT,5,mg/1,"Orexin Receptor Antagonist [EPC],Orexin Recept...",CIV,N,20211231.0
160,0006-0019_54e9c31a-9429-4842-b2d6-0cc1e5ad613c,19-juin,HUMAN PRESCRIPTION DRUG,PRINIVIL,,lisinopril,TABLET,ORAL,19871229,,NDA,NDA019558,10,LISINOPRIL,5,mg/1,"Angiotensin Converting Enzyme Inhibitor [EPC],...",,N,20201231.0
310,0009-0003_67759a7c-ea06-4151-87e1-a301c44d67cd,03-sept,HUMAN PRESCRIPTION DRUG,SOLU-MEDROL,,methylprednisolone sodium succinate,"INJECTION, POWDER, FOR SOLUTION",INTRAMUSCULAR; INTRAVENOUS,19590402,,NDA,NDA011856,Pharmacia and Upjohn Company LLC,METHYLPREDNISOLONE SODIUM SUCCINATE,500,mg/4mL,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0
311,0009-0005_c9aa26c1-05c3-479c-90eb-63b2181c5e7e,05-sept,HUMAN PRESCRIPTION DRUG,Solu-Cortef,,hydrocortisone sodium succinate,"INJECTION, POWDER, FOR SOLUTION",INTRAMUSCULAR; INTRAVENOUS,19550427,,NDA,NDA009866,Pharmacia and Upjohn Company LLC,HYDROCORTISONE SODIUM SUCCINATE,1000,mg/8mL,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0
312,0009-0011_c9aa26c1-05c3-479c-90eb-63b2181c5e7e,11-sept,HUMAN PRESCRIPTION DRUG,Solu-Cortef,,hydrocortisone sodium succinate,"INJECTION, POWDER, FOR SOLUTION",INTRAMUSCULAR; INTRAVENOUS,19550427,,NDA,NDA009866,Pharmacia and Upjohn Company LLC,HYDROCORTISONE SODIUM SUCCINATE,100,mg/2mL,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0


In [13]:
# List of possible vals
id_ndc_incoherent["PRODUCTNDC"].unique()

array(['05-juin', '19-juin', '03-sept', '05-sept', '11-sept', '12-sept',
       '13-sept', '16-sept', '17-sept', '18-sept', '20-sept', '22-sept',
       '29-sept', 'OTC MONOGRAPH NOT FINAL', 'NDA', 'OTC MONOGRAPH FINAL',
       'UNAPPROVED HOMEOPATHIC', 'UNAPPROVED MEDICAL GAS',
       'UNAPPROVED DRUG OTHER', 'ANDA', 'NDA AUTHORIZED GENERIC', 'BLA'],
      dtype=object)

Suffix feature will mostly not contribute a lot of relevant information since there are not really a standard and many incongruency in naming

In [14]:
print(f"{len(prod['PROPRIETARYNAMESUFFIX'].dropna())} values for PPTsuffix with {len(prod['PROPRIETARYNAMESUFFIX'].dropna().unique())} unique categories")

10163 values for PPTsuffix with 4022 unique categories


In [15]:
# See a few examples
print(prod['PROPRIETARYNAMESUFFIX'].dropna().unique()[:10])

['Zydis ' 'Mix75/25 ' 'Mix50/50 ' 'Intramuscular ' 'Relprevv ' 'KwikPen '
 ' Junior KwikPen ' ' Tempo Pen ' 'R ' 'N ']


**ProductTypeName FDA labels verification**

In [16]:
content_type_label = pd.read_html("https://www.fda.gov/industry/structured-product-labeling-resources/document-type-including-content-labeling-type")[0]["LOINC Name"]


In [17]:
content_type_label = content_type_label.str.replace("LABEL", "")
content_type_label = content_type_label.str.rstrip().to_list()

In [18]:
print(f'{len([label for label in prod["PRODUCTTYPENAME"].unique() if label not in content_type_label])} producttype categories not in official FDA repo')

0 producttype categories not in official FDA repo


**Dosage form FDA codes verification**

In [19]:
dosage_form_codes = pd.read_html("https://www.fda.gov/industry/structured-product-labeling-resources/dosage-forms")[0]
dosage_form_codes = dosage_form_codes["SPL Acceptable Term"].to_list()


In [20]:
[label for label in prod["DOSAGEFORMNAME"].unique() if label not in dosage_form_codes]
print(f"{len([label for label in prod['DOSAGEFORMNAME'].unique() if label not in dosage_form_codes])} DosageForm categories not in official FDA repo codes")

0 DosageForm categories not in official FDA repo codes


**RouteName FDA codes verification**

In [21]:
routename_codes = pd.read_html("https://www.fda.gov/industry/structured-product-labeling-resources/route-administration")[0]
routename_codes = routename_codes["SPL Acceptable Term"].to_list()


In [22]:
print(f'{len([label for label in prod["ROUTENAME"].dropna().unique() if label not in routename_codes])} RouteNames not listed in official FDA repo')

127 RouteNames not listed in official FDA repo


In [23]:
# Example of multiple categories for RouteName
print([label for label in prod["ROUTENAME"].dropna().unique() if label not in routename_codes][:10])

['INTRAMUSCULAR; SUBCUTANEOUS', 'INTRAVENOUS; SUBCUTANEOUS', 'INTRA-ARTICULAR; INTRAMUSCULAR', 'INTRA-ARTICULAR; INTRALESIONAL', 'INTRAMUSCULAR; INTRAVENOUS', 'INTRALESIONAL; INTRAMUSCULAR; INTRASYNOVIAL; SOFT TISSUE', 'INTRAMUSCULAR; INTRAVENOUS; SUBCONJUNCTIVAL', 'INTRA-ARTICULAR; INTRALESIONAL; INTRAMUSCULAR; SOFT TISSUE', 'INTRAVASCULAR; INTRAVENOUS', 'INTRA-ARTERIAL; INTRAVENOUS']


We see it's the objects with multiple categories that have those special labels. This will get fixed with get_dummies method or OneHotEncoding

In [24]:
# Pandas get_dummies example
prod["ROUTENAME"].str.get_dummies().head()

Unnamed: 0,AURICULAR (OTIC),BUCCAL,BUCCAL; DENTAL; TOPICAL,BUCCAL; SUBLINGUAL,BUCCAL; VAGINAL,CUTANEOUS,CUTANEOUS; EXTRACORPOREAL,CUTANEOUS; EXTRACORPOREAL; TOPICAL; VAGINAL,CUTANEOUS; EXTRACORPOREAL; VAGINAL,CUTANEOUS; INTRADERMAL; SUBCUTANEOUS,...,TOPICAL,TOPICAL; TOPICAL,TOPICAL; TOPICAL; TOPICAL,TOPICAL; TRANSDERMAL,TOPICAL; VAGINAL,TRANSDERMAL,TRANSMUCOSAL,URETERAL,URETHRAL,VAGINAL
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
#ohe example
enc = OneHotEncoder()
enc.fit(prod[["ROUTENAME"]])
enc.categories_[0][:5]

array(['AURICULAR (OTIC)', 'BUCCAL', 'BUCCAL; DENTAL; TOPICAL',
       'BUCCAL; SUBLINGUAL', 'BUCCAL; VAGINAL'], dtype=object)

Some categories are repeated multiple times with the same routename which does not make sense

In [26]:
prod[prod["ROUTENAME"] == "TOPICAL; TOPICAL; TOPICAL"]

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
74661,68466-0002_b9ea1370-4627-458d-8460-3c1fa0a56f48,68466-0002,HUMAN OTC DRUG,Sports For Trauma Gel,,"Bellis Perennis, Hypericum Perfomatum,Toxicode...",GEL,TOPICAL; TOPICAL; TOPICAL,20040701,,UNAPPROVED HOMEOPATHIC,,"Schwabe Mexico, S.A. de C.V.",BELLIS PERENNIS; HYPERICUM PERFORATUM; TOXICOD...,1; 2; 3; 1,[hp_X]/71g; [hp_X]/71g; [hp_X]/71g; [hp_X]/71g,,,N,20201231.0


In [27]:
# Find the elements with multiple RouteNames
prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].head()

49    INTRAMUSCULAR; SUBCUTANEOUS
52      INTRAVENOUS; SUBCUTANEOUS
55      INTRAVENOUS; SUBCUTANEOUS
70    INTRAMUSCULAR; SUBCUTANEOUS
71    INTRAMUSCULAR; SUBCUTANEOUS
Name: ROUTENAME, dtype: object

In [28]:
# Find all the elements with RouteName repetitions
print(f'Num of RouteName repetitions : {len(prod[prod["ROUTENAME"].str.split("; ").str.len() > 1][prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x : set(x)).str.len() == 1])}')
prod[prod["ROUTENAME"].str.split("; ").str.len() > 1][prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x : set(x)).str.len() == 1].head()

Num of RouteName repetitions : 32


Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
9927,0527-1109_458431c7-41c7-48f4-a8f6-b6b6ccc7cbe2,0527-1109,HUMAN PRESCRIPTION DRUG,Isoniazid,,Isoniazid,TABLET,ORAL; ORAL,20131010,,ANDA,ANDA089776,"Lannett Company, Inc.",ISONIAZID,300,mg/1,Antimycobacterial [EPC],,N,20211231.0
11121,0615-8061_4643015f-3f68-4ecd-909f-85e3fd2c8549,0615-8061,HUMAN PRESCRIPTION DRUG,Lisinopril,,Lisinopril,TABLET,ORAL; ORAL,20111101,20200930.0,ANDA,ANDA076180,"NCS HealthCare of KY, Inc dba Vangard Labs",LISINOPRIL,2.5,mg/1,"Angiotensin Converting Enzyme Inhibitor [EPC],...",,N,
12307,0869-0012_2fd9a395-b322-45b0-b568-71b98e581ae4,0869-0012,HUMAN OTC DRUG,Vitamin A D,,"Lanolin, Petrolatum",OINTMENT,TOPICAL; TOPICAL,20130701,,OTC MONOGRAPH FINAL,part347,Vi-Jon,LANOLIN; PETROLATUM,133; 459,mg/g; mg/g,,,N,20211231.0
17328,16714-114_6ae8605d-16ec-9ea6-8389-ba144c924ee1,16714-114,HUMAN PRESCRIPTION DRUG,Fluoxetine hydrochloride,,Fluoxetine hydrochloride,"TABLET, FILM COATED",ORAL; ORAL,20190918,,ANDA,ANDA211721,NorthStar Rx LLC,FLUOXETINE HYDROCHLORIDE,60,mg/1,"Serotonin Reuptake Inhibitor [EPC],Serotonin U...",,N,20201231.0
29347,43598-632_b7779005-0433-747c-3ea9-16e6d45b6cee,43598-632,HUMAN PRESCRIPTION DRUG,Fluoxetine hydrochloride,,Fluoxetine hydrochloride,"TABLET, FILM COATED",ORAL; ORAL,20190128,,ANDA,ANDA211721,Dr. Reddy's Laboratories Inc.,FLUOXETINE HYDROCHLORIDE,60,mg/1,"Serotonin Reuptake Inhibitor [EPC],Serotonin U...",,N,20201231.0


Shrink to a single categorie for the 32 objects with repetitions

In [29]:
prod.loc[
    prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: set(x))[prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: set(x)).str.len() == 1].index,
    "ROUTENAME"
] = prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: list(set(x))[0])[prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: set(x)).str.len() == 1]
# [prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ")]
# .apply(lambda x : set(x)).str.len() == 1]

**Verifying DateType attributes (START/END/Listing_Record_Certified_Through)**

First let's transform the data from int to datetime format

In [30]:
prod["STARTMARKETINGDATE"] = pd.to_datetime(prod["STARTMARKETINGDATE"], format="%Y%m%d")

In [31]:
print(f'Date range from {min(prod["STARTMARKETINGDATE"])} to {max(prod["STARTMARKETINGDATE"])} for marketing start : OK')

Date range from 1900-01-01 00:00:00 to 2020-02-14 00:00:00 for marketing start : OK


Error for 3031 year in timestamp of EndMarketingDate. Let's fix it to 2031

In [32]:
# pd.to_datetime(prod["ENDMARKETINGDATE"], format="%Y%m%d")
prod["ENDMARKETINGDATE"].sort_values(ascending=False).head()

29503    30310209.0
65640    20390831.0
46709    20380131.0
89575    20331010.0
89576    20331010.0
Name: ENDMARKETINGDATE, dtype: float64

In [33]:
prod.loc[prod["ENDMARKETINGDATE"] > 20500000, ["ENDMARKETINGDATE"]] = 30310209.0 - 10000000

In [34]:
print(prod.iloc[29503]["ENDMARKETINGDATE"])    # Check replacement

20310209.0


In [35]:
# Converting the actual data
prod["ENDMARKETINGDATE"] = pd.to_datetime(prod["ENDMARKETINGDATE"], format="%Y%m%d")

In [36]:
print(f'Date range from {min(prod["ENDMARKETINGDATE"].dropna())} to {max(prod["ENDMARKETINGDATE"].dropna())} for marketing end : OK')

Date range from 2020-02-15 00:00:00 to 2039-08-31 00:00:00 for marketing end : OK


No objects with incongruent start/end date combinations

In [37]:
print(f'Number of objects with enddates greather than start dates : {len(prod[prod["ENDMARKETINGDATE"] < prod["STARTMARKETINGDATE"]])}')

Number of objects with enddates greather than start dates : 0


Listing records to date time, no incongruencies

In [38]:
prod["LISTING_RECORD_CERTIFIED_THROUGH"] = pd.to_datetime(prod["LISTING_RECORD_CERTIFIED_THROUGH"], format="%Y%m%d")

In [39]:
print(f'Date range from {min(prod["LISTING_RECORD_CERTIFIED_THROUGH"].dropna())} to {max(prod["LISTING_RECORD_CERTIFIED_THROUGH"].dropna())} for marketing end : OK')

Date range from 2020-12-31 00:00:00 to 2021-12-31 00:00:00 for marketing end : OK


In [40]:
print(f'Number of objects with listing certified greather than start dates : {len(prod[prod["LISTING_RECORD_CERTIFIED_THROUGH"] < prod["STARTMARKETINGDATE"]])}')

Number of objects with listing certified greather than start dates : 0


**Application Number verification to startwith FDA reference codes**

In [41]:
# Does not matc prefix NDA / ANDA / BLA or partXXXX in ApplicationNumber
prod["APPLICATIONNUMBER"].dropna()[prod["APPLICATIONNUMBER"].dropna().str.match("^[^NDA|^ANDA|^BLA|^part]")]

26428    333D
Name: APPLICATIONNUMBER, dtype: object

Let's remove that object

In [42]:
prod = prod.drop(prod["APPLICATIONNUMBER"].dropna()[prod["APPLICATIONNUMBER"].dropna().str.match("^[^NDA|^ANDA|^BLA|^part]")].index)

In [43]:
prod.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,,0002-0800,HUMAN OTC DRUG,Sterile Diluent,,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,1987-07-10,NaT,NDA,NDA018781,10,WATER,1.0,mL/mL,,,N,2020-12-31
1,,0002-1200,HUMAN PRESCRIPTION DRUG,Amyvid,,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,2012-06-01,NaT,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,2021-12-31
2,,0002-1433,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
3,,0002-1434,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,galcanezumab,"INJECTION, SOLUTION",SUBCUTANEOUS,2018-09-27,NaT,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,,N,2020-12-31


**Verify if ApplicationNumber prefix and MarketingCategoryName are identical**

In [44]:
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].isnull().sum()

MARKETINGCATEGORYNAME        0
APPLICATIONNUMBER        13097
dtype: int64

Let's look at the values for MarketingCategoryName when ApplicationNumber is null

In [45]:
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]][prod["APPLICATIONNUMBER"].isnull()].head()

Unnamed: 0,MARKETINGCATEGORYNAME,APPLICATIONNUMBER
738,UNAPPROVED DRUG OTHER,
750,UNAPPROVED DRUG OTHER,
2262,UNAPPROVED DRUG OTHER,
2307,UNAPPROVED DRUG OTHER,
2822,UNAPPROVED DRUG OTHER,


For now we leave the NaN fields in ApplicationNumber as empty

In [46]:
prod["MARKETINGCATEGORYNAME"][prod["APPLICATIONNUMBER"].isnull()].unique()

array(['UNAPPROVED DRUG OTHER', 'UNAPPROVED HOMEOPATHIC',
       'UNAPPROVED MEDICAL GAS',
       'UNAPPROVED DRUG FOR USE IN DRUG SHORTAGE'], dtype=object)

List of the categories where ApplicationNumber prefix does not match MarketingCategoryName show that NDA and AND have some mismatch between the 2 features

In [47]:
print(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"], axis=1)]["MARKETINGCATEGORYNAME"].unique())

['OTC MONOGRAPH NOT FINAL' 'OTC MONOGRAPH FINAL' 'NDA AUTHORIZED GENERIC'
 'NDA' 'ANDA']


OK for OTC MONOGRAPH + NDA AUTHORIZED GENERIC entries

In [48]:
# Entries with NDA AUTHORIZED GENERIC AND NOT STARTING WITH NDAXXXXX IN APPLICATION NUMBER
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA AUTHORIZED GENERIC", axis=1)]["APPLICATIONNUMBER"].str.contains("^[^NDA]").sum()

0

Incongruencies for ANDA and NDA labelled objects in ApplicationNumber

In [49]:
# For ANDA in Marketing
print(f'{len(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "ANDA", axis=1)])} mislabelled ANDA samples')
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "ANDA", axis=1)]


12 mislabelled ANDA samples


Unnamed: 0,MARKETINGCATEGORYNAME,APPLICATIONNUMBER
9209,ANDA,BA740193
9228,ANDA,BA720563
9229,ANDA,BA720562
16915,ANDA,BA010228
16916,ANDA,BA010228
16920,ANDA,BA125608
16921,ANDA,BA125608
16923,ANDA,BA010228
28399,ANDA,BA110057
41792,ANDA,BA740193


In [50]:
# NDA mislabelled
print(f'{len(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA", axis=1)])} mislabelled NDA samples')
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA", axis=1)]

126 mislabelled NDA samples


Unnamed: 0,MARKETINGCATEGORYNAME,APPLICATIONNUMBER
5955,NDA,BN890105
8905,NDA,BN070012
8966,NDA,BN200952
13099,NDA,BN160918
13100,NDA,BN160918
...,...,...
50017,NDA,BN980123
50020,NDA,BN000127
50021,NDA,BN000127
50022,NDA,BN000127


Remove them from prod dataframe

In [51]:
prod = prod.drop(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "ANDA", axis=1)].index)
prod = prod.drop(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA", axis=1)].index)


In [52]:
prod.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,,0002-0800,HUMAN OTC DRUG,Sterile Diluent,,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,1987-07-10,NaT,NDA,NDA018781,10,WATER,1.0,mL/mL,,,N,2020-12-31
1,,0002-1200,HUMAN PRESCRIPTION DRUG,Amyvid,,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,2012-06-01,NaT,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,2021-12-31
2,,0002-1433,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
3,,0002-1434,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,galcanezumab,"INJECTION, SOLUTION",SUBCUTANEOUS,2018-09-27,NaT,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,,N,2020-12-31


In [53]:
# prod["ACTIVE_NUMERATOR_STRENGTH"].astype(float)

### Exploring package data

In [55]:
pack.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,NDCPACKAGECODE,PACKAGEDESCRIPTION,STARTMARKETINGDATE,ENDMARKETINGDATE,NDC_EXCLUDE_FLAG,SAMPLE_PACKAGE
0,0002-0800_94c48759-29bb-402d-afff-9a713be11f0e,0002-0800,0002-0800-01,1 VIAL in 1 CARTON (0002-0800-01) > 10 mL in ...,19870710,,N,N
1,0002-1200_35551a38-7a8d-43b8-8abd-f6cb7549e932,0002-1200,0002-1200-30,"1 VIAL, MULTI-DOSE in 1 CAN (0002-1200-30) > ...",20120601,,N,N
2,0002-1200_35551a38-7a8d-43b8-8abd-f6cb7549e932,0002-1200,0002-1200-50,"1 VIAL, MULTI-DOSE in 1 CAN (0002-1200-50) > ...",20120601,,N,N
3,0002-1433_42a80046-fd68-4b80-819c-a443b7816edb,0002-1433,0002-1433-61,2 SYRINGE in 1 CARTON (0002-1433-61) > .5 mL ...,20141107,,N,Y
4,0002-1433_42a80046-fd68-4b80-819c-a443b7816edb,0002-1433,0002-1433-80,4 SYRINGE in 1 CARTON (0002-1433-80) > .5 mL ...,20141107,,N,N
