# TP2: Preprocessing and data visualization

## Winter 2023 - BIN710 Data Mining (UdeS)

Second assignement as part of the Data Mining class at UdeS.

Student name : Simon Lalonde

### Directory structure

├── package2.csv    ---> Data

├── product2.csv    ---> Data

├── tp1.ipynb   ---> Jupyter Notebook

└── TP1.pdf    ---> Tasks to complete

### Data
2 files for each dataset and both have the *byte signature*, meaning when compared byte by byte they are similar.

NDC = National Drug Code

### Metadata
Description for the 2 data files used : 
- [Product](https://www.fda.gov/drugs/drug-approvals-and-databases/ndc-product-file-definitions)
- [Package](https://www.fda.gov/drugs/drug-approvals-and-databases/ndc-package-file-definitions)

### Goal
Use preprocessing and data visualization techniques on FDA drugs databases

---

## 1, 2 and 3 : Data verification and cleaning for individual tables (coherence, types, redundency etc.)

Importing all required libraries and modules. Reading file to dataframe with proper encoding

In [1]:
from pathlib import Path

import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder

In [2]:
root_dir = Path.cwd()
pack = pd.read_csv(root_dir / "package2.csv", delimiter=";")
prod = pd.read_csv(root_dir / "product2.csv", delimiter=";", encoding="ISO-8859-1")    # Latin-1 encoding



### Exploring and cleaning the product data table

In [3]:
prod.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,,0002-0800,HUMAN OTC DRUG,Sterile Diluent,,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,19870710,,NDA,NDA018781,10,WATER,1.0,mL/mL,,,N,20201231.0
1,,0002-1200,HUMAN PRESCRIPTION DRUG,Amyvid,,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,20120601,,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,20211231.0
2,,0002-1433,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,20140918,,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,20201231.0
3,,0002-1434,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,20140918,,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,20201231.0
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,galcanezumab,"INJECTION, SOLUTION",SUBCUTANEOUS,20180927,,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,,N,20201231.0


In [4]:
print(f"Product dataframe has {prod.shape[0]} objects and {prod.shape[1]} columns.") 

Product dataframe has 93238 objects and 20 columns.


In [5]:
print(f" There are {len(prod.dtypes[prod.dtypes != 'object'])} numerical columns :\n") 
print(prod.dtypes[prod.dtypes != "object"].index.to_list())

 There are 3 numerical columns :

['STARTMARKETINGDATE', 'ENDMARKETINGDATE', 'LISTING_RECORD_CERTIFIED_THROUGH']


In [6]:
print(f" There are {len(prod.dtypes[prod.dtypes == 'object'])} non-numerical columns :\n") 
print(prod.dtypes[prod.dtypes == "object"].index.to_list())

 There are 17 non-numerical columns :

['PRODUCTID', 'PRODUCTNDC', 'PRODUCTTYPENAME', 'PROPRIETARYNAME', 'PROPRIETARYNAMESUFFIX', 'NONPROPRIETARYNAME', 'DOSAGEFORMNAME', 'ROUTENAME', 'MARKETINGCATEGORYNAME', 'APPLICATIONNUMBER', 'LABELERNAME', 'SUBSTANCENAME', 'ACTIVE_NUMERATOR_STRENGTH', 'ACTIVE_INGRED_UNIT', 'PHARM_CLASSES', 'DEASCHEDULE', 'NDC_EXCLUDE_FLAG']


Looking at null/missing values for each feature. We see some feature with very high missing vals such as ProrietaryNameSuffix, EndMarketingDate and DeaSchedule. It makes sense since these are not missing per-se but information on the object itself (No need for suffix on Rx, does have a known end marketing date and no classification of dependency potential respectively)

In [7]:
prod.isnull().sum()

PRODUCTID                            1560
PRODUCTNDC                              0
PRODUCTTYPENAME                         0
PROPRIETARYNAME                         6
PROPRIETARYNAMESUFFIX               83075
NONPROPRIETARYNAME                      4
DOSAGEFORMNAME                          0
ROUTENAME                            1932
STARTMARKETINGDATE                      0
ENDMARKETINGDATE                    88915
MARKETINGCATEGORYNAME                   0
APPLICATIONNUMBER                   13097
LABELERNAME                             0
SUBSTANCENAME                        2309
ACTIVE_NUMERATOR_STRENGTH            2309
ACTIVE_INGRED_UNIT                   2309
PHARM_CLASSES                       50984
DEASCHEDULE                         88815
NDC_EXCLUDE_FLAG                        0
LISTING_RECORD_CERTIFIED_THROUGH     4325
dtype: int64

ProductID is supposed to be composed of ProductNDC and other information. Let's look if if's true

In [8]:
print(f"Num objects with no NA in ProdID/NDC : {len(prod[['PRODUCTID', 'PRODUCTNDC']].dropna())}")

Num objects with no NA in ProdID/NDC : 91678


In [9]:
print(f"Num objects with NDC code within ID col : {prod[['PRODUCTID', 'PRODUCTNDC']].dropna().apply(lambda x : x.PRODUCTNDC in x.PRODUCTID, axis=1).sum()}")

Num objects with NDC code within ID col : 91165


In [10]:
prod[prod['PRODUCTID'].notna()].head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
20,0002-3251_67a53369-eead-4f2c-afe9-f3274899c47e,0002-3251,HUMAN PRESCRIPTION DRUG,Strattera,,Atomoxetine hydrochloride,CAPSULE,ORAL,20050214,,NDA,NDA021411,10,ATOMOXETINE HYDROCHLORIDE,100.0,mg/1,"Norepinephrine Reuptake Inhibitor [EPC],Norepi...",,N,20211231.0
21,0002-3270_06e2a1f2-459c-45aa-9341-54e36f7726a7,0002-3270,HUMAN PRESCRIPTION DRUG,Cymbalta,,Duloxetine hydrochloride,"CAPSULE, DELAYED RELEASE",ORAL,20100115,,NDA,NDA021427,10,DULOXETINE HYDROCHLORIDE,60.0,mg/1,"Norepinephrine Uptake Inhibitors [MoA],Seroton...",,N,20201231.0
22,0002-4112_d561034d-ea58-45fe-9d07-2e9eba98c2e4,0002-4112,HUMAN PRESCRIPTION DRUG,Zyprexa,,Olanzapine,TABLET,ORAL,19970623,,NDA,NDA020592,10,OLANZAPINE,2.5,mg/1,Atypical Antipsychotic [EPC],,N,20201231.0
23,0002-4115_d561034d-ea58-45fe-9d07-2e9eba98c2e4,0002-4115,HUMAN PRESCRIPTION DRUG,Zyprexa,,Olanzapine,TABLET,ORAL,19961001,,NDA,NDA020592,10,OLANZAPINE,5.0,mg/1,Atypical Antipsychotic [EPC],,N,20201231.0
24,0002-4116_d561034d-ea58-45fe-9d07-2e9eba98c2e4,0002-4116,HUMAN PRESCRIPTION DRUG,Zyprexa,,Olanzapine,TABLET,ORAL,19961001,,NDA,NDA020592,10,OLANZAPINE,7.5,mg/1,Atypical Antipsychotic [EPC],,N,20201231.0


In [11]:
id_ndc_incoherent = prod[prod['PRODUCTID'].notna()][prod[['PRODUCTID', 'PRODUCTNDC']].dropna().apply(lambda x : x.PRODUCTNDC not in x.PRODUCTID, axis=1)]
print(f"Num objects with incoherent ID and NDC : {len(id_ndc_incoherent)}")

Num objects with incoherent ID and NDC : 513


In [12]:
id_ndc_incoherent.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
159,0006-0005_0c7a3452-ecb2-4f66-ad52-94f8eaf8cde8,05-juin,HUMAN PRESCRIPTION DRUG,BELSOMRA,,suvorexant,"TABLET, FILM COATED",ORAL,20140829,,NDA,NDA204569,10,SUVOREXANT,5,mg/1,"Orexin Receptor Antagonist [EPC],Orexin Recept...",CIV,N,20211231.0
160,0006-0019_54e9c31a-9429-4842-b2d6-0cc1e5ad613c,19-juin,HUMAN PRESCRIPTION DRUG,PRINIVIL,,lisinopril,TABLET,ORAL,19871229,,NDA,NDA019558,10,LISINOPRIL,5,mg/1,"Angiotensin Converting Enzyme Inhibitor [EPC],...",,N,20201231.0
310,0009-0003_67759a7c-ea06-4151-87e1-a301c44d67cd,03-sept,HUMAN PRESCRIPTION DRUG,SOLU-MEDROL,,methylprednisolone sodium succinate,"INJECTION, POWDER, FOR SOLUTION",INTRAMUSCULAR; INTRAVENOUS,19590402,,NDA,NDA011856,Pharmacia and Upjohn Company LLC,METHYLPREDNISOLONE SODIUM SUCCINATE,500,mg/4mL,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0
311,0009-0005_c9aa26c1-05c3-479c-90eb-63b2181c5e7e,05-sept,HUMAN PRESCRIPTION DRUG,Solu-Cortef,,hydrocortisone sodium succinate,"INJECTION, POWDER, FOR SOLUTION",INTRAMUSCULAR; INTRAVENOUS,19550427,,NDA,NDA009866,Pharmacia and Upjohn Company LLC,HYDROCORTISONE SODIUM SUCCINATE,1000,mg/8mL,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0
312,0009-0011_c9aa26c1-05c3-479c-90eb-63b2181c5e7e,11-sept,HUMAN PRESCRIPTION DRUG,Solu-Cortef,,hydrocortisone sodium succinate,"INJECTION, POWDER, FOR SOLUTION",INTRAMUSCULAR; INTRAVENOUS,19550427,,NDA,NDA009866,Pharmacia and Upjohn Company LLC,HYDROCORTISONE SODIUM SUCCINATE,100,mg/2mL,"Corticosteroid [EPC],Corticosteroid Hormone Re...",,N,20201231.0


DROP THEM

In [13]:
prod = prod.drop(id_ndc_incoherent.index)

In [14]:
# List of possible vals
id_ndc_incoherent["PRODUCTNDC"].unique()

array(['05-juin', '19-juin', '03-sept', '05-sept', '11-sept', '12-sept',
       '13-sept', '16-sept', '17-sept', '18-sept', '20-sept', '22-sept',
       '29-sept', 'OTC MONOGRAPH NOT FINAL', 'NDA', 'OTC MONOGRAPH FINAL',
       'UNAPPROVED HOMEOPATHIC', 'UNAPPROVED MEDICAL GAS',
       'UNAPPROVED DRUG OTHER', 'ANDA', 'NDA AUTHORIZED GENERIC', 'BLA'],
      dtype=object)

Suffix feature will mostly not contribute a lot of relevant information since there are not really a standard and many incongruency in naming

In [15]:
print(f"{len(prod['PROPRIETARYNAMESUFFIX'].dropna())} values for PPTsuffix with {len(prod['PROPRIETARYNAMESUFFIX'].dropna().unique())} unique categories")

10115 values for PPTsuffix with 4010 unique categories


In [16]:
# See a few examples
print(prod['PROPRIETARYNAMESUFFIX'].dropna().unique()[:10])

['Zydis ' 'Mix75/25 ' 'Mix50/50 ' 'Intramuscular ' 'Relprevv ' 'KwikPen '
 ' Junior KwikPen ' ' Tempo Pen ' 'R ' 'N ']


**ProductTypeName FDA labels verification**

In [17]:
content_type_label = pd.read_html("https://www.fda.gov/industry/structured-product-labeling-resources/document-type-including-content-labeling-type")[0]["LOINC Name"]


In [18]:
content_type_label = content_type_label.str.replace("LABEL", "")
content_type_label = content_type_label.str.rstrip().to_list()

In [19]:
print(f'{len([label for label in prod["PRODUCTTYPENAME"].unique() if label not in content_type_label])} producttype categories not in official FDA repo')

0 producttype categories not in official FDA repo


**Dosage form FDA codes verification**

In [20]:
dosage_form_codes = pd.read_html("https://www.fda.gov/industry/structured-product-labeling-resources/dosage-forms")[0]
dosage_form_codes = dosage_form_codes["SPL Acceptable Term"].to_list()


In [21]:
[label for label in prod["DOSAGEFORMNAME"].unique() if label not in dosage_form_codes]
print(f"{len([label for label in prod['DOSAGEFORMNAME'].unique() if label not in dosage_form_codes])} DosageForm categories not in official FDA repo codes")

0 DosageForm categories not in official FDA repo codes


**RouteName FDA codes verification**

In [22]:
routename_codes = pd.read_html("https://www.fda.gov/industry/structured-product-labeling-resources/route-administration")[0]
routename_codes = routename_codes["SPL Acceptable Term"].to_list()


In [23]:
print(f'{len([label for label in prod["ROUTENAME"].dropna().unique() if label not in routename_codes])} RouteNames not listed in official FDA repo')

127 RouteNames not listed in official FDA repo


In [24]:
# Example of multiple categories for RouteName
print([label for label in prod["ROUTENAME"].dropna().unique() if label not in routename_codes][:10])

['INTRAMUSCULAR; SUBCUTANEOUS', 'INTRAVENOUS; SUBCUTANEOUS', 'INTRA-ARTICULAR; INTRAMUSCULAR', 'INTRA-ARTICULAR; INTRALESIONAL', 'INTRAMUSCULAR; INTRAVENOUS', 'INTRALESIONAL; INTRAMUSCULAR; INTRASYNOVIAL; SOFT TISSUE', 'INTRAMUSCULAR; INTRAVENOUS; SUBCONJUNCTIVAL', 'INTRA-ARTICULAR; INTRALESIONAL; INTRAMUSCULAR; SOFT TISSUE', 'INTRAVASCULAR; INTRAVENOUS', 'INTRA-ARTERIAL; INTRAVENOUS']


In [25]:
# Check for lowercase
len(prod["ROUTENAME"].dropna()[prod["ROUTENAME"].str.islower().dropna()])

0

We see it's the objects with multiple categories that have those special labels. This will get fixed with get_dummies method or OneHotEncoding

In [26]:
# Pandas get_dummies example
prod["ROUTENAME"].str.upper().str.get_dummies().head()

Unnamed: 0,AURICULAR (OTIC),BUCCAL,BUCCAL; DENTAL; TOPICAL,BUCCAL; SUBLINGUAL,BUCCAL; VAGINAL,CUTANEOUS,CUTANEOUS; EXTRACORPOREAL,CUTANEOUS; EXTRACORPOREAL; TOPICAL; VAGINAL,CUTANEOUS; EXTRACORPOREAL; VAGINAL,CUTANEOUS; INTRADERMAL; SUBCUTANEOUS,...,TOPICAL,TOPICAL; TOPICAL,TOPICAL; TOPICAL; TOPICAL,TOPICAL; TRANSDERMAL,TOPICAL; VAGINAL,TRANSDERMAL,TRANSMUCOSAL,URETERAL,URETHRAL,VAGINAL
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [27]:
#ohe example
enc = OneHotEncoder()
enc.fit(pd.DataFrame(prod["ROUTENAME"].str.upper()))
enc.categories_[0][:5]

array(['AURICULAR (OTIC)', 'BUCCAL', 'BUCCAL; DENTAL; TOPICAL',
       'BUCCAL; SUBLINGUAL', 'BUCCAL; VAGINAL'], dtype=object)

Some categories are repeated multiple times with the same routename which does not make sense

In [28]:
prod[prod["ROUTENAME"] == "TOPICAL; TOPICAL; TOPICAL"]

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
74661,68466-0002_b9ea1370-4627-458d-8460-3c1fa0a56f48,68466-0002,HUMAN OTC DRUG,Sports For Trauma Gel,,"Bellis Perennis, Hypericum Perfomatum,Toxicode...",GEL,TOPICAL; TOPICAL; TOPICAL,20040701,,UNAPPROVED HOMEOPATHIC,,"Schwabe Mexico, S.A. de C.V.",BELLIS PERENNIS; HYPERICUM PERFORATUM; TOXICOD...,1; 2; 3; 1,[hp_X]/71g; [hp_X]/71g; [hp_X]/71g; [hp_X]/71g,,,N,20201231.0


In [29]:
# Find the elements with multiple RouteNames
prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].head()

49    INTRAMUSCULAR; SUBCUTANEOUS
52      INTRAVENOUS; SUBCUTANEOUS
55      INTRAVENOUS; SUBCUTANEOUS
70    INTRAMUSCULAR; SUBCUTANEOUS
71    INTRAMUSCULAR; SUBCUTANEOUS
Name: ROUTENAME, dtype: object

In [30]:
# Find all the elements with RouteName repetitions
print(f'Num of RouteName repetitions : {len(prod[prod["ROUTENAME"].str.split("; ").str.len() > 1][prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x : set(x)).str.len() == 1])}')
prod[prod["ROUTENAME"].str.split("; ").str.len() > 1][prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x : set(x)).str.len() == 1].head()

Num of RouteName repetitions : 32


Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
9927,0527-1109_458431c7-41c7-48f4-a8f6-b6b6ccc7cbe2,0527-1109,HUMAN PRESCRIPTION DRUG,Isoniazid,,Isoniazid,TABLET,ORAL; ORAL,20131010,,ANDA,ANDA089776,"Lannett Company, Inc.",ISONIAZID,300,mg/1,Antimycobacterial [EPC],,N,20211231.0
11121,0615-8061_4643015f-3f68-4ecd-909f-85e3fd2c8549,0615-8061,HUMAN PRESCRIPTION DRUG,Lisinopril,,Lisinopril,TABLET,ORAL; ORAL,20111101,20200930.0,ANDA,ANDA076180,"NCS HealthCare of KY, Inc dba Vangard Labs",LISINOPRIL,2.5,mg/1,"Angiotensin Converting Enzyme Inhibitor [EPC],...",,N,
12307,0869-0012_2fd9a395-b322-45b0-b568-71b98e581ae4,0869-0012,HUMAN OTC DRUG,Vitamin A D,,"Lanolin, Petrolatum",OINTMENT,TOPICAL; TOPICAL,20130701,,OTC MONOGRAPH FINAL,part347,Vi-Jon,LANOLIN; PETROLATUM,133; 459,mg/g; mg/g,,,N,20211231.0
17328,16714-114_6ae8605d-16ec-9ea6-8389-ba144c924ee1,16714-114,HUMAN PRESCRIPTION DRUG,Fluoxetine hydrochloride,,Fluoxetine hydrochloride,"TABLET, FILM COATED",ORAL; ORAL,20190918,,ANDA,ANDA211721,NorthStar Rx LLC,FLUOXETINE HYDROCHLORIDE,60,mg/1,"Serotonin Reuptake Inhibitor [EPC],Serotonin U...",,N,20201231.0
29347,43598-632_b7779005-0433-747c-3ea9-16e6d45b6cee,43598-632,HUMAN PRESCRIPTION DRUG,Fluoxetine hydrochloride,,Fluoxetine hydrochloride,"TABLET, FILM COATED",ORAL; ORAL,20190128,,ANDA,ANDA211721,Dr. Reddy's Laboratories Inc.,FLUOXETINE HYDROCHLORIDE,60,mg/1,"Serotonin Reuptake Inhibitor [EPC],Serotonin U...",,N,20201231.0


Shrink to a single categorie for the 32 objects with repetitions

In [31]:
prod.loc[
    prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: set(x))[prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: set(x)).str.len() == 1].index,
    "ROUTENAME"
] = prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: list(set(x))[0])[prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ").apply(lambda x: set(x)).str.len() == 1]
# [prod["ROUTENAME"][prod["ROUTENAME"].str.split("; ").str.len() > 1].str.split("; ")]
# .apply(lambda x : set(x)).str.len() == 1]

**Verifying DateType attributes (START/END/Listing_Record_Certified_Through)**

First let's transform the data from int to datetime format

In [32]:
prod["STARTMARKETINGDATE"] = pd.to_datetime(prod["STARTMARKETINGDATE"], format="%Y%m%d")

In [33]:
print(f'Date range from {min(prod["STARTMARKETINGDATE"])} to {max(prod["STARTMARKETINGDATE"])} for marketing start : OK')

Date range from 1900-01-01 00:00:00 to 2020-02-14 00:00:00 for marketing start : OK


Error for 3031 year in timestamp of EndMarketingDate. Let's fix it to 2031

In [34]:
# pd.to_datetime(prod["ENDMARKETINGDATE"], format="%Y%m%d")
prod["ENDMARKETINGDATE"].sort_values(ascending=False).head()

29503    30310209.0
65640    20390831.0
46709    20380131.0
89574    20331010.0
89575    20331010.0
Name: ENDMARKETINGDATE, dtype: float64

In [35]:
prod.loc[prod["ENDMARKETINGDATE"] > 20500000, ["ENDMARKETINGDATE"]] = 30310209.0 - 10000000

In [36]:
print(prod.iloc[29503]["ENDMARKETINGDATE"])    # Check replacement

20200813.0


In [37]:
# Converting the actual data
prod["ENDMARKETINGDATE"] = pd.to_datetime(prod["ENDMARKETINGDATE"], format="%Y%m%d")

In [38]:
print(f'Date range from {min(prod["ENDMARKETINGDATE"].dropna())} to {max(prod["ENDMARKETINGDATE"].dropna())} for marketing end : OK')

Date range from 2020-02-15 00:00:00 to 2039-08-31 00:00:00 for marketing end : OK


No objects with incongruent start/end date combinations

In [39]:
print(f'Number of objects with enddates greather than start dates : {len(prod[prod["ENDMARKETINGDATE"] < prod["STARTMARKETINGDATE"]])}')

Number of objects with enddates greather than start dates : 0


Listing records to date time, no incongruencies

In [40]:
prod["LISTING_RECORD_CERTIFIED_THROUGH"] = pd.to_datetime(prod["LISTING_RECORD_CERTIFIED_THROUGH"], format="%Y%m%d")

In [41]:
print(f'Date range from {min(prod["LISTING_RECORD_CERTIFIED_THROUGH"].dropna())} to {max(prod["LISTING_RECORD_CERTIFIED_THROUGH"].dropna())} for marketing end : OK')

Date range from 2020-12-31 00:00:00 to 2021-12-31 00:00:00 for marketing end : OK


In [42]:
print(f'Number of objects with listing certified greather than start dates : {len(prod[prod["LISTING_RECORD_CERTIFIED_THROUGH"] < prod["STARTMARKETINGDATE"]])}')

Number of objects with listing certified greather than start dates : 0


**Application Number verification to startwith FDA reference codes**

In [43]:
# Does not matc prefix NDA / ANDA / BLA or partXXXX in ApplicationNumber
prod["APPLICATIONNUMBER"].dropna()[prod["APPLICATIONNUMBER"].dropna().str.match("^[^NDA|^ANDA|^BLA|^part]")]

26428    333D
Name: APPLICATIONNUMBER, dtype: object

Let's remove that object

In [44]:
prod = prod.drop(prod["APPLICATIONNUMBER"].dropna()[prod["APPLICATIONNUMBER"].dropna().str.match("^[^NDA|^ANDA|^BLA|^part]")].index)

In [45]:
prod.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,,0002-0800,HUMAN OTC DRUG,Sterile Diluent,,diluent,"INJECTION, SOLUTION",SUBCUTANEOUS,1987-07-10,NaT,NDA,NDA018781,10,WATER,1.0,mL/mL,,,N,2020-12-31
1,,0002-1200,HUMAN PRESCRIPTION DRUG,Amyvid,,Florbetapir F 18,"INJECTION, SOLUTION",INTRAVENOUS,2012-06-01,NaT,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,2021-12-31
2,,0002-1433,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
3,,0002-1434,HUMAN PRESCRIPTION DRUG,Trulicity,,Dulaglutide,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,galcanezumab,"INJECTION, SOLUTION",SUBCUTANEOUS,2018-09-27,NaT,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,,N,2020-12-31


**Verify if ApplicationNumber prefix and MarketingCategoryName are identical**

In [46]:
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].isnull().sum()

MARKETINGCATEGORYNAME        0
APPLICATIONNUMBER        12978
dtype: int64

Let's look at the values for MarketingCategoryName when ApplicationNumber is null

In [47]:
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]][prod["APPLICATIONNUMBER"].isnull()].head()

Unnamed: 0,MARKETINGCATEGORYNAME,APPLICATIONNUMBER
738,UNAPPROVED DRUG OTHER,
750,UNAPPROVED DRUG OTHER,
2262,UNAPPROVED DRUG OTHER,
2307,UNAPPROVED DRUG OTHER,
2822,UNAPPROVED DRUG OTHER,


For now we leave the NaN fields in ApplicationNumber as empty

In [48]:
prod["MARKETINGCATEGORYNAME"][prod["APPLICATIONNUMBER"].isnull()].unique()

array(['UNAPPROVED DRUG OTHER', 'UNAPPROVED HOMEOPATHIC',
       'UNAPPROVED MEDICAL GAS',
       'UNAPPROVED DRUG FOR USE IN DRUG SHORTAGE'], dtype=object)

List of the categories where ApplicationNumber prefix does not match MarketingCategoryName show that NDA and AND have some mismatch between the 2 features

In [49]:
print(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"], axis=1)]["MARKETINGCATEGORYNAME"].unique())

['OTC MONOGRAPH NOT FINAL' 'OTC MONOGRAPH FINAL' 'NDA AUTHORIZED GENERIC'
 'NDA' 'ANDA']


OK for OTC MONOGRAPH + NDA AUTHORIZED GENERIC entries

In [50]:
# Entries with NDA AUTHORIZED GENERIC AND NOT STARTING WITH NDAXXXXX IN APPLICATION NUMBER
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA AUTHORIZED GENERIC", axis=1)]["APPLICATIONNUMBER"].str.contains("^[^NDA]").sum()

0

Incongruencies for ANDA and NDA labelled objects in ApplicationNumber

In [51]:
# For ANDA in Marketing
print(f'{len(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "ANDA", axis=1)])} mislabelled ANDA samples')
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "ANDA", axis=1)]


12 mislabelled ANDA samples


Unnamed: 0,MARKETINGCATEGORYNAME,APPLICATIONNUMBER
9209,ANDA,BA740193
9228,ANDA,BA720563
9229,ANDA,BA720562
16915,ANDA,BA010228
16916,ANDA,BA010228
16920,ANDA,BA125608
16921,ANDA,BA125608
16923,ANDA,BA010228
28399,ANDA,BA110057
41792,ANDA,BA740193


In [52]:
# NDA mislabelled
print(f'{len(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA", axis=1)])} mislabelled NDA samples')
prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA", axis=1)]

111 mislabelled NDA samples


Unnamed: 0,MARKETINGCATEGORYNAME,APPLICATIONNUMBER
5955,NDA,BN890105
8905,NDA,BN070012
8966,NDA,BN200952
13099,NDA,BN160918
13100,NDA,BN160918
...,...,...
50017,NDA,BN980123
50020,NDA,BN000127
50021,NDA,BN000127
50022,NDA,BN000127


Remove them from prod dataframe

In [53]:
prod = prod.drop(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "ANDA", axis=1)].index)
prod = prod.drop(prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna()[prod[["MARKETINGCATEGORYNAME","APPLICATIONNUMBER"]].dropna().apply(lambda x : x["MARKETINGCATEGORYNAME"] not in x["APPLICATIONNUMBER"] and x["MARKETINGCATEGORYNAME"] == "NDA", axis=1)].index)


ProprietaryName seems all over the place and there are no standard format/regulations for this feature. Maybe delete it later on since it might not bring valuable information. At least convert to upper/lowercase before OneHotEncoding

In [54]:
# Example propname
print(prod["PROPRIETARYNAME"].sort_values().unique()[:50])

['(CHLOROPROCAINE HCI' '.Insulin Aspart Protamine and Insulin Aspart'
 '0.9% SODIUM CHLORIDE' '02 CUSHION SPF45' '1 Bladder' '1 Detoxification'
 '1% Hydrocortisone' '1% LIDOCAINE HCI'
 '1.8OZ HAND SANITIZER WITH CLIP -ASSORTED'
 '1.8oz Armstrong Hand Sanitizer with Aloe Vera and Vitamin E'
 '10 Armani Prima Control Glow Moisturizer SBS SPF 35' '10 PARASITE DETOX'
 '10 Parasite Detox' '10 TREE MIX' '100% Pure Yerba Mate MIst'
 '1000 Roses CC Color Plus Correct Sheer Tan SPF 30'
 '1000 Roses CC Color plus Correct Sheer Nude SPF 30'
 '1000 Roses Daily Shade Facial SPF 18' '1012 Antimicrobial'
 '10g Colgate plus Toothbrush Kit'
 '10g Colgate plus Toothbrush plus Floss Kit' '11 Tree Pollen Mix'
 '111 Medco Benzoyl Peroxide' '12 Hour Nasal' '12 Hour Nasal Decongestant'
 '12 Hour Original Nasal Decongestant' '12 hour allergy and congestion'
 '12 hour allergy d' '12 hour decongestant'
 '12HR Allergy and Congestion Relief' '16OZ HYDORGEN PEROXIDE'
 '1ST MEDXPATCH' '1st RELIEF TOPICAL' '2 Cockro

In [55]:
prod["PROPRIETARYNAME"] = prod["PROPRIETARYNAME"].str.upper()

**No other cleanup/checkup except to convert to uppercase/lowercase before OneHotEncoding of NonProprietaryName to avoid repeated categories**

In [56]:
prod["NONPROPRIETARYNAME"] = prod["NONPROPRIETARYNAME"].str.upper()

In [57]:
prod["NONPROPRIETARYNAME"].head()

0             DILUENT
1    FLORBETAPIR F 18
2         DULAGLUTIDE
3         DULAGLUTIDE
4        GALCANEZUMAB
Name: NONPROPRIETARYNAME, dtype: object

There are a lot of similar companies with just slight variations in their name. It might lead to more confusion or decreasing the model metrics such as Accuracy and F1 scores.

In [58]:
prod["LABELERNAME"].sort_values().unique()[:20]

array(['- INDUSTRIAL WELDING SUPPLY CO. OF HARVEY, INC.',
       "-L'Oreal USA Products Inc", '.Cardinal Health',
       '1 Veterans Health', '10', '101196749', '111 Medco',
       '1ST MEDX LLC', '1st Class Pharmaceuticals, Inc.', '2 Transform',
       '20Lighter, LLC.', '21st Century Designer Health Products',
       '21st Century Formulations', '21st Century Homeopathics',
       '21st Century Homeopathics, Inc', '2xl Corporation', '3014704014',
       '3D Imaging Drug Design and Development LLC', '3LAB', '3LAB, Inc'],
      dtype=object)

Let's convert to uppercase before OneHotEncoding

In [59]:
prod["LABELERNAME"] = prod["LABELERNAME"].str.upper()

**SubstanceName, ACTIVATE_NUMERATOR_STRENGTH (which on the FDA website is labelled as StrengthNumber) and ACTIVE_INGRED_UNIT are linked features. They also can have multiple values each when split by a semi colon. Let's look if the number of multiple values match its counterpart in each feature**

Looking at the frequency of number of elements per value. They look similar.

In [60]:
prod["SUBSTANCENAME"].str.split(";").str.len().astype(str).value_counts()[:6]

1.0    69367
2.0     9430
3.0     4472
4.0     2655
nan     2249
5.0     1120
Name: SUBSTANCENAME, dtype: int64

In [61]:
prod["ACTIVE_NUMERATOR_STRENGTH"].str.split(";").str.len().astype(str).value_counts()[:6]

1.0    69367
2.0     9431
3.0     4471
4.0     2655
nan     2249
5.0     1120
Name: ACTIVE_NUMERATOR_STRENGTH, dtype: int64

In [62]:
prod["ACTIVE_INGRED_UNIT"].str.split(";").str.len().astype(str).value_counts()[:6]

1.0    69367
2.0     9431
3.0     4471
4.0     2655
nan     2249
5.0     1120
Name: ACTIVE_INGRED_UNIT, dtype: int64

Looking at differences in amount of categories per objects for SubstanceName and ACTIVE_NUMERATOR_STRENGTH

In [63]:
prod[prod["SUBSTANCENAME"].str.split(";").str.len() != prod["ACTIVE_NUMERATOR_STRENGTH"].str.split(";").str.len()][["SUBSTANCENAME", "ACTIVE_NUMERATOR_STRENGTH"]].head()

Unnamed: 0,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH
49,,
58,,
59,,
60,,
69,,


Mostly NaN except...

In [64]:
prod[prod["SUBSTANCENAME"].str.split(";").str.len() != prod["ACTIVE_NUMERATOR_STRENGTH"].str.split(";").str.len()][["SUBSTANCENAME", "ACTIVE_NUMERATOR_STRENGTH"]].dropna()

Unnamed: 0,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH
90536,GLYCERIN; HYDROLYZED SOY PROTEIN (ENZYMATIC; 2...,10; .12


In [65]:
prod[prod["ACTIVE_NUMERATOR_STRENGTH"].str.split(";").str.len() != prod["ACTIVE_INGRED_UNIT"].str.split(";").str.len()][["ACTIVE_NUMERATOR_STRENGTH", "ACTIVE_INGRED_UNIT"]].dropna()

Unnamed: 0,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT


In [66]:
# drop the problematic object
prod = prod.drop(
    prod[prod["SUBSTANCENAME"].str.split(";").str.len() != prod["ACTIVE_NUMERATOR_STRENGTH"].str.split(";").str.len()][["SUBSTANCENAME", "ACTIVE_NUMERATOR_STRENGTH"]].dropna().index
)

While it does not mean that the number of classes matches the number of substances !

In [67]:
len(prod[prod["SUBSTANCENAME"].str.split(";").str.len() != prod["PHARM_CLASSES"].str.split(",").str.len()][ "PHARM_CLASSES"])

88816

Looking now at a possible OneHotEncoding approach to deal with multiple values per instance

In [68]:
print("ohe len categories for SubstanceName")
print(len(OneHotEncoder().fit(pd.DataFrame(prod["SUBSTANCENAME"].str.upper())).categories_[0]))
print("ohe len categories for ACTIVE_NUMERATOR_STRENGTH")
print(len(OneHotEncoder().fit(pd.DataFrame(prod["ACTIVE_NUMERATOR_STRENGTH"].str.upper())).categories_[0]))
print("ohe len categories for ACTIVE_INGRED_UNIT")
print(len(OneHotEncoder().fit(pd.DataFrame(prod["ACTIVE_INGRED_UNIT"].str.upper())).categories_[0]))

ohe len categories for SubstanceName
8923
ohe len categories for ACTIVE_NUMERATOR_STRENGTH
8716
ohe len categories for ACTIVE_INGRED_UNIT
2369


That amount of categories with OneHotEncoding might lead to overfitting, let's look at another avenue to deal with multiple values per instance those features

In [69]:
multiple_vals_feats = ["SUBSTANCENAME", "ACTIVE_NUMERATOR_STRENGTH", "ACTIVE_INGRED_UNIT"]

In [70]:
prod[0:5]

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE,NDC_EXCLUDE_FLAG,LISTING_RECORD_CERTIFIED_THROUGH
0,,0002-0800,HUMAN OTC DRUG,STERILE DILUENT,,DILUENT,"INJECTION, SOLUTION",SUBCUTANEOUS,1987-07-10,NaT,NDA,NDA018781,10,WATER,1.0,mL/mL,,,N,2020-12-31
1,,0002-1200,HUMAN PRESCRIPTION DRUG,AMYVID,,FLORBETAPIR F 18,"INJECTION, SOLUTION",INTRAVENOUS,2012-06-01,NaT,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",,N,2021-12-31
2,,0002-1433,HUMAN PRESCRIPTION DRUG,TRULICITY,,DULAGLUTIDE,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
3,,0002-1434,HUMAN PRESCRIPTION DRUG,TRULICITY,,DULAGLUTIDE,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",,N,2020-12-31
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,GALCANEZUMAB,"INJECTION, SOLUTION",SUBCUTANEOUS,2018-09-27,NaT,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,,N,2020-12-31


In [71]:
# To get the first upto fifth substance, concentration and units
prod[prod[multiple_vals_feats].apply(lambda x: x.str.split(";").str[0]).isnull().any(axis=1)][multiple_vals_feats]

Unnamed: 0,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT
49,,,
58,,,
59,,,
60,,,
69,,,
...,...,...,...
93016,,,
93017,,,
93018,,,
93019,,,


We could get all the element positions for the first 5 elements for each of the feature, meaning we would split in 15 new features

In [72]:
for i in range(5):
    # print([f"{feat}_{i}" for feat in multiple_vals_feats])
    prod[[f"{feat}_{i}" for feat in multiple_vals_feats]] = prod[multiple_vals_feats].apply(lambda x: x.str.split(";").str[i])

In [73]:
prod.isnull().sum()[20:]

SUBSTANCENAME_0                 2249
ACTIVE_NUMERATOR_STRENGTH_0     2249
ACTIVE_INGRED_UNIT_0            2249
SUBSTANCENAME_1                71616
ACTIVE_NUMERATOR_STRENGTH_1    71616
ACTIVE_INGRED_UNIT_1           71616
SUBSTANCENAME_2                81046
ACTIVE_NUMERATOR_STRENGTH_2    81046
ACTIVE_INGRED_UNIT_2           81046
SUBSTANCENAME_3                85517
ACTIVE_NUMERATOR_STRENGTH_3    85517
ACTIVE_INGRED_UNIT_3           85517
SUBSTANCENAME_4                88172
ACTIVE_NUMERATOR_STRENGTH_4    88172
ACTIVE_INGRED_UNIT_4           88172
dtype: int64

In [74]:
prod = prod.drop(columns=prod.iloc[:, 20:].columns)

This is a lot of NaN but we will loose the link to pharm_classes if we only keep those. We have to choose a different approach

**PHARM_CLASSES Verification and correction**

In [75]:
print(f"There are {prod['PHARM_CLASSES'].isnull().sum()} missing values for PHARM_CLASSES")

There are 50554 missing values for PHARM_CLASSES


In [76]:
# Counts of Number of classes per objects
prod["PHARM_CLASSES"].str.split(",").str.len().sort_values().astype(str).value_counts()[:6]

nan    50554
2.0    23180
3.0     5029
4.0     3410
1.0     3066
5.0     3004
Name: PHARM_CLASSES, dtype: int64

In [77]:
prod["PHARM_CLASSES"].head()

0                                                  NaN
1    Radioactive Diagnostic Agent [EPC],Positron Em...
2    GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...
3    GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...
4                                                  NaN
Name: PHARM_CLASSES, dtype: object

**Extracting only the PHARM_CLASS code**

In [78]:
prod["PHARM_CLASSES"].head()

0                                                  NaN
1    Radioactive Diagnostic Agent [EPC],Positron Em...
2    GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...
3    GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...
4                                                  NaN
Name: PHARM_CLASSES, dtype: object

In [79]:
prod["PHARM_CLASSES"].str.findall("\[(.*?)\]").head()

0               NaN
1        [EPC, MoA]
2    [EPC, CS, MoA]
3    [EPC, CS, MoA]
4               NaN
Name: PHARM_CLASSES, dtype: object

There are repetitions within pharm_classes because it is linked to the SubstanceName molecules

In [80]:
prod["PHARM_CLASSES"].str.findall("\[(.*?)\]").dropna()[prod["PHARM_CLASSES"].dropna().str.findall("\[(.*?)\]").str.len() > prod["PHARM_CLASSES"].str.findall("\[(.*?)\]").dropna().apply(lambda x: set(x)).str.len()].head()

11    [EPC, EPC, MoA]
12    [EPC, EPC, MoA]
13    [EPC, EPC, MoA]
14    [EPC, EPC, MoA]
15    [MoA, EPC, MoA]
Name: PHARM_CLASSES, dtype: object

**DEASchedule verification codes**

In [81]:
deas_codes = ["CI", "CII", "CIII", "CIV", "CV"]
print(len([label for label in prod["DEASCHEDULE"].dropna().unique() if label not in deas_codes]))

0


**NDC_Exclude_Flag codes verification**

In [82]:
ndc_exclude_codes = ["E", "U", "I"]
print([label for label in prod["NDC_EXCLUDE_FLAG"].unique()])

['N']


In [83]:
prod["NDC_EXCLUDE_FLAG"].value_counts()

N    92600
Name: NDC_EXCLUDE_FLAG, dtype: int64

Easy dropping that column since they are all the same value

In [84]:
prod = prod.drop(columns="NDC_EXCLUDE_FLAG")

**LISTING_RECORD_CERTIFIED_THROUGH dates verification**

In [85]:
print(f"Max time : {max(prod['LISTING_RECORD_CERTIFIED_THROUGH'])}")
print(f"Min time : {min(prod['LISTING_RECORD_CERTIFIED_THROUGH'])}")

Max time : 2021-12-31 00:00:00
Min time : 2020-12-31 00:00:00


This won't give any relevant information for a classification task. We can drop it

In [86]:
prod = prod.drop(columns="LISTING_RECORD_CERTIFIED_THROUGH")

In [98]:
prod.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,PRODUCTTYPENAME,PROPRIETARYNAME,PROPRIETARYNAMESUFFIX,NONPROPRIETARYNAME,DOSAGEFORMNAME,ROUTENAME,STARTMARKETINGDATE,ENDMARKETINGDATE,MARKETINGCATEGORYNAME,APPLICATIONNUMBER,LABELERNAME,SUBSTANCENAME,ACTIVE_NUMERATOR_STRENGTH,ACTIVE_INGRED_UNIT,PHARM_CLASSES,DEASCHEDULE
0,,0002-0800,HUMAN OTC DRUG,STERILE DILUENT,,DILUENT,"INJECTION, SOLUTION",SUBCUTANEOUS,1987-07-10,NaT,NDA,NDA018781,10,WATER,1.0,mL/mL,,
1,,0002-1200,HUMAN PRESCRIPTION DRUG,AMYVID,,FLORBETAPIR F 18,"INJECTION, SOLUTION",INTRAVENOUS,2012-06-01,NaT,NDA,NDA202008,10,FLORBETAPIR F-18,51.0,mCi/mL,"Radioactive Diagnostic Agent [EPC],Positron Em...",
2,,0002-1433,HUMAN PRESCRIPTION DRUG,TRULICITY,,DULAGLUTIDE,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,0.75,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",
3,,0002-1434,HUMAN PRESCRIPTION DRUG,TRULICITY,,DULAGLUTIDE,"INJECTION, SOLUTION",SUBCUTANEOUS,2014-09-18,NaT,BLA,BLA125469,10,DULAGLUTIDE,1.5,mg/.5mL,"GLP-1 Receptor Agonist [EPC],Glucagon-Like Pep...",
4,,0002-1436,HUMAN PRESCRIPTION DRUG,EMGALITY,,GALCANEZUMAB,"INJECTION, SOLUTION",SUBCUTANEOUS,2018-09-27,NaT,BLA,BLA761063,10,GALCANEZUMAB,120.0,mg/mL,,


### Exploring package data

In [88]:
pack.head()

Unnamed: 0,PRODUCTID,PRODUCTNDC,NDCPACKAGECODE,PACKAGEDESCRIPTION,STARTMARKETINGDATE,ENDMARKETINGDATE,NDC_EXCLUDE_FLAG,SAMPLE_PACKAGE
0,0002-0800_94c48759-29bb-402d-afff-9a713be11f0e,0002-0800,0002-0800-01,1 VIAL in 1 CARTON (0002-0800-01) > 10 mL in ...,19870710,,N,N
1,0002-1200_35551a38-7a8d-43b8-8abd-f6cb7549e932,0002-1200,0002-1200-30,"1 VIAL, MULTI-DOSE in 1 CAN (0002-1200-30) > ...",20120601,,N,N
2,0002-1200_35551a38-7a8d-43b8-8abd-f6cb7549e932,0002-1200,0002-1200-50,"1 VIAL, MULTI-DOSE in 1 CAN (0002-1200-50) > ...",20120601,,N,N
3,0002-1433_42a80046-fd68-4b80-819c-a443b7816edb,0002-1433,0002-1433-61,2 SYRINGE in 1 CARTON (0002-1433-61) > .5 mL ...,20141107,,N,Y
4,0002-1433_42a80046-fd68-4b80-819c-a443b7816edb,0002-1433,0002-1433-80,4 SYRINGE in 1 CARTON (0002-1433-80) > .5 mL ...,20141107,,N,N


In [89]:
print(f"pack dataframe has {pack.shape[0]} objects and {pack.shape[1]} columns.") 

pack dataframe has 173887 objects and 8 columns.


In [90]:
print(f" There are {len(pack.dtypes[pack.dtypes != 'object'])} numerical columns :\n") 
print(pack.dtypes[pack.dtypes != "object"].index.to_list())

 There are 2 numerical columns :

['STARTMARKETINGDATE', 'ENDMARKETINGDATE']


In [91]:
print(f" There are {len(pack.dtypes[pack.dtypes == 'object'])} non-numerical columns :\n") 
print(pack.dtypes[pack.dtypes == "object"].index.to_list())

 There are 6 non-numerical columns :

['PRODUCTID', 'PRODUCTNDC', 'NDCPACKAGECODE', 'PACKAGEDESCRIPTION', 'NDC_EXCLUDE_FLAG', 'SAMPLE_PACKAGE']


**Evaluating NaN of all features**

In [92]:
pack.isnull().sum()

PRODUCTID                  0
PRODUCTNDC              1500
NDCPACKAGECODE          2346
PACKAGEDESCRIPTION         0
STARTMARKETINGDATE         0
ENDMARKETINGDATE      167431
NDC_EXCLUDE_FLAG           0
SAMPLE_PACKAGE             0
dtype: int64

**Evaluating PRODUCTNDC and PRODUCTID since they are interelated**

In [93]:
print(f"Num objects with no NA in ProdID/NDC : {len(pack[['PRODUCTID', 'PRODUCTNDC']].dropna())}")

Num objects with no NA in ProdID/NDC : 172387


In [94]:
print(f"Num objects with NDC code within ID col : {pack[['PRODUCTID', 'PRODUCTNDC']].dropna().apply(lambda x : x.PRODUCTNDC in x.PRODUCTID, axis=1).sum()}")

Num objects with NDC code within ID col : 171868


**Removing duplicated entries**

In [95]:
# prod[prod.duplicated("PRODUCTNDC", keep=False)].sort_values(by="PRODUCTNDC")[:10]