# Scrape Information from the FDA

The openFDA contains information to obtain data from their JSON file (see tutorial [here](01_Fetch_Data.ipynb)). However, the information is updated regularly. During testing, it was seen that the information in that data extended up to 2020. As a workaround, drug_nme includes additional methods to extract information from the FDA from several sources, which will be demoed below. 

### Import modules

In [1]:
import pandas as pd
from drug_nme import FDAScraper

### Initialize the FDAScrapper() class

The FDASCrapper() class will need to be initialized. This class can take several parameters. 
- url: the url to a specific site. By default, it will point to the [NME Drug and New Biologic Approvals](https://www.fda.gov/drugs/nda-and-bla-approvals/new-molecular-entity-nme-drug-and-new-biologic-approvals) page. 
- compilation_link: A link to the page containing the [Compilation of CDER NME Drug and New Biologic Approvals](https://www.fda.gov/drugs/drug-approvals-and-databases/compilation-cder-new-molecular-entity-nme-drug-and-new-biologic-approvals). 
- latest_link: A link to the page containing the [Novel Drug Approvals at FDA](https://www.fda.gov/drugs/development-approval-process-drugs/novel-drug-approvals-fda).

By default, these are already set to the specific web addresses. However, if there are issues, users can try to use different sites. They can be set to the paramaters manually.  

In [2]:
scrape = FDAScraper()

### Get NMEs From PDF Reports

The FDA curates reports of approval by [calendar year](https://www.fda.gov/drugs/nda-and-bla-approvals/new-molecular-entity-nme-drug-and-new-biologic-approvals). The **get.pdf_links()** method will extract links to the PDFs. 

This will result in a dictionary containing the year and a link to the corresponding PDF report. This can be viewed by clicking or copy and pasting the link to the browser.  The links are also used to set the instance variable. Thus, it is not necessary to set the **get_pdf_links()** to a variable.

**NOTE:** Currently, the FDA curates reports from 2015-2023. There are additional reports available for 1999-2014, however it is listed under the FDA Archive link. Extraction there can be difficult. A workaround will be demoed further below.    

In [3]:
# set to variable to demo the output
links = scrape.get_pdf_links()
links

{'2023': 'https://www.fda.gov/media/177083/download?attachment',
 '2022': 'https://www.fda.gov/media/165828/download?attachment',
 '2021': 'https://www.fda.gov/media/158152/download?attachment',
 '2020': 'https://www.fda.gov/media/147400/download?attachment',
 '2019': 'https://www.fda.gov/media/147414/download?attachment',
 '2018': 'https://www.fda.gov/media/124809/download?attachment',
 '2017': 'https://www.fda.gov/media/110746/download',
 '2016': 'https://www.fda.gov/media/102967/download',
 '2015': 'https://www.fda.gov/media/93424/download'}

### Extract Table From PDF Report

The above **get_pdf_links** method sets the instance variable in the FDAScrapper class. This can the nbe used to extract the table from the PDF file using the **extract_table()**. 

In [4]:
year_2023 = scrape.extract_table(year='2023')
year_2023

Unnamed: 0,APPLICATION NUMBER,PROPRIETARY NAME,ESTABLISHED NAME,APPLICANT,REVIEW CLASSIFICATION,APPROVAL DATE,INDICATION
1,NDA 214373,BRENZAVVY,BEXAGLIFLOZIN,THERACOSBIO LLC,S,1/20/2023,ADJUNCT TO DIET AND EXERCISE TO IMPROVE GLYCEM...
2,NDA 216059,JAYPIRCA,PIRTOBRUTINIB,LOXO ONCOLOGY INC,"P,O",1/27/2023,FOR THE TREATMENT OF ADULT PATIENTS WITH RELAP...
3,NDA 217639,ORSERDU,ELACESTRANT,STEMLINE THERAPEUTICS INC,P,1/27/2023,FOR THE TREATMENT OF POSTMENOPAUSAL WOMEN OR A...
4,NDA 216951,JESDUVROQ,DAPRODUSTAT,GLAXOSMITHKLINE INTELLECTUAL PROPERTY NO 2 LTD...,S,2/1/2023,FOR THE TREATMENT OF ANEMIA DUE TO CHRONIC KID...
5,NDA 216403,FILSPARI,SPARSENTAN,TRAVERE THERAPEUTICS INC,"P,O",2/17/2023,TO REDUCE PROTEINURIA IN ADULTS WITH PRIMARY I...
6,NDA 216718,SKYCLARYS,OMAVELOXOLONE,REATA PHARMACEUTICALS INC,"P,O",2/28/2023,FOR THE TREATMENT OF FRIEDREICH'S ATAXIA IN AD...
7,NDA 216386,ZAVZPRET,ZAVEGEPANT,PFIZER INC,S,3/9/2023,ACUTE TREATMENT OF MIGRAINE WITH OR WITHOUT AU...
8,NDA 217026,DAYBUE,TROFINETIDE,ACADIA PHARMACEUTICALS INC,"P,O",3/10/2023,FOR THE TREATMENT OF RETT SYNDROME IN ADULTS A...
9,NDA 217417,REZZAYO,REZAFUNGIN,CIDARA THERAPEUTICS INC,"P,O",3/22/2023,FOR THE TREATMENT OF CANDIDEMIA AND INVASIVE C...
10,NDA 217759,JOENJA,LENIOLISIB PHOSPHATE,PHARMING TECHNOLOGIES BV,"P,O",3/24/2023,FOR THE TREATMENT OF ACTIVATED PHOSPHOINOSITID...


### Extract Table From Multiple Years
The above example is great for a single year. Multiple years can be obtained using a loop. The link dictionary demoed above will be looped and the table for each key/item pair will be extracted. Then the tables will be combined to generate a single pd.DataFrame for NME across the indicated years. 

In [5]:
# Empty list to hold pd.DataFrame
tables = []

# Extract table for all years in dictionary
for year, table in links.items():
    data = scrape.extract_table(year=year)
    tables.append(data)
    
# Combine tables into a single pd.DataFrame
tables = pd.concat(tables, ignore_index=True)
tables

Unnamed: 0,APPLICATION NUMBER,PROPRIETARY NAME,ESTABLISHED NAME,APPLICANT,REVIEW CLASSIFICATION,APPROVAL DATE,INDICATION,"CY 2020 CDER New Molecular Entity (NME) Drug & Original BLA Calendar Year Approvals\rAs of Decemeber 31, 2020\rThis report reflects the data shown as it is identified in the database.\rSelection Criteria:\rUser Response: Start Date: 1/1/2020 End Date: 12/31/2020\rSort Order: Approval Date",Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,"CY 2018 CDER New Molecular Entity (NME) Drug & Original BLA Calendar Year Approvals\rAs of December 31, 2018\rThis report reflects the data shown as it is identified in the database.\rSelection Criteria:\rUser Response: Start Date: 1/1/2018 End Date: 12/31/2018\rSort Order: Approval Date"
0,NDA 214373,BRENZAVVY,BEXAGLIFLOZIN,THERACOSBIO LLC,S,1/20/2023,ADJUNCT TO DIET AND EXERCISE TO IMPROVE GLYCEM...,,,,,,,,,
1,NDA 216059,JAYPIRCA,PIRTOBRUTINIB,LOXO ONCOLOGY INC,"P,O",1/27/2023,FOR THE TREATMENT OF ADULT PATIENTS WITH RELAP...,,,,,,,,,
2,NDA 217639,ORSERDU,ELACESTRANT,STEMLINE THERAPEUTICS INC,P,1/27/2023,FOR THE TREATMENT OF POSTMENOPAUSAL WOMEN OR A...,,,,,,,,,
3,NDA 216951,JESDUVROQ,DAPRODUSTAT,GLAXOSMITHKLINE INTELLECTUAL PROPERTY NO 2 LTD...,S,2/1/2023,FOR THE TREATMENT OF ANEMIA DUE TO CHRONIC KID...,,,,,,,,,
4,NDA 216403,FILSPARI,SPARSENTAN,TRAVERE THERAPEUTICS INC,"P,O",2/17/2023,TO REDUCE PROTEINURIA IN ADULTS WITH PRIMARY I...,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
300,BLA 125526/0.0,NUCALA,MEPOLIZUMAB,GLAXOSMITHKLINE LLC,S,11/4/2015,INDICATED FOR THE ADD-ON MAINTENANCE TREATMENT...,,,,,,,,,
301,BLA 761036/0.0,DARZALEX,DARATUMUMAB,"JANSSEN BIOTECH, INC.","P,O",11/16/2015,INDICATED FOR TREATMENT OF PATIENTS WITH MULTI...,,,,,,,,,
302,BLA 125547/0.0,PORTRAZZA,NECITUMUMAB,ELI LILLY AND COMPANY,S,11/24/2015,"IS INDICATED, IN COMBINATION WITH GEMCITABINE ...",,,,,,,,,
303,BLA 761035/0.0,EMPLICITI,ELOTUZUMAB,BRISTOL-MYERS SQUIBB COMPANY,"P,O",11/30/2015,INDICATED IN COMBINATION WITH LENALIDOMIDE AND...,,,,,,,,,


### Get Compilation Data

The FDA contains a site containing [Compilation Dataset links](https://www.fda.gov/drugs/drug-approvals-and-databases/compilation-cder-new-molecular-entity-nme-drug-and-new-biologic-approvals). This link points to a .csv file that can be downloaded and converted into a pd.DataFrame. This is already handled by FDAScrapper.

If there are updates to the Compilation Dataset page, a different URL can be tested by passing in the url to the "url" parameter.  

As of this writing, the compilation dataset contains **NME and New Biologic Approvals from 1985-2023**

In [6]:
compilation = scrape.get_compilation()
compilation

Unnamed: 0,Proprietary Name,Active Ingredient/Moiety,Applicant,NDA/BLA,Application Number(1),Application Number(2),Application Number(3),Dosage Form(1),Route of Administration(1),Dosage Form(2),...,Approved Use(s),Review Designation,Orphan Drug Designation,Accelerated Approval,Breakthrough Therapy Designation,Fast Track Designation,Qualified Infectious Disease Product,Issued a Priority Review Voucher,Redeemed a Priority Review Voucher,Notes
0,Lupron,leuprolide acetate,TAP Pharmaceuticals,NDA,19010,,,Injectable,Injection,,...,,Priority,No,,,,,,,
1,Seldane,terfenadine,Merrell-Dow,NDA,18949,,,Tablet,Oral,,...,,Priority,No,,,,,,,
2,Ridaura,auranofin,SmithKline & French,NDA,18689,,,Capsule,Oral,,...,,Priority,No,,,,,,,
3,Marinol,dronabinol,Unimed,NDA,18651,,,Capsule,Oral,,...,,Priority,No,,,,,,,
4,Fortaz,ceftazidime,Glaxo,NDA,50578,,,Injectable,Injection,,...,,Priority,No,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1286,Ryzneuta,efbemalenograstim alfa-vuxw,Evive Biotechnology Singapore PTE. Ltd.,BLA,761134,,,Injectable,Injection,,...,RYZNEUTA is indicated to decrease the incidenc...,Standard,No,No,No,No,,No,No,
1287,Ogsiveo,nirogacestat,"SpringWorks Therapeutics, Inc.",NDA,217677,,,Tablet,Oral,,...,OGSIVEO is indicated for adult patients with p...,Priority,Yes,No,Yes,Yes,No,No,No,
1288,Fabhalta,iptacopan,Novartis Pharmaceuticals Corporation,NDA,218276,,,Capsule,Oral,,...,FABHALTA is indicated for the treatment of adu...,Priority (used priority review voucher),Yes,No,Yes,No,No,No,Yes,
1289,Filsuvez,birch triterpenes,Amryt Pharmaceuticals DAC,NDA,215064,,,Gel,Topical,,...,FILSUVEZ topical gel is indicated for the trea...,Priority,Yes,No,No,Yes,No,RPD,No,


# Get Current Approvals for Current Year

Finally, the FDA curates [Novel Drug Approvals](https://www.fda.gov/drugs/development-approval-process-drugs/novel-drug-approvals-fda) for a given year. Information from this page can be used to obtain the current year drug approvals.

A downside to this page is that it does not properly designate a drug as New Drug Application (NDA) or Biologics License Application (BLA). This would indicate small-molecule or biologics, respectively. A rule of thumb can be taken by the active ingredient names, where drugs ending in "mab" and "cept" can indicate biologics. However, confirmation should still be performed to ensure labeling accuracy.   

In [7]:
latest = scrape.get_current_year()
latest

Unnamed: 0,Drug Name,Active Ingredient,Approval Date,FDA-approved use on approval date*
0,Kisunla,donanemab-azbt,7/2/2024,To treat Alzheimer’s disease
1,Ohtuvayre,ensifentrine,6/26/2024,To treat chronic obstructive pulmonary disease
2,Piasky,crovalimab-akkz,6/20/2024,To treat paroxysmal nocturnal hemoglobinuria
3,Sofdra,sofpironium,6/18/2024,To treat primary axillary hyperhidrosis
4,Iqirvo,elafibranor,6/10/2024,To treat primary biliary cholangitis in combin...
5,Rytelo,imetelstat,6/6/2024,To treat low- to intermediate-1 risk myelodysp...
6,Imdelltra,tarlatamab-dlle,5/16/2024,To treat extensive stage small cell lung cancer
7,Xolremdi,mavorixafor,4/26/2024,"To treat WHIM syndrome (warts, hypogammaglobul..."
8,Ojemda,tovorafenib,4/23/2024,To treat relapsed or refractory pediatric low-...
9,Anktiva,nogapendekin alfa inbakicept-pmln,4/22/2024,To treat bladder cancer
