## Single page Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all the doctors info on page 292.
- Export the content into a CSV file called ```page_292.csv```.

In [1]:
## create coding cells as needed

## import libraries
import pandas as pd
from random import randrange
import time

In [2]:
## TARGET page 292
pageurl = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action?d-49653-p=292"


In [3]:
## scrape url

df_list = pd.read_html(pageurl)
df_list

[                                  Physician Last Name  \
 0                                               Ataee   
 1                                              Feiner   
 2                                              Taylor   
 3                                             Donshik   
 4                                             Robbins   
 5                                             Freeman   
 6                                           Markowitz   
 7                                              Puskas   
 8                                               Lopez   
 9                                              Hirsch   
 10                                             Stobie   
 11                                                Leo   
 12                                                 Lu   
 13                                              James   
 14                                             O'Hair   
 15                                           Berselli   
 16           

In [4]:
## deterimine which list item holds the target table
## Slice off the correct table.
## Notice the difference between [0] and [1]
## df_list[0] is close but notice index 20 and 21 have some non-essential info
df = df_list[1]
df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Ataee,Shahab,,232919,MD,09/11/2003,10/07/2003,1964
1,Feiner,Marc,Alan,147174,MD,10/10/2003,10/06/2003,1952
2,Taylor,David,Howarth,154079,MD,10/10/2003,10/06/2003,1953
3,Donshik,Jon,David,202926,MD,10/10/2003,10/06/2003,1968
4,Robbins,Richard,Gregg,179378,MD,10/10/2003,10/06/2003,1963
5,Freeman,Douglas,,220267,MD,10/10/2003,10/06/2003,1968
6,Markowitz,Howard,,188933,MD,10/10/2003,10/06/2003,1961
7,Puskas,John,Michael,120273,MD,10/06/2003,10/01/2003,1945
8,Lopez,Jose,A,192852,MD,10/06/2003,10/01/2003,1962
9,Hirsch,Anthony,,98440,MD,10/01/2003,09/24/2003,1940


In [None]:
## export to csv
df.to_csv("page_292.csv", encoding = "UTF-8", index = False)

## Multipage Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all doctors whose last names begin with "P".
- Export the content into a CSV file called ```md_P.csv```.


In [5]:
## create coding cells as needed

## figure URL to scrape and
## test out a single
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p=1"

In [6]:
## scrape table from page using Pandas
df_list = pd.read_html(url)
df_list

[                                  Physician Last Name  \
 0                                                Paal   
 1                                                Pace   
 2                                                Pace   
 3                                             Pacetti   
 4                                              Pachas   
 5                                             Pacheco   
 6                                               Pacik   
 7                                               Pacis   
 8                                                Pack   
 9                                        Packianathan   
 10                                       Packianathan   
 11                                              Padeh   
 12                                              Padeh   
 13                                            Padilla   
 14                                            Padilla   
 15                                            Padilla   
 16           

## See if we can capture a single page easily

In [7]:
## again, we need to target df_list[1] which does not contain that info
df_list[1]

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Paal,Adam,,,MD,10/30/2000,,1961.0
1,Pace,Enrico,,166026.0,MD,08/21/2001,,1956.0
2,Pace,Leonard,,172870.0,MD,01/15/2002,01/22/2002,1952.0
3,Pacetti,Stephen,J,175021.0,MD,04/14/2016,04/07/2016,1957.0
4,Pachas,Hector,M,95535.0,MD,02/11/1993,,
5,Pacheco,Denny,J.,258600.0,DO,08/27/2020,08/26/2020,1962.0
6,Pacik,Peter,,96944.0,MD,11/15/2012,11/09/2012,1940.0
7,Pacis,Andresito,B.,125213.0,MD,10/22/2021,10/22/2021,1938.0
8,Pack,A,Stephen,183669.0,MD,04/28/2000,07/19/2001,1956.0
9,Packianathan,Emmanuel,,203833.0,MD,07/31/2008,07/25/2008,1945.0


## Now gets scrape to iterate through all pages

In [10]:
## f-string base urls
base_url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=P&d-49653-p="


In [11]:
## Combined url timed nav with table scrape
counter = 1 ## counter to track
total_pages = 24 ## number of pages we want to scrape
df_all = [] ## list that will hold all the dataframes that are produced
broken_links = [] # to hold our broken links in case we run into any
for url_number in range(1,total_pages):
    print(f"Scraping link {counter} of {total_pages}")
    counter+=1 ## increment counter
    link = f"{base_url}{url_number}"
    
    try:
        df_list = pd.read_html(link) ## turn html table into a df using pandas
        df_all.append(df_list[1]) ## append table in index position 1 to a list
        

    except:
        print(f"Some is wrong with {link}!")
        broken_links.append(link)
        
    finally:
        ## let's not forget to snooze
        snooze = randrange(5,7)
        print(f"snoozing for {snooze} seconds before scraping next link.")
        time.sleep(snooze)

    print("Done scraping")

Scraping link 1 of 24
snoozing for 5 seconds before scraping next link.
Done scraping
Scraping link 2 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 3 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 4 of 24
snoozing for 5 seconds before scraping next link.
Done scraping
Scraping link 5 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 6 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 7 of 24
snoozing for 5 seconds before scraping next link.
Done scraping
Scraping link 8 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 9 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 10 of 24
snoozing for 5 seconds before scraping next link.
Done scraping
Scraping link 11 of 24
snoozing for 6 seconds before scraping next link.
Done scraping
Scraping link 12 of 24
snoozing for 5 seconds before

In [12]:
## len of list

len(df_all)

23

In [13]:
## convert to a single df rather than a list of df
df = pd.concat(df_all, ignore_index = True)
df

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Paal,Adam,,,MD,10/30/2000,,1961.0
1,Pace,Enrico,,166026,MD,08/21/2001,,1956.0
2,Pace,Leonard,,172870,MD,01/15/2002,01/22/2002,1952.0
3,Pacetti,Stephen,J,175021,MD,04/14/2016,04/07/2016,1957.0
4,Pachas,Hector,M,095535,MD,02/11/1993,,
...,...,...,...,...,...,...,...,...
455,Psaila,Justin,Sciberras,83832,MD,07/26/2005,07/20/2005,1928.0
456,Pua,Florence,,197443,MD,09/26/2024,09/19/2024,1950.0
457,Puca,Christopher,,147733,MD,11/25/2009,11/18/2009,1952.0
458,Puccio,Steven,T.,229825,DO,07/01/2021,07/01/2021,1964.0


In [14]:
## export to csv
df.to_csv("dr-p.csv", encoding = "UTF-8", index = False)