## Multipage Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all doctors whose last name begins with "Z".
- Export the content into a CSV file called ```md_Z.csv```.


### Our strategic approach:

There are at least two approaches.

#### Approach 1. 

Scrape ALL the pages with ALL the names and then use pandas to filter for names that begin with Z. This works but forces us to hit the site harder and for longer.

#### Approach 2. 

We notice that we can scrape also filter the results by specific letter. When we click on Z and examine the url:

```https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z```

We click on 2 in the pagination for this result and we get:

```https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2```


Note that very last few characters. We ask, can we replace the 2 with a 1?

```https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1```

And indeed it works.

#### We realize we just have to scrape this link but we replace the page number with a placeholder!


In [1]:
## IMPORT LIBRARIES

import requests
from bs4 import BeautifulSoup 
import pandas as pd
from random import randrange
import time

In [2]:
## test out a single 
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1"



In [3]:
## request url and store in page
## check status code
response = requests.get(url)
response.status_code

200

In [4]:
## scrape table from page using Pandas
df_list = pd.read_html(response.text)
df_list

[                                  Physician Last Name  \
 0                                             Zaccheo   
 1                                           Zachariah   
 2                                              Zachel   
 3                                              Zackin   
 4                                              Zackin   
 5                                              Zackin   
 6                                               Zadeh   
 7                                               Zafar   
 8                                               Zafar   
 9                                                Zahl   
 10                                             Zahler   
 11                                              Zaino   
 12                                                Zak   
 13                                               Zaki   
 14                                              Zales   
 15                                           Zalmanov   
 16           

In [5]:
len(df_list)

4

In [6]:
## df_list[0] is close but notice index 20 and 21 have some non-essential info
df_list[0]

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842.0,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458.0,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699.0,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457.0,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457.0,MD,03/16/2005,03/09/2005,1941.0
5,Zackin,Henry,J,101457.0,MD,02/21/1990,03/09/2005,1941.0
6,Zadeh,Mehran,,3399.0,PA,07/21/2010,09/06/2013,1961.0
7,Zafar,Kamal,,113.0,SA,08/04/2016,08/08/2016,1968.0
8,Zafar,Syeda,,158264.0,MD,10/16/2007,11/06/2007,1936.0
9,Zahl,Kenneth,,151413.0,MD,04/18/2008,04/11/2008,1956.0


In [7]:
## we need to target df_list[1] which does not contain that info
df_list[1]

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457,MD,03/16/2005,03/09/2005,1941.0
5,Zackin,Henry,J,101457,MD,02/21/1990,03/09/2005,1941.0
6,Zadeh,Mehran,,3399,PA,07/21/2010,09/06/2013,1961.0
7,Zafar,Kamal,,113,SA,08/04/2016,08/08/2016,1968.0
8,Zafar,Syeda,,158264,MD,10/16/2007,11/06/2007,1936.0
9,Zahl,Kenneth,,151413,MD,04/18/2008,04/11/2008,1956.0


## Now gets scrape to iterate through all pages

In [8]:
## URL with placeholder 
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p={}"

In [9]:
## Combined url timed nav with table scrape

total_pages = 6 ## number of pages we want to scrape
df_all = [] ## list that will hold all the dataframes that are produced
for url_number in range(1,total_pages):
    link = url.format(url_number)
    page = requests.get(link)
    if response.status_code == 200:
        print(f"got it...scraping page...{link}")
        df_list = pd.read_html(response.text) ## turn html table into a df using pandas
        df_all.append(df_list[1]) ## append table in index position 1 to a list
        ## let's not forget to snooze
        snooze = randrange(5,7)
        print(f"snoozing for {snooze} seconds before scraping next link.")
        time.sleep(snooze)

    else:
        print(f"oh no! {link} returned:", page.status_code)
        
df_all ## what does our list look like

got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1
snoozing for 6 seconds before scraping next link.
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2
snoozing for 5 seconds before scraping next link.
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=3
snoozing for 5 seconds before scraping next link.
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=4
snoozing for 6 seconds before scraping next link.
got it...scraping page...https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=5
snoozing for 6 secon

[   Physician Last Name Physician First Name Physician Middle Name  \
 0              Zaccheo               Jerald                     D   
 1            Zachariah              Abraham                   NaN   
 2               Zachel             Gretchen                   NaN   
 3               Zackin                Henry                     J   
 4               Zackin                Henry                     J   
 5               Zackin                Henry                     J   
 6                Zadeh               Mehran                   NaN   
 7                Zafar                Kamal                   NaN   
 8                Zafar                Syeda                   NaN   
 9                 Zahl              Kenneth                   NaN   
 10              Zahler               Gideon                   NaN   
 11               Zaino               Edward                   NaN   
 12                 Zak                 John                   NaN   
 13                Z

In [10]:
## What does each dataframe hold in our list of dataframes
df_all[1]

Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457,MD,03/16/2005,03/09/2005,1941.0
5,Zackin,Henry,J,101457,MD,02/21/1990,03/09/2005,1941.0
6,Zadeh,Mehran,,3399,PA,07/21/2010,09/06/2013,1961.0
7,Zafar,Kamal,,113,SA,08/04/2016,08/08/2016,1968.0
8,Zafar,Syeda,,158264,MD,10/16/2007,11/06/2007,1936.0
9,Zahl,Kenneth,,151413,MD,04/18/2008,04/11/2008,1956.0


## Combine dataframes and export
##### OPTION 1 - via a function

We use the function we wrote last week that takes a list of dataframes, concats them and exports to a single csv:

In [11]:
## FUNCTION to download individual dataframes in a list as a single csv
def process_lists(list_name, filename):
    df = pd.concat(list_name, ignore_index = True)
    df.to_csv(filename, encoding = "UTF-8", index = False)
    print(f"{filename} is in your current folder")
    return df

In [12]:
## call our function
process_lists(df_all, "md_Z.csv")

md_Z.csv is in your current folder


Unnamed: 0,Physician Last Name,Physician First Name,Physician Middle Name,License Number,License Type,Effective Date,Date Updated,Year of Birth
0,Zaccheo,Jerald,D,134842,MD,12/20/2001,12/23/2001,1946.0
1,Zachariah,Abraham,,137458,MD,09/15/2004,09/08/2004,1950.0
2,Zachel,Gretchen,,20699,PA,10/13/2017,10/06/2017,1952.0
3,Zackin,Henry,J,101457,MD,03/28/2002,03/09/2005,1941.0
4,Zackin,Henry,J,101457,MD,03/16/2005,03/09/2005,1941.0
...,...,...,...,...,...,...,...,...
95,Zalmanov,Mikhail,Isaakovich,158429,MD,11/08/2005,11/02/2005,1946.0
96,Zalmanov,Mikhail,I,158429,MD,03/06/2013,02/27/2013,1946.0
97,Zaman,Fasih,Q,114131,MD,11/24/1994,,
98,Zaman,Shah,Mohammad Maniruz,141193,MD,09/27/2021,10/29/2021,1945.0


#### OPTION 2 - manual combining and export

In [13]:
df = pd.concat(df_all, ignore_index = True)
df.to_csv("md_Z_2.csv", encoding = "UTF-8", index = False)

## 2. Conversion function


Write a function that takes string values like ```$12.24267```, ```10,201``` and ```$12,501``` and converts them into floating point numbers like ```12.24```, ```10201.0``` and ```12501.0```

Test it out on those 3 string values.


In [14]:
## write function

def string2float(a_string):
    ## remove $ sign and comma
    a_string = a_string.replace("$", "").replace(",", "") 
    ## return what is converted to float and rounded
    return round(float(a_string), 2)

In [15]:
## test it out on the numbers provided

string2float("$12.24267")

12.24

In [16]:
string2float("10,201")

10201.0

In [17]:
string2float("$12,501")

12501.0