## Multipage Tabular Scrape

- <a href="https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AllRecordsAction.action">On this site</a>, scrape all doctors whose last name begins with "Z".
- Export the content into a CSV file called ```md_Z.csv```.


### Our strategic approach:

There are at least two approaches.

#### Approach 1. 

Scrape ALL the pages with ALL the names and then use pandas to filter for names that begin with Z. This works but forces us to hit the site harder and for longer.

#### Approach 2. 

We notice that we can scrape also filter the results by specific letter. When we click on Z and examine the url:

```https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z```

We click on 2 in the pagination for this result and we get:

```https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=2```


Note that very last few characters. We ask, can we replace the 2 with a 1?

```https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1```

And indeed it works.

#### We realize we just have to scrape this link but we replace the page number with a placeholder!


In [None]:
## IMPORT LIBRARIES

import requests
from bs4 import BeautifulSoup 
import pandas as pd
from random import randrange
import time

In [None]:
## test out a single 
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p=1"



In [None]:
## request url and store in page
## check status code
response = requests.get(url)
response.status_code

In [None]:
## scrape table from page using Pandas
df_list = pd.read_html(response.text)
df_list

In [None]:
len(df_list)

In [None]:
## df_list[0] is close but notice index 20 and 21 have some non-essential info
df_list[0]

In [None]:
## we need to target df_list[1] which does not contain that info
df_list[1]

## Now gets scrape to iterate through all pages

In [None]:
## URL with placeholder 
url = "https://apps.health.ny.gov/pubdoh/professionals/doctors/conduct/factions/AlphabetSearchAction.action?alpbhabetSearch=Z&d-49653-p={}"

In [None]:
## Combined url timed nav with table scrape

total_pages = 6 ## number of pages we want to scrape
df_all = [] ## list that will hold all the dataframes that are produced
for url_number in range(1,total_pages):
    link = url.format(url_number)
    page = requests.get(link)
    if response.status_code == 200:
        print(f"got it...scraping page...{link}")
        df_list = pd.read_html(response.text) ## turn html table into a df using pandas
        df_all.append(df_list[1]) ## append table in index position 1 to a list
        ## let's not forget to snooze
        snooze = randrange(5,7)
        print(f"snoozing for {snooze} seconds before scraping next link.")
        time.sleep(snooze)

    else:
        print(f"oh no! {link} returned:", page.status_code)
        
df_all ## what does our list look like

In [None]:
## What does each dataframe hold in our list of dataframes
df_all[1]

## Combine dataframes and export
##### OPTION 1 - via a function

We use the function we wrote last week that takes a list of dataframes, concats them and exports to a single csv:

In [None]:
## FUNCTION to download individual dataframes in a list as a single csv
def process_lists(list_name, filename):
    df = pd.concat(list_name, ignore_index = True)
    df.to_csv(filename, encoding = "UTF-8", index = False)
    print(f"{filename} is in your current folder")
    return df

In [None]:
## call our function
process_lists(df_all, "md_Z.csv")

#### OPTION 2 - manual combining and export

In [None]:
df = pd.concat(df_all, ignore_index = True)
df.to_csv("md_Z_2.csv", encoding = "UTF-8", index = False)

## 2. Conversion function


Write a function that takes string values like ```$12.24267```, ```10,201``` and ```$12,501``` and converts them into floating point numbers like ```12.24```, ```10201.0``` and ```12501.0```

Test it out on those 3 string values.


In [None]:
## write function

def string2float(a_string):
    ## remove $ sign and comma
    a_string = a_string.replace("$", "").replace(",", "") 
    ## return what is converted to float and rounded
    return round(float(a_string), 2)

In [None]:
## test it out on the numbers provided

string2float("$12.24267")

In [None]:
string2float("10,201")

In [None]:
string2float("$12,501")