# Data Scraping

Data is often scraped from websites. These kinds of data may be present within a `<table>` in the page.

The data is going to get scrapped from [this](https://www.transtats.bts.gov/Data_Elements.aspx?Data=2) website.

## Data Wranging Procedure
For the above website in particular, that is.

- Firstly, build a list of required values.
    - Build a list of carrier values, maybe by looking at the website.
    - Build a list of airport values, maybe by writing a script that does that.
- Then, make HTTP requests to download all data.
- And _then_ parse the data files.

The library used for scraping this website will be [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


## Install Dependencies

In [31]:
# Uncomment and the following lines of code to install dependencies
# !pip install beautifulsoup4
# !pip install requests

## Imports

Import all the dependencies here.


In [32]:
from bs4 import BeautifulSoup
import requests

## Extracting Entities

In [33]:
def get_codes(soup: BeautifulSoup, id: str):
    '''
    Get airport codes in a Python list.

    Parameters
    ----------
    - `soup`: the `BeautifulSoup` instance containing the opened page
    - `id`: the id of the HTML element containing the codes

    Returns
    --------
    A Python list of strings containing all the codes of the given id.
    '''
    codes = []
    selector = soup.find(id=id)
    for option in selector.find_all('option'):
        codes.append(option['value'])
    return codes


In [34]:
soup = BeautifulSoup(open("page_source.html"))

In [35]:
carrier_codes = get_codes(soup, "CarrierList")
print(f"carrier codes: {carrier_codes}")

carrier codes: ['All', 'AllUS', 'AllForeign', 'FL', 'AS', 'AA', 'MQ', '5Y', 'DL', 'EV', 'F9', 'HA', 'B6', 'OO', 'WN', 'NK', 'US', 'UA', 'VX']


In [36]:
airport_codes = get_codes(soup, "AirportList")
print(f"Airport codes: {airport_codes}")

Airport codes: ['All', 'AllMajors', 'ATL', 'BWI', 'BOS', 'CLT', 'MDW', 'ORD']


## HTTP Requests

In [37]:
# Please note that the function 'make_request' is provided for your reference only.
# You will not be able to to actually use it from within the Udacity web UI.
# Your task is to process the HTML using BeautifulSoup, extract the hidden
# form field values for "__EVENTVALIDATION" and "__VIEWSTATE" and set the appropriate
# values in the data dictionary.
# All your changes should be in the 'extract_data' function

html_page = "page_source.html"


def extract_data(page):
    data = {"eventvalidation": "",
            "viewstate": ""}
    with open(page, "r") as html:
        # do something here to find the necessary values
        soup = BeautifulSoup(html)
        event_validation = soup.find(id="__EVENTVALIDATION")
        data["eventvalidation"] = event_validation["value"]
        viewstate = soup.find(id="__VIEWSTATE")
        data["viewstate"] = viewstate["value"]

    return data


def make_request(data):
    eventvalidation = data["eventvalidation"]
    viewstate = data["viewstate"]

    s = requests.Session()

    r = s.post("https://www.transtats.bts.gov/Data_Elements.aspx?Data=2",
                    data={'AirportList': "ATL",
                          'CarrierList': "FL",
                          'Submit': 'Submit',
                          "__EVENTTARGET": "",
                          "__EVENTARGUMENT": "",
                          "__EVENTVALIDATION": eventvalidation,
                          "__VIEWSTATE": viewstate
                    })

    return r.text


def test():
    data = extract_data(html_page)
    assert data["eventvalidation"] != ""
    assert data["eventvalidation"].startswith("/wEWjAkCoIj1ng0")
    assert data["viewstate"].startswith("/wEPDwUKLTI")

    
test()


## Best Practices for Scraping

**STEPS**:
1. Look at how a browser makes requests.
2. Emulate the same in code.
3. If "stuff blows up", look at HTTP traffic.
4. Repeat from Step 1 until it works.
