# Web Scraping Data

**Web Scraping** (also called screen scraping, web data extraction, web harvesting, etc.) is a technique to extract data from a website which can then be saved to a file, database, or used in an application.

Data displayed by some websites are only viewable through a web browser and may not have the functionality to download a copy to a file. Rather than manually copying and pasting the data - a time consuming and tedious task - web scraping automates this process. This makes the task of collecting the data more efficient and the program can be scheduled to web scrape new data that is added in the future.

In this lesson, we will use the `requests` and `bs4`(Beautiful Soup) libraries to extract the data from a Wikipedia webpage of U.S. state capitals, store the data into a `pandas` dataframe, then save the collected data to a file.

In [None]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

## Connecting to the website

`requests` is a commonly used library to allow a Python program to access a website and its contents. The `.get()` function "talks" to the website and in turn, the website sends a **response** with a **status code** that states if the user can proceed within the website. For more information about status codes and their meanings go to the REST API Tutorial website's [list of HTTP status codes](https://www.restapitutorial.com/httpstatuscodes.html). Common status codes are `200 (OK)` , `401 (Unauthorized)`, `404 (Not Found)`, and `500 (Internal Server Error)`.

In [None]:
URL = "https://simple.wikipedia.org/wiki/List_of_U.S._state_capitals"

In [None]:
# connect to the website
# website returns a response

response = requests.get(URL)

In [None]:
# check the status code
response.status_code

## HTML Basics
The basic syntax of an HTML webpage contains **tags** which create the structure of the website:

- HTML documents must start with a type declaration `<!DOCTYPE html>` tag.
- The HTML document is contained between `<html>` and `</html>`.
- The meta and script declaration of the HTML document is between `<head>` and `</head>`.
- The visible part of the HTML document is between `<body>` and `</body>` tags.
- Title headings are defined with the `<h1>`  through `< h6>` tags.
- Paragraphs are defined with the `<p>` tag.

Other useful tags include `<a>` for hyperlinks, `<table>` for tables, `<tr>` for table rows, and `<td>` for a single cell within a table row.

## Web Scraping Guidelines

- Check a websites Terms & Conditions before you scrape it. Add `/robots.txt` to the end of your URL to find out if the website authorizes web scraping.
- Carefully read the statements about legal use of data. Many times, web scraped data cannot be used for commerical purposes.
- Do not aggressively request data from the website, as this may cause time out issues. Create your program to make infrequent or scheduled requests in order to simulate human-like behavior. Good practice suggests 1 request per webpage per second.
- Inspect the HTML structure of the webpage through your browser before using Python to parse it.
- The layout of a website may change over time, so revisit the site and update your code as needed.

## Download and Parse HTML

`bs4`(Beautiful Soup) is the most popular library for extracting data from HTML and XML files.

The `.text` function will download a string version of the HTML structure of the requested webpage. Then using the `BeautifulSoup()` function, the text will be **parsed** (interpreted as) HTML. This will add functionality to search for tags within the webpage without treating the contents as individual string characters.

In [None]:
# display the first 300 characters in the string
print(response.text[:300])

In [None]:
# parse the string as HTML
raw_HTML = BeautifulSoup(response.text, 'html.parser')

In [None]:
raw_HTML.title

## Find HTML Elements

Because there is only one `<table>` tag on this page, we can use the `.find()` function to search for the tag, which will return its contents. If there are multiple tags of the same type, the `.find()` function will return the first tag of that type listed on the page or you can add the argument `class_=` to search for a tag with a particular class type.

Then within the table, we will gather all of the `<tr>` tags to collect data from each row. `.find_all()` creates a list where each item in the list is a tag of that type. This is useful for extracting data from tags that have similarly structured information, such as hyperlinks in `<a>` tags or row information in `<tr>` tags.

In [None]:
# search for the <table> tag on the page
raw_HTML.find('table')

In [None]:
# can also search the same table using its class type
raw_HTML.find('table', class_= 'wikitable sortable')

In [None]:
# save table to variable
table = raw_HTML.find('table')

In [None]:
# find_all creates a list of all <tr> tags in table
# display the first 2 <tr> tags and their contents
table.find_all('tr')[:4]

The first two `<tr>` tags in the table contain `<th>` tags to create the column headers on the page. The dataset is small, so we will not focus on collecting the text from those tags to use as the column headers in the dataframe. However, because we are collecting the data from the rest of the table, we will use list slicing to skip over those rows. This will start directly at the first row that has `<td>` tags which has the actual data from the table that we want to store into the dataframe.

In [None]:
# store the contents of the table that we will collect data from
# tabledata is a list
tabledata = table.find_all('tr')[2:]

In [None]:
# display first 2 <tr> tags and their contents
tabledata[:2]

Within each `<tr>` tag, we will need to create another list (using the `.find_all()` function) of all of its `<td>` tags. In each `<td>` tag, we can access the data that is each individual table cell and extract its text.

In [None]:
# first row in the table data
tabledata[0]

In [None]:
# first <td> in the first table data row
tabledata[0].find_all('td')

In [None]:
# save the list of the first row's <td> tags
first_row = tabledata[0].find_all('td')

In [None]:
first_row[0]

In [None]:
first_row[0].text

In [None]:
# loop through each <td> tag
# print out each tag's text data
for data in first_row:
    print(data.text)

**NOTE**: The `Notes` column in the table does not have information for every state. In the example above, Alabama does not have any notes displayed. However, there is a hidden newline value in the print display of the text. The newline will be stored as numpy "null" value when we collect the data. This is important because missing/empty information needs a "null" placeholder when creating a dataframe.

# Collect Data

Now that we have identified the structure of the table, we can set up an empty dictionary to collect the values. The keys of the dictionary will later become the column headers in the dataframe. The value of each key will be a list that has information for that column in the table.

In [None]:
# empty dictionary to hold the values
state_info = {'State':[],
              'Abbreviation':[],
              'Statehood':[],
              'Capital':[],
              'Capital_Since':[],
              'Area':[],
              'City_population':[],
              'Metro_population':[],
              'State_rank':[],
              'US_rank':[],
              'Notes':[],}

In [None]:
state_info.values()

In [None]:
for index, item in enumerate(state_info.values()):
    print(index, item)

In [None]:
# loop through each <tr> tag
for row in tabledata:
    
    # get a list of <td> tags for that row
    row_data = row.find_all('td')
    
    # get a list of the values (empty lists) from the dictionary
    # loop through the index number and each item (empty list)
    for index, item in enumerate(state_info.values()):
        
        # access the data from the <td> tag with the same index position as the empty list
        data = row_data[index].text
        
        # check if the data is a newline or empty string
        # if so, store a null value
        if (data == "\n") or (data == ''):
            item.append(np.nan)
        
        # otherwise, store the actual data from that cell
        else:
            item.append(data)

In [None]:
# verify contents in the dictionary
# list of values for state names
state_info['State']

In [None]:
state_info['Notes']

## Create Dataframe

Now that the data has been extracted from the table, we can use `pandas`'s `.DataFrame()` function to structure the dictionary as a dataframe. The keys from the dictionary will be the column headers and the list for each key will become the column values. Lastly, we will save the dataframe's contents into a file.

In [None]:
# create a dataframe from dictionary info
state_df = pd.DataFrame(data=state_info)
state_df.head()

In [None]:
# create a new file and save dataframe contents
state_df.to_csv('stateinfo.csv', index=False)