# 03 - web scraping and data transformations

1. [The TSA posts passenger numbers](https://www.tsa.gov/coronavirus/passenger-throughput) in a table but there is no download or API option. We can use BeautifulSoup to parse this table.
1. Transform the TSA passenger data in two ways to create two different charts
1. Create two charts inside this notebook with [Matplotlib](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

__Library reference__
- [BeautifulSoup]()
- [pandas]()
- [Matplot for pandas]()
- [Datetime format codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes)

1. Turn the TSA's html table into a dataframe
    1. Create a list of column names
    1. Create a 2d array of data
    1. Format the data into two columns: date and value
1. Transform the data in two different ways for new different charts
1. Create two charts

In [1]:
# !pipenv uninstall matplotlib

In [2]:
#### Import libraries

from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timedelta
import requests

# set display format for numbers
# suppress scientific notation

pd.set_option('display.float_format', lambda x: '%.3f' % x)

## 1. Turn the TSA's html table into a dataframe

In [3]:
# get html from from page

tsa_r = requests.get('https://www.tsa.gov/coronavirus/passenger-throughput')

In [4]:
# create a beautifulsoup object
doc = BeautifulSoup(tsa_r.text, 'html.parser')

In [5]:
doc

<!DOCTYPE html>

<html dir="ltr" lang="en" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">
<head>
<meta charset="utf-8"/>
<script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-34936050-1"></script>
<script>window.dataLayer = window.dataLayer || [];function gtag(){dataLayer.push(arguments)};gtag("js", new Date());gtag("config", "UA-34936050-1", {"groups":"default","anonymize_ip":true,"allow_ad_personalization_signals":false});</script>
<link href="https://www.tsa.gov/coronavirus/passenger-throughput" rel="canonical"/>
<script>var pfHeaderImgUrl = '/sites/default/files/tsa_insignia_rgb_whitespace_0.svg';var pfHeaderTagline = '';var pfdisableClickToDel  

#### table tag
![table selected](../answers/assets/table.png)

### a. Create a list column names

In [6]:
# turn thead into a column list

thead = doc.find('thead')

In [7]:
# then find all th elements (because there is only 1 row)
ths = thead.find_all('th')

In [8]:
# and loop through each th to extract the text for a list
tsa_col = []

for th in ths:
    tsa_col.append(th.text.strip())

In [9]:
# print the list
tsa_col

['Date',
 '2021 Traveler Throughput',
 '2020 Traveler Throughput',
 '2019 Traveler Throughput']

### b. Create a 2d array of data
![tbody example](../answers/assets/tbody.png)

In [10]:
# turn data into an array of arrays (2d array)
tbody = doc.find('tbody')

In [11]:
# turn tr tags into a list
trs = tbody.find_all('tr')

In [12]:
# create a list of td tags inside each tr list
tr_list = []
for tr in trs:
    tds = tr.find_all('td')
    td_list = []
    for td in tds:
        td_list.append(td.text.strip())
    tr_list.append(td_list)

In [13]:
# Check the length of the list and the first couple of items
len(tr_list)

365

In [14]:
tr_list[0:5]

#we have three columns of data but not easily transformable. Better to have two columns - one column = date, one column = throughput value

[['7/11/2021', '2,198,635', '754,545', '2,669,717'],
 ['7/10/2021', '1,987,652', '656,284', '2,312,178'],
 ['7/9/2021', '2,147,903', '711,124', '2,716,812'],
 ['7/8/2021', '2,027,364', '709,653', '2,608,209'],
 ['7/7/2021', '1,880,160', '632,498', '2,515,902']]

### c. Format the data into two columns: date and value

In [38]:
# create a function that will generate dates of preceding years
#figure out process using hard values - define where the changes will be and replace areas that change with variable
def format_date(d, column_year):
    date_f = datetime.strptime(d, '%m/%d/%Y')
    new_date = date_f - timedelta(weeks = column_year*52)  
    return new_date

In [16]:
# this double loop can be combined with the loop above that generates tr_list
# but i want to separate text extraction from formatting
passengers_per_day = []
# for each tr
for tr in tr_list:
    # we need to find dates for 2020 and 2019 and align them with the html table format
    # turn string into date object so we can perform datetime calculations on it
    #print(tr)
    #data is always in position 0
    date_2021 = tr[0]
    #print(type(date_2021)) #this is a string
    date_2021 = datetime.strptime(tr[0], '%m/%d/%Y')
    # the date for 2020 will be 52 weeks before 
    date_2020 = date_2021 - timedelta(weeks = 52)  
    #check if days of the week line up...
    # the date for 2019 will be 104 weeks before
    date_2019 = date_2021 - timedelta(weeks = 104)
    #print(date_2021, date_2020, date_2019)
    # because the above is a repeatable process, how can move this to a function?
    #print(tr[1:])
    date_list  = [date_2021, date_2020, date_2019]
    # for each passenger column td_list[1:]
    for (index, passenger_column) in enumerate(tr[1:]):
        # Create a new dictionary to populate with formatted date
        # index being the column that corresponds to the order of dates in the date_list above
        daily_passengers = {
            'date': date_list[index],
            'value': passenger_column,
        }
        #print(daily_passengers)
        passengers_per_day.append(daily_passengers)
        # if value does exist, change it to an integer (or else there will be an error on missing values)
        
            # add each newly created dictionary to passengers_per_day list

In [17]:
len(passengers_per_day)

1095

In [18]:
# turn passengers_per_day into a DataFrame with "date" "value" columns
tsa_df = pd.DataFrame(passengers_per_day)

In [19]:
# sort dates from latest to earliest
tsa_df = tsa_df.sort_values('date', ascending= True)

In [20]:
len(tsa_df), len(tsa_df['date'].unique())
#looks like there are duplicate dates somewhere

(1095, 1093)

In [37]:
tsa_df[tsa_df.duplicated()]

Unnamed: 0,date,value


In [22]:
# delete duplicates
tsa_df = tsa_df.drop_duplicates(subset=['date'])

In [23]:
len(tsa_df), len(tsa_df['date'].unique())

(1093, 1093)

In [24]:
#check out values column
tsa_df[tsa_df['value'].isna()]

Unnamed: 0,date,value


In [25]:
tsa_df[tsa_df['value'] == 0]

Unnamed: 0,date,value


In [26]:
# print(tr_list[0][1:])
print(tr_list[0][1:])

['2,198,635', '754,545', '2,669,717']


## 2. Transform the data in two different ways for two different charts
[What's moving average and why are they used? - Dallas FED](https://www.dallasfed.org/research/basics/moving.aspx)

### a. Calculate 7-day moving average

In [27]:
# display the last 7 rows

In [28]:
# write a function that takes the current date and 6 previous dates and averages them
def moving_average(row):
    
    return row

[Read up on pandas' apply method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

In [29]:
# calculate 7-day moving average in a new column and start 7 days in (note: result_type apply)
# set the date as the index for matplot

### b. Group data by weeks

In [30]:
# create a function to get day of the first day of the week
def weekday_start(row):
    
    return row

In [31]:
# create a new column that IDs the start date of the week

In [32]:
# groupby week start turn the groupby object into a dataframe

## 3. Create two charts - one for 7-day moving average and one for week totals
Create a bar chart of the daily values for reference

In [33]:
# create a bar chart for daily values

### a. 7-day moving average

In [34]:
# plot a 7-day average line chart

### b. By weekly totals

In [35]:
# plot as weeks as a line chart