# Using Regex in Web Scraping

When a website's ```html``` and ```css``` is not well organized, we often have to use ```regex``` to target our content. This is especially relevant today because reporters have to scrape energy sites for their climate change pieces. And these sites are generally all poorly designed (perhaps on purpose!).

At <a href="https://content.energy.alberta.ca/Tenure/1314.asp">this energy site</a> , we want to target and download the ```csv``` files on every page.

No need to iterate through every year...let's do it one time.

How would you approach this?

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re

## single page test

In [3]:
## target url
url = "https://content.energy.alberta.ca/Tenure/1314.asp?Year=2020"

In [4]:
response = requests.get(url)


In [5]:
## try via pandas
df = pd.read_html(response.text)
df

[                     0                       1                   2   3   4
 0   2020 - December 02      2020 - December 02  2020 - December 02 NaN NaN
 1                  NaN                     NaN                 NaN NaN NaN
 2                  NaN  Public Offering Notice                 NaN NaN NaN
 3                  NaN    Public Sales Results                 NaN NaN NaN
 4   2020 - November 18      2020 - November 18  2020 - November 18 NaN NaN
 5                  NaN                     NaN                 NaN NaN NaN
 6                  NaN  Public Offering Notice                 NaN NaN NaN
 7                  NaN    Public Sales Results                 NaN NaN NaN
 8      2020 - April 01         2020 - April 01     2020 - April 01 NaN NaN
 9                  NaN                     NaN                 NaN NaN NaN
 10                 NaN  Public Offering Notice                 NaN NaN NaN
 11                 NaN    Public Sales Results                 NaN NaN NaN
 12     2020

In [6]:
## try via pure html

soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<!-- -->
<link href="/css/default.css" media="screen" rel="stylesheet" title="default" type="text/css"/>
<link href="/css/print.css" media="print" rel="stylesheet" title="default" type="text/css"/>
<!--[if lte IE 7]><link rel="stylesheet" type="text/css" media="screen" href="/css/iestyles.css" /><![endif]-->
<link href="http://purl.org/dc/elements/1.1/" rel="schema.DC"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Government of Alberta: </title>
<meta content="noarchive" name="robots"/>
<meta content="nocache" name="robots"/>
<meta content="index,follow" name="robots">
<meta content="Land sales" name="keywords">
<meta content="" name="description"/>
<meta content="Copyright © 2007 Government of Alberta" name="copyright"/>
<meta content="Government of Alberta" name="author"

In [7]:
target_table = soup.find("table")
target_table

<table border="0" cellpadding="1" cellspacing="1"><tr><td align="LEFT" colspan="3"><h4><b>2020 - December 02</b></h4></td><tr><tr><td>          </td><td align="LEFT">Public Offering Notice   </td><td></td><td align="LEFT"><a href="/FTPPNG/2020/20201202PON.pdf"><img alt="20201202PON.pdf" border="0" height="17" src="/scripts/pon/images/icon_pdf.gif" width="16"/></a>   </td><td align="LEFT"><a href="/FTPPNG/2020/20201202PON.xml"><img alt="20201202PON.xml" border="0" height="17" src="/scripts/pon/images/icon_xml.gif" width="16"/></a>   </td></tr><tr><td>          </td><td align="LEFT">Public Sales Results   </td><td align="LEFT"><a href="/FTPPNG/2020/20201202PSR.csv"><img alt="20201202PSR.csv" border="0" height="17" src="/scripts/pon/images/icon_csv.gif" width="16"/></a>   </td><td align="LEFT"><a href="/FTPPNG/2020/20201202PSR.pdf"><img alt="20201202PSR.pdf" border="0" height="17" src="/scripts/pon/images/icon_pdf.gif" width="16"/></a>   </td><td align="LEFT"><a href="/FTPPNG/2020/2020120

In [8]:
all_atags = target_table.find_all("a")
all_atags

[<a href="/FTPPNG/2020/20201202PON.pdf"><img alt="20201202PON.pdf" border="0" height="17" src="/scripts/pon/images/icon_pdf.gif" width="16"/></a>,
 <a href="/FTPPNG/2020/20201202PON.xml"><img alt="20201202PON.xml" border="0" height="17" src="/scripts/pon/images/icon_xml.gif" width="16"/></a>,
 <a href="/FTPPNG/2020/20201202PSR.csv"><img alt="20201202PSR.csv" border="0" height="17" src="/scripts/pon/images/icon_csv.gif" width="16"/></a>,
 <a href="/FTPPNG/2020/20201202PSR.pdf"><img alt="20201202PSR.pdf" border="0" height="17" src="/scripts/pon/images/icon_pdf.gif" width="16"/></a>,
 <a href="/FTPPNG/2020/20201202PSR.xml"><img alt="20201202PSR.xml" border="0" height="17" src="/scripts/pon/images/icon_xml.gif" width="16"/></a>,
 <a href="/FTPPNG/2020/20201118PON.pdf"><img alt="20201118PON.pdf" border="0" height="17" src="/scripts/pon/images/icon_pdf.gif" width="16"/></a>,
 <a href="/FTPPNG/2020/20201118PON.xml"><img alt="20201118PON.xml" border="0" height="17" src="/scripts/pon/images/ico

In [9]:
len(all_atags)

42

In [10]:
pat = re.compile(r'href=\"([\/a-z0-9]+\.csv)', re.I)