# Scraping a fixed URL site

We'll often encounter website where the url never changes. Here are a few examples: 

- <a href="https://eportal.miteco.gob.es/BoleHWeb/">Ministry for the Ecological Transition and the Demographic Challenge</a>.
- <a href="https://www.seethroughny.net/">See Through NY</a> 
- <a href="https://restructuring.ra.kroll.com/pge/Home-ClaimInfo">PG&E fire victim creditors</a>

From <a href="https://infopost.enbridge.com/InfoPost/">this homepage</a>, we want to scrape the critical notices for the Algonquin Gas Transmission.

Let's explore the site to come up with our scrape strategy.

## Scrape one page to confirm and then scrape all at once


In [1]:
## import libraries
import requests
import pandas as pd


In [2]:
## target url
url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI"


In [5]:
## scrape table with pandas
all_data = pd.read_html(url)
df = all_data[0]
df

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time
0,Capacity Constraint,11/04/2024 02:46:58 PM,11/05/2024 09:00:00 AM,11/06/2024 09:00:00 AM,154761,AGT Pipeline Conditions for 11/5/2024,
1,Capacity Constraint,11/03/2024 03:04:06 PM,11/04/2024 09:00:00 AM,11/05/2024 09:00:00 AM,154725,AGT Pipeline Conditions for 11/4/2024,
2,Capacity Constraint,11/02/2024 02:59:13 PM,11/03/2024 09:00:00 AM,11/04/2024 09:00:00 AM,154679,AGT Pipeline Conditions for 11/3/2024,
3,Capacity Constraint,11/01/2024 02:48:20 PM,11/02/2024 09:00:00 AM,11/03/2024 09:00:00 AM,154657,AGT Pipeline Conditions for 11/2/2024,
4,Operational Flow Order,10/31/2024 03:00:00 PM,11/02/2024 09:00:00 AM,01/29/2025 09:00:00 AM,154578,AGT Operational Flow Order -- UPDATE EFF 11/2,
...,...,...,...,...,...,...,...
125,Capacity Constraint,08/10/2024 03:15:53 PM,08/11/2024 09:00:00 AM,08/12/2024 09:00:00 AM,151175,AGT Pipeline Conditions for 8/11/2024,
126,Capacity Constraint,08/09/2024 03:32:59 PM,08/10/2024 09:00:00 AM,08/11/2024 09:00:00 AM,151145,AGT Pipeline Conditions for 8/10/2024,
127,Capacity Constraint,08/08/2024 03:05:00 PM,08/09/2024 09:00:00 AM,08/10/2024 09:00:00 AM,151073,AGT Pipeline Conditions for 8/9/2024,
128,Operational Flow Order,08/07/2024 03:00:00 PM,08/09/2024 09:00:00 AM,11/05/2024 09:00:00 AM,151039,AGT Operational Flow Order -- EFF 8/9,


In [9]:
## GET ALL URLS

## get notice id numbers
notice_ids = df["Notice Identifier"].to_list()
notice_ids[0:10]





[154761,
 154725,
 154679,
 154657,
 154578,
 154597,
 154555,
 154515,
 154468,
 154442]

In [10]:
## insert into base link to 

base_start = "https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1="
base_end = "&type=CRI&Embed=2&pipe=AG"

In [13]:
##create list of links Link comprehension
links = [f"{base_start}{notice_id}{base_end}" for notice_id in notice_ids]
links

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154761&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154725&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154679&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154657&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154578&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154597&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154555&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154515&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154468&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

In [14]:
## for loops
links_fl = []
for notice_id in notice_ids:
    links_fl.append(f"{base_start}{notice_id}{base_end}")
    
links_fl

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154761&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154725&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154679&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154657&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154578&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154597&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154555&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154515&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=154468&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

# What if we wanted to scrape ALL the gas lines?

In [15]:
## import all non-tabular scraping libs
from bs4 import BeautifulSoup
from random import randrange
import time
import requests

In [16]:
## get homepage and make soup
homepage_url = "https://infopost.enbridge.com/InfoPost/"
response = requests.get(homepage_url)
soup = BeautifulSoup(response.text, "html.parser")
soup

<!DOCTYPE html>

<!-- template.asp -->
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="favicon.ico" rel="shortcut icon"/>
<link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css"/>
<link href="css/bootstrap.min.css" rel="stylesheet"/>
<link href="css/font-awesome-ie7.min.css" rel="stylesheet">
<link href="css/font-awesome.css" rel="stylesheet"/>
<link href="css/link.css" rel="stylesheet">
<link href="css/print.css" media="print" rel="stylesheet"/>
<link href="css/infopost-custom.css" media="screen" rel="stylesheet"/>
<link href="css/environment.css" media="screen" rel="stylesheet"/>
<!-- HTML5 shim, for IE6-9 support of HTML5 elements -->
<!--[if lt IE 9]>
	<link href="css/link-ie.css" rel="stylesheet" />
	<script src="scripts/html5shiv.js" type="text/javascript"></script>
	<script src="scripts/html5shiv-printshiv.js" type="text/javascrip

In [17]:
## get dropdown html
dropdown_list = soup.find(id="dropdown")
dropdown_list

<ul class="dropdown-menu select-pipe-dropdown-menu" id="dropdown">
<li><a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a></li><li><a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a></li><li><a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a></li><li><a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a></li><li><a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a></li><li><a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a></li><li><a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a></li><li><a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a></li><li><a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a></li><li><a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a></li><li><a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a></li><li><a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a></li><li><a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a></li><li><a href="NPCHome.asp?Pipe=NPC">Nautilus P

In [24]:
## get atags
atags = dropdown_list.find_all("a")
atags

[<a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a>,
 <a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a>,
 <a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a>,
 <a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a>,
 <a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a>,
 <a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a>,
 <a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a>,
 <a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a>,
 <a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a>,
 <a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a>,
 <a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a>,
 <a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a>,
 <a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a>,
 <a href="NPCHome.asp?Pipe=NPC">Nautilus Pipeline Company (NPC)</a>,
 <a href="NXCAHome.asp?Pipe=NXCA">NEXUS ULC (NXCA)</a>,
 <a href="NXUSHome.asp?Pipe=NXUS">NEXUS U.S. (NXUS)</a>,
 <a href

In [25]:
## get the hrefs LC
hrefs = [atag.get("href") for atag in atags]
hrefs

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'TPGSHome.asp?Pipe=TPGS',
 'VCPHome.asp?Pipe=VCP',
 'WRGSHome.asp?Pipe=WRGS']

In [27]:
## get href FL

hrefs_fl = []
for atag in atags:
    hrefs_fl.append(atag.get("href"))
    
hrefs_fl

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'TPGSHome.asp?Pipe=TPGS',
 'VCPHome.asp?Pipe=VCP',
 'WRGSHome.asp?Pipe=WRGS']

In [29]:
## target codes only
unit_codes = [href.split('=')[1] for href in hrefs]
unit_codes

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'TPGS',
 'VCP',
 'WRGS']

In [30]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [31]:
## url templates
base_url =\
"https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe="

end_url = "&type=CRI"

In [32]:
## attch codes to base urls
links = [f"{base_url}{unit_code}{end_url}" for unit_code in unit_codes]
links

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BGS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BIG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BSP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=EG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=ET&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GPL&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MCGP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNCA&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNUS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MR&type=CRI',
 '

In [33]:
len(links)

24

### enumerate()

We can create an efficient counter with minimal coding to track our progress.

In [34]:
fruits = ["apple", "orange", "plum", "pear", "banana"]

In [36]:
for index, fruit in enumerate(fruits, start = 1):
    print(f"{index}: {fruit}")

1: apple
2: orange
3: plum
4: pear
5: banana


In [41]:
## scrape all gas units at one time
df_list = []
broken_links = []
total_links = len(unit_codes)

for counter, unit_code in enumerate(unit_codes, start = 1):
    target_link = f"{base_url}{unit_code}{end_url}"
    print(f"Scraping {counter} of {total_links}")
    try:
        data = pd.read_html(target_link)
        df = data[0]
        df["unit"] = unit_code
        df_list.append(df)
    except:
        print(f"{unit_code} was busted or had no table")
        broken_url.append(target_link)
    finally:
        snooze = randrange(10,20)
        print(f"Snoozing for {snooze} seconds")
        time.sleep(snooze)
        
print(f"Done scraping all units")

Scraping 1 of 24
Snoozing for 13 seconds
Scraping 2 of 24
Snoozing for 14 seconds
Scraping 3 of 24
Snoozing for 15 seconds
Scraping 4 of 24
Snoozing for 17 seconds
Scraping 5 of 24
Snoozing for 16 seconds
Scraping 6 of 24
Snoozing for 11 seconds
Scraping 7 of 24
Snoozing for 17 seconds
Scraping 8 of 24
Snoozing for 13 seconds
Scraping 9 of 24
Snoozing for 11 seconds
Scraping 10 of 24
Snoozing for 15 seconds
Scraping 11 of 24
Snoozing for 14 seconds
Scraping 12 of 24
Snoozing for 11 seconds
Scraping 13 of 24
Snoozing for 16 seconds
Scraping 14 of 24
Snoozing for 13 seconds
Scraping 15 of 24
Snoozing for 13 seconds
Scraping 16 of 24
Snoozing for 17 seconds
Scraping 17 of 24
Snoozing for 14 seconds
Scraping 18 of 24
Snoozing for 19 seconds
Scraping 19 of 24
Snoozing for 11 seconds
Scraping 20 of 24
Snoozing for 15 seconds
Scraping 21 of 24
Snoozing for 19 seconds
Scraping 22 of 24
Snoozing for 14 seconds
Scraping 23 of 24
Snoozing for 18 seconds
Scraping 24 of 24
Snoozing for 16 seconds
D

In [42]:
df_list

[                Notice Type        Posted Date/Time  \
 0       Capacity Constraint  11/04/2024 02:46:58 PM   
 1       Capacity Constraint  11/03/2024 03:04:06 PM   
 2       Capacity Constraint  11/02/2024 02:59:13 PM   
 3       Capacity Constraint  11/01/2024 02:48:20 PM   
 4    Operational Flow Order  10/31/2024 03:00:00 PM   
 ..                      ...                     ...   
 125     Capacity Constraint  08/10/2024 03:15:53 PM   
 126     Capacity Constraint  08/09/2024 03:32:59 PM   
 127     Capacity Constraint  08/08/2024 03:05:00 PM   
 128  Operational Flow Order  08/07/2024 03:00:00 PM   
 129     Capacity Constraint  08/07/2024 02:57:05 PM   
 
     Notice Effective Date/Time    Notice End Date/Time  Notice Identifier  \
 0       11/05/2024 09:00:00 AM  11/06/2024 09:00:00 AM             154761   
 1       11/04/2024 09:00:00 AM  11/05/2024 09:00:00 AM             154725   
 2       11/03/2024 09:00:00 AM  11/04/2024 09:00:00 AM             154679   
 3       11/02

In [43]:
final_df = pd.concat(df_list, ignore_index = True)


In [44]:
final_df

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,unit
0,Capacity Constraint,11/04/2024 02:46:58 PM,11/05/2024 09:00:00 AM,11/06/2024 09:00:00 AM,154761,AGT Pipeline Conditions for 11/5/2024,,AG
1,Capacity Constraint,11/03/2024 03:04:06 PM,11/04/2024 09:00:00 AM,11/05/2024 09:00:00 AM,154725,AGT Pipeline Conditions for 11/4/2024,,AG
2,Capacity Constraint,11/02/2024 02:59:13 PM,11/03/2024 09:00:00 AM,11/04/2024 09:00:00 AM,154679,AGT Pipeline Conditions for 11/3/2024,,AG
3,Capacity Constraint,11/01/2024 02:48:20 PM,11/02/2024 09:00:00 AM,11/03/2024 09:00:00 AM,154657,AGT Pipeline Conditions for 11/2/2024,,AG
4,Operational Flow Order,10/31/2024 03:00:00 PM,11/02/2024 09:00:00 AM,01/29/2025 09:00:00 AM,154578,AGT Operational Flow Order -- UPDATE EFF 11/2,,AG
...,...,...,...,...,...,...,...,...
1049,Force Majeure,09/12/2024 05:03:55 PM,09/12/2024 05:03:55 PM,12/11/2024 05:03:55 PM,152512,Walker Ridge - Hurricane Francine - System Update,,WRGS
1050,Force Majeure,09/12/2024 11:47:21 AM,09/12/2024 11:47:21 AM,12/11/2024 11:47:21 AM,152477,Walker Ridge - Tropical Storm Francine - Syste...,,WRGS
1051,Operational Alert,09/09/2024 02:10:12 PM,09/09/2024 02:10:12 PM,12/08/2024 02:10:12 PM,152286,Walker Ridge - Tropical Storm Francine - Evacu...,,WRGS
1052,Operational Alert,09/09/2024 10:03:29 AM,09/09/2024 10:03:29 AM,12/08/2024 10:03:29 AM,152278,Walker Ridge - Tropical Disturbance 24 - Evacu...,,WRGS
