# Scraping a fixed URL site

We'll often encounter website where the url never changes. Here are a few examples: 


- <a href="https://www.seethroughny.net/">See Through NY</a> 
- <a href="https://restructuring.ra.kroll.com/pge/Home-ClaimInfo">PG&E fire victim creditors</a>


Winter's coming and we want to track critical notices by an energy company to see if prices spike due to their maintenance issues.

From <a href="https://infopost.enbridge.com/InfoPost/">this homepage</a>, we want to scrape all the  critical notices for all their business units. 

<img src="https://sandeepmj.github.io/image-host/energy-scrape.png">

Let's explore the site to come up with our scrape strategy.

A good approach always is to scrape a single page to see if we can, and then get all the pages.

## Single Page Scrape

Determine how to scrape a single page.


In [13]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [15]:
## import libraries
import lxml
import requests ## request content from websites
import pandas as pd ## organize scraped data
from bs4 import BeautifulSoup ## parse content from websites as html
from random import uniform, sample ## uniform for float, ## sample for random samples
import time ## to slow down our scrapes


In [3]:
## target url
url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI"

In [19]:
## scrape table with pandas
all_data = pd.read_html(url)
type(all_data)
type(all_data[0])

pandas.core.frame.DataFrame

In [23]:
## call my list
df = all_data[0]
df

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time
0,Capacity Constraint,10/06/2025 03:09:28 PM,10/07/2025 09:00:00 AM,10/08/2025 09:00:00 AM,168324,AGT Pipeline Conditions for 10/7/2025,
1,Capacity Constraint,10/05/2025 02:46:36 PM,10/06/2025 09:00:00 AM,10/07/2025 09:00:00 AM,168269,AGT Pipeline Conditions for 10/6/2025,
2,Capacity Constraint,10/04/2025 02:47:08 PM,10/05/2025 09:00:00 AM,10/06/2025 09:00:00 AM,168236,AGT Pipeline Conditions for 10/5/2025,
3,Capacity Constraint,10/03/2025 03:00:34 PM,10/04/2025 09:00:00 AM,10/05/2025 09:00:00 AM,168202,AGT Pipeline Conditions for 10/4/2025,
4,Capacity Constraint,10/02/2025 03:15:54 PM,10/03/2025 09:00:00 AM,10/04/2025 09:00:00 AM,168169,AGT Pipeline Conditions for 10/3/2025,
...,...,...,...,...,...,...,...
125,Capacity Constraint,07/12/2025 02:47:03 PM,07/13/2025 09:00:00 AM,07/14/2025 09:00:00 AM,165035,AGT Pipeline Conditions for 7/13/2025,
126,Capacity Constraint,07/11/2025 06:30:29 PM,07/11/2025 06:30:29 PM,07/13/2025 09:00:00 AM,165025,AGT Pipeline Conditions for 7/12/2025 -- INTRADAY,
127,Capacity Constraint,07/11/2025 02:59:24 PM,07/12/2025 09:00:00 AM,07/13/2025 09:00:00 AM,165005,AGT Pipeline Conditions for 7/12/2025,
128,Capacity Constraint,07/10/2025 03:20:01 PM,07/11/2025 09:00:00 AM,07/12/2025 09:00:00 AM,164968,AGT Pipeline Conditions for 7/11/2025,


In [31]:
## GET ALL URLS

## get notice id numbers

notice_ids = df["Notice Identifier"].to_list()
notice_ids[:10]




[168324,
 168269,
 168236,
 168202,
 168169,
 168152,
 168133,
 168122,
 168061,
 168013]

In [33]:
## insert into base link to 

start_url = "https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1="
end_url = "&type=CRI&Embed=2&pipe=AG"

In [41]:
## for loops
links_fl = []
for notice_id in notice_ids:
    links_fl.append(f"{start_url}{notice_id}{end_url}")
links_fl   

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168324&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168269&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168236&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168202&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168169&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168152&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168133&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168122&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168061&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

In [43]:
##create list of links Link comprehension

links_lc = [f"{start_url}{notice_id}{end_url}" for notice_id in notice_ids]
links_lc

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168324&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168269&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168236&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168202&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168169&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168152&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168133&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168122&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=168061&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

## Next step: scrape ALL the gas lines.

What is our approach?

### Using ```Headers``` When Web Scraping

**Headers make your scraper look like a real browser instead of a bot.** 

Websites can easily detect and block requests that lack typical browser information, returning 403 errors or empty content.

**The key header is `User-Agent`:**
```python
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}

```

### `pd.read_html()` usually works without headers

**`pd.read_html()` is designed specifically for parsing HTML tables, not general web scraping.** 

It focuses on extracting `<table>` elements from already-loaded HTML content, which many websites serve even to basic requests.

Reading tables seems **less suspicious** and more legitimate than taking all the page content.

In [45]:
## create headers

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}

In [77]:
## get homepage and make soup
# ul id dropdown

homepage = "https://infopost.enbridge.com/InfoPost/"
response = requests.get(homepage, headers = headers)




In [79]:
type(response)

requests.models.Response

In [81]:
response.status_code

200

In [83]:
response.text

'<!DOCTYPE html>\r\n<!-- template.asp -->\r\n\r\n\r\n<HTML lang="en">\r\n<HEAD>\r\n<meta charset="utf-8"/>\r\n\r\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n\r\n<link rel="shortcut icon" href="favicon.ico" />\r\n<link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css" />\r\n<link href="css/bootstrap.min.css" rel="stylesheet">\r\n<link href="css/font-awesome-ie7.min.css" rel="stylesheet"/>\r\n<link href="css/font-awesome.css" rel="stylesheet">\r\n\r\n<link href="css/link.css" rel="stylesheet" />\r\n<link href="css/print.css" rel="stylesheet" media="print" />\r\n<link href="css/infopost-custom.css" rel="stylesheet" media="screen" />\r\n<link href="css/environment.css" rel="stylesheet" media="screen" />\r\n<!-- HTML5 shim, for IE6-9 support of HTML5 elements -->\r\n<!--[if lt IE 9]>\r\n\t<link href="css/link-ie.css" rel="stylesheet" />\r\n\t<script src="scripts/html5shiv.js" type="text/java

In [87]:
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

<!DOCTYPE html>
<!-- template.asp -->
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="favicon.ico" rel="shortcut icon"/>
  <link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css"/>
  <link href="css/bootstrap.min.css" rel="stylesheet"/>
  <link href="css/font-awesome-ie7.min.css" rel="stylesheet">
   <link href="css/font-awesome.css" rel="stylesheet"/>
   <link href="css/link.css" rel="stylesheet">
    <link href="css/print.css" media="print" rel="stylesheet"/>
    <link href="css/infopost-custom.css" media="screen" rel="stylesheet"/>
    <link href="css/environment.css" media="screen" rel="stylesheet"/>
    <!-- HTML5 shim, for IE6-9 support of HTML5 elements -->
    <!--[if lt IE 9]>
	<link href="css/link-ie.css" rel="stylesheet" />
	<script src="scripts/html5shiv.js" type="text/javascript"></script>
	<script src="scripts/html

In [93]:
## get dropdown html
dropdown = soup.find(id="dropdown")
type(dropdown)
dropdown

<ul class="dropdown-menu select-pipe-dropdown-menu" id="dropdown">
<li><a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a></li><li><a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a></li><li><a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a></li><li><a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a></li><li><a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a></li><li><a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a></li><li><a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a></li><li><a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a></li><li><a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a></li><li><a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a></li><li><a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a></li><li><a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a></li><li><a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a></li><li><a href="NPCHome.asp?Pipe=NPC">Nautilus P

In [99]:
## get atags
atags = dropdown.find_all("a")
atags

[<a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a>,
 <a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a>,
 <a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a>,
 <a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a>,
 <a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a>,
 <a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a>,
 <a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a>,
 <a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a>,
 <a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a>,
 <a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a>,
 <a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a>,
 <a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a>,
 <a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a>,
 <a href="NPCHome.asp?Pipe=NPC">Nautilus Pipeline Company (NPC)</a>,
 <a href="NXCAHome.asp?Pipe=NXCA">NEXUS ULC (NXCA)</a>,
 <a href="NXUSHome.asp?Pipe=NXUS">NEXUS U.S. (NXUS)</a>,
 <a href

In [107]:
## get the hrefs LC
hrefs = [atag.get("href") for atag in atags]
hrefs

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'TPGSHome.asp?Pipe=TPGS',
 'VCPHome.asp?Pipe=VCP',
 'WEHome.asp?Pipe=WE',
 'WRGSHome.asp?Pipe=WRGS']

In [111]:
## get href FL
hrefs_fl = []
for atag in atags:
    hrefs_fl.append(atag.get("href"))

hrefs_fl

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'TPGSHome.asp?Pipe=TPGS',
 'VCPHome.asp?Pipe=VCP',
 'WEHome.asp?Pipe=WE',
 'WRGSHome.asp?Pipe=WRGS']

In [139]:
animals = ["cat", "dog", "rat"]
animals[1]

'dog'

In [145]:
## split codes 
unit_codes = [href.split('=')[1] for href in hrefs]
unit_codes

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'TPGS',
 'VCP',
 'WE',
 'WRGS']

In [147]:
for href in hrefs:
    print(href.split("=")[1])
    print("**********")

AG
**********
BGS
**********
BIG
**********
BSP
**********
EG
**********
ET
**********
GB
**********
GPL
**********
MCGP
**********
MB
**********
MNCA
**********
MNUS
**********
MR
**********
NPC
**********
NXCA
**********
NXUS
**********
SESH
**********
SG
**********
SR
**********
STT
**********
TE
**********
TPGS
**********
VCP
**********
WE
**********
WRGS
**********


In [149]:
type(unit_codes)

list

In [125]:
unit_codes[0][1]

'AG'

In [None]:
## target codes only


In [153]:
## url templates
start_url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe="
end_url = "&type=CRI"

In [155]:
## attch codes to base urls
links = [f"{start_url}{unit_code}{end_url}" for unit_code in unit_codes]
links

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BGS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BIG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BSP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=EG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=ET&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GPL&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MCGP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNCA&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNUS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MR&type=CRI',
 '

In [157]:
## length of links
len(links)

25

### ```enumerate()```

We can create an efficient counter with minimal coding to track our progress.

In [159]:
## quick explanation
## run this cell
fruits = ["apple", "orange", "plum", "pear", "banana"]
fruits

['apple', 'orange', 'plum', 'pear', 'banana']

In [169]:
## demo enumerate
for i, fruit in enumerate(fruits, start = 1):
    print(i, fruit)

1 apple
2 orange
3 plum
4 pear
5 banana


In [None]:
## scrape all gas units at one time
df_list = []
broken_links = []
total_links = len(unit_codes)

for counter, unit_code in enumerate(unit_codes, start = 1):
    target_link = f"{base_url}{unit_code}{end_url}"
    print(f"Scraping {counter} of {total_links}")
    try:
        data = pd.read_html(target_link)
        df = data[0]
        df["unit"] = unit_code
        df_list.append(df)
    except:
        print(f"{unit_code} was busted or had no table")
        broken_url.append(target_link)
    finally:
        snooze = uniform(10,20)
        print(f"Snoozing for {snooze} seconds")
        time.sleep(snooze)
        
print(f"Done scraping all units")

In [None]:
## call our list
## what do we have?


In [None]:
## concat


### Capture actual critical notices text

In [None]:
## create link to each text description.

## get notice id numbers


In [None]:
## name of units


### What are the elements that make up the url call?

In [None]:
## build parts of url


## Build our URLs using ```zip()``` 

In [None]:
## zip together and build new list of text descriptions



In [None]:
## pull out 10 random samples
## use our sample method from the random library (imported earlier)

