# Scraping a fixed URL site

We'll often encounter website where the url never changes. Here are a few examples: 

- <a href="https://eportal.miteco.gob.es/BoleHWeb/">Ministry for the Ecological Transition and the Demographic Challenge</a>.
- <a href="https://www.seethroughny.net/">See Through NY</a> 
- <a href="https://restructuring.ra.kroll.com/pge/Home-ClaimInfo">PG&E fire victim creditors</a>

From <a href="https://infopost.enbridge.com/InfoPost/">this homepage</a>, we want to scrape the critical notices for the Algonquin Gas Transmission.

Let's explore the site to come up with our scrape strategy.


In [11]:
## import libraries

import requests
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
##target url

url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI"

In [3]:
## response

response = requests.get(url)
response.status_code

200

In [4]:
type(response.text)

str

In [6]:
##capture all tables

all_data = pd.read_html(response.text)
all_data

[                Notice Type        Posted Date/Time  \
 0       Capacity Constraint  10/23/2023 03:36:53 PM   
 1       Capacity Constraint  10/22/2023 03:05:11 PM   
 2       Capacity Constraint  10/21/2023 03:29:00 PM   
 3       Capacity Constraint  10/20/2023 03:41:37 PM   
 4       Capacity Constraint  10/19/2023 03:10:00 PM   
 ..                      ...                     ...   
 118     Capacity Constraint  07/29/2023 02:45:36 PM   
 119     Capacity Constraint  07/28/2023 03:49:14 PM   
 120  Operational Flow Order  07/28/2023 07:11:00 AM   
 121     Capacity Constraint  07/27/2023 03:03:44 PM   
 122     Capacity Constraint  07/26/2023 02:57:00 PM   
 
     Notice Effective Date/Time    Notice End Date/Time  Notice Identifier  \
 0       10/24/2023 09:00:00 AM  10/25/2023 09:00:00 AM             139536   
 1       10/23/2023 09:00:00 AM  10/24/2023 09:00:00 AM             139482   
 2       10/22/2023 09:00:00 AM  10/23/2023 09:00:00 AM             139455   
 3       10/21

In [9]:
df = all_data[0]
df

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time
0,Capacity Constraint,10/23/2023 03:36:53 PM,10/24/2023 09:00:00 AM,10/25/2023 09:00:00 AM,139536,AGT Pipeline Conditions for 10/24/2023,
1,Capacity Constraint,10/22/2023 03:05:11 PM,10/23/2023 09:00:00 AM,10/24/2023 09:00:00 AM,139482,AGT Pipeline Conditions for 10/23/2023,
2,Capacity Constraint,10/21/2023 03:29:00 PM,10/22/2023 09:00:00 AM,10/23/2023 09:00:00 AM,139455,AGT Pipeline Conditions for 10/22/2023,
3,Capacity Constraint,10/20/2023 03:41:37 PM,10/21/2023 09:00:00 AM,10/22/2023 09:00:00 AM,139420,AGT Pipeline Conditions for 10/21/2023,
4,Capacity Constraint,10/19/2023 03:10:00 PM,10/20/2023 09:00:00 AM,10/21/2023 09:00:00 AM,139374,AGT Pipeline Conditions for 10/20/2023,
...,...,...,...,...,...,...,...
118,Capacity Constraint,07/29/2023 02:45:36 PM,07/30/2023 09:00:00 AM,07/31/2023 09:00:00 AM,136178,AGT Pipeline Conditions for 7/30/2023,
119,Capacity Constraint,07/28/2023 03:49:14 PM,07/29/2023 09:00:00 AM,07/30/2023 09:00:00 AM,136161,AGT Pipeline Conditions for 7/29/2023,
120,Operational Flow Order,07/28/2023 07:11:00 AM,07/29/2023 09:00:00 AM,10/26/2023 09:00:00 AM,136132,AGT Operational Flow Order -- LIFTED EFF 7/29,
121,Capacity Constraint,07/27/2023 03:03:44 PM,07/28/2023 09:00:00 AM,07/29/2023 09:00:00 AM,136114,AGT Pipeline Conditions for 7/28/2023,


# What if we wanted to scrape ALL the gas lines?


In [12]:
## capture all the biz units


In [14]:
homepage_url = "https://infopost.enbridge.com/InfoPost/"
response= requests.get(homepage_url)
soup = BeautifulSoup(response.text, "html.parser")

In [16]:
dropdown_list = soup.find(id="dropdown")
dropdown_list

<ul class="dropdown-menu select-pipe-dropdown-menu" id="dropdown">
<li><a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a></li><li><a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a></li><li><a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a></li><li><a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a></li><li><a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a></li><li><a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a></li><li><a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a></li><li><a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a></li><li><a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a></li><li><a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a></li><li><a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a></li><li><a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a></li><li><a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a></li><li><a href="NPCHome.asp?Pipe=NPC">Nautilus P

In [19]:
## get a_tags
a_tags = dropdown_list.find_all("a")
a_tags

[<a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a>,
 <a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a>,
 <a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a>,
 <a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a>,
 <a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a>,
 <a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a>,
 <a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a>,
 <a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a>,
 <a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a>,
 <a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a>,
 <a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a>,
 <a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a>,
 <a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a>,
 <a href="NPCHome.asp?Pipe=NPC">Nautilus Pipeline Company (NPC)</a>,
 <a href="NXCAHome.asp?Pipe=NXCA">NEXUS ULC (NXCA)</a>,
 <a href="NXUSHome.asp?Pipe=NXUS">NEXUS U.S. (NXUS)</a>,
 <a href

In [20]:
## get hrefs

hrefs = [a_tag.get("href") for a_tag in a_tags]
hrefs

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'TPGSHome.asp?Pipe=TPGS',
 'VCPHome.asp?Pipe=VCP',
 'WRGSHome.asp?Pipe=WRGS']

In [21]:
## import regex library
import re 


In [31]:
## capture pattern

pat = re.compile(r"Pipe=(\w+)")

In [33]:
unit_codes = []

for unit_code in hrefs:
    unit_codes.append(pat.findall(unit_code)[0])
    
unit_codes

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'TPGS',
 'VCP',
 'WRGS']

In [34]:
url_start = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe="
end_url = "&type=CRI"

In [37]:
links = [f"{url_start}{unit_code}{end_url}" for unit_code in unit_codes]
links

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BGS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BIG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BSP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=EG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=ET&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GPL&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MCGP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNCA&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNUS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MR&type=CRI',
 '

In [39]:
import time
from random import randrange

In [41]:
broken_url = []
df_list = []
total_links = len(unit_codes)
counter = 1 

for unit_code in unit_codes:
    print(f"Scraping {counter} of {total_links}")
    counter+= 1
    response = requests.get(f"{url_start}{unit_code}{end_url}")
    try:
        data = pd.read_html(response.text)
        df = data[0]
        df["unit"] = unit_code
        df_list.append(df)
    except:
        print(f"{unit_code} was busted or had no table")
        broken_url.append(f"{url_start}{unit_code}{end_url}")
    finally:
        snooze = randrange(10, 15)
        print(f"Snoozing for {snooze} seconds")
        time.sleep(snooze)
        
print("Done scraping all units")

Scraping 1 of 24
Snoozing for 12 seconds
Scraping 2 of 24
Snoozing for 13 seconds
Scraping 3 of 24
Snoozing for 11 seconds
Scraping 4 of 24
Snoozing for 11 seconds
Scraping 5 of 24
Snoozing for 12 seconds
Scraping 6 of 24
Snoozing for 13 seconds
Scraping 7 of 24
Snoozing for 10 seconds
Scraping 8 of 24
Snoozing for 12 seconds
Scraping 9 of 24
Snoozing for 13 seconds
Scraping 10 of 24
Snoozing for 12 seconds
Scraping 11 of 24
Snoozing for 14 seconds
Scraping 12 of 24
Snoozing for 11 seconds
Scraping 13 of 24
MR was busted or had no table
Snoozing for 14 seconds
Scraping 14 of 24
Snoozing for 13 seconds
Scraping 15 of 24
Snoozing for 12 seconds
Scraping 16 of 24
Snoozing for 12 seconds
Scraping 17 of 24
Snoozing for 12 seconds
Scraping 18 of 24
Snoozing for 12 seconds
Scraping 19 of 24
Snoozing for 10 seconds
Scraping 20 of 24
Snoozing for 13 seconds
Scraping 21 of 24
Snoozing for 14 seconds
Scraping 22 of 24
Snoozing for 10 seconds
Scraping 23 of 24
Snoozing for 10 seconds
Scraping 24 o

In [42]:
broken_url

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MR&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=WRGS&type=CRI']

In [46]:
df_list[5]

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,unit
0,Operational Flow Order,10/23/2023 05:18:28 PM,10/24/2023 09:00:00 AM,01/21/2024 05:18:28 PM,139542,ETNG Customer Specific Action Alert -- EFF 10/24,,ET
1,Capacity Constraint,10/23/2023 02:35:00 PM,10/24/2023 09:00:00 AM,10/25/2023 09:00:00 AM,139509,ETNG Pipeline Conditions for 10/24/2023,,ET
2,Capacity Constraint,10/22/2023 03:06:32 PM,10/23/2023 09:00:00 AM,10/24/2023 09:00:00 AM,139494,ETNG Pipeline Conditions for 10/23/2023,,ET
3,Capacity Constraint,10/21/2023 03:31:27 PM,10/22/2023 09:00:00 AM,10/23/2023 03:31:27 PM,139456,ETNG Pipeline Conditions for 10/22/2023,,ET
4,Capacity Constraint,10/21/2023 09:31:11 AM,10/21/2023 09:31:11 AM,01/19/2024 09:31:11 AM,139426,Tracy City to Ooltewah 3200-1 VS1 Investigatio...,,ET
...,...,...,...,...,...,...,...,...
106,Capacity Constraint,07/30/2023 02:36:31 PM,07/31/2023 09:00:00 AM,08/01/2023 09:00:00 AM,136197,ETNG Pipeline Conditions for 7/31/2023,,ET
107,Capacity Constraint,07/29/2023 02:47:00 PM,07/30/2023 09:00:00 AM,07/31/2023 09:00:00 AM,136179,ETNG Pipeline Conditions for 7/30/2023,,ET
108,Capacity Constraint,07/28/2023 03:46:51 PM,07/29/2023 09:00:00 AM,07/30/2023 09:00:00 AM,136160,ETNG Pipeline Conditions for 7/29/2023,,ET
109,Capacity Constraint,07/27/2023 02:57:57 PM,07/28/2023 09:00:00 AM,07/29/2023 09:00:00 AM,136113,ETNG Pipeline Conditions for 7/28/2023,,ET


In [47]:
final_df = pd.concat(df_list).reset_index()

In [48]:
final_df

Unnamed: 0,index,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,unit
0,0,Capacity Constraint,10/23/2023 03:36:53 PM,10/24/2023 09:00:00 AM,10/25/2023 09:00:00 AM,139536,AGT Pipeline Conditions for 10/24/2023,,AG
1,1,Capacity Constraint,10/22/2023 03:05:11 PM,10/23/2023 09:00:00 AM,10/24/2023 09:00:00 AM,139482,AGT Pipeline Conditions for 10/23/2023,,AG
2,2,Capacity Constraint,10/21/2023 03:29:00 PM,10/22/2023 09:00:00 AM,10/23/2023 09:00:00 AM,139455,AGT Pipeline Conditions for 10/22/2023,,AG
3,3,Capacity Constraint,10/20/2023 03:41:37 PM,10/21/2023 09:00:00 AM,10/22/2023 09:00:00 AM,139420,AGT Pipeline Conditions for 10/21/2023,,AG
4,4,Capacity Constraint,10/19/2023 03:10:00 PM,10/20/2023 09:00:00 AM,10/21/2023 09:00:00 AM,139374,AGT Pipeline Conditions for 10/20/2023,,AG
...,...,...,...,...,...,...,...,...,...
1035,93,Capacity Constraint,07/30/2023 02:33:55 PM,07/31/2023 09:00:00 AM,08/01/2023 09:00:00 AM,136195,VCP Pipeline Conditions for 7/31/2023,,VCP
1036,94,Capacity Constraint,07/29/2023 02:40:35 PM,07/30/2023 09:00:00 AM,07/31/2023 09:00:00 AM,136174,VCP Pipeline Conditions for 7/30/2023,,VCP
1037,95,Capacity Constraint,07/28/2023 03:42:50 PM,07/29/2023 09:00:00 AM,07/30/2023 09:00:00 AM,136158,VCP Pipeline Conditions for 7/29/2023,,VCP
1038,96,Capacity Constraint,07/27/2023 02:25:34 PM,07/28/2023 09:00:00 AM,07/29/2023 09:00:00 AM,136106,VCP Pipeline Conditions for 7/28/2023,,VCP


In [49]:
final_df.sample(20)

Unnamed: 0,index,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,unit
665,4,Capacity Constraint,10/19/2023 02:22:50 PM,10/20/2023 09:00:00 AM,10/21/2023 09:00:00 AM,139354,SR Storage Conditions for 10/20/2023,,SR
935,34,Computer System Status,10/04/2023 08:15:00 AM,10/10/2023 08:00:00 AM,10/13/2023 08:00:00 AM,138690,LINK Application Testing on October 11,,TPGS
845,80,Capacity Constraint,09/04/2023 03:35:50 PM,09/05/2023 09:00:00 AM,09/06/2023 09:00:00 AM,137498,TE Pipeline Conditions for 9/5/2023,,TE
565,105,Capacity Constraint,07/26/2023 02:50:11 PM,07/27/2023 09:00:00 AM,07/28/2023 09:00:00 AM,136077,SESH Pipeline Conditions for 7/27/2023,,SESH
658,91,Capacity Constraint,07/28/2023 03:38:57 PM,07/29/2023 09:00:00 AM,07/30/2023 09:00:00 AM,136156,SGSC Storage Conditions for 7/29/2023,,SG
277,5,Maintenance,08/10/2023 08:00:00 AM,08/10/2023 08:00:00 AM,08/13/2023 04:17:16 PM,136571,Measurement Data Outage,,GB
1037,95,Capacity Constraint,07/28/2023 03:42:50 PM,07/29/2023 09:00:00 AM,07/30/2023 09:00:00 AM,136158,VCP Pipeline Conditions for 7/29/2023,,VCP
1012,70,Capacity Constraint,08/18/2023 02:31:36 PM,08/19/2023 09:00:00 AM,08/20/2023 09:00:00 AM,136907,VCP Pipeline Conditions for 8/19/2023,,VCP
525,65,Capacity Constraint,09/01/2023 02:40:26 PM,09/02/2023 09:00:00 AM,09/03/2023 09:00:00 AM,137397,SESH Pipeline Conditions for 9/2/2023,,SESH
344,6,Computer System Status,09/05/2023 02:00:00 PM,10/13/2023 10:00:00 PM,10/15/2023 09:00:00 AM,137516,LINK System Maintenance moved from September 1...,,NPC
