# Scrape a fixed URL site

We'll often encounter website where the url never changes. Here are a few examples: <a href="https://my.nycha.info/Outages/Outages.aspx">NYCHA outages</a> and <a href="https://www.timeanddate.com/weather/india/mumbai/historic?month=4&year=2020">global temperature search</a>.

From <a href="https://infopost.enbridge.com/InfoPost/">this homepage</a>, we want to scrape the critical notices for the Algonquin Gas Transmission.

Let's explore the site to come up with our scrape strategy.



In [8]:
## libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [9]:
## target url
url = "https://infopost.enbridge.com/InfoPost/AGHome.asp?Pipe=AG"

In [10]:
## get response
response = requests.get(url)
response.status_code

200

In [11]:
response.text

'<!DOCTYPE html>\r\n<!-- template.asp -->\r\n\r\n\r\n\r\n<HTML lang="en">\r\n<HEAD>\r\n<meta charset="utf-8"/>\r\n <meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width, initial-scale=1.0">\r\n\r\n<link rel="shortcut icon" href="favicon.ico" />\r\n<link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css" />\r\n<link href="css/bootstrap.min.css" rel="stylesheet">\r\n<link href="css/font-awesome-ie7.min.css" rel="stylesheet"/>\r\n<link href="css/font-awesome.css" rel="stylesheet">\r\n\r\n<link href="css/link.css" rel="stylesheet" />\r\n<link href="css/print.css" rel="stylesheet" media="print" />\r\n<link href="css/infopost-custom.css" rel="stylesheet" media="screen" />\r\n<link href="css/environment.css" rel="stylesheet" media="screen" />\r\n<!-- HTML5 shim, for IE6-9 support of HTML5 elements -->\r\n<!--[if lt IE 9]>\r\n\t<link href="css/link-ie.css" rel="stylesheet" />\r\n\t<script src="scripts/html5shiv.js" type="text/java

In [12]:
## actual target url

url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI"

In [13]:
response = requests.get(url)
data = pd.read_html(response.text)
data

[                Notice Type        Posted Date/Time  \
 0    Operational Flow Order  11/01/2022 07:14:13 AM   
 1       Capacity Constraint  10/31/2022 03:40:00 PM   
 2       Capacity Constraint  10/30/2022 03:03:04 PM   
 3       Capacity Constraint  10/29/2022 03:13:39 PM   
 4       Capacity Constraint  10/28/2022 03:02:00 PM   
 ..                      ...                     ...   
 120     Capacity Constraint  08/06/2022 03:48:31 PM   
 121     Capacity Constraint  08/05/2022 03:23:00 PM   
 122     Capacity Constraint  08/04/2022 03:16:00 PM   
 123     Capacity Constraint  08/04/2022 03:11:11 PM   
 124     Capacity Constraint  08/03/2022 03:57:00 PM   
 
     Notice Effective Date/Time    Notice End Date/Time  Notice Identifier  \
 0       11/01/2022 09:00:00 AM  01/30/2023 09:00:00 AM             126922   
 1       11/01/2022 09:00:00 AM  11/02/2022 09:00:00 AM             126903   
 2       10/31/2022 09:00:00 AM  11/01/2022 09:00:00 AM             126863   
 3       10/30

In [14]:
type(data)

list

In [15]:
len(data)

1

In [16]:
type(data[0])

pandas.core.frame.DataFrame

In [18]:
ag_data = data[0].copy()

In [19]:
ag_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Notice Type                 125 non-null    object 
 1   Posted Date/Time            125 non-null    object 
 2   Notice Effective Date/Time  125 non-null    object 
 3   Notice End Date/Time        125 non-null    object 
 4   Notice Identifier           125 non-null    int64  
 5   Subject                     125 non-null    object 
 6   Response Date/Time          0 non-null      float64
dtypes: float64(1), int64(1), object(5)
memory usage: 7.0+ KB


In [21]:
ag_data

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time
0,Operational Flow Order,11/01/2022 07:14:13 AM,11/01/2022 09:00:00 AM,01/30/2023 09:00:00 AM,126922,AGT Operational Flow Order -- LIFTED,
1,Capacity Constraint,10/31/2022 03:40:00 PM,11/01/2022 09:00:00 AM,11/02/2022 09:00:00 AM,126903,AGT Pipeline Conditions for 11/1/2022,
2,Capacity Constraint,10/30/2022 03:03:04 PM,10/31/2022 09:00:00 AM,11/01/2022 09:00:00 AM,126863,AGT Pipeline Conditions for 10/31/2022,
3,Capacity Constraint,10/29/2022 03:13:39 PM,10/30/2022 09:00:00 AM,10/31/2022 09:00:00 AM,126840,AGT Pipeline Conditions for 10/30/2022,
4,Capacity Constraint,10/28/2022 03:02:00 PM,10/29/2022 09:00:00 AM,10/30/2022 09:00:00 AM,126787,AGT Pipeline Conditions for 10/29//2022,
...,...,...,...,...,...,...,...
120,Capacity Constraint,08/06/2022 03:48:31 PM,08/07/2022 09:00:00 AM,08/08/2022 09:00:00 AM,123656,AGT Pipeline Conditions for 8/7/2022,
121,Capacity Constraint,08/05/2022 03:23:00 PM,08/06/2022 09:00:00 AM,08/07/2022 09:00:00 AM,123620,AGT Pipeline Conditions for 8/6/2022,
122,Capacity Constraint,08/04/2022 03:16:00 PM,08/05/2022 09:00:00 AM,08/06/2022 09:00:00 AM,123596,AGT Pipeline Conditions for 8/5/2022 -- CORREC...,
123,Capacity Constraint,08/04/2022 03:11:11 PM,08/05/2022 09:00:00 AM,08/06/2022 09:00:00 AM,123595,AGT Pipeline Conditions for 8/5/2022,


In [25]:
# notice_ids = list(ag_data["Notice Identifier"])
notice_ids = ag_data["Notice Identifier"].tolist()
notice_ids

[126922,
 126903,
 126863,
 126840,
 126787,
 126767,
 126726,
 126720,
 126676,
 126641,
 126640,
 126637,
 126595,
 126589,
 126570,
 126569,
 126554,
 126520,
 126486,
 126447,
 126441,
 126410,
 126409,
 126408,
 126403,
 126343,
 126317,
 126273,
 126238,
 126203,
 126177,
 126161,
 126127,
 126053,
 126012,
 125979,
 125941,
 125907,
 125866,
 125851,
 125814,
 125765,
 125743,
 125724,
 125720,
 125685,
 125654,
 125602,
 125588,
 125582,
 125539,
 125493,
 125465,
 125449,
 125434,
 125417,
 125390,
 125362,
 125342,
 125295,
 125257,
 125238,
 125210,
 125182,
 125154,
 125122,
 125097,
 125093,
 125065,
 125062,
 124983,
 124931,
 124927,
 124888,
 124868,
 124819,
 124789,
 124751,
 124701,
 124672,
 124648,
 124631,
 124597,
 124550,
 124548,
 124546,
 124538,
 124506,
 124479,
 124478,
 124469,
 124416,
 124389,
 124360,
 124331,
 124298,
 124288,
 124261,
 124214,
 124183,
 124167,
 124141,
 124100,
 124079,
 124085,
 124082,
 124055,
 124051,
 123975,
 123933,
 123908,
 

In [26]:
ph_url = "https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1={}&type=CRI&Embed=2&pipe=AG"

In [27]:
notice_links = [ph_url.format(notice_id) for notice_id in notice_ids]
notice_links

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126922&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126903&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126863&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126840&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126787&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126767&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126726&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126720&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=126676&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

# What if we wanted to scrape ALL the gas lines?


In [28]:
## place holder url

ph_url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe={}&type=CRI"

In [30]:
home_url = "https://infopost.enbridge.com/InfoPost/"
homepage = requests.get(home_url)
soup = BeautifulSoup(homepage.text, "html.parser")
soup

<!DOCTYPE html>

<!-- template.asp -->
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="favicon.ico" rel="shortcut icon"/>
<link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css"/>
<link href="css/bootstrap.min.css" rel="stylesheet"/>
<link href="css/font-awesome-ie7.min.css" rel="stylesheet">
<link href="css/font-awesome.css" rel="stylesheet"/>
<link href="css/link.css" rel="stylesheet">
<link href="css/print.css" media="print" rel="stylesheet"/>
<link href="css/infopost-custom.css" media="screen" rel="stylesheet"/>
<link href="css/environment.css" media="screen" rel="stylesheet"/>
<!-- HTML5 shim, for IE6-9 support of HTML5 elements -->
<!--[if lt IE 9]>
	<link href="css/link-ie.css" rel="stylesheet" />
	<script src="scripts/html5shiv.js" type="text/javascript"></script>
	<script src="scripts/html5shiv-printshiv.js" type="text/javascrip

In [31]:
dropdown = soup(id="dropdown")
dropdown

[<ul class="dropdown-menu select-pipe-dropdown-menu" id="dropdown">
 <li><a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a></li><li><a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a></li><li><a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a></li><li><a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a></li><li><a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a></li><li><a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a></li><li><a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a></li><li><a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a></li><li><a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a></li><li><a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a></li><li><a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a></li><li><a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a></li><li><a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a></li><li><a href="NPCHome.asp?Pipe=NPC">Nautilus

In [41]:
dropdown[0]

<ul class="dropdown-menu select-pipe-dropdown-menu" id="dropdown">
<li><a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a></li><li><a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a></li><li><a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a></li><li><a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a></li><li><a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a></li><li><a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a></li><li><a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a></li><li><a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a></li><li><a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a></li><li><a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a></li><li><a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a></li><li><a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a></li><li><a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a></li><li><a href="NPCHome.asp?Pipe=NPC">Nautilus P

In [32]:
len(dropdown)

1

In [43]:
mylist = ["sandeep"]
mylist

['sandeep']

In [44]:
mylist.title()

AttributeError: 'list' object has no attribute 'title'

In [46]:
mylist[0].title()

'Sandeep'

In [42]:
all_a = dropdown[0].find_all("a")
all_a

[<a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a>,
 <a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a>,
 <a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a>,
 <a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a>,
 <a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a>,
 <a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a>,
 <a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a>,
 <a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a>,
 <a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a>,
 <a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a>,
 <a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a>,
 <a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a>,
 <a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a>,
 <a href="NPCHome.asp?Pipe=NPC">Nautilus Pipeline Company (NPC)</a>,
 <a href="NXCAHome.asp?Pipe=NXCA">NEXUS ULC (NXCA)</a>,
 <a href="NXUSHome.asp?Pipe=NXUS">NEXUS U.S. (NXUS)</a>,
 <a href

In [36]:
hrefs = [url.get("href") for url in all_a]
hrefs

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'VCPHome.asp?Pipe=VCP',
 'WRGSHome.asp?Pipe=WRGS']

In [37]:
import re

In [38]:
pat = re.compile(r"Pipe=(\w+)")

In [39]:
abbrvs = []
for item in hrefs:
    abbrvs.append(pat.findall(item)[0])
    
abbrvs

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'VCP',
 'WRGS']

In [47]:
abbrvs2 = [pat.findall(item)[0] for item in hrefs]
abbrvs2

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'VCP',
 'WRGS']

In [48]:
target_links = [ph_url.format(abbrv) for abbrv in abbrvs]
target_links

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BGS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BIG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BSP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=EG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=ET&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GPL&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MCGP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNCA&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNUS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MR&type=CRI',
 '

In [49]:
import time
from random import randrange

In [52]:
broken_url = []
df_all = []

for abbrv in abbrvs:
    link = ph_url.format(abbrv)
#     print(link)
#     print(abbrv)
    print(f"scraping {abbrv}")
    response = requests.get(link)
    try:
        data = pd.read_html(response.text)
        df = data[0].copy()
        df["unit"] = abbrv
        df_all.append(df)
        
    except:
        print(f"{abbrv} has nothing or is broken!")
        broken_url.append(link)
    finally:
        snooze = randrange(10,15)
        print(f"snoozing for {snooze} seconds")
        time.sleep(snooze)
        
print("done scraping")   

scraping AG
snoozing for 10 seconds
scraping BGS
snoozing for 14 seconds
scraping BIG
snoozing for 11 seconds
scraping BSP
snoozing for 11 seconds
scraping EG
snoozing for 13 seconds
scraping ET
snoozing for 11 seconds
scraping GB
snoozing for 14 seconds
scraping GPL
snoozing for 12 seconds
scraping MCGP
snoozing for 14 seconds
scraping MB
snoozing for 12 seconds
scraping MNCA
snoozing for 13 seconds
scraping MNUS
snoozing for 12 seconds
scraping MR
MR has nothing or is broken!
snoozing for 12 seconds
scraping NPC
snoozing for 13 seconds
scraping NXCA
snoozing for 13 seconds
scraping NXUS
snoozing for 14 seconds
scraping SESH
snoozing for 12 seconds
scraping SG
snoozing for 12 seconds
scraping SR
snoozing for 10 seconds
scraping STT
snoozing for 12 seconds
scraping TE
snoozing for 11 seconds
scraping VCP
snoozing for 10 seconds
scraping WRGS
WRGS has nothing or is broken!
snoozing for 13 seconds
done scraping


In [61]:
def combine_tables(list_name, file_name):
    '''
    takes list of dfs, combines into single csv
    requires csv filename as string
    '''
    df = pd.concat(list_name)
    df.reset_index(inplace = True)
    df.to_csv(file_name, encoding = "UTF-8", index = False)
    return df

In [59]:
df = combine_tables(df_all, "energy_critical.csv")
df

Unnamed: 0,index,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,unit
0,0,Operational Flow Order,11/01/2022 07:14:13 AM,11/01/2022 09:00:00 AM,01/30/2023 09:00:00 AM,126922,AGT Operational Flow Order -- LIFTED,,AG
1,1,Capacity Constraint,10/31/2022 03:40:00 PM,11/01/2022 09:00:00 AM,11/02/2022 09:00:00 AM,126903,AGT Pipeline Conditions for 11/1/2022,,AG
2,2,Capacity Constraint,10/30/2022 03:03:04 PM,10/31/2022 09:00:00 AM,11/01/2022 09:00:00 AM,126863,AGT Pipeline Conditions for 10/31/2022,,AG
3,3,Capacity Constraint,10/29/2022 03:13:39 PM,10/30/2022 09:00:00 AM,10/31/2022 09:00:00 AM,126840,AGT Pipeline Conditions for 10/30/2022,,AG
4,4,Capacity Constraint,10/28/2022 03:02:00 PM,10/29/2022 09:00:00 AM,10/30/2022 09:00:00 AM,126787,AGT Pipeline Conditions for 10/29//2022,,AG
...,...,...,...,...,...,...,...,...,...
988,22,Computer System Status,09/01/2022 08:45:25 AM,09/01/2022 08:45:25 AM,11/30/2022 08:45:25 AM,124570,Application Maintenance on September 14,,VCP
989,23,Computer System Status,08/23/2022 08:25:51 AM,08/23/2022 08:25:51 AM,11/21/2022 08:25:51 AM,124231,Implementation of Texas Railroad Commission Cu...,,VCP
990,24,Computer System Status,08/18/2022 07:14:34 AM,08/18/2022 07:14:34 AM,11/16/2022 07:14:34 AM,124075,LINK System Maintenance the weekend of Septemb...,,VCP
991,25,Force Majeure,08/08/2022 07:03:10 AM,08/08/2022 07:03:10 AM,11/06/2022 07:03:10 AM,123686,Brownsville Compressor Station Force Majeure -...,,VCP
