# Scrape a fixed URL site

From <a href="https://infopost.enbridge.com/InfoPost/">this homepage</a>, we want to scrape the critical notices for the Algonquin Gas Transmission.

Let's explore the site to come up with our scrape strategy.


In [1]:
pip install icecream

Note: you may need to restart the kernel to use updated packages.


In [2]:
## import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
from icecream import ic

In [4]:
## target url
url = "https://infopost.enbridge.com/InfoPost/AGHome.asp?Pipe=AG"

In [5]:
## request url data
response = requests.get(url)
response.status_code

200

In [6]:
print(response.text)

<!DOCTYPE html>
<!-- template.asp -->



<HTML lang="en">
<HEAD>
<meta charset="utf-8"/>
 <meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">

<link rel="shortcut icon" href="favicon.ico" />
<link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css" />
<link href="css/bootstrap.min.css" rel="stylesheet">
<link href="css/font-awesome-ie7.min.css" rel="stylesheet"/>
<link href="css/font-awesome.css" rel="stylesheet">

<link href="css/link.css" rel="stylesheet" />
<link href="css/print.css" rel="stylesheet" media="print" />
<link href="css/infopost-custom.css" rel="stylesheet" media="screen" />
<link href="css/environment.css" rel="stylesheet" media="screen" />
<!-- HTML5 shim, for IE6-9 support of HTML5 elements -->
<!--[if lt IE 9]>
	<link href="css/link-ie.css" rel="stylesheet" />
	<script src="scripts/html5shiv.js" type="text/javascript"></script>
	<script src="scripts/html5shiv-

In [7]:
## real target url

url = "https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI"
response = requests.get(url)
data = pd.read_html(response.text)
data

[                Notice Type        Posted Date/Time  \
 0       Capacity Constraint  10/28/2021 03:05:42 PM   
 1    Operational Flow Order  10/28/2021 08:08:54 AM   
 2       Capacity Constraint  10/27/2021 04:03:20 PM   
 3       Capacity Constraint  10/26/2021 03:16:30 PM   
 4       Capacity Constraint  10/25/2021 03:06:43 PM   
 ..                      ...                     ...   
 121     Capacity Constraint  08/04/2021 03:08:16 PM   
 122     Capacity Constraint  08/03/2021 03:32:33 PM   
 123     Capacity Constraint  08/02/2021 03:48:23 PM   
 124     Capacity Constraint  08/01/2021 03:15:36 PM   
 125     Capacity Constraint  07/31/2021 03:21:27 PM   
 
     Notice Effective Date/Time    Notice End Date/Time  Notice Identifier  \
 0       10/29/2021 09:00:00 AM  10/30/2021 09:00:00 AM             113798   
 1       10/30/2021 09:00:00 AM  11/01/2021 09:00:00 AM             113784   
 2       10/28/2021 09:00:00 AM  10/29/2021 09:00:00 AM             113779   
 3       10/27

In [8]:
type(data)

list

In [9]:
len(data)

1

In [10]:
type(data[0])

pandas.core.frame.DataFrame

In [11]:
df = data[0]
df

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time
0,Capacity Constraint,10/28/2021 03:05:42 PM,10/29/2021 09:00:00 AM,10/30/2021 09:00:00 AM,113798,AGT Pipeline Conditions for 10/29/2021,
1,Operational Flow Order,10/28/2021 08:08:54 AM,10/30/2021 09:00:00 AM,11/01/2021 09:00:00 AM,113784,AGT Operational Flow Order,
2,Capacity Constraint,10/27/2021 04:03:20 PM,10/28/2021 09:00:00 AM,10/29/2021 09:00:00 AM,113779,AGT Pipeline Conditions for 10/28/2021,
3,Capacity Constraint,10/26/2021 03:16:30 PM,10/27/2021 09:00:00 AM,10/28/2021 09:00:00 AM,113746,AGT Pipeline Conditions for 10/27/2021,
4,Capacity Constraint,10/25/2021 03:06:43 PM,10/26/2021 09:00:00 AM,10/27/2021 09:00:00 AM,113689,AGT Pipeline Conditions for 10/26/2021,
...,...,...,...,...,...,...,...
121,Capacity Constraint,08/04/2021 03:08:16 PM,08/05/2021 09:00:00 AM,08/06/2021 09:00:00 AM,110953,AGT Pipeline Conditions for 8/5/2021,
122,Capacity Constraint,08/03/2021 03:32:33 PM,08/04/2021 09:00:00 AM,08/05/2021 09:00:00 AM,110919,AGT Pipeline Conditions for 8/4/2021,
123,Capacity Constraint,08/02/2021 03:48:23 PM,08/03/2021 09:00:00 AM,08/04/2021 09:00:00 AM,110893,AGT Pipeline Conditions for 8/3/2021,
124,Capacity Constraint,08/01/2021 03:15:36 PM,08/02/2021 09:00:00 AM,08/03/2021 09:00:00 AM,110860,AGT Pipeline Conditions for 8/2/2021,


In [12]:
## placeholder

ph_url = "https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1={}&type=CRI&Embed=2&pipe=AG"

In [13]:
## turn id column to list

notices_list = df["Notice Identifier"].tolist()
notices_list

[113798,
 113784,
 113779,
 113746,
 113689,
 113659,
 113654,
 113611,
 113600,
 113561,
 113544,
 113508,
 113469,
 113463,
 113432,
 113392,
 113381,
 113355,
 113354,
 113338,
 113316,
 113285,
 113278,
 113248,
 113217,
 113187,
 113172,
 113161,
 113130,
 113078,
 113050,
 113043,
 113009,
 112986,
 112985,
 112978,
 112975,
 112942,
 112914,
 112879,
 112853,
 112852,
 112847,
 112790,
 112757,
 112727,
 112694,
 112691,
 112673,
 112648,
 112634,
 112606,
 112575,
 112546,
 112515,
 112484,
 112444,
 112420,
 112415,
 112372,
 112345,
 112341,
 112323,
 112287,
 112276,
 112236,
 112223,
 112201,
 112144,
 112116,
 112096,
 112063,
 112054,
 112050,
 112026,
 112001,
 111961,
 111943,
 111914,
 111908,
 111872,
 111871,
 111847,
 111840,
 111813,
 111774,
 111751,
 111735,
 111716,
 111659,
 111632,
 111621,
 111580,
 111572,
 111529,
 111511,
 111487,
 111476,
 111434,
 111413,
 111404,
 111386,
 111378,
 111349,
 111308,
 111284,
 111272,
 111231,
 111227,
 111221,
 111216,
 

In [14]:
notice_links = [ph_url.format(noticeID) for noticeID in notices_list]

In [15]:
notice_links

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113798&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113784&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113779&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113746&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113689&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113659&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113654&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113611&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113600&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

In [16]:
df["Notice URL"] = notice_links

In [17]:
df

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,Notice URL
0,Capacity Constraint,10/28/2021 03:05:42 PM,10/29/2021 09:00:00 AM,10/30/2021 09:00:00 AM,113798,AGT Pipeline Conditions for 10/29/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
1,Operational Flow Order,10/28/2021 08:08:54 AM,10/30/2021 09:00:00 AM,11/01/2021 09:00:00 AM,113784,AGT Operational Flow Order,,https://infopost.enbridge.com/InfoPost/NoticeL...
2,Capacity Constraint,10/27/2021 04:03:20 PM,10/28/2021 09:00:00 AM,10/29/2021 09:00:00 AM,113779,AGT Pipeline Conditions for 10/28/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
3,Capacity Constraint,10/26/2021 03:16:30 PM,10/27/2021 09:00:00 AM,10/28/2021 09:00:00 AM,113746,AGT Pipeline Conditions for 10/27/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
4,Capacity Constraint,10/25/2021 03:06:43 PM,10/26/2021 09:00:00 AM,10/27/2021 09:00:00 AM,113689,AGT Pipeline Conditions for 10/26/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
...,...,...,...,...,...,...,...,...
121,Capacity Constraint,08/04/2021 03:08:16 PM,08/05/2021 09:00:00 AM,08/06/2021 09:00:00 AM,110953,AGT Pipeline Conditions for 8/5/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
122,Capacity Constraint,08/03/2021 03:32:33 PM,08/04/2021 09:00:00 AM,08/05/2021 09:00:00 AM,110919,AGT Pipeline Conditions for 8/4/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
123,Capacity Constraint,08/02/2021 03:48:23 PM,08/03/2021 09:00:00 AM,08/04/2021 09:00:00 AM,110893,AGT Pipeline Conditions for 8/3/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...
124,Capacity Constraint,08/01/2021 03:15:36 PM,08/02/2021 09:00:00 AM,08/03/2021 09:00:00 AM,110860,AGT Pipeline Conditions for 8/2/2021,,https://infopost.enbridge.com/InfoPost/NoticeL...


In [20]:
## to show the full links because df columns not wide enough
df["Notice URL"].tolist()

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113798&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113784&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113779&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113746&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113689&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113659&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113654&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113611&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113600&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

['https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113798&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113784&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113779&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113746&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113689&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113659&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113654&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113611&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?strKey1=113600&type=CRI&Embed=2&pipe=AG',
 'https://infopost.enbridge.com/InfoPost/NoticeListDetail.asp?st

# What if we wanted to scrape ALL the gas lines?


In [3]:
### direct target

target_ph_url ="https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe={}&type=CRI"

In [4]:
homepage_url = "https://infopost.enbridge.com/InfoPost/"
homepage = requests.get(homepage_url)
soup = BeautifulSoup(homepage.text, "html.parser")
soup

<!DOCTYPE html>

<!-- template.asp -->
<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="favicon.ico" rel="shortcut icon"/>
<link href="css/jquery-ui.min.css" rel="stylesheet" type="text/css"/>
<link href="css/bootstrap.min.css" rel="stylesheet"/>
<link href="css/font-awesome-ie7.min.css" rel="stylesheet">
<link href="css/font-awesome.css" rel="stylesheet"/>
<link href="css/link.css" rel="stylesheet">
<link href="css/print.css" media="print" rel="stylesheet"/>
<link href="css/infopost-custom.css" media="screen" rel="stylesheet"/>
<link href="css/environment.css" media="screen" rel="stylesheet"/>
<!-- HTML5 shim, for IE6-9 support of HTML5 elements -->
<!--[if lt IE 9]>
	<link href="css/link-ie.css" rel="stylesheet" />
	<script src="scripts/html5shiv.js" type="text/javascript"></script>
	<script src="scripts/html5shiv-printshiv.js" type="text/javascrip

In [5]:
target = soup(id="dropdown")
target

[<ul class="dropdown-menu select-pipe-dropdown-menu" id="dropdown">
 <li><a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a></li><li><a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a></li><li><a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a></li><li><a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a></li><li><a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a></li><li><a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a></li><li><a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a></li><li><a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a></li><li><a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a></li><li><a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a></li><li><a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a></li><li><a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a></li><li><a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a></li><li><a href="NPCHome.asp?Pipe=NPC">Nautilus

In [6]:
len(target)

1

In [27]:
# all_a = [item.find_all("a") for item in target]
# all_a = all_a[0]
# all_a

[<a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a>]

In [7]:
all_a = target[0].find_all('a')
all_a

[<a href="AGHome.asp?Pipe=AG">Algonquin (AGT)</a>,
 <a href="BGSHome.asp?Pipe=BGS">Bobcat Gas Storage (BGS)</a>,
 <a href="BIGHome.asp?Pipe=BIG">BIG Pipeline (BIG)</a>,
 <a href="BSPHome.asp?Pipe=BSP">Big Sandy Pipeline (BSP)</a>,
 <a href="EGHome.asp?Pipe=EG">MHP Egan (EHP)</a>,
 <a href="ETHome.asp?Pipe=ET">East Tennessee (ETNG)</a>,
 <a href="GBHome.asp?Pipe=GB">Garden Banks (GB)</a>,
 <a href="GPLHome.asp?Pipe=GPL">Generation  Pipeline (GPL)</a>,
 <a href="MCGPHome.asp?Pipe=MCGP">Mississippi Canyon (MCGP)</a>,
 <a href="MBHome.asp?Pipe=MB">MHP Moss Bluff (MBHP)</a>,
 <a href="MNCAHome.asp?Pipe=MNCA">Maritimes &amp; Northeast Canada (MNCA)</a>,
 <a href="MNUSHome.asp?Pipe=MNUS">Maritimes &amp; Northeast U.S. (MNUS)</a>,
 <a href="MRHome.asp?Pipe=MR">Manta Ray Offshore Gathering Company (MR)</a>,
 <a href="NPCHome.asp?Pipe=NPC">Nautilus Pipeline Company (NPC)</a>,
 <a href="NXCAHome.asp?Pipe=NXCA">NEXUS ULC (NXCA)</a>,
 <a href="NXUSHome.asp?Pipe=NXUS">NEXUS U.S. (NXUS)</a>,
 <a href

In [12]:
hrefs = [url.get("href") for url in all_a]
hrefs

['AGHome.asp?Pipe=AG',
 'BGSHome.asp?Pipe=BGS',
 'BIGHome.asp?Pipe=BIG',
 'BSPHome.asp?Pipe=BSP',
 'EGHome.asp?Pipe=EG',
 'ETHome.asp?Pipe=ET',
 'GBHome.asp?Pipe=GB',
 'GPLHome.asp?Pipe=GPL',
 'MCGPHome.asp?Pipe=MCGP',
 'MBHome.asp?Pipe=MB',
 'MNCAHome.asp?Pipe=MNCA',
 'MNUSHome.asp?Pipe=MNUS',
 'MRHome.asp?Pipe=MR',
 'NPCHome.asp?Pipe=NPC',
 'NXCAHome.asp?Pipe=NXCA',
 'NXUSHome.asp?Pipe=NXUS',
 'SESHHome.asp?Pipe=SESH',
 'SGHome.asp?Pipe=SG',
 'SRHome.asp?Pipe=SR',
 'STTHome.asp?Pipe=STT',
 'TEHome.asp?Pipe=TE',
 'VCPHome.asp?Pipe=VCP',
 'WRGSHome.asp?Pipe=WRGS']

In [13]:
target_ph_url

'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe={}&type=CRI'

In [14]:
## regular expressions

import re

In [15]:
pat = re.compile(r"Pipe=(\w*)")

In [16]:
abbrvs_fl =[]
for item in hrefs:
    print(pat.findall(item))
    abbrvs_fl.append(pat.findall(item)[0])

['AG']
['BGS']
['BIG']
['BSP']
['EG']
['ET']
['GB']
['GPL']
['MCGP']
['MB']
['MNCA']
['MNUS']
['MR']
['NPC']
['NXCA']
['NXUS']
['SESH']
['SG']
['SR']
['STT']
['TE']
['VCP']
['WRGS']


In [17]:
abbrvs_fl

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'VCP',
 'WRGS']

In [19]:
abbrvs = [pat.findall(item)[0] for item in hrefs]
abbrvs

['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'VCP',
 'WRGS']

In [20]:
abbrvs_g = (pat.findall(item)[0] for item in hrefs)
abbrvs_g

<generator object <genexpr> at 0x7fd45ba2a200>

In [22]:
for abbrv in abbrvs_g:
    print(abbrv)

AG
BGS
BIG
BSP
EG
ET
GB
GPL
MCGP
MB
MNCA
MNUS
MR
NPC
NXCA
NXUS
SESH
SG
SR
STT
TE
VCP
WRGS


In [57]:
abbrvs = ['AG',
 'BGS',
 'BIG',
 'BSP',
 'EG',
 'ET',
 'GB',
 'GPL',
 'MCGP',
 'MB',
 'MNCA',
 'MNUS',
 'MR',
 'NPC',
 'NXCA',
 'NXUS',
 'Sandeep',
 'SESH',
 'SG',
 'SR',
 'STT',
 'TE',
 'VCP',
 'WRGS']

In [46]:
links = [target_ph_url.format(abbrv) for abbrv in abbrvs]
links

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=AG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BGS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BIG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=BSP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=EG&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=ET&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=GPL&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MCGP&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MB&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNCA&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MNUS&type=CRI',
 'https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=MR&type=CRI',
 '

In [26]:
links = [target_ph_url.format(abbrv) for abbrv in abbrvs_g]
links

[]

In [27]:
links =[]
for abbrv in abbrvs_g:
    links.append(abbrv)

In [50]:
import time
from random import randrange

In [58]:
df_all = []
broken_url = []
for abbrv in abbrvs[15:18]:
    link = target_ph_url.format(abbrv)
#     ic(link)
#     ic(abbrv)
    response = requests.get(link)
    try:
        data = pd.read_html(response.text)
        df = data[0]
        df["unit"] = abbrv
        df_all.append(df)
    except:
        print(f"{abbrv} has nothing or was broken")
        broken_url.append(link)
    finally:
        snooze = randrange(10,15)
        print(f"snoozing for {snooze} seconds before next link")
        time.sleep(snooze)
        
print("done scraping")

snoozing for 10 seconds before next link
Sandeep has nothing or was broken
snoozing for 14 seconds before next link
snoozing for 10 seconds before next link
done scraping


In [60]:
broken_url

['https://infopost.enbridge.com/InfoPost/NoticesList.asp?pipe=Sandeep&type=CRI']

In [52]:
df_all

[               Notice Type        Posted Date/Time Notice Effective Date/Time  \
 0      Capacity Constraint  10/28/2021 02:12:03 PM     10/29/2021 09:00:00 AM   
 1      Capacity Constraint  10/27/2021 02:40:34 PM     10/28/2021 09:00:00 AM   
 2      Capacity Constraint  10/26/2021 02:40:07 PM     10/27/2021 09:00:00 AM   
 3      Capacity Constraint  10/25/2021 02:50:02 PM     10/26/2021 09:00:00 AM   
 4   Computer System Status  10/25/2021 09:19:10 AM     10/25/2021 09:19:10 AM   
 5      Capacity Constraint  10/24/2021 01:50:45 PM     10/25/2021 09:00:00 AM   
 6      Capacity Constraint  10/23/2021 01:40:43 PM     10/24/2021 09:00:00 AM   
 7      Capacity Constraint  10/22/2021 03:36:08 PM     10/23/2021 09:00:00 AM   
 8      Capacity Constraint  10/21/2021 02:54:41 PM     10/22/2021 09:00:00 AM   
 9      Capacity Constraint  10/20/2021 02:57:09 PM     10/21/2021 09:00:00 AM   
 10  Computer System Status  10/19/2021 04:19:02 PM     10/19/2021 04:19:02 PM   
 11     Capacity

In [53]:
def combine_tables(list_name,filename):
  '''
  Takes dataframes in a list and combines into a single CSV.
  Tables must have identical column headers and order
  Arguments: name of list produced by tabula and the CSV name you want (in quotes as a string)
  '''
#   dataframes = [pd.DataFrame(a_table) for a_table in list_name] ## list comprehension to turn each tabula table into a dataframe
  df = pd.concat(list_name) ## join/concat all the dataframes into one dataframe
  df.to_csv(filename, encoding='utf-8', index=False) ## convert that single dataframe into a csv
#   files.download(filename) ## download it
  print(f"{filename} is in your downloads folder!")
  return df

In [54]:
df = combine_tables(df_all, "energy.csv")

energy.csv is in your downloads folder!


In [56]:
df.sample(15)

Unnamed: 0,Notice Type,Posted Date/Time,Notice Effective Date/Time,Notice End Date/Time,Notice Identifier,Subject,Response Date/Time,unit
0,Capacity Constraint,10/28/2021 02:12:03 PM,10/29/2021 09:00:00 AM,10/30/2021 09:00:00 AM,113795,NEXUS Pipeline Conditions for 10/29/2021,,NXUS
27,Capacity Constraint,10/04/2021 03:44:21 PM,10/05/2021 09:00:00 AM,10/06/2021 09:00:00 AM,112974,NEXUS Pipeline Conditions for 10/5/2021,,NXUS
23,Capacity Constraint,10/08/2021 01:38:47 PM,10/09/2021 09:00:00 AM,10/10/2021 09:00:00 AM,113092,NEXUS Pipeline Conditions for 10/9/2021,,NXUS
38,Computer System Status,08/10/2021 07:31:32 AM,08/10/2021 07:31:32 AM,11/08/2021 07:31:32 AM,111133,LINK System Downtime Due to System Maintenance,,NXUS
0,Computer System Status,10/25/2021 09:19:10 AM,10/25/2021 09:19:10 AM,01/23/2022 09:19:10 AM,113675,Measurement Integrity Downtime this Friday Night,,SG
4,Computer System Status,10/25/2021 09:19:10 AM,10/25/2021 09:19:10 AM,01/23/2022 09:19:10 AM,113673,Measurement Integrity Downtime this Friday Night,,NXUS
35,Computer System Status,08/25/2021 04:40:24 PM,08/25/2021 04:40:24 PM,11/23/2021 04:40:24 PM,111673,Additional TSA requirements that will impact LINK,,NXUS
10,Computer System Status,08/05/2021 08:20:54 AM,08/05/2021 08:20:54 AM,11/03/2021 08:13:54 AM,110970,TSA Directive - Required Password Changes,,SESH
3,Capacity Constraint,09/23/2021 02:45:41 PM,09/24/2021 09:00:00 AM,09/25/2021 09:00:00 AM,112589,SGSC Storage Conditions for 9/24/2021,,SG
9,Capacity Constraint,10/20/2021 02:57:09 PM,10/21/2021 09:00:00 AM,10/22/2021 09:00:00 AM,113502,NEXUS Pipeline Conditions for 10/21/2021,,NXUS
