# curl, Requests, XPath, and CSS selectors

Setup

In [None]:
%%capture output
%%bash
apt-get update
apt-get install -y tree

## Resources

- [Introduction to web scraping]( https://carpentries-incubator.github.io/lc-webscraping/ )


  - [Selecting content on a web page with XPath]( https://carpentries-incubator.github.io/lc-webscraping/02-xpath/index.html )

  - [Web scraping using Python and Scrapy]( https://carpentries-incubator.github.io/lc-webscraping/04-scrapy/index.html )

- [Playwright]( https://playwright.dev/python/ )

- [HTML Scraping](https://python-docs.readthedocs.io/en/latest/scenarios/scrape.html)

- [Parsing HTML with Beautiful Soup 4](https://automatetheboringstuff.com/2e/chapter12/#:~:text=Parsing%20HTML%20with%20the%20bs4%20Module)

  - [Sample files as zip](https://nostarch.com/download/Automate_the_Boring_Stuff_onlinematerials_v.2.zip)

- [CSS Selector notation]( https://www.w3schools.com/cssref/css_selectors.php )




## Hyper Text Markup Language ( HTML )



"HTML describes the structure of a web page semantically and originally included cues for its appearance."

Features:
- Text with "markup"
- markup == HTML elements/tags
- most tags paired, <html> ... </html>
- nested: tags within tags forming a hierarchy or tree

The term was coined by [Ted Nelson]( https://en.wikipedia.org/wiki/Ted_Nelson ) around 1963 and implemented by [Tim Berners-Lee]( https://en.wikipedia.org/wiki/Tim_Berners-Lee ) in 1989.

A simple [markup example.]( https://en.wikipedia.org/wiki/HTML#Markup )





## Viewing HTML from the browser



### Example.com



One way:
1. View the webpage at http://www.example.com
1. View the source ( Ctrl+U )

Another way:
1. Open Developer tools ( Ctrl+Shift+I )
1. Click on the Elements tab
1. Alt+click or right+click on the "<html>" tag and select "Expand Recursively"

Locating individual elements:
1. Click on the "Select an Element" arrow ( Ctrl+Shift+C )
1. Click on the text "More information..."

Notice that the corresponding section of HTML is highlighted.  You can also click on a section of HTML and the corresponding element in the page will be highlighted.





### ABQ Library



One way:
1. Visit [ABQ databases]( https://abqlibrary.org/az.php?p=1 )
1. View the source ( Ctrl+U )

Another way:
1. Open Developer tools ( Ctrl+Shift+I )
1. Click on the Elements tab
1. Alt+click or right+click on the "<html>" tag and select "Expand Recursively"



## From the command line





In [None]:
!curl http://www.example.com


<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domai

In [None]:
!curl -v https://abqlibrary.org/az.php?p=1

*   Trying 34.194.39.199:443...
* Connected to abqlibrary.org (34.194.39.199) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.0 (OUT), TLS header, Certificate Status (22):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS header, Certificate Status (22):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS header, Finished (20):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.2 (OUT), TLS header, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS header,

## Hyper Text Transfer Protocol ( HTTP )



From [Wikipedia: HTTP]( https://en.wikipedia.org/wiki/HTTP )

"The Hypertext Transfer Protocol (HTTP) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems."

"In 1991, the first documented official version of HTTP ..., HTTP/0.9, supported only GET method, allowing clients to only retrieve HTML documents from the server, but not supporting any other file formats or information upload."

Features:
- Data flows as a Request/Response pair
- Connections are "stateless", i.e. the server doesn't remember previous requests/responses.
- Requests contain a "verb" and key:value pairs in a Header, and sometimes a payload.
- Responses contain a "status" and key:value pairs in a Header, and often a payload.




Although the protocol has a number of methods/verbs, we will be primarily using GET, HEAD, and POST methods.
- [HTTP methods]( https://en.wikipedia.org/wiki/HTTP#Request_methods )

### Viewing request/response pair from the browser


1. View the webpage at http://www.example.com
1. Open Developer tools ( Ctrl+Shift+I )
1. Click on the Network tab
1. Refresh the page
1. Under the Name field, click on "www.example.com"
1. Scroll down to see the General, Request, and Response sections.
1. Click on the Raw checkbox.
1. Click on the Response tab to view the response payload.







### Viewing request/response pair with curl


Requests have a "> " at the beginning of the line.

Responses have a "< " at the beginning of the line.

In [None]:
!curl --help


Usage: curl [options...] <url>
 -d, --data <data>          HTTP POST data
 -f, --fail                 Fail silently (no output at all) on HTTP errors
 -h, --help <category>      Get help for commands
 -i, --include              Include protocol response headers in the output
 -o, --output <file>        Write to file instead of stdout
 -O, --remote-name          Write output to a file named as the remote file
 -s, --silent               Silent mode
 -T, --upload-file <file>   Transfer local FILE to destination
 -u, --user <user:password> Server user and password
 -A, --user-agent <name>    Send User-Agent <name> to server
 -v, --verbose              Make the operation more talkative
 -V, --version              Show version number and quit

This is not the full help, this menu is stripped into categories.
Use "--help category" to get an overview of all categories.
For all options use the manual or "--help all".


In [None]:
!curl -s -v http://www.example.com

*   Trying 93.184.215.14:80...
* Connected to www.example.com (93.184.215.14) port 80 (#0)
> GET / HTTP/1.1
> Host: www.example.com
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Age: 336833
< Cache-Control: max-age=604800
< Content-Type: text/html; charset=UTF-8
< Date: Mon, 17 Jun 2024 16:16:32 GMT
< Etag: "3147526947"
< Expires: Mon, 24 Jun 2024 16:16:32 GMT
< Last-Modified: Thu, 17 Oct 2019 07:18:26 GMT
< Server: ECAcc (sed/5906)
< Vary: Accept-Encoding
< X-Cache: HIT
< Content-Length: 1256
< 
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystem

## Using Python

### Fetching files/pages using the requests module

In [None]:
import requests
import json


In [None]:
page = requests.get('http://www.example.com')


In [None]:
dict(page.headers)


{'Content-Encoding': 'gzip',
 'Age': '336833',
 'Cache-Control': 'max-age=604800',
 'Content-Type': 'text/html; charset=UTF-8',
 'Date': 'Mon, 17 Jun 2024 16:16:32 GMT',
 'Etag': '"3147526947+gzip"',
 'Expires': 'Mon, 24 Jun 2024 16:16:32 GMT',
 'Last-Modified': 'Thu, 17 Oct 2019 07:18:26 GMT',
 'Server': 'ECAcc (sed/5906)',
 'Vary': 'Accept-Encoding',
 'X-Cache': 'HIT',
 'Content-Length': '648'}

In [None]:
page.reason, page.status_code


('OK', 200)

In [None]:
[
page.request.url,
page.request.method,
page.request.path_url,
page.request.headers,
page.request.body,
]

['http://www.example.com/',
 'GET',
 '/',
 {'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'},
 None]

In [None]:
page.url


'http://www.example.com/'

In [None]:
page.connection


<requests.adapters.HTTPAdapter at 0x79e7841c0490>

In [None]:
page.cookies


<RequestsCookieJar[]>

In [None]:
page.text


'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

In [None]:
page.content


b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    

In [None]:
type(page)

## Trees


From [Wikipedia: Tree (Graph)]( https://en.wikipedia.org/wiki/Tree_(graph_theory) )

"In graph theory, a tree is an undirected graph in which any two vertices are connected by exactly one path, or equivalently a connected acyclic undirected graph."

Structure:
- Type of graph, i.e. nodes and edges
- Zero or one parent node
- Zero or more children nodes

Nomenclature:
- Nodes
- Edges
- Directed edge
- Levels, up, down
- Parent, child, sibling
- Ancestors, descendants
- Root ( no parent )
- Branches ( parents and children )
- Leaves ( no children )
- Path: sequence of nodes and edges between two nodes
- Traversing, recursing; depth first, breadth first

Type of trees:
- Binary: at most two children
- Balanced: equal number of levels for all children


Examples:
- Filesystem
- Org charts
- Family pedigrees
- HTML, XML, JSON, YAML


Example: `tree` command for viewing the filesystem

In [None]:
!tree /etc/apt

[01;34m/etc/apt[0m
├── [01;34mapt.conf.d[0m
│   ├── [00m01autoremove[0m
│   ├── [00m01-vendor-ubuntu[0m
│   ├── [00m20packagekit[0m
│   ├── [00m70debconf[0m
│   ├── [01;32m90assumeyes[0m
│   ├── [00mdocker-autoremove-suggests[0m
│   ├── [00mdocker-clean[0m
│   ├── [00mdocker-disable-periodic-update[0m
│   ├── [00mdocker-gzip-indexes[0m
│   └── [00mdocker-no-languages[0m
├── [01;34mauth.conf.d[0m
├── [01;34mkeyrings[0m
├── [01;34mpreferences.d[0m
│   └── [00mcuda-repository-pin-600[0m
├── [00msources.list[0m
├── [01;34msources.list.d[0m
│   ├── [00marchive_uri-https_cloud_r-project_org_bin_linux_ubuntu-jammy.list[0m
│   ├── [00mc2d4u_team-ubuntu-c2d4u4_0_-jammy.list[0m
│   ├── [00mcuda-ubuntu2204-x86_64.list[0m
│   ├── [00mdeadsnakes-ubuntu-ppa-jammy.list[0m
│   ├── [00mgraphics-drivers-ubuntu-ppa-jammy.list[0m
│   └── [00mubuntugis-ubuntu-ppa-jammy.list[0m
└── [01;34mtrusted.gpg.d[0m
    ├── [00mc2d4u_team-ubuntu-c2d4u4_0_.gpg[0m
    

In [None]:
!find /etc/apt

/etc/apt
/etc/apt/preferences.d
/etc/apt/preferences.d/cuda-repository-pin-600
/etc/apt/sources.list.d
/etc/apt/sources.list.d/archive_uri-https_cloud_r-project_org_bin_linux_ubuntu-jammy.list
/etc/apt/sources.list.d/cuda-ubuntu2204-x86_64.list
/etc/apt/sources.list.d/c2d4u_team-ubuntu-c2d4u4_0_-jammy.list
/etc/apt/sources.list.d/graphics-drivers-ubuntu-ppa-jammy.list
/etc/apt/sources.list.d/deadsnakes-ubuntu-ppa-jammy.list
/etc/apt/sources.list.d/ubuntugis-ubuntu-ppa-jammy.list
/etc/apt/trusted.gpg.d
/etc/apt/trusted.gpg.d/ubuntu-keyring-2018-archive.gpg
/etc/apt/trusted.gpg.d/ubuntu-keyring-2012-cdimage.gpg
/etc/apt/trusted.gpg.d/deadsnakes-ubuntu-ppa.gpg~
/etc/apt/trusted.gpg.d/ubuntugis-ubuntu-ppa.gpg~
/etc/apt/trusted.gpg.d/deadsnakes-ubuntu-ppa.gpg
/etc/apt/trusted.gpg.d/graphics-drivers-ubuntu-ppa.gpg
/etc/apt/trusted.gpg.d/c2d4u_team-ubuntu-c2d4u4_0_.gpg~
/etc/apt/trusted.gpg.d/c2d4u_team-ubuntu-c2d4u4_0_.gpg
/etc/apt/trusted.gpg.d/cran_ubuntu_key.asc
/etc/apt/trusted.gpg.d/gra

## Beautiful Soup and CSS selectors


In [None]:
%%capture
%%bash
curl -s -L -O https://nostarch.com/download/Automate_the_Boring_Stuff_onlinematerials_v.2.zip
unzip -jo Automate_the_Boring_Stuff_onlinematerials_v.2.zip automate_online-materials/example.html


In [None]:
!unzip -l Automate_the_Boring_Stuff_onlinematerials_v.2.zip

Archive:  Automate_the_Boring_Stuff_onlinematerials_v.2.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2018-08-27 09:25   automate_online-materials/
   582952  2015-03-30 08:51   automate_online-materials/alarm.wav
      461  2015-03-30 08:51   automate_online-materials/allMyCats1.py
      311  2015-03-30 08:51   automate_online-materials/allMyCats2.py
     1415  2015-03-30 09:03   automate_online-materials/backupToZip.py
      493  2015-03-30 08:51   automate_online-materials/birthdays.py
      633  2015-03-30 11:55   automate_online-materials/boxPrint.py
      223  2015-03-30 08:51   automate_online-materials/buggyAddingProgram.py
      399  2015-03-30 09:17   automate_online-materials/bulletPointAdder.py
      310  2015-03-30 09:18   automate_online-materials/calcProd.py
    16726  2014-09-24 11:05   automate_online-materials/catlogo.png
      121  2015-03-30 09:19   automate_online-materials/catnapping.py
   155379  2015-03-30 08:51   automate_

In [None]:
pwd

'/content'

In [None]:
ls -l

total 8608
-rw-r--r-- 1 root root 8802488 Jun 17 16:16 Automate_the_Boring_Stuff_onlinematerials_v.2.zip
-rw-r--r-- 1 root root     324 Mar 30  2015 example.html
drwxr-xr-x 1 root root    4096 Jun 13 13:28 [0m[01;34msample_data[0m/


In [None]:
!cat -n example.html

     1	<!-- This is the example.html file. -->
     2	
     3	<html><head><title>The Website Title</title></head>
     4	<body>
     5	<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
     6	<p class="slogan">Learn Python the easy way!</p>
     7	<p>By <span id="author">Al Sweigart</span></p>
     8	</body></html>

In [None]:
import bs4

with open('example.html') as exampleFile:
  html = exampleFile.read()
exampleSoup = bs4.BeautifulSoup( html, 'html.parser')
print(html)


<!-- This is the example.html file. -->

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>


In [None]:
html

'<!-- This is the example.html file. -->\n\n<html><head><title>The Website Title</title></head>\n<body>\n<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>\n<p class="slogan">Learn Python the easy way!</p>\n<p>By <span id="author">Al Sweigart</span></p>\n</body></html>'

In [None]:
type(html)

str

In [None]:
for i, line in enumerate(html.split('\n')):
  print(i+1, line)

1 <!-- This is the example.html file. -->
2 
3 <html><head><title>The Website Title</title></head>
4 <body>
5 <p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
6 <p class="slogan">Learn Python the easy way!</p>
7 <p>By <span id="author">Al Sweigart</span></p>
8 </body></html>


In [None]:
type(exampleSoup)

In [None]:
elems = exampleSoup.select('#author')
elems

[<span id="author">Al Sweigart</span>]

In [None]:
type(elems) # elems is a list of Tag objects.

In [None]:
len(elems)

1

In [None]:
elems[0]

<span id="author">Al Sweigart</span>

In [None]:
type(elems[0])

In [None]:
str(elems[0]) # The Tag object as a string.

'<span id="author">Al Sweigart</span>'

In [None]:
elems[0].getText()

'Al Sweigart'

In [None]:
elems[0].attrs

{'id': 'author'}

In [None]:
pElems = exampleSoup.select('p')
pElems

[<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>,
 <p class="slogan">Learn Python the easy way!</p>,
 <p>By <span id="author">Al Sweigart</span></p>]

In [None]:
len(pElems)

3

In [None]:
pElems[:2]

[<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>,
 <p class="slogan">Learn Python the easy way!</p>]

In [None]:
pElems[0]

<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>

In [None]:
type(pElems[0])

In [None]:
str(pElems[0])

'<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>'

In [None]:
pElems[0].getText()

'Download my Python book from my website.'

In [None]:
str(pElems[1])

'<p class="slogan">Learn Python the easy way!</p>'

In [None]:
pElems[1].getText()

'Learn Python the easy way!'

In [None]:
str(pElems[2])

'<p>By <span id="author">Al Sweigart</span></p>'

In [None]:
pElems[2].getText()

'By Al Sweigart'

In [None]:
aElems = exampleSoup.select('a')
aElems


[<a href="http://inventwithpython.com">my website</a>]

In [None]:
aElems[0]


<a href="http://inventwithpython.com">my website</a>

In [None]:
aElems[0].attrs


{'href': 'http://inventwithpython.com'}

In [None]:
aElems[0].attrs["href"]


'http://inventwithpython.com'

In [None]:
exampleSoup.select("body")[0].get_text().split('\n')

['',
 'Download my Python book from my website.',
 'Learn Python the easy way!',
 'By Al Sweigart',
 '']

### Using DBpedia

In [None]:
dbpedia = requests.get("https://dbpedia.org/page/Digby_Morrell")
dbpedia

<Response [200]>

In [None]:
html = dbpedia.text
html

'<!DOCTYPE html>\n<html\n    prefix="\n        dbp: http://dbpedia.org/property/\n        dbo: http://dbedia.org/ontology/\n        dct: http://purl.org/dc/terms/\n        dbd: http://dbpedia.org/datatype/\n\tog:  https://ogp.me/ns#\n\t"\n>\n\n\n<!-- header -->\n<head>\n    <meta charset="utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n\n    <title>About: Digby Morrell</title>\n\n    <!-- Links -->\n    <link rel="alternate" type="application/rdf+xml" \t\thref="http://dbpedia.org/data/Digby_Morrell.rdf" title="Structured Descriptor Document (RDF/XML format)" />\n    <link rel="alternate" type="text/n3" \t\t\thref="http://dbpedia.org/data/Digby_Morrell.n3" title="Structured Descriptor Document (N3 format)" />\n    <link rel="alternate" type="text/turtle" \t\t\thref="http://dbpedia.org/data/Digby_Morrell.ttl" title="Structured Descriptor Document (Turtle format)" />\n    <link rel="alternate" type="application/json+rdf" \t\thref="http://dbpedia.org/

In [None]:
dom = bs4.BeautifulSoup(html, "html.parser")
dom

<!DOCTYPE html>

<html prefix="
        dbp: http://dbpedia.org/property/
        dbo: http://dbedia.org/ontology/
        dct: http://purl.org/dc/terms/
        dbd: http://dbpedia.org/datatype/
	og:  https://ogp.me/ns#
	">
<!-- header -->
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>About: Digby Morrell</title>
<!-- Links -->
<link href="http://dbpedia.org/data/Digby_Morrell.rdf" rel="alternate" title="Structured Descriptor Document (RDF/XML format)" type="application/rdf+xml"/>
<link href="http://dbpedia.org/data/Digby_Morrell.n3" rel="alternate" title="Structured Descriptor Document (N3 format)" type="text/n3"/>
<link href="http://dbpedia.org/data/Digby_Morrell.ttl" rel="alternate" title="Structured Descriptor Document (Turtle format)" type="text/turtle"/>
<link href="http://dbpedia.org/data/Digby_Morrell.jrdf" rel="alternate" title="Structured Descriptor Document (RDF/JSON format)" type="application/json+rdf"/>
<link h

In [None]:
type(html)

str

In [None]:
type(dom)

In [None]:
a_tags = dom.select("a")
a_tags

[<a class="navbar-brand" href="http://wiki.dbpedia.org/about" style="color: #2c5078" title="About DBpedia">
 <img alt="About DBpedia" class="img-fluid" src="/statics/images/dbpedia_logo_land_120.png"/>
 </a>,
 <a aria-expanded="false" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" href="#" id="navbarDropdownBrowse" role="button">
 <i class="bi-eye-fill"></i> Browse using<span class="caret"></span></a>,
 <a class="nav-link" href="/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FDigby_Morrell">OpenLink Faceted Browser</a>,
 <a class="nav-link" href="http://osde.demo.openlinksw.com/#/editor?uri=http%3A%2F%2Fdbpedia.org%2Fdata%2FDigby_Morrell.ttl&amp;view=statements">OpenLink Structured Data Editor</a>,
 <a class="nav-link" href="http://en.lodlive.it/?http%3A%2F%2Fdbpedia.org%2Fresource%2FDigby_Morrell">LodLive Browser</a>,
 <a aria-expanded="false" class="nav-link dropdown-toggle" data-bs-toggle="dropdown" href="#" id="navbarDropdownFormats" role="button">
 <i class="bi-fi

In [None]:
len(a_tags)

177

In [None]:
[ tag.attrs for tag in a_tags ]

[{'class': ['navbar-brand'],
  'href': 'http://wiki.dbpedia.org/about',
  'title': 'About DBpedia',
  'style': 'color: #2c5078'},
 {'class': ['nav-link', 'dropdown-toggle'],
  'href': '#',
  'id': 'navbarDropdownBrowse',
  'role': 'button',
  'data-bs-toggle': 'dropdown',
  'aria-expanded': 'false'},
 {'class': ['nav-link'],
  'href': '/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FDigby_Morrell'},
 {'class': ['nav-link'],
  'href': 'http://osde.demo.openlinksw.com/#/editor?uri=http%3A%2F%2Fdbpedia.org%2Fdata%2FDigby_Morrell.ttl&view=statements'},
 {'class': ['nav-link'],
  'href': 'http://en.lodlive.it/?http%3A%2F%2Fdbpedia.org%2Fresource%2FDigby_Morrell'},
 {'class': ['nav-link', 'dropdown-toggle'],
  'href': '#',
  'id': 'navbarDropdownFormats',
  'role': 'button',
  'data-bs-toggle': 'dropdown',
  'aria-expanded': 'false'},
 {'class': ['dropdown-item'],
  'href': 'http://dbpedia.org/data/Digby_Morrell.ntriples'},
 {'class': ['dropdown-item'],
  'href': 'http://dbpedia.org/dat

In [None]:
for a_tag in a_tags:
  href = a_tag.attrs["href"]
  if href.startswith("http"):
    rev = a_tag.attrs.get("rev","")
    if "foaf:primaryTopic" in rev:
      print(href)


http://en.wikipedia.org/wiki/Digby_Morrell


Using only CSS selectors.

In [None]:
foaf_pt = dom.select('a[href^="http"][rev="foaf:primaryTopic"]')
foaf_pt[0].attrs["href"]


'http://en.wikipedia.org/wiki/Digby_Morrell'

In [None]:
foaf_pt[0].attrs["href"].split("/")[-1]


### STEM Boomerang


https://stemboomerang.org/stem-career-fair-23/



In [None]:
url = "https://stemboomerang.org/stem-career-fair-23/"


Using curl to save an html file, then parse the html file.

In [None]:
!curl -s {url} > boom.html
!ls -la

total 8916
drwxr-xr-x 1 root root    4096 Jun 17 16:27 .
drwxr-xr-x 1 root root    4096 Jun 17 16:15 ..
-rw-r--r-- 1 root root 8802488 Jun 17 16:16 Automate_the_Boring_Stuff_onlinematerials_v.2.zip
-rw-r--r-- 1 root root  300630 Jun 17 16:27 boom.html
drwxr-xr-x 4 root root    4096 Jun 13 13:27 .config
-rw-r--r-- 1 root root     324 Mar 30  2015 example.html
drwxr-xr-x 1 root root    4096 Jun 13 13:28 sample_data


In [None]:
!grep -i tricore boom.html

<p>.</div></div></div></div></div><div class="wpb_column vc_column_container vc_col-sm-4 vc_col-has-fill "><div class="vc_column-inner vc_custom_1675479607705"><div class="wpb_wrapper"><style type="text/css" scoped="scoped">.vc_front_widget.fs_scope_5{background-color: #ffffff;color: #777777;transition: all .2s linear; -webkit-transition: all .2s linear;border: 1px solid #eeeeee;}.vc_front_widget.fs_scope_5:hover{background-color: #0daa95;color: #ffffff;border-color:#ffffff;}.vc_front_widget.fs_scope_5 .font_icons {color: #db9232;transition: all .2s linear; -webkit-transition: all .2s linear;}.vc_front_widget.fs_scope_5:hover .font_icons {color: #ffffff;}.vc_front_widget.fs_scope_5 a {color: #4a9b45;transition: all .2s linear; -webkit-transition: all .2s linear;}.vc_front_widget.fs_scope_5:hover a {color: #ffffff;}.vc_front_widget.fs_scope_5 h3.widget-title, .vc_front_widget.fs_scope_5 h3.widget-title a {color: #444444;transition: all .2s linear; -webkit-transition: all .2s linear;}.vc

In [None]:
with open('boom.html') as boom_file:
  html = boom_file.read()
dom = bs4.BeautifulSoup( html, 'html.parser')
str(dom)[:500]

'<!DOCTYPE html>\n\n<html class="creativo-elements-custom-header" dir="ltr" lang="en-US" prefix="og: https://ogp.me/ns#" xmlns:fb="//www.facebook.com/2008/fbml" xmlns:og="//opengraphprotocol.org/schema/">\n<head>\n<meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>\n<meta charset="utf-8"/>\n<link href="https://gmpg.org/xfn/11" rel="profile"/>\n<ti'

Using `requests` and `bs4`.

In [None]:
url = "https://stemboomerang.org/stem-career-fair-23/"
boom = requests.get(url)
boom

<Response [406]>

In [None]:
boom.request.headers

{'User-Agent': 'python-requests/2.31.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

Modify the `user-agent` header.

In [None]:
user_agent = {'User-agent': 'Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'}
user_agent = {'User-agent': 'foobar'}
user_agent = {'User-agent': 'python-requests'}
user_agent = {'User-agent': 'I am a hacker.  I should be blocked from your site.'}
boom = requests.get(url, headers = user_agent)
boom

<Response [200]>

In [None]:
boom.request.headers

{'User-agent': 'I am a hacker.  I should be blocked from your site.', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

In [None]:
html = boom.text

In [None]:
dom = bs4.BeautifulSoup( html, 'html.parser')
str(dom)[:500]

'<!DOCTYPE html>\n\n<html class="creativo-elements-custom-header" dir="ltr" lang="en-US" prefix="og: https://ogp.me/ns#" xmlns:fb="//www.facebook.com/2008/fbml" xmlns:og="//opengraphprotocol.org/schema/">\n<head>\n<meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>\n<meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>\n<meta charset="utf-8"/>\n<link href="https://gmpg.org/xfn/11" rel="profile"/>\n<ti'

In [None]:
a_tags = dom.select("a")
a_tags

[<a class="skip-link screen-reader-text" href="#content">Skip to content</a>,
 <a href="https://stemboomerang.org" rel="home" title="stemboomerang.org">
 						stemboomerang.org					</a>,
 <a href="https://stemboomerang.org" rel="home" title="stemboomerang.org">
 <img alt="Boomerang New Mexico" class="cr-site-logo-default" decoding="async" height="527" sizes="(max-width: 2095px) 100vw, 2095px" src="https://i0.wp.com/stemboomerang.org/wp-content/uploads/2021/01/BoomerangNM.jpg?fit=2095%2C527&amp;ssl=1" srcset="https://i0.wp.com/stemboomerang.org/wp-content/uploads/2021/01/BoomerangNM.jpg?w=2095&amp;ssl=1 2095w, https://i0.wp.com/stemboomerang.org/wp-content/uploads/2021/01/BoomerangNM.jpg?resize=300%2C75&amp;ssl=1 300w, https://i0.wp.com/stemboomerang.org/wp-content/uploads/2021/01/BoomerangNM.jpg?resize=1024%2C258&amp;ssl=1 1024w, https://i0.wp.com/stemboomerang.org/wp-content/uploads/2021/01/BoomerangNM.jpg?resize=768%2C193&amp;ssl=1 768w, https://i0.wp.com/stemboomerang.org/wp-conten

In [None]:
len(a_tags)

33

In [None]:
[ tag.attrs for tag in a_tags ]

[{'class': ['skip-link', 'screen-reader-text'], 'href': '#content'},
 {'href': 'https://stemboomerang.org',
  'title': 'stemboomerang.org',
  'rel': ['home']},
 {'href': 'https://stemboomerang.org',
  'rel': ['home'],
  'title': 'stemboomerang.org'},
 {'href': 'https://stemboomerang.org'},
 {'href': 'https://stemboomerang.org/our-mission/'},
 {'href': 'https://www.boomerang-nm.com/info'},
 {'href': 'https://stemboomerang.org/home/contact/'},
 {'href': 'https://stemboomerang.org'},
 {'href': 'https://www.qstation.tech/'},
 {'href': 'mailto:Info@stemboomerang.org'},
 {'href': 'http://www.boomerang-nm.com'},
 {'href': 'https://jgmsinc.com/', 'target': '_self', 'class': ['block']},
 {'href': 'https://techsource-inc.com/careers',
  'target': '_self',
  'class': ['block']},
 {'href': 'https://www.visionquest-bio.com/',
  'target': '_self',
  'class': ['block']},
 {'href': 'www.phs.org', 'target': '_self', 'class': ['block']},
 {'href': 'www.ideas-tek.com/join-us', 'target': '_self', 'class':

In [None]:
[ tag.attrs["href"] for tag in a_tags[:10] ]

['#content',
 'https://stemboomerang.org',
 'https://stemboomerang.org',
 'https://stemboomerang.org',
 'https://stemboomerang.org/our-mission/',
 'https://www.boomerang-nm.com/info',
 'https://stemboomerang.org/home/contact/',
 'https://stemboomerang.org',
 'https://www.qstation.tech/',
 'mailto:Info@stemboomerang.org']

In [None]:
hrefs = []
for a_tag in a_tags:
  href = a_tag.attrs["href"]
  if href.startswith("http"):
    foo = a_tag.attrs.get("class","")
    if "block" in foo:
      hrefs += [ href ]

len(hrefs)

14

In [None]:
sorted(hrefs)

['https://axientcorp.com/resource/corporate-overview/',
 'https://careers-encantadotech.icims.com/jobs/search?ss=1&searchRelation=keyword_all&searchCategory=8730',
 'https://goadelante.org/315-2/employment/',
 'https://goadelante.org/315-2/employment/',
 'https://honeywell.phenompro.com/us/en/campaign-fm-t',
 'https://jgmsinc.com/',
 'https://jobs.boeing.com/',
 'https://rs21.io/careers',
 'https://techsource-inc.com/careers',
 'https://www.pebblelabs.com/careers',
 'https://www.stemsantafe.org/jobs',
 'https://www.tecolote.com/careers',
 'https://www.tricore.org/about-tricore/careers/',
 'https://www.visionquest-bio.com/']

Using only CSS selectors.

In [None]:
a_tags = dom.select("a[href^='http'][class*='block']")
len(a_tags)

14

In [None]:
a_tags[0]

<a class="block" href="https://jgmsinc.com/" target="_self"><span class="overlay_effect block absolute top-0 w-full h-full opacity-0 bg-black transition-opacity duration-200 ease-linear group-hover:opacity-75 flex items-center justify-center"><i class="fa fa-link text-4xl"></i></span></a>

In [None]:
[ tag.attrs["href"] for tag in a_tags ]

['https://jgmsinc.com/',
 'https://techsource-inc.com/careers',
 'https://www.visionquest-bio.com/',
 'https://careers-encantadotech.icims.com/jobs/search?ss=1&searchRelation=keyword_all&searchCategory=8730',
 'https://rs21.io/careers',
 'https://axientcorp.com/resource/corporate-overview/',
 'https://www.stemsantafe.org/jobs',
 'https://www.tricore.org/about-tricore/careers/',
 'https://www.tecolote.com/careers',
 'https://goadelante.org/315-2/employment/',
 'https://goadelante.org/315-2/employment/',
 'https://honeywell.phenompro.com/us/en/campaign-fm-t',
 'https://jobs.boeing.com/',
 'https://www.pebblelabs.com/careers']

## XPath

Using the `lxml` library to pull the same information as the CSS selector.

In [None]:
from lxml import html
import requests


In [None]:
page = requests.get('http://www.example.com')
tree = html.fromstring(page.content)
tree

<Element html at 0x79e753d23790>

In [None]:
page.content

b'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    

In [None]:
title = tree.xpath('/html/head/title/text()')
print(title[0])


Example Domain


In [None]:
link = tree.xpath('/html/body/div/p[2]/a/@href')
print(link[0])


https://www.iana.org/domains/example


In [None]:
with open('example.html') as exampleFile:
  content = exampleFile.read()
print(content)


<!-- This is the example.html file. -->

<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>


In [None]:
# Get an element list
tree = html.fromstring( content )
elems_xp = tree.xpath('//*[@*="author"]')
elems_xp


[<Element span at 0x79e753c7d8f0>]

In [None]:
# Get one element
elems_xp[0]


<Element span at 0x79e753c7d8f0>

In [None]:
# Text bounded by the tag
elems_xp[0].text

'Al Sweigart'

In [None]:
# Element attributes
elems_xp[0].attrib


{'id': 'author'}

In [None]:
pElems_xp = tree.xpath('//p')
pElems_xp


[<Element p at 0x79e753c7e1b0>,
 <Element p at 0x79e753c7e200>,
 <Element p at 0x79e753c7e250>]

In [None]:
pElems_xp[0]


<Element p at 0x79e753c7e1b0>

In [None]:
pElems_xp[0].text


'Download my '

In [None]:
pElems_xp[1]


<Element p at 0x79e753c7e200>

In [None]:
pElems_xp[1].text


'Learn Python the easy way!'

In [None]:
pElems_xp[2]


<Element p at 0x79e753c7e250>

In [None]:
pElems_xp[2].text


'By '

## CSS vs XPATH


In [None]:
example = '''
<body>
  <p class="phrase">Hello, world!</p>
  <p>By <span id="name">Foo Bar</span></p>
</body></html>
'''
example


'\n<body>\n  <p class="phrase">Hello, world!</p>\n  <p>By <span id="name">Foo Bar</span></p>\n</body></html>\n'

In [None]:
# Using CSS selectors
import bs4
tree_css = bs4.BeautifulSoup( example, 'html.parser')
elems_css = tree_css.select('#name')
elems_css

[<span id="name">Foo Bar</span>]

In [None]:
tree_css


<body>
<p class="phrase">Hello, world!</p>
<p>By <span id="name">Foo Bar</span></p>
</body>

In [None]:
elems_css[0]

<span id="name">Foo Bar</span>

In [None]:
elems_css[0].text

'Foo Bar'

In [None]:
str(elems_css[0])

'<span id="name">Foo Bar</span>'

In [None]:
# Using XPath
from lxml import html
tree = html.fromstring( example )
elems_xp = tree.xpath('//*[@*="name"]')
elems_xp


[<Element span at 0x79e753c89350>]

In [None]:
elems_xp[0]


<Element span at 0x79e753c89350>

In [None]:
# Text bounded by the tag
elems_xp[0].text

'Foo Bar'

In [None]:
# Can't seem to get the text of the entire tag
str(elems_xp[0])


'<Element span at 0x79e753c89350>'

## HEAD request

Having fun with HTTP headers:
- [Fun and unusual HTTP response headers]( https://www.pingdom.com/blog/fun-and-unusual-http-response-headers/ )

Response codes:
- MDN [HTTP response status codes]( https://developer.mozilla.org/en-US/docs/Web/HTTP/Status )
- Wikipedia [List of HTTP status codes]( https://en.wikipedia.org/wiki/List_of_HTTP_status_codes )


In [None]:
!curl -s -I -XHEAD 'https://jgmsinc.com/'

HTTP/2 301 
[1mage[0m: 69605
[1mdate[0m: Sun, 16 Jun 2024 21:54:20 GMT
[1mlocation[0m: https://www.jgmsinc.com/
[1mserver[0m: Squarespace
[1mset-cookie[0m: crumb=BYC2GVhSG66+NjU5YzJiNTZiMGYyMzYyZDllZGYwN2QyMTJlY2Uy;Secure;Path=/
[1mstrict-transport-security[0m: max-age=15552000
[1mx-contextid[0m: nS2ORyRj/jD7xUOBO
[1mcontent-length[0m: 0



In [None]:
!curl -s -I -XHEAD 'https://careers-encantadotech.icims.com/jobs/search?ss=1&searchRelation=keyword_all&searchCategory=8730'

HTTP/2 200 
[1mcontent-type[0m: text/html;charset=UTF-8
[1mserver[0m: nginx
[1mdate[0m: Mon, 17 Jun 2024 17:15:02 GMT
[1mp3p[0m: CP="CAO PSA OUR"
[1mstrict-transport-security[0m: max-age=63072000; includeSubDomains
[1mexpires[0m: Sun, 16 Jun 2024 17:15:02 GMT
[1mcache-control[0m: no-cache, no-store
[1mpragma[0m: no-cache
[1mset-cookie[0m: JSESSIONID=6B0C8443A7FA07BD2B6B002BACDF9AA9; Path=/; Secure; HttpOnly; SameSite=Lax
[1micims-ats-host[0m: appip-10-47-163-217-prod206
[1micims-ats-customer[0m: 12243
[1mx-icims-tenant[0m: 0x0001
[1mx-cache[0m: Miss from cloudfront
[1mvia[0m: 1.1 29147f9e38067439b15976c1b4e88fc2.cloudfront.net (CloudFront)
[1mx-amz-cf-pop[0m: HKG1-P1
[1malt-svc[0m: h3=":443"; ma=86400
[1mx-amz-cf-id[0m: PseBUJRyHF16X4-T0Hk4OIKP7BLBlfLRIx1XWIBCCzvluicXrc-efg==
[1mserver-timing[0m: cdn-upstream-layer;desc="REC",cdn-upstream-dns;dur=5,cdn-upstream-connect;dur=623,cdn-upstream-fbl;dur=843,cdn-cache-miss,cdn-pop;desc="HK

In [None]:
!curl -s -L -I -XHEAD 'https://cnmingenuity.org/'


HTTP/2 301 
[1mdate[0m: Mon, 17 Jun 2024 17:16:17 GMT
[1mcontent-type[0m: text/html; charset=UTF-8
[1mlocation[0m: https://www.cnmingenuity.org/
[1mexpires[0m: Mon, 17 Jun 2024 18:14:43 GMT
[1mx-redirect-by[0m: WordPress
[1mx-powered-by[0m: WP Engine
[1mx-cacheable[0m: non200
[1mcache-control[0m: max-age=600, must-revalidate
[1mx-cache[0m: HIT: 1
[1mx-cache-group[0m: normal
[1mcf-cache-status[0m: DYNAMIC
[1mserver[0m: cloudflare
[1mcf-ray[0m: 8954afc00cf78410-TPE
[1malt-svc[0m: h3=":443"; ma=86400

HTTP/2 200 
[1mdate[0m: Mon, 17 Jun 2024 17:16:17 GMT
[1mcontent-type[0m: text/html; charset=UTF-8
[1mvary[0m: Accept-Encoding
[1mvary[0m: Accept-Encoding
[1mvary[0m: Accept-Encoding
[1mvary[0m: Accept-Encoding,Cookie
[1mlink[0m: <https://www.cnmingenuity.org/wp-json/>; rel="https://api.w.org/"
[1mlink[0m: <https://www.cnmingenuity.org/wp-json/wp/v2/pages/2>; rel="alternate"; type="application/json"
[1mlink[0m: <https://www.cnminge

In [None]:
!curl -I -s https://ddc-datascience.s3.amazonaws.com/Projects/Project.1-Transactions/Data/Transaction.train.csv

HTTP/1.1 200 OK
[1mx-amz-id-2[0m: TckodZ4QNSmep2PxyevBOJ7vX7aW1CKccasnqoMMhv9fAunrklU2WlJINn24Ef0x3FwAUExzAjA=
[1mx-amz-request-id[0m: FFH2E2SVXPZJNJDD
[1mDate[0m: Mon, 17 Jun 2024 17:17:24 GMT
[1mLast-Modified[0m: Sat, 23 Sep 2023 18:46:31 GMT
[1mETag[0m: "97b9422de3eca4f1c4adbd7789fd110c-9"
[1mx-amz-server-side-encryption[0m: AES256
[1mAccept-Ranges[0m: bytes
[1mContent-Type[0m: text/csv
[1mServer[0m: AmazonS3
[1mContent-Length[0m: 73445257



## robots.txt


"robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit."

- Wikipedia [robots.txt]( https://en.wikipedia.org/wiki/Robots.txt )
- Google docs [Introduction to robots.txt]( https://developers.google.com/search/docs/crawling-indexing/robots/intro )
- [RFC 9309]( https://www.rfc-editor.org/rfc/rfc9309 )




In [None]:
!curl -s 'https://cnmingenuity.org/robots.txt'


User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php


In [None]:
!curl -s 'https://www.cabq.gov/robots.txt'


Sitemap: https://www.cabq.gov/sitemap.xml.gz

# Define access-restrictions for robots/spiders
# http://www.robotstxt.org/wc/norobots.html



# By default we allow robots to access all areas of our site
# already accessible to anonymous users

User-agent: *
Disallow:



# Add Googlebot-specific syntax extension to exclude forms
# that are repeated for each piece of content in the site
# the wildcard is only supported by Googlebot
# http://www.google.com/support/webmasters/bin/answer.py?answer=40367&ctx=sibling

User-Agent: Googlebot
Disallow: /*?
Disallow: /*atct_album_view$
Disallow: /*folder_factories$
Disallow: /*folder_summary_view$
Disallow: /*login_form$
Disallow: /*mail_password_form$
Disallow: /@@search
Disallow: /*search_rss$
Disallow: /*sendto_form$
Disallow: /*summary_view$
Disallow: /*thumbnail_view$
Disallow: /*view$

## Sitemaps




- Wikipedia [sitemaps]( https://en.wikipedia.org/wiki/Sitemaps )



In [None]:
!zcat --help

Usage: /usr/bin/zcat [OPTION]... [FILE]...
Uncompress FILEs to standard output.

  -f, --force       force; read compressed data even from a terminal
  -l, --list        list compressed file contents
  -r, --recursive   operate recursively on directories
  -S, --suffix=SUF  use suffix SUF on compressed files
      --synchronous synchronous output (safer if system crashes, but slower)
  -t, --test        test compressed file integrity
  -v, --verbose     verbose mode
      --help        display this help and exit
      --version     display version information and exit

With no FILE, or when FILE is -, read standard input.

Report bugs to <bug-gzip@gnu.org>.


In [None]:
!curl -s 'https://www.cabq.gov/sitemap.xml.gz' | zcat


<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <sitemap>
    <loc>https://www.cabq.gov/sitemap1.xml.gz</loc>
    <lastmod>2024-06-17T00:30:40-06:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://www.cabq.gov/sitemap2.xml.gz</loc>
    <lastmod>2024-06-17T00:30:40-06:00</lastmod>
  </sitemap>
</sitemapindex>


In [None]:
!curl -s 'https://www.cabq.gov/sitemap1.xml.gz' | zcat | head -20

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

  <url>
    <loc>https://www.cabq.gov</loc>
    <lastmod>2024-06-10T15:32:57-06:00</lastmod>
    
    
  </url>
  <url>
    <loc>https://www.cabq.gov/clerk/documents/augut-29-2022-lb-cancellation.pdf</loc>
    <lastmod>2022-08-23T14:07:05-06:00</lastmod>
    
    
  </url>
  <url>
    <loc>https://www.cabq.gov/cpoa/documents/draft-cpoa-board-minutes-august-11-2022-w-attachments.pdf</loc>
    <lastmod>2022-09-30T11:18:53-06:00</lastmod>
    
    


In [None]:
%%bash
curl -s 'https://www.cabq.gov/sitemap1.xml.gz' |
zcat |
fgrep '<loc>' |
sed -e 's#<loc>##g ; s#</loc>##g' |
head -3 |
while read url ; do
  echo == ${url}
  curl -s -I ${url}
done

== https://www.cabq.gov
HTTP/2 200 
date: Mon, 17 Jun 2024 17:40:25 GMT
content-type: text/html;charset=utf-8
content-language: en
expires: Fri, 20 Jun 2014 15:28:15 GMT
vary: X-Anonymous
x-frame-options: SAMEORIGIN
age: 7929
x-cache: HIT
cache-control: max-age=0, s-maxage=0, must-revalidate
accept-ranges: bytes
cf-cache-status: DYNAMIC
server: cloudflare
cf-ray: 8954d3165cfa85ff-HKG

== https://www.cabq.gov/clerk/documents/augut-29-2022-lb-cancellation.pdf
HTTP/2 200 
date: Mon, 17 Jun 2024 17:40:25 GMT
content-type: application/pdf
content-length: 25337
expires: Fri, 20 Jun 2014 17:38:15 GMT
vary: X-Anonymous
x-frame-options: SAMEORIGIN
x-cache: HIT
cache-control: max-age=0, s-maxage=0, must-revalidate
cf-cache-status: MISS
last-modified: Mon, 17 Jun 2024 17:40:25 GMT
accept-ranges: bytes
server: cloudflare
cf-ray: 8954d31a89b6048d-HKG

== https://www.cabq.gov/cpoa/documents/draft-cpoa-board-minutes-august-11-2022-w-attachments.pdf
HTTP/2 200 
date: Mon

In [None]:
%%bash
curl -s 'https://www.cabq.gov/sitemap1.xml.gz' |
zcat |
fgrep '<loc>' |
sed -e 's#<loc>##g ; s#</loc>##g' |
wc

  40000   40000 3564999


In [None]:
%%bash
curl -s 'https://www.cabq.gov/sitemap2.xml.gz' |
zcat |
fgrep '<loc>' |
sed -e 's#<loc>##g ; s#</loc>##g' |
wc

  25269   25269 2145025


In [None]:
!curl -s https://blog.nimblebox.ai/robots.txt

User-agent: *
Allow: /

Sitemap: https://blog.nimblebox.ai/sitemap.xml

In [None]:
!curl -s -I https://blog.nimblebox.ai/sitemap.xml

HTTP/2 200 
[1malt-svc[0m: h3=":443"; ma=2592000
[1mcache-control[0m: public, max-age=0, must-revalidate
[1mcontent-type[0m: text/xml
[1mdate[0m: Mon, 17 Jun 2024 18:00:58 GMT
[1metag[0m: "f0164362eb3c72570c56da9f9dd343ed"
[1mlast-modified[0m: Mon, 29 Apr 2024 06:25:05 GMT
[1mserver[0m: Framer/22dcab7
[1mserver-timing[0m: region;desc="ap-northeast-2", cache;desc="not-cached", ssg-status;desc="optimized", version;desc="22dcab7"
[1mstrict-transport-security[0m: max-age=31536000
[1mvary[0m: Accept-Encoding
[1mcontent-length[0m: 6847



In [None]:
%%bash
curl -s https://blog.nimblebox.ai/sitemap.xml |
fgrep '<loc>' |
sed -e 's#<loc>##g ; s#</loc>##g' |
while read url ; do
  echo == ${url}
  curl -s -I ${url}
done > nimblebox.txt


In [None]:
!cat nimblebox.txt | head -100


== https://blog.nimblebox.ai/mlops-top-python-packages
HTTP/2 200 
alt-svc: h3=":443"; ma=2592000
cache-control: public, max-age=0, must-revalidate
content-type: text/html
date: Mon, 17 Jun 2024 18:02:36 GMT
etag: "fc3505c9f389b1d8ba95dc63129046ec"
last-modified: Mon, 29 Apr 2024 06:25:04 GMT
link: <https://framerusercontent.com>; rel="preconnect", <https://framerusercontent.com>; rel="preconnect"; crossorigin=""
server: Framer/22dcab7
server-timing: region;desc="ap-northeast-2", cache;desc="not-cached", ssg-status;desc="optimized", version;desc="22dcab7"
strict-transport-security: max-age=31536000
vary: Accept-Encoding
content-length: 98329

== https://blog.nimblebox.ai/mlops-the-ultimate-guide
HTTP/2 200 
alt-svc: h3=":443"; ma=2592000
cache-control: public, max-age=0, must-revalidate
content-type: text/html
date: Mon, 17 Jun 2024 18:02:37 GMT
etag: "c274d888635ce5eabdeec772c1713aa9"
last-modified: Mon, 29 Apr 2024 06:25:04 GMT
link: <https://framerusercontent.co

In [None]:
!grep ^HTTP nimblebox.txt | sort | uniq -c


     77 HTTP/2 200 


In [None]:
!grep team nimblebox.txt

== https://blog.nimblebox.ai/mlops-team-structure


In [None]:
!curl -s -L https://nimblebox.ai/blog/mlops-team-structure | head -200


<!doctype html>
<!-- ✨ Built with Framer • https://www.framer.com/ -->
<html lang="en-US">
<head>
    <meta charset="utf-8">
    <script async src="https://ga.jspm.io/npm:es-module-shims@1.6.3/dist/es-module-shims.js" crossorigin="anonymous" data-framer-es-module-shims></script>
    <!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-W9GNZ9PF');</script>
<!-- End Google Tag Manager -->
<style>
    li>p.framer-text{
        margin-bottom: 8px;
    }
    ol.framer-text, ul.framer-text{
        padding-top: 12px !important;
    }
</style>
    <!-- End of headStart -->
    <meta name="viewport" content="width=device-width">
    <meta name="generator" content="Framer 3fa6aa4">
