https://programminghistorian.org/en/lessons/intro-to-beautiful-soup
    
https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=25

https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=200&view=list

In [1]:
!pip install beautifulsoup4

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
[K     |████████████████████████████████| 115 kB 3.4 MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0"
  Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.3 soupsieve-2.0.1


In [7]:
# we need this so that we can grab stuff off the web
import requests

In [8]:
# target webpage
url = "https://www.historymuseum.ca/collections/search-results/?q=Ottawa&per_page=500&view=list"

# Getting the webpage, creating a Response object.
response = requests.get(url)
 
# Extracting the source code of the page.
data = response.text
 

In [9]:
from bs4 import BeautifulSoup

#we give that data to BS so that it can extract what we're interested in
soup = BeautifulSoup(data, 'lxml')

print(soup.prettify())


<!DOCTYPE html>
<html class="search-results" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <title>
   Search the Collections | Canadian Museum of History
  </title>
  <meta content="initial-scale=1.0, width=device-width" name="viewport"/>
  <link href="https://gmpg.org/xfn/11" rel="profile"/>
  <link href="https://www.historymuseum.ca/collections/xmlrpc.php" rel="pingback"/>
  <link href="https://www.historymuseum.ca/collections/wp-content/themes/collection-search/_images/favicons/apple-touch-icon.png?v=jw69Lm5BEG" rel="apple-touch-icon" sizes="180x180"/>
  <link href="https://www.historymuseum.ca/collections/wp-content/themes/collection-search/_images/favicons/favicon-32x32.png?v=jw69Lm5BEG" rel="icon" sizes="32x32" type="image/png"/>
  <link href="https://www.historymuseum.ca/collections/wp-content/themes/collection-search/_images/favicons/favicon-16x16.png?v=jw69Lm5BEG" rel="icon" sizes="16x16" type="image/png"/>
  <link href="https://www.historymuseum.ca/collections/wp-content/

In [10]:
# so let's find the links to the data we're interested in 
# ie the individual records
links = soup.find_all('a')

for link in links:
    print(link)


<a class="bypass-link" href="#main-content" id="bypass-main">
			Skip to main content		</a>
<a class="main-logo" href="https://www.historymuseum.ca/">
<img alt="Logo of the Canadian Museum of History" class="main-logo-image" height="44" src="https://www.historymuseum.ca/collections/wp-content/themes/collection-search/_images/cmh-main-logo-grey.svg" title="" width="300"/> </a>
<a class="favourites-handle" href="https://www.historymuseum.ca/collections/favourites/">
<span class="text">Favourites</span>
<svg aria-hidden="true" class="icon">
<title>Heart</title>
<use xlink:href="#heart"></use>
</svg>
<div class="favourites-count">
<span class="favourites-count-icon">0</span>
<span class="favourites-count-text"> items in your favourites list.</span>
</div>
</a>
<a href="https://www.historymuseum.ca/collections/favourites/">
				Favourites			</a>
<a aria-label="Copy favourites list url" class="favourites-box-link-handle" data-export-url-handle="favourites-box" href="#nogo">
<svg aria-hidden=

In [12]:
# so lets get those links

for link in links:
    names = link.contents[0]
    fullLink = link.get('href')
    print(names)
    print(fullLink)


			Skip to main content		
#main-content


https://www.historymuseum.ca/


https://www.historymuseum.ca/collections/favourites/

				Favourites			
https://www.historymuseum.ca/collections/favourites/


#nogo

		All Favourites	
https://www.historymuseum.ca/collections/favourites/
Browse the collection
https://www.historymuseum.ca/collections/search-results/
Home
https://www.historymuseum.ca/collections

	Français
https://www.museedelhistoire.ca/collections/resultats-de-recherche/?q=Ottawa&per_page=500&view=list
25
/collections/search-results/?q=Ottawa&per_page=25&view=list&page_num=1
50
/collections/search-results/?q=Ottawa&per_page=50&view=list&page_num=1
100
/collections/search-results/?q=Ottawa&per_page=100&view=list&page_num=1
200
/collections/search-results/?q=Ottawa&per_page=200&view=list&page_num=1


#media-filter
Image Only
/collections/search-results/?q=Ottawa&per_page=500&view=list&media=images
Show All Media
/collections/search-results/?q=Ottawa&per_page=500&view=list&media=a

In [14]:
# let's write those links to a file
import csv

f = csv.writer(open("histmuse.csv", "w"))
f.writerow(["Name", "Link"]) # Write column headers as the first line

links = soup.find_all('a')
for link in links:
    names = link.contents[0]
    fullLink = link.get('href')
    # print(names)
    # print(fullLink)
    f.writerow([names, fullLink])

Okay! So that's a bit messy, but you now have a csv that's a bit messy, admittedly, but you could easily clean it up so that it just looks like this:

```
https://www.historymuseum.ca/collections/artifact/1337564/
https://www.historymuseum.ca/collections/artifact/2359060/
https://www.historymuseum.ca/collections/artifact/1316383/
https://www.historymuseum.ca/collections/artifact/2365313/
https://www.historymuseum.ca/collections/artifact/2365193/
```

Save as `urls.txt` and then pass that file to wget at the command line, like this:

`wget -i urls.txt -r --no-parent -nd -w 2 --limit-rate=100k`

and you'd bet a folder of data (html pages, in this case.

But wait! There's other data from the original search screen we could grab. Look at the original html:

```html
<span class="collection-item-metadata location">Canada</span>
<span class="collection-item-metadata artifact-number">2011.175.11</span>
<span class="collection-item-metadata date-made">1977</span>
```

Let's grab that.

In [18]:
trs = soup.find_all('span')
for tr in trs:
    print(tr)

<span class="text">Search</span>
<span class="text">Favourites</span>
<span class="favourites-count-icon">0</span>
<span class="favourites-count-text"> items in your favourites list.</span>
<span class="display-count-label">Results Per Page</span>
<span class="text">Media Filter</span>
<span class="text">Gallery Mode</span>
<span class="text">Grid View</span>
<span class="text">List View</span>
<span class="collection-item-irn">1337564</span>
<span class="collection-item-metadata location">Canada</span>
<span class="collection-item-metadata artifact-number">2001.145.1</span>
<span class="collection-item-metadata date-made"></span>
<span class="collection-item-irn">2359060</span>
<span class="collection-item-metadata location">Canada</span>
<span class="collection-item-metadata artifact-number">2013.4.18</span>
<span class="collection-item-metadata date-made"></span>
<span class="collection-item-irn">1316383</span>
<span class="collection-item-metadata location">Canada</span>
<span clas