# Getting Photos from the Cambridge Air Photo Collection

The [Cambridge University Collection of Aerial Photography](https://www.cambridgeairphotos.com/) is a valuable resource for archaeology, given that many of its photos were taken immediately post war and thus before developers rolled over the landscape with their bulldozers.

In this notebook, I show how to gather those photos together so that we can use transfer learning on them to build an image classifier. This image classifier can then be used on other collections of archival air photos to try to identify archaeological materials.

In [2]:
# only do this the first time
!pip install bs4

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/1d/5d/3260694a59df0ec52f8b4883f5d23b130bc237602a1411fa670eae12351e/beautifulsoup4-4.7.1-py3-none-any.whl (94kB)
[K    100% |████████████████████████████████| 102kB 3.5MB/s a 0:00:011
[?25hCollecting soupsieve>=1.2 (from beautifulsoup4->bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/77/78/bca00cc9fa70bba1226ee70a42bf375c4e048fe69066a0d9b5e69bc2a79a/soupsieve-1.8-py2.py3-none-any.whl (88kB)
[K    100% |████████████████████████████████| 92kB 9.1MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: bs4
  Running setup.py bdist_wheel for bs4 ... [?25ldone
[?25h  Stored in directory: /home/jovyan/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collecte

In [4]:
# import necessary bits and bobs
from bs4 import BeautifulSoup
import csv
import requests

In [14]:
# make a directory for whatever theme it is you're after
!mkdir roman_fort


In [15]:
cd roman_fort

/home/jovyan/roman_fort


In [19]:
# create a file to write the results of our scraper
file = open("output.txt", "w")

In [20]:
# If you go to https://www.cambridgeairphotos.com/themes/earthworks/ you'll see that there are over 300 pages of results,
# with 30 showing at a time. Let's just grab a portion of that. We're going to define here the range. Let's say the first ten pages
pages = []

for i in range(1,10):
    url = 'https://www.cambridgeairphotos.com/themes/roman+fort/page' + str(i) + '.html'
    pages.append(url)

In [21]:
# and now we define, and run, our scraper.
# it's picking out the links to the thumbnails
# and saving them to the file we made above.
for item in pages:
    page = requests.get(item)
    soup = BeautifulSoup(page.content, "html.parser")


    divs = soup.find_all('div', attrs={"class": "cucapgallery naturalwidth compressed"})

    for div in divs:
        for link in div.find_all('a', attrs={"class": "lightbox"}):
            fulllink = link.get ('href')
            file.writelines(["https://www.cambridgeairphotos.com", fulllink, "\n"])

## Examine the results

If you go to the file browser now, you'll see a file [output.txt](output.txt] with lots of lines like this:

```
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbw022.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbw023.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kci022.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kci027.jpg
```

Now we'll write some code to download those thumbnails.

In [22]:
# first we grab the module that lets us work with regular expressions

import re

img = []
with open('output.txt') as csvfile:
    csvrows = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in csvrows:
# this next line tells the machine that the file name is the first element in the row
        filename = row[0]
        filename = re.sub('["https://www.cambridgeairphotos.com/data/thumbnails"]', '', filename)
# this tells the machine that the url is also that first line. if you had a csv with more than one column, you could adjust accordingly
        url = row[0]
        print(url)
        result = requests.get(url, stream=True)
        if result.status_code == 200:
            image = result.raw.read()
            open(filename + ".jpg","wb").write(image)


https://www.cambridgeairphotos.com/data/thumbnails/640/35kar028.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kar029.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kar030.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kar031.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kaz015.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kaz016.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kaz017.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbb008.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbb009.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbb010.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbb011.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbk004.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbk005.jpg
https://www.cambridgeairphotos.com/data/thumbnails/640/35kbk006.jpg
https://www.cambridgeairphotos.com/data/thumbnai