# Web scraping

Programming in Python

School of Computer Science, University of St Andrews

## Data on the web
- There's a lot of *unstructured* data on the web
- Placed on web pages for display, with no API
- We may want to get this data automatically
- Libraries exist to help us!

## BeautifulSoup
- Python library for scraping data: `bs4`
- May need installation first
- Allows us to search through the document tree

Let's open a page: https://www.st-andrews.ac.uk/computer-science/

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://www.st-andrews.ac.uk/computer-science/"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

In [2]:
soup.title

<title>School of Computer Science - University of St Andrews</title>

In [3]:
soup.title.text

'School of Computer Science - University of St Andrews'

In [4]:
print(soup.get_text())








School of Computer Science - University of St Andrews



























Skip to content

Skip to content



Current studentsCurrent staff


University of St Andrews





















School of Computer Science




Navigation 
PeopleProspective studentsAboutResearch EventsNews 






Build a smarter worldComputer science is more important than ever. Be part of building a more intelligent world through computing technology.  StudyResearch 





People



Equality and diversity



Research 


Latest School news


All School news




Fully-funded PhD scholarship in user experience design
Applications are sought from passionate, creative and outgoing students interested in using their skills and interests in tabletop gaming in application to research in computer science, Human Computer Interaction, and User Experience design. This exciting PhD project will see the worlds of TTRPG and computing coincide to produce meaningful interactions to support the design, development 

## Search for a type of element

In [5]:
images = soup.find_all("img")
print(images)

[, , , ]


In [6]:
images[2]



In [7]:
images[2].name

'img'

In [8]:
images[2]["alt"]

'Students and staff interacting'

In [9]:
images[2]["src"]

'data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=='

In [10]:
images[2]["data-src"]

'/assets/university/schools/school-of-computer-science/images/students-staff-social-space.jpg'

![](https://www.st-andrews.ac.uk/assets/university/schools/school-of-computer-science/images/students-staff-social-space.jpg)

## Example: find all links

In [11]:
links = soup.find_all("a")
print(links)

[<a href="#content-begin">Skip to content</a>, <a href="#content-begin">Skip to content</a>, <a class="audience" href="https://info.cs.st-andrews.ac.uk">Current students</a>, <a class="audience" href="https://universityofstandrews907.sharepoint.com/sites/compsci">Current staff</a>, <a aria-label="University of St Andrews" href="https://www.st-andrews.ac.uk/"><span>University of St Andrews</span></a>, <a href="/computer-science/">School of Computer Science</a>, <a class="navigation-button">Navigation <i class="chevron down"></i></a>, <a href="/computer-science/people/">People</a>, <a href="/computer-science/prospective/">Prospective students</a>, <a href="/computer-science/about/">About</a>, <a href="/computer-science/research/">Research </a>, <a href="/computer-science/events/">Events</a>, <a href="/computer-science/news/">News</a>, <a class="btn btn-action" href="/computer-science/prospective/">  Study</a>, <a class="btn btn-action" href="/computer-science/research/">Research </a>, <a

In [12]:
len(links)

38

In [13]:
links[18]

<a href="https://blogs.cs.st-andrews.ac.uk/csblog" style="margin-bottom: 30px;">All School news</a>

In [14]:
links[18].name

'a'

In [15]:
links[18].text

'All School news'

In [16]:
links[18]["href"]

'https://blogs.cs.st-andrews.ac.uk/csblog'

In [17]:
[link["href"] for link in links if link.has_attr("href") and link["href"].startswith("http")]

['https://info.cs.st-andrews.ac.uk',
 'https://universityofstandrews907.sharepoint.com/sites/compsci',
 'https://www.st-andrews.ac.uk/',
 'https://blogs.cs.st-andrews.ac.uk/csblog',
 'https://blogs.cs.st-andrews.ac.uk/csblog/2023/12/18/fully-funded-phd-scholarship-in-user-experience-design-roll-for-initiative/',
 'https://blogs.cs.st-andrews.ac.uk/csblog/2023/12/12/fully-funded-phd-scholarship-in-privacy-and-trust-on-the-web/',
 'https://blogs.cs.st-andrews.ac.uk/csblog/2023/11/27/winter-graduation-reception/',
 'https://events.st-andrews.ac.uk/events/doors-open-computer-science/',
 'https://twitter.com/StAndrewsCS',
 'https://www.facebook.com/StAndrewsCS',
 'https://www.st-andrews.ac.uk/digital-standards/accessibility/accessibility-statement/',
 'https://www.st-andrews.ac.uk/terms/',
 'https://www.st-andrews.ac.uk/help/']

## Inspecting a particular area

In [18]:
divs = soup.find_all("div", {"class": "hero-section__content"})

In [19]:
len(divs)

1

In [20]:
div = divs[0]
print(div)

<div class="hero-section__content"><h1 class="hero-section__heading">Build a smarter world</h1><div class="hero-section__content-group"><p class="hero-section__intro">Computer science is more important than ever. Be part of building a more intelligent world through computing technology.</p><a class="btn btn-action" href="/computer-science/prospective/">  Study</a><a class="btn btn-action" href="/computer-science/research/">Research </a></div></div>


In [21]:
[child.name for child in div.children]

['h1', 'div']

In [22]:
heading = div.find("h1")

In [23]:
heading.text

'Build a smarter world'

In [24]:
subdiv = div.find("div")

In [25]:
subdiv

<div class="hero-section__content-group"><p class="hero-section__intro">Computer science is more important than ever. Be part of building a more intelligent world through computing technology.</p><a class="btn btn-action" href="/computer-science/prospective/">  Study</a><a class="btn btn-action" href="/computer-science/research/">Research </a></div>

In [26]:
[child.name for child in subdiv.children]

['p', 'a', 'a']

In [27]:
link = div.find("a")

In [28]:
link.text

'  Study'

In [29]:
link["href"]

'/computer-science/prospective/'

In [30]:
link2 = link.next_sibling

In [31]:
link2.text

'Research '

In [32]:
link2["href"]

'/computer-science/research/'

In [33]:
link2.parent

<div class="hero-section__content-group"><p class="hero-section__intro">Computer science is more important than ever. Be part of building a more intelligent world through computing technology.</p><a class="btn btn-action" href="/computer-science/prospective/">  Study</a><a class="btn btn-action" href="/computer-science/research/">Research </a></div>

In [34]:
link2.parent == subdiv

True

In [35]:
link2.parent.parent == div

True

## Summary
- Use `BeautifulSoup` to extract information from web pages.
- Use `find_all`, `find`, `parent`, `children` and `next_sibling` to explore
- Many more features: read the docs or search for a solution to learn more!

## Exercise
- Take a look at a classic of web design, the Space Jam website: https://www.spacejam.com/1996/
- Write BeautifulSoup code to produce a Python dictionary of the 11 planet-shaped links on the page
- The dictionary's keys should be the alt-text of the images in the links
- The dictionary's values should be tuples of the form `(url, img)` where `url` is the URL the link points to, and `img` is the address of the image displayed

## Solution

In [36]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://www.spacejam.com/1996/"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")

links = soup.find_all("a")
d = {}
for link in links:
    image = link.find("img")
    if image is not None:
        alt = image["alt"]
        url = link["href"]
        img = image["src"]
        d[alt] = (url, img)

print(d)

{'Press Box Shuttle': ('cmp/pressbox/pressboxframes.html', 'img/p-pressbox.gif'), 'Jam Central': ('cmp/jamcentral/jamcentralframes.html', 'img/p-jamcentral.gif'), 'Planet B-Ball': ('cmp/bball/bballframes.html', 'img/p-bball.gif'), 'Lunar Tunes': ('cmp/tunes/tunesframes.html', 'img/p-lunartunes.gif'), 'The Lineup': ('cmp/lineup/lineupframes.html', 'img/p-lineup.gif'), 'Jump Station': ('cmp/jump/jumpframes.html', 'img/p-jump.gif'), 'Junior Jam': ('cmp/junior/juniorframes.html', 'img/p-junior.gif'), 'Warner Studio Store': ('https://shop.looneytunes.com/spacejam96?utm_source=SpaceJam1996&utm_medium=Website&utm_campaign=Theatrical2021', 'img/p-studiostore.gif'), 'Stellar Souvenirs': ('cmp/souvenirs/souvenirsframes.html', 'img/p-souvenirs.gif'), 'Site Map': ('cmp/sitemap.html', 'img/p-sitemap.gif'), 'Behind the Jam': ('cmp/behind/behindframes.html', 'img/p-behind.gif')}
