# Web Scraping with Beautiful Soup
Sometimes you come across some useful data but it's not in a convenient format such as Excel or csv, but is embedded in a web page.  You could copy and paste it from the web page into Excel, but there are a couple of big issues with this approach:

1. You might need to spend a lot of time tidying up the data
2. If the data on the web page changes you have to go through the whole manual process again.

An alternative is to take a programming approach and use a techinique called **screen scraping**.  In this worksheet we will use a library called Beautiful Soup to scrape some data from websites.

## Beautiful Soup
Let's just have a quick play with the Beautiful Soup library to see what it does.  We will use the list of countries from Wikipedia:

https://en.wikipedia.org/wiki/List_of_national_capitals



In [8]:
from bs4 import *
import urllib3

# This is the web page we will pull data from
url = "https://en.wikipedia.org/wiki/List_of_national_capitals"

# Pull the date from the web page
http = urllib3.PoolManager()
response = http.request('GET', url)

# Create a Beautiful Soup object, passing in the page data
soup = BeautifulSoup(response.data, "lxml")

Let's inspect the first part of the Beautiful Soup object to see what it contains:

In [9]:
str(soup)[:5000]

'<!DOCTYPE html>\n<html class="client-nojs" dir="ltr" lang="en">\n<head>\n<meta charset="utf-8"/>\n<title>List of national capitals - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"23b7951d-10c3-4d58-932d-a58bf0fa7996","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_national_capitals","wgTitle":"List of national capitals","wgCurRevisionId":1027939042,"wgRevisionId":1027939042,"wgArticleId":33728,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Articles containing Spanish-language text","Li

So that looks like the HTML content of the web page!

We can start to extract elements from the page.  First the title:

In [10]:
# HTML document title
soup.title

<title>List of national capitals - Wikipedia</title>

Then we can find the first tag of a particular type:

In [11]:
# First link tag:
soup.find('a')

<a id="top"></a>

We can find all elements of a particular type.  Let's select the first 20:

In [12]:
# All link tags:
soup.find_all('a')[:20]

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>,
 <a href="/wiki/Lists_of_capitals" title="Lists of capitals">Lists of capitals</a>,
 <a class="mw-selflink selflink">in alphabetical order</a>,
 <a href="/wiki/List_of_national_capitals_by_latitude" title="List of national capitals by latitude">by latitude</a>,
 <a href="/wiki/List_of_national_capitals_by_population" title="List of national capitals by population">by population</a>,
 <a href="/wiki/List_of_former_national_capitals" title="List of former national capitals">Former</a>,
 <a href="/wiki/List_of_purpose-built_national_capitals" title="List of purpose-built national capitals">Purpose-built</a>,
 <a href="/wiki/List_of_national_capitals_situated_on_an_international_border" title="List of national capitals situated on an international border">On an international border</a>,
 <a href="/wiki/List_of_countries_whose_capital_is_not_th

We can extract attributes within a set of tags:

In [13]:
# All links within link tags:
for link in soup.find_all('a')[:20]:
    print(link.get('href'))

None
#mw-head
#searchInput
/wiki/Lists_of_capitals
None
/wiki/List_of_national_capitals_by_latitude
/wiki/List_of_national_capitals_by_population
/wiki/List_of_former_national_capitals
/wiki/List_of_purpose-built_national_capitals
/wiki/List_of_national_capitals_situated_on_an_international_border
/wiki/List_of_countries_whose_capital_is_not_their_largest_city
/wiki/List_of_countries_with_multiple_capitals
/wiki/Timeline_of_country_and_capital_changes
/wiki/List_of_capitals_outside_the_territories_they_serve
/wiki/List_of_purpose-built_capitals_of_country_subdivisions
/wiki/Template:Lists_of_capitals
/wiki/Template_talk:Lists_of_capitals
https://en.wikipedia.org/w/index.php?title=Template:Lists_of_capitals&action=edit
/wiki/Capital_city
/wiki/Territory_(administrative_division)


We can get the text content of the web page:

In [14]:
# All page text
soup.get_text()

'\n\n\nList of national capitals - Wikipedia\ndocument.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"23b7951d-10c3-4d58-932d-a58bf0fa7996","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_national_capitals","wgTitle":"List of national capitals","wgCurRevisionId":1027939042,"wgRevisionId":1027939042,"wgArticleId":33728,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Articles containing Spanish-language text","Lists of countries","Lists of capitals"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPage