# Queen's Christmas Broadcast crawler

This webcrawler is created for the purpose of corpus creation of the Queen Elizabeth's Christmas broadcasts.

## Libraries

First off, there are some libraries we need for buidling the crawler.

Requests is for making internet requests to servers
Beautiful Soup is for filtering through HTML documents
the join function is useful when we want to create filepaths that work for all possible OS (Windows and MacOS use different ways of creating filepaths)
makedirs is so that we can create a folder where to save our html and later corpus files

In [2]:
import requests
from os import makedirs
from os.path import join
from bs4 import BeautifulSoup

## Set-up the crawler:

When setting up a web-crawler we want to look at the structure of the web page we want to extract the HTML from. In our case, we have a nice page that links to all of the transcripts: https://www.royal.uk/history-christmas-broadcast One option is to use this page extract all the links and then download the pages with the links we sourced.

However, we could also see if there is a pattern in the URLs that helps us download the transcripts directely without having to extract the ahrf/links with Beautiful Soup. This could save us some time and would make the process a bit more clean as we do not need to save an index file for these links.

So let's check if there is a pattern in the URLs, which we could use to build the web crawler. Have a look at the following links:

- https://www.royal.uk/queens-first-christmas-broadcast-1952
- https://www.royal.uk/christmas-broadcast-1954
- https://www.royal.uk/christmas-broadcast-1962
- https://www.royal.uk/christmas-broadcast-2006
- https://www.royal.uk/christmas-broadcast-2021

We can see that there is a pattern but there is also annoyingly one outlier: the very first Christmas broadcast has a different URL than all other broadcasts. While this is a bit of a nuisance, it is still something we can consider in our design process and adjust to, afterall all the other speech seem to follow a pattern.

First, let's set up the constant variables we will need later on:

In [3]:
#Setting up constant vars
makedirs('./transcript-pages', exist_ok=True)
TRANS_PAGES_DIR = "transcript-pages"


THE_URL = 'https://www.royal.uk/christmas-broadcast-' #this is the base of the URL that we can reuse in loop later
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
}

You might be wondering what the *header* variable is for. I am using this because the royal website blocks requests coming from scripts directly. This header makes it seem as if the request is coming from a browser and not from a python script. Here is the status code we get when we make a request without changing the header:

In [29]:
example_year = 1998
url = THE_URL + str(example_year)
resp = requests.get(url)
print(resp.status_code)

403


**Status code 403** means that our request got blocked by the server, so let's try it again with using our header:

In [32]:
resp = requests.get(url, headers = headers)
print(resp.status_code)

200


Now we get **status code 200**, which means our request is successful and we got a response from the server. With this being resolved, we can move towards saving the HTML files of the pages. Just to be extra sure, I want to check the status of each page before downloading and saving it. Otherwise, we might end up saving an empty file.

In [6]:
# This is for downloading the one URL that differs from the usual pattern:

exception_url = 'https://www.royal.uk/queens-first-christmas-broadcast-1952'
resp = requests.get(exception_url, headers = headers)
if resp.status_code == 200:
        print("Downloading", resp.url)
        fname = join(TRANS_PAGES_DIR, '{}.html'.format(1952))
        with open(fname, "w") as wf:
            print("Saving as", fname)
            wf.write(resp.text)
else:
    print("Request for", resp.url, "was not successful.\nNot saving file.")

# Now we can download all of the other HTML files for which the URLs increment by year.
# For this I am simply using a for loop to go through the years.

start_year = 1953
end_year = 2021 + 1 # Adding an extra year because otherwise it will only download up until 2021

for page_num in range(start_year, end_year): 
    url = THE_URL + str(page_num)
    resp = requests.get(url, headers = headers)
    if resp.status_code == 200:
        print("Downloading", resp.url)
        fname = join(TRANS_PAGES_DIR, '{}.html'.format(page_num))
        with open(fname, "w") as wf:
            print("Saving as", fname)
            wf.write(resp.text)
    else:
        print("Request for", resp.url, "was not successful.\nNot saving file.")

Downloading https://www.royal.uk/queens-first-christmas-broadcast-1952
Saving as transcript-pages/1952.html
Downloading https://www.royal.uk/christmas-broadcast-1953
Saving as transcript-pages/1953.html
Downloading https://www.royal.uk/christmas-broadcast-1954
Saving as transcript-pages/1954.html
Downloading https://www.royal.uk/christmas-broadcast-1955
Saving as transcript-pages/1955.html
Downloading https://www.royal.uk/christmas-broadcast-1956
Saving as transcript-pages/1956.html
Downloading https://www.royal.uk/christmas-broadcast-1957
Saving as transcript-pages/1957.html
Downloading https://www.royal.uk/christmas-broadcast-1958
Saving as transcript-pages/1958.html
Downloading https://www.royal.uk/christmas-broadcast-1959
Saving as transcript-pages/1959.html
Downloading https://www.royal.uk/christmas-broadcast-1960
Saving as transcript-pages/1960.html
Downloading https://www.royal.uk/christmas-broadcast-1961
Saving as transcript-pages/1961.html
Downloading https://www.royal.uk/chri

Oh, no. It seems that there might be some more outlier URLs...
2018 and 2019 were not successful requests and it seems these pages do not exist even though there were speeches given by the queen. Let's have a look at the URLs:
- https://www.royal.uk/queens-christmas-broadcast-2018
- https://www.royal.uk/queen’s-christmas-broadcast-2019

We can see that these followed a different naming convention than most other speeches. What this shows us is that using URL patterns is inconsistent as URL naming can vary from time to time. This would not be the case if there were page queries used. Consider for instance this URL from the White House Press Briefing:
- https://www.whitehouse.gov/briefing-room/?page=0

The **?page=0** is a so-called *query string*, which can be used at the end of an URL to query a wide number of things such as the page number. If the transcripts were organized in such a way that we could have use a query string, we wouldn't have to make any exceptions. This shows that just using URL patterns can lead to inconsistencies and when we are presented with this, it might be better to scrape the links using the [index page](https://www.royal.uk/history-christmas-broadcast) and Beautiful Soup.  Funnily enough however, my group mate asked me if I was able to get the broadcast for 2017, which was successful. However, the link given on the [index webpage](https://www.royal.uk/history-christmas-broadcast) seems to be wrong and leads to an non-existing page. 

Ironically, this means that the crawler I built was able to get the correct link when the wrong link was given on the actual index-page. The conclusion I draw from this is that one way or another, we can encounter incostiencies when webscraping so there is not one way to go about and do it. It's important that we make sure to check the status of our requests to identify problems early on and not later on when cleaning the corpus.

Let's just scrape the last two broadcasts:

In [8]:
# Broadcast from 2018

exception_url_2018 = 'https://www.royal.uk/queens-christmas-broadcast-2018'
resp = requests.get(exception_url_2018, headers = headers)
if resp.status_code == 200:
        print("Downloading", resp.url)
        fname = join(TRANS_PAGES_DIR, '{}.html'.format(2018)) #add correct year manually
            print("Saving as", fname)
            wf.write(resp.text)
else:
    print("Request for", resp.url, "was not successful.\nNot saving file.")

    
# Broadcastfrom 2019

exception_url_2019 = 'https://www.royal.uk/queen’s-christmas-broadcast-2019' #Correct link here
resp = requests.get(exception_url_2019, headers = headers)
if resp.status_code == 200:
        print("Downloading", resp.url)
        fname = join(TRANS_PAGES_DIR, '{}.html'.format(2019)) #add correct year manually
        with open(fname, "w") as wf:
            print("Saving as", fname)
            wf.write(resp.text)
else:
    print("Request for", resp.url, "was not successful.\nNot saving file.")
    

Downloading https://www.royal.uk/queens-christmas-broadcast-2018
Saving as transcript-pages/2018.html
Downloading https://www.royal.uk/queen%E2%80%99s-christmas-broadcast-2019
Saving as transcript-pages/2019.html


Now we should finally have all the necessary HTML files and can move to the next step.

## Creating txt files from the HTML files using Beautiful Soup

### Beautiful Soup example

To demonstrate the basic use of the parser let's have a look at how it works with an example first. I am creating an example, which I took from the [documentation page of the Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package.

In [23]:
html_example = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_example, 'lxml')

This creates a Beautiful soup object object that we can do a bunch of different things with later on. Note, that we also have to pass a so-called *parser* when we create the object. I used 'lxml' because I have seen other people use it and it seems like a decent often-used parser. Let's try out some of the things we can do with this object:

In [26]:
soup.name

'[document]'

In [28]:
soup.head

<head><title>The Dormouse's story</title></head>

In [29]:
soup.p

<p class="title"><b>The Dormouse's story</b></p>

In [30]:
soup.text

"The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n...\n"

In [31]:
soup.contents

[<html><head><title>The Dormouse's story</title></head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>
 <p class="story">...</p>
 </body></html>]

In [33]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

### Let's start unraveling our HTML docs

---
The find_all method above is one of the commonly used methods for searching your HTML document.
Let's see what happens when we open up one of our texts and try to find all the 'p' (paragraphs) elements in there.

In [106]:
start_year = 1968
end_year = 1969

for page_num in range(start_year, end_year):
    broadcast = join(TRANS_PAGES_DIR, '{}.html'.format(page_num))
    with open(broadcast, "r") as file:
        broadcast1968 = BeautifulSoup(file, 'lxml')

broadcast1968.find_all('p')

In [51]:
start_year = 2019
end_year = 2020

for page_num in range(start_year, end_year):
    broadcast = join(TRANS_PAGES_DIR, '{}.html'.format(page_num))
    with open(broadcast, "r") as file:
        broadcast2019 = BeautifulSoup(file, 'lxml')

broadcast2019.find_all('p')

This is not a bad start. 

We do seem to be getting the entirety of the speech, but there are also some unnessary things we might need to filter out such as the footer and the "Share this article" p element. Additionally, we can see that the formatting in these articles is consistent again. 

While in the [2019 broadcast](https://www.royal.uk/queen’s-christmas-broadcast-2019) there is a "share this article" p element, the [1968 broadcast](https://www.royal.uk/christmas-broadcast-1968) also has this but lacks the little description and framed quote at the beginning. Unfortunately, there are no class attributes it seems that highlight this clearly but the description is encapsulated in an \<em> element('em' standing for *emphasis*). However, when looking through the webpages, I have also seen description that do not have any special formatting and are not in cursive. Therefore, I unfortunately cannot rely on the em-element.

In [101]:
paragraphs = []
merged = ""
for paragraph in broadcast1968.find_all('p'):
    paragraphs.append(paragraph.text)

merged = "\n".join(paragraphs) #I am using the line break to preserve the more orginal structure of the written broadcast

#print(merged) #For checking if eveything saved correctly

In [100]:
# I was debating if I want to split the text but I decided against it for now.

#end_phrase = 'share this article'
#if end_phrase in merged.lower():
#    print('yey')
#   print(merged.lower().split(end_phrase)) #we are splitting the text at the end phrase once

Now it's time to bring those two chunks of code together into one loop that goes through the HTML files and saves the code.

In [112]:
makedirs('./corpus', exist_ok=True)
path = "corpus"

start_year = 1952
end_year = 2021 +1


for year in range(start_year, end_year):
    
    broadcast = join(TRANS_PAGES_DIR, '{}.html'.format(year)) 
    with open(broadcast, "r") as file:
        print("Opening and converting", broadcast)
        file = BeautifulSoup(file, 'lxml')
        
    paragraphs = []
    merged = ""
    file_path = join(path, '{}.txt'.format(year)) 
    
    for paragraph in file.find_all('p'):
        paragraphs.append(paragraph.text)
        merged = "\n".join(paragraphs)
    
    with open(file_path, "w") as file:
        print("Saving as", file_path)
        file.write(merged)
         #   print("Saving as", year)
         #   wf.write(resp.text)

Opening and converting transcript-pages/1952.html
Saving as corpus/1952.txt
Opening and converting transcript-pages/1953.html
Saving as corpus/1953.txt
Opening and converting transcript-pages/1954.html
Saving as corpus/1954.txt
Opening and converting transcript-pages/1955.html
Saving as corpus/1955.txt
Opening and converting transcript-pages/1956.html
Saving as corpus/1956.txt
Opening and converting transcript-pages/1957.html
Saving as corpus/1957.txt
Opening and converting transcript-pages/1958.html
Saving as corpus/1958.txt
Opening and converting transcript-pages/1959.html
Saving as corpus/1959.txt
Opening and converting transcript-pages/1960.html
Saving as corpus/1960.txt
Opening and converting transcript-pages/1961.html
Saving as corpus/1961.txt
Opening and converting transcript-pages/1962.html
Saving as corpus/1962.txt
Opening and converting transcript-pages/1963.html
Saving as corpus/1963.txt
Opening and converting transcript-pages/1964.html
Saving as corpus/1964.txt
Opening and 