Putting a URL to work
======================

### What you will learn

How a URL works and how you can use the syntax of URLs to specify exactly what you want to see in your browser. 

### What it can be used for

This is useful as a tool in itself should you need to embed within an email or document a link to a precise piece of information. But most importantly, it the first building block towards understanding: 

1. Web APIs: the means by which you can access a huge quantity of publicly available data which will be invaluable for analysing your own data, building models and creating tools. Examples are social media data, weather data, and the product and media information available in wikis.
2. The technique of web scraping which allows you to collect the data that is present on web pages. Again, there is plenty of potential for using this data in modelling and analysis.

Once you understand how urls are constructed you can automate their construction and cycle through them at great speed to harvest information!

### Some terminology

It's probably fair to say that while most of us use the term **URL** quite freely in conversation, few people outside of IT understand what a URL actually is. URL stands for **Uniform Resource Locator**. It is the address of a resource (thing to be used) on the internet. That resource is most commonly a web page, but it could also be an ftp site, an email server, a printer or anything else that is connected to the network of computers that we call the web. 

If the resource is a webpage then, more often than not, typing the URL into your browser just brings the resource (a file written in **HTML**, **javascript** and a few other things) into your browser where it is converted into a pretty webpage.

But URLs can do much more than just that. To see how, let's have a look at the format that a URL must follow.

	 scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]

Square brackets are wrapped around the optional things. For example web pages don't require a username and password (ftp sites do though) so the format for a web page is usually just this:

	 scheme://path[?query][#fragment]

Looking first at scheme, the common schemes are http, ftp, mailto, file and data. Of these we are most concerned with **http** or **Hypertext Transfer Protocol**. Let's break this down a bit. Hypertext is just text containing links (**hyperlinks**) to other text so that a user can navigate from one text to another. The most common language for writing hypertext is of course **HTML** which stands for **Hypertext Markup Language**. The transfer protocol part is just referring to set of rules for moving these files about. So with a webpage URL the scheme part is specifying that the URL will be bringing back some hypertext in accordance with the HTTP rules.

The path part of a webpage URL is usually made up of a domain name like `www.coppelia.io` (the rules for these are governed by the **Domain Name System** - you may have heard the acronym **DNS** being used) combined with the file path syntax that we are all familiar with from our PCs. So you might for example have 

	http://www.coppelia.io/blog/blogpost

The next two parts are where it gets interesting. Following a `?` we may add a **query string** which is a way of asking for certain results to be brought back. So for example 

	http://www.blueturtlefish.com?productid=11678

might bring back the product page for the product with product ID 11678. We can use `&` to specify more than one attribute to search on

	http://www.blueturtlefish.com?productid=11678&colour=blue

The syntax for the query string is not strictly defined but the `attribute=value&attribute=value` approach is very common.

Finally, the hash sign `#` precedes a **fragment identifier**. Originally this was just a way of identifying a portion of the webpage that has been given the label (called an **id**) that you specify after the hash. So for example  

	http://www.coppelia.io/blog/blogpost#introduction
	
will take you to the part of a blog post that has been labelled introduction (the labelling is in the HTML, you won't necessarily see it on the rendered webpage).

However the fragment identifier has become a way of passing extra bits of information that can be picked up and processed by **javascript** (the programming language that gives instructions to your browser) once the page has loaded. So just to make life more difficult sometimes you will see what looks like a query string coming after the hash sign. 

	http://www.blueturtlefish.com/products#product=11678&colour=blue

Yes this is a query but it's working slightly differently. In the original case above the url was saying "give me the webpage that meets the criteria productid=11678 and colour=blue". This time it is saying load the page `http://www.blueturtlefish.com/products` and then hand the instructions `product =11678&colour=blue` over to javascript for processing (and then javascript runs off and gets some info on the product perhaps displaying the result in an info box).


------------------

## Hands on

1. First we aregoing to look at how google structures a URL. Using Chrome, go to google, type in a search term and execute the search. If your mind has gone blank, search for a hoover. What do you notice about the URL? Is it using a query string or a fragment identifier? Try searching for a terms that contains a space, e.g. 'blue hoover'. How is the space represented in the URL?

2. In what follows we will often want to automate the construction of URLs so we can flip through pages at speed. But each website uses URLs slightly differently. To understand the URL construction in individual cases we will need to reverse engineer it. Click on the Videos tab in google and search again. What does the URL look like now?

3. Hopefully you will have worked out that there is an attribute `tbm` that controls the type of thing you are searching for (`tbm=vid` in the previous case). So now experiment to find the rest of the possible values and construct some URLs that will take you straight to the results containing Shopping, News and Books.

4. Now see if you work out what the following URL is doing

		https://www.google.co.uk/#tbs=vw:l,mr:1,price:1,ppr_min:45,ppr_max:90,local_avail:1&tbm=shop&q=hoover&tbas=0

5. Let's use the fragment identifier for its intended purpose. Pick a page in wikipedia and inspect the URL. Sticking with the randomly chosen vacuum cleaner theme I picked [this](https://en.wikipedia.org/wiki/Roomba). Note the simplicity of the URL structure. Wikipedia is just a collection of webpages. Now right click and choose *Inspect Source*. You are now looking at the raw HTML. We'll talk more about this laTER. For now just note that some of the parts of the text have tags (the parts within the < > symbols) that contain *ids*. For example in my page I have `id="Original_and_400_series"`. These are points in the text I can get the URL to jump to using the fragment identifier. I just put the id text behind the hash. 

		https://en.wikipedia.org/wiki/Roomba#Original_and_400_series
		
	Try doing something similar with the page you have chosen.

Now complete the following tasks

---

### Task 1

6. Work out the url syntax for opening Google maps at a specific longitude and latitude (say your office or home)
7. Go to weather.com and work the url syntax for the following
	1. Weather at your office
	2. A ten day forecast for London
	3. A ten day forecast for your office
8. Pick a website and analyse the URL structure

---

### Task 2

Instead of opening a URL in a browser we are going to do something different. We are going to open the URL in python and pull the contents of the webpage into a python program. This opens up many possibilities since now we can write code that interacts with the content and that means we can automate jobs that could previously only be done manually. Imagine that for a media client, we want to do some text analysis on all the wikipedia pages for actors. 

In [1]:
import requests
response = requests.get('https://en.wikipedia.org/wiki/Christopher_Lee')
print('RESPONSE:', response.status_code)
print(response.text)

RESPONSE: 200
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Christopher Lee - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Christopher_Lee","wgTitle":"Christopher Lee","wgCurRevisionId":879747302,"wgRevisionId":879747302,"wgArticleId":53494,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Pages containing London Gazette template with parameter supp set to y","Webarchive template wayback links","CS1 maint: Multiple names: authors list","CS1 maint: BOT: original-url status unknown","EngvarB from January 2018","Use dmy dates from January 2018","Articles with hCards","Commons category link from W

You should see some html scroll across the python console. Here's a quick breakdown of the code:
	
1. `import requests` loads a python library (a collection of code) that allows you to open URLs and interact with them via HTTP. 
2. `response = requests.get('https://en.wikipedia.org/wiki/Christopher_Lee')` opens the URL for the Roomba page and stores the result in the response variable. That response is a combination of different things (an *object* in programming speak) including any error messages that come back and the retrieved HTML (if the job has been successful).
3. Lines 3 and 4 print the return status code (200 if the page has been returned successfully) and the HTML.

The raw HTML may be useful to us for certain tasks but say we just want to get at the raw text of the wikipedia article for now. For this we'll need to use another python library called BeautifulSoup. Execute the following in your browser.

In [2]:
from bs4 import BeautifulSoup
response = requests.get('https://en.wikipedia.org/wiki/Christopher_Lee')
print('RESPONSE:', response.status_code)

html = response.content
soup = BeautifulSoup(html, "html5lib")
for p in soup.find_all('p'):
    print(p.text)

RESPONSE: 200



Sir Christopher Frank Carandini Lee CBE, CStJ (27 May 1922 – 7 June 2015) was an English actor,[1] singer, military officer, and author. With a career spanning nearly 70 years, Lee was well known for portraying villains and became best known for his role as Count Dracula in a sequence of Hammer Horror films, a typecasting situation he always lamented.[2] His other film roles include Francisco Scaramanga in the James Bond film The Man with the Golden Gun (1974), Saruman in The Lord of the Rings film trilogy (2001–2003) and The Hobbit film trilogy (2012–2014), and Count Dooku in the second and third films of the Star Wars prequel trilogy (2002 and 2005).

Lee was knighted for services to drama and charity in 2009, received the BAFTA Fellowship in 2011, and received the BFI Fellowship in 2013.[3] Lee considered his best performance to be that of Pakistan's founder Muhammad Ali Jinnah in the biopic Jinnah (1998), and his best film to be the British cult film The Wicker Man

It's the last three lines that concern us. 
   
1. `soup =  BeautifulSoup(html, "html5lib")` converts the html we extracted into something that we can manipulate with the BeautifulSoup library.
2. `soup.find_all('p')` finds all the parts of the html that are tagged as a paragraph of text. We then loop through all of these paragraphs printing out the text contained in each 
   
Complete the job of creating a dataset of actor bios. First write a python function that takes as arguments a) a list of wikipedia pages and b) a directory path. This function should then store in the specified directory, text files containing the raw text from each webpage. If you have time, create an adhoc method of extracting the webpages from one of the wikipedia "list of actors" pages and apply your function to these pages.
   
   

--------------------------

Extracting Data from an API
======================

### What you will learn

How to extract data from a web API using python and how to convert the data into something usable.

### What it can be used for

Many online businesses (Twitter, Facebook, Google, Spotify, Linkedin) provide access to at least some of their data via APIs. There are many more publicly available APIs providing services from weather forecasts, to route planning to facial recognition. Their range of services is huge. [Here](https://www.diycode.cc/projects/toddmotto/public-apis) is just one example of a site listing publicly available APIs.

Each of these sources can providing rich data to supplement your modelling and analysis projects



### Some terminology

There is a strict IT meaning of a web **API** or **Application Programming Interface** and there is the meaning it has acquired for web users and developers. The former is pretty interesting but it's the latter that is most relevant for us! So, sticking with the latter, we can describe a web API as a means of accessing a service that sits on a server somewhere on the internet. The service can be pretty much anything but it usually takes some form of data as an input and returns data as an output. A common example would be query parameters as an input (get me all tweets with a certain hashtag) and a data set of query results as output.

The rules for interacting with an API are usually well documented by the provider of the API and are usually the first port of call as you start working out how you will use the service. Web APIs use **HTTP** as a method of communication and it's here that our earlier work on URLs pays off. The API is accessed using a particular URL called an **endpoint**. In fact each service provided by the API will have its own endpoint. In the examples below we use one endpoint to query artists on Spotify and another to query albums.

As with our work on URLs we can pass **query parameters** to these endpoints, but we also need to pass other types of information, such as codes (or keys)  that prove that we are who we say we are. This is called **authentication** and **authentication keys** are supplied by the API provider as you sign up for the service. 

We pass the authentication keys to the API in the **header** part of an **HTTP request**. What do we mean by this? Well, there are two main types of HTTP Request. A **GET** request where your computer asks for something from a server on the web and a **POST** request where you pass something to a server. When we enter a URL into a browser that's just a GET request; the thing that's *got* is the webpage. We will mostly do GET requests to APIs but POST requests are common too and you'll see one in the task below. 

What we don't see when we type a URL into the browser is the *header* information that accompanies each request. This is information which our computer needs but we don't usually need to bother with, like what kind of files are acceptable as a response to the request. You can see the header for any URL by looking at the network tab which is part of the **Chrome's developer tools**.

The header is where we put the authentication information. This is not something we can do in a browser, hence the need for a tool like python to handle the more complex API interactions.

The data returned from an API call can take many forms but one of the common is **json** or **JavaScript Object Notation** a standard data exchange format. Most of you will be familar with it but in case you are not it looks like this:

	{
	"Title": "D.O.A.",
	"Director":  "Rudolph Maté",
	"Starring": [
		{
		 "Name": "Edmond O'Brien",
		 "Born": "September 10, 1915"
		},
		{
		 "Name": "Pamela Britton",
		 "Born": "March 19, 1923 "
		}
	]
	}


You'll need one last concept if you are fairly new to python. This is the idea of a  **python dictionary**. It's a way of storing data in python that looks like this:

	dict = {'Name': 'Simon', 'Age': '7'}

(You'll notice it is very similar to json!) It's easy to access the information in a dictionary. For example, I can get Age as follows

	print dict['Age']
	7

## Hands on

### Task 1

First we will be looking at the International Space Station API. We've chosen this because it does not currently require authentication and this will save us from an extra layer of complexity.  The first step with an API is to checkout the API documentation which you can find [here](http://open-notify.org/Open-Notify-API/ISS-Location-Now/).
	
Now to construct a call to the ISS API we need to know how to structure the URL. Let's try using the International Space Station Current Location. The [documentation](http://open-notify.org/Open-Notify-API/ISS-Location-Now/) for that endpoint tells us that we need to make a GET request to the URL `http://api.open-notify.org/iss-now.json` There are no further query parameters

So paste the following into your browser

    http://api.open-notify.org/iss-now.json

In the steps above we were able to use the URL alone to bring back data since it was a fairly uncomplicated GET request. But we soon find ourselves needing to provide authentication information and more complex parameters. We'd also like to manipulate the json data that is returned This will all require python. Let's start by looking at how we'd call the ISS API using python.

In [4]:
import json
response = requests.get('http://api.open-notify.org/iss-now.json')
obj = json.loads(response.text)
print(obj['timestamp'])
print(obj['iss_position']['latitude'], obj['iss_position']['longitude'])

1548247862
9.1528 52.6715


### Task 2

Next we are going to look a slightly more useful API. The utelly API available via [RapidAPI](https://rapidapi.com) (a very useful resource) will return a list of which platforms a particular programme is on. The [documentation and testing page ](https://rapidapi.com/utelly/api/utelly) on rapidAPI gives you all the information you need about the endpoint as well as providing sample code for querying it. Most APIs of any note require you to subscribe to them. You are then issued with an authentication key. Below we access the API via my RapidAPI authentication key. Try executing the following code which searches for the programme **Treme**.

In [5]:
response = requests.get("https://utelly-tv-shows-and-movies-availability-v1.p.rapidapi.com/lookup?term=treme&country=uk",
  headers={
    "X-RapidAPI-Key": "53e13c2560mshc31bcc23b606a16p1b03fbjsnd3f429b002d4"
  }
)

response.json()

{'status_code': 200,
 'updated': '2019-01-23T04:00:23+0000',
 'term': 'treme',
 'results': [{'picture': 'https://utellyassets2-9.imgix.net/2/Open/HBO/Treme/Season%204/Episode%20405%20-%20...To%20Miss%20New%20Orleans/_4by3/Treme-Episode405-Still1.jpg?fit=crop&auto=compress&crop=faces,top',
   'name': 'Treme',
   'locations': [{'display_name': 'Google Play',
     'name': 'GooglePlay',
     'url': 'https://play.google.com/store/tv/show?id=7JF-u3fKZlA',
     'id': '5523f21391072d0e23728ab9',
     'icon': 'https://utellyassets7.imgix.net/locations_icons/utelly/black/GooglePlay.png?&w=92&auto=compress&app_version=5s2gpuod4j-996'},
    {'display_name': 'Rakuten TV',
     'name': 'WuakiTV',
     'url': 'https://uk.wuaki.tv/seasons/treme-4/episodes/yes-we-can',
     'id': '56c6edcba54d7559fe5028e1',
     'icon': 'https://utellyassets7.imgix.net/locations_icons/utelly/black/WuakiTV.png?&w=92&auto=compress&app_version=5s2gpuod4j-996'},
    {'display_name': 'TalkTalk TV Store',
     'name': 'Blink

As you can see the API has provided the data in JSON format which isn't very user friendly. First we need to parse the json. For example the following will extract the names of the programmes that have been returned.

In [6]:
response_json = response.json()
results = response_json['results']
for r in results:
    print(r['name'])

Treme
One Tree Hill
Happy Tree Friends
The Tree of Life


To get at the locations we need a nested loop.

In [7]:
response_json = response.json()
results = response_json['results']
for r in results:
    print()
    print(r['name'])
    for l in r['locations']:
        print('>>', l['name'])


Treme
>> GooglePlay
>> WuakiTV
>> BlinkBox
>> ITunes
>> Amazon

One Tree Hill
>> GooglePlay
>> ITunes
>> Amazon

Happy Tree Friends
>> GooglePlay

The Tree of Life
>> BlinkBox
>> Amazon
>> WuakiTV


How to scrape a web page
=================================

### What you will learn

How to extract information from a web page using the python package BeautifulSoup.


### What it can be used for

Many web pages contain structured information that is useful to businesses. Some practical examples of web scraping are the extraction of

* Product and price information for price comparison and market analysis
* Personal profiles for tracking reputation
* Stories from news sites, which are then summarised/curated to provide a new portal for the information.

In your case, you might be interested in extracting data describing TV programmes, or in pulling data about TV schedules or the presence of advertising.

There is some ambiguity about the legal status of web scraping. Under certain conditions (for example the scraping of an entire database) it could been seen as the theft of intellectual property. If you are at all in doubt please raise it with your line manager!


## An explanation of the parts

To understand **web scraping** we will need to get a deeper understanding of **HTML**. As I'm sure most of you are aware HTML is a markup language in which plain text is annotated with tags, which, together with instructions on styling, instruct your browser on how to render the text. For example

	<h1>Hello</h1>
	
Will render *Hello* as heading 1, so that it looks like this.

Hello
=====

Some of the most **common tags**, and we will come across them in task below, are:

* `<h1>`, `<h2>` etc which mark headings
* `<div>` or dividers tags which mark out block sections of a webpage
* `<a>` otherwise known as anchor tags, which mark hyper links
* `<span>` which mark out a section of text for special formatting
* `<p>` which indicate that a block of text is a paragraph
* `<ul>` and `<ol>` which mark unordered (bullet point) and ordered (numbered) lists respectively
* `<li>` which marks a list item

Tags can be assigned a **class** or given an **id**

	<h1 class ="boxtitle">Hello</h1>
	<h1 id ="special">Hello</h1>

This is usually to mark them out for a particular kind or formatting, but it is especially useful for scraping as it allows us to tell python which tag contains the data we are interested in.

Sometimes the data we are interested in not inside a conveniently labelled tag but lies close to it. In which case we need to learn how to navigate the **DOM** (or **Document Object Model**) **Tree** to get to it. Tags can be placed within other tags giving us a hierarchical structure. The DOM specifies the rules that say which tags can be nested inside which others.

For example, here the *span* tag is the **child** of the *div* tag (or we could say the *div* tag is the **parent** of the *span* tag

	<DIV>
		<SPAN>
	       Some text
		</SPAN>
	</DIV>

Here however we have two *anchor* that are *siblings* (their parent being the div tag).

	<DIV>
		<A href="www.coppelia.io">
	       Some text
		</A>
		<A href="www.coppelia.io/blog">
	       Some other text
		</A>
	</DIV>
	
With web scraping we can get from the tag that can be identified directly with class or id to one that can't by saying things like "take the child of the sibling of the parent of the node we know!"

**BeautifulSoup** which we've met already, facilitates both the selecting of tags by class or id and the navigation of the DOM tree.

We will also being making good use of *Chrome's Developer Tools* which allow use to inspect the DOM tree for any webpage.


------------------

## How to scrape a web page

1. We are going to see if we can scrape programme and episode data from ITV's website. The first thing we need to do is work out the structure of the part of the site that we are interested in. I notice that there is a [list of shows](https://www.itv.com/hub/shows) and that each show has its [own page](https://www.itv.com/hub/a-touch-of-frost/Ya1774). If we can extract the from the list of shows the url of each show page, then we can visit each show page in turn and pull out the episode data. So to get this list of URLs we will need to analyse the DOM of the page `https://www.itv.com/hub/shows`. Visit the page and the use the chrome developer tools to view the page source. You will see many of the elements and tags we discussed above.

2. Let's start simple and see if we can extract all the programme titles. There's an easy way to find what we are looking for in this soup of html. Return to the usual browser view. Highlight the title of the first programme in the list, right click and select *Inspect Element* to bring up the DOM. You will see the text for that programme sitting in a h3 tag, nested within lots of other tags. It has a class of `tout__title` which seemse to identify it as a programme title.

3. Now we are going to use BeautifulSoup to extract the text from all h3 tags with this class. Run the following code.

In [8]:
query_url = "https://www.itv.com/hub/shows"
query_response = requests.get(query_url).text
soup = BeautifulSoup(query_response, "html5lib")

titles = soup.find_all('h3',{'class': 'tout__title'})

for t in titles:
    
    print(t.text.strip(' \t\n\r'))

A Touch of Frost
A Very Royal Wedding
Absolutely Ascot
Agatha Christie's Marple
Agatha Christie's Poirot
Ainsley's Caribbean Kitchen
Alan Titchmarsh
Alexander Armstrong in the Land of the Midnight Sun
American Dad!
An Audience with...
Animanimals
The Avatars
BabyRiki
The Bachelorette
Bad Move
The Bagel & Becky Show
Be Cool Scooby Doo
Be Tasty
Bear Grylls Survival School
Bear's Mission with...
Ben 10
Benidorm
Benidorm: Ten Years on Holiday
Betch
Bette Midler: One Night Only
The Big Audition
The Big Fight
The Big Fight
The Big Fight: Highlights
Big Star's Bigger Star
Birds of a Feather
Botched
Bradley Walsh & Son: Breaking Dad
Bradley Walsh: When Dummies Took Over the World
Bridezillas
Bring It!
Britain's Got Talent
British Touring Car Championship
British Touring Car Championship Review
Buying and Selling
Cake Hunters
Car Crash Global: Caught on Camera
Carry On
Catchphrase
Celebrations
Celebrity Catchphrase
Champions League
The Chase
The Chase - The Bloopers
The Chase: Celebrity Special

Now see if you can extract all the programme summaries!

The programme titles will be useful for our data set but what we really need is the urls. Recall links have tag `<a>` by using the Chrome inspector on the links we can identify the right class for the links and extract them.


In [9]:
links = soup.find_all('a',{'class': 'complex-link'}, {'data-content-type': 'programme'})
for l in links:
    print(l.get('href'))  

https://www.itv.com/hub/a-touch-of-frost/Ya1774
https://www.itv.com/hub/a-very-royal-wedding/2a4703
https://www.itv.com/hub/absolutely-ascot/2a5862
https://www.itv.com/hub/agatha-christies-marple/L1286
https://www.itv.com/hub/agatha-christies-poirot/L0830
https://www.itv.com/hub/ainsleys-caribbean-kitchen/2a6099
https://www.itv.com/alantitchmarsh
https://www.itv.com/hub/alexander-armstrong-in-the-land-of-the-midnight-sun/2a3386
https://www.itv.com/hub/american-dad/2a4263
https://www.itv.com/hub/an-audience-with/L0055
https://www.itv.com/hub/animanimals/2a5938
https://www.itv.com/hub/the-avatars/2a5209
https://www.itv.com/hub/babyriki/2a5940
https://www.itv.com/hub/the-bachelorette/2a3729
https://www.itv.com/hub/bad-move/2a4774
https://www.itv.com/hub/the-bagel-becky-show/2a5321
https://www.itv.com/hub/be-cool-scooby-doo/2a4501
https://www.itv.com/hub/be-tasty/2a4995
https://www.itv.com/hub/bear-grylls-survival-school/2a4050
https://www.itv.com/hub/bears-mission-with/2a5494
https://www.

Next we want to visit each of these pages in turn and pull out the episode data. First we put the urls into a python list `prog_pages`, then we loop through the list using `requests` to get the page and `BeautifulSoup` to pull out the episode titles which again seem to have the class `tout__title`

In [10]:
prog_pages =[]
for l in links:
    prog_pages.append(l.get('href'))


for i, p in enumerate(prog_pages[0:10]):
    print("-------------------------")
    print(titles[i].text.strip(' \t\n\r'))
    query_response = requests.get(p).text
    prog_page_soup = BeautifulSoup(query_response, "html5lib")
    episode_titles = prog_page_soup.find_all('h3',{'class': 'tout__title'})
    for e in episode_titles:
        print(e.text.strip(' \t\n\r'))

-------------------------
A Touch of Frost
1. Penny for the Guy
5. Deep Waters
4. The Things We Do for Love
3. Funtime for Swingers
Grantchester
Maigret
Unforgotten
Agatha Christie's Marple
Coronation Street
Discover something new
-------------------------
A Very Royal Wedding
Sat Saturday 29 Dec
Dinner Date
Chrisley Knows Best
Tonight
This Morning
Pothole Wars
Discover something new
-------------------------
Absolutely Ascot
Episode 8
Episode 7
Episode 6
Episode 5
Episode 4
Episode 3
Episode 2
Episode 1
Dinner Date
The Chase - The Bloopers
The Imitation Game
Ibiza Weekender
Married To Medicine
Discover something new
-------------------------
Agatha Christie's Marple
3. Towards Zero
2. Ordeal by Innocence
1. At Bertram's Hotel
7. Dead Man's Mirror
5. Adventures of the Italian Nobleman
Want even more?
Emmerdale
Endeavour
Cleaning Up
Doc Martin
Vera
Discover something new
-------------------------
Agatha Christie's Poirot
7. Dead Man's Mirror
5. Adventures of the Italian Nobleman
4. Case

But something seems to have gone wrong. The class `tout__title` is picking up not just the episode titles but the titles of similar programmes that are displayed at the bottom of the page. Closer inspection of the DOM reveals that the episode titles are all `h3` tags that are nested inside of a `div` with an `id` of `more-episodes`. Makes sense.

In [11]:
for i, p in enumerate(prog_pages[0:10]):
    print("-------------------------")
    print(titles[i].text.strip(' \t\n\r'))
    query_response = requests.get(p).text
    prog_page_soup = BeautifulSoup(query_response, "html5lib")
   
    more_episodes_info = prog_page_soup.find('div',{'id': 'more-episodes'} )
    if more_episodes_info != None :
        episode_titles = more_episodes_info.find_all('h3',{'class': 'tout__title'})
        for e in episode_titles:
            print(e.text.strip(' \t\n\r'))

-------------------------
A Touch of Frost
1. Penny for the Guy
5. Deep Waters
4. The Things We Do for Love
3. Funtime for Swingers
-------------------------
A Very Royal Wedding
Sat Saturday 29 Dec
-------------------------
Absolutely Ascot
Episode 8
Episode 7
Episode 6
Episode 5
Episode 4
Episode 3
Episode 2
Episode 1
-------------------------
Agatha Christie's Marple
3. Towards Zero
2. Ordeal by Innocence
1. At Bertram's Hotel
-------------------------
Agatha Christie's Poirot
7. Dead Man's Mirror
5. Adventures of the Italian Nobleman
4. Case of the Missing Will
3. Yellow Iris
1. Adventures of the Egyptian Tomb
3. One Two Buckle My Shoe
2. Death in the Clouds
1. The ABC Murders
10. The Affair at the Victory Ball
9. The Theft of the Royal Ruby
3. The Adventure of Johnnie Waverly
2. Murder in the Mews
1. The Adventure of the Clapham Cook
-------------------------
Ainsley's Caribbean Kitchen
Episode 1
-------------------------
Alan Titchmarsh
-------------------------
Alexander Armstro

This all looks good. The last step is to store the results in a pandas dataframe rather than just printing them.

In [12]:
import pandas as pd

rows = []
for i, p in enumerate(prog_pages[0:10]):
   
    query_response = requests.get(p).text
    prog_page_soup = BeautifulSoup(query_response, "html5lib")
    
    more_episodes_info = prog_page_soup.find('div',{'id': 'more-episodes'} )
    if more_episodes_info != None :
        episode_titles = more_episodes_info.find_all('h3',{'class': 'tout__title'})
        for e in episode_titles:
            row = {"Programme Title": titles[i].text.strip(' \t\n\r'), "Episode Title": e.text.strip(' \t\n\r')}
            rows.append(row)

pd.DataFrame(rows)

Unnamed: 0,Episode Title,Programme Title
0,1. Penny for the Guy,A Touch of Frost
1,5. Deep Waters,A Touch of Frost
2,4. The Things We Do for Love,A Touch of Frost
3,3. Funtime for Swingers,A Touch of Frost
4,Sat Saturday 29 Dec,A Very Royal Wedding
5,Episode 8,Absolutely Ascot
6,Episode 7,Absolutely Ascot
7,Episode 6,Absolutely Ascot
8,Episode 5,Absolutely Ascot
9,Episode 4,Absolutely Ascot
