# Learning to Download Data

First, let's learn how to download transcript data and collect just the pieces we need. We will start with the first episode of campaign 1 of Critical Role, [Arrival at Kraghammer](https://criticalrole.fandom.com/wiki/Arrival_at_Kraghammer/Transcript).

In [2]:
from bs4 import BeautifulSoup

import urllib
import urllib.request

In [5]:
transcript_url = 'https://criticalrole.fandom.com/wiki/Arrival_at_Kraghammer/Transcript'

with urllib.request.urlopen(transcript_url) as response:
  html = response.read()

transcript = BeautifulSoup(html, 'html.parser')

In [6]:
print(transcript.prettify()[1:200])

!DOCTYPE html>
<html class="client-nojs sse-other" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Arrival at Kraghammer/Transcript | Critical Role Wiki | Fandom
  </title>
  <scr


We are using the BeautfulSoup object in order to parse the html from our target page. The main body of the page is in the `mw-parser-output` div, so let's get that:

In [21]:
main_div = transcript.find_all('div', {'class': 'mw-parser-output'})[0]

Then, we can iterate through the children of the div. We expect that each major section will start with a `h2` element, followed by a number of `p` elements, each of which will be a line/section of dialogue from someone. We will collect the paragraphs into a list for each major section:

In [39]:
current_section = None
sections = {}
for child in main_div.children:
    if child.name == "h2":
        current_section = child.text[:-2]
        sections[current_section] = []
    elif child.name == "p" and current_section is not None:
        sections[current_section].append(child.text)
        
print(sections)

Pre-Show
Part I
Break
Part II
{'Pre-Show': ["MATT: Hello everyone. My name is Matthew Mercer, voice actor and Dungeon Master for Critical Role on Geek & Sundry, where I take a bunch of other voice actors and run them through a fantastical fantasy adventure through the world of Dungeons & Dragons. We play every Thursday at 7:00pm Pacific Standard Time on Geek & Sundry's Twitch stream. Please come watch us live if you have the opportunity. Back episodes and future episodes will be uploaded on the Geek & Sundry website. You can also check them out there. In the meantime, enjoy!\nWelcome to first episode of Critical Role, and what this basically is is a continuation of our weekly D&D game. Me and a bunch of other likely nerdy and enjoyable voice actors gathering around, rolling some dice, killing some creatures, having some adventure. Now we have the pleasure of bringing it on the stream for you to watch, enjoy, and occasionally interact with. Before we get to that, to give you a little ba

We can see that there are many lines of dialogue that we need to go through and clean up. Let's look at the beginning of each line to see that they are formatted a certain way. They start with a NAME: and then the dialogue, with occasional other formats (for sound effects, for example).

In [65]:
for section in sections:
    print(f'---- {section} ----')
    for p in sections[section][:10]:
        print(p[:10])

---- Pre-Show ----
MATT: Hell
TRAVIS: Ri
[record sc
TRAVIS (CO
MARISHA: A
[thunder c
MARISHA (C
TALIESIN: 
SAM: Oh, y
ORION: Gre
---- Part I ----
MATT: All 
TRAVIS: So
MATT: Yep,
TRAVIS: Ne
MATT: Yeah
LAURA: Oh 
SAM: In th
MATT: In t
SAM: Wow! 
TRAVIS: We
---- Break ----
TRAVIS: Ri
[record sc
TRAVIS (CO
MARISHA: A
[thunder c
MARISHA (C
TALIESIN: 
SAM: Oh, y
ORION: Gre
LIAM: Neve
---- Part II ----
MATT: Hell
SAM: Oh, J
TRAVIS: We
SAM: Are w
TALIESIN: 
LAURA and 
SAM: Eggs 
TALIESIN: 
LIAM: I wa
MARISHA: D


For now, let's label the unusual formats as 'NOSPEAKER' and separate out the names and dialogue. We may want the names as labels later.

In [66]:
parsed_sections = {key: [[x.strip() for x in y.split(':', 1)] if ':' in y else ['NOSPEAKER', y] for y in value]
                   for (key, value) in sections.items()}
print(parsed_sections.get('Pre-Show')[:5])

[['MATT', "Hello everyone. My name is Matthew Mercer, voice actor and Dungeon Master for Critical Role on Geek & Sundry, where I take a bunch of other voice actors and run them through a fantastical fantasy adventure through the world of Dungeons & Dragons. We play every Thursday at 7:00pm Pacific Standard Time on Geek & Sundry's Twitch stream. Please come watch us live if you have the opportunity. Back episodes and future episodes will be uploaded on the Geek & Sundry website. You can also check them out there. In the meantime, enjoy!\nWelcome to first episode of Critical Role, and what this basically is is a continuation of our weekly D&D game. Me and a bunch of other likely nerdy and enjoyable voice actors gathering around, rolling some dice, killing some creatures, having some adventure. Now we have the pleasure of bringing it on the stream for you to watch, enjoy, and occasionally interact with. Before we get to that, to give you a little backstory on the characters you'll be seei

Now let's see how many lines each speaker has. It might also tell us if there is additional cleanup that needs to be done.

In [70]:
speakers = {key: [x[0] for x in value] for (key, value) in parsed_sections.items()}
print(speakers.get('Pre-Show')[:5])

['MATT', 'TRAVIS', 'NOSPEAKER', "TRAVIS (CONT'D)", 'MARISHA']


In [75]:
from collections import Counter
for section, names in speakers.items():
    print(f'---- {section} ----')
    print(Counter(names))

---- Pre-Show ----
Counter({'MATT': 15, 'MARISHA': 7, 'ORION': 6, 'NOSPEAKER': 5, 'TRAVIS': 4, 'LAURA': 4, 'SAM': 3, 'ZAC': 3, 'TALIESIN': 2, 'LIAM': 2, "TRAVIS (CONT'D)": 1, "MARISHA (Cont'd)": 1, 'ORION and MATT': 1, "ORION (Cont'd)": 1})
---- Part I ----
Counter({'MATT': 329, 'LAURA': 193, 'SAM': 123, 'LIAM': 102, 'TRAVIS': 96, 'MARISHA': 84, 'ORION': 69, 'TALIESIN': 31, 'NOSPEAKER': 9, 'ALL': 4, 'LAURA and LIAM': 2, 'ZAC': 2, 'LAURA and SAM': 1, 'MATT and MARISHA': 1, 'LAURA and TRAVIS': 1, 'TALIESIN and MARISHA': 1, 'ORION and MARISHA': 1})
---- Break ----
Counter({'NOSPEAKER': 2, 'TRAVIS': 1, "TRAVIS (CONT'D)": 1, 'MARISHA': 1, "MARISHA (Cont'd)": 1, 'TALIESIN': 1, 'SAM': 1, 'ORION': 1, 'LIAM': 1, 'LAURA': 1})
---- Part II ----
Counter({'MATT': 367, 'LAURA': 150, 'SAM': 115, 'MARISHA': 94, 'ORION': 93, 'TRAVIS': 89, 'TALIESIN': 85, 'LIAM': 56, 'NOSPEAKER': 19, 'ALL': 4, 'ZAC': 3, 'LAURA and TRAVIS': 1, 'SAM and TALIESIN': 1, 'TRAVIS and LAURA': 1, 'LAURA and SAM': 1, 'SAM and MAT