# Searching for subtitles

Tried searching for subtitles with character names, but unable to find any reliable sources.
Found https://bigbangtrans.wordpress.com/series-1-episode-1-pilot-episode/ with transcribe while googling.

This blog has transcripts with somewhat sensible format which I think I can parse.
Let's get all the transcripts.

# Getting transcripts

Found the sitemap: https://bigbangtrans.wordpress.com/sitemap.xml

Let's parse this to get episode wise list.

In [1]:
import requests

sitemap_url = "https://bigbangtrans.wordpress.com/sitemap.xml"
res = requests.get(sitemap_url)
sitemap_data = res.text
print(sitemap_data[:100])

<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress.com" -->
<urlset xmlns:xsi="http://


In [2]:
# XML parsing is tedious. Convert to pythonic way via xmltodict
!pip install xmltodict


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
import xmltodict

In [4]:
sitemap_dict = xmltodict.parse(sitemap_data)
sitemap_dict.get('url')

In [5]:
urls = [x.get('loc') for x in sitemap_dict.get('urlset').get('url')]
urls = [x for x in urls if "series-" in x] # based on pattern in url
urls[:5]

['https://bigbangtrans.wordpress.com/series-10-episode-24-the-long-distance-dissonance/',
 'https://bigbangtrans.wordpress.com/series-10-episode-23-the-gyroscopic-collapse/',
 'https://bigbangtrans.wordpress.com/series-10-episode-22-the-cognition-regeneration/',
 'https://bigbangtrans.wordpress.com/series-10-episode-21-the-separation-agitation/',
 'https://bigbangtrans.wordpress.com/series-10-episode-20-the-recollection-dissipation/']

In [6]:
# some swaggy OOP
from dataclasses import dataclass

@dataclass
class Transcript:
    season: int
    episode: int
    title: int
    link: str
    html_text: str = ""
    raw_text: str = ""

In [7]:
transcripts: list[Transcript] = []
for url in urls:
    scheme, _, site, path, _ = url.split("/")
    _, season, _, episode, title = path.split("-", maxsplit=4)
    transcript = Transcript(season=int(season), episode=int(episode), title=title, link=url)
    transcripts.append(transcript)
print(transcripts[:2])

[Transcript(season=10, episode=24, title='the-long-distance-dissonance', link='https://bigbangtrans.wordpress.com/series-10-episode-24-the-long-distance-dissonance/', html_text='', raw_text=''), Transcript(season=10, episode=23, title='the-gyroscopic-collapse', link='https://bigbangtrans.wordpress.com/series-10-episode-23-the-gyroscopic-collapse/', html_text='', raw_text='')]


In [8]:
# Download text
for transcript in transcripts:
    print(f"Downloading for {transcript.link}")
    res = requests.get(transcript.link)
    if res.status_code == 200:
        transcript.html_text = res.text

Downloading for https://bigbangtrans.wordpress.com/series-10-episode-24-the-long-distance-dissonance/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-23-the-gyroscopic-collapse/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-22-the-cognition-regeneration/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-21-the-separation-agitation/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-20-the-recollection-dissipation/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-19-the-collaboration-fluctuation/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-18-the-escape-hatch-identification/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-17-the-comic-con-conundrum/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-16-the-allowance-evaporation/
Downloading for https://bigbangtrans.wordpress.com/series-10-episode-15-the-locomotion-reverberat

In [9]:
!pip install beautifulsoup4


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
from bs4 import BeautifulSoup

In [15]:
for transcript in transcripts:
    print(f"Extracting text from HTML for {transcript.link}")
    soup = BeautifulSoup(transcript.html_text)
    transcript.raw_text = soup.select('#content')[0].text # Blog content is in div with id `content`

Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-24-the-long-distance-dissonance/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-23-the-gyroscopic-collapse/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-22-the-cognition-regeneration/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-21-the-separation-agitation/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-20-the-recollection-dissipation/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-19-the-collaboration-fluctuation/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-18-the-escape-hatch-identification/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-17-the-comic-con-conundrum/
Extracting text from HTML for https://bigbangtrans.wordpress.com/series-10-episode-1

# Parsing Dialogs

Since now I have contents for the transcript, parse it in the dialogues.

In [28]:
len(transcripts)

231

In [32]:
from typing import Optional


@dataclass
class Dialogue:
    speaker: str
    text: str
    transcript: Optional[Transcript]
    speaker_supporting_text: Optional[str] = ""

In [33]:
dialogues = []
for transcript in transcripts:
    lines = [x for x in transcript.raw_text.split("\n") if x]
    for line in lines:
        try:
            speaker, text = line.split(":", maxsplit=1)
            dialogue = Dialogue(speaker=speaker, text=text, transcript=transcript)
            dialogues.append(dialogue)
        except Exception as e:
            pass

In [34]:
len(dialogues)

54652

# Filtering Scenes
scenes are added in transcript with format `scene: Description`
In out case it is parsed as dialogue with speaker = scene. 

In [35]:
@dataclass
class Scene:
    description: str
    transcript: Optional[Transcript]

In [36]:
scenes = []
for dialogue in dialogues:
    if dialogue.speaker.lower() == 'scene':
        scene = Scene(description=dialogue.text, transcript=dialogue.transcript)
        scenes.append(scene)

In [37]:
dialogues = [x for x in dialogues if x.speaker.lower() != 'scene']

In [38]:
len(dialogues), len(scenes)

(51804, 2848)

# Check the speakers

In [39]:
speakers = list(set([x.speaker.lower() for x in dialogues]))
print(speakers[:10])

['sheldon (now in a sparkly green suit with rhinestones)', 'katee', 'leonard, howard and raj (singing)', 'stuart (off)', 'stephen hawking', 'penny (knocking on door and entering)', 'susan (penny’s mother)', 'raj (reaching the other side)', 'leonard (voice off)', 'penny (entering, carrying a laptop)']


There are many speaker parsed as `speaker_name (blah-blah)`
Let's add the information in bracket as speaker_supporting_text in dialogue

In [41]:
for i, dialogue in enumerate(dialogues):
    original_speaker = dialogue.speaker.lower().strip()
    if "(" in original_speaker or "(" in original_speaker:
        speaker, speaker_supporting_text = original_speaker.split("(", maxsplit=1)
        speaker = speaker.strip()
        speaker_supporting_text = speaker_supporting_text.replace(")","").replace("(","").strip()
    else:
        speaker = original_speaker
        speaker_supporting_text = ""
    dialogue.speaker = speaker
    dialogue.speaker_supporting_text = speaker_supporting_text

In [42]:
len(dialogues)

51804

In [43]:
speakers = list(set([x.speaker for x in dialogues]))
speakers[:6]

['katee',
 'stephen hawking',
 'mrs fowler',
 'doctor',
 'barber',
 'first car thief']

In [44]:
from collections import defaultdict
dialogues_per_speaker = defaultdict(int)
for dialogue in dialogues:
    dialogues_per_speaker[dialogue.speaker] += 1

In [45]:
from collections import Counter
dialogues_per_speaker = Counter(dialogues_per_speaker)

In [46]:
dialogues_per_speaker

Counter({'sheldon': 11620,
         'leonard': 9713,
         'penny': 7677,
         'howard': 5853,
         'raj': 4669,
         'amy': 3472,
         'bernadette': 2684,
         'stuart': 733,
         'priya': 222,
         'share this': 221,
         'mrs cooper': 213,
         'emily': 164,
         'beverley': 162,
         'mrs wolowitz': 136,
         'zack': 135,
         'arthur': 130,
         'wil': 126,
         'leslie': 113,
         'kripke': 106,
         'man': 105,
         'bert': 95,
         'barry': 79,
         'lucy': 73,
         'steph': 73,
         'ramona': 71,
         'all': 71,
         'girl': 71,
         'past sheldon': 66,
         'past leonard': 64,
         'dr koothrappali': 63,
         'alex': 63,
         'mary': 61,
         'howard’s mother': 57,
         'lesley': 53,
         'dave': 52,
         'gablehouser': 50,
         'beverly': 48,
         'mike': 48,
         'dr hofstadter': 47,
         'missy': 47,
         'alfred': 46,
 

# Cleaning and Fixing speaker names

Based on the dialogues_per_speaker Counter instance, I have found that some speaker names are mistyped and misattributes.

Creating a dictionary to fix them

In [47]:
speaker_rename_dict = {
    'barry': 'kripke', # Since we have both barry and kripke in the set, keeping only kripke to make it consistant
    'past sheldon': 'sheldon',
    'past leonard': 'leonard',
    'mary': 'mrs cooper',
    'howard’s mother': 'mrs wolowitz',
    'lesley': 'leslie',
    'beverly': 'beverley',
    'wil wheaton': 'wil',
    'penny’s dad': 'wyatt',
    'stephen hawking': 'hawking',
    # I am considering only these, others have only <20 instances
}

In [48]:
for dialogue in dialogues:
    dialogue.speaker = speaker_rename_dict.get(dialogue.speaker, dialogue.speaker)

# Big Question: How many times sheldon says the word `penny`

In [49]:
penny_count = 0
for dialogue in dialogues:
    if dialogue.speaker == 'sheldon':
        penny_count += dialogue.text.lower().count('penny')


In [52]:
penny_count

543