---
title: "Scraping Netflix Subtitle Data: Small Example"  
subtitle: Getting at Netflix Subtitles and Exploring a Dave Chappelle joke setup
summary: Getting my hands on some of that Stove Top stuffing
date: 2020-10-29  
categories:  
  - Python  
tags:  
  - pandas  
  - beautifulsoup
slug: "netflix-subtitle-small-example"  
image:
  caption: ''
  focal_point: ''
  preview_only: yes
links:
  donate_button:
    icon: seedling
    icon_pack: fas
    name: Ways to Support
    url: /support/

---

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

pd.set_option('display.max_colwidth', 300)
pd.set_option('display.max_rows', 300)

from collections import Counter
import re

<!-- Icon Image: Small -->
<img src="featured.png" width="100"/> 

This is a little example of parsing Netflix subtitle data. Here we look at Dave Chappelle's comedy special *Equanimity* and the story that he uses to setup a punchline he alludes to at the start of the show.

**Note About Getting Subtitles From Netflix**: I looked at [this GitHub repo](https://github.com/isaacbernat/netflix-to-srt) for a little, which helped me find the subtitle file, but parsing it wasn't working. So, I saved it off as an XML file and did it myself with python.

1. Find episode to get subtitles
2. Developer Tools > Network > file starting with `?o=`
3. Save off as an `.xml` file

In [2]:
def make_netflix_subtitle_df(netflix_xml_file):
    '''
        Formatting from XML file of Netflix subtitles into a dataframe
    '''
    
    # open file
    with open(netflix_xml_file, 'r') as f:
        contents = f.read()
    
        # replace <br> tags with space, important to get a full string per timestamp
        contents = contents.replace('<br/>', ' ')
    
        # beautiful soup
        soup = BeautifulSoup(contents, 'html.parser')
    
    # grab p tags, loop thru and get tags/text from each
    subtitles = soup.find_all('p')
    df_dict = {}

    for ind, timestamp in enumerate(subtitles):
#         print(timestamp.attrs)
#         print(timestamp.text)

        df_dict[ind] = timestamp.attrs
        df_dict[ind]['text'] = timestamp.text
        
    # dataframe
    df = pd.DataFrame(df_dict.values())[['begin', 'end', 'text']]
    
    # format
    df['begin'] = pd.to_numeric(df.begin.str.replace('t', ''))
    df['end'] = pd.to_numeric(df.end.str.replace('t', ''))
    
    return df

## Dave Chappelle Fishbowl Story
- How many words before he wraps back around to the original punchline?
- How long is that gap?

In [3]:
df = make_netflix_subtitle_df('chappelle-equanimity.xml')

In [4]:
df.head()

Unnamed: 0,begin,end,text
0,63813750,72572500,"[""Killing Me Softly with His Song"" playing]"
1,73406667,100934167,[woman vocalizes]
2,111778334,149732917,♪ I heard he sang a good song ♪
3,150567084,191024167,♪ I heard he had a style ♪ -[camera shutter clicks]
4,191858334,241074167,♪ And so I came to see him To listen for a while ♪


In [5]:
interim_story = df.loc[57:146]

In [6]:
interim_story.head()

Unnamed: 0,begin,end,text
57,2051215834,2064145417,You know what's weird?
58,2072487084,2090421667,I've always been this talented.
59,2105853750,2123788334,I can't remember a time when I wasn't.
60,2124622500,2159240417,"You know, when I was growing up, I was probably about eight years old,"
61,2160074584,2188436250,"and at the time, we were living in Silver Spring."


### How Many Words?

In [7]:
story_text = interim_story.text.str.cat(sep = ' ')

In [8]:
text = story_text

text = text.lower()

text = re.sub("[^\w ]", "", text)

text

'you know whats weird ive always been this talented i cant remember a time when i wasnt you know when i was growing up i was probably about eight years old and at the time we were living in silver spring yeah yes common misconception about me and dc a lot of people think im from the hood thats not true but i never bothered to correct anybody because i wanted the streets to embrace me as a matter of fact i kept it up as a ruse like sometimes ill hang out with rappers like nas and them and these motherfuckers start talking about the projects yo it was wild in the pjs yo and ill be like word nigga word but i dont know i have no idea my parents did just well enough so that i could grow up poor around white people to be honest whennas and them talk about the projects nigga i used to get jealous because it sounded fun everybody in the projects was poor and thats fair but if you were poor in silver spring nigga it felt like it was only happening to you nas does not know the pain of that first

In [9]:
words = text.split(' ')
word_count = Counter(words)

In [10]:
len(list(word_count.elements()))

712

### How Long of a Gap?

In [17]:
interim_story.head()

Unnamed: 0,begin,end,text
57,2051215834,2064145417,You know what's weird?
58,2072487084,2090421667,I've always been this talented.
59,2105853750,2123788334,I can't remember a time when I wasn't.
60,2124622500,2159240417,"You know, when I was growing up, I was probably about eight years old,"
61,2160074584,2188436250,"and at the time, we were living in Silver Spring."


In [12]:
start = interim_story.iat[0, 0]
end = interim_story.iat[-1, 1]

In [13]:
duration = end - start

The duration is provided as 10^7 seconds.

In [14]:
minutes_full = duration/(10 ** 7)/60
minutes_full

5.243432638333333

In [15]:
minutes = int(minutes_full)
seconds_frac = minutes_full % 1
seconds = int(seconds_frac * 60)

In [16]:
print(f"Story lasts for {minutes} minutes, {seconds} seconds")

Story lasts for 5 minutes, 14 seconds


## Conclusion

- Example of parsing Netflix subtitles. Data includes timestamp for a given subtitle and the associated text.

### What's Useful for Later:
- Function for parsing Netflix `.xml` subtitle files

### Going Further
- I could see attempting to recreate work I have done previously on the use of [swear words in comedy specials](https://zachbogart.com/project/comedy/)

Till next time!

![](https://media.giphy.com/media/42D3CxaINsAFemFuId/giphy.gif)

#### Image Credit
[comedy mask](https://thenounproject.com/search/?creator=4129988&q=comedy&i=3169849) by Zach Bogart from [the Noun Project](https://thenounproject.com/) 