# "You think that's funny?" Topic modelling and text generation using Amazon and Netflix stand-up comedy scripts

# 1) Construction of the dataset

In this section, I have built the dataset used in this project. To do that, I combined the list of the stand-up comedy specials and their release date extracted from Wikipedia (`List_stand_up_comedy_full.csv` file) together with the text of the shows extracted by a series of .srt files (e.g., subtitles scripts) retrieved from the croudsourced community Subscene.com.

The final dataframe is then saved in the file `Stand_up_comedy_dataset.csv`.

The final dataset for the analysis contains the following features:

| Column name | Datatype | Definition |
| :- | :- | :- |
| `Title` | object | Title of the stand-up comedy special. |
| `Producer` | object | Platform that produced and released the show. |
| `Comedian` | object | Name and surname of the comedian. |
| `Gender` | object | Gender of the comedian. |
| `Release date` | object | Release date of the show. |
| `Original language` | object | Original language of the audio of the show.|
| `Text` | object | Full text/transcript of the show. |
| `Len_Hours` | int64 | Number of runtime hours. |
| `Len_Minutes` | int64 | Number of runtime minutes. |
| `Len_Minutes` | int64 | Number of runtime seconds. |
| `File_name` | object | Name of the .srt file (subtitles/transcript) in the working folder.|


The sources used to construct this dataset are Wikipedia (i.e., [Amazon stand-up comedy specials list](https://en.wikipedia.org/wiki/List_of_Amazon_original_programming#Stand-up_comedy_specials) and [Netflix stand up comedy list](https://en.wikipedia.org/wiki/List_of_Netflix_original_stand-up_comedy_specials)) for the first six features listed above. The remaining five features have been instead extracted from `.srt` files (subtitles files) downloaded from the website [Subscene.com](https://subscene.com). Subscene is a crowdsourced community of translators that hosts thousands of transcripts produced by its members. These `.srt` files contain the audio transcript in English language of the show and information on the lenght/timing of each sentence present in the show.

Notice that the dataset includes all the stand-up comedy shows for which an English transcript (.srt file) was available on Subscene. The stand-up comedy specials that do not meet this criterion were not included in the dataset.

In [1]:
import pysrt # Necessary to read .srt files
from tqdm import tqdm # Progress bar for loop
import os
import re
import string
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import list of stand up shows from .csv file (source: Wikipedia + manual integrations/adjustments)
df = pd.read_csv('List_stand_up_comedy_full.csv')
print('Total number of stand-up comedy shows:', df.shape[0])
df.head()

Total number of stand-up comedy shows: 143


Unnamed: 0,Title,Release date,Original language,Producer,Gender,File_name
0,Adam DeVine: Best Time of Our Lives,"June 18, 2019",English,Netflix,M,Adam.Devine.Best.Time.of.Our.Lives.2019.720p.W...
1,Adam Sandler: 100% Fresh,"October 23, 2018",English,Netflix,M,Adam.Sandler.100.Percent.Fresh.2018.WEBRip.x26...
2,Adel Karam: Live From Beirut,"March 1, 2018",Arabic,Netflix,M,Adel_Karam_Live_From_Beirut_En_en
3,Afonso Padilha: Classless,"September 3, 2020",Portuguese,Netflix,M,Afonso.Padilha.Classless.2020
4,Alex Fernández: The Best Comedian in the World,"January 23, 2020",Spanish,Netflix,M,Alex.Fernández.The.Best.Comedian.in.the.World....


In [3]:
# Example .srt file
example = pysrt.open('Stand_up_specials_subs/Ali.Wong.Baby.Cobra.2016.720p.WEBRip.x264-JAWN.srt')
print('Starting time of sentence in position 0:', example[0].start)
print('Ending time of sentence in position 0:', example[0].end)
print('\nSentence example up to position 0:\n\n', example[0].text)

Starting time of sentence in position 0: 00:00:05,589
Ending time of sentence in position 0: 00:00:09,885

Sentence example up to position 0:

 [male announcer] <i>Ladies and gentlemen,</i>
<i>please welcome to the stage: Ali Wong!</i>


## 1.1) Extract text data from .srt files and create dataset for analysis

In [4]:
# Define cleaning function for .srt text files

def clean_srt(text):
    text = text.encode("ascii", "ignore") # Remove non-ASCII characters
    text = text.decode()
    text = text.replace('\n', ' ') # Remove new row escape
    text = re.sub(r'<[^>]+>', '', text) # Eliminate text within < > characters (often contaning info on subs font/color)
    text = re.sub(r'\[[^]]+\]', '', text) # Eliminate text within squared brackets (often contaning audio description for hearing-impaired individuals)
    text = re.sub(r'\([^)]+\)', '', text) # Eliminate text within parentheses (often contaning audio description for hearing-impaired individuals)
    text = text.replace('-', '') # Remove dialogue dashes
    text = ' '.join(text.split()) # Reduce all double/triple whitespacing to single
    return text

In [5]:
# Clean_srt function test on example file
clean_srt(example[0].text)

'Ladies and gentlemen, please welcome to the stage: Ali Wong!'

In [6]:
# Extract text data from .srt files and add to the dataset

main_dir = 'Stand_up_specials_subs'

# Get file names from folders
file_path = [os.path.join(root,f) for root,dirs,files in os.walk(main_dir) for f in files]

text_file = []
file_names = []
show_lenght = []


for file in tqdm(file_path):
    
    # Get file name to use as common column to merge with df
    file_name = os.path.basename(file)
    file_names.append(file_name)
    
    # Open .srt file
    subs = pysrt.open(file, encoding='iso-8859-1')
    
    # Extract text content and append to list
    text = subs.text
    text = clean_srt(text)
    text_file.append(text)
    
    # Extract lenght
    lenght = subs[len(subs)-1].end
    show_lenght.append(lenght)


# Transform list in a dataframe and rename columns 
df_fn = pd.DataFrame(file_names)
df_len = pd.DataFrame(show_lenght)
df_txt = pd.DataFrame(text_file)
extract_df = pd.concat([df_fn, df_len, df_txt], ignore_index=True, axis=1).rename(columns={0 :'File_name', 1 : 'Len_Hours', 2 : 'Len_Minutes', 3 : 'Len_Seconds', 4 :'Len_Milliseconds', 5 : 'Text'})

print('Number of stand-up comedy shows with text:', extract_df.shape[0])
extract_df.head()

100%|████████████████████████████████████████████████████████████████████████████████| 143/143 [00:06<00:00, 22.25it/s]

Number of stand-up comedy shows with text: 143





Unnamed: 0,File_name,Len_Hours,Len_Minutes,Len_Seconds,Len_Milliseconds,Text
0,Adam.Devine.Best.Time.of.Our.Lives.2019.720p.W...,0,57,44,711,"Hey, man. How are you? Thank you. Let's do thi..."
1,Adam.Sandler.100.Percent.Fresh.2018.WEBRip.x26...,1,13,38,580,"Okay, ready, and... Take your own cue, Adam. A..."
2,Adel_Karam_Live_From_Beirut_En_en.srt,0,57,23,600,A NETFLIX COMEDY SPECIAL CASINO LEBANON Hello....
3,Afonso.Padilha.Classless.2020.srt,1,2,24,699,"I'm so happy to be recording this for Netflix,..."
4,Alex.Fernández.The.Best.Comedian.in.the.World....,0,51,0,432,A NETFLIX ORIGINAL COMEDY SPECIAL ALEX FERNNDE...


## 1.2) Merge dataframe with extracted text data and stand-up shows dataset

In [7]:
# Remove file extension from file name for merging
extract_df['File_name'] = extract_df['File_name'].str.replace('.srt', '', regex=False)

# Concatenate the two dataframes to obtain final dataset
df_final = pd.merge(df, extract_df, how='left', on='File_name')

# Generate comedian's name column
df_final['Comedian'] =  df_final['Title'].str.split(':', expand=True)[0]

# Reorder columns, drop milliseconds column and print basic info
df_final = df_final[['Title', 'Producer', 'Comedian', 'Gender', 'Release date', 'Original language',
                     'Text', 'Len_Hours', 'Len_Minutes', 'Len_Seconds', 'File_name']]

print('Final dataset shape:\n', df_final.shape)
df_final.info()
df_final.head()

Final dataset shape:
 (143, 11)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 142
Data columns (total 11 columns):
Title                143 non-null object
Producer             143 non-null object
Comedian             143 non-null object
Gender               143 non-null object
Release date         143 non-null object
Original language    143 non-null object
Text                 143 non-null object
Len_Hours            143 non-null int64
Len_Minutes          143 non-null int64
Len_Seconds          143 non-null int64
File_name            143 non-null object
dtypes: int64(3), object(8)
memory usage: 13.4+ KB


Unnamed: 0,Title,Producer,Comedian,Gender,Release date,Original language,Text,Len_Hours,Len_Minutes,Len_Seconds,File_name
0,Adam DeVine: Best Time of Our Lives,Netflix,Adam DeVine,M,"June 18, 2019",English,"Hey, man. How are you? Thank you. Let's do thi...",0,57,44,Adam.Devine.Best.Time.of.Our.Lives.2019.720p.W...
1,Adam Sandler: 100% Fresh,Netflix,Adam Sandler,M,"October 23, 2018",English,"Okay, ready, and... Take your own cue, Adam. A...",1,13,38,Adam.Sandler.100.Percent.Fresh.2018.WEBRip.x26...
2,Adel Karam: Live From Beirut,Netflix,Adel Karam,M,"March 1, 2018",Arabic,A NETFLIX COMEDY SPECIAL CASINO LEBANON Hello....,0,57,23,Adel_Karam_Live_From_Beirut_En_en
3,Afonso Padilha: Classless,Netflix,Afonso Padilha,M,"September 3, 2020",Portuguese,"I'm so happy to be recording this for Netflix,...",1,2,24,Afonso.Padilha.Classless.2020
4,Alex Fernández: The Best Comedian in the World,Netflix,Alex Fernández,M,"January 23, 2020",Spanish,A NETFLIX ORIGINAL COMEDY SPECIAL ALEX FERNNDE...,0,51,0,Alex.Fernández.The.Best.Comedian.in.the.World....


In [8]:
# Save to .csv file
df_final.to_csv('Stand_up_comedy_dataset.csv', index=False)