Importing Data

In [1]:
import requests
from bs4 import BeautifulSoup
import pickle

In [2]:
def url_to_transcript(url, name):
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    text = soup.find_all('p')
    with open("transcripts/" + name + ".txt", 'w', encoding="utf-8") as f:
        for i in range(len(text)):
            data = text[i].get_text()
            f.write(data)
            f.write("\n")

In [3]:
urls = ['https://scrapsfromtheloft.com/comedy/nate-bargatze-greatest-average-american-transcript/',
        'https://scrapsfromtheloft.com/comedy/chris-rock-tamborine-transcript/',
        'https://scrapsfromtheloft.com/comedy/vir-das-losing-it-transcript/',
        'https://scrapsfromtheloft.com/comedy/kevin-hart-irresponsible-transcript/',
        'https://scrapsfromtheloft.com/comedy/russell-peters-deported-transcript/',
        'https://scrapsfromtheloft.com/comedy/aziz-ansari-right-now-transcript/',
        'https://scrapsfromtheloft.com/comedy/whitney-cummings-can-i-touch-it-transcript/',
        'https://scrapsfromtheloft.com/comedy/norm-macdonald-hitlers-dog-gossip-trickery-2017-full-transcript/',
        'https://scrapsfromtheloft.com/comedy/kenny-sebastian-dont-be-that-guy-transcript/',
        'https://scrapsfromtheloft.com/comedy/hannah-gadsby-douglas-transcript/'
       ]
comedians = ['nate', 'chris', 'vir', 'kevin', 'russell', 'aziz', 'whitney', 'norm', 'kenny', 'hannah']

In [5]:
#importing data from urls using Beautiful Soup
for i in range(len(comedians)):
    url_to_transcript(urls[i],comedians[i])

In [6]:
data = {}
for c in comedians:
    with open("transcripts/" + c + ".txt", "r",encoding="utf8") as file:
        data[c] = file.readlines()

In [7]:
def combine_text(list_of_text):
    combined_text = ' '.join(list_of_text)
    return combined_text

In [8]:
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [9]:
data_combined['aziz']

["♪ Sometimes I feel so happy ♪\n ♪ Sometimes I feel so sad ♪\n ♪ Sometimes I feel so happy ♪\n ♪ But mostly you just make me mad ♪\n ♪ Baby, you just make me mad ♪\n ♪ Linger on ♪ ♪ Your pale blue eyes ♪\n ♪ Linger on ♪\n Aziz Ansari’s Right Now! Aziz Ansari!\n ♪ Thought of you as my mountaintop ♪\n ♪ Thought of you as my peak ♪\n Thought of you as everything I’ve had but couldn’t keep I’ve had but couldn’t keep\n Thank you. Thank you very much! Thank you. Thanks. I appreciate that. Thank you so much. Take a seat. Take a seat. Thanks so much. Wow. What a nice welcome. Wow, wow, wow. Very excit… By the way, this guy’s with me. He’s, uh… he’s authorized. He’s not, like, a very audacious bootlegger who really doesn’t give a fuck. “You said no phones, but what about full-on cameras?” Uh… Yeah, we’re filming these shows, so, you know, you might be in the show, uh, when it’s on, whatever I put it on. You’ll be like, “Oh, shit. I was there!” But we’re filming a few shows, so if you’re, like,

In [10]:
#generating corpus
import pandas as pd
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
aziz,"♪ Sometimes I feel so happy ♪\n ♪ Sometimes I feel so sad ♪\n ♪ Sometimes I feel so happy ♪\n ♪ But mostly you just make me mad ♪\n ♪ Baby, you ju..."
chris,"[indistinct overlapping chatter]\n [woman] Ladies and gentlemen, Chris Rock.\n [audience cheers and applauds]\n Yeah. Please. Oh, sit down. Sit yo..."
hannah,"The following is the transcript of Hannah Gadbsy: Douglas. In her second Netflix special, named after her dog, Gadsby explores how autism affects ..."
kenny,Make some noise for Kenny Sebastian.\n Oh my God. Thank you so much Mumbai. Thank you. Thank you. Really. How are you guys doing? Oh shit. Let’s d...
kevin,"[heartbeat]\n [indistinct chatter]\n [atmospheric whooshing]\n [audience cheering] It’s showtime, honey?\n Babe, I’m gone.\n [woman] Coming.\n Alr..."
nate,[folk rock music playing]\n ♪ Family ♪\n ♪ Singin’ in the kitchen ♪\n ♪ Family ♪\n ♪ Runnin’ through the yard… ♪\n ♪ Family ♪\n ♪ Goin’ on vacatio...
norm,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork cho..."
russell,"[TYPING]\n [CHEERING]\n NARRATOR: Ladies and gentlemen, it’s start time at the Dome NSCI SVP Stadium. And right about now, we’re going to bring yo..."
vir,"I lost 80% of my mind. It’s very freeing. You should see the look on your faces right now, by the way. Oh! Good evening, San Francisco. Are you gu..."
whitney,"Ladies and gentlemen… Whitney Cummings!\n This is awesome. I am shooting my fourth stand-up special this evening in my hometown, Washington DC. Th..."


In [11]:
data_df.transcript.loc['norm']

"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork chop. I don’t want to brag or anything like that. But it’s in my belly right now as we speak. And I realized that you… you eat at a restaurant different than you eat at home, you know? Like, at home you would never cook up a pork chop on your skillet, you know, and make it nice and hot on one side, then turn it over, make it hot on the other side, and then cut into it and see how it’s going in the middle. And then you go, “Man, I’m going to love eating this delicious pork chop.” As soon as it’s hot enough to eat, I’ll eat it. But while I’m waiting, “I’m going to eat a big loaf of bread.” Who would do that? “With, like, 35 pats of butter, and I’ll eat that loaf of bread.” “And that will get my appetite sharpened up…” “For the pork.”\n I also noticed that desserts are different nowadays. When I was young, the waiter would come and go, “What do yo

Data Cleaning

In [12]:
import re
import string

def clean_text_round1(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [13]:
data_clean = pd.DataFrame(data_df.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
aziz,♪ sometimes i feel so happy ♪\n ♪ sometimes i feel so sad ♪\n ♪ sometimes i feel so happy ♪\n ♪ but mostly you just make me mad ♪\n ♪ baby you jus...
chris,\n ladies and gentlemen chris rock\n \n yeah please oh sit down sit yo asses down please let me get on with the show it’s nice to be here brookly...
hannah,the following is the transcript of hannah gadbsy douglas in her second netflix special named after her dog gadsby explores how autism affects her ...
kenny,make some noise for kenny sebastian\n oh my god thank you so much mumbai thank you thank you really how are you guys doing oh shit let’s do this c...
kevin,\n \n \n it’s showtime honey\n babe i’m gone\n coming\n alright\n see you later love you showtime baby let’s go show time bro i’ll see you on ...
nate,\n ♪ family ♪\n ♪ singin’ in the kitchen ♪\n ♪ family ♪\n ♪ runnin’ through the yard… ♪\n ♪ family ♪\n ♪ goin’ on vacation ♪\n ♪ family ♪\n ♪ on a...
norm,then people go “goddamn at least he’s not a hypocrite” “you’ve got to give it to him that’s the worst part of it” all right i ate a pork chop i do...
russell,\n \n narrator ladies and gentlemen it’s start time at the dome nsci svp stadium and right about now we’re going to bring you the brother that gav...
vir,i lost of my mind it’s very freeing you should see the look on your faces right now by the way oh good evening san francisco are you guys excited...
whitney,ladies and gentlemen… whitney cummings\n this is awesome i am shooting my fourth standup special this evening in my hometown washington dc thank y...


Data Cleaning Round 2

In [15]:
def clean_text_round2(text):
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [16]:
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
aziz,♪ sometimes i feel so happy ♪ ♪ sometimes i feel so sad ♪ ♪ sometimes i feel so happy ♪ ♪ but mostly you just make me mad ♪ ♪ baby you just make m...
chris,ladies and gentlemen chris rock yeah please oh sit down sit yo asses down please let me get on with the show its nice to be here brooklyn heres...
hannah,the following is the transcript of hannah gadbsy douglas in her second netflix special named after her dog gadsby explores how autism affects her ...
kenny,make some noise for kenny sebastian oh my god thank you so much mumbai thank you thank you really how are you guys doing oh shit lets do this come...
kevin,its showtime honey babe im gone coming alright see you later love you showtime baby lets go show time bro ill see you on the other side bab...
nate,♪ family ♪ ♪ singin in the kitchen ♪ ♪ family ♪ ♪ runnin through the yard ♪ ♪ family ♪ ♪ goin on vacation ♪ ♪ family ♪ ♪ on a credit card ♪ ♪ hey...
norm,then people go goddamn at least hes not a hypocrite youve got to give it to him thats the worst part of it all right i ate a pork chop i dont want...
russell,narrator ladies and gentlemen its start time at the dome nsci svp stadium and right about now were going to bring you the brother that gave you ...
vir,i lost of my mind its very freeing you should see the look on your faces right now by the way oh good evening san francisco are you guys excited ...
whitney,ladies and gentlemen whitney cummings this is awesome i am shooting my fourth standup special this evening in my hometown washington dc thank you ...


Data Cleaning Round 3

In [18]:
def clean_round3(text):
    text=re.sub('♪','',text)
    return text

In [19]:
data_clean=pd.DataFrame(data_df['transcript'].apply(clean_round3))

In [20]:
data_clean

Unnamed: 0,transcript
aziz,"Sometimes I feel so happy \n Sometimes I feel so sad \n Sometimes I feel so happy \n But mostly you just make me mad \n Baby, you just make m..."
chris,"[indistinct overlapping chatter]\n [woman] Ladies and gentlemen, Chris Rock.\n [audience cheers and applauds]\n Yeah. Please. Oh, sit down. Sit yo..."
hannah,"The following is the transcript of Hannah Gadbsy: Douglas. In her second Netflix special, named after her dog, Gadsby explores how autism affects ..."
kenny,Make some noise for Kenny Sebastian.\n Oh my God. Thank you so much Mumbai. Thank you. Thank you. Really. How are you guys doing? Oh shit. Let’s d...
kevin,"[heartbeat]\n [indistinct chatter]\n [atmospheric whooshing]\n [audience cheering] It’s showtime, honey?\n Babe, I’m gone.\n [woman] Coming.\n Alr..."
nate,[folk rock music playing]\n Family \n Singin’ in the kitchen \n Family \n Runnin’ through the yard… \n Family \n Goin’ on vacation \n Famil...
norm,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork cho..."
russell,"[TYPING]\n [CHEERING]\n NARRATOR: Ladies and gentlemen, it’s start time at the Dome NSCI SVP Stadium. And right about now, we’re going to bring yo..."
vir,"I lost 80% of my mind. It’s very freeing. You should see the look on your faces right now, by the way. Oh! Good evening, San Francisco. Are you gu..."
whitney,"Ladies and gentlemen… Whitney Cummings!\n This is awesome. I am shooting my fourth stand-up special this evening in my hometown, Washington DC. Th..."


In [22]:
#pickling final cleaned data
data_clean.to_pickle('data_clean.pkl')

Organizing Data

In [25]:
full_name=['Aziz Ansari','Chris Rock','Hannah Gadsby','Kenny Sebastian','Kevin Hart','Nate Bargatze','Norm Macdonald','Russell Peters ','Vir Das','Whitney Cummings']

In [23]:
import pandas as pd
data_clean = pd.read_pickle('data_clean.pkl')
data_clean

Unnamed: 0,transcript
aziz,"Sometimes I feel so happy \n Sometimes I feel so sad \n Sometimes I feel so happy \n But mostly you just make me mad \n Baby, you just make m..."
chris,"[indistinct overlapping chatter]\n [woman] Ladies and gentlemen, Chris Rock.\n [audience cheers and applauds]\n Yeah. Please. Oh, sit down. Sit yo..."
hannah,"The following is the transcript of Hannah Gadbsy: Douglas. In her second Netflix special, named after her dog, Gadsby explores how autism affects ..."
kenny,Make some noise for Kenny Sebastian.\n Oh my God. Thank you so much Mumbai. Thank you. Thank you. Really. How are you guys doing? Oh shit. Let’s d...
kevin,"[heartbeat]\n [indistinct chatter]\n [atmospheric whooshing]\n [audience cheering] It’s showtime, honey?\n Babe, I’m gone.\n [woman] Coming.\n Alr..."
nate,[folk rock music playing]\n Family \n Singin’ in the kitchen \n Family \n Runnin’ through the yard… \n Family \n Goin’ on vacation \n Famil...
norm,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork cho..."
russell,"[TYPING]\n [CHEERING]\n NARRATOR: Ladies and gentlemen, it’s start time at the Dome NSCI SVP Stadium. And right about now, we’re going to bring yo..."
vir,"I lost 80% of my mind. It’s very freeing. You should see the look on your faces right now, by the way. Oh! Good evening, San Francisco. Are you gu..."
whitney,"Ladies and gentlemen… Whitney Cummings!\n This is awesome. I am shooting my fourth stand-up special this evening in my hometown, Washington DC. Th..."


In [26]:
data_clean.index = full_name

In [28]:
data_clean

Unnamed: 0,transcript
Aziz Ansari,"Sometimes I feel so happy \n Sometimes I feel so sad \n Sometimes I feel so happy \n But mostly you just make me mad \n Baby, you just make m..."
Chris Rock,"[indistinct overlapping chatter]\n [woman] Ladies and gentlemen, Chris Rock.\n [audience cheers and applauds]\n Yeah. Please. Oh, sit down. Sit yo..."
Hannah Gadsby,"The following is the transcript of Hannah Gadbsy: Douglas. In her second Netflix special, named after her dog, Gadsby explores how autism affects ..."
Kenny Sebastian,Make some noise for Kenny Sebastian.\n Oh my God. Thank you so much Mumbai. Thank you. Thank you. Really. How are you guys doing? Oh shit. Let’s d...
Kevin Hart,"[heartbeat]\n [indistinct chatter]\n [atmospheric whooshing]\n [audience cheering] It’s showtime, honey?\n Babe, I’m gone.\n [woman] Coming.\n Alr..."
Nate Bargatze,[folk rock music playing]\n Family \n Singin’ in the kitchen \n Family \n Runnin’ through the yard… \n Family \n Goin’ on vacation \n Famil...
Norm Macdonald,"Then people go, “Goddamn, at least he’s not a hypocrite.” “You’ve got to give it to him, that’s the worst part of it.” All right. I ate a pork cho..."
Russell Peters,"[TYPING]\n [CHEERING]\n NARRATOR: Ladies and gentlemen, it’s start time at the Dome NSCI SVP Stadium. And right about now, we’re going to bring yo..."
Vir Das,"I lost 80% of my mind. It’s very freeing. You should see the look on your faces right now, by the way. Oh! Good evening, San Francisco. Are you gu..."
Whitney Cummings,"Ladies and gentlemen… Whitney Cummings!\n This is awesome. I am shooting my fourth stand-up special this evening in my hometown, Washington DC. Th..."


Creating Document Term Matrix

In [30]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm



Unnamed: 0,00,000,0what,10,100,11,12,13,14,1400,...,zahra,zero,zillionaire,zip,zodiac,zoo,zoom,zuck,zucker,zuckerberg
Aziz Ansari,0,0,1,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Chris Rock,0,1,0,1,4,1,0,0,0,0,...,1,0,0,0,0,0,0,3,9,1
Hannah Gadsby,0,0,0,0,0,0,0,0,0,0,...,0,0,0,6,0,0,0,0,0,0
Kenny Sebastian,0,0,0,5,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
Kevin Hart,0,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
Nate Bargatze,2,2,0,0,3,0,2,0,0,0,...,0,1,1,0,0,1,1,0,0,0
Norm Macdonald,0,0,0,0,0,1,0,1,2,0,...,0,0,0,0,0,0,0,0,0,0
Russell Peters,0,2,0,10,1,0,1,1,0,0,...,0,0,0,0,1,0,0,0,0,0
Vir Das,0,0,0,0,1,0,5,0,3,0,...,0,0,0,0,0,0,0,0,0,0
Whitney Cummings,0,1,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
data_dtm.to_pickle("wordcount.pkl")

Stop Words: Since CountVectorizer just counts the occurrences of each word in its vocabulary, extremely common words like ‘the’, ‘and’, etc. will become very important features while they add little meaning to the text. Your model can often be improved if you don’t take those words into account. Stop words are just a list of words you don’t want to use as features. You can set the parameter stop_words=’english’ to use a built-in list. Alternatively you can set stop_words equal to some custom list. This parameter defaults to None.

ngram_range: An n-gram is just a string of n words in a row. E.g. the sentence ‘I am Groot’ contains the 2-grams ‘I am’ and ‘am Groot’. The sentence is itself a 3-gram. Set the parameter ngram_range=(a,b) where a is the minimum and b is the maximum size of ngrams you want to include in your features. The default ngram_range is (1,1). For example for job postings online 2-grams as features boost the model’s predictive power significantly. This makes intuitive sense; many job titles such as ‘data scientist’, ‘data engineer’, and ‘data analyst’ are 2 words long.

min_df, max_df: These are the minimum and maximum document frequencies words/n-grams must have to be used as features. If either of these parameters are set to integers, they will be used as bounds on the number of documents each feature must be in to be considered as a feature. If either is set to a float, that number will be interpreted as a frequency rather than a numerical limit. min_df defaults to 1 (int) and max_df defaults to 1.0 (float).

max_features: This parameter is pretty self-explanatory. The CountVectorizer will choose the words/features that occur most frequently to be in its’ vocabulary and drop everything else.

In [32]:
cvec=CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=0.1, max_df=0.7, max_features=100)