In [1]:
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 500) 
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# path to annotated files
files = {
    "ned-flanders": "../data/annotations/final/ned_tagged.csv",
    "seymour-skinner": "../data/annotations/final/seymour_skinner_tagged.csv",
    "c-montgomery-burns": "../data/annotations/final/c_montgomery_burns_tagged.csv",
}

df = pd.concat(pd.read_csv(path).assign(character=character) for character, path in files.items())

print(df.head())
print("qty of lines: " + str(df["tag"].count()))

   tag  \
0  REL   
1  REL   
2  FAM   
3  REL   
4  REL   

                                                                                                                                                                                                                                                                                                                                                                                                                           quote  \
0  Why me, Lord? Where have I gone wrong? I've always been nice to people. I don't drink or dance or swear. I've even kept kosher, just to be on the safe side. I've done everything the Bible says, even the stuff that contradicts the other stuff. What more could I do? I... I feel like I'm coming apart here. I want to yell out, but I... I... I just can't dang-diddily-do-dang-do-don-diddily-darn-do it. I... I...I...   
1                                                                                           Lord, I

In [3]:
# agg all quotes for each topic tag into single string
topic_docs = df.groupby('tag')['quote'].apply(lambda x: ' '.join(x)).reset_index()

display(topic_docs)

Unnamed: 0,tag,quote
0,COM,"You are a picture and a half! Well -- hee hee! -- if you're finished by tomorrow, come on over and strap on the feed bag. We're gonna fire up old Propane Elaine and put the heat to the meat! Nummy nummy num! That concludes our Halloween show for this year. I just want to say that for watching this network, you're all going to Hell. And that includes FX, Fox Sports and our newest Devil's portal, The Wall Street Journal! Welcome to the club! Well sir, we could do a little Quid Pro for the Kay-..."
1,CON,"Ahh, of course! This must be where he dropped the dagger. And this is the butler's pantry where Mrs. Astor concealed herself. And right here's where they found the torso heap! In front of our very own fireplace! Well, any hoodily-doodle, the embassy says it's just a routine hostage-taking -- but I have to drive to Capital City, fill out some forms to get 'em out. Could you possibly watch the kids tonight? You know, with all the energy we're puttin' into this sabotage thing, we coulda written..."
2,ECO,"Uh... Homer... Ah, about those things you borrowed from me over the years -- you know -- the TV trays, the power sander, that downstairs bathtub... you going to be needing those things in Cypress Creek? HOW I LOVE MY COAT OF MANY COLORS / IT WAS RED AND YELLOW AND GREEN AND BROWN / AND SCARLET AND BLACK AND OCHRE AND PEACH / AND RUBY AND OLIVE AND VIOLET AND FAWN... Gee, I always like to help you, Homer, but I don't want to be an accessory to some sort of shady doin's. And it does raise a wh..."
3,FAM,"No, that's not true. I... I... I don't like the service at the post office. Y'know, it's all rush, rush, get you in, get you out. Then, they've got those machines in the lobby. They're even faster. No help there. You might even say I hate the post office. That, and my parents. Lousy beatniks! Hey, that felt good! Wordplay: never cared for it. But it's never too soon for you two to join the ""I do"" crew. Now I'm not sayin' it's all Jell-O with Cool Whip. She'll nag ya. She'll try to change ya...."
4,LEI,"Scared of what? All the funny camp songs we're gonna sing? WE'LL BE SAFE INSIDE OUR FORTRESS WHEN THEY COME / WE'LL BE SAFE FROM CREEPS AND KILLERS WHEN THEY COME / UNLESS THEY HAVE A BLOWTORCH / OR A POISON GAS INJECTOR / THEN I DON'T KNOW WHAT'LL HAPPEN WHEN THEY COME! Yeah. I made a great film, but I'm havin' trouble gettin' it out there. I'm like Michael Moore, except I'm skinny, my jeans are washed, and God loves me. I just wish I could find some way to spread my message. In the last ga..."
5,PER,"Homer, I'm in a rhubarb of a pickle of a jam here. I was all set to go off on vacation, when I get called up for jury duty. Oh, it's a corker of a case -- seems a man drove up onto a traffic island and hit a decorative rowboat full of geraniums. Ohhh. I wanted to subscribe to that new Arts and Crafts Channel. Well, sir, they send over this flimflam man to install it, and do you know what he did? He offered to hook me up illegally to every cable channel for only fifty bucks. All right. It's t..."
6,REL,"Why me, Lord? Where have I gone wrong? I've always been nice to people. I don't drink or dance or swear. I've even kept kosher, just to be on the safe side. I've done everything the Bible says, even the stuff that contradicts the other stuff. What more could I do? I... I feel like I'm coming apart here. I want to yell out, but I... I... I just can't dang-diddily-do-dang-do-don-diddily-darn-do it. I... I...I... Lord, I never question you... but I've been wondering if your decision to take Mau..."
7,WRK,"Hello, Gas Company? How poisonous is your gas?... Wow. But-- Eh, but I'm talkin', you know, about outdoors with plenty of ventilation that... How could that be worse?... Okay, permanent brain damage, or just temporary?... I see. Bart, I've barely been here a good solid week, and you've been sent to my office eleven times. And now that I have peanut butter cups, you seem to be gettin' in trouble every hour. Oh I was just wondering how many boxes of staples I should order for the store. Does t..."


## tf-idf time:

In [4]:
vectoriser = TfidfVectorizer(
    lowercase=True,
    use_idf=True,
    sublinear_tf=True,
    max_df=0.95,                # ignore very common words
    min_df=2,                   # ignore words that appear in only one topic
)

X = vectoriser.fit_transform(topic_docs['quote'])
feature_names = vectoriser.get_feature_names_out()
topic_labels = topic_docs['tag'].tolist() 

display(X)
display(X.shape)
display(feature_names)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4662 stored elements and shape (8, 1483)>

(8, 1483)

array(['able', 'above', 'absolutely', ..., 'yourself', 'zeppelin',
       'zipper'], shape=(1483,), dtype=object)

In [5]:
rows = []
Xd = X.toarray()
for i, tag in enumerate(topic_labels):
    idx = Xd[i].argsort()[::-1][:10]
    for r, j in enumerate(idx): rows.append((tag, r+1, feature_names[j], Xd[i][j]))

tfidf_table = pd.DataFrame(rows, columns=["topic","rank","word","tfidf"])

tfidf_table

Unnamed: 0,topic,rank,word,tfidf
0,COM,1,gem,0.092817
1,COM,2,tree,0.092817
2,COM,3,apple,0.092817
3,COM,4,handed,0.080093
4,COM,5,media,0.080093
5,COM,6,poor,0.080093
6,COM,7,state,0.080093
7,COM,8,kid,0.080093
8,COM,9,seen,0.07985
9,COM,10,something,0.077687


In [6]:
# oh no! looks like the top words are semantically uninteresting
# let's try cleaning the quotes a bit more

In [7]:
# string-ify top words to uncover those semantically uninteresting
def stringify_top_terms(tfidf_table):
    s = ""
    for word in tfidf_table['word'].unique():
        s += f'"{word}", '
    return s

s = stringify_top_terms(tfidf_table)
print(s)

"gem", "tree", "apple", "handed", "media", "poor", "state", "kid", "seen", "something", "crime", "shot", "potato", "nature", "during", "drop", "murder", "doodily", "prison", "him", "oil", "pay", "buy", "money", "due", "budget", "leave", "income", "thousands", "profit", "rod", "woman", "mother", "her", "edna", "maude", "silly", "dad", "marriage", "family", "sing", "spare", "ball", "center", "nickel", "groundskeeper", "rather", "nine", "game", "seven", "dee", "bat", "hee", "am", "had", "clothes", "principal", "male", "skinner", "fruit", "bible", "church", "lord", "holy", "became", "god", "stuff", "father", "said", "box", "students", "student", "starting", "grade", "willie", "plant", "school", "elementary", "worker", 


In [8]:
# semantically uninteresting terms from top words: 
# gem, handed, seen, something, during, doodily, him, woman, her, edna, maude, silly, center, rather, nine, seven, dee, hee, am, had, became, starting, willie

In [10]:
# perform tfidf vectoriser w/ custom stop words to rm semantically uninteresting words
def custom_stops_tfidf(extra_stops,curr_stops=ENGLISH_STOP_WORDS):
    custom_stop_words = list(curr_stops.union(extra_stops))

    vectoriser = TfidfVectorizer(
        lowercase=True,
        stop_words=custom_stop_words,
        min_df=2,
        max_df=0.90,
        token_pattern=r"(?u)\b[a-zA-Z]{3,}\b",
    )
    
    return vectoriser

In [11]:
extra_stops = {
    # from top topic terms analysis
    "gem", "handed", "seen", "something", "during", "doodily", "him", "woman", 
    "her", "edna", "maude", "silly", "center", "rather", "nine", 
    "seven", "dee", "hee", "am", "had", "became", "starting", "willie",
    # conversational fillers
    "oh", "like", "just", "really", "know", "gonna", "gotta",
    "hey", "uh", "um", "yeah", "yep", "nope", "right",
    "ok", "okay", "well", "it"
    # personal pronouns
    "i", "me", "my", "we", "us", "our",
    "you", "your", "he", "him", "his",
    "she", "her", "they", "them", "their",
    # names
    "lisa", "marge", "maude", "homer", "simpson", "bart", "edna", 
    "krabappel", "agnes", "smithers", "ned", "flanders", "seymour", 
    "skinner", "nelson", "montgomery", "burns", "willie", "edna"
    # contraction fragments
    "ll", "ve", "re", "don", "didn", "doesn", "isn", "aren",
    "won", "cant", "couldn", "shouldn", "wouldn",
    # generic address terms
    "man", "guy", "sir", "maam",
    "ding", "doo", "bread", "juice", "apple", "honey"
}

vectoriser = custom_stops_tfidf(extra_stops)

X = vectoriser.fit_transform(topic_docs['quote'])
feature_names = vectoriser.get_feature_names_out()


In [12]:
rows = []
Xd = X.toarray()
for i, tag in enumerate(topic_labels):
    idx = Xd[i].argsort()[::-1][:10]
    for r, j in enumerate(idx): rows.append((tag, r+1, feature_names[j], Xd[i][j]))

tfidf_table = pd.DataFrame(rows, columns=["topic","rank","word","tfidf"])

tfidf_table

Unnamed: 0,topic,rank,word,tfidf
0,COM,1,people,0.180162
1,COM,2,children,0.141581
2,COM,3,place,0.135121
3,COM,4,thing,0.135121
4,COM,5,tree,0.126844
5,COM,6,thank,0.126053
6,COM,7,welcome,0.113265
7,COM,8,guess,0.112601
8,COM,9,need,0.112601
9,COM,10,media,0.109456


In [13]:
# could further improve by:
# 1. lemmatising words prior to tf-idf analysis
# 2. reviewing top terms again for any remaining uninteresting words

display(stringify_top_terms(tfidf_table))

'"people", "children", "place", "thing", "tree", "thank", "welcome", "guess", "need", "media", "school", "shot", "potato", "crime", "prison", "boy", "alive", "nature", "drop", "murder", "money", "oil", "buy", "pay", "leave", "business", "budget", "dollars", "store", "mother", "rod", "boys", "family", "yes", "kids", "kinda", "room", "dad", "sing", "ball", "spare", "game", "film", "called", "groundskeeper", "nickel", "lot", "bat", "diddily", "wrong", "heart", "home", "women", "principal", "clothes", "lord", "god", "bible", "church", "said", "holy", "father", "stuff", "students", "plant", "grade", "morning", "work", "elementary", '

In [14]:
# COM - guess, need
# CON - drop, nature
# ECO - na
# FAM - yes, kinda
# LEI - lot
# PER - thing
# REL - stuff
# WRK - yes

extra_stops_2 = {
    "guess", "need", "drop", "nature", "yes", "kinda", "lot", "thing", "stuff", "yes"
}

vectoriser = custom_stops_tfidf(extra_stops_2, extra_stops)

X = vectoriser.fit_transform(topic_docs['quote'])
feature_names = vectoriser.get_feature_names_out()

In [15]:
rows = []
Xd = X.toarray()
for i, tag in enumerate(topic_labels):
    idx = Xd[i].argsort()[::-1][:10]
    for r, j in enumerate(idx): rows.append((tag, r+1, feature_names[j], Xd[i][j]))

tfidf_table = pd.DataFrame(rows, columns=["topic","rank","word","tfidf"])

tfidf_table

Unnamed: 0,topic,rank,word,tfidf
0,COM,1,people,0.17326
1,COM,2,children,0.136157
2,COM,3,place,0.129945
3,COM,4,tree,0.121984
4,COM,5,thank,0.121224
5,COM,6,welcome,0.108926
6,COM,7,kid,0.105262
7,COM,8,media,0.105262
8,COM,9,poor,0.105262
9,COM,10,state,0.105262


In [16]:
topic_counts = df.groupby(["character", "tag"]).size().unstack(fill_value=0)
topic_counts


tag,COM,CON,ECO,FAM,LEI,PER,REL,WRK
character,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
c-montgomery-burns,27,68,44,19,47,43,4,48
ned-flanders,40,19,26,66,30,30,74,15
seymour-skinner,19,28,23,27,22,23,1,157


In [17]:
# topic counts normalised by row sum

topic_percentages = topic_counts.div(topic_counts.sum(axis=1), axis=0)
topic_percentages

tag,COM,CON,ECO,FAM,LEI,PER,REL,WRK
character,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
c-montgomery-burns,0.09,0.226667,0.146667,0.063333,0.156667,0.143333,0.013333,0.16
ned-flanders,0.133333,0.063333,0.086667,0.22,0.1,0.1,0.246667,0.05
seymour-skinner,0.063333,0.093333,0.076667,0.09,0.073333,0.076667,0.003333,0.523333


In [18]:
# word-weighted topic percentages

df["word_count"] = df["quote"].str.split().apply(len)

weighted = df.groupby(["character", "tag"])["word_count"].sum()
weighted = weighted.unstack(fill_value=0)

weighted_percent = weighted.div(weighted.sum(axis=1), axis=0)
weighted_percent

tag,COM,CON,ECO,FAM,LEI,PER,REL,WRK
character,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
c-montgomery-burns,0.085362,0.218999,0.146794,0.062053,0.165544,0.137367,0.015021,0.168859
ned-flanders,0.135159,0.057747,0.083012,0.214561,0.097573,0.098195,0.263846,0.049907
seymour-skinner,0.06814,0.096969,0.072584,0.088651,0.071445,0.078851,0.003304,0.520055
