This notebook tracks some work I did but did not end up contributing to the final analysis.

## Getting the data from Mongo

I thought that I would have to analyze by year, but didn't end up needing to do that. Below is my work to pull the data from Mongo, split by year and save each CSV.

In [59]:
# Get the data from Mongo.
# I never used it, but you can pass in a Mongo query to pull a specific episdoe if desired.

full_data = mon.get_episodes()

# Data type cleaning required for parsing for years and sorting
full_data['ep_air_date'] =  pd.to_datetime(full_data['ep_air_date'])
full_data['ep_num'] =  pd.to_numeric(full_data['ep_num'])
full_data['timestamp'] =  full_data['timestamp'].str.replace('_', '.')
full_data['timestamp'] =  pd.to_timedelta(full_data['timestamp'])
full_data = full_data.sort_values(by=['ep_num', 'timestamp'])

# Create CSVs per year
to_empty = full_data

for year in range(1995, 2021):
    to_empty[to_empty['ep_air_date']<f'{year}-12-31'].to_csv(f'/Users/pang/repos/nlp-test/data/006_by_year/df_{year}')
    to_empty = to_empty[to_empty['ep_air_date']>f'{year}-12-31']

## Initial removal of unnecessary data

This was work that would process data by year. This section was fully removed in favor of what's in the "Getting the data from Mongo" in the final text.

In [108]:
year_data = pd.read_csv('/Users/pang/repos/nlp-test/data/006_by_year/df_2019', index_col="Unnamed: 0")
year_data = year_data.dropna(subset=['words'])
year_data['speaker'] = year_data['speaker'].fillna('UNKNOWN')

In [109]:
# Dropping the text from the credits act since they're all just acknoledgements for the episode.
no_credits = year_data[year_data['act']!='Credits']

# Remove sound effects in podcasts.
# Searches for full string that starts/ends with '[]' with any alphanumerics and space.
# There are some that embed sound effects with speech: "[LAUGHING] That's what she said."
# Those are kept and removed later in the pipeline
no_credits_no_sounds = no_credits[no_credits['words'].str.match("(?!.*\]\.*$).*")]
# The extra '\.' between ']' and ']*$' is because some sound effects were written like this: '[SPEAKING CHINESE].'

# There are around 1700 blanks for some reason and they are removed.
no_credits_sounds_blanks = no_credits_no_sounds[no_credits_no_sounds['words']!='']

In [110]:
no_credits_sounds_blanks.sample(1)

Unnamed: 0,ep_num,ep_title,ep_air_date,ep_summary,speaker,words,timestamp,act
6096,678,The Wannabes,2019-07-05,"We hang out with the presidential candidates, ...",Emanuele Berry,Yeah. Let's do this.,0 days 00:32:14.560000000,Wannabes One


## Aggregate data

This was code when I was analyzing at the "act" level. I needed a column that kept the act and ep data together.

In [111]:
no_credits_sounds_blanks['ep_act'] =no_credits_sounds_blanks['ep_num'].map(str) + ' - ' +  no_credits_sounds_blanks['act'].str.slice(stop=20)
words = no_credits_sounds_blanks[['ep_act', 'words']].groupby(by='ep_act').sum()
words = words['words']

## Data cleaning

### Keep only the nouns

I tried playing with getting just the nouns. While computationally expensive (especially on memory, it seems), it didn't provide any additional benefit to the analysis so I dropped this line of inquiry.

In [118]:
noun_words = []
counter = 1

for act in words:
    print(f'Nouning {counter} of {len(words)} items.')
    l=[]
    tok_text = nlp(act)
    for tok in tok_text:
        if tok.pos_=='NOUN':
            l.append(tok)
    noun_words.append(' '.join([str(word) for word in l]))
    counter +=1

In [119]:
words = pd.Series(noun_words)
words.sample(3)

633     night room dinner crowd gown heel minks-- ton ...
882     What something everything guy part movement co...
1388    morning confusion strategy rail side interview...
dtype: object

### Remove all numbers/punctuation and lower all capitals.

## Document term matrix format

Now we remove all stop words and convert to document term matrix

### TF-IDF  (and removing stop words)

Because I ended up going with LDA, TF-IDF was not the best vectorizor so this ended up being dropped.

In [218]:
# TF-IDF is not recommended for LDA
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words=stop_words, max_df=.8, min_df=.2, ngram_range=(1, 2))
dtm = tfidf.fit_transform(words)

## Dimension reduction/Topic modeling

### Latent Sematic Analysis (LSA)

LSA, while providing decent results was dropped in favor of LDA which provided more probabilistic data on topic content and for ease of explaining results to a lay-audience.

In [234]:
from sklearn.decomposition import TruncatedSVD

# # We have to convert `.toarray()` because the vectorizer returns a sparse matrix.
# # For a big corpus, we would skip the dataframe and keep the output sparse.
# pd.DataFrame(doc_word.toarray(), index=example, columns=vectorizer.get_feature_names()).head(10)

# Acronynms: Latent Semantic Analysis (LSA) is just another name for 
#  Signular Value Decomposition (SVD) applied to Natural Language Processing (NLP)
lsa = TruncatedSVD(num_of_topics)
doc_topic = lsa.fit_transform(dtm)
lsa.explained_variance_ratio_

array([0.00705675, 0.03001361, 0.01688934, 0.01609716, 0.01458167,
       0.01331692, 0.01326753, 0.01138552, 0.01089079, 0.0105532 ,
       0.0103751 , 0.00977346, 0.00929703, 0.00887317, 0.00860662,
       0.00839951, 0.00760565, 0.00758318, 0.00743471, 0.00724316])

In [235]:
topic_word = pd.DataFrame(lsa.components_.round(5),
             columns = vectorizer.get_feature_names())
topic_word

Unnamed: 0,abandon,ability,absolutely,accept,accident,accord,account,accuse,acknowledge,act act,...,ye,year ago,year later,year year,yell,yellow,yesterday,york,york city,young man
0,0.0144,0.01278,0.02822,0.02116,0.02015,0.01898,0.02257,0.01473,0.01134,0.01462,...,0.01299,0.03821,0.01977,0.01455,0.02832,0.01508,0.01314,0.05841,0.0209,0.01652
1,-0.00392,0.00553,0.00553,0.00169,-0.00542,0.02176,0.01305,0.0202,0.0033,-0.00286,...,0.0002,-0.00011,-0.01254,0.00265,-0.00607,-0.01562,-0.00279,0.00148,-0.00738,0.00102
2,3e-05,0.00375,-0.01041,-0.01187,0.00062,-0.00448,0.00131,-0.01032,-0.00581,-0.0018,...,0.00342,-0.00514,-0.00855,-0.00063,0.01531,0.00424,-0.00111,0.03513,0.02204,0.0107
3,0.00475,-0.00106,-0.00659,-0.0068,0.01991,0.01568,0.01589,0.01101,-0.00241,0.0009,...,0.00607,-0.00514,0.00674,0.00114,0.00089,-0.00666,0.00384,-0.05243,-0.01824,0.01275
4,-0.00897,0.00706,0.00508,-0.00901,-0.00391,0.00554,0.02839,-0.01301,0.0015,0.00575,...,0.00294,0.01227,0.00011,0.00685,-0.0182,0.00131,-0.00545,0.02028,-0.00373,-0.02049
5,0.00992,-0.00182,-0.0062,0.00265,0.00404,0.00994,-0.00867,0.00216,-0.00159,-0.00112,...,0.00102,-0.00196,0.00163,-0.00101,-0.00041,0.01293,0.00164,-0.03627,-0.01616,-0.01084
6,0.01391,0.00175,-0.00072,-0.00282,0.01382,-0.00705,0.00285,-0.00666,0.00366,0.00323,...,-0.00116,0.00421,-0.00012,0.00123,0.00333,-0.00261,0.00436,0.0002,-0.0078,-0.00256
7,-0.00355,0.01395,0.00488,0.00608,0.01943,0.00938,-0.00032,0.00972,0.01109,0.00264,...,0.00215,-0.001,0.00552,-0.00775,-0.01205,0.00764,-0.00422,-0.03814,-0.02362,-0.00036
8,0.0136,-0.00366,-0.01034,0.00644,-0.00051,0.00632,-0.01106,-0.00275,0.00291,-0.00445,...,-0.01104,-0.00382,-0.00191,0.01403,0.00081,0.00105,-0.00197,-0.0589,-0.02779,0.00415
9,-0.01339,0.00661,-0.00223,0.01073,0.00401,0.00716,0.00972,0.01746,0.00444,-0.003,...,0.0108,-0.00374,0.00423,0.00525,-0.02557,-0.00576,-0.00742,-0.01382,-0.0072,-0.01677


In [237]:
display_topics(lsa, vectorizer.get_feature_names(), 20)


Topic  0
mother, dad, mom, father, girl, parent, black, white, brother, police, war, government, town, state, david, boy, president, buy, group, sister

Topic  1
government, vote, president, war, state, campaign, election, company, court, candidate, police, political, law, meeting, united, issue, party, department, tax, united states

Topic  2
chicken, bird, animal, dog, eat, sell, cat, water, light, police, christmas, bank, truck, plant, buy, david, fly, company, customer, food

Topic  3
police, crime, officer, murder, prison, court, drug, mom, attorney, mother, cop, arrest, trial, hospital, lawyer, bank, law, dad, agent, gun

Topic  4
company, bank, buy, sell, business, mike, market, dad, price, worker, credit, million, dollar, customer, chicken, mom, alex, tax, industry, make money

Topic  5
chicken, war, bird, father, dad, animal, military, soldier, president, eat, kill, army, americans, dog, states, united, mother, united states, attack, vote

Topic  6
war, soldier, military, arm

### Non-Negative Matrix Factorization (NMF)

NMF, while providing decent results was dropped for the same reason I dropped LSA.

In [232]:
from sklearn.decomposition import NMF

nmf_model = NMF(num_of_topics)
doc_topic = nmf_model.fit_transform(dtm)

topic_word = pd.DataFrame(nmf_model.components_.round(5),
             columns = vectorizer.get_feature_names())
topic_word

nmf_model.components_

Unnamed: 0,abandon,ability,absolutely,accept,accident,accord,account,accuse,acknowledge,act act,...,ye,year ago,year later,year year,yell,yellow,yesterday,york,york city,young man
0,0.082,0.01541,0.04633,0.01979,0.03807,0.00185,0.02412,0.0,0.01563,0.02101,...,0.01071,0.07722,0.01698,0.02962,0.10382,0.04144,0.03052,0.16982,0.06934,0.06581
1,0.00586,0.01201,0.04566,0.02714,0.0,0.03272,0.00578,0.03491,0.01649,0.01406,...,0.0,0.03494,0.00309,0.0086,0.02344,0.01011,0.01397,0.04669,0.0,0.00346
2,0.0143,0.0,0.03343,0.0,0.0,0.0,0.01718,0.0,0.00297,0.01786,...,0.00752,0.0583,0.01321,0.02206,0.03619,0.00335,0.02336,0.01185,0.00529,0.0
3,0.0,0.0,0.0196,0.00739,0.01967,0.03214,0.01753,0.03838,0.0,0.00934,...,0.01507,0.00686,0.01109,0.0,0.04098,0.0,0.02483,0.01274,0.01247,0.06245
4,0.0,0.02021,0.02185,0.00416,0.0137,0.02546,0.07078,0.00303,0.01222,0.01994,...,0.01397,0.02804,0.01741,0.01749,0.00698,0.0,0.00119,0.06179,0.01057,0.0
5,0.0,0.00369,0.0,0.00539,0.00402,0.02854,0.00175,0.0,0.0,0.00017,...,0.01958,0.00609,0.01766,0.00425,0.00637,0.03072,0.00312,0.01573,0.01837,0.0
6,0.0,0.00614,0.03323,0.01039,0.00836,0.03213,0.05962,0.00708,0.00466,0.00976,...,0.02031,0.04812,0.00246,0.02076,0.0,0.00124,0.00577,0.05532,0.0,0.0
7,0.0,0.02092,0.01538,0.00468,0.04746,0.04071,0.02059,0.00691,0.02289,0.01386,...,0.0037,0.03316,0.01097,0.01453,0.00208,0.00246,0.00471,0.00072,0.0,0.00407
8,0.00303,0.00351,0.0,0.01073,0.0013,0.0013,0.00106,0.02405,0.0,0.0,...,0.01022,0.00813,0.00062,0.04376,0.00396,0.0,0.01665,0.0,0.0,0.0
9,0.0198,0.0286,0.01249,0.02774,0.02044,0.05655,0.04155,0.05938,0.01243,0.0006,...,0.01978,0.02944,0.01275,0.01976,0.0,0.0,0.0,0.02459,0.0,0.00315


In [233]:
display_topics(nmf_model, vectorizer.get_feature_names(), 20)


Topic  0
town, water, road, truck, bus, camp, boat, light, foot, mile, sleep, park, ride, river, bed, summer, beach, drink, trip, hotel

Topic  1
vote, president, campaign, election, candidate, party, political, state, politic, bush, issue, debate, tax, win, speech, race, politician, congress, government, washington

Topic  2
dad, mom, parent, brother, marriage, father, divorce, son, funny, video, trip, sad, uncle, college, weird, food, happy, drug, chance, game

Topic  3
police, officer, cop, crime, gun, murder, arrest, shoot, department, kill, victim, prison, jail, county, report, evidence, video, suspect, scene, charge

Topic  4
bank, mike, buy, alex, government, sell, market, credit, price, pool, fund, dollar, wall, million, rate, business, federal, cash, agency, security

Topic  5
chicken, bird, eat, meal, episode, fish, standard, plant, food, worker, taste, sister, performance, restaurant, personality, sun, wing, fly, feed, sky

Topic  6
company, worker, plant, business, sell, c

## Further analysis

### Top topics

I took time to see if there is anything interesting I could glean from showing the top topics through the years. It wasn't particularly insightful and just looked like noise so I ended up abandoning this review.

In [491]:
display_topics(lda, 
               vectorizer.get_feature_names(), 
               20)


Topic  0
police, court, judge, prison, law, drug, crime, department, gun, jail, attorney, officer, officers, trial, cops, cases, justice, lawyer, evidence, shot

Topic  1
dad, father, mom, mother, parents, brother, kid, son, sister, children, book, lived, child, older, wife, happy, died, years old, daughter, uncle

Topic  2
women, sex, wrote, letter, men, girl, letters, write, writing, relationship, married, gay, book, met, reading, wife, marriage, didn know, email, advice

Topic  3
company, business, pay, bank, government, buy, million, sell, deal, jobs, market, month, plan, credit, street, dollars, office, bought, sold, paid

Topic  4
dog, eat, animals, book, food, animal, cat, dogs, body, human, fat, eyes, death, eating, dead, naked, fish, running, water, stand

Topic  5
town, church, building, street, water, south, state, neighborhood, local, land, moved, community, white, wall, government, houses, workers, lots, lived, road

Topic  6
black, students, white, schools, class, teache

In [492]:
lda_results = pd.DataFrame(eps,
             columns=['episode']).join(pd.DataFrame(doc_topic))

df = ep_meta.merge(lda_results, how='inner', left_on='ep_num', right_on='episode')

In [494]:
df.columns = ['ep_num', 'air_date', 'summary', 'ep', 'law_n_order', 
              'family', 'love_n_sex', 'business', 'other2', 'local',
             'race_school', 'medical', 'media', 'holiday_family',
             'politics', 'other', 'sports', 'military', 'music']

In [538]:
just_topics = df[['law_n_order', 'family', 'love_n_sex', 'business', 'other2', 'local',
             'race_school', 'medical', 'media', 'holiday_family',
             'politics', 'other', 'sports', 'military', 'music']]
top_topic = just_topics.apply(lambda x: just_topics.columns[x.argmax()], axis = 1)

In [545]:
just_topics.max(axis=1)

0      0.315394
1      0.365250
2      0.273619
3      0.465268
4      0.180217
         ...   
683    0.327819
684    0.355211
685    0.399705
686    0.434523
687    0.305550
Length: 688, dtype: float64

In [541]:
top_topic

0        new_york
1        business
2          other2
3           other
4         medical
          ...    
683       medical
684      military
685    love_n_sex
686        family
687        family
Length: 688, dtype: object

In [546]:
new = df.merge(top_topic.rename('top_topic'), left_index=True, right_index=True)

### Named entity search

I wanted to see what kinds of sources they had over the years, with the hypothesis that they were able to gain more prominent sources for their stories. The number of named entities were too much to do analysis and since I was close to the end of the time period I ended up abandoning this line of inquiry as well.

In [None]:
from collections import defaultdict


counter = 0
entities = {}
for act in pang:
    ent_count = defaultdict(int)
    print(f'---------------------Finding Entities {counter} of {len(pang)} items.')
    tok_text = nlp(act)
    print(pang.index[counter])
    for token in tok_text.ents:
        if token.label_ not in ['TIME', 'DATE', 'CARDINAL', 'ORDINAL', 'MONEY', 'GPE', 'QUANTITY']:
            ent_count[token.text] += 1
    entities[pang.index[counter]] = ent_count 
    print(entities[pang.index[counter]])
    counter += 1

Code to only find 'names' which for the code below is only entities that are 2 'words'

In [None]:
names = []
for entity in people.index:
    if len(entity.split(' '))==2:
        names.append(entity)
#         names.append([entity, people[entity]])
names2 = pd.DataFrame(names)
names2

### Clustering

I did not end up doing any clustering for my final analysis in favor of reviewing trends over time.

In [229]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)


In [230]:
ypred = km.fit_predict(dtm)

In [None]:
x,y = zip(*X)
plt.figure(dpi=200)
plt.scatter(X[:,0],X[:,1],c=plt.cm.rainbow(ypred*20),s=14);