The goal of this assignment is to give you an opportunity to get hands-on experience with the unsupervised methods we learned about.

Part 1:

(a) Experiment with either k-means clustering or LDA on your adopted document collection to try to find topics in the collection.   Be sure to try a few different values of k.  (If you want to use some other variant of clustering, that is fine.)

(b) Show your output in some easy-to-digest form.

(c) Discuss how well it did or did not work.

(d) (Optional) Compare to a WordNet grouping algorithm, such as those students came up with in the keyphrase assignment.

In [111]:
import numpy as np
import pandas as pd
import nltk
from nltk.stem.snowball import SnowballStemmer
import re
from sklearn import feature_extraction

In [3]:
stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer("english")

In [91]:
def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    tokens = list(filter(lambda x: not x[0].isupper(), tokens))
    filtered_tokens = []
    for token in tokens:        
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    tokens = list(filter(lambda x: not x[0].isupper(), tokens))
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [92]:
tal_text = pd.read_csv("../tal_stories/tal_text_broad.txt")
tal_text.columns = ["Episode Name", "Episode Transcript"]
tal_text.head()

Unnamed: 0,Episode Name,Episode Transcript
0,#1 Party School,"('#1 Party School', '<EPISODE NUMBER:396> <EPI..."
1,2010,"('2010', '<EPISODE NUMBER:397> <EPISODE NAME:2..."
2,Long Shot,"('Long Shot', '<EPISODE NUMBER:398> <EPISODE N..."
3,Contents Unknown,"('Contents Unknown', '<EPISODE NUMBER:399> <EP..."
4,Stories Pitched by Our Parents,"('Stories Pitched by Our Parents', '<EPISODE N..."


In [96]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for each in tal_text["Episode Transcript"]:
    allwords_stemmed = tokenize_and_stem(each)
    totalvocab_stemmed.extend(allwords_stemmed)
    
    allwords_tokenized = tokenize_only(each)
    totalvocab_tokenized.extend(allwords_tokenized)

In [104]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index=totalvocab_stemmed)
vocab_frame = vocab_frame.dropna()

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=.8, max_features=200000,
                                    min_df=.2, stop_words='english',
                                  use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(tal_text["Episode Transcript"])

print(tfidf_matrix.shape)

CPU times: user 1min 17s, sys: 670 ms, total: 1min 18s
Wall time: 1min 21s
(101, 2228)


In [106]:
from sklearn.cluster import KMeans
num_clusters = 5

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

CPU times: user 343 ms, sys: 5.04 ms, total: 348 ms
Wall time: 457 ms


In [107]:
from sklearn.externals import joblib

joblib.dump(km, 'cluster_algo2.pkl')

# km = joblib.load('cluster_algo2.pkl')

['cluster_algo2.pkl', 'cluster_algo2.pkl_01.npy', 'cluster_algo2.pkl_02.npy']

In [108]:
tal_text["Cluster"] = clusters
tal_text["Cluster"].value_counts()

3    46
2    16
0    16
4    12
1    11
dtype: int64

In [109]:
print("Top terms per cluster:")
print()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :20]:
        print(" %s" % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0], end=",")
    print()
    print()
    
    print("Cluster %d Episode Names:" %i, end='')
    for title in tal_text[tal_text["Cluster"] == i]['Episode Name']:
        print(' %s' % title, end='\n')
    print()
    print()
    
print()
print()

Top terms per cluster:

Cluster 0 words: nan, nan, nan, students, nan, dad, male, taxing, drinking, football, town, interview, interview, nan, quote, father, girls, middle, nan, card,

Cluster 0 Episode Names: #1 Party School
 Georgia Rambler
 Petty Tyrant
 Last Man Standing
 Will They Know Me Back Home?
 Game Changer
 Gossip
 Middle School
 Back to Penn State
 What Kind of Country
 Blackjack
 Switcheroo
 Red State Blue State
 Surrogates
 No Coincidence, No Story!
 Dr. Gilmer and Mr. Hyde


Cluster 1 words: nan, letter, mother, hill, nan, mom, wall, parents, heart, adamantly, doctors, interview, interview, churches, mark, bed, inside, subject, cancer, box,

Cluster 1 Episode Names: Contents Unknown
 Parent Trap
 Enemy Camp 2010
 Toxie
 Comedians of Christmas Comedy Special
 Slow To React
 Living Without (2011)
 Invisible Made Visible
 Show Me The Way
 Our Friend David
 Tribes


Cluster 2 words: nan, nan, nan, company, adamantly, government, nan, economic, banks, million, dollar, econom

### Discussion

After the first run of the clustering algorithm, it was clear that most of the clusters seem to be be affected by the proper nouns, or names of individuals. You can see the output of the first run here:

```
Top terms per cluster:

Cluster 0 words: b'interviews', b'interviews', b'Paul', b'Mike', b'Adam', b'drug', b'David', b'students', b'game', b'girls', b'Wall', b'parents', b'dad', b'police', b'court', b'JAMES', b'dog', b'cells', b'marked', b'cancer',

Cluster 0 Episode Names: Long Shot
 Contents Unknown
 Parent Trap
 Save the Day
 Enemy Camp 2010
 True Urban Legends 
 Island Time 
 Held Hostage
 First Contact
 Million Dollar Idea
 Neighborhood Watch
 Kid Politics
 Slow To React
 Tough Room 2011
 Very Tough Love
 Know When To Fold 'Em
 Amusement Park
 Living Without (2011)
 So Crazy It Just Might Work
 Poultry Slam 2011
 Nemeses
 Mr. Daisey and the Apple Factory
 What I Did For Love
 Retraction
 Invisible Made Visible
 Americans in China
 Hiding in Plain Sight
 The Convert
 Back to School
 This Week
 Lights, Camera, Christmas!
 Self-Improvement Kick 
 Valentine's Day


Cluster 1 words: b'Alex', b'Alex', b'Blumberg', b'David', b'company', b'Adam', b'governments', b'banks', b'economy', b'millions', b'Economic', b'crisis', b'lawyer', b'billion', b'created', b'frankly', b'dollars', b'buying', b'fed', b'governor',

Cluster 1 Episode Names: 2010
 NUMMI
 Inside Job
 Social Contract
 Crybabies
 Toxie
 The Invention of Money
 How To Create a Job
 When Patents Attack!
 Adventure!
 Continental Breakup
 Take the Money and Run for Office
 Mortal Vs. Venial
 Our Friend David
 Loopholes
 Getting Away With It
 Trends With Benefits
 Tribes
 When Patents Attack... Part Two!


Cluster 2 words: b'Sarah', b'Sarah', b'Koenig', b'students', b'dad', b'Steve', b'interviews', b'interviews', b'Lisa', b'Pollak', b'Lisa', b'tax', b'MALE', b'football', b'town', b'drink', b'Jane', b'Jane', b'Matt', b'father',

Cluster 2 Episode Names: #1 Party School
 Stories Pitched by Our Parents
 Georgia Rambler
 Petty Tyrant
 Last Man Standing
 Game Changer
 Gossip
 Middle School
 Back to Penn State
 What Kind of Country
 Blackjack
 Switcheroo
 Red State Blue State
 Surrogates
 No Coincidence, No Story!
 Dr. Gilmer and Mr. Hyde


Cluster 3 words: b'Ben', b'John', b'Calhoun', b'Ben', b'mom', b'interviews', b'interviews', b'Republicans', b'dad', b'police', b'David', b'hospital', b'I\\', b"'m", b'Parties', b'doctor', b'gun', b'Christmas', b'students', b'Jonathan',

Cluster 3 Episode Names: Right to Remain Silent
 This Party Sucks
 Comedians of Christmas Comedy Special
 Original Recipe
 See No Evil
 This Week
 The Psychopath Test
 Old Boys Network
 Father's Day 2011
 A House Divided
 Ten Years In
 The Incredible Case of the P.I. Moms
 Reap What You Sow
 Play the Part
 Own Worst Enemy
 Show Me The Way
 What Doesn't Kill You
 Little War on the Prairie
 Doppelg?ngers
 Harper High School, Part One
 Harper High School, Part Two
 Hit the Road
 Hot In My Backyard


Cluster 4 words: b'Nancy', b'Nancy', b'Updike', b'translator', b'soldiers', b'Army', b'dog', b'SPANISH', b'military', b'Sarah', b'boys', b'war', b'arrested', b'father', b'son', b'Brian', b'Reed', b'Brian', b'drug', b'interpreting',

Cluster 4 Episode Names: The Bridge
 Iraq After Us
 Oh You Shouldn't Have
 Will They Know Me Back Home?
 Fine Print 2011
 Thugs
 What Happened At Dos Erres
 Send a Message
 Animal Sacrifice
 Picture Show
```

Therefore, I modified the tokenizer and the stemmer to pos_tag each word and I planned on removing any proper nouns. However, the pos_tagger significantly slowed the process and did not complete tokenizing my corpus after 15 minutes. As a substitute, instead of tagging the words, I decided to just remove words that started with a capital letter. This worked much faster, however, I may have wrongfully excluded some words.

Overall, the clustering seems to have worked quite well using the tf-idf vectorizer. It was interesting to see the various episodes grouped together that I recognize as being similar in topics. For example, cluster 2 had both the patent episodes as well as the episode on the NUMMI car plant. Given the other topics, I thought this would proper classification.

Part 2:

Experiment with Word2Vec to find related terms for terms in your collection.  I recommend using the large pre-trained collection that is in the notebook we discussed in class.

In [112]:
import gensim
from gensim.models import Word2Vec
from nltk.data import find

In [115]:
from nltk.corpus import brown
brown_model = gensim.models.Word2Vec(brown.sents())

# It might take some time to train the model. So, after it is trained, it can be saved as follows:

brown_model.save('brown.embedding')
new_model = gensim.models.Word2Vec.load('brown.embedding')

(a) Select five nouns of interest from your collection, and compare what WordNet finds as the first 3 synsets to what Word2Vec finds as the top 5 rated similar nouns (using the most_similar() function).  State results are better for your collection in each case?  (you may use negative evidence if you like, by providing positive and negative example words).

In [142]:
from nltk.corpus import wordnet as wn

In [157]:
nouns = ["students", "economy", "soldiers", "dogs", "hospital"]
for n in nouns:
    print(n)
    
    print("Word2Vec Results")
    print(brown_model.most_similar(n)[:5])
    print()
    
    print("WordNet Results")
    print(wn.synsets(n.lower())[:3])
    print()
    
    print("*" * 100)

students
Word2Vec Results
[('enroll', 0.7520958781242371), ('automobiles', 0.7386108040809631), ('matters', 0.7279121279716492), ('countries', 0.7262592315673828), ('counterparts', 0.7189865708351135)]

WordNet Results
[Synset('student.n.01'), Synset('scholar.n.01')]

****************************************************************************************************
economy
Word2Vec Results
[('outlook', 0.8700981736183167), ('union', 0.8654215335845947), ('acceptance', 0.8580105304718018), ('leadership', 0.853107213973999), ('commitment', 0.8510946035385132)]

WordNet Results
[Synset('economy.n.01'), Synset('economy.n.02'), Synset('economy.n.03')]

****************************************************************************************************
soldiers
Word2Vec Results
[('traders', 0.7417430281639099), ('shores', 0.6989766955375671), ('indignant', 0.69255131483078), ('French', 0.6835082173347473), ('girls', 0.6714829206466675)]

WordNet Results
[Synset('soldier.n.01'), Synset('sol

Wordnet seems to provide a better result for each of the words because Word2Vec seems to have high recall but also low precision. For example, for "students", Word2Vec returned one similar word: "enroll". However, the remaining words: automobiles', 'matters','countries', and 'counterparts' are not very similar to student. This same pattern was apparent with the remaining of the nouns.

(b) Do the same for 5 adjectives.

In [155]:
adjectives = ["greatly", "economy", "soldiers", "dogs", "hospital"]

for a in adjectives:
    print(a)
    
    print("Word2Vec Results")
    print(brown_model.most_similar(a)[:5])
    print()
    
    print("WordNet Results")
    print(wn.synsets(a.lower())[:3])
    print()
    
    print("*" * 100)

greatly
Word2Vec Results
[('radar', 0.7545557022094727), ('guided', 0.7503126859664917), ('donated', 0.7400814890861511), ('exploited', 0.7389732599258423), ('investigated', 0.7284181118011475)]

WordNet Results
[Synset('greatly.r.01')]

****************************************************************************************************
economy
Word2Vec Results
[('outlook', 0.8700981736183167), ('union', 0.8654215335845947), ('acceptance', 0.8580105304718018), ('leadership', 0.853107213973999), ('commitment', 0.8510946035385132)]

WordNet Results
[Synset('economy.n.01'), Synset('economy.n.02'), Synset('economy.n.03')]

****************************************************************************************************
soldiers
Word2Vec Results
[('traders', 0.7417430281639099), ('shores', 0.6989766955375671), ('indignant', 0.69255131483078), ('French', 0.6835082173347473), ('girls', 0.6714829206466675)]

WordNet Results
[Synset('soldier.n.01'), Synset('soldier.n.02'), Synset('soldier.v.

Aside from "economy", Word2Vec also seemed to struggle with adjectives as it did with the nouns. It seems to be tightly coupled with the dataset that is passed to it. In this case, it is closely tied to the brown dataset

(c) Do the same for 5 verbs.

In [156]:
verbs = ["climb", "assume", "translated", "create", "interview"]
for v in verbs:
    print(v)
    
    print("Word2Vec Results")
    print(brown_model.most_similar(v)[:5])
    print()
    
    print("WordNet Results")
    print(wn.synsets(v.lower())[:3])
    print()
    
    print("*" * 100)

climb
Word2Vec Results
[('grow', 0.7233080863952637), ('add', 0.7189280986785889), ('decide', 0.7062013149261475), ('drain', 0.6854236125946045), ('throw', 0.6576958894729614)]

WordNet Results
[Synset('ascent.n.01'), Synset('climb.n.02'), Synset('climb.n.03')]

****************************************************************************************************
assume
Word2Vec Results
[('suggest', 0.8785237669944763), ('recognize', 0.8768250346183777), ('consider', 0.8590641021728516), ('realize', 0.8582191467285156), ('deny', 0.8558826446533203)]

WordNet Results
[Synset('assume.v.01'), Synset('assume.v.02'), Synset('assume.v.03')]

****************************************************************************************************
translated
Word2Vec Results
[('preventing', 0.7863573431968689), ('patches', 0.7138186693191528), ('extremely', 0.7026211023330688), ('outward', 0.6979726552963257), ('plentiful', 0.6964269876480103)]

WordNet Results
[Synset('translate.v.01'), Synset('tran

Finally, with verbs, Wordnet still seems to be superior because Word2Vec still returns too may words that are dissimmilar despite having high similarity scores.