The goal of this assignment is to give you an opportunity to get hands-on experience with the unsupervised methods we learned about.

Part 1:

(a) Experiment with either k-means clustering or LDA on your adopted document collection to try to find topics in the collection.   Be sure to try a few different values of k.  (If you want to use some other variant of clustering, that is fine.)

(b) Show your output in some easy-to-digest form.

(c) Discuss how well it did or did not work.

(d) (Optional) Compare to a WordNet grouping algorithm, such as those students came up with in the keyphrase assignment.

In [2]:
import numpy as np
import pandas as pd
import nltk
from nltk.stem.snowball import SnowballStemmer
import re
import os
import codecs
from sklearn import feature_extraction

In [3]:
stopwords = nltk.corpus.stopwords.words('english')
stemmer = SnowballStemmer("english")

In [4]:
def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [58]:
tal_text = pd.read_csv("../tal_stories/tal_text_broad.txt")
tal_text.columns = ["Episode Name", "Episode Transcript"]
tal_text.head()

Unnamed: 0,Episode Name,Episode Transcript
0,#1 Party School,"('#1 Party School', '<EPISODE NUMBER:396> <EPI..."
1,2010,"('2010', '<EPISODE NUMBER:397> <EPISODE NAME:2..."
2,Long Shot,"('Long Shot', '<EPISODE NUMBER:398> <EPISODE N..."
3,Contents Unknown,"('Contents Unknown', '<EPISODE NUMBER:399> <EP..."
4,Stories Pitched by Our Parents,"('Stories Pitched by Our Parents', '<EPISODE N..."


In [44]:
total_vocab_stemmed = []
total_vocab_tokenized = []
for each in tal_text["Episode Transcript"]:
    allwords_stemmed = tokenize_and_stem(each)
    totalvocab_stemmed.extend(allwords_stemmed)
    
    allwords_tokenized = tokenize_only(each)
    totalvocab_tokenized.extend(allwords_tokenized)

In [45]:
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index=totalvocab_stemmed)
print('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')

there are 1098587 items in vocab_frame


In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=.8, max_features=200000,
                                    min_df=.2, stop_words='english',
                                  use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

%time tfidf_matrix = tfidf_vectorizer.fit_transform(tal_text["Episode Transcript"])

print(tfidf_matrix.shape)

CPU times: user 1min 19s, sys: 858 ms, total: 1min 19s
Wall time: 1min 30s
(101, 2228)


In [49]:
from sklearn.cluster import KMeans
num_clusters = 5

km = KMeans(n_clusters=num_clusters)

%time km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

CPU times: user 343 ms, sys: 5.4 ms, total: 348 ms
Wall time: 719 ms


In [50]:
from sklearn.externals import joblib

# joblib.dump(km, 'cluster_algo1.pkl')

km = joblib.load('cluster_algo1.pkl')

['cluster_algo1.pkl', 'cluster_algo1.pkl_01.npy', 'cluster_algo1.pkl_02.npy']

In [62]:
tal_text["Cluster"] = clusters
tal_text["Cluster"].value_counts()

0    33
3    23
1    19
2    16
4    10
dtype: int64

In [65]:
print("Top terms per cluster:")
print()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = tfidf_vectorizer.get_feature_names()

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    
    for ind in order_centroids[i, :20]:
        print(" %s" % vocab_frame.ix[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=",")
    print()
    print()
    
    print("Cluster %d Episode Names:" %i, end='')
    for title in tal_text[tal_text["Cluster"] == i]['Episode Name']:
        print(' %s' % title, end='\n')
    print()
    print()
    
print()
print()

Top terms per cluster:

Cluster 0 words: b'interviews', b'interviews', b'Paul', b'Mike', b'Adam', b'drug', b'David', b'students', b'game', b'girls', b'Wall', b'parents', b'dad', b'police', b'court', b'JAMES', b'dog', b'cells', b'marked', b'cancer',

Cluster 0 Episode Names: Long Shot
 Contents Unknown
 Parent Trap
 Save the Day
 Enemy Camp 2010
 True Urban Legends 
 Island Time 
 Held Hostage
 First Contact
 Million Dollar Idea
 Neighborhood Watch
 Kid Politics
 Slow To React
 Tough Room 2011
 Very Tough Love
 Know When To Fold 'Em
 Amusement Park
 Living Without (2011)
 So Crazy It Just Might Work
 Poultry Slam 2011
 Nemeses
 Mr. Daisey and the Apple Factory
 What I Did For Love
 Retraction
 Invisible Made Visible
 Americans in China
 Hiding in Plain Sight
 The Convert
 Back to School
 This Week
 Lights, Camera, Christmas!
 Self-Improvement Kick 
 Valentine's Day


Cluster 1 words: b'Alex', b'Alex', b'Blumberg', b'David', b'company', b'Adam', b'governments', b'banks', b'economy', b'mi

### Discussion

After the first run of the clustering algorithm, it is clear that most of the clusters seem to be be affected by the proper nouns, or names of individuals.

```
Top terms per cluster:

Cluster 0 words: b'interviews', b'interviews', b'Paul', b'Mike', b'Adam', b'drug', b'David', b'students', b'game', b'girls', b'Wall', b'parents', b'dad', b'police', b'court', b'JAMES', b'dog', b'cells', b'marked', b'cancer',

Cluster 0 Episode Names: Long Shot
 Contents Unknown
 Parent Trap
 Save the Day
 Enemy Camp 2010
 True Urban Legends 
 Island Time 
 Held Hostage
 First Contact
 Million Dollar Idea
 Neighborhood Watch
 Kid Politics
 Slow To React
 Tough Room 2011
 Very Tough Love
 Know When To Fold 'Em
 Amusement Park
 Living Without (2011)
 So Crazy It Just Might Work
 Poultry Slam 2011
 Nemeses
 Mr. Daisey and the Apple Factory
 What I Did For Love
 Retraction
 Invisible Made Visible
 Americans in China
 Hiding in Plain Sight
 The Convert
 Back to School
 This Week
 Lights, Camera, Christmas!
 Self-Improvement Kick 
 Valentine's Day


Cluster 1 words: b'Alex', b'Alex', b'Blumberg', b'David', b'company', b'Adam', b'governments', b'banks', b'economy', b'millions', b'Economic', b'crisis', b'lawyer', b'billion', b'created', b'frankly', b'dollars', b'buying', b'fed', b'governor',

Cluster 1 Episode Names: 2010
 NUMMI
 Inside Job
 Social Contract
 Crybabies
 Toxie
 The Invention of Money
 How To Create a Job
 When Patents Attack!
 Adventure!
 Continental Breakup
 Take the Money and Run for Office
 Mortal Vs. Venial
 Our Friend David
 Loopholes
 Getting Away With It
 Trends With Benefits
 Tribes
 When Patents Attack... Part Two!


Cluster 2 words: b'Sarah', b'Sarah', b'Koenig', b'students', b'dad', b'Steve', b'interviews', b'interviews', b'Lisa', b'Pollak', b'Lisa', b'tax', b'MALE', b'football', b'town', b'drink', b'Jane', b'Jane', b'Matt', b'father',

Cluster 2 Episode Names: #1 Party School
 Stories Pitched by Our Parents
 Georgia Rambler
 Petty Tyrant
 Last Man Standing
 Game Changer
 Gossip
 Middle School
 Back to Penn State
 What Kind of Country
 Blackjack
 Switcheroo
 Red State Blue State
 Surrogates
 No Coincidence, No Story!
 Dr. Gilmer and Mr. Hyde


Cluster 3 words: b'Ben', b'John', b'Calhoun', b'Ben', b'mom', b'interviews', b'interviews', b'Republicans', b'dad', b'police', b'David', b'hospital', b'I\\', b"'m", b'Parties', b'doctor', b'gun', b'Christmas', b'students', b'Jonathan',

Cluster 3 Episode Names: Right to Remain Silent
 This Party Sucks
 Comedians of Christmas Comedy Special
 Original Recipe
 See No Evil
 This Week
 The Psychopath Test
 Old Boys Network
 Father's Day 2011
 A House Divided
 Ten Years In
 The Incredible Case of the P.I. Moms
 Reap What You Sow
 Play the Part
 Own Worst Enemy
 Show Me The Way
 What Doesn't Kill You
 Little War on the Prairie
 Doppelg?ngers
 Harper High School, Part One
 Harper High School, Part Two
 Hit the Road
 Hot In My Backyard


Cluster 4 words: b'Nancy', b'Nancy', b'Updike', b'translator', b'soldiers', b'Army', b'dog', b'SPANISH', b'military', b'Sarah', b'boys', b'war', b'arrested', b'father', b'son', b'Brian', b'Reed', b'Brian', b'drug', b'interpreting',

Cluster 4 Episode Names: The Bridge
 Iraq After Us
 Oh You Shouldn't Have
 Will They Know Me Back Home?
 Fine Print 2011
 Thugs
 What Happened At Dos Erres
 Send a Message
 Animal Sacrifice
 Picture Show
```

Part 2:

Experiment with Word2Vec to find related terms for terms in your collection.  I recommend using the large pre-trained collection that is in the notebook we discussed in class.  

(a) Select five nouns of interest from your collection, and compare what WordNet finds as the first 3 synsets to what Word2Vec finds as the top 5 rated similar nouns (using the most_similar() function).  State results are better for your collection in each case?  (you may use negative evidence if you like, by providing positive and negative example words).

(b) Do the same for 5 adjectives.

(c) Do the same for 5 verbs.