Part 1: Experiment with either k-means clustering or LDA on your adopted document collection to try to find topics in the collection.   Be sure to try a few different values of k.  (If you want to use some other variant of clustering, that is fine.)



### Imports + Read File

In [2]:
# below code adapted from Brandon Rose's blog post on clustering: 
# http://brandonrose.org/clustering

import numpy as np
import pandas as pd
import nltk
import re
import os
import codecs
from sklearn import feature_extraction

In [64]:
def read_file(filename):
    #get the file
    data = pd.read_csv(filename, low_memory=False, delimiter = ',', encoding="ISO-8859-1", na_values=['\n'])
    #filter the null data
#     filtered_data = data["ExtractedBodyText"].notnull()
    bodytext = data.ExtractedBodyText.tolist()
    return bodytext

In [65]:
data_list_1 = read_file('email_out1.csv')
data_list_2 = read_file('email_out2.csv')
data_list_3 = read_file('email_out3.csv')

#list of bodytext of 300 emails
bodytext_all = data_list_1 + data_list_2 + data_list_3
    

In [74]:
bodytext_all[:3]

['Pls find a copy of the movie "Pray the Devil Back to Hell" about the war in Liberia.\nAdd Gloria Steinem to my call list.\nPls find a copy of the new Mary Pipher book, "Seeking Peace: Chronicles of the Worst Buddhist in the World."',
 'Why did they call you?',
 'Thank you so much.']

### Stopwords, stemming, tokenizing

In [49]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [50]:
# here I define a tokenizer and stemmer which returns the set of stems in the text that it is passed

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [67]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in bodytext_all:
    allwords_stemmed = tokenize_and_stem(i)
    totalvocab_stemmed.extend(allwords_stemmed)
    
    allwords_tokenized = tokenize_only(i)
    totalvocab_tokenized.extend(allwords_tokenized)

TypeError: expected string or bytes-like object

Part 2: Experiment with Word2Vec to find related terms for terms in your collection.  I recommend using the large pre-trained collection that is in the notebook we discussed in class.   You can do this one of two ways.  Either follow the instructions shown below, or come up with your own way to explore with it.

(a) Select five nouns of interest from your collection, and compare what WordNet finds as the first 3 synsets to what Word2Vec finds as the top 5 rated similar nouns (using the most_similar() function).  State results are better for your collection in each case?  (you may use negative evidence if you like, by providing positive and negative example words).

(b) Do the same for 5 adjectives.

(c) Do the same for 5 verbs.