# Vector Semantics
Please write an iPython notebook that represents words and articles in the Reuters corpus as vectors and clusters the article vectors. Please use the nltk.corpus.reuters "training" documents (as shown in reuters.fileids()). Get rid of all documents in more than one class. From that, get rid of of all documents that aren't one of the 5 classes.

`ship, trade, interest, money-fx, crude`   

1. Create a term-document matrix containing a row for every word in the corpus vocabulary and a column for each document, where each entry is the tf-idf score of a word for a document.
2. Reduce the size of the matrix. Compute the maximum tf-idf score for each word and keep the 500 rows with the top 500 maxima. Did that remove the maximum tf-idf score of any column? Comment. 
3. Cluster the document vectors into five clusters using an unsupervised algorithm like k-means. Create a 5x5 matrix that compares each cluster to the each of the above five categories, using the Jaccard Index (see below). Comment.
4. Try clustering the words and comparing those clusters to the categories, too. Comment on the results.

The Jaccard Index compares two sets A and B using the formula

J(A,B) = |A intersect B| / |A union B|

In [81]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import reuters
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter
from sklearn.cluster import KMeans

In [143]:
# read in training files
trainfile = []
testfile = []
for filetype in ['ship', 'trade', 'crude', 'money-fx', 'interest']:
    for fileid in nltk.corpus.reuters.fileids(filetype):
        if len(nltk.corpus.reuters.categories(fileid)) == 1:
            if 'train' in fileid:
                trainfile.append(fileid)
            elif 'test' in fileid:
                testfile.append(fileid)
print(len(trainfile))
print(len(testfile))

1024
401


In [83]:
#Define the stopwords to remove and the stemming tool
stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", ':', ';', '(', ')', '[', ']', '{', '}'])
stemmer = SnowballStemmer('english')

In [84]:
#Preprocess the text in training and testing
processed_train = []
for doc in trainfile:
    tokens = reuters.words(doc)
    filtered = [word for word in tokens if word not in stop_words]
    stemmed = [stemmer.stem(word) for word in filtered]
    processed_train.append(stemmed)
    
# processed_test = []
# for doc in testfile:
#     tokens = reuters.words(doc)
#     filtered = [word for word in tokens if word not in stop_words]
#     stemmed = [stemmer.stem(word) for word in filtered]
#     processed_test.append(stemmed)


## 1. Create the term-document matrix

In [85]:
#Get all words appear in processed_train 
total_doc_contain_word = Counter()
for doc in processed_train:
    count_words = Counter(doc)
    total_doc_contain_word.update(count_words)

In [86]:
#Calculate TF for each word and doc
all_tf = []
for doc in processed_train:
    tf=[]
    total_words = len(doc)
    for word in total_doc_contain_word:
        if word not in doc:
            tf.append(0)
        else:
            tf.append(count_words[word] / total_words)
    all_tf.append(tf)


In [87]:
#Calculate IDF for each word
idf = []
all_doc = len(processed_train)
for word in total_doc_contain_word:
    thisidf = np.log(all_doc / (total_doc_contain_word[word]+1))
    idf.append(thisidf)

In [88]:
#Calculate TF-IDF 
doc_tfidf = []
for doc in all_tf:
    tf_idf = [a*b for a,b in zip(doc,idf)]
    doc_tfidf.append(tf_idf)

In [89]:
#reshape tfidf matrix to make it word * doc
tfidf = np.array(doc_tfidf)
tfidf = [*zip(*tfidf)]
tfidf = np.array(tfidf)

In [90]:
# Get the final tf-idf matrix
column_names = trainfile
row_names    = list(total_doc_contain_word)
tfidf_matrix = pd.DataFrame(tfidf, columns=column_names, index=row_names)
print(tfidf_matrix)

             training/10302  training/10388  training/10391  training/10394  \
unit               0.000000        0.000000        0.000000        0.000000   
state              0.000000        0.000000        0.000000        0.000000   
line               0.000000        0.000000        0.000000        0.000000   
lay                0.000000        0.000000        0.000000        0.000000   
off                0.000000        0.000000        0.000000        0.000000   
far                0.000000        0.000000        0.000000        0.000000   
east               0.000000        0.000000        0.000000        0.000000   
staff              0.000000        0.000000        0.000000        0.000000   
&                  0.000000        0.000000        0.000000        0.000000   
lt                 0.000000        0.000000        0.000000        0.000000   
inc                0.000000        0.000000        0.000000        0.000000   
>                  0.000000        0.000000        0

## 2. Reduce the matrix

In [91]:
tfidf_matrix['max_value'] = tfidf_matrix.max(axis=1)
print(tfidf_matrix)

             training/10302  training/10388  training/10391  training/10394  \
unit               0.000000        0.000000        0.000000        0.000000   
state              0.000000        0.000000        0.000000        0.000000   
line               0.000000        0.000000        0.000000        0.000000   
lay                0.000000        0.000000        0.000000        0.000000   
off                0.000000        0.000000        0.000000        0.000000   
far                0.000000        0.000000        0.000000        0.000000   
east               0.000000        0.000000        0.000000        0.000000   
staff              0.000000        0.000000        0.000000        0.000000   
&                  0.000000        0.000000        0.000000        0.000000   
lt                 0.000000        0.000000        0.000000        0.000000   
inc                0.000000        0.000000        0.000000        0.000000   
>                  0.000000        0.000000        0

In [92]:
# Keep the 500 rows with the top 500 maxima for the tfidf matrix.
tfidf_reduced_matrix = tfidf_matrix.nlargest(500, 'max_value')

I think it removes the maximum tf-idf score of some columns, because there are 1306 docs and there are definitely some docs' maximum tf-idf's associated words are not in these top 500 words.

## 3. Cluster the document vectors into five clusters 
Using an unsupervised algorithm like k-means. Create a 5x5 matrix that compares each cluster to the each of the above five categories, using the Jaccard Index (see below). Comment.

In [93]:
# delete the last max_value column
tfidf_reduced_matrix = tfidf_reduced_matrix.drop('max_value', 1)

In [94]:
#reshape tfidf_reduced_matrix matrix to make it doc * word
matrix = tfidf_reduced_matrix.transpose()

In [100]:
# use k-means to cluster docs into five clusters
km = KMeans(n_clusters=5,init='k-means++')
km.fit(matrix)
print(km.labels_)
print(km.cluster_centers_)

[4 4 4 ... 3 2 3]
[[2.66220821e-03 2.95402284e-02 5.72999740e-02 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.32479234e-01 0.00000000e+00 8.16687358e-02 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [2.56152834e-01 2.35357880e-01 1.91472156e-01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [9.88571889e-02 9.08317823e-02 7.38949430e-02 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [6.52221878e-04 1.23917244e-03 1.84660602e-04 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]]


In [162]:
doc_cate=[[], [], [], [], []]
for i in range(len(km.labels_)):
    if km.labels_[i] == 0:
        doc_cate[0].append(trainfile[i])
    elif km.labels_[i] == 1:
        doc_cate[1].append(trainfile[i])
    elif km.labels_[i] == 2:
        doc_cate[2].append(trainfile[i])
    elif km.labels_[i] == 3:
        doc_cate[3].append(trainfile[i])
    elif km.labels_[i] == 4:
        doc_cate[4].append(trainfile[i])
#print(len(doc_cate))

9


In [163]:
train_cate = [[],[],[],[],[]]
for file in trainfile:
    if reuters.categories(file) == ['ship']:
        train_cate[0].append(file)
    elif reuters.categories(file) == ['trade']:
        train_cate[1].append(file)
    elif reuters.categories(file) == ['interest']:
        train_cate[2].append(file)
    elif reuters.categories(file) == ['money-fx']:
        train_cate[3].append(file)
    elif reuters.categories(file) == ['crude']:
        train_cate[4].append(file)
#print(len(train_cate))

5


In [170]:
# get intersection between clustered categories and predefinied categories.
intersect = np.zeros(shape=(5,5))
for i in range(len(doc_cate)):
    for j in range(len(train_cate)):
        intersect[i][j]=len(list(set(doc_cate[i]) & set(train_cate[j])))

In [172]:
# get union between clustered categories and predefinied categories.
union = np.zeros(shape=(5,5))
for i in range(len(doc_cate)):
    for j in range(len(train_cate)):
        union[i][j]=len(list(set(doc_cate[i]) | set(train_cate[j])))

In [178]:
# Create the 5x5 jaccard matrix that compares each cluster to the each of the above five categories
jaccard = intersect/union
column_names = ['ship', 'trade', 'crude', 'money-fx', 'interest']
row_names    = ['cate0','cate1','cate2','cate3','cate4']
jaccard = pd.DataFrame(jaccard, columns=column_names, index=row_names)
print(jaccard)

           ship     trade     crude  money-fx  interest
cate0  0.000000  0.011070  0.059113  0.037975  0.000000
cate1  0.000000  0.000000  0.010309  0.013393  0.000000
cate2  0.000000  0.000000  0.020408  0.022124  0.000000
cate3  0.000000  0.000000  0.035714  0.021834  0.000000
cate4  0.110883  0.252815  0.166166  0.200803  0.259754


The clustering is pretty bad since it clustered the majority docs into one category and there are a lot of zeros for the jaccard index. I think if I normalize the tf-idf matrix before doing the k-means clustering, the result would be better.

## 4. Cluster the words and comparing those clusters to the categories

In [107]:
# use k-means to cluster words into five clusters
km2 = KMeans(n_clusters=5,init='k-means++')
km2.fit(tfidf_reduced_matrix)
print(km2.labels_)
print(km2.cluster_centers_)

[2 1 3 4 0 4 4 0 0 0 4 0 0 4 4 0 4 4 0 4 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 

In [185]:
# get the word categories
wordlist = list(tfidf_reduced_matrix.index.values)
word_cate=[[], [], [], [], []]
for i in range(len(km2.labels_)):
    if km2.labels_[i] == 0:
        word_cate[0].append(wordlist[i])
    elif km2.labels_[i] == 1:
        word_cate[1].append(wordlist[i])
    elif km2.labels_[i] == 2:
        word_cate[2].append(wordlist[i])
    elif km2.labels_[i] == 3:
        word_cate[3].append(wordlist[i])
    elif km2.labels_[i] == 4:
        word_cate[4].append(wordlist[i])
print(len(word_cate[0]),len(word_cate[1]), len(word_cate[2]), len(word_cate[3]), len(word_cate[4]))
# This clustering is terrible. Most of the words go to one category while the other four categories have only a few words.

487 1 1 1 10


In [193]:
# find the assigned category for each word by using the doc's category where the word has the maximum ti-idf for that doc.
docname = tfidf_reduced_matrix.idxmax(axis=1)
docname = list(np.array(docname))
word_train_cate=[]
for file in docname:
    word_train_cate.append(reuters.categories(file))
#print(len(word_train_cate))

500


In [194]:
train_word_cate = [[],[],[],[],[]]
for i in range(len(word_train_cate)):
    if word_train_cate[i] == ['ship']:
        train_word_cate[0].append(wordlist[i])
    elif word_train_cate[i] == ['trade']:
        train_word_cate[1].append(wordlist[i])
    elif word_train_cate[i] == ['interest']:
        train_word_cate[2].append(wordlist[i])
    elif word_train_cate[i] == ['money-fx']:
        train_word_cate[3].append(wordlist[i])
    elif word_train_cate[i] == ['crude']:
        train_word_cate[4].append(wordlist[i])
print(len(train_word_cate[0]),len(train_word_cate[1]), len(train_word_cate[2]), len(train_word_cate[3]), len(train_word_cate[4]))

472 1 12 14 1


In [195]:
# get intersection between clustered categories and predefinied categories.
intersect2 = np.zeros(shape=(5,5))
for i in range(len(word_cate)):
    for j in range(len(train_word_cate)):
        intersect2[i][j]=len(list(set(word_cate[i]) & set(train_word_cate[j])))
print(intersect2)

[[472.   1.   5.   8.   1.]
 [  0.   0.   0.   1.   0.]
 [  0.   0.   0.   1.   0.]
 [  0.   0.   0.   1.   0.]
 [  0.   0.   7.   3.   0.]]


In [196]:
# get union between clustered categories and predefinied categories.
union2 = np.zeros(shape=(5,5))
for i in range(len(word_cate)):
    for j in range(len(train_word_cate)):
        union2[i][j]=len(list(set(word_cate[i]) | set(train_word_cate[j])))
print(union2)

[[487. 487. 494. 493. 487.]
 [473.   2.  13.  14.   2.]
 [473.   2.  13.  14.   2.]
 [473.   2.  13.  14.   2.]
 [482.  11.  15.  21.  11.]]


In [197]:
jaccard2 = intersect2/union2
column_names = ['ship', 'trade', 'crude', 'money-fx', 'interest']
row_names    = ['category0','category1','category2','category3','category4']
jaccard2 = pd.DataFrame(jaccard2, columns=column_names, index=row_names)
print(jaccard2)

               ship     trade     crude  money-fx  interest
category0  0.969199  0.002053  0.010121  0.016227  0.002053
category1  0.000000  0.000000  0.000000  0.071429  0.000000
category2  0.000000  0.000000  0.000000  0.071429  0.000000
category3  0.000000  0.000000  0.000000  0.071429  0.000000
category4  0.000000  0.000000  0.466667  0.142857  0.000000


This clustering for words is also terrible. I tried to assign a category to each word by using the doc's category where the word has the maximum ti-idf for that doc, but even that made a lot of words go into the 'ship' category (472 out of 500). The k-means clustering also clustered 487 out of 500 words into one category. The jaccard index shows that at least the the k-means clustering and the assigned clusters have somewhat similarity by having 0.969199 between category0 and 'ship'.