# LSA

© Anatolii Stehnii, 2018

Main goal of this assignment is to give you a basic understanding of how Latent Semantic Analysis is performed and how to interpret it results. In this work you will calculate word embeddings using LSA for arbitrary text and explore their properties.

Code from [Yuri Guts's Thrones2Vec](https://github.com/YuriyGuts/thrones2vec/blob/master/Thrones2Vec.ipynb) is used.

In [154]:
%env LC_ALL=en_US.UTF-8
%env LANG=en_US.UTF-8

import os
import re

import nltk
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 10]

env: LC_ALL=en_US.UTF-8
env: LANG=en_US.UTF-8


## Corpus preprocessing

First, download any text and save it in `./data/part1/corpus.txt`. I recommend you to use English text, but if you feel brave enough, you can setup a pipleline for any other language.

In [162]:
data_dir = "../data/part1"
corpus_file = os.path.join(data_dir, "corpus.txt")

with open(corpus_file, "r", ) as f:
    corpus = f.read()
print("Corpus is {0} characters long.".format(len(corpus)))

Corpus is 9748084 characters long.


We need to split our corpus on documents and split documents on terms. Use English tokenizers from NLTK or create your own function.

In [4]:
nltk.download("punkt")
nltk_english = nltk.data.load('tokenizers/punkt/english.pickle')
# use this or define your own document_tokenize
document_tokenize = lambda corpus: nltk_english.tokenize(corpus)

# use this or define your own word_tokenize
from nltk import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/anatolii.stehnii/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [163]:
# split corpus on documents
documents_raw = document_tokenize(corpus)
# split documents on tokens
documents = list(map(word_tokenize, documents_raw))

Add your preprocessing if needed or use default:

In [164]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

# default preprocessing – leave only alphabetical characters and stem tokens
# redefine this function, if you need another pipeline
def preprocess(word):
    clean = re.sub("[^a-zA-Z]","", word)
    clean_lower = clean.lower()
    return ps.stem(clean_lower)

In [165]:
# replace each word with preprocessed
documents = [[preprocess(word) for word in document] for document in documents]
# filter empty tokens
documents = [[word for word in document if word] for document in documents]

In [166]:
# Compare raw and preprocessed documents
print("Raw document: {}\n".format(documents_raw[5]))
print("Preprocessed document: {}\n".format(documents[5]))
print("-"*80)
token_count = sum([len(document) for document in documents])
print("The corpus contains {0:,} tokens in {1:,} documents".format(token_count, len(documents)))

Raw document: He was an old man, past fifty, and he had seen the lordlings come and go.

Preprocessed document: ['he', 'wa', 'an', 'old', 'man', 'past', 'fifti', 'and', 'he', 'had', 'seen', 'the', 'lordl', 'come', 'and', 'go']

--------------------------------------------------------------------------------
The corpus contains 1,807,559 tokens in 158,726 documents


Here we are filtering rare words from our documents. We clearly will not be able to analyze a word, if it occurs only a few times in a text. Define your threshold for a minimum word occurence in the text to filter rare words.

In [167]:
min_word_count = 3

words_count = {}
for d in documents:
    for w in d:
        if w in words_count:
            words_count[w] += 1
        else:
            words_count[w] = 1

rare_words = {w for w, count in words_count.items() if count < min_word_count}
print("The corpus vocabulary contains {0:,} unique words, {1:,} with less than {2:} occurences.".format(len(words_count), len(rare_words), min_word_count))

The corpus vocabulary contains 16,232 unique words, 6,080 with less than 3 occurences.


Define your own set of stop words or use default from NLTK:

In [168]:
nltk.download("stopwords")
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
#stop_words = {'a', 'an', 'the'}

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/anatolii.stehnii/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [169]:
# remove stop words and rare words from documents
remove_words = rare_words | stop_words
documents = [[word for word in document if word not in remove_words] for document in documents]
print("Filtered document: {}" .format(documents[5]))
print("-"*80)

all_words = {word for document in documents for word in document}
vocabulary = {w:i for i, w in enumerate(sorted(all_words))}
print("A filtered vocabulary contains {} unique words".format(len(vocabulary)))

Filtered document: ['wa', 'old', 'man', 'past', 'fifti', 'seen', 'lordl', 'come', 'go']
--------------------------------------------------------------------------------
A filtered vocabulary contains 10030 unique words


## Term-document matrix

Here you need to transform your corpus into term-document matrix of any type you want (BoW, TF-IDF). Use a sparse matrix, if you have large amount of data. Don't forget to save your vocabulary so you can restore a row index for each word.

In [170]:
tf_idf_matrix = None # your code here

## SVD

Perform SVD of term-document matrix and reduce it's dimensionality to n_dim components. 

In [171]:
n_dim = 200

In [172]:
u = None # your code here

In [174]:
word_embeddings = u

## Explore result

Select a set of words you would like to explore.

In [176]:
words = ['king', 'queen', 'robert', 'cersei', 'daeneri', 'stanni', 'man', 'woman']
indexes = [vocabulary[word] for word in words]

Create a plot with your words, use first two components for x and y.

In [None]:
# your code here

Measure distances (cosine, euclidean) between words to check, if distances have meaning.

In [None]:
# your code here

In [None]:
# try to find meaning for individual components
component = 1
values = word_ebeddings[:,component].tolist()
words_values = list(sorted(zip(vocabulary.items(), values), key=lambda x: x[1]))
words_values[:20]

In [None]:
words_values[-20:]

## Save results
We will need preprocessed dataset and word embeddings in the next assignment, so let's save it.

In [181]:
# Store vocabulary as word per line. Indexes can be easily restored from order, therore they are ommited
with open(os.path.join(data_dir, 'vocabulary.txt'), "w") as f:
    vocab_str = '\n'.join(vocabulary.keys())
    f.write(vocab_str)

In [185]:
# Store corpus, document per line, each token replaced with index
with open(os.path.join(data_dir, 'corpus_preprocessed.txt'), "w") as f:
    corpus_str = '\n'.join([' '.join([str(vocabulary[token]) for token in document]) for document in documents])
    f.write(corpus_str)

In [186]:
# Store LSA embeddings
np.save(os.path.join(data_dir, 'lsa_embeddings.npy'), word_embeddings)