# Word vectors

This Jupyter notebook walks you through the steps of creating word vectors from your (lemmatized, if needed) source texts, and runs you through doing various kinds of analysis on those word vectors.

## 1. Getting the source texts
This notebook assumes you've already acquired texts that you want to work with. You need, at a minimum, around 1 million words to get better-than-garbage results for word vectors, and your texts need to be lemmatized (if you're working with a language where that's relevant).

## 2. Cleaning the source texts
The lemmatization code for your language may or may not already take care of some of these cleaning steps -- and also, these cleaning steps might not be all you need. If you have other punctuation that's causing a problem, try modifying the examples below, and/or you can always check in with Quinn about it.

Even if you don't need to clean your source texts, be sure to run the first code block to import modules that will be important for the word vector steps below.

### 2.1 Importing modules and setting up paths
Change the value of *sourcefiledirectory* to where you've put your source files, then run the code block below first, even if you want to move on immediately to the word vectors. It imports a number of modules you'll need later.

For instance, the default path a text file in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-TEXT-FILE.txt'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\YOUR-TEXT-FILE.txt'

In [None]:
#os is used for things like changing directories and listing files
import os
#io is used for opening and writing files
import io
#itertools is used for some of the iterative code
from itertools import chain
#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob

#This is the full path to the directory where you've stored the source texts
sourcefiledirectory = '/Users/qad/Documents/hp_noparatext'

#Changing the directory to where you've stored the source texts, so you can open them in later code blocks
os.chdir(sourcefiledirectory)

### 2.2 Lower-casing all text
The code below **replaces** your source files with versions where all characters are lower-case. Be sure you have a copy of the original version of your source file somewhere else in case you need to go back to it!

In [None]:
#Look through the directory you specified to find files that end in .txt.
for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #For each file that ends in .txt, open and read its contents into a string. Then make the characters lower-case.
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.lower()]

        #Create a new file with the same file name (i.e. replacing the original file) and write the lowercase lines
        #This method also automatically closes the file once it's done
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

### 2.3 Removing line breaks
Line breaks get attached to the previous word, so this code adds a space to separate them.

In [None]:
# Look for files in the source directory that end in .txt
for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string. Find newline characters (\n) and put a space before and after.
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("\n", " \n ")]

        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

### 2.4 Cleaning up punctuation
You need to either remove punctuation attached to words, or separate it from the words with a space.

In [None]:
# This gets rid of ellipses. Sets of more than one period complicate further text processing.

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("...", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# This gets rid of sets of two periods (yes, there were some of those in the source files!)

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("..", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# This takes a period followed by a space, and puts a space before it as well.
# You don't want to just get rid of all periods because they're used in abbreviations.

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace(". ", " . ")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Sometimes the period is followed by a newline rather than a space, so this code puts a space before those.

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace(".\n", " .\n")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Gets rid of colons

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace(":", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Gets rid of semicolons

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace(";", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Gets rid of commas

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace(",", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Gets rid of colons

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace(":", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Gets rid of « quotation marks

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("«", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Gets rid of » quotation marks

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("»", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)

In [None]:
# Replaces ellipsis characters

for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        
        #Read each text file into a string, and do the find-and-replace
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("…", "")]
        
        #Write output to a new file with the same name as the original, overwriting the original file.
        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)
        out.close()

### 2.5 Removing multiple spaces

At this point, there are still places in the texts with five spaces (after previous chapter headers). This removes them and replaces them with a single space.

In [None]:
for filename in os.listdir(sourcefiledirectory):
    if filename.endswith(".txt"):
        f = open(filename, 'r', encoding='utf8')
        text = f.read()
        lines = [text.replace("     ", " ")]

        with open(filename, 'w', encoding='utf8') as out:
            out.writelines(lines)
        out.close()

## 3. Word vector creation

The code blocks in this section generate the word vector representation for a set of texts. You can specify a different directory than the one used for the data cleaning, which can be useful if you want to run the vectors separately for different subsets of your corpus (e.g. just "Harry Potter", just "Tanya Grotter", etc.) To do this, copy the cleaned up text files for the subset of the corpus that you want to run into a new directory, and put the path to that new directory in the first code block below for *vector_sources*.

Before you run this for the first time, you need to install the *gensim* Python package. Open a terminal window and type `pip install gensim`.

### 3.0a One-time setup: install gensim
The fist time you run this notebook, run the code cell below to install the `gensim` package. You won't have to run this the next time you run the notebook, but nothing bad will happen if you do run it.

In [None]:
import sys
#Installs gensim
!{sys.executable} -m pip install gensim

### 3.1 Run every time: import modules
Run the code cell below every time to import the modules you need to run the notebook

In [None]:
# gensim is a Python module for generating and analyzing word vectors
import gensim
# Logging allows you to watch the progress of long-running processes
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# word2vec is used to generate the vectors, phrases to identify phrases as an input for vector generation
from gensim.models import word2vec, Phrases
from gensim.models.phrases import Phraser
# These utilities are used for exporting and loading models
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import KeyedVectors

### 3.2 Specify where the text files are for the word vectors
This may be the same place as you indicated above for the text cleaning steps, or you may choose to split your corpus into various subsets for creating word vectors.

Change the value of *vector_sources* to where you've put your source files, then run the code block below.

For instance, the default path a text file in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-TEXT-FILE.txt'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\YOUR-TEXT-FILE.txt'

In [None]:
vector_sources="/Users/qad/Documents/megacorpus"

### 3.3 List each file and its length
This is a confirmation step that lists all the files that will be used as the input for the word vectors, along with how many characters are in each file.

In [None]:
#Change directory to where the data for your word vectors is
os.chdir(vector_sources)
#List all the documents in the directory with the data for your word vectors
documents = list()
for filename in glob.glob("*.txt"):
    #Open each text file in the directory and read it into a string
    f = io.open(filename, mode="r", encoding="utf-8")
    filedata = f.read()
    #Print the filename along with how many characters (i.e. letters, numbers, etc.) are in the file
    print(filename + " = " + str(len(filedata)) + " chars")
    documents = documents + filedata.split("\n")

### 3.4 Identify phrases
This code block identifies bigram and trigram (2-word and 3-word, respectively) phrases. Especially if you have a small corpus, phrase mis-identification is possible through repeated words (e.g. "she said" in English).

Phrases are treated like single words when doing the word vector generation.

**Note:** this will take some time, and will generate a lot of status messages in the process.

In [None]:
# Generates bigrams and trigrams from the text
sentence_stream = [doc.split(" ") for doc in documents]
trigram_sentences_project = []
bigram = Phraser(Phrases(sentence_stream))
trigram = Phraser(Phrases(bigram[sentence_stream]))

for sent in sentence_stream:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]
    trigram_sentences_project.append(trigrams_)

### 3.5 Running and saving word vectors
This code block sets the parameters for vector generation, generates vectors, and saves the model.

The default parameters should work in most cases. If you change the *num_features*, you'll need to change it again in the visualization code below.

**Note:** this will take some time and generate a lot of status messages in the process.

In [None]:
# Sets values for various parameters for vector generation.
num_features = 200    # Word vector dimensionality                      
min_word_count = 2    # Minimum word count                        
num_workers = 20      # Number of threads to run in parallel
context = 5           # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words


# Sets up the code to run the word vector creation
model = word2vec.Word2Vec(trigram_sentences_project, workers=num_workers, \
            vector_size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)


# Saves model; you can change the name as long as it ends in .model
model.save("word2vec.model")

In [None]:
# Print the total number of items (including words, phrases, standalone punctuation, etc.) in the model's vocabulary

print(len(model.wv.vocab))

## 4. Word vector analysis

The code blocks below allow you to pull up most-similar and most-dissimilar terms, and attempt analogies with the word vectors.

### 4a Using an existing model
If you don't want to create your own word vectors, you can use one of [the pretrained models provided by Facebook](https://fasttext.cc/docs/en/pretrained-vectors.html). Download a model, put in the full path to it on your own computer below, and then run the code below.

**You only need to do this if you haven't generated your own word vectors below. Otherwise, skip the following code block!**

In [None]:
# DO NOT run this if you've already created your own word vectors. This is ONLY for loading an existing model.
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('/Users/qad/Downloads/wiki.vi/wiki.vi.bin')

### 4.1 Most similar terms
Put any word in the corpus between the quotes below to show the most similar words. You can change the value of *topn* to show more, or fewer, words.

Keep in mind that if you used the preprocessing steps, the text is all lower-case and lemmatized, so no capital letters or inflected forms or else it will throw an error about the word not being in the vocabulary.

In [None]:
w1 = "hội"
model.wv.most_similar (positive=w1,topn=30)

### 4.2 Most dissimilar terms
Put any word in the corpus between the quotes below to show the most **dissimilar** words (i.e. those words that are used in the most dissimilar ways to the one you've given the model). You can change the value of *topn* to show more, or fewer, words.

Keep in mind that if you used the preprocessing steps, the text is all lower-case and lemmatized, so no capital letters or inflected forms or else it will throw an error about the word not being in the vocabulary.

In [None]:
w1 = "таня"
model.wv.most_similar (negative=w1,topn=30)

### 4.3 Analogies
Without a very large corpus, the results of these analogies is very dissatisfying. The code below shows how to construct these analogies if you want to try them.

The analogy code takes three words as input. To render the analogy гарри:квиддич::таня:??? (one might imagine драконбол as a high probability answer), you would use the code below. Or, more abstractly, given *A:B::C:??*, the code would be: `positive=['A','C'],negative=['B']`

In [None]:
# гарри is to квиддич what таня is to...
model.wv.most_similar(positive=['гарри','таня'],negative=['квиддич'],topn=30)

## 5. Visualization
The code below will generate two kinds of visualizations for the word vectors, by reducing the dimensionality of the vectors from 200 dimensions (or however many you specified when creating the vectors) down to 2.

For this to work, you need the most recent version of *matplotlib*; as of March 2019, you may need to open a terminal and run `conda uninstall matplotlib` then `conda install matplotlib` to update to the latest version, depending on when you installed Anaconda.

You also need to install the *sklearn* module. In the terminal: `pip install sklearn`.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

### 5.1 Visualizing similar words
The code below will plot a given word, and the most similar words to it. You can increase or decrease the number of values displayed by changing the *topn* value for *close_words* (currently 30).

You can input the word you want to use as the basis for similar words in the second code block below.

Note: if you changed the number of features to something other than 200 when generating word vectors, you'll need to change the line `arr = np.empty((0,200), dtype='f')`, replacing 200 with the number of features you used.

In [None]:
## visualizing subset of vectors
import numpy as np
import matplotlib.pyplot as plt
 
from sklearn.manifold import TSNE
def display_closestwords_tsnescatterplot(model, word):
    
    arr = np.empty((0,200), dtype='f')
    word_labels = [word]

    # get close words
    close_words = model.wv.most_similar (word, topn=30)
    
    # add the vector for each of the closest words to the array
    arr = np.append(arr, np.array([model.wv[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model.wv[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)
        
    # find tsne coords for 2 dimensions
    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)

    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    # display scatter plot
    
    
    
    
    plt.scatter(x_coords, y_coords)
    


    for label, x, y in zip(word_labels, x_coords, y_coords):
        plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()

In [None]:
#put your word between the single quotes here
display_closestwords_tsnescatterplot(model,'снитч')

### 5.2 Visualizing all the words
To visualize the overall shape of all the word vectors, you can run the code below. You'll need to zoom in quite a bit to be able to make anything specific out of it. Certain traits (like a loop in the overall curve) may be warning signs of data cleaning issues.

In [None]:
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
from bokeh.io import push_notebook, show, output_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, LabelSet
output_notebook()
 
X = []
for word in model.wv.vocab:
    X.append(model.wv[word])
 
X = np.array(X)
print("Computed X: ", X.shape)
X_embedded = TSNE(n_components=2, n_iter=250, verbose=2).fit_transform(X)
print("Computed t-SNE", X_embedded.shape)
 
df = pd.DataFrame(columns=['x', 'y', 'word'])
df['x'], df['y'], df['word'] = X_embedded[:,0], X_embedded[:,1], model.wv.vocab
 
source = ColumnDataSource(ColumnDataSource.from_df(df))
labels = LabelSet(x="x", y="y", text="word", y_offset=8,
                  text_font_size="20pt", text_color="#555555",
                  source=source, text_align='center')
 
plot = figure(plot_width=400, plot_height=400)
plot.circle("x", "y", size=12, source=source, line_color="black", fill_alpha=0.8)
plot.add_layout(labels)
show(plot, notebook_handle=True)

## 6. Acknowledgements

Thanks to [Jeff Tharsen](http://tharsen.net/) for sharing a notebook that ran word vectors with gensim on Shakespeare. That notebook got this project off the ground by giving me an example of how to actually invoke the gensim phraser and word vector creation. The code in sections 3 and 4 is based off that notebook.

Thanks to Aneesha Bakharia for [this Medium post on *Using TSNE to Plot a Subset of Similar Words from Word2Vec*](https://medium.com/@aneesha/using-tsne-to-plot-a-subset-of-similar-words-from-word2vec-bb8eeaea6229), where I found the code that section 5.1 is based on.

Thanks to Jeff Thompson for [this blog post on visualizing word vectors](https://www.jeffreythompson.org/blog/2017/02/13/using-word2vec-and-tsne/), which I reworked slightly for section 5.2.