# Word vector notebook

This notebook enables you to create word vectors from corpora on TDM Studio.

## Setup: Locating NLTK data
You have to tell the system where to find the data it'll need to be able to use NLTK to tokenize the text.

In [None]:
import nltk
import os
nltkpath = os.getcwd() + '/nltk_data'
nltk.data.path.append(nltkpath)

## 1. Getting the source texts

In TDM Studio, the datasets you have created are in the `data` folder, in a subfolder with the name of the dataset.

## 2. Cleaning the source texts

For better performance, the text should be tokenized and lower-cased.

### 2.1 Importing modules and setting up paths
Change the value of `sourcefiledirectory` to where your dataset is, then run the code block below first, even if you want to move on immediately to the word vectors. It imports a number of modules you'll need later.

In [None]:
#os is used for things like changing directories and listing files
import os
#io is used for opening and writing files
import io
#itertools is used for some of the iterative code
from itertools import chain
#glob is used to find all the pathnames matching a specified pattern (here, all text files)
import glob
#pandas is used to extract the data from the xml files and turn them into a table
import pandas as pd
#lxml is used to parse the xml files
from lxml import etree
#bs4 is used to parse the xml files
from bs4 import BeautifulSoup


#This is the full path to the directory where you've stored the source texts. Be sure to keep the slash at the end when you put in your actual dataset name
sourcefiledirectory = 'data/dataset-name/'
dataset_directory = sourcefiledirectory
input_files = os.listdir(sourcefiledirectory)

### Parsing the data
The following code extracts the actual article content from each of the articles in your data set, and adds it to a table.

In [None]:
def getxmlcontent(root):
    if root.find('.//HiddenText') is not None:
        return(root.find('.//HiddenText').text)
    elif root.find('.//Text') is not None:
        return(root.find('.//Text').text)
    else:
        return None

In [None]:
## Creates three lists to store filename, full text, and date

filename_list = []
text_list = []
date_list = []
newspaper_list = []

#Parse file and add data to lists

for file in input_files:
    tree = etree.parse(dataset_directory + file)
    root = tree.getroot()

    if getxmlcontent(root) is not None:
        soup = BeautifulSoup(getxmlcontent(root))
        text = soup.get_text()
    else:
        text = 'Error in processing document'
    
    date = root.find('.//NumericDate').text
    newspaper = root.find('.//PubFrosting//Title').text

    filename_list.append(file)
    text_list.append(text)
    date_list.append(date)
    newspaper_list.append(newspaper)

#creates table
df = pd.DataFrame({'Article ID': filename_list, 'Newspaper': newspaper_list, 'Date': date_list, 'Text': text_list})

## Tokenize text
The following code tokenizes and lowercases the text. It then exports each tokenized text as a file in a new directory called `tokenized`.

In [None]:
from nltk.tokenize import word_tokenize

df['Tokenized'] = df['Text'].apply(lambda x: word_tokenize(x))
df['Tokenized'] = df['Tokenized'].apply(lambda x: ' '.join(x))
df['Tokenized'] = df['Tokenized'].str.lower()

isExist = os.path.exists('tokenized')
if not isExist:
    os.mkdir('tokenized')
os.chdir('tokenized')
for index, row in df.iterrows():
    filename = f"{index}.txt"
    with open(filename, 'w') as file:
        file.write(row['Tokenized'])

## 3. Word vector creation

The code blocks in this section generate the word vector representation for a set of texts. 

In [None]:
# gensim is a Python module for generating and analyzing word vectors
import gensim
# Logging allows you to watch the progress of long-running processes
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# word2vec is used to generate the vectors, phrases to identify phrases as an input for vector generation
from gensim.models import word2vec, Phrases
from gensim.models.phrases import Phraser
# These utilities are used for exporting and loading models
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import KeyedVectors

vector_sources="."

### 3.1 List each file and its length
This is a confirmation step that lists all the files that will be used as the input for the word vectors.

In [None]:
#Change directory to where the data for your word vectors is
os.chdir(vector_sources)
#List all the documents in the directory with the data for your word vectors
documents = list()
for filename in glob.glob("*.txt"):
    #Open each text file in the directory and read it into a string
    f = io.open(filename, mode="r", encoding="utf-8")
    filedata = f.read()
    #Print the filename along with how many characters (i.e. letters, numbers, etc.) are in the file
    print(filename + " = " + str(len(filedata)) + " chars")
    documents = documents + filedata.split("\n")

### 3.2 Identify phrases
This code block identifies bigram and trigram (2-word and 3-word, respectively) phrases. Phrases are treated like single words when doing the word vector generation. **Note:** this will take some time, and will generate a lot of status messages in the process.

In [None]:
# Generates bigrams and trigrams from the text
sentence_stream = [doc.split(" ") for doc in documents]
trigram_sentences_project = []
bigram = Phraser(Phrases(sentence_stream))
trigram = Phraser(Phrases(bigram[sentence_stream]))

for sent in sentence_stream:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]
    trigram_sentences_project.append(trigrams_)

### 3.3 Running and saving word vectors
This code block sets the parameters for vector generation, generates vectors, and saves the model. **Note:** this will take some time and generate a lot of status messages in the process.

In [None]:
# Sets values for various parameters for vector generation.
num_features = 200    # Word vector dimensionality                      
min_word_count = 2    # Minimum word count                        
num_workers = 20      # Number of threads to run in parallel
context = 5           # Context window size                                                                                    
downsampling = 1e-3   # Downsample setting for frequent words


# Sets up the code to run the word vector creation
model = word2vec.Word2Vec(trigram_sentences_project, workers=num_workers, \
            size=num_features, min_count = min_word_count, \
            window = context, sample = downsampling)


# Saves model; you can change the name as long as it ends in .model
model.save("word2vec.model")

In [None]:
# Print the total number of items (including words, phrases, standalone punctuation, etc.) in the model's vocabulary

print(len(model.wv.index_to_key))

## 4. Word vector analysis

The code blocks below allow you to pull up most-similar and most-dissimilar terms, and attempt analogies with the word vectors. (The "Harry Potter" and "Tanya Grotter" corpora-- even combined-- don't seem to be large enough to support meaningful analogies, but you may be able to train the models further on fanfic to get there.)

### 4.1 Most similar terms
Put any word in the corpus between the quotes below to show the most similar words. You can change the value of *topn* to show more, or fewer, words.

Keep in mind that if you used the preprocessing steps, the text is all lower-case and lemmatized, so no capital letters or inflected forms or else it will throw an error about the word not being in the vocabulary.

If you want more words, change `topn` to a higher number.

In [None]:
w1 = "pilot"
model.wv.most_similar (positive=w1,topn=30)

### 4.3 Analogies
Without a larger corpus, the results of these analogies is very dissatisfying. The code below shows how to construct these analogies if you want to try them.

The analogy code takes three words as input. To render the analogy pilot:plane::sailor:??? (one might imagine boat as a high probability answer), you would use the code below. Or, more abstractly, given *A:B::C:??*, the code would be: `positive=['A','C'],negative=['B']`

In [None]:
# pilot is to plane what sailor is to...
model.wv.most_similar(positive=['pilot','sailor'],negative=['plane'],topn=30)