In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import re
from nltk.corpus import gutenberg, stopwords

In [2]:
# Utility function to clean text.
def text_cleaner(text):
    
    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--',' ',text)
    
    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)
    
    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+','',text)
    
    # Get rid of extra whitespace.
    text = ' '.join(text.split())
    
    return text


# Import all the Austen in the Project Gutenberg corpus.
austen = ""
for novel in ['persuasion','emma','sense']:
    work = gutenberg.raw('austen-' + novel + '.txt')
    austen = austen + work

# Clean the data.
austen_clean = text_cleaner(austen)

In [3]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
austen_doc = nlp(austen_clean)

ValueError: [E088] Text of length 2006272 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

In [4]:
len(austen_clean)

2006272

In [5]:
austen_short = austen_clean[:750000]

In [6]:
len(austen_short)

750000

In [7]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
austen_doc = nlp(austen_short)

In [8]:
# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in austen_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)


print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(austen_clean)))

['daughter', 'eld', 'give', 'thing', 'tempt']
We have 6714 sentences and 2006272 tokens.


In [9]:
import gensim
from gensim.models import word2vec

model = word2vec.Word2Vec(
    sentences,
    workers=4,     # Number of threads to run in parallel (if your computer does parallel processing).
    min_count=25,  # Minimum word count threshold.
    window=20,      # Number of words around target word to consider.
    sg=0,          # Use CBOW because our corpus is small.
    sample=1e-3 ,  # Penalize frequent words.
    size=100,      # Word vector length.
    hs=1           # Use hierarchical softmax.
)

print('done!')

done!


In [11]:
# List of words in model.
vocab = model.wv.vocab.keys()

print(model.wv.most_similar(positive=['lady', 'man'], negative=['woman']))

# Similarity is calculated using the cosine, so again 1 is total
# similarity and 0 is no similarity.
print(model.wv.similarity('excellent', 'great'))
print(model.wv.similarity('mr', 'mrs'))

# One of these things is not like the other...
print(model.doesnt_match("breakfast marriage dinner lunch".split()))

[('mary', 0.9835100173950195), ('goddard', 0.9834635257720947), ('bates', 0.9827357530593872), ('possible', 0.9825205206871033), ('case', 0.9823529124259949), ('love', 0.9816750288009644), ('henrietta', 0.9808393716812134), ('till', 0.9804214239120483), ('bad', 0.9787410497665405), ('morrow', 0.9786244630813599)]
0.9560243
0.9164338
dinner


  if sys.path[0] == '':


# Drill 0 Thoughts

Changing the hyperparameters , especially 'min_count', 'window', and 'size' can change the accuracy of the model quite a bit. There is a bit of randomness in the model to begin with, as the results change at least slightly every time regardless of if the hyperparameters have been altered or not. Of these three hyperparameters and in my very limited experimentation, it seems that 'window' has the potential to change the results the most. This refers to the number of words around the target that are to be considered. If more words are taken into account, it would follow that the accuracy would improve.

# Drill 1 Thoughts

Singular nouns very often match the their plural nouns.

Capitalized words very often match to their uncapitalized versions.

Some names matched to a common converged name, but not in a way I would have expecgted. For example, both "Djokovic" and "Federer" match to Nadal, whereas I would have predicted that "Federer" would have been the common match.

In analyzing the sets of words, it wasn't difficult to find sets of words that produced a "solution" that I would not have predicted. For example, in thinking about baseball, "mariners yankees cubs athletics" produces "athletics", whereas I would have suspected it would find "cubs" to be the odd one out, since they are the only team that plays in the National League. Clearly the connections that are being produced by the algorithm are finding something else that it deems more important.