# Comparative Analysis of Descriptive Metadata, part 3

## Word Embeddings with Word2Vec

### Newcastle University Special Collections and University of Edinburgh Archives

***

**Table of Contents**
* [Train Embeddings](#train-embeddings)
* [Analysis](#analysis)
  * [Gendered Terms and Adjectives](#gendered-terms-and-adjectives)
  * [Names and Adjectives](#names-and-adjectives)

***

In [1]:
import emb_utils
import pandas as pd
from pathlib import Path
import os, re, glob, string
import gensim
from gensim.models import Word2Vec
from gensim import utils

In [2]:
# Create a directory to save the models
emb_model_dir = "embedding_models/"
Path(emb_model_dir).mkdir(exist_ok=True, parents=True)

Other files (from `analysis_metadata_nusc.ipynb`) to experiment with creating embeddings with include:

In [3]:
nusc_dir = "data/"
nusc_lower = "nusc_descriptions_lower.txt"  # includes numbers and punctuation attached to (not separating, like periods or commas) a token (i.e., the backslashes in 5/1/1)
nusc_lower_alpha = "nusc_descriptions_lower_alpha.txt"
nusc_lower_alpha_no_stopwords = "nusc_descriptions_lower_alpha_no_stopwords.txt"

## Train Embeddings

Use [Word2Vec](https://perma.cc/R282-M8UM)* to create custom word embeddings from the Newcastle and Edinburgh datasets.

**Check out [this resource](https://perma.cc/49GV-E236) for an illustrated explanation of Word2Vec.*

First we'll evaluate how different parameter combinations represent the corpus to determine whether to use skip-gram or continuous bag-of-words, and what to set `context_window` and `min_count`, to. We'll stick with 100 for the dimensions of the vectors.

In [None]:
nusc_files = [nusc_lower, nusc_lower_alpha, nusc_lower_alpha_no_stopwords]
file_paths = [nusc_dir+f for f in nusc_files]
params = {
    "file_paths": file_paths, 
    "arch": [0, 1], 
    "min_count": [3, 4, 5], 
    "window": [6, 8, 10],     # Generally ~10 is suitable for skip-gram and ~5 is suitable for CBOW
    "vector_size": [100]      # Default is 100
    }
similar_words1 = ["photograph", "photographs"]
similar_words2 = ["influential", "greatest"]

In [5]:
class MyCorpus:
    def __iter__(self):
            # corpus_path = file_path
            for line in open(corpus_path):
                yield utils.simple_preprocess(line)  # assumes one doc per line, tokens separated by whitespace in each line

In [6]:
architectures = params["arch"]
windows = params["window"]
min_counts = params["min_count"]
vector_sizes = params["vector_size"]
sim_col1 = f"cosine_similarity_{similar_words1[0]}_{similar_words1[1]}"
sim_col2 = f"cosine_similarity_{similar_words2[0]}_{similar_words2[1]}"
df = pd.DataFrame(columns=[
    "file", "architecture", "context_window", "min_freq_count", "vector_size", sim_col1, sim_col2
    ])
for file_path in file_paths:
    corpus_path = file_path
    sentences = MyCorpus()
    for a in architectures:
        for min_count in min_counts:
            for w in windows:
                for vector_size in vector_sizes:
                    # print(file_path, a, min_count, w, vector_size)
                    model = Word2Vec(
                        sentences=sentences, 
                        window=w,
                        min_count=min_count, 
                        workers=3,        # Default is 3
                        epochs=5, 
                        sg=a,
                        vector_size=vector_size
                    )
                    sim1 = model.wv.similarity(similar_words1[0], similar_words1[1])
                    sim2 = model.wv.similarity(similar_words2[0], similar_words2[1])
                    new_row = pd.DataFrame.from_dict({
                        "file":[file_path], "architecture":[a], "context_window":[w], 
                        "min_freq_count":[min_count], "vector_size":[vector_size], 
                        sim_col1 : [sim1], sim_col2: [sim2]
                        })
                    df = pd.concat([df, new_row])

  df = pd.concat([df, new_row])


In [7]:
df.head(2)

Unnamed: 0,file,architecture,context_window,min_freq_count,vector_size,cosine_similarity_photograph_photographs,cosine_similarity_influential_greatest
0,data/nusc_descriptions_lower.txt,0,6,3,100,0.651196,0.861426
0,data/nusc_descriptions_lower.txt,0,8,3,100,0.65297,0.840397


In [8]:
df.to_csv(emb_model_dir+"nusc_word2vec_model_eval1.csv")  #nusc_word2vec_model_eval2.csv

Based on the evaluation results, we'll use the lowercased alphabetic corpus that excludes stop words (`nusc_descriptions_lower_alpha_no_stopwords.txt`) and continuous bag of words (CBOW) for the architecture, as those resulted in the highest cosine similarity scores for the chosen word pairs in both evaluation runs.  A context window of 8 paired with a minimum token frequency count of 3 as well as a window of 6 paired with a min. count of 5 both yield results that are among the highest (top seven) cosine similarity scores.

(These parameters returned a cosine similarity of about 0.70-0.71 for "photograph" and "photographs," and 0.93-0.97 for "influential" and "greatest.")

Since Gensim's Word2Vec documentation recommends a context window of about 5 for CBOW, let's use the latter set of parameters for our word embedding model.

In [11]:
corpus_path = file_paths[2]
print(corpus_path)

data/nusc_descriptions_lower_alpha_no_stopwords.txt


In [12]:
sentences = MyCorpus()
window = 6           
sg = 0
if sg == 1:
    arch = "sg"
else:
    arch = "cbow"
min_count = 5
vector_size = 100

In [None]:
nusc_model = Word2Vec(
    sentences=sentences, 
    window=window,
    min_count=min_count, 
    workers=3,             # Default is 3
    epochs=5, 
    sg=sg,
    vector_size=vector_size
    )
nusc_model.save(emb_model_dir+f"nusc_word2vec_{arch}_{vector_size}d_context{window}.model")

In [None]:
# Look at a sample of the words in the model to make sure it was trained as expected
for index, word in enumerate(nusc_model.wv.index_to_key):
    if index == 5:
        break
    print(f"word #{index}/{len(nusc_model.wv.index_to_key)} is {word}")

word #0/9182 is of
word #1/9182 is the
word #2/9182 is and
word #3/9182 is to
word #4/9182 is in


Let's investigate relationships between grammatically and lexically gendered words and the top most common adjectives from the `nusc_uoe_comarison` notebook.

In [60]:
man_similar = nusc_model.wv.most_similar("man", topn=10)
men_similar = nusc_model.wv.most_similar("men", topn=10)
boy_similar = nusc_model.wv.most_similar("boy", topn=10)
boys_similar = nusc_model.wv.most_similar("boys", topn=10)
print(man_similar, men_similar)
print()
print(boy_similar, boys_similar)

[('woman', 0.9337855577468872), ('dog', 0.8908532857894897), ('illustration', 0.872054934501648), ('bird', 0.8476430773735046), ('boy', 0.8368086814880371), ('caption', 0.8355693817138672), ('horse', 0.8285383582115173), ('men', 0.8200470805168152), ('fishing', 0.8187179565429688), ('fish', 0.8137270212173462)] [('boy', 0.8402321934700012), ('man', 0.8200470805168152), ('woman', 0.813395082950592), ('horse', 0.7978457808494568), ('illustration', 0.7831832766532898), ('fishing', 0.781781792640686), ('night', 0.765031635761261), ('birds', 0.7648122906684875), ('talking', 0.7642461657524109), ('dog', 0.7623046040534973)]

[('horse', 0.9071115851402283), ('woman', 0.9002888798713684), ('holding', 0.8924639225006104), ('sat', 0.8841385245323181), ('talking', 0.8721470236778259), ('older', 0.8648854494094849), ('caption', 0.86201411485672), ('hat', 0.8616151809692383), ('grandma', 0.8555570840835571), ('wearing', 0.8542486429214478)] [('holly', 0.9339240193367004), ('tramp', 0.92756712436676

In [61]:
woman_similar = nusc_model.wv.most_similar("woman", topn=10)
women_similar = nusc_model.wv.most_similar("women", topn=10)
girl_similar = nusc_model.wv.most_similar("girl", topn=10)
girls_similar = nusc_model.wv.most_similar("girls", topn=10)
print(woman_similar, women_similar)
print()
print(girl_similar, girls_similar)

[('man', 0.9337854981422424), ('boy', 0.9002889394760132), ('hat', 0.8966806530952454), ('horse', 0.8709551095962524), ('dog', 0.8581132292747498), ('talking', 0.8508883118629456), ('holding', 0.8499537110328674), ('bird', 0.8389260768890381), ('illustration', 0.8326569199562073), ('sat', 0.8278467655181885)] [('ireland', 0.829063892364502), ('teaching', 0.7741239070892334), ('irish', 0.770521879196167), ('suffrage', 0.7630794644355774), ('traveller', 0.7527427673339844), ('politics', 0.7476858496665955), ('education', 0.7421779632568359), ('african', 0.727326512336731), ('peace', 0.7267255187034607), ('affairs', 0.7241491079330444)]

[('playing', 0.8857541680335999), ('rod', 0.8604581952095032), ('outdoors', 0.8544914722442627), ('guides', 0.8522405624389648), ('clothes', 0.8401862978935242), ('swimming', 0.830213189125061), ('dress', 0.8267804384231567), ('northallerton', 0.8266317844390869), ('dancing', 0.8251681327819824), ('spending', 0.8245590925216675)] [('holding', 0.8831355571

Smaller context window seems better (6 or 8 rather than 10...)

### CBOW

In [90]:
# Specify the model parameters and associated embedding model file name's variables
sentences = MyCorpus()
window = 6           # generally ~10 is suitable for skip-gram and ~5 is suitable for CBOW
sg = 0
if sg == 1:
    arch = "sg"
else:
    arch = "cbow"
min_count = 2
vector_size = 100    # Default is 100

In [91]:
nusc_model = Word2Vec(
    sentences=sentences, 
    window=window,
    min_count=min_count, 
    workers=3,        # Default is 3
    epochs=5, 
    sg=sg,
    vector_size=vector_size
    )
# nusc_model.save(emb_model_dir+f"nusc_word2vec_{arch}_{vector_size}d_context{window}.model")

In [92]:
# Look at a sample of the words in the model to make sure it was trained as expected
for index, word in enumerate(nusc_model.wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(nusc_model.wv.index_to_key)} is {word}")

word #0/17326 is of
word #1/17326 is the
word #2/17326 is and
word #3/17326 is to
word #4/17326 is in
word #5/17326 is from
word #6/17326 is for
word #7/17326 is letter
word #8/17326 is on
word #9/17326 is file


Let's investigate the similarity of words we'd expect to be similar in the corpus.

In [93]:
nusc_model.wv.most_similar("file", topn=10)

### with nusc_lower_alpha_no_stopwords:
 ## SG
# [('files', 0.7466872930526733),
#  ('outgoing', 0.7317661643028259),
#  ('incoming', 0.7259922027587891),
#  ('crossrail', 0.7138091325759888),
#  ('icl', 0.7008764743804932),
#  ('edco', 0.6982001066207886),
#  ('pr', 0.6961277723312378),
#  ('scans', 0.6931055188179016),
#  ('collaborator', 0.6898413896560669),
#  ('bishopsgate', 0.68710857629776)]

 ## CBOW
# [('files', 0.8331142663955688),
#  ('historic', 0.7841407060623169),
#  ('outgoing', 0.7751049995422363),
#  ('edco', 0.7709203362464905),
#  ('maps', 0.7653296589851379),
#  ('brochures', 0.7587681412696838),
#  ('divided', 0.7556792497634888),
#  ('designs', 0.7539230585098267),
#  ('preparation', 0.7517764568328857),
#  ('presentation', 0.7514598369598389)]

[('files', 0.7767755389213562),
 ('project', 0.7417789697647095),
 ('photogrpahs', 0.7277644276618958),
 ('brochure', 0.7108482718467712),
 ('masterplan', 0.6939000487327576),
 ('designs', 0.6929506659507751),
 ('concept', 0.6849844455718994),
 ('presentation', 0.6834692358970642),
 ('historical', 0.6816883087158203),
 ('maps', 0.6786699891090393)]

In [94]:
nusc_model.wv.most_similar("letters", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('testimonial', 0.6532636284828186),
#  ('swanage', 0.6265147924423218),
#  ('telegrams', 0.6239833235740662),
#  ('capel', 0.6227788329124451),
#  ('robin', 0.6210677623748779),
#  ('rct', 0.6200641989707947),
#  ('nettlecombe', 0.6174697875976562),
#  ('cheyne', 0.6166645288467407),
#  ('receipts', 0.6135836839675903),
#  ('maconochie', 0.6127098798751831)]

 ## CBOW
#  [('diary', 0.8785619735717773),
#  ('rowcliffe', 0.8287729024887085),
#  ('brewis', 0.8260412216186523),
#  ('macaulay', 0.8240538835525513),
#  ('death', 0.8071291446685791),
#  ('diverse', 0.8017675280570984),
#  ('trevelyan', 0.7953676581382751),
#  ('bell', 0.7860344648361206),
#  ('career', 0.7856048345565796),
#  ('postcard', 0.7852859497070312)]

[('postcard', 0.7146406173706055),
 ('brannon', 0.7076549530029297),
 ('diary', 0.6909405589103699),
 ('postcards', 0.6608390808105469),
 ('trip', 0.6601690649986267),
 ('death', 0.6571880578994751),
 ('crimea', 0.6488868594169617),
 ('trevelyan', 0.6360825300216675),
 ('bell', 0.6304318308830261),
 ('trusteeship', 0.6229152679443359)]

In [95]:
nusc_model.wv.most_similar("photograph", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('photographs', 0.6903167963027954),
#  ('portrait', 0.6546804904937744),
#  ('photo', 0.635033130645752),
#  ('potentially', 0.5975172519683838),
#  ('backed', 0.5960174798965454),
#  ('relative', 0.5952077507972717),
#  ('portraying', 0.5813989639282227),
#  ('white', 0.5800067782402039),
#  ('colour', 0.575546383857727),
#  ('outside', 0.5735738277435303)]

 ## CBOW
# [('photographs', 0.6903167963027954),
#  ('portrait', 0.6546804904937744),
#  ('photo', 0.635033130645752),
#  ('potentially', 0.5975172519683838),
#  ('backed', 0.5960174798965454),
#  ('relative', 0.5952077507972717),
#  ('portraying', 0.5813989639282227),
#  ('white', 0.5800067782402039),
#  ('colour', 0.575546383857727),
#  ('outside', 0.5735738277435303)]

[('portraying', 0.7605535984039307),
 ('photo', 0.7457089424133301),
 ('canvas', 0.6874290108680725),
 ('white', 0.6861698627471924),
 ('portrait', 0.6805410385131836),
 ('photographs', 0.6617839336395264),
 ('fluid', 0.6378166675567627),
 ('woman', 0.6364460587501526),
 ('photographic', 0.6357232332229614),
 ('black', 0.6309189200401306)]

In [96]:
nusc_model.wv.most_similar("man", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('woman', 0.8484075665473938),
#  ('boy', 0.8225122094154358),
#  ('dressed', 0.8222455382347107),
#  ('magistrate', 0.8101038336753845),
#  ('smartly', 0.8084942102432251),
#  ('talking', 0.8078976273536682),
#  ('dog', 0.8016583323478699),
#  ('fish', 0.7976317405700684),
#  ('observing', 0.7971518039703369),
#  ('grass', 0.7877422571182251)]

 ## CBOW
#  [('woman', 0.8484075665473938),
#  ('boy', 0.8225122094154358),
#  ('dressed', 0.8222455382347107),
#  ('magistrate', 0.8101038336753845),
#  ('smartly', 0.8084942102432251),
#  ('talking', 0.8078976273536682),
#  ('dog', 0.8016583323478699),
#  ('fish', 0.7976317405700684),
#  ('observing', 0.7971518039703369),
#  ('grass', 0.7877422571182251)]

[('woman', 0.9410372972488403),
 ('dog', 0.9109161496162415),
 ('illustration', 0.8857969641685486),
 ('bird', 0.8558411002159119),
 ('boy', 0.8472152352333069),
 ('men', 0.8378831148147583),
 ('birds', 0.8346065282821655),
 ('fishing', 0.8319897651672363),
 ('foreground', 0.8214560747146606),
 ('watercolour', 0.8190545439720154)]

In [97]:
nusc_model.wv.most_similar("men", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('talking', 0.7804786562919617),
#  ('eating', 0.7744420170783997),
#  ('killed', 0.7701953649520874),
#  ('soldiers', 0.7627596259117126),
#  ('waiter', 0.759706437587738),
#  ('smartly', 0.7592019438743591),
#  ('stags', 0.7579233646392822),
#  ('deer', 0.7574462294578552),
#  ('caption', 0.7534621357917786),
#  ('rabbits', 0.7507340312004089)]

 ## CBOW
# [('talking', 0.7804786562919617),
#  ('eating', 0.7744420170783997),
#  ('killed', 0.7701953649520874),
#  ('soldiers', 0.7627596259117126),
#  ('waiter', 0.759706437587738),
#  ('smartly', 0.7592019438743591),
#  ('stags', 0.7579233646392822),
#  ('deer', 0.7574462294578552),
#  ('caption', 0.7534621357917786),
#  ('rabbits', 0.7507340312004089)]

[('boy', 0.8657723069190979),
 ('man', 0.8378832340240479),
 ('woman', 0.8297742009162903),
 ('dog', 0.8235422372817993),
 ('talking', 0.8170042037963867),
 ('bird', 0.8152064085006714),
 ('birds', 0.8132823705673218),
 ('caption', 0.8104074001312256),
 ('hunting', 0.8051303029060364),
 ('uniform', 0.8033609986305237)]

In [98]:
nusc_model.wv.most_similar("woman", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('wearing', 0.8845924139022827),
#  ('holding', 0.8606200814247131),
#  ('boy', 0.8556447625160217),
#  ('knitting', 0.8531962633132935),
#  ('man', 0.8484075665473938),
#  ('sat', 0.8451454043388367),
#  ('backed', 0.8439164757728577),
#  ('dressed', 0.8423831462860107),
#  ('hat', 0.83938068151474),
#  ('observing', 0.8337906002998352)]

 ## CBOW
#  [('wearing', 0.8845924139022827),
#  ('holding', 0.8606200814247131),
#  ('boy', 0.8556447625160217),
#  ('knitting', 0.8531962633132935),
#  ('man', 0.8484075665473938),
#  ('sat', 0.8451454043388367),
#  ('backed', 0.8439164757728577),
#  ('dressed', 0.8423831462860107),
#  ('hat', 0.83938068151474),
#  ('observing', 0.8337906002998352)]

[('man', 0.9410372376441956),
 ('boy', 0.9070931673049927),
 ('dog', 0.8777681589126587),
 ('bird', 0.8735437393188477),
 ('talking', 0.8582047820091248),
 ('foreground', 0.856869637966156),
 ('illustration', 0.8519386649131775),
 ('birds', 0.8442080616950989),
 ('hat', 0.8439050316810608),
 ('men', 0.8297741413116455)]

In [99]:
nusc_model.wv.most_similar("women", topn=10)

### with nusc_lower_alpha_no_stopwords
# [('suffrage', 0.7628691792488098),
#  ('popular', 0.7125651240348816),
#  ('traveller', 0.7020295262336731),
#  ('federation', 0.6859576106071472),
#  ('institutes', 0.6706743836402893),
#  ('mountains', 0.667505145072937),
#  ('unemployment', 0.6652866005897522),
#  ('careers', 0.6585383415222168),
#  ('adults', 0.6571559906005859),
#  ('sixty', 0.6563506126403809)]

 ## CBOW
# [('suffrage', 0.7628691792488098),
#  ('popular', 0.7125651240348816),
#  ('traveller', 0.7020295262336731),
#  ('federation', 0.6859576106071472),
#  ('institutes', 0.6706743836402893),
#  ('mountains', 0.667505145072937),
#  ('unemployment', 0.6652866005897522),
#  ('careers', 0.6585383415222168),
#  ('adults', 0.6571559906005859),
#  ('sixty', 0.6563506126403809)]

[('ireland', 0.8727635145187378),
 ('politics', 0.7913272976875305),
 ('irish', 0.7819133400917053),
 ('education', 0.7754734754562378),
 ('russia', 0.7749536037445068),
 ('indian', 0.7719627022743225),
 ('peace', 0.7717481255531311),
 ('social', 0.7717099785804749),
 ('traveller', 0.7682931423187256),
 ('suffrage', 0.7645044922828674)]

In [100]:
nusc_model.wv.most_similar("key", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('strategies', 0.8460091948509216),
#  ('facilities', 0.8363745212554932),
#  ('population', 0.83343505859375),
#  ('constraints', 0.8240486979484558),
#  ('hub', 0.8183097839355469),
#  ('function', 0.8164852261543274),
#  ('incorporating', 0.8153958320617676),
#  ('achieve', 0.8151978850364685),
#  ('designated', 0.8110092878341675),
#  ('encompassing', 0.8107568025588989)]

  ## CBOW
# [('strategies', 0.8460091948509216),
#  ('facilities', 0.8363745212554932),
#  ('population', 0.83343505859375),
#  ('constraints', 0.8240486979484558),
#  ('hub', 0.8183097839355469),
#  ('function', 0.8164852261543274),
#  ('incorporating', 0.8153958320617676),
#  ('achieve', 0.8151978850364685),
#  ('designated', 0.8110092878341675),
#  ('encompassing', 0.8107568025588989)]

[('retail', 0.9050933122634888),
 ('space', 0.9026519656181335),
 ('spaces', 0.9018585085868835),
 ('facilities', 0.886414110660553),
 ('areas', 0.8766377568244934),
 ('water', 0.8765712380409241),
 ('residential', 0.8642987608909607),
 ('structure', 0.8596487641334534),
 ('reuse', 0.851976752281189),
 ('massing', 0.8495070934295654)]

In [101]:
nusc_model.wv.most_similar("influential", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('renowned', 0.9030014276504517),
#  ('controversial', 0.896609365940094),
#  ('era', 0.8870092630386353),
#  ('haibun', 0.876724898815155),
#  ('négritude', 0.8765851855278015),
#  ('dagger', 0.8763759732246399),
#  ('aged', 0.875598669052124),
#  ('finest', 0.8741796016693115),
#  ('thirty', 0.872181236743927),
#  ('reviewer', 0.870881974697113)]

 ## CBOW
# [('renowned', 0.9030014276504517),
#  ('controversial', 0.896609365940094),
#  ('era', 0.8870092630386353),
#  ('haibun', 0.876724898815155),
#  ('négritude', 0.8765851855278015),
#  ('dagger', 0.8763759732246399),
#  ('aged', 0.875598669052124),
#  ('finest', 0.8741796016693115),
#  ('thirty', 0.872181236743927),
#  ('reviewer', 0.870881974697113)]

[('important', 0.8529764413833618),
 ('crime', 0.8338021636009216),
 ('effectively', 0.8248555660247803),
 ('frequently', 0.8217456340789795),
 ('renowned', 0.8181887865066528),
 ('body', 0.8177512884140015),
 ('language', 0.8141488432884216),
 ('languages', 0.8141064643859863),
 ('among', 0.8137490749359131),
 ('countries', 0.8130801916122437)]

In [102]:
nusc_model.wv.most_similar("significant", topn=10)

### with nusc_lower_alpha_no_stopwords
 ## SG
# [('covered', 0.8478525876998901),
#  ('resources', 0.8471731543540955),
#  ('online', 0.8461249470710754),
#  ('demonstrate', 0.8445181250572205),
#  ('bids', 0.8415424823760986),
#  ('themes', 0.8382758498191833),
#  ('initiatives', 0.8319141268730164),
#  ('sector', 0.8317221403121948),
#  ('basis', 0.8311277031898499),
#  ('objectives', 0.8302891850471497)]

 ## CBOW
# [('covered', 0.8478525876998901),
#  ('resources', 0.8471731543540955),
#  ('online', 0.8461249470710754),
#  ('demonstrate', 0.8445181250572205),
#  ('bids', 0.8415424823760986),
#  ('themes', 0.8382758498191833),
#  ('initiatives', 0.8319141268730164),
#  ('sector', 0.8317221403121948),
#  ('basis', 0.8311277031898499),
#  ('objectives', 0.8302891850471497)]

[('relevance', 0.8954179883003235),
 ('remit', 0.8918818235397339),
 ('extensively', 0.8898897767066956),
 ('exploring', 0.8885937929153442),
 ('ahead', 0.8843324780464172),
 ('similar', 0.8808154463768005),
 ('output', 0.8806847929954529),
 ('employed', 0.8798457980155945),
 ('deliberately', 0.8797436356544495),
 ('maintain', 0.8794825077056885)]

***

How similar are pretrained vectors to th custom vectors?

## Pretrained Embeddings

#### Floret 

(fastText + Bloom embeddings (spaCy's default) - see documentation [here](https://github.com/explosion/floret))