##### Problem - To detect if two job titles are similar or not

There are 3 different models with training parameters in this script. It would be ideal to comment out the ones you don't need.

First method is using Word2Vec - training word2vec on a new model with a dataset containing ~300,000 job titles. Accuracy is very bad

Second method is using Word2Vec but with the GoogleNews pre trained model. Good Accuracy but only for individual words.

Third method is using Doc2Vec and the same dataset with training. Yields the best results so far.


All models at the end of training has been saved to working directory and can be loaded again with their respective methods.

###### First Method - Using Word2Vec custom model

In [5]:
import os
import re
import pandas as pd
import numpy as np
from time import time
from collections import defaultdict
from nltk.tokenize import RegexpTokenizer

In [6]:
#$ pip install -r requirements.txt

In [7]:
import logging
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [8]:
# find current working directory
directory = os.getcwd()

# specifying csv filename, this is a switch between two datasets in the directory.

# filename = "\\job_title.csv"
filename = "\\titles_final.csv"

# using concat to generate fullpath
file = directory+filename

# load csv file containing job titles
df = pd.read_csv(file)

In [9]:
# a little data refactoring step

df.isnull().sum()
df.dropna()

# this could be optional - in a future step, vocab should only consider unique words
# df = df.drop_duplicates(keep='first', inplace=False, ignore_index=False)

Unnamed: 0,Titles
0,Senior Product Manager
1,Solutions Engineer
2,Staff Software Engineer
3,Head of Product (Platform)
4,Incubation Lead Success Cloud
...,...
298391,IT Analyst
298392,Concertmaster
298393,Technical Lead
298394,Sales Associate


In [28]:
# tokenization

# filter_lowercase = df['Titles'].str.lower()

df_sentences = df['Titles'].astype('str').tolist()
tokenizer = RegexpTokenizer(r'\w+')
df_tokenized = [w.lower() for w in df_sentences]
# df_tokenized = [tokenizer.tokenize(i) for i in df_tokenized]

# removing punctuations and extra spacing between words

# df["Titles"] = filter_lowercase.str.replace('[^\w\s]', '')
# df["Titles"] = filter_lowercase.str.replace('\s+', ' ', regex=True)

In [29]:
df_tokenized[1:10] 

['solutions engineer',
 'staff software engineer',
 'head of product (platform)',
 'incubation lead  success cloud',
 'author',
 'technical writer',
 'product',
 'technology architecture and operations',
 'software engineer']

In [31]:
# converting each element into it's own list - doing this because we need individual job title as its own list


def listoflists(inputlist):
    return [[el] for el in inputlist]

nestedlist = listoflists(df_tokenized)

# reinitializing the dataset again

# df = pd.DataFrame({'Titles':nestedlist})
nestedlist[1:10]

[['solutions engineer'],
 ['staff software engineer'],
 ['head of product (platform)'],
 ['incubation lead  success cloud'],
 ['author'],
 ['technical writer'],
 ['product'],
 ['technology architecture and operations'],
 ['software engineer']]

In [33]:
# Phrases, Phraser, bigram

from gensim.models.phrases import Phrases, Phraser

In [34]:
phrases = Phrases(nestedlist, min_count=2, progress_per=10000)

INFO - 20:53:51: collecting all words and their counts
INFO - 20:53:51: PROGRESS: at sentence #0, processed 0 words and 0 word types
INFO - 20:53:51: PROGRESS: at sentence #10000, processed 10000 words and 5996 word types
INFO - 20:53:51: PROGRESS: at sentence #20000, processed 20000 words and 10906 word types
INFO - 20:53:51: PROGRESS: at sentence #30000, processed 30000 words and 15154 word types
INFO - 20:53:51: PROGRESS: at sentence #40000, processed 40000 words and 18746 word types
INFO - 20:53:51: PROGRESS: at sentence #50000, processed 50000 words and 21926 word types
INFO - 20:53:51: PROGRESS: at sentence #60000, processed 60000 words and 26699 word types
INFO - 20:53:51: PROGRESS: at sentence #70000, processed 70000 words and 31341 word types
INFO - 20:53:51: PROGRESS: at sentence #80000, processed 80000 words and 35600 word types
INFO - 20:53:51: PROGRESS: at sentence #90000, processed 90000 words and 39443 word types
INFO - 20:53:51: PROGRESS: at sentence #100000, processed 

In [35]:
bigram = Phraser(phrases)

INFO - 20:53:54: exporting phrases from Phrases<120044 vocab, min_count=2, threshold=10.0, max_vocab_size=40000000>
INFO - 20:53:54: FrozenPhrases lifecycle event {'msg': 'exported FrozenPhrases<0 phrases, min_count=2, threshold=10.0> from Phrases<120044 vocab, min_count=2, threshold=10.0, max_vocab_size=40000000> in 0.03s', 'datetime': '2021-11-15T20:53:54.640178', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [36]:
bigram_sentences = bigram[nestedlist]

In [37]:
# building the model

import multiprocessing
from gensim.models import Word2Vec

In [38]:
# parameters for the model

model = Word2Vec(min_count=1, 
                 window=4, 
                 vector_size=300, 
                 sample=0, 
                 sg=0, 
                 alpha=0.03, 
                 min_alpha=0.0007, 
                 negative=5, 
                 workers=4)

INFO - 20:54:01: Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.03)', 'datetime': '2021-11-15T20:54:01.212172', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}


In [40]:
# building vocabulary from dataset

t=time()
model.build_vocab(bigram_sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

INFO - 20:54:36: collecting all words and their counts
INFO - 20:54:36: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 20:54:36: PROGRESS: at sentence #10000, processed 10000 words, keeping 5996 word types
INFO - 20:54:36: PROGRESS: at sentence #20000, processed 20000 words, keeping 10906 word types
INFO - 20:54:36: PROGRESS: at sentence #30000, processed 30000 words, keeping 15154 word types
INFO - 20:54:36: PROGRESS: at sentence #40000, processed 40000 words, keeping 18746 word types
INFO - 20:54:36: PROGRESS: at sentence #50000, processed 50000 words, keeping 21926 word types
INFO - 20:54:36: PROGRESS: at sentence #60000, processed 60000 words, keeping 26699 word types
INFO - 20:54:36: PROGRESS: at sentence #70000, processed 70000 words, keeping 31341 word types
INFO - 20:54:36: PROGRESS: at sentence #80000, processed 80000 words, keeping 35600 word types
INFO - 20:54:36: PROGRESS: at sentence #90000, processed 90000 words, keeping 39443 word types
INFO - 2

Time to build vocab: 0.05 mins


In [41]:
model.corpus_count

298396

In [42]:
# train the model

t = time()
model.train(bigram_sentences, total_examples=model.corpus_count, epochs=100)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

model.save("w2v.model")
print("Model Saved")

INFO - 20:54:43: Word2Vec lifecycle event {'msg': 'training model with 4 workers on 120044 vocabulary and 300 features, using sg=0 hs=0 sample=0 negative=5 window=4', 'datetime': '2021-11-15T20:54:43.672500', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}
INFO - 20:54:44: EPOCH 1 - PROGRESS: at 60.32% examples, 176700 words/s, in_qsize 7, out_qsize 0
INFO - 20:54:44: worker thread finished; awaiting finish of 3 more threads
INFO - 20:54:44: worker thread finished; awaiting finish of 2 more threads
INFO - 20:54:44: worker thread finished; awaiting finish of 1 more threads
INFO - 20:54:44: worker thread finished; awaiting finish of 0 more threads
INFO - 20:54:44: EPOCH - 1 : training on 298396 raw words (298396 effective words) took 1.1s, 264210 effective words/s
INFO - 20:54:45: worker thread finished; awaiting finish of 3 more threads
INFO - 20:54:45: EPOCH 2

INFO - 20:55:01: EPOCH 17 - PROGRESS: at 80.43% examples, 237551 words/s, in_qsize 6, out_qsize 0
INFO - 20:55:01: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:01: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:01: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:01: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:01: EPOCH - 17 : training on 298396 raw words (298396 effective words) took 1.1s, 282468 effective words/s
INFO - 20:55:02: EPOCH 18 - PROGRESS: at 90.48% examples, 268430 words/s, in_qsize 3, out_qsize 1
INFO - 20:55:02: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:02: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:02: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:02: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:02: EPOCH - 18 : training on 298396 raw words (298396 effective words) 

INFO - 20:55:18: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:18: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:18: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:18: EPOCH - 33 : training on 298396 raw words (298396 effective words) took 1.1s, 283016 effective words/s
INFO - 20:55:19: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:19: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:19: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:19: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:19: EPOCH - 34 : training on 298396 raw words (298396 effective words) took 1.0s, 302753 effective words/s
INFO - 20:55:20: EPOCH 35 - PROGRESS: at 80.43% examples, 236249 words/s, in_qsize 6, out_qsize 0
INFO - 20:55:20: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:20: worker thread finished; awaiting finish of 2 

INFO - 20:55:35: EPOCH - 49 : training on 298396 raw words (298396 effective words) took 1.2s, 257158 effective words/s
INFO - 20:55:36: EPOCH 50 - PROGRESS: at 90.48% examples, 269118 words/s, in_qsize 2, out_qsize 3
INFO - 20:55:36: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:36: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:36: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:36: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:36: EPOCH - 50 : training on 298396 raw words (298396 effective words) took 1.0s, 292388 effective words/s
INFO - 20:55:37: EPOCH 51 - PROGRESS: at 90.48% examples, 269149 words/s, in_qsize 3, out_qsize 1
INFO - 20:55:37: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:37: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:37: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:37: worker thread finished

INFO - 20:55:52: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:52: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:52: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:52: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:52: EPOCH - 66 : training on 298396 raw words (298396 effective words) took 1.0s, 298391 effective words/s
INFO - 20:55:53: worker thread finished; awaiting finish of 3 more threads
INFO - 20:55:53: worker thread finished; awaiting finish of 2 more threads
INFO - 20:55:53: EPOCH 67 - PROGRESS: at 96.65% examples, 282878 words/s, in_qsize 1, out_qsize 1
INFO - 20:55:53: worker thread finished; awaiting finish of 1 more threads
INFO - 20:55:53: worker thread finished; awaiting finish of 0 more threads
INFO - 20:55:53: EPOCH - 67 : training on 298396 raw words (298396 effective words) took 1.0s, 291246 effective words/s
INFO - 20:55:54: worker thread finished; awaiting finish of 3 

INFO - 20:56:10: worker thread finished; awaiting finish of 2 more threads
INFO - 20:56:10: worker thread finished; awaiting finish of 1 more threads
INFO - 20:56:10: worker thread finished; awaiting finish of 0 more threads
INFO - 20:56:10: EPOCH - 83 : training on 298396 raw words (298396 effective words) took 1.0s, 304359 effective words/s
INFO - 20:56:11: worker thread finished; awaiting finish of 3 more threads
INFO - 20:56:11: worker thread finished; awaiting finish of 2 more threads
INFO - 20:56:11: worker thread finished; awaiting finish of 1 more threads
INFO - 20:56:11: worker thread finished; awaiting finish of 0 more threads
INFO - 20:56:11: EPOCH - 84 : training on 298396 raw words (298396 effective words) took 0.9s, 320650 effective words/s
INFO - 20:56:12: EPOCH 85 - PROGRESS: at 90.48% examples, 269803 words/s, in_qsize 2, out_qsize 3
INFO - 20:56:12: worker thread finished; awaiting finish of 3 more threads
INFO - 20:56:12: worker thread finished; awaiting finish of 2 

INFO - 20:56:28: worker thread finished; awaiting finish of 2 more threads
INFO - 20:56:28: worker thread finished; awaiting finish of 1 more threads
INFO - 20:56:28: worker thread finished; awaiting finish of 0 more threads
INFO - 20:56:28: EPOCH - 100 : training on 298396 raw words (298396 effective words) took 1.1s, 282321 effective words/s
INFO - 20:56:28: Word2Vec lifecycle event {'msg': 'training on 29839600 raw words (29839600 effective words) took 104.9s, 284342 effective words/s', 'datetime': '2021-11-15T20:56:28.616240', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'train'}
INFO - 20:56:28: Word2Vec lifecycle event {'fname_or_handle': 'w2v.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2021-11-15T20:56:28.617242', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 

Time to train the model: 1.75 mins


INFO - 20:56:31: storing np array 'syn1neg' to w2v.model.syn1neg.npy
INFO - 20:56:31: not storing attribute cum_table
INFO - 20:56:31: saved w2v.model


Model Saved


In [43]:
# well let's see how we did

model.wv.most_similar('software engineer', topn=10)

[('senior technical writer contract', 0.26377832889556885),
 ('technical support (contract)', 0.22593717277050018),
 ('president/independent consultant', 0.22013071179389954),
 ('sessional instructor', 0.21934692561626434),
 ('front desk coordinator/senior sourcer/ jr. recruiter', 0.21793535351753235),
 ('qa lead engineer  wafl', 0.217039555311203),
 ('manager - field sales', 0.21519960463047028),
 ('systems engineer summer intern', 0.21515102684497833),
 ('senior director  business operations & analytics', 0.21431227028369904),
 ('game designer  the godfather: the game', 0.2119659036397934)]

In [44]:
model.wv.similarity('software engineer', 'engineer')

-0.073169254

###### Well it seems like the result isn't what we were expecting.

###### --------------------------------------------------------------------------------------------------

###### Using Doc2Vec to train a model on the same dataset

In [104]:
# using Doc2Vec with df as the dataframe with lowercase csv read

In [105]:
import os
import re
import pandas as pd
import numpy as np
from time import time
from collections import defaultdict
from nltk.tokenize import RegexpTokenizer

In [106]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk.tokenize import word_tokenize

In [107]:
# find current working directory
directory = os.getcwd()

# specifying csv filename, this is a switch between two datasets in the directory.

# filename = "\\job_title.csv"
filename = "\\titles_final.csv"

# using concat to generate fullpath
file = directory+filename

# load csv file containing job titles
df = pd.read_csv(file)

In [108]:
# doc2vec tagging

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower(), preserve_line=True), tags=[str(i)]) for i, _d in enumerate(df["Titles"])]

# tagged_data = [TaggedDocument(words=word_tokenize(_d.lower(), preserve_line=True), tags=[str(i)]) for i, _d in enumerate(df["Titles"])]

In [109]:
tagged_data_backup = tagged_data

In [110]:
# def listoflists(inputlist):
#     return [[el] for el in inputlist]

# nestedlist = listoflists(tagged_data)

# # reinitializing the dataset again

#  df = pd.DataFrame({'Titles':nestedlist})
#  df.head

In [111]:
print(nestedlist[1:10])
print("---------------")
print(tagged_data_backup[1:10])

[[TaggedDocument(words=['solutions', 'engineer'], tags=['1'])], [TaggedDocument(words=['staff', 'software', 'engineer'], tags=['2'])], [TaggedDocument(words=['head', 'of', 'product', '(', 'platform', ')'], tags=['3'])], [TaggedDocument(words=['incubation', 'lead', 'success', 'cloud'], tags=['4'])], [TaggedDocument(words=['author'], tags=['5'])], [TaggedDocument(words=['technical', 'writer'], tags=['6'])], [TaggedDocument(words=['product'], tags=['7'])], [TaggedDocument(words=['technology', 'architecture', 'and', 'operations'], tags=['8'])], [TaggedDocument(words=['software', 'engineer'], tags=['9'])]]
---------------
[TaggedDocument(words=['solutions', 'engineer'], tags=['1']), TaggedDocument(words=['staff', 'software', 'engineer'], tags=['2']), TaggedDocument(words=['head', 'of', 'product', '(', 'platform', ')'], tags=['3']), TaggedDocument(words=['incubation', 'lead', 'success', 'cloud'], tags=['4']), TaggedDocument(words=['author'], tags=['5']), TaggedDocument(words=['technical', 'w

In [112]:
max_epochs = 5
vec_size = 100
alpha = 0.025

model_doc = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm=1)
  
model_doc.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model_doc.train(tagged_data,
                total_examples=model_doc.corpus_count,
                epochs=10)
    # decrease the learning rate
    model_doc.alpha -= 0.0002
    # fix the learning rate, no decay
    model_doc.min_alpha = model_doc.alpha

model_doc.save("d2v.model")
print("Model Saved")

INFO - 21:58:58: Doc2Vec lifecycle event {'params': 'Doc2Vec(dm/m,d100,n5,w5,s0.001,t3)', 'datetime': '2021-11-15T21:58:58.174503', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}
INFO - 21:58:58: collecting all words and their counts
INFO - 21:58:58: PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO - 21:58:58: PROGRESS: at example #10000, processed 41471 words (613440/s), 2366 word types, 10000 tags
INFO - 21:58:58: PROGRESS: at example #20000, processed 80563 words (1981539/s), 3561 word types, 20000 tags
INFO - 21:58:58: PROGRESS: at example #30000, processed 117636 words (1713661/s), 4528 word types, 30000 tags
INFO - 21:58:58: PROGRESS: at example #40000, processed 152103 words (1587806/s), 5256 word types, 40000 tags
INFO - 21:58:58: PROGRESS: at example #50000, processed 184287 words (1415459/s), 5934 word types, 50000 tags


iteration 0


INFO - 21:59:02: EPOCH 1 - PROGRESS: at 7.50% examples, 71525 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:03: EPOCH 1 - PROGRESS: at 14.21% examples, 61972 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:04: EPOCH 1 - PROGRESS: at 22.39% examples, 65696 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:05: EPOCH 1 - PROGRESS: at 30.77% examples, 66812 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:06: EPOCH 1 - PROGRESS: at 39.14% examples, 68301 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:07: EPOCH 1 - PROGRESS: at 47.75% examples, 68909 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:08: EPOCH 1 - PROGRESS: at 54.51% examples, 67618 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:09: EPOCH 1 - PROGRESS: at 63.34% examples, 67781 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:10: EPOCH 1 - PROGRESS: at 72.22% examples, 68346 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:11: EPOCH 1 - PROGRESS: at 80.36% examples, 68231 words/s, in_qsize 5, out_qsize 0
INFO - 21:59:13: EPOCH 1 - PROGRESS: at 8

INFO - 22:00:18: worker thread finished; awaiting finish of 1 more threads
INFO - 22:00:18: worker thread finished; awaiting finish of 0 more threads
INFO - 22:00:18: EPOCH - 6 : training on 1037922 raw words (888979 effective words) took 13.0s, 68545 effective words/s
INFO - 22:00:19: EPOCH 7 - PROGRESS: at 8.40% examples, 63824 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:20: EPOCH 7 - PROGRESS: at 17.36% examples, 65173 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:21: EPOCH 7 - PROGRESS: at 25.95% examples, 68028 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:23: EPOCH 7 - PROGRESS: at 34.86% examples, 67895 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:24: EPOCH 7 - PROGRESS: at 42.80% examples, 68483 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:25: EPOCH 7 - PROGRESS: at 51.82% examples, 68481 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:26: EPOCH 7 - PROGRESS: at 60.26% examples, 68645 words/s, in_qsize 5, out_qsize 0
INFO - 22:00:27: EPOCH 7 - PROGRESS: at 68.24% examples, 68

iteration 1


INFO - 22:01:10: EPOCH 1 - PROGRESS: at 8.42% examples, 61118 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:12: EPOCH 1 - PROGRESS: at 17.12% examples, 63849 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:13: EPOCH 1 - PROGRESS: at 25.95% examples, 68821 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:14: EPOCH 1 - PROGRESS: at 33.82% examples, 67311 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:15: EPOCH 1 - PROGRESS: at 41.88% examples, 68449 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:16: EPOCH 1 - PROGRESS: at 50.84% examples, 68366 words/s, in_qsize 6, out_qsize 0
INFO - 22:01:17: EPOCH 1 - PROGRESS: at 59.24% examples, 68588 words/s, in_qsize 6, out_qsize 0
INFO - 22:01:18: EPOCH 1 - PROGRESS: at 68.24% examples, 68580 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:19: EPOCH 1 - PROGRESS: at 77.31% examples, 68595 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:20: EPOCH 1 - PROGRESS: at 85.51% examples, 68230 words/s, in_qsize 5, out_qsize 0
INFO - 22:01:21: EPOCH 1 - PROGRESS: at 9

INFO - 22:02:26: worker thread finished; awaiting finish of 1 more threads
INFO - 22:02:26: worker thread finished; awaiting finish of 0 more threads
INFO - 22:02:26: EPOCH - 6 : training on 1037922 raw words (887846 effective words) took 12.9s, 68812 effective words/s
INFO - 22:02:27: EPOCH 7 - PROGRESS: at 8.40% examples, 62978 words/s, in_qsize 6, out_qsize 0
INFO - 22:02:28: EPOCH 7 - PROGRESS: at 17.36% examples, 65539 words/s, in_qsize 5, out_qsize 0
INFO - 22:02:29: EPOCH 7 - PROGRESS: at 25.95% examples, 69588 words/s, in_qsize 5, out_qsize 0
INFO - 22:02:30: EPOCH 7 - PROGRESS: at 33.82% examples, 67948 words/s, in_qsize 5, out_qsize 0
INFO - 22:02:31: EPOCH 7 - PROGRESS: at 42.80% examples, 70320 words/s, in_qsize 5, out_qsize 0
INFO - 22:02:32: EPOCH 7 - PROGRESS: at 49.81% examples, 67731 words/s, in_qsize 5, out_qsize 0
INFO - 22:02:33: EPOCH 7 - PROGRESS: at 58.26% examples, 68242 words/s, in_qsize 5, out_qsize 0
INFO - 22:02:35: EPOCH 7 - PROGRESS: at 67.31% examples, 68

iteration 2


INFO - 22:03:18: EPOCH 1 - PROGRESS: at 8.42% examples, 61184 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:19: EPOCH 1 - PROGRESS: at 17.36% examples, 64687 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:20: EPOCH 1 - PROGRESS: at 25.95% examples, 68462 words/s, in_qsize 6, out_qsize 0
INFO - 22:03:21: EPOCH 1 - PROGRESS: at 34.86% examples, 67566 words/s, in_qsize 6, out_qsize 0
INFO - 22:03:22: EPOCH 1 - PROGRESS: at 42.84% examples, 67824 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:24: EPOCH 1 - PROGRESS: at 51.72% examples, 68004 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:25: EPOCH 1 - PROGRESS: at 60.26% examples, 68371 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:26: EPOCH 1 - PROGRESS: at 69.21% examples, 68633 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:27: EPOCH 1 - PROGRESS: at 77.31% examples, 68493 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:28: EPOCH 1 - PROGRESS: at 86.54% examples, 68215 words/s, in_qsize 5, out_qsize 0
INFO - 22:03:29: EPOCH 1 - PROGRESS: at 9

INFO - 22:04:33: worker thread finished; awaiting finish of 0 more threads
INFO - 22:04:33: EPOCH - 6 : training on 1037922 raw words (887112 effective words) took 12.7s, 69585 effective words/s
INFO - 22:04:35: EPOCH 7 - PROGRESS: at 8.42% examples, 61477 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:36: EPOCH 7 - PROGRESS: at 17.36% examples, 64381 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:37: EPOCH 7 - PROGRESS: at 25.95% examples, 69012 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:38: EPOCH 7 - PROGRESS: at 33.82% examples, 68168 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:39: EPOCH 7 - PROGRESS: at 42.80% examples, 69784 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:40: EPOCH 7 - PROGRESS: at 49.81% examples, 68351 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:41: EPOCH 7 - PROGRESS: at 58.26% examples, 69322 words/s, in_qsize 6, out_qsize 0
INFO - 22:04:42: EPOCH 7 - PROGRESS: at 66.36% examples, 68972 words/s, in_qsize 5, out_qsize 0
INFO - 22:04:43: EPOCH 7 - PROGRESS: a

iteration 3


INFO - 22:05:26: EPOCH 1 - PROGRESS: at 7.50% examples, 71627 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:27: EPOCH 1 - PROGRESS: at 14.21% examples, 62126 words/s, in_qsize 6, out_qsize 0
INFO - 22:05:28: EPOCH 1 - PROGRESS: at 22.38% examples, 65574 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:29: EPOCH 1 - PROGRESS: at 30.77% examples, 66759 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:30: EPOCH 1 - PROGRESS: at 39.14% examples, 68250 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:31: EPOCH 1 - PROGRESS: at 47.75% examples, 68362 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:32: EPOCH 1 - PROGRESS: at 56.36% examples, 68775 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:33: EPOCH 1 - PROGRESS: at 64.40% examples, 68176 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:34: EPOCH 1 - PROGRESS: at 73.27% examples, 67991 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:35: EPOCH 1 - PROGRESS: at 82.39% examples, 68341 words/s, in_qsize 5, out_qsize 0
INFO - 22:05:36: EPOCH 1 - PROGRESS: at 9

INFO - 22:06:41: worker thread finished; awaiting finish of 0 more threads
INFO - 22:06:41: EPOCH - 6 : training on 1037922 raw words (887201 effective words) took 12.7s, 69731 effective words/s
INFO - 22:06:43: EPOCH 7 - PROGRESS: at 8.40% examples, 61038 words/s, in_qsize 5, out_qsize 0
INFO - 22:06:44: EPOCH 7 - PROGRESS: at 17.09% examples, 63147 words/s, in_qsize 5, out_qsize 0
INFO - 22:06:45: EPOCH 7 - PROGRESS: at 25.96% examples, 67401 words/s, in_qsize 5, out_qsize 0
INFO - 22:06:46: EPOCH 7 - PROGRESS: at 33.82% examples, 67082 words/s, in_qsize 5, out_qsize 0
INFO - 22:06:47: EPOCH 7 - PROGRESS: at 42.80% examples, 68825 words/s, in_qsize 5, out_qsize 0
INFO - 22:06:48: EPOCH 7 - PROGRESS: at 49.80% examples, 67423 words/s, in_qsize 5, out_qsize 0
INFO - 22:06:49: EPOCH 7 - PROGRESS: at 58.26% examples, 67830 words/s, in_qsize 6, out_qsize 0
INFO - 22:06:50: EPOCH 7 - PROGRESS: at 67.29% examples, 68049 words/s, in_qsize 6, out_qsize 0
INFO - 22:06:51: EPOCH 7 - PROGRESS: a

iteration 4


INFO - 22:07:34: EPOCH 1 - PROGRESS: at 8.42% examples, 61484 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:35: EPOCH 1 - PROGRESS: at 17.36% examples, 64318 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:36: EPOCH 1 - PROGRESS: at 25.95% examples, 68919 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:37: EPOCH 1 - PROGRESS: at 33.82% examples, 67455 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:38: EPOCH 1 - PROGRESS: at 41.88% examples, 68266 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:39: EPOCH 1 - PROGRESS: at 50.84% examples, 68356 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:40: EPOCH 1 - PROGRESS: at 59.24% examples, 68686 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:41: EPOCH 1 - PROGRESS: at 68.24% examples, 68673 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:42: EPOCH 1 - PROGRESS: at 77.31% examples, 68611 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:44: EPOCH 1 - PROGRESS: at 86.54% examples, 68646 words/s, in_qsize 5, out_qsize 0
INFO - 22:07:45: EPOCH 1 - PROGRESS: at 9

INFO - 22:08:49: worker thread finished; awaiting finish of 0 more threads
INFO - 22:08:49: EPOCH - 6 : training on 1037922 raw words (886859 effective words) took 13.0s, 68048 effective words/s
INFO - 22:08:50: EPOCH 7 - PROGRESS: at 5.79% examples, 54919 words/s, in_qsize 5, out_qsize 0
INFO - 22:08:51: EPOCH 7 - PROGRESS: at 14.21% examples, 56813 words/s, in_qsize 6, out_qsize 0
INFO - 22:08:52: EPOCH 7 - PROGRESS: at 22.39% examples, 61341 words/s, in_qsize 5, out_qsize 0
INFO - 22:08:54: EPOCH 7 - PROGRESS: at 30.77% examples, 61857 words/s, in_qsize 6, out_qsize 0
INFO - 22:08:55: EPOCH 7 - PROGRESS: at 39.14% examples, 62005 words/s, in_qsize 6, out_qsize 0
INFO - 22:08:56: EPOCH 7 - PROGRESS: at 46.73% examples, 62670 words/s, in_qsize 5, out_qsize 0
INFO - 22:08:57: EPOCH 7 - PROGRESS: at 53.58% examples, 62276 words/s, in_qsize 5, out_qsize 0
INFO - 22:08:58: EPOCH 7 - PROGRESS: at 62.29% examples, 62589 words/s, in_qsize 6, out_qsize 0
INFO - 22:08:59: EPOCH 7 - PROGRESS: a

Model Saved


In [113]:
# let's play around a bit

from gensim.models.doc2vec import Doc2Vec

model_doc= Doc2Vec.load("d2v.model")

test_data = word_tokenize("software".lower())
infervector = model_doc.infer_vector(test_data)
print("infervector : ", infervector)

INFO - 22:11:19: loading Doc2Vec object from d2v.model
INFO - 22:11:20: loading dv recursively from d2v.model.dv.* with mmap=None
INFO - 22:11:20: loading vectors from d2v.model.dv.vectors.npy with mmap=None
INFO - 22:11:20: loading wv recursively from d2v.model.wv.* with mmap=None
INFO - 22:11:20: setting ignored attribute cum_table to None
INFO - 22:11:20: Doc2Vec lifecycle event {'fname': 'd2v.model', 'datetime': '2021-11-15T22:11:20.487568', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'loaded'}


infervector :  [-1.0948300e-03 -3.0130588e-03 -3.2158128e-03 -2.2336596e-03
 -5.6902436e-04  4.9792505e-03 -4.1479487e-03  3.2444722e-03
 -9.8201213e-04  3.8267256e-04  4.2805420e-03 -3.8736034e-03
 -1.0155964e-03 -2.7353568e-03  3.1132514e-03 -7.4538676e-04
  4.6149306e-03  2.1108382e-03 -1.8934292e-03 -4.5531355e-03
  4.1035544e-03 -1.3906566e-03 -2.0570979e-03 -2.9974764e-03
  2.5853992e-04 -2.4875125e-03 -3.9584483e-03 -2.5457137e-03
  9.5643580e-04 -3.4371475e-03  6.3641608e-04 -2.5162857e-03
  1.7685771e-04 -2.5888109e-03 -2.8173574e-03 -3.1098926e-03
 -7.1257085e-04  3.1269651e-03 -4.7572041e-03  1.5845734e-03
 -4.5119881e-04 -2.8822795e-03 -1.3190350e-03  2.7068586e-03
 -1.1087060e-05 -2.3971272e-03 -9.5118879e-04 -2.3176745e-03
  8.6905004e-04  2.5868844e-03  2.3079335e-03  4.1183513e-03
 -3.6493273e-04  3.5040551e-03  1.7143232e-03  9.4479322e-04
  1.4545619e-03 -8.7448000e-04 -1.8480078e-03  3.2277023e-03
 -2.9682785e-03  8.0378650e-04  1.7610121e-03 -3.6347837e-03
 -1.79246

In [114]:
model_doc.wv.most_similar('developer')

[('engineer', 0.7792899012565613),
 ('consultant', 0.767435610294342),
 ('programmer', 0.7430940866470337),
 ('development', 0.7413915395736694),
 ('software', 0.736546516418457),
 ('architect', 0.7362387180328369),
 ('designer', 0.7341099977493286),
 ('analyst', 0.7284877896308899),
 ('web', 0.7136134505271912),
 ('systems', 0.7085645198822021)]

In [115]:
model_doc.wv.most_similar('engineer')

[('engineering', 0.8313654065132141),
 ('consultant', 0.8232923150062561),
 ('analyst', 0.8065841197967529),
 ('architect', 0.7996267676353455),
 ('specialist', 0.7960496544837952),
 ('manager', 0.7815800905227661),
 ('developer', 0.779289960861206),
 ('administrator', 0.76097172498703),
 ('lead', 0.7426279783248901),
 ('systems', 0.7195594310760498)]

In [116]:
query = "software engineer".split()

new_vector = model_doc.infer_vector(query)
sims = model_doc.dv.most_similar([new_vector])
print(sims)

[('119566', 0.8511234521865845), ('231802', 0.8401719927787781), ('211635', 0.8395752906799316), ('57096', 0.8387839794158936), ('248778', 0.8378311991691589), ('182380', 0.8162076473236084), ('183983', 0.8134768605232239), ('96816', 0.8097153902053833), ('214154', 0.8094416856765747), ('181138', 0.8081831336021423)]


In [117]:
new_vector = model_doc.infer_vector(["software", "engineer"])

In [118]:
print(new_vector)

[-0.00085565 -0.00447602  0.00162773  0.0047522  -0.00311977 -0.00454217
  0.00238067 -0.00409617 -0.003526    0.00446138  0.00213109 -0.00324883
  0.00275442  0.00182354 -0.00126268  0.00374277 -0.00459501 -0.00462416
  0.00369988 -0.00330196  0.00218844  0.00129127 -0.00461516 -0.00436336
  0.00318492  0.00330207  0.00478095  0.00369216 -0.00386617  0.00446906
  0.00144897  0.00367969  0.00378462  0.00278455  0.00077458  0.00353155
 -0.00370056  0.00021552 -0.00158209  0.00303996  0.00493509  0.00092773
 -0.00475027  0.00389528  0.00340583  0.00445411  0.00072603 -0.00418299
  0.00078716 -0.00256898 -0.00034733 -0.00125732  0.00125496  0.00381139
  0.00451908  0.00480799  0.0045063  -0.00376669  0.0042793  -0.00275833
 -0.00166    -0.0034785   0.00076885  0.00340302 -0.00140343  0.00236112
 -0.00482351 -0.00432416  0.00395162 -0.00080143  0.0035865   0.00063031
  0.0018056   0.00411497  0.00456712  0.00406306 -0.00291419  0.00050661
 -0.00323906 -0.00068642  0.00291761  0.00475291  0

In [119]:
model_doc.wv.similar_by_word('founder', topn=10, restrict_vocab=None)

[('co-founder', 0.8362897634506226),
 ('cofounder', 0.6949377655982971),
 ('owner', 0.6782577633857727),
 ('creative', 0.6626403331756592),
 ('ceo', 0.6604053378105164),
 ('media', 0.636041522026062),
 ('e-commerce', 0.6242018342018127),
 ('community', 0.6128408312797546),
 ('web', 0.5908872485160828),
 ('designer', 0.5895556211471558)]

###### ----------------------------------------

In [120]:
#  use this query to compare job titles

In [121]:
q1 = ('product owner')
q2 = ('ui engineer')
score = model_doc.wv.n_similarity(q1.split(), q2.split())

# now for the similarity variable

if score==1.0:
    print("They're the same! Score: ", score)
elif score<=0.99 and score>=0.9:
    print("Very High Similarity! Score: ", score)
elif score<=0.89 and score>=0.75:
    print("High Similarity! Score: ", score)
elif score<=0.74 and score>=0.50:
    print("Medium Similarity! Score: ", score)
elif score<=0.49 and score>=0.30:
    print("Low-ish Similarity! Score: ", score)
else:
    print("Would'nt call them similar! Score: ", score)


Medium Similarity! Score:  0.6312029


###### So far the best results have come from this model

###### --------------------------------------------------------------------------------------------------------------------

###### Using the pre-trained GoogleNews model to fine tune around the same dataset

In [None]:
#now lets try the same dataset using a pre-trained word2vec model
#We will be using the GoogleNews vectors dataset at 
#https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?resourcekey=0-wjGZdNAUop6WykTtMip30g

#download this to working directory

In [83]:
import os
import re
import pandas as pd
import numpy as np
from time import time
from collections import defaultdict
from nltk.tokenize import RegexpTokenizer

In [84]:
import gensim
from gensim.models import Word2Vec 
from gensim.models import KeyedVectors

In [85]:
# find current working directory
directory = os.getcwd()

# specifying csv filename, this is a switch between two datasets in the directory.

# filename = "\\job_title.csv"
filename = "\\titles_final.csv"

# using concat to generate fullpath
file = directory+filename

# load csv file containing job titles
df = pd.read_csv(file)

In [86]:
df_sentences2 = df['Titles'].astype('str').tolist()

In [88]:
# tokenize the dataset
tokenizer2 = RegexpTokenizer(r'\w+')
df_tokenized2 = [w.lower() for w in df_sentences2]
df_tokenized2 = [tokenizer.tokenize(i) for i in df_tokenized2]

print(df_tokenized2[0:10])

[['senior', 'product', 'manager'], ['solutions', 'engineer'], ['staff', 'software', 'engineer'], ['head', 'of', 'product', 'platform'], ['incubation', 'lead', 'success', 'cloud'], ['author'], ['technical', 'writer'], ['product'], ['technology', 'architecture', 'and', 'operations'], ['software', 'engineer']]


In [89]:
# loading pretrained model unto the script

model_keyed = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary = True)

INFO - 21:22:34: loading projection weights from GoogleNews-vectors-negative300.bin
INFO - 21:22:58: KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from GoogleNews-vectors-negative300.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2021-11-15T21:22:58.951608', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'load_word2vec_format'}


In [90]:
# Building vocabulary and training the model

t = time()

model_2 = Word2Vec(vector_size=300, min_count=1, workers=4)
model_2.build_vocab(df_tokenized2)
total_examples = model_2.corpus_count

model_2.build_vocab([list(model_keyed.index_to_key)], update=True)

model_2.wv.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

model_2.train(df_tokenized2, total_examples=total_examples, epochs=5)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

model_2.save('c2vpretrained.model')
print("model saved!")

INFO - 21:24:01: Word2Vec lifecycle event {'params': 'Word2Vec(vocab=0, vector_size=300, alpha=0.025)', 'datetime': '2021-11-15T21:24:01.868482', 'gensim': '4.0.1', 'python': '3.6.13 |Anaconda, Inc.| (default, Mar 16 2021, 11:37:27) [MSC v.1916 64 bit (AMD64)]', 'platform': 'Windows-10-10.0.19041-SP0', 'event': 'created'}
INFO - 21:24:01: collecting all words and their counts
INFO - 21:24:01: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 21:24:01: PROGRESS: at sentence #10000, processed 39523 words, keeping 2026 word types
INFO - 21:24:01: PROGRESS: at sentence #20000, processed 76873 words, keeping 2903 word types
INFO - 21:24:01: PROGRESS: at sentence #30000, processed 112308 words, keeping 3562 word types
INFO - 21:24:01: PROGRESS: at sentence #40000, processed 145434 words, keeping 4058 word types
INFO - 21:24:01: PROGRESS: at sentence #50000, processed 176505 words, keeping 4505 word types
INFO - 21:24:01: PROGRESS: at sentence #60000, processed 214131 w

INFO - 21:25:57: worker thread finished; awaiting finish of 0 more threads
INFO - 21:25:57: EPOCH - 1 : training on 997464 raw words (982036 effective words) took 7.3s, 134581 effective words/s
INFO - 21:25:59: EPOCH 2 - PROGRESS: at 59.97% examples, 607859 words/s, in_qsize 7, out_qsize 0
INFO - 21:25:59: worker thread finished; awaiting finish of 3 more threads
INFO - 21:25:59: worker thread finished; awaiting finish of 2 more threads
INFO - 21:25:59: worker thread finished; awaiting finish of 1 more threads
INFO - 21:25:59: worker thread finished; awaiting finish of 0 more threads
INFO - 21:25:59: EPOCH - 2 : training on 997464 raw words (982020 effective words) took 1.5s, 640000 effective words/s
INFO - 21:26:00: EPOCH 3 - PROGRESS: at 63.15% examples, 638565 words/s, in_qsize 7, out_qsize 0
INFO - 21:26:01: worker thread finished; awaiting finish of 3 more threads
INFO - 21:26:01: worker thread finished; awaiting finish of 2 more threads
INFO - 21:26:01: worker thread finished; aw

Time to train the model: 2.04 mins


INFO - 21:26:04: storing np array 'vectors' to c2vpretrained.model.wv.vectors.npy
INFO - 21:27:35: storing np array 'syn1neg' to c2vpretrained.model.syn1neg.npy
INFO - 21:28:45: not storing attribute cum_table
INFO - 21:28:50: saved c2vpretrained.model


model saved!


In [91]:
# let's do some testing

model_2.wv.most_similar('software', topn=10)

[('mobile', 0.9679745435714722),
 ('driver', 0.9671269059181213),
 ('web', 0.9640653729438782),
 ('application', 0.9639647006988525),
 ('platform', 0.9580212235450745),
 ('ops', 0.9522882699966431),
 ('applications', 0.9518682956695557),
 ('industrial', 0.9515718221664429),
 ('process', 0.9483321905136108),
 ('consulting', 0.9479199647903442)]

In [92]:
list(model_2.wv.index_to_key[1:10])

['senior',
 'manager',
 'software',
 'network',
 'product',
 'intern',
 'sales',
 'technical',
 'director']

In [93]:
model_2.wv.similarity('developer', 'engineer')

0.9161561

###### The results aren't that bad but we have to use single words to get the similarity to work. This works well but it is not the solution to our problem

###### Conclusion: Best result: Doc2Vec, Not so best: Word2Vec custom, So-So: Word2Vec pretrained