For the word2vec model, we will be using the gensim library, which has implementations of both the cbow and skip-gram variant of word2vec.

In [1]:
import gensim

Now we load up our dataset that was preprocessed earlier. We will be using the csv module again.

In [2]:
import csv

Transactions were devided into individual lines, items in these transactions are separated by ','

In [3]:
entries = []
with open('extracted_dependencies.csv', 'r') as csv_file:
    entries_csv = csv.reader(csv_file, delimiter=',')
    entries = list(entries_csv)

# Setting up the parameters for Word2Vec model.

Now, our recommendation system based on word2vec must be specifically tuned to be trained in a correct way.

First, we define the size - dimensionality of the word vectors. The recommended value is often between 100 - 150, depending on how big the dataset is. For example purposes, we will use the size 150.

In [4]:
vec_dimension = 150

Another parameter is the window size. This specifies the maximum distance between the current and predicted word within a sentence. In terms of our issue, the items in transaction is like one sentence. In this sentence, we want  "train" every word with each other. So, to cover all the cases that can occur, the window size should be the lenght of the longest transaction in entries, so that every word is trained witihin every word present in the same transaction.

In [5]:
window_size = max([len(transaction) for transaction in entries])
print(window_size)

369


The min_count parameter specifies the minimal count of a word to be recognized. In other words, the model would ignore words that do not satisfy the min_count. For example purposes, we set the min_count to 1, because right now we want to train model with every unique dependency it contains.

In [6]:
frequency = 1

The parameter sg is for the training algorithm: 1 for skip-gram; 0 for CBOW. We will train both variants, firstly the CBOW and then skip-gram.

In [7]:
cbow = 0
skip_gram = 1

There are a lot of other parameters that you can specify for a better, fine-tuned model. If you are interestered, you can lookup everything in the gensim documentation here:
https://radimrehurek.com/gensim/models/word2vec.html

In [8]:
model_cbow = gensim.models.Word2Vec(
        entries,
        size=vec_dimension,
        window=window_size,
        min_count=frequency,
        sg=cbow)

In [9]:
model_sg = gensim.models.Word2Vec(
        entries,
        size=vec_dimension,
        window=window_size,
        min_count=frequency,
        sg=skip_gram)

# Training the models

Now the both variants of model are set up, we can train them. There are also various parameters that can be set for trainig phase, so if you are interested, you can look it up in the documentation mentioned above. The epochs represent number of iterations over the corpus. They are set to be of a one hundred, this is for now just for an example purposes.

In [10]:
model_cbow.train(entries, total_examples=len(entries), epochs=100)
model_sg.train(entries, total_examples=len(entries), epochs=100)

(14943449, 17526400)

# Saving the models

Once the model is trained, it can be also saved and loaded for later use. We will save the both variants for later use in the part that is dedicated to results of recommendation systems.

In [11]:
model_cbow.save('dependencies_recommender_w2v_cbow')
model_sg.save('dependencies_recommender_w2v_skipgram')
print('models were successfully saved')

models were successfully saved
