# Step Two: Training the Model

After preparing our dataset into a corpus - a collection of sentences - we can train our model and create our custom word embeddings. This method uses the dataset created using the steps outlined in ```clean_data.ipynb```. However, this process should also work for any corpus formatted as a list of sentences in a text file. 

To create our word embeddings, we will make use of Word2Vec algorithms (see ```README.md``` for this directory for further details). 

### Requirements

- We will use the ```Gensim``` library - a standard in natural language processing projects. More specifically, the [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) module provides functions which will allow use to train and make use of our word embeddings. 
- We need a text corpus formatted as a text file containing a list of sentences to evaluate. 


Note: As mentioned previously, it would be most appropriate to create a new directory and virtual environment for training your data instead of within this API directory. 

In [1]:
from gensim.models import Word2Vec 

### Creating and Training the Model

We can now create our ```Word2Vec``` neural network-based model! The code below is deceptively simple as it contains a massive amount of functionality. By inputting our corpus at the start when initialising the model, the steps of building vocabulary and training the model can be completed in one go. This can also be done a step at a time, see the [documentation here](https://radimrehurek.com/gensim/models/word2vec.html) for examples. 

Additionally, there are many parameters that can be set to control the features and implementation of the model. For example, you can select between the *skip-gram* or the *CBOW (continuous bag of words)* algorithms. While the many options are certainly worth playing with, for now we will keep it simple. The one optional parameter we will set is ```min_count```. This sets the frequency below which all less-frequent words be ignored. As our dataset is relatively small, we will set this at 1 so as many words as possible are included in our word embeddings.

In [2]:
model = Word2Vec(min_count=1, corpus_file="./cleaned_dataset.txt")

### Extracting the Word Vectors

We have our model! We can do things such as save or load this model, but for now we want to access our custom trained word vectors. Since we've finished the training, we no longer need the full model state. We can extract the vectors and keys alone - a much smaller and faster object than the full model - to use in our API functionality. 

In [3]:
word_vectors = model.wv

### Using the Word Vectors

Now we have our word embeddings, we can use them to explore the relationships between words from our dataset! The most important functionality for our use is finding similar keywords for a given word. Let's try to find the 10 words most similar to the word **heart**.

In [4]:
result = word_vectors.most_similar("heart")

print("Words most similar to heart are", result)

Words most similar to heart are [('star', 0.6312820315361023), ('stars', 0.4272494316101074), ('tealight', 0.4014790952205658), ('chandelier', 0.376049667596817), ('chick', 0.3455809950828552), ('bird', 0.34482234716415405), ('candle', 0.342176228761673), ('candelabra', 0.3411673307418823), ('soap', 0.3409248888492584), ('seat', 0.3406926691532135)]


We can also create more complex queries. We can use multiple keywords to find words most similar to two or more keywords. Let's find words most similar to both **heart** and **bag**.

In [5]:
result = word_vectors.most_similar(["heart","bag"])

print("Words most similar to heart and bag are", result)

Words most similar to heart and bag are [('star', 0.4480481445789337), ('bunting', 0.3603987991809845), ('box', 0.3536360561847687), ('wrap', 0.3321145176887512), ('tape', 0.32946687936782837), ('seat', 0.30161723494529724), ('sack', 0.292302668094635), ('chick', 0.2723071575164795), ('tags', 0.2681472897529602), ('jug', 0.26459288597106934)]


We can also input "negative" keywords - which contribute negatively to the similarity. For example, if we want keywords similar to **heart** but not similar to **star**, we can input an array of positive words and an array of negative words.

In [6]:
positive_words = ["heart"]
negative_words = ["star"]

result = word_vectors.most_similar(positive=positive_words, negative=negative_words)

print("Words most similar to heart but not to star are", result)

Words most similar to heart but not to star are [('candle', 0.3686973452568054), ('sweetheart', 0.3085654377937317), ('pan', 0.29118064045906067), ('folding', 0.28588226437568665), ('buddha', 0.26655375957489014), ('photoframe', 0.26415762305259705), ('shape', 0.2628423869609833), ('cosy', 0.26227304339408875), ('diner', 0.25945886969566345), ('eau', 0.251926064491272)]


Both the positive and negative arrays can contain multiple keywords. See the [KeyedVectors module documentation](https://radimrehurek.com/gensim/models/keyedvectors.html) for further details on this functionality.

### Saving and Loading the Word Vectors

To make the most of our word vectors, we want to store and load them for use at a later date. We can do this by storing them as a text file. 

In [7]:
word_vectors.save_word2vec_format('ecommerce_vecs.txt', binary= False)

We can then load them from the text file at a later point wherever we want! We require the module ```KeyedVectors``` from ```Gensim``` in whatever file we wish to use our word vectors within. Then, we simply load them from the text file. 

In [8]:
from gensim.models import KeyedVectors

word_vectors = KeyedVectors.load_word2vec_format("ecommerce_vecs.txt", binary=False) 

result = word_vectors.most_similar("heart")

print("The words most similar to heart are", result)

The words most similar to heart are [('star', 0.6312820315361023), ('stars', 0.4272494316101074), ('tealight', 0.4014790952205658), ('chandelier', 0.376049667596817), ('chick', 0.3455809950828552), ('bird', 0.34482234716415405), ('candle', 0.342176228761673), ('candelabra', 0.3411673307418823), ('soap', 0.3409248888492584), ('seat', 0.3406926691532135)]


For our keyword API, we can now load our word vectors into our ```model.py``` and create functionality that accepts keywords in a POST request and returns similar words for future product searches. 