#**Word2Vec**
The previous lesson on word embeddings introduced the idea of creating vectors from words. With these vectors we are able to 'math' on words and show relationships. More importantly, these word embeddings can be used as input into other ML algorithms.

This lesson is all about how to create those vectors from text as well as how to use an already trained word embedding model.  

#**What is word2vec?**

Word2Vec is a machine learning algorithm developed in 2013. Specifically, it uses a neural network (2 layers) and is trained with unlabeled data. Since the data is raw text, there is no 'label' to work with, it's an unsupervised ML technique. You can read an overview and the [paper](https://arxiv.org/pdf/1301.3781.pdf).

The open source project (https://github.com/tmikolov/word2vec) is written in C; however, the Python gensim (pronounced jen-sim) library provides a Python implementation.

For this lesson, we are going to use a dataset from Kaggle regarding the make and model for cars. It is already included in this notebook.

In [1]:
import gzip
import gensim
import pandas as pd
import warnings

def build_dataset_raw():
  # here's an example of how to use a zipped (compressed) file
  filename = 'cars.csv.gz'
  # https://www.kaggle.com/CooperUnion/cardataset?select=data.csv
  file = gzip.open(filename, 'rb')
  # clean and tokenize the text
  return [gensim.utils.simple_preprocess(line) for line in file]
  
def test_raw():
  document = build_dataset_raw()
  print(document[0])  # make note of the column names
  print(document[10]) # row '9'
  # print(document)
test_raw()

['make', 'model', 'year', 'engine', 'fuel', 'type', 'engine', 'hp', 'engine', 'cylinders', 'transmission', 'type', 'driven_wheels', 'number', 'of', 'doors', 'market', 'category', 'vehicle', 'size', 'vehicle', 'style', 'highway', 'mpg', 'city', 'mpg', 'popularity', 'msrp']
['bmw', 'series', 'premium', 'unleaded', 'required', 'manual', 'rear', 'wheel', 'drive', 'luxury', 'compact', 'convertible']


The above example shows how you open a compressed file in Python. Many data sets are very large and may only be available in a compressed format. The example also shows how you can use gensim to process data as well. You can [read](https://radimrehurek.com/gensim/utils.html#gensim.utils.simple_preprocess) about the gensim API (how to use the methods and functions) for pre-processing as well.

Instead of using the raw data, let's make use of Pandas to help us clean the data. Take note of the compression parameter in read_csv:

In [27]:
import pandas as pd

pd.options.mode.chained_assignment = None  # default='warn'

def build_dataset():

  # another way to read compressed data
  filename = 'cars.csv.gz'
  df_original = pd.read_csv(filename, compression='gzip')

  # feature selection
  # select the fields we want to train word2vec on
  features = ['Make', 'Model', 'Market Category','Vehicle Size','Vehicle Style',
              'Engine Fuel Type','Transmission Type','Driven_Wheels']
  df = df_original[features]
  ser = df['Make'] + ' ' + df['Model']
  # print(ser)
  df['Make_Model'] = ser
  df = df.drop(labels=['Make', 'Model'], axis=1).reindex(columns=['Make_Model', 'Market Category','Vehicle Size','Vehicle Style','Engine Fuel Type','Transmission Type', 'Driven_Wheels']).reindex(columns=['Market Category','Vehicle Size', 'Make_Model', 'Vehicle Style','Engine Fuel Type','Transmission Type','Driven_Wheels'])

  # print(df)
  doc = []
  for index, row in df.iterrows():
    line = [r for v in row.values for r in str(v).split(',')]
    doc.append(line)
  return doc, df

def test_pd_data():
  document, df = build_dataset()
  print(document[0][0:5])
  print(document[1][0:5])

test_pd_data()

# ==> ['Factory Tuner', 'Luxury', 'High-Performance', 'Compact', 'Coupe']

['Factory Tuner', 'Luxury', 'High-Performance', 'Compact', 'BMW 1 Series M']
['Luxury', 'Performance', 'Compact', 'BMW 1 Series', 'Convertible']


#**Exercise**
Update build_dataset such that you will create the field Make_Model in the dataset. Be sure to add this field to the front of each line of the output. The new field Make_Model combines the fields Make and Model with a single space between them.
Once this is finished, test_pd_data should print out the following:

```
['BMW 1 Series M', 'Factory Tuner', 'Luxury', 'High-Performance', 'Compact']
```

You should also confirm that there are 928 unique Make_Models in the dataset.


#**Model Building**
For creating word2vec models, gensim's Word2Vec class is available. It essentially implements the classic algorithm mentioned in the beginning. In the next code cell, re-type in the following:



```
def build_model_v0(doc):
  model = gensim.models.Word2Vec(doc)
  return model

def test_v0():
    document, df = build_dataset()
    model = build_model_v0(document)
    print(len(model.wv.vocab))

test_v0()
```

Note that the .wv property of the model is the word vector object that provides access into the word vectors themselves.



In [28]:
def build_model_v0(doc):
  model = gensim.models.Word2Vec(doc)
  return model

def test_v0():
    document, df = build_dataset()
    model = build_model_v0(document)
    print(model.wv.key_to_index)

test_v0()

{'AUTOMATIC': 0, 'regular unleaded': 1, 'front wheel drive': 2, 'Compact': 3, 'Midsize': 4, 'nan': 5, 'rear wheel drive': 6, 'Luxury': 7, 'Sedan': 8, 'MANUAL': 9, 'Large': 10, '4dr SUV': 11, 'all wheel drive': 12, 'Performance': 13, 'Crossover': 14, 'premium unleaded (required)': 15, 'premium unleaded (recommended)': 16, 'four wheel drive': 17, 'High-Performance': 18, 'Coupe': 19, 'Hatchback': 20, 'Flex Fuel': 21, 'flex-fuel (unleaded/E85)': 22, 'Convertible': 23, '4dr Hatchback': 24, 'Crew Cab Pickup': 25, 'AUTOMATED_MANUAL': 26, 'Extended Cab Pickup': 27, 'Factory Tuner': 28, 'Wagon': 29, '2dr Hatchback': 30, 'Exotic': 31, 'Passenger Minivan': 32, 'Regular Cab Pickup': 33, 'Hybrid': 34, 'Diesel': 35, 'Chevrolet Silverado 1500': 36, 'diesel': 37, 'Toyota Tundra': 38, '2dr SUV': 39, 'Passenger Van': 40, 'Ford F-150': 41, 'Cargo Van': 42, 'GMC Sierra 1500': 43, 'Volkswagen Beetle Convertible': 44, 'Toyota Tacoma': 45, 'Nissan Frontier': 46, 'Volkswagen GTI': 47, 'Honda Accord': 48, 'Vol

#**Model Evaluation: Extrinsic vs Intrinsic evaluation**

Of course, we have no idea on how 'good' the default Word2Vec function is at building word embeddings. We need a way to evaluate it. For evaluating how useful/accurate the word embeddings are, there are two different ways to assess them: intrinsically and extrinsically.

In **intrinsic** evaluation, you are assessing the performance on a very specific task or sub- task for the vectors themselves. For example, one task might be how many word analogies are correctly identified.
<br>

In **extrinsic** evaluation, you are using your word vectors as input into another NLP process (e.g. named entity recognition, classification, another neural network).

For this example, we will evaluate our simple model using a few intrinsic evaluations:
1. Do the word vectors capture all the make/models of the car set?
2. How accurate are the car similarities? For example, we would expect 'Toyota Camry' and 'Nissan Van' to be closer than 'Toyota Camry' and 'Mercedes-Benz SLK-Class'.

Read, understand and run the following code:

In [29]:
def evaluate_model(model, df=None):

  output = ''
  if df is not None:
    unique_set = df['Make_Model'].unique()
    missing=0
    for mm in unique_set:
      if mm not in model.wv.index_to_key:
        missing += 1
    output += "{:d} models are missing of {:d}\n".format(missing, len(unique_set))

  try:
    t = 'Toyota Camry'
    other = ['Honda Accord', 'Nissan Van', 'Mercedes-Benz SLK-Class']
    for o in other:
      output += t + '->' + o + ' ' + "{:0.4f}\n".format(model.wv.similarity(t,o))

    tuples = model.wv.most_similar(positive='Honda Odyssey', topn=3)
    for mm, v in tuples:
      output += mm + ', '
    output = output.strip(', ')

  except KeyError as e:
    output += "\nError:" + str(e)

  return output

def test_v0():
  document, df = build_dataset()
  model = build_model_v0(document)
  print(evaluate_model(model, df))
  
test_v0()

263 models are missing of 928
Toyota Camry->Honda Accord 0.9765

Error:"Key 'Nissan Van' not present"


What did you notice for test_v0? Of course, this isn't a thorough testing suite; but it helps to show some simple relationships. Ideally, you would come up with score/metric for your evaluation function.

**Tuning Our Algorithm**

There's another parameter (many actually) that we can use to configure Word2Vec. These parameters (called hyper parameters) along with our evaluation function can be used to build an accurate model based on our dataset. The values of these hyper-parameters come from experience and trail-and-error.

The first parameter is min_count whose default value is 5. There are many car models that only appear a few times and these cars are being dropped. Let's update our model to use this parameter:



```
def build_model_v1(doc):
  model = gensim.models.Word2Vec(
          doc,
          min_count=1, # only ignore words that occur less than 1 times
          )
  return model
```
add the following code to the above cell (`build_model_v1`) and run `test_v1`
```
def test_v1():
  document, df = build_dataset()
  model = build_model_v1(document)
  print(evaluate_model(model,df))
test_v1()
```

That's much better (your output will be different, but all the car models should be there):

```
0 models are missing of 928
Toyota Camry->Honda Accord 0.9584
Toyota Camry->Nissan Van 0.9442
Toyota Camry->Mercedes-Benz SLK-Class 0.6755
Toyota Previa, Pontiac Montana, Chevrolet Uplander
```




In [30]:
def build_model_v1(doc):
  model = gensim.models.Word2Vec(
          doc,
          min_count=1, # only ignore words that occur less than 1 times
          )
  return model

def test_v1():
  document, df = build_dataset()
  model = build_model_v1(document)
  print(evaluate_model(model,df))
test_v1()

0 models are missing of 928
Toyota Camry->Honda Accord 0.9081
Toyota Camry->Nissan Van 0.9334
Toyota Camry->Mercedes-Benz SLK-Class 0.5702
Toyota Previa, GMC Jimmy, Dodge Ramcharger


#**Randomness of ML**

You may see different numbers in your output than what is shown. Many machine learning algorithms use randomization to make sure things are evenly spaced out in high dimensional space to start. So if you re-run your above model, you should see different results each time -- but on average, your results should be close on each run.


**CPUS and Threads**

However, this randomness causes issues with reproducibility. We can control the randomness by doing a few things. The main issue for Word2Vec is that the work it does is split across many threads. You can think of a thread as an independent worker. Usually you want to at least match the number of CPUs to the number of threads. That way if you have multiple CPUS, you can take advantage of parallel processing. For this lesson, we will not worry about how to find the number of CPUS our VM has. But you can get this information from within a Python program.



```
import multiprocessing
print(multiprocessing.cpu_count())
```



In [18]:
import multiprocessing
print(multiprocessing.cpu_count())

4


***Coder's Log***: a process is an active program. It has it's own memory and resources. A thread is 'lightweight' in that it can share the same memory of it's parent process. Processes are isolated; threads are not.

When work is split up between threads, each thread may be assigned different units of work, finish at different times and their results may be combined differently. At the cost of being less efficient, we can tell Word2Vec to only use a single thread. That will stop the randomness. Note that there is also a seed hyperparameter that can be used to control randomness.


Go back to the previous code cell and update your code (and run it):



```
def build_model_v1(doc):
  model = gensim.models.Word2Vec(
             doc,
             min_count=1, # ignore words that occur less than 1 times
             workers=1
          )
  return model
  
def test_v1():
  document, df = build_dataset()
  model = build_model_v1(document)
  print(evaluate_model(model,df))
test_v1()
```
You should now see consistent numbers between multiple runs. Here's the output we get (your output will be slightly different):



```
0 models are missing of 928
Toyota Camry->Honda Accord 0.9773
Toyota Camry->Nissan Van 0.9501
Toyota Camry->Mercedes-Benz SLK-Class 0.4989
GMC Jimmy, Ford Five Hundred, GMC Envoy
```




In [31]:
def build_model_v1(doc):
  model = gensim.models.Word2Vec(
             doc,
             min_count=1, # ignore words that occur less than 1 times
             workers=1
          )
  return model
  
def test_v1():
  document, df = build_dataset()
  model = build_model_v1(document)
  print(evaluate_model(model,df))
test_v1()

0 models are missing of 928
Toyota Camry->Honda Accord 0.9263
Toyota Camry->Nissan Van 0.8998
Toyota Camry->Mercedes-Benz SLK-Class 0.7427
Chevrolet Blazer, Toyota Previa, Chevrolet Silverado 1500 Hybrid


#**Windows of Context**
The output of word2vec is a set of word vectors. And each word vector is essentially the same as shown in the previous lesson on word embeddings. The goal of the algorithm is to have words with similar context occupy close spatial positions. As discussed in the word embeddings lesson, the cosine similarity can be used as a metric of closeness.

For word2vec there is a concept of defining both a 'target word' and 'context words'. Below shows an example of the target word 'by' with its context window (word is known the company it):

![](https://drive.google.com/uc?export=view&id=1QR5HR3sFX5Dt1ZDb_sNxoX1xEsf4uMl7)


For the window of size n the contexts are defined by capturing n words to the left of the target and n words to its right. This window of context (shown here to be size 3) slides along the text. So the next word that is processed is the:

![](https://drive.google.com/uc?export=view&id=1Ao5X5Z07uZdnZNQRRaRhH9kDdYJ7pPc-)

Given that information, it's clear that where the target word appears in the document and the size of the context window can affect the quality of the output. If the window is too small, 'meaning' becomes very narrow. If the window is too big, words no longer separate from each other.

###**Exercise**

Go all the way back to the code cell that creates the function build_dataset. Move the column 'Make_Model' from the front of the list to the third position. Now Re-run the cell with build_dataset in it. You should see the following (from the output of test_pd_data):

```
['Factory Tuner', 'Luxury', 'High-Performance', 'BMW 1 Series M', 'Compact']
```

Now re-run the cell with test_v1():

```
test_v1()
```

Notice that the position of where the make/model appears in the document affects the result (the similarity of Camry and Accord went down). We can avoid this issue by creating a wide context window.

The default window size is 5. Do the following modifications:
* update build_model_v1 to be the following:

```
def build_model_v1(doc):
  model = gensim.models.Word2Vec(
             doc,
             min_count=1, # ignore words that occur less than 1 times
             workers=1,   # one thread to remove randomness
             window=10,   # wide window size
          )
  return model
```

When you re-run the cell (test_v1()) the output becomes:

```
0 models are missing of 928
Toyota Camry->Honda Accord 0.9734
Toyota Camry->Nissan Van 0.9292
Toyota Camry->Mercedes-Benz SLK-Class 0.1101
Dodge Ramcharger, Pontiac Montana, GMC Jimmy
```

#**Epoch Training**

As we saw in the ML Prep lesson, each machine learning algorithm involves iteration over the dataset to help adjust and improve. Initially, the word vectors are assigned random locations in very high dimensional space. As the algorithm iterates, these word vectors move closer to neighborhoods with 'similar words'.

![](https://drive.google.com/uc?export=view&id=1v1C8k8v-rZnaDtRVv2nqb0eEwmlZuKqa)

Remember, that 'closeness' is defined by
how similar these words are. And being similar, means the words share similar contexts. So you expect common misspellings and upper/lower case versions of the same word to be located near each other in high dimensional space. The image to the left shows how the days of the week (orange circle) might be near each other. Also similar relationships would have similar distances (e.g. king to queen and uncle to aunt)



![](https://drive.google.com/uc?export=view&id=1Gvt7N27XLaHRlpvYe2qGV732FGVgt24S)


You can control how many times word2vec iterates on its training through the iter parameter (whose default is 5). Let's up this to 15. Of course, this is a choice that comes from experimentation and evaluation. If your corpus is huge, you may not have enough years to iterate.


Let's create another function:

In [37]:
def build_model_v2(doc):
  model = gensim.models.Word2Vec(
          doc,
          min_count=1,   # ignore words that occur less than 2 times
          workers=1,     # threads to use
          window=10,     # size of window around the target word
          epochs=15        # 15 epochs
          )
  return model

ndims = [25, 50, 75, 100, 150, 200]

def build_model_v2(doc):
  model = gensim.models.Word2Vec(
          doc,
          min_count=1,   # ignore words that occur less than 2 times
          workers=1,     # threads to use
          window=10,     # size of window around the target word
          epochs=15,        # 15 epochs
          vector_size=ndim, # how big the output vectors (spacy == 300)
          )
  return model


def test_v2():
  document, df = build_dataset()
  model = build_model_v2(document)
  print(evaluate_model(model,df))
# test_v2()

for i in ndims:
  print("NDIM = {}".format(i))
  ndim = i
  test_v2()
  print('*'*100)

NDIM = 25
0 models are missing of 928
Toyota Camry->Honda Accord 0.8772
Toyota Camry->Nissan Van 0.6246
Toyota Camry->Mercedes-Benz SLK-Class -0.1957
Toyota Sienna, Plymouth Grand Voyager, Chevrolet Astro
****************************************************************************************************
NDIM = 50
0 models are missing of 928
Toyota Camry->Honda Accord 0.8747
Toyota Camry->Nissan Van 0.6553
Toyota Camry->Mercedes-Benz SLK-Class -0.1814
Toyota Sienna, Plymouth Grand Voyager, Chevrolet Astro
****************************************************************************************************
NDIM = 75
0 models are missing of 928
Toyota Camry->Honda Accord 0.8740
Toyota Camry->Nissan Van 0.6748
Toyota Camry->Mercedes-Benz SLK-Class -0.1354
Toyota Sienna, Plymouth Grand Voyager, Chevrolet Astro
****************************************************************************************************
NDIM = 100
0 models are missing of 928
Toyota Camry->Honda Accord 0.8724
Toyota Ca

The output should look close to the following:

```
0 models are missing of 928
Toyota Camry->Honda Accord 0.8234
Toyota Camry->Nissan Van 0.6933
Toyota Camry->Mercedes-Benz SLK-Class -0.0498
Ford Aerostar, GMC Safari, Dodge Caravan
```

This looks a lot better. The three similar vans are correct and the Camry and Mercedes have a lot more distance between them (negative in fact). Note: your output will look slightly different. But you should see an improvement.

#**High Dimensional Space**

As we saw in the word embeddings lesson, word vectors have a length (we saw that spaCy uses 300) that indicates how many dimensions each word contains. The default for word2vec is 100.

This is another hyper-parameter that you can adjust. There's no perfect number. The larger your corpus is the more dimensions you will need. The cars dataset is very small so it would be good to know how many dimensions capture the similarities between cars.

You want the smallest number of dimensions necessary to do well on your evaluation metrics (and no more). Too many dimensions become space inefficient as your corpus size increase and could result in **overfitting** (when the model doesn't generalize well).

Update build_model_v2 to allow the number of dimensions to be passed in.



```
def build_model_v2(doc):
  model = gensim.models.Word2Vec(
          doc,
          min_count=1,   # ignore words that occur less than 2 times
          workers=1,     # threads to use
          window=10,     # size of window around the target word
          iter=15        # 15 epochs
          size=ndim, # how big the output vectors (spacy == 300)
          )
  return model
```



**Exercise**

Experiment with using 25, 50, 75, 100, 150, 200. Update test_v2 to call build_model_v2 in a loop of the different sizes.

What do you notice and where would you decide to put the cutoff?

#**Two Ways To Train**

Word2Vec uses a neural network as it's algorithm and architecture. We will go into more detail in the lesson on using neural networks. For now, we will simplify things a bit just so we can stay focused on the task at hand.

Word2vec provides two very different ways to structure the neural network for learning the distributed representations of words that try to minimize the computational complexity (how long it takes to run). These two underlying architectures are the continuous bag-of-words model (CBOW) and a continuous Skip-gram (Skip Gram) model. Each uses a different metric to evaluate the training of the model.

#**Continuous bag-of-words (CBOW) training**

In this method, a window of words surrounding the 'target' word (i.e. the context) is used in an attempt to predict the target word.

It's a 'bag of words' in that the actual order of the surrounding words is not used in the analysis. It uses a continuous probability distribution to represent the context words (rather than discrete counting).

The input into CBOW is a vector representation of a group of context words, the goal is to get the most appropriate target word which will be within the vicinity of the group of words.


![](https://drive.google.com/uc?export=view&id=1Csgy5_ICG58GuiOGsojGZo3Fs0cUUBwW)


#**Skip-gram training**

For CBOW, if you have enough context, the goal is to predict the word. In the skip-gram 'model', if you are given a target word, the output is the set of context words (i.e. words who appeared in close proximity to the target).

Essentially the task the neural network is solving is to find which context words can appear given a target word. After training the neural network, if we input any target word into the neural network, it will give a vector output which represents the words which have a high probability of appearing near the given word.

![](https://drive.google.com/uc?export=view&id=1dkeVAr5maRiSm2jmauQ_f2cpu9dg4hkK)


#**Choosing between the two**

The author of word2vec [summarizes the differences](https://groups.google.com/g/word2vec-toolkit/c/NLvYXU99cAM/m/E5ld8LcDxlAJ?pli=1) of CBOW and SkipGram:

* Skip-gram: works well with small amount of the training data, represents well even rare words or phrases

* CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words


![](https://drive.google.com/uc?export=view&id=1TVBn_sskOkehevhJfcs9sQvd8uiFAz-w)


The default training method of word2vec is CBOW. But since our car dataset is so small, let's try the skipgram model by using the sg named parameter:

In [43]:
def build_model_v3(doc):
  model = gensim.models.Word2Vec(
          doc,
          min_count=1,   # ignore words that occur less than 2 times
          workers=1,     # threads to use
          window=10,     # size of window around the target word
          epochs=15,        # 15 epochs
          vector_size=ndim,     # how big the output vectors (spacy == 300)
          sg=1,          # 0 == CBOW (default) 1 == skip gram
          negative=15
          )
  return model

def test_v3():
  document, df = build_dataset()
  model = build_model_v3(document)
  print(evaluate_model(model,df))

test_v3()

0 models are missing of 928
Toyota Camry->Honda Accord 0.9015
Toyota Camry->Nissan Van 0.8168
Toyota Camry->Mercedes-Benz SLK-Class 0.6322
Toyota Sienna, Chevrolet Astro, Ford Aerostar


When you run it, you should see something similar to

```
0 models are missing of 928
Toyota Camry->Honda Accord 0.8568
Toyota Camry->Nissan Van 0.8036
Toyota Camry->Mercedes-Benz SLK-Class 0.5291
GMC Safari, Chevrolet Astro, Dodge Caravan
```

In order to evaluate the accuracy between CBOW or skip-gram, we would need a more extensive test suite. But you can see that skip-gram did perform very well.

#**Negative Sampling**

Without getting into the details (they will come), training a neural network (NN) is very time consuming. A NN is made up of connected nodes and layers.


![](https://drive.google.com/uc?export=view&id=1lhZc_wHJiH3474H1MRDvnuQdlrywtav7)

You can also think of each connection (a line or edge between nodes) as having a 'weight' that needs to be adjusted. As the size of the vocabulary increases (the number of unique words in the corpus) so does the complexity of the internal architecture (i.e. a lot more nodes, edges and weights to adjust).

Negative sampling addresses the complexity issue by having each training sample modify only a small percentage of the nodes/weights (rather than all of them). With negative sampling, we randomly select just a small number of “negative” words to update the weights for. In this context, a “negative” word is one for which we want the network to output a 0).

Another option for the training method is called soft-max. Soft-max is computational expensive and is usually referred to as hierarchical soft-max which is an optimized implementation. We can cover the details of these algorithms when we get to neural networks.

For word2vec, the default negative sampling parameter is set to 5. Update the function build_model_v3 to include 15 to be the value:

```
negative=15,
```



When you re-run, test_v3, you should see results similar to the following:

```
0 models are missing of 928
Toyota Camry->Honda Accord 0.8977
Toyota Camry->Nissan Van 0.8216
Toyota Camry->Mercedes-Benz SLK-Class 0.6351
GMC Safari, Ford Windstar, Chevrolet Astro
```

#**Saving and Loading Models**

Once you train a model (which can take hours, days, weeks, months?), you will want to save it to a file so you can just reload the trained model. The model you save is significantly smaller than the corpus you used to train it.

For gensim, you can save models via the save method. Update your test_v3 function to include saving the model:

**Saving Models**

```
def test_v3():
  document, df = build_dataset()
  model = build_model_v3(document)
  print(evaluate_model(model,df))
  model.save('carmodel.skipgram')
test_v3()
```

**Loading Models**

You can reload saved models just as easily:
```
def test_load():
  md2 = gensim.models.Word2Vec.load('carmodel.skipgram')
  print(evaluate_model(md2))
```

In [44]:
def test_v3():
  document, df = build_dataset()
  model = build_model_v3(document)
  print(evaluate_model(model,df))
  model.save('carmodel.skipgram')
test_v3()

0 models are missing of 928
Toyota Camry->Honda Accord 0.9015
Toyota Camry->Nissan Van 0.8168
Toyota Camry->Mercedes-Benz SLK-Class 0.6322
Toyota Sienna, Chevrolet Astro, Ford Aerostar


In [46]:
def test_load():
  md2 = gensim.models.Word2Vec.load('carmodel.skipgram')
  print(evaluate_model(md2))
test_load()

Toyota Camry->Honda Accord 0.9015
Toyota Camry->Nissan Van 0.8168
Toyota Camry->Mercedes-Benz SLK-Class 0.6322
Toyota Sienna, Chevrolet Astro, Ford Aerostar


**Word Analogies Test Suite**

A classic set of word analogies is also available to use (see https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt)
The word2vec model also provides a way to do the evaluation easily as well:

```
model.wv.evaluate_word_analogies
```

See the [documentation](https://radimrehurek.com/gensim/models/keyedvectors.html) for more information.

#**fastText (2016)**

Facebook's implementation for creating word embeddings and sentence classification is called fastText. It is written in C++ and supports multiprocessing during training. It's word vectors are actually sub-words. You can [read](https://arxiv.org/pdf/1607.04606.pdf) about it here. You can even install the fasttext Python library. The process of installing and evaluating the models will be very straightforward.

#**Summary**
There's a lot going on in this lesson. So much in fact that the tests for this lesson will only confirm that you wrote the necessary functions. We'll have a separate lesson that allows you to work on a corpus and build your own word embeddings.

#**Lesson Assignment**
If you followed along with the lesson, you should be good to go. All the model building (and loading) takes too long. But be sure that all the functions are properly written and run without errors. You still need to submit your notebook in Moodle for grading.

**Steps to submit your work:**


1.   Download the notebook from Moodle. It is recommended that you use Google Colab to work on it.
2.   Upload any supporting files using file upload option within Google Colab.
3.   Complete the exercises and/or assignments
4.   Download as .ipynb
5.   Name the file as "lastname_firstname_WeekNumber.ipynb"
6.   After following the above steps, submit the final file in Moodle





<h1><center>The End!</center></h1>