# **Harry Potter 2 Vec**


![](https://drive.google.com/uc?export=view&id=10NHEpBSv-bQTZ0hMJRpJ-RqD66rrtJDR)

In previous lessons, you learned about both word embeddings and how to create them using gensim's Word2Vec. In this lesson, we are going to build our own word vector model using the text of Harry Potter.

Ideally, the model can then be used to find Harry Potter word analogies. For example, if you had the word vectors for the Harry Potter books, you could ask:
                            
                             Ron - man + woman = ?

Hopefully, you would get Hermione.

This lesson will ask you to build the best possible model. We will guide you through the steps and you can submit your best model for the assignment.



##**Hyperparameters and Model Evaluation**

The goal of this lesson is for you to experiment with the different hyperparameters to gensim's Word2Vec model. As a quick review, **hyperparameters** are those values or parameters that the model is not learning or trying to optimize, but rather are used to configure the algorithm. Examples include the 'K' in **K**•means, the degree of polynomial to use in regression, number of levels in a decision tree, the learning rate, and number of layers and nodes in a neural network. For Word2Vec there are a [multiple](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) parameters with which to experiment.

You will also build your own scoring system to evaluate the different models you create. We have so much to discuss, so let's get started!

#**Building the Harry Potter Corpus**

All 7 Harry Potter books are available in Moodle. You need to download them and upload in Colab using Colab's upload file option.

Each book is a single 'sentence', cleaned and tokenized. The function read_one shows how to read a single Harry Potter book.



```
def read_one():
  path = 'hp1.txt'
  book = open(path, 'r').read()
  return book
```

![](https://drive.google.com/uc?export=view&id=1C-l6Bg25X1-y7wPYlweVNuWRFgtVcTei)


You will need to finish the build_hp_dataset implementation.



```
def build_hp_dataset(count=3, stopwords=[]):
  # count is the number of documents in the corpus (harry potter books)
  # each document is an array of words
  # remove any stopwords
  # returns an array of documents (same size as count)
  return []
```

* You are free to experiment using any number of books from the collections; however, **only books 1-3** will be used for evaluation.
* You should remove stopwords if any are passed in.
* You can get a list of words from a book by doing a simple split on the content.



In [1]:
import gensim
from gensim.models import KeyedVectors

import nltk
from nltk.corpus import stopwords
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')

import re
import LessonUtil as Util

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import httpimport
url = "https://raw.githubusercontent.com/zhu-jun-ting/INFO-490-MH2/main/"
with httpimport.remote_repo(["util"], url):
    import util

[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:

def read_hp(number):
    path = 'data/hp{}.txt'.format(number)
    with open(path, 'r') as book:
        text = book.read()
    return text

def build_hp_dataset(count=3, stopwords=[]):
    # count is the number of documents in the corpus (harry potter books)
    # each document is an array of words
    # remove any stopwords
    # returns an array of documents (same size as count)

    hp = ""
    for i in range(1, count+1):
        hp += read_hp(i)

    hp = hp.split()

    return hp

corpus = build_hp_dataset(3)

#**Building the Model**

You will build a word embedding model using Word2Vec. The next section will describe how to create a configuration for build_model to use.

A few notes:
* most of the configuartion values are coming from the config parameter
* you are free to add additional parameters to see if you can build an awesome model; however, for the lesson, only the above parameters in a configuration will be used.
* iter is capped at 100
* size is capped at 300
* the [Word2Vec documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) should be used for additional details.

In [3]:
def build_model(config):

    # no need to change anything
    model = gensim.models.Word2Vec(
        config.doc,                   # each sentence is a HP book
        size=min(config.size,300),    # how big the output vectors (spacy == 300)
        window=config.window,         # size of window around the target word
        min_count=config.min_count,   # ignore words that occur less than 2 times
        sg=config.sg,                 # 0 == CBOW (default) 1 == skip gram
        
        negative=config.negative,
        # keep these the same
        workers=1,  # threads to use
        iter=min(config.iter, 100)  # max iterations is 100,
        )
    return model


#**Configuring the Model**

Now that you know how the model will be built AND you have your corpus, the next step is to build a configuration to configure the Word2Vec model. This lesson comes with a special class that you will use to create a configuration object.

```
import LessonUtil as Util
config = Util.build_config(corpus, size=75, window=15, iter=75)
```

Take a look below at how the configuration will be used to build a model:

```
def build_model(config):
    return gensim.models.Word2Vec(sentences=config.doc, size=config.size, ... )
```

The configuration object has the following fields:

```['doc', 'size', 'window', 'min_count', 'sg', 'negative', 'iter', 'name']```

* The same hyperparameters that you can configure a Word2Vec model are in the configuration
* The name field allows you to name the configuration if you want to keep track of it.


In [6]:
default_config = Util.build_config(corpus)
print("Default config", default_config)

config1 = Util.build_config(corpus, size=75, window=15, iter=75)
config2 = Util.build_config(corpus, 5, 2, 1, 1, 5, 30)

print("Custom config1", config1)
print("Custom config2", config2)

Default config doc_len:262364, size:10, window:5, min_count:5, sg:0, negative:3, iter:25
Custom config1 doc_len:262364, size:75, window:15, min_count:5, sg:0, negative:3, iter:75
Custom config2 doc_len:262364, size:5, window:2, min_count:1, sg:1, negative:5, iter:30


##**Preventing Randomness**

Like most machine learning algorithms, randomization is introduced during initialization. This comes at the cost when trying to compare different versions of models. You can control some of this by setting workers=1.

However, the results you get in this notebook will NOT be the same when you submit on gradescope. They should be close though.

#**Running with the model**
We now have enough in place to start the process of building a Word2Vec model with Harry Potter Books. So Exciting. Below is a sample pipeline.

Make sure you understand the pipeline inside simple_test.

In [None]:
def simple_test():
    # build the harry potter corpus
    corpus = build_hp_dataset()

    # build the configuration to try/experiment
    config = Util.build_config(corpus, 5, 2, 1, 1, 5, 30)
    
    # build the model with this configuration
    model  = build_model(config)
    
    # let's hope harry is in there
    harry = model.wv['harry']
    
    # see who is similar to ron
    print(model.wv.most_similar(positive=['ron']))

What do you see? If you built a good model, would hope to see hermione near the top?

#**Scoring the Model**

Now with that in place, the fun begins. It is your job to find the best set of parameters (hyperparameters) to configure the Word2Vec model.

The LessonUtil module has a set of tests you can use to evaluate your models. Each test looks like the following:

```
(['ron'],       [],       ['hermione', 'harry']),
(['voldemort'], ['evil'], ['dumbledore']),
```

Each test has 3 components:

* the first component is the positive vectors
* the second component is the negative vectors
* the third component is a set of possible answers

Looking at the first example: (['ron'], [],['hermione', 'harry'])

Here the **Ron** vector should be close to either (or both) the **Hermione**
or **Harry** vector.

Looking at the second example: (['voldemort'], ['evil'], ['dumbledore'])

The vector that is the result of the **Voldemort** vector - the **evil** vector; the hope is that result vector is close/similar to the **Dumbledore**
vector: veldemort - evil = dumbledore

## **Lesson Assignment**
After configuring the Word2Vec model with the best set of hyperparameters, test your best model with all (19) of the test cases give in the Util file.
* ```result = model.wv.most_similar(positive=pos, negative=neg, topn=topn)```
    *   Get the top 25 words (```topn = 25```)
* Print out the results for all the test cases (when you submit the notebook, the results should be visible). 
<br><br>


**Steps to submit your work:**


1.   Download the lesson notebook from Moodle.
2.   Upload any supporting files using file upload option within Google Colab.
3.   Complete the exercises and/or assignments
4.   Download as .ipynb
5.   Name the file as "lastname_firstname_WeekNumber.ipynb"
6.   After following the above steps, submit the final file in Moodle





<h1><center>The End!</center></h1>