# **Harry Potter 2 Vec**


![](https://drive.google.com/uc?export=view&id=10NHEpBSv-bQTZ0hMJRpJ-RqD66rrtJDR)

In previous lessons, you learned about both word embeddings and how to create them using gensim's Word2Vec. In this lesson, we are going to build our own word vector model using the text of Harry Potter.

Ideally, the model can then be used to find Harry Potter word analogies. For example, if you had the word vectors for the Harry Potter books, you could ask:
                            
                             Ron - man + woman = ?

Hopefully, you would get Hermione.

This lesson will ask you to build the best possible model. We will guide you through the steps and you can submit your best model for the assignment.



##**Hyperparameters and Model Evaluation**

The goal of this lesson is for you to experiment with the different hyperparameters to gensim's Word2Vec model. As a quick review, **hyperparameters** are those values or parameters that the model is not learning or trying to optimize, but rather are used to configure the algorithm. Examples include the 'K' in **K**•means, the degree of polynomial to use in regression, number of levels in a decision tree, the learning rate, and number of layers and nodes in a neural network. For Word2Vec there are a [multiple](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) parameters with which to experiment.

You will also build your own scoring system to evaluate the different models you create. We have so much to discuss, so let's get started!

#**Building the Harry Potter Corpus**

All 7 Harry Potter books are available in Moodle. You need to download them and upload in Colab using Colab's upload file option.

Each book is a single 'sentence', cleaned and tokenized. The function read_one shows how to read a single Harry Potter book.



```
def read_one():
  path = 'hp1.txt'
  book = open(path, 'r').read()
  return book
```

![](https://drive.google.com/uc?export=view&id=1C-l6Bg25X1-y7wPYlweVNuWRFgtVcTei)


You will need to finish the build_hp_dataset implementation.



```
def build_hp_dataset(count=3, stopwords=[]):
  # count is the number of documents in the corpus (harry potter books)
  # each document is an array of words
  # remove any stopwords
  # returns an array of documents (same vector_size as count)
  return []
```

* You are free to experiment using any number of books from the collections; however, **only books 1-3** will be used for evaluation.
* You should remove stopwords if any are passed in.
* You can get a list of words from a book by doing a simple split on the content.



In [219]:
import gensim
from gensim.models import KeyedVectors

import nltk
from nltk.corpus import stopwords
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

sw = nltk.download('stopwords')

import re
import LessonUtil as Util

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import httpimport
url = "https://raw.githubusercontent.com/zhu-jun-ting/INFO-490-MH2/main/"
with httpimport.remote_repo(["util"], url):
    import util

from functools import singledispatch
from collections import namedtuple

[nltk_data] Downloading package stopwords to /Users/mac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [221]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [86]:

class Config(namedtuple('Config', ['doc', 'vector_size', 'window', 'min_count', 'sg', 'negative', 'epochs', 'name'])):



  def __str__(self):
    # skip doc
    fmt = "doc_len:{}, vector_size:{}, window:{}, min_count:{}, sg:{}, negative:{}, epochs:{}"
    return fmt.format(len(self.doc), self.vector_size,
                      self.window, self.min_count,
                      self.sg, self.negative, self.epochs)

def build_config(doc, vector_size=10, window=5, min_count=5, sg=0, negative=3, epochs=25, name=''):
  return Config(doc=doc, vector_size=vector_size, window=window, min_count=min_count, sg=sg, negative=negative, epochs=epochs, name=name)


In [224]:

def read_hp(number):
    path = 'data/hp{}.txt'.format(number)
    with open(path, 'r') as book:
        text = book.read()
    return text

def build_hp_dataset(count=3, stopwords=stopwords.words('english')):
    # count is the number of documents in the corpus (harry potter books)
    # each document is an array of words
    # remove any stopwords
    # returns an array of documents (same vector_size as count)

    hp = []
    for i in range(1, count+1):
        hp.append(read_hp(i).split())

    for text in hp:
        for word in stopwords:
            try:
                text.remove(word)
            except Exception as e:
                pass

    return hp

corpus = build_hp_dataset(3)
len(corpus[0]) # 75258

75112

#**Building the Model**

You will build a word embedding model using Word2Vec. The next section will describe how to create a configuration for build_model to use.

A few notes:
* most of the configuartion values are coming from the config parameter
* you are free to add additional parameters to see if you can build an awesome model; however, for the lesson, only the above parameters in a configuration will be used.
* epochs is capped at 100
* vector_size is capped at 300
* the [Word2Vec documentation](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec) should be used for additional details.

In [88]:
def build_model(config):

    # no need to change anything
    model = gensim.models.Word2Vec(
        config.doc,                   # each sentence is a HP book
        vector_size=min(config.vector_size,300),    # how big the output vectors (spacy == 300)
        window=config.window,         # vector_size of window around the target word
        min_count=config.min_count,   # ignore words that occur less than 2 times
        sg=config.sg,                 # 0 == CBOW (default) 1 == skip gram
        
        negative=config.negative,
        # keep these the same
        workers=1,  # threads to use
        epochs=min(config.epochs, 100)  # max epochsations is 100,
        )
    return model




#**Configuring the Model**

Now that you know how the model will be built AND you have your corpus, the next step is to build a configuration to configure the Word2Vec model. This lesson comes with a special class that you will use to create a configuration object.

```
import LessonUtil as Util
config = build_config(corpus, vector_size=75, window=15, epochs=75)
```

Take a look below at how the configuration will be used to build a model:

```
def build_model(config):
    return gensim.models.Word2Vec(sentences=config.doc, vector_size=config.vector_size, ... )
```

The configuration object has the following fields:

```['doc', 'vector_size', 'window', 'min_count', 'sg', 'negative', 'epochs', 'name']```

* The same hyperparameters that you can configure a Word2Vec model are in the configuration
* The name field allows you to name the configuration if you want to keep track of it.


In [89]:
default_config = build_config(corpus)
print("Default config", default_config)

config1 = build_config(corpus, vector_size=75, window=15, epochs=75)
config2 = build_config(corpus, 5, 2, 1, 1, 5, 30)

print("Custom config1", config1)
print("Custom config2", config2)

Default config doc_len:3, vector_size:10, window:5, min_count:5, sg:0, negative:3, epochs:25
Custom config1 doc_len:3, vector_size:75, window:15, min_count:5, sg:0, negative:3, epochs:75
Custom config2 doc_len:3, vector_size:5, window:2, min_count:1, sg:1, negative:5, epochs:30


##**Preventing Randomness**

Like most machine learning algorithms, randomization is introduced during initialization. This comes at the cost when trying to compare different versions of models. You can control some of this by setting workers=1.

However, the results you get in this notebook will NOT be the same when you submit on gradescope. They should be close though.

#**Running with the model**
We now have enough in place to start the process of building a Word2Vec model with Harry Potter Books. So Exciting. Below is a sample pipeline.

Make sure you understand the pipeline inside simple_test.

In [90]:
config = build_config(corpus, 5, 2, 1, 1, 5, 30)
model  = build_model(config)

In [91]:
model.wv?

[0;31mType:[0m            KeyedVectors
[0;31mString form:[0m     <gensim.models.keyedvectors.KeyedVectors object at 0x7f80dd5eae20>
[0;31mLength:[0m          11988
[0;31mFile:[0m            /Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/gensim/models/keyedvectors.py
[0;31mDocstring:[0m       <no docstring>
[0;31mClass docstring:[0m
Serialize/deserialize objects from disk, by equipping them with the `save()` / `load()` methods.

--------
This uses pickle internally (among other techniques), so objects must not contain unpicklable attributes
such as lambda functions etc.
[0;31mInit docstring:[0m 
Mapping between keys (such as words) and vectors for :class:`~gensim.models.Word2Vec`
and related models.

Used to perform operations on the vectors such as vector lookup, distance, similarity etc.

To support the needs of specific models and other downstream uses, you can also set
additional attributes via the :meth:`~gensim.models.keyedvectors.Keyed

In [92]:
def simple_test():
    # build the harry potter corpus
    corpus = build_hp_dataset()

    # build the configuration to try/experiment
    config = build_config(corpus, 5, 2, 1, 1, 5, 30)
    
    # build the model with this configuration
    model  = build_model(config)
    
    # let's hope harry is in there
    try:
        harry = model.wv['harry']
    except KeyError as e:
        print(e)
    
    # see who is similar to ron
    print(model.wv.most_similar(positive=['ron']))

simple_test()

[('quickly', 0.9927340149879456), ('shouted', 0.992134690284729), ('son', 0.9898478984832764), ('hagrid', 0.9885435700416565), ('work', 0.9871503710746765), ('born', 0.9864280223846436), ('harry', 0.9851586222648621), ('now', 0.9848641157150269), ("she's", 0.984484076499939), ("'yes'", 0.9829734563827515)]


In [93]:
'quickly' in [key for (key, value) in model.wv.most_similar(positive=['ron'], topn=25)]

True

What do you see? If you built a good model, would hope to see hermione near the top?

#**Scoring the Model**

Now with that in place, the fun begins. It is your job to find the best set of parameters (hyperparameters) to configure the Word2Vec model.

The LessonUtil module has a set of tests you can use to evaluate your models. Each test looks like the following:

```
(['ron'],       [],       ['hermione', 'harry']),
(['voldemort'], ['evil'], ['dumbledore']),
```

Each test has 3 components:

* the first component is the positive vectors
* the second component is the negative vectors
* the third component is a set of possible answers

Looking at the first example: (['ron'], [],['hermione', 'harry'])

Here the **Ron** vector should be close to either (or both) the **Hermione**
or **Harry** vector.

Looking at the second example: (['voldemort'], ['evil'], ['dumbledore'])

The vector that is the result of the **Voldemort** vector - the **evil** vector; the hope is that result vector is close/similar to the **Dumbledore**
vector: veldemort - evil = dumbledore

In [94]:
@singledispatch
def check_correct(your_answer, correct_answer):
	pass

@check_correct.register
def _(your_answer, correct_answer:list):
	for ans in correct_answer:
		if ans in your_answer:
			return True
	return False

@check_correct.register
def _(your_answer, correct_answer:str):
	return correct_answer in your_answer

check_correct([1], [1])

True

In [95]:
def test(model, tests, topn=25, debug=False):
	# initialize model score as 0, then if it get one entry correct, we add 100/test_number to the score. The final score x out of 100 will be the crepochsia of this model
	model_score = 0
	point = 100/len(tests)
	for test in tests:
		positive = test[0]
		negative = test[1]
		wanted_results = test[2]
		gensim_results = [key for (key, value) in model.wv.most_similar(positive=positive, negative=negative, topn=topn)]
		# print(gensim_result_list)
		# print(wanted_results)
		is_correct = check_correct(gensim_results, wanted_results)
		model_score += int(is_correct) * point

		# debug
		if debug:
			print('*' * 100)
			print("TEST QUESTION -> {}".format(test))
			print("RESULT FROM GENSIM -> {}".format(gensim_results))
			print("SCORE FOR THIS QUESTION -> {}".format(int(is_correct) * point))
	return model_score


test(model=model, tests=Util.tests[:-1], topn=25, debug=True)

****************************************************************************************************
TEST QUESTION -> (['mcgonagall'], [], ['professor'])
RESULT FROM GENSIM -> ['fingering', 'hampered', 'puzzle', 'drooling', 'relaxed', 'first-year', 'fickle', 'fifty-ten', "mornin'", 'tangerine', 'gentlemen', 'polo', 'verifiable', 'kip', 'orphaned', 'scott', 'patterns', 'prickled', 'utterly', 'dumbledore', 'unregistered', 'firmer', 'more', 'disgusting', "girl's"]
SCORE FOR THIS QUESTION -> 0.0
****************************************************************************************************
TEST QUESTION -> (['ron'], [], ['hermione', 'harry'])
RESULT FROM GENSIM -> ['quickly', 'shouted', 'son', 'hagrid', 'work', 'born', 'harry', 'now', "she's", "'yes'", 'failed', 'mummy', 'being', 'dying', 'build', 'ginny', 'definite', 'masons', 'explode', 'firm', 'beaten', 'hello', 'contact', 'devil', 'shrunk']
SCORE FOR THIS QUESTION -> 5.2631578947368425
*********************************************

10.526315789473685

In [226]:
import random
field = ['doc', 'vector_size', 'window', 'min_count', 'sg', 'negative', 'epochs', 'name']

corpus = build_hp_dataset(count=3)

# config = build_config(corpus, vector_size=10, window=10, min_count=15, sg=1, negative=3, epochs=8)
# print(config)
# model  = build_model(config)

def randint(a, b):
	return random.randint(a, b)

for i in range(100):
	config = build_config(corpus, vector_size=randint(10, 300), window=randint(4, 20), min_count=5, sg=randint(0, 1), negative=randint(3, 100), epochs=randint(30, 100))
	print(config)
	model = build_model(config)
	print("{} -> {}".format(i, test(model=model, tests=Util.tests[:-1], topn=25, debug=False)))



doc_len:3, vector_size:154, window:7, min_count:5, sg:1, negative:4, epochs:90
0 -> 31.578947368421055
doc_len:3, vector_size:227, window:13, min_count:5, sg:1, negative:87, epochs:75
1 -> 36.8421052631579
doc_len:3, vector_size:266, window:10, min_count:5, sg:1, negative:48, epochs:81
2 -> 31.578947368421055
doc_len:3, vector_size:225, window:16, min_count:5, sg:0, negative:4, epochs:92
3 -> 31.578947368421055
doc_len:3, vector_size:80, window:9, min_count:5, sg:0, negative:83, epochs:63
4 -> 31.578947368421055
doc_len:3, vector_size:264, window:18, min_count:5, sg:0, negative:59, epochs:74
5 -> 42.10526315789474
doc_len:3, vector_size:39, window:9, min_count:5, sg:1, negative:28, epochs:69
6 -> 52.631578947368425
doc_len:3, vector_size:102, window:8, min_count:5, sg:1, negative:88, epochs:46
7 -> 36.8421052631579
doc_len:3, vector_size:133, window:13, min_count:5, sg:0, negative:89, epochs:52
8 -> 31.578947368421055
doc_len:3, vector_size:31, window:9, min_count:5, sg:0, negative:30,

doc_len:3, vector_size:75, window:19, min_count:12, sg:0, negative:44, epochs:48
doc_len:3, vector_size:30, window:14, min_count:14, sg:1, negative:46, epochs:94
doc_len:3, vector_size:69, window:15, min_count:13, sg:1, negative:35, epochs:64

In [201]:
# best option


config = build_config(corpus, vector_size=100, window=25, min_count=12, sg=0, negative=90, epochs=48)
print(config)
model  = build_model(config)

test(model=model, tests=Util.tests[:-1], topn=25, debug=True)


doc_len:3, vector_size:100, window:25, min_count:12, sg:0, negative:90, epochs:48
****************************************************************************************************
TEST QUESTION -> (['mcgonagall'], [], ['professor'])
RESULT FROM GENSIM -> ['professor', 'dumbledore', "everyone's", 'sounding', 'live', "hagrid's", 'suppose', 'james', 'lily', 'drop', 'frightened', 'albus', 'saying', 'hollow', 'believe', "they're", "mcgonagall's", 'gently', "isn't", 'poor', 'here', 'sense', 'gone', 'anxious', 'seems']
SCORE FOR THIS QUESTION -> 5.2631578947368425
****************************************************************************************************
TEST QUESTION -> (['ron'], [], ['hermione', 'harry'])
RESULT FROM GENSIM -> ['grinning', 'george', 'fred', 'hermione', "percy's", "hermione's", 'ages', 'percy', 'okay', 'ear', 'needs', 'bar', "you've", 'bill', 'mum', 'brothers', 'try', 'midair', "we're", 'twins', 'seat', 'scabbers', 'badge', 'wing', 'rat']
SCORE FOR THIS QUESTION 

57.89473684210527

In [200]:


for i in range(90, 110):
	config = build_config(corpus, vector_size=i, window=25, min_count=12, sg=0, negative=90, epochs=48)
	print(config)
	model  = build_model(config)

	print("{} -> {}".format(i, test(model=model, tests=Util.tests[:-1], topn=25, debug=False)))



doc_len:3, vector_size:90, window:25, min_count:12, sg:0, negative:90, epochs:48
90 -> 36.8421052631579
doc_len:3, vector_size:91, window:25, min_count:12, sg:0, negative:90, epochs:48
91 -> 47.36842105263158
doc_len:3, vector_size:92, window:25, min_count:12, sg:0, negative:90, epochs:48
92 -> 42.10526315789474
doc_len:3, vector_size:93, window:25, min_count:12, sg:0, negative:90, epochs:48
93 -> 47.36842105263158
doc_len:3, vector_size:94, window:25, min_count:12, sg:0, negative:90, epochs:48
94 -> 52.631578947368425
doc_len:3, vector_size:95, window:25, min_count:12, sg:0, negative:90, epochs:48
95 -> 47.36842105263158
doc_len:3, vector_size:96, window:25, min_count:12, sg:0, negative:90, epochs:48
96 -> 47.36842105263158
doc_len:3, vector_size:97, window:25, min_count:12, sg:0, negative:90, epochs:48
97 -> 42.10526315789474
doc_len:3, vector_size:98, window:25, min_count:12, sg:0, negative:90, epochs:48
98 -> 47.36842105263158
doc_len:3, vector_size:99, window:25, min_count:12, sg:

In [225]:
# with config set as your model with parameters,

# Try implementing :


config = build_config(corpus, vector_size=100, window=25, min_count=12, sg=0, negative=90, epochs=48)

model = build_model(config)
def test_model():
	score = 0
	fail= []
	for test in Util.tests[:-1]:
		soln = test[2]
		pos = test[0]
		neg = test[1]
		result = model.wv.most_similar(positive=pos, negative=neg, topn=25)
		if any([sol in res for sol in soln for res in result]):
			score += 1
		else:
			fail.append([*pos, "->", *soln])
	print("Passed", score)
	print(fail)

test_model()
# This should show you the score no, anything over or equal to 14 is sufficient.

Passed 8
[['gryffindor', '->', 'hogwarts', 'house'], ['ron', 'hermione', '->', 'harry'], ['hagrid', '->', 'dumbledore'], ['ravenclaw', '->', 'hufflepuff'], ['muggle', 'magic', '->', 'wizard'], ['wizard', '->', 'potter'], ['house', '->', 'gryffindor'], ['house', 'evil', '->', 'slytherin'], ['voldemort', '->', 'dumbledore'], ['slytherin', 'good', '->', 'malfoy', 'lucius'], ['harry', 'aunt', '->', 'petunia']]


## **Lesson Assignment**
After configuring the Word2Vec model with the best set of hyperparameters, test your best model with all (19) of the test cases give in the Util file.
* ```result = model.wv.most_similar(positive=pos, negative=neg, topn=topn)```
    *   Get the top 25 words (```topn = 25```)
* Print out the results for all the test cases (when you submit the notebook, the results should be visible). 
<br><br>


**Steps to submit your work:**


1.   Download the lesson notebook from Moodle.
2.   Upload any supporting files using file upload option within Google Colab.
3.   Complete the exercises and/or assignments
4.   Download as .ipynb
5.   Name the file as "lastname_firstname_WeekNumber.ipynb"
6.   After following the above steps, submit the final file in Moodle





<h1><center>The End!</center></h1>