# How similar are these texts?

- Text1:
> "Tennis is a racket sport played individually or between two teams of two players each"
- Text2:
> "Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks"
- Text3:
> "The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick"




In [11]:
texts = {
    "text1" : "Tennis is a racket sport played individually or between two teams of two players each",
    "text2" : "Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks",
    "text3" : "The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick"
}

for name,text in texts.items():
  print (name,text)

text1 Tennis is a racket sport played individually or between two teams of two players each
text2 Quidditch is a fictional sport where witches and wizards playing by riding flying broomsticks
text3 The Wizard of Oz agrees to grant their wishes if they prove their worth by bringing him the Witch's broomstick


### Tokens

We need to retrieve the minimum processing units (tokens) from each text in order to compare them.

Take a look at [nltk.tokenize package](https://www.nltk.org/api/nltk.tokenize.html) to find out how to do it by using regular expressions (`nltk.tokenize.regexp`).


![regular_expressions.png](https://www.dropbox.com/s/fj4va9i6zbyvebh/regular_expression.png?raw=1)


In [22]:
%pip install nltk
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.
Collecting scikit-learn
  Using cached scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl.metadata (31 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.1-cp39-cp39-macosx_12_0_arm64.whl (11.1 MB)
Using cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [scikit-learn][0m [scikit-learn]
[1A[2KSuccessfully installed scikit-learn-1.6.1 threadpoolctl-3.6.0
Note: you may need to restart the kernel to use updated packages.


In [23]:
from nltk.tokenize import RegexpTokenizer

tokens = {}

regex = '\w+|\$[\d\.]+|\S+'

tokenizer = RegexpTokenizer(regex)

for name in texts:
  text = texts[name]
  text_tokens = tokenizer.tokenize(text)
  tokens[name]=text_tokens
  print(name,text_tokens)


text1 ['Tennis', 'is', 'a', 'racket', 'sport', 'played', 'individually', 'or', 'between', 'two', 'teams', 'of', 'two', 'players', 'each']
text2 ['Quidditch', 'is', 'a', 'fictional', 'sport', 'where', 'witches', 'and', 'wizards', 'playing', 'by', 'riding', 'flying', 'broomsticks']
text3 ['The', 'Wizard', 'of', 'Oz', 'agrees', 'to', 'grant', 'their', 'wishes', 'if', 'they', 'prove', 'their', 'worth', 'by', 'bringing', 'him', 'the', 'Witch', "'s", 'broomstick']


### Stopwords

Let's remove stopwords from tokens. The `stopwords` package from the 'nltk.corpus' is used.



In [24]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords


# create a set with stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

for name in tokens:
  text_tokens = tokens[name]

  # filter tokens that are not on the stopword set
  filtered_tokens = [token for token in text_tokens if not token in stop_words]

  tokens[name] = filtered_tokens
  print(name,filtered_tokens)


{'ourselves', 'here', 'at', 'ours', 's', "it'll", 'haven', "that'll", 'now', "don't", 've', 'again', 'which', 'doing', 'has', 'who', 'over', 'down', 'wasn', "they'll", "hadn't", "weren't", 'd', 'that', 'off', "she'd", 'during', 'having', "she's", 'to', 'yourselves', 'in', 'or', 'an', 'mightn', 'you', 'me', 'few', 'of', 'under', "isn't", "mightn't", 'all', 'through', 'this', 'won', 'more', 'too', "i'll", 'our', 'been', 'out', "shouldn't", "we'll", 'than', 'll', 'so', 'against', 'by', 'until', 'are', "you're", "they're", "i've", 'such', "aren't", "shan't", 'o', "doesn't", 'from', "won't", 'not', "he'll", 'their', 'her', 'yours', 'most', 'on', 'nor', 'them', 'as', 'any', 'both', 'itself', 'up', "she'll", 'between', "he'd", 'shan', 'about', 'hasn', 'had', 'will', 'only', 'then', "i'd", 'him', 'am', 'with', 'once', 'mustn', 'ain', 'further', "they'd", 'i', 'if', 'hadn', 're', 'didn', 'how', 'his', 'what', "you've", 'after', 'were', 'being', 'do', 'but', 'own', "it'd", 'weren', 'very', 'need

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/federicosvendsen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Stemming

Different words with the same root can be grouped together.

Let's extract the root of each term using the [Porter stemmer](http://www.nltk.org/howto/stem.html).

In [25]:
from nltk.stem.porter import *

stemmer = PorterStemmer()

stems = {}

for name in tokens:
  text_tokens = tokens[name]

  # transform to the stem each token
  stem_tokens = [stemmer.stem(token) for token in text_tokens]

  stems[name] = stem_tokens
  print(name, stem_tokens)

text1 ['tenni', 'racket', 'sport', 'play', 'individu', 'two', 'team', 'two', 'player']
text2 ['quidditch', 'fiction', 'sport', 'witch', 'wizard', 'play', 'ride', 'fli', 'broomstick']
text3 ['the', 'wizard', 'oz', 'agre', 'grant', 'wish', 'prove', 'worth', 'bring', 'witch', "'s", 'broomstick']


##### Lemmatizer

Now, let's use a lemmatizer.

Import the [WordNetLemmatizer](https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.wordnet) and iterate through the token list to get its lemmatized form

Don't forget to reload the original tokens!

In [26]:
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

stems = {}

for name in tokens:
  text_tokens = tokens[name]

  # transform to the lemma each token
  stem_tokens = [wnl.lemmatize(token) for token in text_tokens]

  stems[name] = stem_tokens
  print(name, stem_tokens)

text1 ['Tennis', 'racket', 'sport', 'played', 'individually', 'two', 'team', 'two', 'player']
text2 ['Quidditch', 'fictional', 'sport', 'witch', 'wizard', 'playing', 'riding', 'flying', 'broomstick']
text3 ['The', 'Wizard', 'Oz', 'agrees', 'grant', 'wish', 'prove', 'worth', 'bringing', 'Witch', "'s", 'broomstick']


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/federicosvendsen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


The solution code is available at: https://gist.github.com/cbadenes/2936b98270bf900ac04da09567845a36

### Vocabulary


First we need to create the `vocabulary`.

A `vocabulary` is a list with the unique words contained in the texts:

In [27]:
vocabulary = []

# extend `vocabulary` with the unique stems
for name in stems:
  vocabulary.extend([stem for stem in stems[name] if stem not in vocabulary])

print("Vocabulary Size: ", len(vocabulary))
print("Vocabulary Words: ", vocabulary)

Vocabulary Size:  28
Vocabulary Words:  ['Tennis', 'racket', 'sport', 'played', 'individually', 'two', 'team', 'two', 'player', 'Quidditch', 'fictional', 'witch', 'wizard', 'playing', 'riding', 'flying', 'broomstick', 'The', 'Wizard', 'Oz', 'agrees', 'grant', 'wish', 'prove', 'worth', 'bringing', 'Witch', "'s"]


The solution code is available at: https://gist.github.com/cbadenes/d7153a03ac93af9e6926f208c8295f27


### Bag-of-Words

And now, the bag-of-words can be created based on the vocabulary:

In [28]:
import pandas as pd


vectors={}

for name in stems:
  vectors[name]=[]

# create a vector from the stems with token frequencies
for word in vocabulary:
  for name in vectors:
    count = stems[name].count(word)
    vectors[name].append(count)

for name in vectors:
 print(name, vectors[name])

text1 [1, 1, 1, 1, 1, 2, 1, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
text2 [0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
text3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Let's see the **terms-document matrix**:

In [29]:
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_columns', None)

df = pd.DataFrame(vectors.values(),
                  columns=vocabulary,
                  index=texts.keys())

display(df)

Unnamed: 0,Tennis,racket,sport,played,individually,two,team,two.1,player,Quidditch,fictional,witch,wizard,playing,riding,flying,broomstick,The,Wizard,Oz,agrees,grant,wish,prove,worth,bringing,Witch,'s
text1,1,1,1,1,1,2,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
text2,0,0,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0
text3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1


### Document-Similarity Matrix

Compute the Cosine similarity among vectors:

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

print(cosine_similarity(df, df))

[[1.         0.0860663  0.        ]
 [0.0860663  1.         0.09622504]
 [0.         0.09622504 1.        ]]


# Word Embeddings

In [38]:
%pip install numpy==1.26.4


Note: you may need to restart the kernel to use updated packages.


In [39]:
%pip install gensim


Note: you may need to restart the kernel to use updated packages.


In [41]:
import gensim
import gensim.downloader as api
#glove_vectors = gensim.downloader.load('word2vec-google-news-300') # Takes some time!
glove_vectors = gensim.downloader.load('glove-twitter-25') # Fast

RuntimeError: Compiled extensions are unavailable. If you've installed from a package, ask the package maintainer to include compiled extensions. If you're building Gensim from source yourself, install Cython and a C compiler, and then run `python setup.py build_ext --inplace` to retry. 

In [None]:
# Get Similar words
word_of_interest = 'technology'
similar_words = glove_vectors.most_similar(word_of_interest, topn=10)

# Print
for word, score in similar_words:
    print(f'{word}: {score}')

In [None]:
from gensim.models import Word2Vec

# Let's do some operations
palabra1 = 'king'
palabra2 = 'man'
palabra3 = 'woman'

# Operation
result = glove_vectors[palabra1] - glove_vectors[palabra2] + glove_vectors[palabra3]

# Get similar words to the result
similar_words = glove_vectors.similar_by_vector(result, topn=10)

# Print
for palabra, score in similar_words:
    print(f'{palabra}: {score}')

In [None]:
len(glove_vectors[palabra1])

In [None]:
# Define two sentences for similarity comparison
import numpy as np

sentence1 = "Tennis racket sport played individually two teams two players"
sentence2 = "Quidditch fictional sport witches wizards playing riding flying broomsticks"

In [None]:
import numpy as np

def sentence_vectorizer(sentence, model):
    words = sentence.split()
    # Filter out words that are not in the vocabulary
    words = [word for word in words if word in model]
    if len(words) == 0:
        return np.zeros(model.vector_size)  # Return a zero vector for out-of-vocabulary sentences
    return np.mean([model[word] for word in words], axis=0)

from sklearn.metrics.pairwise import cosine_similarity



vector1 = sentence_vectorizer(sentence1, glove_vectors)
vector2 = sentence_vectorizer(sentence2, glove_vectors)

similarity = cosine_similarity([vector1], [vector2])[0][0]
print("Similarity:", similarity)

In [None]:
vector1 =glove_vectors['soup']
vector2 = glove_vectors['miso']
similarity = cosine_similarity([vector1], [vector2])[0][0]
print("Similarity:", similarity)

## Creating your own Word Embedding... with Harry Potter


In [None]:
!git clone https://github.com/amephraim/nlp.git

Cloning into 'nlp'...
remote: Enumerating objects: 310, done.[K
remote: Total 310 (delta 0), reused 0 (delta 0), pack-reused 310 (from 1)[K
Receiving objects: 100% (310/310), 3.20 MiB | 8.67 MiB/s, done.
Resolving deltas: 100% (171/171), done.


In [42]:
%pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [None]:
import numpy as np
import pandas as pd
import os
import re
import time

from gensim.models import Word2Vec
from tqdm import tqdm

tqdm.pandas()

import nltk
nltk.download('punkt')  #
from nltk.tokenize import sent_tokenize

def process_text(file_path):
    with open(file_path, 'r',encoding='utf-8',errors='ignore') as file:
        content = file.read()

    # Remove line breaks
    elements = content.replace('\n', ' ').split()

    # Remove elements <1
    filtered_elements = [element for element in elements if len(element) > 1]

    # Join
    text = ' '.join(filtered_elements)

    sentences = sent_tokenize(text)

    return sentences


def preprocessing(titles_array):

    """
    Take in an array of titles, and return the processed titles.

    (e.g. input: 'i am a boy', output - 'am boy')  -> since I remove those words with length 1

    Feel free to change the preprocessing steps and see how it affects the modelling results!
    """

    processed_array = []

    for title in tqdm(titles_array):
        print(f"original title: {title}")
        # remove other non-alphabets symbols with space (i.e. keep only alphabets and whitespaces).
        processed = re.sub('[^a-zA-Z ]', '', title)

        words = processed.split()

        # keep words that have length of more than 1 (e.g. gb, bb), remove those with length 1.
        processed_array.append((' '.join([word for word in words if len(word) > 1])).split())

    return processed_array

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/federicosvendsen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [45]:
import nltk
nltk.download('punkt_tab')
folder_path = 'nlp/texts'

# Obtener la lista de archivos en la carpeta
file_list = os.listdir(folder_path)

train_sentences=[]

# Leer cada archivo en la lista
for file_name in file_list:
    # Comprobar si el elemento en la lista es un archivo
    print(file_name)
    file_path = os.path.join(folder_path, file_name)
    result = process_text(file_path)
    train_sentences.extend(preprocessing(result))

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/federicosvendsen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Bible_KJV.txt


 38%|███▊      | 11240/29928 [00:00<00:00, 57226.47it/s]

original title: ﻿The Project Gutenberg EBook of The King James Bible This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever.
original title: You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: The King James Bible Release Date: March 2, 2011 [EBook #10] [This King James Bible was orginally posted by Project Gutenberg in late 1989] Language: English *** START OF THIS PROJECT GUTENBERG EBOOK THE KING JAMES BIBLE *** The Old Testament of the King James Version of the Bible The First Book of Moses: Called Genesis 1:1 In the beginning God created the heaven and the earth.
original title: 1:2 And the earth was without form, and void; and darkness was upon the face of the deep.
original title: And the Spirit of God moved upon the face of the waters.
original title: 1:3 And God said, Let there be light: and there was light.
original title: 1:4 And G

100%|██████████| 29928/29928 [00:00<00:00, 67438.21it/s]

original title: 1:15 And when ye spread forth your hands, will hide mine eyes from you: yea, when ye make many prayers, will not hear: your hands are full of blood.
original title: 1:16 Wash you, make you clean; put away the evil of your doings from before mine eyes; cease to do evil; 1:17 Learn to do well; seek judgment, relieve the oppressed, judge the fatherless, plead for the widow.
original title: 1:18 Come now, and let us reason together, saith the LORD: though your sins be as scarlet, they shall be as white as snow; though they be red like crimson, they shall be as wool.
original title: 1:19 If ye be willing and obedient, ye shall eat the good of the land: 1:20 But if ye refuse and rebel, ye shall be devoured with the sword: for the mouth of the LORD hath spoken it.
original title: 1:21 How is the faithful city become an harlot!
original title: it was full of judgment; righteousness lodged in it; but now murderers.
original title: 1:22 Thy silver is become dross, thy wine mixed 


100%|██████████| 8544/8544 [00:00<00:00, 69757.83it/s]


original title: Harry Potter and the Prisoner of Azkaban by J.K. Rowling CHAPTER ONE OWL POST Harry Potter was highly unusual boy in many ways.
original title: For one thing, he hated the summer holidays more than any other time of year.
original title: For another, he really wanted to do his homework but was forced to do it in secret, in the dead of night.
original title: And he also happened to be wizard.
original title: It was nearly midnight, and he was lying on his stomach in bed, the blankets drawn right over his head like tent, flashlight in one hand and large leather-bound book (A History of Magic by Bathilda Bagshot) propped open against the pillow.
original title: Harry moved the tip of his eagle-feather quill down the page, frowning as he looked for something that would help him write his essay, "Witch Burning in the Fourteenth Century Was Completely Pointless discuss."
original title: The quill paused at the top of likely-looking paragraph.
original title: Harry Pushed his 

100%|██████████| 15046/15046 [00:00<00:00, 81462.35it/s]


original title: Harry Potter and the Goblet of Fire by J.K. Rowling THIS E-TEXT WAS NOT PRODUCED FOR PROFIT AND IS NOT FOR SALE.
original title: we all know this is copyright protected book....blah, blah, blah.
original title: no reproduction by any means...blah, blah, blah.
original title: enjoy.
original title: To Peter Rowling.
original title: In Memory of Mr. Ridley.
original title: And to Susan Sladden.
original title: Who Helped Harry Out of His Cupboard.
original title: CONTENTS ONE The Riddle House TWO The Scar 16 THREE The Invitation 26 FOUR Back to the Burrow 39 FIVE Weasleys' Wizard Wheezes 51 SIX The Portkey 65 SEVEN Bagman and Crouch 75 EIGHT The Quidditch World Cup 95 NINE The Dark Mark 117 TEN Mayhem at the Ministry 145 ELEVEN Aboard the Hogwarts Express 158 TWELVE The Triwizard Tournament 171 THIRTEEN Mad-Eye Moody 193 FOURTEEN The Unforgivable Curses 209 FIFTEEN Beauxbatons and Durmstrang 228 SIXTEEN The Goblet of Fire 248 SEVENTEEN The Four Champions -272 EIGHTEEN The

100%|██████████| 6382/6382 [00:00<00:00, 138672.28it/s]


original title: Harry Potter and the Sorcerer's Stone CHAPTER ONE THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.
original title: They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.
original title: Mr. Dursley was the director of firm called Grunnings, which made drills.
original title: He was big, beefy man with hardly any neck, although he did have very large mustache.
original title: Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors.
original title: The Dursleys had small son called Dudley and in their opinion there was no finer boy anywhere.
original title: The Dursleys had everything they wanted, but they also had secret, and their greatest fear was that somebody w

  0%|          | 0/6584 [00:00<?, ?it/s]

original title: HARRY POTTER AND THE CHAMBER OF SECRETS by J. K. Rowling (this is BOOK in the Harry Potter series) Original Scanned/OCR: Friday, April 07, 2000 v1.0 (edit where needed, change version number by 0.1) RR THE WORST BIRTHDAY Not for the first time, an argument had broken out over breakfast at number four, Privet Drive.
original title: Mr. Vernon Dursley had been woken in the early hours of the morning by loud, hooting noise from his nephew Harry's room.
original title: "Third time this week!"
original title: he roared across the table.
original title: "If you can't control that owl, it'll have to go!"
original title: Harry tried, yet again, to explain.
original title: "She's bored," he said.
original title: "She's used to flying around outside.
original title: If could just let her out at night -" "Do look stupid?"
original title: snarled Uncle Vernon, bit of fried egg dangling from his bushy mustache.
original title: "I know what'll happen if that owl's let out."
original 

100%|██████████| 6584/6584 [00:00<00:00, 88531.57it/s]


original title: "Alone in the Chamber of Secrets, forsaken by his friends, defeated at last by the Dark Lord he so unwisely challenged.
original title: You'll be back with your dear Mudblood mother soon, Harry... She bought you twelve years of borrowed time ... but Lord Voldemort got you in the end, as you knew he must ."
original title: If this is dying, thought Harry, it's not so bad.
original title: Even the pain was leaving him ....
original title: But was this dying?
original title: Instead of going black, the Chamber seemed to be coming back into focus.
original title: Harry gave his head little shake and there was Fawkes, still resting his head on Harry's arm.
original title: pearly patch of tears was shining all around the wound -- except that there was no wound "Get away, bird," said Riddle's voice suddenly.
original title: "Get away from him said, get away --" Harry raised his head.
original title: Riddle was pointing Harry's wand at Fawkes; there was bang like gun, and Fawk




In [49]:
train_sentences[1]

['You',
 'may',
 'copy',
 'it',
 'give',
 'it',
 'away',
 'or',
 'reuse',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'wwwgutenbergorg',
 'Title',
 'The',
 'King',
 'James',
 'Bible',
 'Release',
 'Date',
 'March',
 'EBook',
 'This',
 'King',
 'James',
 'Bible',
 'was',
 'orginally',
 'posted',
 'by',
 'Project',
 'Gutenberg',
 'in',
 'late',
 'Language',
 'English',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'THE',
 'KING',
 'JAMES',
 'BIBLE',
 'The',
 'Old',
 'Testament',
 'of',
 'the',
 'King',
 'James',
 'Version',
 'of',
 'the',
 'Bible',
 'The',
 'First',
 'Book',
 'of',
 'Moses',
 'Called',
 'Genesis',
 'In',
 'the',
 'beginning',
 'God',
 'created',
 'the',
 'heaven',
 'and',
 'the',
 'earth']

In [None]:
model = Word2Vec(sentences=train_sentences,
                 sg=0,
                 vector_size=400,
                 workers=4,
                 epochs=6

                 )

In [None]:
len(model.wv.get_vector('Harry'))

400

In [None]:
model.wv.most_similar('Harry')

[('Hagrid', 0.8427090048789978),
 ('Neville', 0.820137083530426),
 ('Snape', 0.8054292798042297),
 ('Malfoy', 0.7667227983474731),
 ('Wood', 0.7633455991744995),
 ('Cedric', 0.7618286609649658),
 ('Black', 0.7527939081192017),
 ('Hermione', 0.7520601749420166),
 ('Moody', 0.7496966123580933),
 ('Krum', 0.7389581799507141)]

In [None]:
model.wv.most_similar(positive=['Harry', 'Gryffindor'], negative=['Slytherin'])

[('Hagrid', 0.7553294897079468),
 ('Snape', 0.7142441272735596),
 ('Neville', 0.7079694867134094),
 ('Ron', 0.6929983496665955),
 ('Hermione', 0.6885701417922974),
 ('Crookshanks', 0.6822879314422607),
 ('Wood', 0.6807096004486084),
 ('Moody', 0.6597278714179993),
 ('Malfoy', 0.6448456048965454),
 ('Krum', 0.6448404788970947)]

In [None]:
word1 = 'Harry'
word2 = 'Muggle'
word3 = 'Malfoy'

vector1 = model.wv[word1]
vector2 = model.wv[word2]
vector3 = model.wv[word3]

result_vector = vector1 + vector2 - vector3

# Get Similar words to the result
similar_word = model.wv.most_similar(positive=[result_vector], topn=8)

print(f"Palabra más cercana al resultado: {similar_word}")

Palabra más cercana al resultado: [('Harry', 0.7543421983718872), ('just', 0.5745161771774292), ('everyone', 0.5583800077438354), ('hed', 0.5560871362686157), ('someone', 0.5504848957061768), ('Dursleys', 0.5210647583007812), ('something', 0.5207640528678894), ('anyone', 0.5168615579605103)]


## Clustering!

In [None]:
%pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-macosx_12_0_arm64.whl (8.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.6/8.6 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.7.2 threadpoolctl-3.6.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
import gensim
from gensim.models import Word2Vec
from gensim.models import FastText
from sklearn.cluster import KMeans
import numpy as np


word_vectors = model.wv.vectors
kmeans = KMeans(n_clusters=50)  # Number of clusters
kmeans.fit(word_vectors)
cluster_assignments = kmeans.labels_

# Link words to their clusters
word_clusters = {}
for word, cluster in zip(model.wv.index_to_key, cluster_assignments):
    if cluster not in word_clusters:
        word_clusters[cluster] = []
    word_clusters[cluster].append(word)

# Print the words of each cluster
for cluster, words in word_clusters.items():
    print(f"Cluster {cluster + 1}:")
    print(", ".join(words))

Cluster 20:
the, earth, thereof, heaven, water, sword, light, whole, field, ground, midst, world, above, stone, waters, tree, part, trees, places, ark, darkness, fruit, sun, beast, mountains, wind, beasts, gates, mountain, walls, heavens, rain, rock, forest, cloud, edge, dust, shadow, branches, windows, hill, rivers, sides
Cluster 4:
and, of, with, great, without, both, between, young, strong, mighty, having, One, within, therein, filled, manner, money, unclean, become, slain, joy, offered, clean, books, cattle, ones, desolate, houses, nation, cup, service, sanctuary, wives, wickedness, bones, spoil, food, sweet, kind, charge, shame, enemy, flock, image, wherewith, company, idols, corn, teachers, portion, consumed, riches, anointed, saints, everlasting, angels, famine, stead, wash, rich, increase, destruction, tents, reproach, rod, ass, journey, strangers, sacrifices, measure, atonement, scattered, flocks, grave, judges, stars, lamb, souls, order, testimony, abomination, wands, prey, a

## Who does not belong here?

In [None]:
model.wv.doesnt_match(['Gryffindor', 'Slytherin', 'Hufflepuff', 'Ravenclaw', 'Voldemort'])

'Voldemort'

In [None]:
model.wv.doesnt_match(['Harry', 'Hermione', 'Ron', 'Crookshanks','Malfoy'])

'Crookshanks'

In [None]:
model.wv.doesnt_match(['Harry', 'Ron', 'Fred', 'George', 'Ginny'])

'Harry'