# Práctica 6: Word Embedding II: ahora es personal

![we](https://miro.medium.com/v2/resize:fit:2000/1*SYiW1MUZul1NvL1kc1RxwQ.png)

## Objetivos

- Extraer gran cantidad de datos
    - Dumps de wikipedia
- Entrenar representaciones vectoriales usando la biblioteca `gensim`
- Explorar métodos de los modelos
    - Operaciones con vectores
    - Extraer las probabilidades del modelo
- Entrenar representaciones vectoriales con pytorch (?)

## Recapitulación

Hasta ahora hemos visto dos enfoques para la representación de palabras como vectores:

1. Count-based
2. Prediction-based

Dentro de lás representaciones obtenidas con el enfoque *prediction* vimos que varios métodos populared como:

- CBOW
- Skipgram
- GloVe
- fastText

Tambien mencionamos que hay **hiperparametros** con los cuales podemos jugar para obtener resultados distintos.

Algunos son:

- El tamaño de la ventana de contexto
- El tamaño de los vectores



## Obteniendo los datos

![](https://data-and-the-world.onrender.com/posts/read-wikipedia-dump/dump_file_list.png)

Trabajaremos con una parte de la wikipedia en español. Usaremos la herramienta [wikiextractor](https://github.com/attardi/wikiextractor) y obtendremos los datos de la página: https://dumps.wikimedia.org/eswiki/

In [1]:
!pip install numpy==1.24.4
!pip install wikiextractor



In [2]:
import urllib.request
from tqdm import tqdm

# url = "https://dumps.wikimedia.org/eswiki/latest/eswiki-latest-pages-articles1.xml-p1p159400.bz2"
url = "https://dumps.wikimedia.org/eswiki/latest/eswiki-latest-pages-articles2.xml-p159401p693323.bz2"
url = "https://dumps.wikimedia.org/eswiki/latest/eswiki-latest-pages-articles3.xml-p693324p1897740.bz2"
filename = "eswiki-articles3.bz2"

with tqdm(unit='B', unit_scale=True, unit_divisor=1024, miniters=1, desc=filename) as t:
    urllib.request.urlretrieve(url, filename, reporthook=lambda block_num, block_size, total_size: t.update(block_size))

eswiki-articles3.bz2: 498MB [02:01, 4.30MB/s]


In [3]:
%%time
!python -m wikiextractor.WikiExtractor "eswiki-articles3.bz2" --no-templates

INFO: Starting page extraction from eswiki-articles3.bz2.
INFO: Using 1 extract processes.
INFO: Extracted 100000 articles (651.3 art/s)
INFO: Extracted 200000 articles (650.1 art/s)
INFO: Extracted 300000 articles (738.5 art/s)
INFO: Finished 1-process extraction of 375709 articles in 543.3s (691.5 art/s)
CPU times: user 3.41 s, sys: 494 ms, total: 3.91 s
Wall time: 9min 3s


In [4]:
import os

class WikiSentencesExtractor(object):

    def __init__(self, directory, max_lines):
        self.directory = directory
        self.max_lines = max_lines
        self.total_sentences = 0

    def get_sentences(self):
        for subdir_letter in os.listdir(self.directory):
            file_path = os.path.join(self.directory, subdir_letter)
            for file_name in os.listdir(file_path):
                with open(os.path.join(file_path, file_name)) as file:
                    for line in file:
                        if self.max_lines == self.total_sentences:
                            return
                        words = line.split()
                        if len(words) <= 3 or words[0].startswith("<"):
                            continue
                        yield [word.lower().strip(",.\"'()[]{}:;!?") for word in words]
                        self.total_sentences += 1

    def __iter__(self):
        return self.get_sentences()

    def __len__(self):
        return self.total_sentences

In [6]:
directory = "drive/MyDrive/corpora/eswiki-dump-2"
os.listdir(directory)

['AA', 'AB', 'AC', 'AD', 'AE', 'AF']

In [14]:
%%time
sentences = WikiSentencesExtractor(directory, 3)

CPU times: user 8 µs, sys: 0 ns, total: 8 µs
Wall time: 11.9 µs


In [15]:
for sentence in sentences:
    print(sentence)

['el', 'comercio', 'transahariano', 'se', 'refiere', 'al', 'tráfico', 'de', 'mercancías', 'a', 'través', 'del', 'sahara', 'hasta', 'alcanzar', 'áfrica', 'subsahariana', 'desde', 'la', 'costa', 'del', 'norte', 'de', 'áfrica', 'europa', 'o', 'el', 'levante', 'si', 'bien', 'ha', 'existido', 'desde', 'tiempos', 'prehistóricos', 'el', 'apogeo', 'de', 'esta', 'ruta', 'comercial', 'se', 'produjo', 'entre', 'los', 'siglos', 'viii', 'hasta', 'el', 'xvi']
['desertificación', 'creciente', 'e', 'incentivo', 'económico']
['el', 'sahara', 'tuvo', 'una', 'vez', 'un', 'medio', 'ambiente', 'muy', 'diferente', 'en', 'las', 'actuales', 'libia', 'y', 'argelia', 'desde', 'al', 'menos', 'el', '7000', 'a.c', 'ya', 'existía', 'pastoreo', 'cuidado', 'de', 'ovejas', 'y', 'cabras', 'e', 'importantes', 'asentamientos', 'donde', 'se', 'trabajaba', 'la', 'cerámica', 'el', 'ganado', 'fue', 'introducido', 'en', 'el', 'sáhara', 'central', 'ahaggar', 'entre', 'los', 'años', '4000', 'y', '3500', 'a.c', 'pinturas', 'rupe

In [16]:
len(sentences)

3

In [17]:
from gensim.models import word2vec, FastText
import multiprocessing

In [21]:
multiprocessing.cpu_count()

2

In [22]:
!cat /proc/cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU @ 2.20GHz
stepping	: 0
microcode	: 0xffffffff
cpu MHz		: 2199.998
cache size	: 56320 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa mmio_stale_data retbleed
bogomips	: 4399.99
clflush size	: 64
cache_alignment	: 64
addres

In [23]:
model_name = "drive/MyDrive/models/eswiki-pico.model"
try:
    print(f"Searching for model {model_name}")
    model = word2vec.Word2Vec.load(model_name)
    print("Model found!!!")
except Exception as e:
    print(f"Modelo {model_name} not found. Train a new one")
    sentences = WikiSentencesExtractor(directory, max_lines=5000)
    model = word2vec.Word2Vec(
        list(sentences),
        vector_size=300,
        window=5,
        workers=multiprocessing.cpu_count()
        )
model.save(model_name)

Searching for model drive/MyDrive/models/eswiki-pico.model
Modelo drive/MyDrive/models/eswiki-pico.model not found. Train a new one


In [25]:
from enum import Enum

class Algorithms(Enum):
    CBOW = "CBOW"
    SKIP_GRAM = "SKIP_GRAM"
    FAST_TEXT = "FAST_TEXT"

In [26]:
def load_model(model_path):
    try:
        return word2vec.Word2Vec.load(model_path)
    except:
        print(f"[WARN] Model not found in path {model_path}")
        return None

In [28]:
def train_model(sentences, model_name, vector_size: int, window=5, workers=2, algorithm = Algorithms.CBOW):
    model_name_params = f"{model_name}-vs{vector_size}-w{window}-{algorithm.value}.model"
    model_path = f"drive/MyDrive/models/{model_name_params}"
    if load_model(model_path) is not None:
        print(f"Already exists the model {model_path}")
        return load_model(model_path)
    print(f"TRAINING: {model_path}")
    if algorithm in [Algorithms.CBOW, Algorithms.SKIP_GRAM]:
        algorithm_number = 1 if algorithm == Algorithms.SKIP_GRAM else 0
        model = word2vec.Word2Vec(
            sentences,
            vector_size=vector_size,
            window=window,
            workers=workers,
            sg = algorithm_number,
            seed=42,
            )
    elif algorithm == Algorithms.FAST_TEXT:
        model = FastText(sentences=sentences, vector_size=vector_size, window=window, workers=workers, seed=42, epochs=100)
    else:
        print("[ERROR] algorithm not implemented yet :p")
        return
    model.save(model_path)
    return model

In [29]:
def report_stats(model):
    print("Number of words in the corpus used for training the model: ", model.corpus_count)
    print("Number of words in the model: ", len(model.wv.index_to_key))
    print("Time [s], required for training the model: ", model.total_train_time)
    print("Count of trainings performed to generate this model: ", model.train_count)
    print("Length of the word2vec vectors: ", model.vector_size)
    print("Applied context length for generating the model: ", model.window)

In [30]:
%%time
cbow_100 = train_model(
    WikiSentencesExtractor(directory, -1),
    "eswiki-large",
    100,
    2,
    workers=24,
    algorithm=Algorithms.CBOW
    )

Already exists the model drive/MyDrive/models/eswiki-large-vs100-w2-CBOW.model
CPU times: user 5.02 s, sys: 298 ms, total: 5.32 s
Wall time: 9.82 s


In [31]:
report_stats(cbow_100)

Number of words in the corpus used for training the model:  1072961
Number of words in the model:  211340
Time [s], required for training the model:  160.76787760099887
Count of trainings performed to generate this model:  1
Length of the word2vec vectors:  100
Applied context length for generating the model:  2


In [107]:
%%time
cbow_500 = train_model(WikiSentencesExtractor(directory, -1), "eswiki-large", 500, 6, workers=24, algorithm=Algorithms.CBOW)

Already exists the model drive/MyDrive/models/eswiki-large-vs500-w6-CBOW.model
CPU times: user 4.85 s, sys: 2.42 s, total: 7.27 s
Wall time: 26.2 s


In [None]:
report_stats(cbow_500)

In [42]:
%%time
skip_gram_500 = train_model(WikiSentencesExtractor(directory, -1), "eswiki-large", 500, 6, workers=24, algorithm=Algorithms.SKIP_GRAM)

Already exists the model drive/MyDrive/models/eswiki-large-vs500-w6-SKIP_GRAM.model
CPU times: user 6.96 s, sys: 2.91 s, total: 9.87 s
Wall time: 29.2 s


In [43]:
report_stats(skip_gram_500)

Number of words in the corpus used for training the model:  1072961
Number of words in the model:  211340
Time [s], required for training the model:  383.3038097620006
Count of trainings performed to generate this model:  1
Length of the word2vec vectors:  500
Applied context length for generating the model:  6


fastText toma en cuenta la estructura morfologica de las palabras. Esta estructura no es tomada en cuenta en los modelos tradicionales de Word2Vec.

Para hacerlo con fastText se toma la palabra como un agregado de sub-tokens que generalmente y por simplicidad se calculan como los n-gramas de la palabra.

Sauce- https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#fasttext-model

In [74]:
%%time
fastext_600 = train_model(
    WikiSentencesExtractor(directory, -1),
    "eswiki-medium",
    600,
    6,
    workers=24,
    algorithm=Algorithms.FAST_TEXT
    )

Already exists the model drive/MyDrive/models/eswiki-medium-vs600-w6-FAST_TEXT.model
CPU times: user 28.7 s, sys: 18 s, total: 46.7 s
Wall time: 2min 51s


In [75]:
report_stats(fastext_600)

Number of words in the corpus used for training the model:  500000
Number of words in the model:  135919
Time [s], required for training the model:  1.130548065979383
Count of trainings performed to generate this model:  1
Length of the word2vec vectors:  600
Applied context length for generating the model:  6


## Operaciones con los vectores entrenados

Veremos operaciones comunes sobre vectores. Estos resultados dependeran del modelo que hayamos cargado en memoria

In [76]:
model = fastext_600

In [77]:
for index, word in enumerate(model.wv.index_to_key):
    if index == 100:
        break
    print(f"word #{index}/{len(model.wv.index_to_key)} is {word}")

word #0/135919 is de
word #1/135919 is la
word #2/135919 is en
word #3/135919 is el
word #4/135919 is y
word #5/135919 is que
word #6/135919 is a
word #7/135919 is los
word #8/135919 is del
word #9/135919 is se
word #10/135919 is las
word #11/135919 is por
word #12/135919 is un
word #13/135919 is con
word #14/135919 is una
word #15/135919 is su
word #16/135919 is como
word #17/135919 is para
word #18/135919 is es
word #19/135919 is al
word #20/135919 is más
word #21/135919 is no
word #22/135919 is o
word #23/135919 is fue
word #24/135919 is lo
word #25/135919 is sus
word #26/135919 is entre
word #27/135919 is también
word #28/135919 is este
word #29/135919 is son
word #30/135919 is esta
word #31/135919 is pero
word #32/135919 is años
word #33/135919 is dos
word #34/135919 is desde
word #35/135919 is durante
word #36/135919 is sobre
word #37/135919 is parte
word #38/135919 is hasta
word #39/135919 is sin
word #40/135919 is ser
word #41/135919 is ha
word #42/135919 is le
word #43/135919 

In [78]:
gato_vec = model.wv["gato"]
print(gato_vec[:10])
print(len(gato_vec))

[-2.2363242e-04  2.7691919e-04 -3.4900822e-04  8.6492975e-05
  2.4339599e-04  2.1716925e-04 -2.4556002e-04  2.2704665e-04
 -3.5443565e-05 -4.0150961e-05]
600


In [79]:
try:
    agustisidad_vec = model.wv["agusticidad"]
except KeyError:
    print("OOV founded!")


In [82]:
agustisidad_vec[:10]
len(agustisidad_vec)

600

Podemos ver como la similitud entre palabras decrece

In [83]:
word_pairs = [
    ("automóvil", "camión"),
    ("automóvil", "bicicleta"),
    ("automóvil", "cereal"),
    ("automóvil", "conde"),
]

for w1, w2 in word_pairs:
    print(f"{w1} - {w2} {model.wv.similarity(w1, w2)}")

automóvil - camión -0.014398100785911083
automóvil - bicicleta -0.03809265047311783
automóvil - cereal -0.03065969981253147
automóvil - conde -0.02661689557135105


In [84]:
# rey es a hombre como ___ a mujer
# londres es a inglaterra como ____ a vino
model.wv.most_similar(positive=['vida', 'enfermedad'], negative=['salud'])

[('«enfermedad', 0.4606187641620636),
 ('enfermedad»', 0.42763596773147583),
 ('enfermedades', 0.3955746293067932),
 ('vida»', 0.36319515109062195),
 ('vida’', 0.34369346499443054),
 ('ávida', 0.3128056824207306),
 ('vivida', 0.30903947353363037),
 ('vidas»', 0.30570918321609497),
 ('vida—', 0.3042830228805542),
 ('vida”', 0.3009272515773773)]

In [86]:
model.wv.doesnt_match(["disco", "música", "mantequilla", "cantante"])

'disco'

In [90]:
model.wv.similarity("noche", "noches")

0.6047445

In [91]:
model.wv.most_similar("nochecitas", topn=10)

[('noche', 0.4461252689361572),
 ('noches', 0.4133933484554291),
 ('citas', 0.41027557849884033),
 ('noche»', 0.3886655271053314),
 ('cabecitas', 0.38327473402023315),
 ('piedrecitas', 0.3609331548213959),
 ('escitas', 0.3572961688041687),
 ('noches»', 0.35429543256759644),
 ('ilícitas', 0.3270627558231354),
 ('lícitas', 0.3112875819206238)]

## Explorando probabilidades de modelos

In [92]:
directory = "drive/MyDrive/corpora/eswiki-dump"
sentences = WikiSentencesExtractor(directory, 10000)
model = word2vec.Word2Vec(
            list(sentences),
            vector_size=100,
            window=5,
            workers=2,
            sg = 1, # Skip-gram
            seed=42,
            hs=1, # Hierarchical softmax scheme
            negative=0 # Disabling negative samples
        )

Usando `score()` obtiene la probabilidad log de una oración. Solo está implementada para modelos entrenados con el esquema *Hierarchival softmax* y deshabilitando *negative sampling*

Sauce - https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

In [94]:
directory = "drive/MyDrive/corpora/eswiki-dump-2"
test_sentences = list(WikiSentencesExtractor(directory, 1000))
scores = model.score(test_sentences, total_sentences=1000, chunksize=100, queue_factor=2, report_delay=1)

In [96]:
len(scores)

1000

In [99]:
#sent = "esta es la primera vez que te vi".split()
sent = "this is my new book that can be used for corruption".split()

out_sent_score = model.score([sent], total_sentences=1)
(-1)*(out_sent_score / len(sent))

array([5.395258], dtype=float32)

In [100]:
len(scores)

1000

In [101]:
# Obtenemos la (log) perplejidad por sentencia
# y dividimos por la longitud de la sentencia (numero de tokens)
# todo multiplicado por menos 1
ppl_per_sent = [(-1)*(score/len(sent)) for score, sent in zip(scores, test_sentences)]

In [102]:
# Obtenemos la perplejidad promedio considerando las sentencias de test
avg_ppl = sum(ppl_per_sent) / len(ppl_per_sent)
print("Average (log) perplexity of the test set:", avg_ppl)

Average (log) perplexity of the test set: 49.08702824289504


In [103]:
i = 0
for sent, score, ppl in zip(test_sentences, scores, ppl_per_sent):
    print(f"#{i} len={len(sent)} | score={score} | Log ppl={ppl} | Sentence: {sent} ")
    i += 1
    if i == 10: break


#0 len=50 | score=-2985.95703125 | Log ppl=59.719140625 | Sentence: ['el', 'comercio', 'transahariano', 'se', 'refiere', 'al', 'tráfico', 'de', 'mercancías', 'a', 'través', 'del', 'sahara', 'hasta', 'alcanzar', 'áfrica', 'subsahariana', 'desde', 'la', 'costa', 'del', 'norte', 'de', 'áfrica', 'europa', 'o', 'el', 'levante', 'si', 'bien', 'ha', 'existido', 'desde', 'tiempos', 'prehistóricos', 'el', 'apogeo', 'de', 'esta', 'ruta', 'comercial', 'se', 'produjo', 'entre', 'los', 'siglos', 'viii', 'hasta', 'el', 'xvi'] 
#1 len=5 | score=-46.414710998535156 | Log ppl=9.282942199707032 | Sentence: ['desertificación', 'creciente', 'e', 'incentivo', 'económico'] 
#2 len=84 | score=-5536.42041015625 | Log ppl=65.9097667875744 | Sentence: ['el', 'sahara', 'tuvo', 'una', 'vez', 'un', 'medio', 'ambiente', 'muy', 'diferente', 'en', 'las', 'actuales', 'libia', 'y', 'argelia', 'desde', 'al', 'menos', 'el', '7000', 'a.c', 'ya', 'existía', 'pastoreo', 'cuidado', 'de', 'ovejas', 'y', 'cabras', 'e', 'import

### Obteniendo la probabilidad de un contexto

In [104]:
import numpy as np

In [105]:
directory = "drive/MyDrive/corpora/eswiki-dump"
sentences = WikiSentencesExtractor(directory, 10000)
model_context = word2vec.Word2Vec(
            list(sentences),
            vector_size=100,
            window=5,
            workers=2,
            sg = 1, # Skip-gram
            seed=42,
            hs=0, # Hierarchical softmax scheme
        )

In [119]:
sent = "el brocoli es de un sabor rico y es de color".split()
cbow_100.predict_output_word(sent, topn=5)

[('color', 0.000556874),
 ('sabor', 0.00040580358),
 ('pelaje', 0.00030448183),
 ('característico', 0.00024318612),
 ('cabello', 0.00023327101)]

In [120]:
# Forma no estandar de calcular la preplejidad
STEP = 1000
vocab_len = len(model.wv)
oovs_contexts = []
oovs_central_words = []
ppl_per_sentence = []
for sent in test_sentences:
    sent_len = len(sent)
    # To small sent to be consider
    if sent_len < 5:
        continue
    i = 0
    log_probs = []
    while i < (sent_len - 4):
        # Center word starts from 2 in sent
        pos = i + 2
        central_word = sent[pos]
        left_context = sent[pos-2:pos]
        right_context = sent[pos+1:pos+3]
        i += STEP
        full_context = left_context + right_context
        print(f"Central word={central_word}, context={full_context}")
        # Extracting probability of center word given the context
        # The output will be like ('pepito', 7.0062583e-07), ('juanito', 6.9261466e-07)
        prob_context = model_context.predict_output_word(full_context, topn=vocab_len)
        if prob_context is None:
            oovs_contexts.append(full_context)
        else:
            prob_context = {word: prob for word, prob in prob_context}
            if central_word in prob_context:
                # Calculate log probability of each central word in sentence
                # given its sorrounding context
                log_probs.append(np.log(prob_context[central_word]))
            else:
                oovs_central_words.append(central_word)
        if len(log_probs) > 0:
            ppl = (-1) * (sum(log_probs) / len(log_probs))
            ppl_per_sentence.append(ppl)

Central word=transahariano, context=['el', 'comercio', 'se', 'refiere']
Central word=e, context=['desertificación', 'creciente', 'incentivo', 'económico']
Central word=tuvo, context=['el', 'sahara', 'una', 'vez']
Central word=el, context=['como', 'desierto', 'sahara', 'es']
Central word=en, context=['el', 'comercio', 'la', 'época']
Central word=prehistórico, context=['el', 'comercio', 'se', 'expandió']
Central word=a, context=['la', 'ruta', 'través', 'del']
Central word=comercial, context=['la', 'ruta', 'de', 'darb']
Central word=occidental, context=['la', 'más', 'de', 'las']
Central word=occidentales, context=['las', 'rutas', 'eran', 'la']
Central word=este, context=['hacia', 'el', 'las', 'tres']
Central word=como, context=['heródoto', 'mencionó', 'los', 'garamantes']
Central word=más, context=['la', 'evidencia', 'temprana', 'de']
Central word=en, context=['comercio', 'transahariano', 'la', 'edad']
Central word=del, context=['el', 'ascenso', 'imperio', 'de']
Central word=de, context=[

In [121]:
avg_log_ppl = sum(ppl_per_sentence) / len(ppl_per_sentence)
print("Average (log) perplexity of the test sentences:", avg_ppl)

Average (log) perplexity of the test sentences: 49.08702824289504


In [122]:
len(oovs_central_words)

126

In [123]:
from collections import Counter
Counter(oovs_central_words)

Counter({'transahariano': 1,
         'prehistórico': 1,
         'ocasionó': 1,
         'bruce': 1,
         'clasificatorias': 1,
         'avanzó': 1,
         'procedía': 1,
         '–en': 2,
         'avisa': 1,
         'góngora': 1,
         'futbol': 1,
         'f.c': 1,
         'empezaría': 1,
         'descriptiva': 1,
         'vigués': 1,
         'anécdotas': 1,
         'ascendía': 1,
         '2005-06': 1,
         '2010/2011': 1,
         '2013/14': 1,
         'empareja': 1,
         'pontevedra': 1,
         'granate': 1,
         'industry': 1,
         'veneciano': 1,
         'alcoyano': 2,
         'andadura': 2,
         'temporada,1942/43': 1,
         'acabaría': 1,
         'travesía': 1,
         'preferente': 1,
         'ascender': 1,
         '2019-2020': 1,
         'debut': 1,
         'barbie': 1,
         'boecia': 1,
         'cúmulos': 1,
         'renovó': 1,
         'difracción': 1,
         'fresnel': 1,
         'térmico': 1,
         'térmi

## Entrenando modelos con pytorch

In [126]:
from collections import defaultdict
from itertools import chain

def vocabulary_factory():
    """Function that create a vocabulary

    Default method when a key is not in the dictionary changed to be the
    current lenght of the dictionary to provide a unique index for each
    new key.

    Example:
    >> vocab['test']
    0
    >> vocab['other']
    1
    >> vocab['test']
    0
    """
    vocab = defaultdict()
    vocab.default_factory = lambda: len(vocab)
    return vocab

def word_to_index(corpus: list[list[str]], vocab: defaultdict) -> list[int]:
    """Function that maps each word in a corpus to a unique index"""
    for sent in corpus:
        yield [vocab[word] for word in sent]

def get_n_grams(indexed_sents: list[list[str]], n=2) -> chain:
    return chain(*[zip(*[sent[i:] for i in range(n)]) for sent in indexed_sents])

In [128]:
vocab = vocabulary_factory()
sentences = list(WikiSentencesExtractor(directory, 5000))
indexed_sents = list(word_to_index(sentences, vocab))

In [129]:
print(sentences[0])
print(indexed_sents[0])
print(vocab["andorra"])

['andorra', 'oficialmente', 'principado', 'de', 'andorra', '', 'es', 'un', 'micro-estado', 'soberano', 'sin', 'litoral', 'ubicado', 'en', 'el', 'suroeste', 'de', 'europa', 'entre', 'españa', 'y', 'francia', 'en', 'el', 'límite', 'de', 'la', 'península', 'ibérica', 'se', 'constituye', 'en', 'estado', 'independiente', 'de', 'derecho', 'democrático', 'y', 'social', 'cuya', 'forma', 'de', 'gobierno', 'es', 'el', 'coprincipado', 'parlamentario', 'su', 'territorio', 'está', 'organizado', 'en', 'siete', 'parroquias', 'con', 'una', 'población', 'total', 'de', '79', '877', 'habitantes', 'a', '28', 'de', 'febrero', 'de', '2022', 'su', 'capital', 'es', 'andorra', 'la', 'vieja']
[0, 1, 2, 3, 0, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 3, 15, 16, 17, 18, 19, 12, 13, 20, 3, 21, 22, 23, 24, 25, 12, 26, 27, 3, 28, 29, 18, 30, 31, 32, 3, 33, 5, 13, 34, 35, 36, 37, 38, 39, 12, 40, 41, 42, 43, 44, 45, 3, 46, 47, 48, 49, 50, 3, 51, 3, 52, 36, 53, 5, 0, 21, 54]
0


In [130]:
bigrams = get_n_grams(indexed_sents, n=2)

In [131]:
import torch
import torch.nn as nn

In [132]:
# Precisamos el tamaño del vocabulario
N = len(vocab)

In [None]:
N

`Pytorch` ya cuenta con capas de embeddings, por lo que únicamente bastará señalar que queremos una capa de este tipo

In [133]:
# Definición de la red
network = nn.Sequential(nn.Embedding(N, 2), nn.Linear(2, N, bias=False), nn.Softmax(dim=1))

In [134]:
# Definición de la función de riesgo y optimizador
risk = nn.CrossEntropyLoss()
optimizer = torch.optim.Adagrad(network.parameters(), lr=0.1)

In [None]:
%%time
total_risk = []
epochs = 10
for i in range(0, epochs):
    print(f"Inicia iteración #{i}")
    # Riesgo de la iteración
    risk_iter = 0
    for bigram in bigrams:
        optimizer.zero_grad()

        # Predicciones y obtencion de la perdida
        probs = network(torch.tensor([bigram[0]]))
        loss = risk(probs, torch.LongTensor([bigram[1]]))

        # Backpropagation
        loss.backward()
        optimizer.step()

        # Guardando el riesgo
        risk_iter += loss.detach()

    #Guarda el riesgo en la época
    total_risk.append(risk_iter)
    #Imprime información de época
    print(f"fin iteración #{i}. Riesgo: {risk_iter}")

## Práctica 6.2: Reducción de la dimensionalidad

**Fecha de entrega: 12 de noviembre 11:59pm**



Hay varios métodos que podemos aplicar para reduccir la dimensionalidad de nuestros vectores y asi poder visualizar en un espacio de menor dimensionalidad como estan siendo representados los vectores.

- PCA
- T-SNE
- SVD

- Escoger un modelo pre-entrenado visto en clase o entrenar uno propio y cargarlo en memoria
  - [Carpeta con los modelos 📕](https://drive.google.com/drive/folders/1reor2FGsfOB6m3AvfCE16NOHltAFjuvz?usp=sharing)
- Aplicar los 3 algoritmos de reduccion de dimensionalidad
    - Reducir a 2d
    - Plotear 100 vectores al azar
        - Se deben plotear los mismos 100 en los tres casos
    - Analizar y comparar las topologías que se generan con cada algoritmo

**NOTA:** Se requiere usar la misma version de Numpy (`1.24.4`) para utilizar los modelos visto en clase

### Referencias

- Partes del código utilizado para este notebook fueron tomados de trabajos de la [Dr. Ximena Gutierrez-Vasques](https://github.com/ximenina/) y el [Dr. Victor Mijangos](https://github.com/VMijangos/LinguisticaComputacional/blob/main/Notebooks/19%20Word2Vec.ipynb)
- [Corpus streaming on gensim](https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html#corpus-streaming-one-document-at-a-time)
- [Gensim docs](https://radimrehurek.com/gensim/auto_examples/index.html)