# FastText FNTR

On this notebook we will see how to retrain a FastText model already pretrained using the **GENSIM API**.
To accomplis this task we will:

<li> Load a file with spanish text from my own web scrapping script.
<li> Preprocess just a bit this text in order to fed into the model.
<li> Fine tune a pretrained model using callbacks.
<li> Check if the model has learnt anything.
    
Some tests have been already performed in order to find the proper way to do this. On previous expermients we have observed that this is very influenced by stopwords, that **fasttext pretrained model contains stopwords and accent marks and it is previously lowercased**. We will clean both in order not to introduce much noise to the model.

## Preprocess and  Load Text to fine tune the model

In [2]:
import nltk
from nltk.corpus import stopwords

def c_ac(word: str) -> str:
    return word.replace('á', 'a').replace('é', 'e').replace('í', 'i').replace('ó', 'o').replace('ú', 'u').replace(',', '')

def clean_acmarks_and_sws(frase: str) ->  list:
    frase = frase.split(' ')
    sws = stopwords.words('spanish')
    return [c_ac(word) for word in frase if word not in sws]


# clean_acmarks('áérewrewiíiiiíi', sws)
clean_acmarks_and_sws('aaááé eeeiíí ooóó a ante el de la amigo, entr')

['aaaae', 'eeeiii', 'oooo', 'amigo', 'entr']

In [3]:
import os

file = os.path.join(os.getcwd(), 'FastText_FINE', 'mercados_out.txt')

with open(file, 'r', encoding = 'latin-1') as file:
    txt = file.read()
    
txt_split_tokenized = [clean_acmarks_and_sws(frase) for frase in txt.split('. ')]
txt_split_tokenized = txt_split_tokenized[:18000]
print(len(txt_split_tokenized), txt_split_tokenized[:5])

18000 [['llega', 'verano', 'vacaciones', 'el', 'dividendos'], ['junio', 'julio', 'muchas', 'cotizadas', 'españolas', 'optan', 'premiar', 'fidelidad', 'accionistas', 'mediante', 'pago', 'cupon'], ['tras', 'verano', 'atipico', 'vivio', '2020', 'culpa', 'crisis', 'covid-19', 'llevo', 'muchas', 'compañias', 'suspender', 'pagos', 'recortarlos', 'muchas', 'variar', 'calendarios', 'repartos', 'verano', 'cotizadas', 'haran', 'esfuerzo', 'ir', 'recuperando', 'normalidad', 'politicas', 'retribucion', 'accionista', 'medida', 'ido', 'reactivando', 'ingresos', 'beneficios.en', 'proximas', 'semanas', 'veintena', 'compañias', 'españolas', 'repartiran', '5.000', 'millones', 'euros', 'dividendos'], ['cifra', 'añadir', 'desembolsados', 'mes', 'junio', 'superan', '2.000', 'millones', 'gracias', 'parte', '1.000', 'millones', 'repartidos', 'telefonica', 'ultimo', 'dividendo', 'desembolsos', 'realizados', 'caixabank', 'hizo', 'primer', 'pago', 'tras', 'integracion', 'bankia', 'grifols', 'acerinox.en', 'prox

## Load pretrained fasttext model

In [4]:
from gensim.models.callbacks import CallbackAny2Vec

class EpochLogger(CallbackAny2Vec):
    '''Callback to log information about training'''
    def __init__(self):
        self.epoch = 0
    def on_epoch_begin(self, model):
        if self.epoch % 100 == 0:
            print("Epoch #{} start".format(self.epoch))
    def on_epoch_end(self, model):
        if self.epoch % 100 == 0:
            print("Epoch #{} end".format(self.epoch))
        self.epoch += 1

Initial check of how does FastText thinks:

In [5]:
# from gensim.models.wrappers import FastText
from gensim.models.fasttext import FastText
import gensim

fasttext = gensim.models.fasttext.load_facebook_model(r'C:\Users\f.gonzalez\Desktop\FastText_FINE\wiki.es.bin')

Let us study the initial results of the model.

In [6]:
for word in ['mercado', 'bolsa', 'Bolsa', 'banco', 'Banco', 'cárcel', 'empresa']:
    print(word)
    print(fasttext.wv.most_similar(word)[:7])
    print(" ")

mercado
[('mercado,', 0.7775014042854309), ('mercados', 0.7754110097885132), ('posmercado', 0.7551040649414062), ('#mercado', 0.7441679239273071), ('‘mercado', 0.7340843677520752), ('mercada', 0.7078138589859009), ('promercado', 0.7052767276763916)]
 
bolsa
[('bolsas', 0.7388685941696167), ('bolsada', 0.6469042301177979), ('bolsa»', 0.6271626353263855), ('cotiza', 0.6204637289047241), ('bursátil', 0.6049832701683044), ('embolsa', 0.6042190790176392), ('nasdaq', 0.5938200354576111)]
 
Bolsa
[('olsa', 0.818496823310852), ('colsa', 0.7439507842063904), ('tolsa', 0.7165077924728394), ('dolsa', 0.7157090306282043), ('polsa', 0.7060896158218384), ('bupolsa', 0.6889318227767944), ('molsa', 0.6666685342788696)]
 
banco
[('ibanco', 0.7881178259849548), ('bancos', 0.7591121792793274), ('bancoop', 0.7470996379852295), ('mibanco', 0.7466395497322083), ('bancor', 0.7200927138328552), ('bancaria', 0.716255784034729), ('bancario', 0.7133525609970093)]
 
Banco
[('yanco', 0.6898605227470398), ('ñanco',

In [7]:
mercado, bolsa = fasttext.wv['mercado'], fasttext.wv['bolsa']
mercado.shape, bolsa.shape

((300,), (300,))

We can see this model has been clearly trained with lowercased text and with punctuations.

In [8]:
import time
from datetime import timedelta

epoch_logger = EpochLogger()
start = time.time()


new_sentences = txt_split_tokenized

fasttext.build_vocab(new_sentences, update=True)
fasttext.train(new_sentences, total_examples=len(new_sentences), epochs = 1500, callbacks=[epoch_logger])

elapsed = (time.time() - start)

str(timedelta(seconds=elapsed))[:-7]

Epoch #0 start
Epoch #0 end
Epoch #100 start
Epoch #100 end
Epoch #200 start
Epoch #200 end
Epoch #300 start
Epoch #300 end
Epoch #400 start
Epoch #400 end
Epoch #500 start
Epoch #500 end
Epoch #600 start
Epoch #600 end
Epoch #700 start
Epoch #700 end
Epoch #800 start
Epoch #800 end
Epoch #900 start
Epoch #900 end
Epoch #1000 start
Epoch #1000 end
Epoch #1100 start
Epoch #1100 end
Epoch #1200 start
Epoch #1200 end
Epoch #1300 start
Epoch #1300 end
Epoch #1400 start
Epoch #1400 end


'2:17:25'

Has the model learnt anything?

In [9]:
import numpy as np
np.allclose(fasttext.wv['mercado'], mercado), np.allclose(fasttext.wv['bolsa'], bolsa) 

(False, False)

mercado
[('mercado,', 0.7775014042854309), ('mercados', 0.7754110097885132), ('posmercado', 0.7551040649414062), ('#mercado', 0.7441679239273071), ('‘mercado', 0.7340843677520752), ('mercada', 0.7078138589859009), ('promercado', 0.7052767276763916)]
 
bolsa
[('bolsas', 0.7388685941696167), ('bolsada', 0.6469042301177979), ('bolsa»', 0.6271626353263855), ('cotiza', 0.6204637289047241), ('bursátil', 0.6049832701683044), ('embolsa', 0.6042190790176392), ('nasdaq', 0.5938200354576111)]

In [10]:
for word in ['mercado', 'bolsa', 'Bolsa', 'banco', 'Banco', 'cárcel', 'empresa']:
    print(word)
    print(fasttext.wv.most_similar(word)[:7])
    print(" ")

mercado
[('año', 0.6514887809753418), ('deuda', 0.6298579573631287), ('inversores', 0.627924919128418), ('bolsa', 0.6208838224411011), ('mercados', 0.6208115816116333), ('parte', 0.5797388553619385), ('valores', 0.5716472864151001)]
 
bolsa
[('española', 0.7075108289718628), ('año', 0.621670663356781), ('mercado', 0.6208838224411011), ('inversores', 0.5761557817459106), ('compañia', 0.5678509473800659), ('deuda', 0.5638283491134644), ('salida', 0.5417485237121582)]
 
Bolsa
[('olsa', 0.8153858780860901), ('colsa', 0.6973823308944702), ('nlsa', 0.6802817583084106), ('dolsa', 0.6668431758880615), ('tolsa', 0.6642357110977173), ('polsa', 0.6628522872924805), ('molsa', 0.6579697132110596)]
 
banco
[('central', 0.6823917031288147), ('ibanco', 0.5623183846473694), ('españa', 0.5554822087287903), ('/banco', 0.539932131767273), ('europeo', 0.5375440120697021), ('bancoop', 0.5332126617431641), ('bancoldex', 0.5326253175735474)]
 
Banco
[('yanco', 0.833631157875061), ('anco', 0.8321518898010254),

# Persist Model to disk and save

In [22]:
name = os.path.join(os.getcwd(), 'FastText_FINE', 'fastext_mercados_fine_tuned-')

with tempfile.NamedTemporaryFile(prefix=name, delete=False) as tmp:
    temporary_filepath = tmp.name
    fasttext.save(temporary_filepath)