## Fine-tuning word2vec com gensim

A proposta desse notebook é fazer o fine-tuning de vetores word2vec skip-gram de 300 dimensões obtidos pelo [NILC -USP](http://www.nilc.icmc.usp.br/embeddings). <br>
O corpus utilizado são os livros Dom Casmurro e Memórias Postumas de Brás Cubas, disponíveis na biblioteca `nltk`.

Importando as bibliotecas necessárias

In [1]:
from gensim.models import KeyedVectors, Word2Vec
import nltk
from nltk.text import Text
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import machado

In [2]:
nltk.download('machado')
nltk.download('punkt')

[nltk_data] Downloading package machado to /home/luisa/nltk_data...
[nltk_data]   Package machado is already up-to-date!
[nltk_data] Downloading package punkt to /home/luisa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Obtendo o corpus

In [3]:
memorias_postumas_de_bras_cubas = " ".join(Text(machado.words('romance/marm05.txt'))[:])
dom_casmurro = " ".join(Text(machado.words('romance/marm08.txt'))[:])
frases_mpbc = sent_tokenize(memorias_postumas_de_bras_cubas)
frases_dc =  sent_tokenize(dom_casmurro)
frases_token_mpbc = [word_tokenize(frase) for frase in frases_mpbc]
frases_token_dc = [word_tokenize(frase) for frase in frases_dc]
corpus_train = []
corpora_machado = [frases_token_mpbc, frases_token_dc]
for corpus in corpora_machado:
    for frase in corpus:
        clean = [w.lower() for w in frase]
        corpus_train.append(clean)

Carregando o modelo do NILC

In [6]:
model = KeyedVectors.load_word2vec_format("skip_s300.txt")

Fazendo o Fine Tuning

In [7]:
model_ft = Word2Vec(size=300)
model_ft.build_vocab(corpus_train)
total_examples = model_ft.corpus_count

In [8]:
print(total_examples)

7581


In [9]:
model_ft.intersect_word2vec_format("skip_s300.txt",lockf=1.0)

In [10]:
model_ft.train(corpus_train, total_examples=total_examples, epochs=5)

(434917, 797085)

In [11]:
model_ft.save("finetuned.model")
model_ft.wv.save_word2vec_format('finetuned_w2v.txt')

Verificando resultados

In [12]:
model.most_similar('mulher')

[('menina', 0.7241089940071106),
 ('mãe', 0.7074811458587646),
 ('garota', 0.671286404132843),
 ('prostituta', 0.66178959608078),
 ('rapariga', 0.6564629077911377),
 ('esposa', 0.6457226276397705),
 ('amiga', 0.6422991752624512),
 ('enteada', 0.6399168372154236),
 ('filha', 0.6370487213134766),
 ('companheira', 0.6313394904136658)]

In [13]:
model_ft.wv.most_similar('mulher')

[('filha', 0.7753649950027466),
 ('amiga', 0.7461511492729187),
 ('mãe', 0.7395811080932617),
 ('menina', 0.733360767364502),
 ('moça', 0.7179816961288452),
 ('pessoa', 0.6897475719451904),
 ('alma', 0.6722605228424072),
 ('criança', 0.6680387258529663),
 ('voz', 0.66009521484375),
 ('noiva', 0.659413754940033)]

In [14]:
model.most_similar('tio')

[('avô', 0.8674881458282471),
 ('pai', 0.8124756813049316),
 ('sobrinho', 0.8119102716445923),
 ('primo', 0.8070578575134277),
 ('irmão', 0.8021966218948364),
 ('sogro', 0.7897737622261047),
 ('bisavô', 0.7823960781097412),
 ('tio-avô', 0.7783960700035095),
 ('filho', 0.7690531611442566),
 ('meio-irmão', 0.7577237486839294)]

In [15]:
model_ft.wv.most_similar('tio')

[('pai', 0.8111797571182251),
 ('filho', 0.7992830276489258),
 ('irmão', 0.7959020733833313),
 ('amigo', 0.7433714866638184),
 ('marido', 0.7418179512023926),
 ('cunhado', 0.7339798212051392),
 ('padrinho', 0.6576740145683289),
 ('finado', 0.6423109769821167),
 ('parente', 0.6304435729980469),
 ('companheiro', 0.5905669331550598)]

In [16]:
model.most_similar('capitu')

[('virgília', 0.7068570852279663),
 ('souvanel', 0.6866049766540527),
 ('dusá', 0.6700954437255859),
 ('inesita', 0.66921466588974),
 ('honorina', 0.667019248008728),
 ('vexada', 0.6529178619384766),
 ('fidélia', 0.6497822999954224),
 ('antonica', 0.6490815877914429),
 ('iaiá', 0.6445469856262207),
 ('quintília', 0.6438143253326416)]

In [17]:
model_ft.wv.most_similar('capitu')

[('virgília', 0.7919564247131348),
 ('ela', 0.640224039554596),
 ('ele', 0.6002749800682068),
 ('mim', 0.586045503616333),
 ('rir', 0.5820785164833069),
 ('quando', 0.5633065104484558),
 ('dela', 0.5576575398445129),
 ('recuei', 0.5544220209121704),
 ('acudi', 0.5526195764541626),
 ('chorar', 0.5524805784225464)]