# Creating embeddings

First, I need to load and clean my corpus, then vectorize it. To create the embeddings, I will be using gensim [FastText](https://radimrehurek.com/gensim/models/fasttext.html) model.

In [1]:
from gensim.models.fasttext import FastText

%run utility_file    # handles main module imports and loading .csv files
from utility_file import Preprocess     # custom class for preprocessing text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sveta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sveta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
path = 'pi2.csv'
source_lang = 'English'
target_lang = 'Russian'

source_list, target_list = load_separate_corpora_from_csv(path, source_lang, target_lang)

In [3]:
exceptions = [
    'flowerbed', 
    'building'
    ]       # words not handled by NLP libraries in the desired way
russian_stopwords = stopwords.words('russian') + [
    'это',
    'ещё',
    'ваш',
    'всё',
    'весь',
    '-'
    ]

# Preprocessing the corpora

clean_en_corpus = [Preprocess(sentence).preprocess_en_text(exceptions) for sentence in source_list]
clean_ru_corpus = [Preprocess(sentence).preprocess_ru_text(russian_stopwords) for sentence in target_list]

In [5]:
# Pickling the clean corpora for use in other notebooks

import pickle
with open('clean_en_corpus.pkl', 'wb') as f:
       pickle.dump(clean_en_corpus, f)
with open('clean_ru_corpus.pkl', 'wb') as f:
       pickle.dump(clean_ru_corpus, f)

In [13]:
# for loading pickled corpora

# import pickle
# with open('clean_en_corpus.pkl', 'rb') as f:
#        clean_en_corpus = pickle.load(f)
# with open('clean_ru_corpus.pkl', 'rb') as f:
#        clean_ru_corpus = pickle.load(f)

In [6]:
# using nltk's word tokenizer to tokenize every item in the clean corpora in order to feed them to the model

word_tokenized_en_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in clean_en_corpus]
word_tokenized_ru_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in clean_ru_corpus]

In [7]:
# training gensim FastText model to create embeddings

embedding_size = 30
window_size = 45
min_word = 5
down_sampling = 1e-2

ft_en_model = FastText(word_tokenized_en_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)

ft_ru_model = FastText(word_tokenized_ru_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)

# Checking out the results

In [8]:
# taking a look at the embeddings

print(ft_en_model.wv['travel'])

[ 0.44251978 -0.8151118   0.08512311  0.41910678  0.1451333  -0.12282278
  0.5808773   0.15731221  0.08658551  1.0206383   0.7039301   1.3919924
 -0.41332284 -0.25333408  0.943905   -0.3329253  -0.38998294  0.04191609
  0.20740989  0.10418067  1.1407379   0.0242559  -0.14824784 -0.30736986
  0.07419863 -0.56922174 -0.94646376 -0.13597773 -0.06863962 -0.65130246]


In [9]:
semantically_similar_words = {words: [item[0] for item in ft_en_model.wv.most_similar([words], topn=5)]
                  for words in [
                      'crystal', 
                      'naomi', 
                      'key', 
                      'island', 
                      'build', 
                      'reward'
                  ]}   
                    # these are some commonly used game terms 

for k,v in semantically_similar_words.items():
    print(k+":"+str(v))

crystal:['bank', 'refresh', 'buy', 'purchase', 'resource']
naomi:['come', 'mano', 'could', 'soon', 'think']
key:['lock', 'hunter', 'bury', 'treasure', 'purr']
island:['visit', 'come', 'return', 'resort', 'arrive']
build:['upgrade', 'unique', 'hotel', 'building', 'majority']
reward:['prize', 'get', 'awesome', 'earn', 'receive']


The first word in each list looks pretty good and indeed often appears in the same context as the key word. Below I'm checking cosine similarity between other terms frequently appearing in the same context.

In [10]:
print(ft_en_model.wv.similarity(w1='dorin', w2='mano'))      # both are character names
print(ft_en_model.wv.similarity(w1='event', w2='hotel'))     # events are held in hotels
print(ft_en_model.wv.similarity(w1='gold', w2='coin'))       # coins are made of gold

0.8523123
0.7713198
0.58982426


# Saving and converting trained models for further use

In order to perform vector space alignment in notebook 03, I will need .vec Facebook fasttext models. The code below saves our ready gensim models as .bin files and converts them to .vec

In [11]:
from gensim.models.fasttext import save_facebook_model
save_facebook_model(ft_en_model, "en_model_fb.bin", encoding='utf-8')
save_facebook_model(ft_ru_model, "ru_model_fb.bin", encoding='utf-8')

In [12]:
import fasttext
from fasttext import load_model

def convert_bin_to_vec(model, lang):
    f = load_model(model)
    lines=[]

    # get all words from model
    words = f.get_words()
    name = lang + '_model_converted'
    with open("%s.vec" % name, 'w') as file_out:
    
    # the first line must contain number of total words and vector dimension
        file_out.write(str(len(words)) + " " + str(f.get_dimension()) + "\n")

    # line by line, you append vectors to VEC file
        for w in words:
            v = f.get_word_vector(w)
            vstr = ""
            for vi in v:
                vstr += " " + str(vi)
            try:
                file_out.write(w + vstr+'\n')
            except:
                pass

In [13]:
convert_bin_to_vec('en_model_fb.bin', 'en')
convert_bin_to_vec('ru_model_fb.bin', 'ru')

