[View in Colaboratory](https://colab.research.google.com/github/scarimp/TFL_COLAB/blob/master/1_gensim_fasttext_italian_wiki.ipynb)

Working with Fasttext FM from [Manash Kumar Mandal](https://blog.manash.me/how-to-use-pre-trained-word-vectors-from-facebooks-fasttext-a71e6d55f27).

Here the github repository.

**Pre-trained word vectors** for 294 languages, trained on Wikipedia using fastText. These vectors in dimension 300 were obtained using the skip-gram model described in Bojanowski et al. (2016) with default parameters, enriching Word Vectors with Subword Information.

Look also to [datascience.stackexchange](https://datascience.stackexchange.com/questions/10695/how-to-initialize-a-new-word2vec-model-with-pre-trained-model-weights?rq=1).


In [3]:
#!pip list
!pip install gensim

Collecting gensim
[?25l  Downloading https://files.pythonhosted.org/packages/86/f3/37504f07651330ddfdefa631ca5246974a60d0908216539efda842fd080f/gensim-3.5.0-cp36-cp36m-manylinux1_x86_64.whl (23.5MB)
[K    100% |████████████████████████████████| 23.5MB 837kB/s 
Collecting smart-open>=1.2.1 (from gensim)
  Downloading https://files.pythonhosted.org/packages/cf/3d/5f3a9a296d0ba8e00e263a8dee76762076b9eb5ddc254ccaa834651c8d65/smart_open-1.6.0.tar.gz
Collecting boto>=2.32 (from smart-open>=1.2.1->gensim)
[?25l  Downloading https://files.pythonhosted.org/packages/23/10/c0b78c27298029e4454a472a1919bde20cb182dab1662cec7f2ca1dcc523/boto-2.49.0-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 12.7MB/s 
[?25hCollecting bz2file (from smart-open>=1.2.1->gensim)
  Downloading https://files.pythonhosted.org/packages/61/39/122222b5e85cd41c391b68a99ee296584b2a2d1d233e7ee32b4532384f2d/bz2file-0.98.tar.gz
Collecting boto3 (from smart-open>=1.2.1->gensim)
[?25l  Downlo

In [4]:
from __future__ import print_function
import numpy as np
import random
import tensorflow as tf
import nltk as nl
import gensim
import matplotlib as plt

print("numpy vers.=", np.__version__)
print("tensorflow vers.=", tf.__version__)
print("nltk vers.=", nl.__version__)
print("gensim vers.=", gensim.__version__)
print("matplotlib vers.=", plt.__version__)

numpy vers.= 1.14.5
tensorflow vers.= 1.9.0-rc2
nltk vers.= 3.2.5
gensim vers.= 3.5.0
matplotlib vers.= 2.1.2


In [0]:
from gensim.models import KeyedVectors
# Creating the model
it_model = KeyedVectors.load_word2vec_format('https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.it.vec')
it_model

In [13]:
# Getting the tokens 
words = []
for word in it_model.vocab:
    words.append(word)

# Printing out number of tokens available
print("Number of Tokens: {}".format(len(words)))

Number of Tokens: 871053


In [17]:
# Printing out the dimension of a word vector 
print("Dimension of a word vector: {}".format(
    len(it_model[words[0]])
))

Dimension of a word vector: 300


In [16]:
# Print out the vector of a word 
print("Vector components of a word: {}".format(
    it_model[words[0]]
))

Vector components of a word: [ 3.3795e-02 -1.7544e-02  2.5536e-02 -1.5358e-01  4.1421e-01  4.5707e-01
  1.2454e-01 -9.9489e-02  1.3040e-01  5.2032e-01 -2.2110e-01  1.4737e-01
  1.6640e-01 -1.1823e-01 -1.3008e-01 -5.0490e-01  5.9008e-01  2.4718e-03
  2.7920e-02  1.6570e-01 -1.3485e-01  1.6096e-01  7.0532e-02 -8.0481e-02
  2.9753e-02 -1.4702e-01 -1.0424e-01  1.8238e-01  1.0057e-01  2.0130e-01
  1.3713e-01  1.7432e-01 -2.2095e-01 -1.7741e-01 -3.2525e-01  1.3575e-01
  7.8731e-02 -4.0539e-01  2.1751e-02 -1.4919e-01  3.0200e-01 -9.9151e-02
 -1.9201e-01  4.3207e-01 -6.2661e-02  5.6057e-02 -3.0339e-01 -2.2859e-02
  1.4218e-01  6.8054e-02  5.1317e-02 -4.3531e-02 -1.2934e-01 -3.0483e-02
 -7.0046e-03  2.3159e-01  6.4715e-02 -4.5078e-02  3.8989e-01  3.7618e-01
  5.6262e-02 -2.3954e-01  5.5242e-02 -1.2803e-01 -2.0887e-01 -6.6318e-02
 -1.8209e-01  8.2813e-02 -5.2937e-01 -2.4841e-01 -3.6483e-01  2.4408e-01
 -1.7857e-01  2.4326e-01 -1.1353e-01  9.4183e-02  1.3148e-03 -3.7275e-01
 -3.2950e-01 -1.2377e-

In [19]:
# Pick a word 
find_similar_to = 'auto'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: automobile, Similarity: 0.78
Word: automobili, Similarity: 0.71
Word: automobilina, Similarity: 0.70
Word: automobile», Similarity: 0.70
Word: automobiline, Similarity: 0.69
Word: automobilia, Similarity: 0.68
Word: autovetture, Similarity: 0.66
Word: automobily, Similarity: 0.66
Word: automobil, Similarity: 0.65
Word: automobilista, Similarity: 0.64


In [20]:
# Pick a word 
find_similar_to = 'città'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: cittadina, Similarity: 0.75
Word: città…, Similarity: 0.75
Word: città,, Similarity: 0.68
Word: ,città, Similarity: 0.67
Word: cittadine, Similarity: 0.66
Word: città/, Similarity: 0.66
Word: ‘città, Similarity: 0.65
Word: cittadina», Similarity: 0.62
Word: capitale, Similarity: 0.61
Word: città/capitale, Similarity: 0.60


In [25]:
# Pick a word 
find_similar_to = 'castrignano'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: castrignanò, Similarity: 0.92
Word: trignano, Similarity: 0.87
Word: petrignano, Similarity: 0.83
Word: costrignano, Similarity: 0.82
Word: caprignano, Similarity: 0.79
Word: orignano, Similarity: 0.78
Word: cirignano, Similarity: 0.77
Word: gignano, Similarity: 0.76
Word: pignano, Similarity: 0.76
Word: cerfignano, Similarity: 0.76


In [26]:
# Pick a word 
find_similar_to = 'papa'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: pontefice, Similarity: 0.84
Word: innocenzo, Similarity: 0.80
Word: pio, Similarity: 0.77
Word: antipapa, Similarity: 0.76
Word: papale, Similarity: 0.75
Word: pontificato, Similarity: 0.75
Word: pontefice», Similarity: 0.74
Word: clemente, Similarity: 0.74
Word: sisto, Similarity: 0.71
Word: pontefices, Similarity: 0.71


In [27]:
# Pick a word 
find_similar_to = 'papavero'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: papaver, Similarity: 0.88
Word: papaveri, Similarity: 0.82
Word: papaverina, Similarity: 0.78
Word: papaveracee, Similarity: 0.77
Word: papaveracea, Similarity: 0.77
Word: papaverales, Similarity: 0.69
Word: papaveretalia, Similarity: 0.68
Word: papaveraceae, Similarity: 0.64
Word: papava, Similarity: 0.63
Word: garofano, Similarity: 0.62


In [28]:
# Pick a word 
find_similar_to = 'cucina'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: cucina», Similarity: 0.76
Word: cucinata, Similarity: 0.73
Word: /cucina, Similarity: 0.73
Word: cucine, Similarity: 0.72
Word: cucinato, Similarity: 0.72
Word: cucinare, Similarity: 0.71
Word: cucinati, Similarity: 0.70
Word: cucin, Similarity: 0.69
Word: gastronomia, Similarity: 0.69
Word: culinaria, Similarity: 0.69


In [29]:
# Pick a word 
find_similar_to = 'andiamo'

# Finding out similar words [default= top 10]
for similar_word in it_model.similar_by_word(find_similar_to):
    print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
))

Word: andiamoci, Similarity: 0.86
Word: mandiamo, Similarity: 0.78
Word: andiamocene, Similarity: 0.76
Word: cominciamo, Similarity: 0.75
Word: torniamo, Similarity: 0.74
Word: incominciamo, Similarity: 0.74
Word: iniziamo, Similarity: 0.73
Word: domandiamoci, Similarity: 0.73
Word: continuiamo, Similarity: 0.73
Word: passiamoci, Similarity: 0.72


In [51]:
for word in range(5):
  print (it_model.vocab.items)
  
 
#it_model.vocab.items

<built-in method items of dict object at 0x7f1921f9e5e8>
<built-in method items of dict object at 0x7f1921f9e5e8>
<built-in method items of dict object at 0x7f1921f9e5e8>
<built-in method items of dict object at 0x7f1921f9e5e8>
<built-in method items of dict object at 0x7f1921f9e5e8>


In [52]:
# Test words 
word_add = ['roma', 'francia']
word_sub = ['parigi']

# Word vector addition and subtraction 
for resultant_word in it_model.most_similar(
    positive=word_add, negative=word_sub
):
    print("Word : {0} , Similarity: {1:.2f}".format(
        resultant_word[0], resultant_word[1]
))

Word : spagna , Similarity: 0.57
Word : italia , Similarity: 0.57
Word : spagna, , Similarity: 0.53
Word : lazia , Similarity: 0.52
Word : lazio , Similarity: 0.52
Word : viterbo , Similarity: 0.51
Word : germania , Similarity: 0.50
Word : anagni , Similarity: 0.50
Word : civitavecchiese , Similarity: 0.49
Word : spagna» , Similarity: 0.49


In [60]:
it_model.wv.vocab.keys

  """Entry point for launching an IPython kernel.


<function dict.keys>

In [72]:
#for i in range(5):
  #print(it_model.vectors.item[i])
docvec =it_model.vectors.item
dir(docvec)

['__call__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__name__',
 '__ne__',
 '__new__',
 '__qualname__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__self__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__text_signature__']