## Using fastText for word representations

In [8]:
import fasttext
model = fasttext.train_unsupervised('data/data/fil9')

In [9]:
model.words

['the',
 'of',
 'one',
 'zero',
 'and',
 'in',
 'two',
 'a',
 'nine',
 'to',
 'is',
 'eight',
 'three',
 'four',
 'five',
 'six',
 'seven',
 'for',
 'are',
 'as',
 'was',
 's',
 'with',
 'by',
 'from',
 'that',
 'on',
 'or',
 'it',
 'at',
 'his',
 'an',
 'he',
 'have',
 'which',
 'be',
 'this',
 'there',
 'age',
 'also',
 'has',
 'population',
 'not',
 'were',
 'who',
 'other',
 'had',
 'but',
 'years',
 'all',
 'km',
 'their',
 'out',
 'new',
 'city',
 'under',
 'first',
 'more',
 'its',
 'american',
 'county',
 'they',
 'mi',
 'living',
 'income',
 'some',
 'median',
 'been',
 'after',
 'total',
 'most',
 'can',
 'united',
 'no',
 'when',
 'many',
 'states',
 'people',
 'over',
 'time',
 'census',
 'into',
 'used',
 'such',
 'may',
 'i',
 'up',
 'town',
 'average',
 'see',
 'older',
 'area',
 'families',
 'only',
 'family',
 'those',
 'males',
 'females',
 'households',
 'line',
 'made',
 'world',
 'them',
 'these',
 'her',
 'than',
 'would',
 'any',
 'every',
 'during',
 'war',
 'ex

This is a list of words in the vocabulary sorted by decreasing frequency.

In [10]:
model.get_word_vector("the")

array([ 0.44313288, -0.06725024,  0.13870639, -0.09302375, -0.27249616,
       -0.1452028 , -0.3291832 ,  0.08051889, -0.2243666 , -0.2281157 ,
        0.13824797,  0.39411303, -0.00847176, -0.30919692,  0.18922739,
       -0.10838241,  0.06304305,  0.2618296 , -0.22830428, -0.06726748,
        0.17615782,  0.21709362, -0.04251831, -0.20397426, -0.09704718,
       -0.15144597,  0.05906538, -0.03667366, -0.12229547,  0.18454264,
       -0.03223861, -0.2859161 , -0.0116627 ,  0.1809985 , -0.05820664,
       -0.00742344, -0.263299  ,  0.03419667,  0.00328209,  0.18118934,
        0.34182513,  0.24826926,  0.1495962 ,  0.01940354,  0.03490288,
       -0.29688522, -0.14044511, -0.2119899 , -0.296442  , -0.05040457,
        0.34889236,  0.00795957,  0.10304486, -0.07437348, -0.3031885 ,
       -0.01675152,  0.01628832, -0.26944453, -0.08275838,  0.14943007,
        0.1418032 ,  0.29090133, -0.09121905,  0.05317429,  0.02159631,
        0.18770872, -0.01779362,  0.06965798,  0.00975831, -0.15

This is the word vector of the word "the". All the words in the vocabulary are represented in this format.

As the training took a while, we can save it and load it later.

In [12]:
model.save_model("result/fil9.bin")
#model = fasttext.load_model("result/fil9.bin")

### Printing Word Vectors

We can print the word vectors of words asparagus, pidgey and yellow with the following command:

In [26]:
[model.get_word_vector(x) for x in ["asparagus", "pidgey", "yellow"]]

[array([ 0.42690518,  0.25333592,  0.35559738, -0.50758594,  0.10728838,
         0.7777949 ,  0.06088049,  0.32230195, -0.36272103, -0.01237643,
         0.32733086,  0.00986055, -0.63272154,  0.04458496, -0.13258888,
         0.03049676,  0.21458305,  0.13874006,  0.04462852,  0.03653872,
         0.33277968,  0.8327558 ,  0.29634246, -0.5567676 ,  0.2721421 ,
        -0.42235774, -0.14641348, -0.08704136, -0.02238087, -0.33137754,
        -0.25334537, -0.2736928 ,  0.3336205 ,  0.16857824,  0.4292296 ,
         0.30369   , -0.41468224,  0.3929557 ,  1.1367693 , -0.28598624,
         0.04711537,  0.20521347,  0.22843842, -0.19074629, -0.08494286,
        -0.6705761 ,  0.20914786,  0.5104783 , -0.32072568,  0.33267644,
         0.6045782 , -0.3321772 , -0.23552912,  0.43417615, -0.04569479,
         0.07495809,  0.23189762, -0.0666296 , -0.46784756,  0.3590963 ,
        -0.43390772, -0.07309743,  0.25188574,  0.14830428, -0.14054094,
         1.1045274 ,  0.09062426, -0.26540515,  0.0

A nice feature is that you can also query for words that did not appear in your data! Indeed words are represented by the sum of its substrings. As long as the unknown word is made of known substrings, there is a representation of it!

As an example let's try with a misspelled word:

In [14]:
model.get_word_vector("enviroment")

array([ 0.55878055, -0.09884794,  0.01644679, -0.01564694,  0.4957327 ,
        0.07122821, -0.35701048,  0.06908209, -0.48953786, -0.19269708,
       -0.2320794 ,  0.67147803, -0.08004352, -0.02940788,  0.00430572,
       -0.17793663,  0.19303249,  0.05925313, -0.09072658, -0.01990304,
        0.3315585 ,  0.2828672 ,  0.26893455, -0.1239673 , -0.0416136 ,
        0.08420929,  0.11877447,  0.14720206, -0.01383709,  0.25763714,
        0.06215807, -0.01858872,  0.11674039, -0.0890514 ,  0.2559311 ,
       -0.24585488, -0.7545257 ,  0.54074055, -0.20922379, -0.09910334,
        0.13268295,  0.2351052 ,  0.0445831 ,  0.44643945,  0.07826214,
       -0.5003347 , -0.09126402, -0.07871318, -0.12277444,  0.09350567,
       -0.00954623,  0.02424753,  0.06288505,  0.0845123 , -0.01981117,
       -0.1644663 , -0.14672217, -0.1898063 , -0.04739422, -0.155867  ,
        0.22146717, -0.14510304,  0.10355323, -0.2500524 , -0.0230857 ,
        0.20133856, -0.38991705,  0.19892141,  0.30788   ,  0.67

### Nearest Neighbours Queries:

A simple way to check the quality of a word vector is to look at its nearest neighbors. This give an intuition of the type of semantic information the vectors are able to capture.

This can be achieved with the nearest neighbor (nn) functionality. For example, we can query the 10 nearest neighbors of a word by running the following command:

In [15]:
model.get_nearest_neighbors('asparagus')

[(0.7878556847572327, 'walnuts'),
 (0.7779132127761841, 'beetroot'),
 (0.7738654017448425, 'tomato'),
 (0.7719115018844604, 'cabbages'),
 (0.7715265154838562, 'asparagales'),
 (0.7711346745491028, 'chickpea'),
 (0.7707504630088806, 'arrowroot'),
 (0.7656763792037964, 'chickpeas'),
 (0.7650848031044006, 'cabbage'),
 (0.7636562585830688, 'vegetables')]

Nice! It seems that vegetable vectors are similar. What about pokemons?

In [16]:
model.get_nearest_neighbors('pidgey')

[(0.9023844599723816, 'pidgeot'),
 (0.8883858323097229, 'pidgeotto'),
 (0.8718113899230957, 'pidge'),
 (0.7840093970298767, 'charizard'),
 (0.7751268148422241, 'pikachu'),
 (0.7671030759811401, 'beedrill'),
 (0.7573469877243042, 'charmeleon'),
 (0.7519928216934204, 'pidgeon'),
 (0.751587450504303, 'pok'),
 (0.735741376876831, 'squirtle')]

Different evolution of the same Pokemon have close-by vectors! But what about our misspelled word, is its vector close to anything reasonable? Let's find out:

In [17]:
model.get_nearest_neighbors('enviroment')

[(0.8944371342658997, 'enviromental'),
 (0.8718078136444092, 'environ'),
 (0.835297167301178, 'enviro'),
 (0.7783136963844299, 'environs'),
 (0.760940432548523, 'enviromission'),
 (0.757567822933197, 'environnement'),
 (0.6687712669372559, 'environment'),
 (0.6484033465385437, 'realclimate'),
 (0.6476025581359863, 'ecotourism'),
 (0.6470270752906799, 'acclimatation')]

Thanks to the information contained within the word, the vector of our misspelled word matches to reasonable words! It is not perfect but the main information has been captured.

In order to find nearest neighbors, we need to compute a similarity score between words. Our words are represented by continuous word vectors and we can thus apply simple similarities to them. In particular we use the cosine of the angles between two vectors. This similarity is computed for all words in the vocabulary, and the 10 most similar words are shown. Of course, if the word appears in the vocabulary, it will appear on top, with a similarity of 1.

### Word Analogies

In a similar spirit, one can play around with word analogies. For example, we can see if our model can guess what is to France, and what Berlin is to Germany.

This can be done with the analogies functionality. It takes a word triplet (like Germany Berlin France) and outputs the analogy:

In [18]:
model.get_analogies("berlin", "germany", "france")

[(0.8991923332214355, 'paris'),
 (0.7859027981758118, 'dubourg'),
 (0.7838226556777954, 'valenciennes'),
 (0.7820677757263184, 'faubourg'),
 (0.7773563265800476, 'maubourg'),
 (0.7770666480064392, 'louveciennes'),
 (0.7768223285675049, 'dessaulles'),
 (0.7706047892570496, 'beaubourg'),
 (0.7634151577949524, 'pompignan'),
 (0.7581263780593872, 'montpellier')]

The answer provided by our model is Paris, which is correct. Let's have a look at a less obvious example:

In [19]:
model.get_analogies("psx", "sony", "nintendo")

[(0.7998178005218506, 'dreamcast'),
 (0.7940496206283569, 'gamecube'),
 (0.7940258383750916, 'sega'),
 (0.7918456792831421, 'gba'),
 (0.7733192443847656, 'playstation'),
 (0.7727131247520447, 'nintendogs'),
 (0.767773449420929, 'arcade'),
 (0.7618235349655151, 'gameboy'),
 (0.7574592232704163, 'capcom'),
 (0.7501713037490845, 'famicom')]

Our model considers that the nintendo analogy of a psx is the dreamcast, which seems reasonable. Of course the quality of the analogies depend on the dataset used to train the model.

Using subword-level information is particularly interesting to build vectors for unknown words. For example, the word gearshift does not exist on Wikipedia but we can still query its closest existing words:

In [20]:
model.get_nearest_neighbors('gearshift')

[(0.7984798550605774, 'gearing'),
 (0.7829551696777344, 'gears'),
 (0.7524955868721008, 'daisywheel'),
 (0.7510121464729309, 'freewheel'),
 (0.7467069029808044, 'driveshaft'),
 (0.7466177344322205, 'cogwheels'),
 (0.7461075186729431, 'driveshafts'),
 (0.7454342246055603, 'flywheel'),
 (0.7410158514976501, 'flywheels'),
 (0.7373144626617432, 'wheelset')]

Most of the retrieved words share substantial substrings but a few are actually quite different, like cogwheels.

Now that we have seen the interest of subword information for unknown words, let's check how it compares to a model that does not use subword information. To train a model without subwords, just run the following command:

In [22]:
model_without_subwords = fasttext.train_unsupervised('data/data/fil9', maxn=0)
model_without_subwords.get_nearest_neighbors('accomodation')

[(0.80191969871521, 'tyresta'),
 (0.7933712601661682, 'uutela'),
 (0.7897980809211731, 'agrotourism'),
 (0.7783662676811218, 'hostelling'),
 (0.7750516533851624, 'laponian'),
 (0.7545826435089111, 'radzima'),
 (0.7535238265991211, 'rauland'),
 (0.752931535243988, 'tresticklan'),
 (0.7523895502090454, 'maranoa'),
 (0.7478064894676208, 'bothel')]

The result does not make much sense, most of these words are unrelated. On the other hand, using subword information gives the following list of nearest neighbors:

In [23]:
model.get_nearest_neighbors('accomodation')

[(0.9611985087394714, 'accomodations'),
 (0.9291858077049255, 'accommodation'),
 (0.9038296341896057, 'accommodations'),
 (0.8348402976989746, 'accommodative'),
 (0.7768381237983704, 'accommodating'),
 (0.7459931969642639, 'amenities'),
 (0.7376998662948608, 'lodging'),
 (0.7359251976013184, 'ammenities'),
 (0.7227441072463989, 'accomodated'),
 (0.7025332450866699, 'accomodate')]

The nearest neighbors capture different variation around the word accommodation. We also get semantically related words such as amenities or lodging.