### Fasttext

Fasttext is developed by Facebook. It is written in C++, which is very fast and allows multithreadin. Fasttext implementation allows various different tasks -

* Unsupervised pretraining - skipgram and CBOW
* Supervised text classification
* Word embedding extraction
* Query
* Relationship building (analogy)

The implementation also supports word2vec functionalities like - 

* Negative sampling
* Hierarchical softmax

The core logic behind fasttext is choosing character n-gram for each word and learning their seperate embeddings. Below we train a Bengali corpus to extract the embeddings.

https://fasttext.cc/docs/en/unsupervised-tutorial.html


In [1]:
import pandas as pd

In [3]:
df = pd.read_csv('/Users/victor/Desktop/ben.txt',header=None)
df.head(5)

Unnamed: 0,0
0,আমার পেট ভরতি হয়ে গেছে।
1,আমি ঠিক বলতে পারবো না।
2,টম আমাকে ছেড়ে চলে গিয়েছিল।
3,একটা গাড়ী টমকে ধাক্কা মারল।
4,আমি এটা করতে পারব না।


We use pretrained vector to fine tune the word representations. We keep embedding dimension as 300 and minimum char n-gram as 5 and maximum as 10

In [11]:
!fastText-0.2.0/fasttext skipgram -input /Users/victor/Desktop/ben.txt -output /Users/victor/Desktop/ben_emb -minn 5 -maxn 10 -dim 300 -pretrainedVectors /Users/victor/Documents/Models/cc.bn.300.vec



Read 0M words
Number of words:  391
Number of labels: 0
Progress: 100.0% words/sec/thread:   14735 lr:  0.000000 loss:  0.185358 ETA:   0h 0m


In [12]:
!echo "আমি" | fastText-0.2.0/fasttext print-word-vectors /Users/victor/Desktop/ben_emb.bin

আমি -0.32931 0.18306 1.227 -0.58233 0.46813 -0.73084 0.31281 -0.32106 -0.02146 2.0329 0.37587 -1.145 1.5 0.14548 -0.12542 0.70051 -0.93667 0.64389 0.60648 0.45956 -0.31767 0.43078 -0.71775 0.066847 0.56818 0.41195 0.52704 0.7119 -1.4954 -0.85666 0.69183 -0.28549 -0.23158 0.61 -0.56593 0.3912 0.18647 -0.36789 0.011342 -0.71634 -0.89718 -0.7821 0.57323 0.57931 -0.36548 0.7917 -0.55869 0.68534 -0.26748 -0.50278 1.0366 -0.28582 1.1061 0.74846 0.57623 -0.73126 -0.36442 -0.30215 0.12427 -0.4112 -1.1561 0.91384 -0.10559 -0.57253 -0.54086 0.49857 0.83459 1.0644 -0.23765 0.56838 0.057203 -1.332 -0.34205 -0.27973 -0.13099 1.2957 1.0169 -0.97944 -1.8257 -0.4974 1.3325 -0.95574 -2.2218 1.5474 -0.95146 0.066847 1.6112 0.47921 -0.80834 0.89942 0.70476 0.92503 0.39083 -0.13574 0.3873 -0.018164 -0.16991 0.39689 0.77541 0.48988 -0.69433 0.63338 0.34728 0.57296 0.85976 -0.67876 1.9586 -0.66458 -0.081096 0.47647 -0.62881 -0.39466 0.94243 -0.70712 0.35204 0.14057 -0.44425 1.0913 -12.001 0.34625 -1.2778 0.

We execute the below command in command line -
        
        $fastText-0.2.0/fasttext nn /Users/victor/Desktop/ben_emb.bin

and give Query word "আমি" and got the results -


In [16]:
!fastText-0.2.0/fasttext cbow -input /Users/victor/Desktop/ben.txt -output /Users/victor/Desktop/ben_emb2 -minn 5 -maxn 10 -dim 300 -pretrainedVectors /Users/victor/Documents/Models/cc.bn.300.vec


Read 0M words
Number of words:  391
Number of labels: 0
Progress: 100.0% words/sec/thread:   30848 lr:  0.000000 loss:  0.399491 ETA:   0h 0mm


Next we use the crawled pre trained weights from https://fasttext.cc/docs/en/crawl-vectors.html and see the embedding for the above word

In [19]:
import io
from tqdm import tqdm

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    data = {}
    for line in tqdm(fin):
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = map(float, tokens[1:])
    return data

In [20]:
ben_pretrained = load_vectors('/Users/victor/Documents/Models/cc.bn.300.vec')

1468578it [14:13, 1720.58it/s] 


In [21]:
len(ben_pretrained)

1468578

In [29]:
list(ben_pretrained.keys())[-5:]

['ঢোকাছছেন', 'স্বচ্ছন্দের', 'স্কুলবছর', 'kolkata24x7Latest', 'শানালেও']

In [31]:
list(list(ben_pretrained.values())[-1])

[-0.0029,
 0.0178,
 0.04,
 -0.0083,
 0.0025,
 -0.0022,
 0.0175,
 -0.0057,
 -0.0205,
 0.0169,
 0.0247,
 0.0003,
 -0.0401,
 0.0082,
 -0.0165,
 -0.0071,
 0.01,
 0.0209,
 -0.0172,
 -0.0102,
 -0.0162,
 -0.0081,
 -0.0231,
 -0.0238,
 0.0389,
 -0.0016,
 -0.0034,
 0.0103,
 0.0456,
 0.0484,
 -0.0179,
 0.0321,
 -0.0137,
 -0.0051,
 -0.039,
 -0.0239,
 0.0026,
 0.0153,
 0.0408,
 0.0368,
 0.0239,
 -0.0076,
 -0.0049,
 -0.0497,
 0.0061,
 -0.0047,
 0.0029,
 -0.0043,
 0.0064,
 -0.0047,
 0.0059,
 -0.0064,
 -0.0323,
 0.0185,
 0.0053,
 0.0203,
 0.0148,
 0.0,
 -0.0236,
 0.0305,
 0.0144,
 0.0081,
 -0.0061,
 -0.0255,
 -0.0134,
 -0.0148,
 0.0028,
 0.0286,
 0.021,
 -0.009,
 0.0046,
 0.0019,
 -0.0,
 -0.0315,
 0.0129,
 0.0198,
 -0.0159,
 0.0011,
 -0.0353,
 -0.0179,
 -0.0218,
 0.0515,
 0.0061,
 -0.008,
 0.009,
 0.0035,
 0.0099,
 -0.0133,
 0.0166,
 0.0176,
 -0.0524,
 -0.0048,
 0.0095,
 0.0548,
 0.0122,
 -0.0002,
 -0.0139,
 0.0124,
 0.0304,
 -0.0016,
 -0.0064,
 0.0042,
 -0.015,
 0.0231,
 -0.017,
 -0.0266,
 0.0128,
 -