### Spotlight on FastText:

-- Aswin Periyadan Kadinjapali

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification.

To install FastText use :

- pip install fasttext

if above method dosenot work type the code below-

-git clone https://github.com/facebookresearch/fastText.git

-cd fastText

-make

To check if fasttext has been install properly try importing it in your python notebook as shown bellow:

In [38]:
import fasttext

### How is FastText different from gensim Word Vectors?

FastText differs in the sense that word vectors a.k.a word2vec treats every single word as the smallest unit whose vector representation is to be found, but FastText assumes a word to be formed by a n-grams of character, for example, sunny is composed of [sun, sunn,sunny],[sunny,unny,nny] etc, where n could range from 1 to the length of the word. This new representation of word by fastText provides the following benefits over word2vec or glove.

1. Find the vector representation for rare words.

2. It can give the vector representations for the words not present in the dictionary (OOV words) since these can also be broken down into character n-grams. word2vec and glove both fail to provide any vector representations for words not in the dictionary.

3. Character n-grams embeddings tend to perform superior to word2vec and glove on smaller datasets.

Fasttext can learn word representation using primarily two methods used to develop word vectors – Skipgram and CBOW.

In [40]:
# download the sample text file being used from here: https://drive.google.com/file/d/0B8kCsmD4QAr7SC1wV3N2c1N3TkE/view?usp=sharing
# you can also use any text file of your choice to train the word representation model using fasttext.
'''
file.txt is a training file containing utf-8 encoded text.
The returned model object represents your learned model, and you can use it to retrieve information.
'''
#Skipgram
model = fasttext.train_unsupervised('file.txt', model='skipgram')

#CBOW
# model = fasttext.train_unsupervised('file.txt', model='cbow')   // uncomment to run CBOW instead of Skipgram

### Saving and loading a model object:
You can save your trained model object by calling the function save_model.

In [7]:
# to save the model
model.save_model("model.bin") #model.bin is the name of the model we want to save as.

# retrieve saved model into model object.
model = fasttext.load_model("model_filename.bin")

### Get the list of words in dictionary and  their vector representations

In [15]:
# to get the list of words in the dictionary
print("List of words in Dictionary: \n", model.words)
# to get the vector of any word.
print("\n Word vector of 'test' : \n",model["test"])

List of words in Dictionary: 
 ['the', '</s>', 'of', 'and', 'a', 'to', 'in', 'is', 'that', 'model', 'for', 'words', 'as', 'be', 'vector', 'The', 'word', 'can', 'with', 'word2vec', 'are', 'by', 'context', 'syntactic', 'semantic', 'number', 'on', 'Word2vec', 'quality', 'which', 'similar', 'more', 'window', 'words.', 'or', 'training', 'skip-gram', 'test', 'accuracy']

 Word vector of 'test' : 
 [-3.08876904e-03  1.40953832e-03  5.73646976e-03 -4.56001202e-04
  9.79778240e-04  3.26184067e-03 -5.62256295e-03 -4.54788515e-03
 -3.70595662e-04  6.51564449e-03 -2.73235748e-03  4.77137015e-04
  7.69778999e-05  2.12496007e-03 -1.21135935e-02  4.71472507e-04
 -1.07827218e-04 -1.59196241e-03  8.50300398e-03  8.24919343e-03
  6.37742365e-03 -3.52037488e-04 -6.14045374e-03 -5.98106380e-05
 -2.49465695e-03  3.22649628e-03 -2.52983358e-04 -3.95103637e-03
 -5.15940250e-04 -6.58672582e-03 -8.85066669e-03  2.23105447e-03
  3.99405416e-03 -9.83050559e-04  1.19479769e-03  1.46879244e-03
  4.82914550e-03 -1.

### Finding similar words

You can also find the words most similar to a given word. This functionality is provided by the nn parameter. There is no python binding for this function but you can use it from command line.Let’s see how we can find the most similar words to “happy”.

./fasttext nn model.bin

After typing the above command, the terminal will ask you to input a query word.

--> "happy"

by 0.183204

be 0.0822266

training 0.0522333

the 0.0404951

similar 0.036328


The above is the result returned for the most similar words to happy. Interestingly, this feature could be used to correct spellings too. For example, when you enter a wrong spelling, it shows the correct spelling of the word if it occurred in the training file.

--> "wrd"

word 0.481091

words. 0.389373

words 0.370469

word2vec 0.354458

more 0.345805

### word representation of unknown or rare words:
As mensioned before you can also query for words that did not appear in your data!(rare words). Indeed words are represented by the sum of its substrings. As long as the unknown word is made of known substrings, there is a representation of it!

In [45]:
print(model["spotlight"]) # "spotlight" is not present in the dictionary as we can see from the words list in dictionary

[ 2.2415316e-03  9.0507208e-05 -1.7831399e-03  2.7298834e-04
  1.4404415e-03 -1.8822071e-03 -1.5137780e-03  6.4485741e-04
  5.0046685e-04 -1.4109930e-03 -2.7662120e-04 -1.5640133e-03
 -1.5543110e-03  1.9397345e-04 -1.1217052e-03  2.5798799e-04
 -1.2204000e-03 -7.0091290e-04 -8.1962731e-04 -3.0692064e-04
 -4.5066726e-04 -4.7739866e-04 -9.3657913e-04 -7.9209841e-04
 -6.1361020e-04  1.7475449e-04 -8.8623620e-04 -2.7449889e-04
  6.8193569e-04  1.8674624e-04  1.8099083e-03 -1.1102229e-03
  6.2761077e-04 -1.0808907e-03 -6.9864187e-04  1.3800139e-03
 -5.0646992e-04  2.6524949e-04  1.3766529e-03  7.5645803e-04
 -2.8954464e-04 -1.9392713e-03 -3.0345080e-04 -1.5006375e-03
 -7.3841045e-04  1.9331733e-04  6.2301355e-05  7.0502405e-04
 -1.3981971e-03 -1.4167129e-04 -1.6719601e-04  2.3558081e-04
 -1.2008219e-03 -7.8679627e-04  5.4402684e-04 -3.3277075e-04
  1.2448290e-03 -2.6569542e-04 -8.8291609e-04 -4.8254887e-04
  7.0571143e-04  2.0886445e-03 -4.6093765e-04  2.5633629e-04
  1.1041756e-03 -2.03579

### Text Classification:
Text classification is tagging each document in the text with a particular class. Sentiment analysis and email classification are classic examples of text classification.
The data for this analysis is taken from kaggle(https://www.kaggle.com/bittlingmayer/amazonreviews).

#### Training :

In [19]:
# Using Fasttext for Text Classification
'''
train_supervised('filepath'):
Train a supervised model and return a model object. The input must be a filepath. 
The input text does not need to be tokenized as per the tokenize function, but it must be preprocessed and encoded
as UTF-8.
'''
# In dataset we have two labels namely label1 which are negative reviews and label2 being a positive one.
supervised_model = fasttext.train_supervised('train.ft.txt/train.ft.txt')

#### Saving Model : 

In [20]:
supervised_model.save_model('sup_model.bin')

#### Get the list of words in dictionary and labels

In [28]:
print("List of words in the dictionary(first 100): \n",supervised_model.words[:100])

print("\n List of labels in the model: \n",supervised_model.labels)

List of words in the dictionary(first 100): 
 ['the', 'and', 'I', 'to', 'a', 'of', 'is', 'it', 'this', '</s>', 'in', 'for', 'that', 'was', 'with', 'you', 'not', 'on', 'have', 'but', 'The', 'my', 'are', 'as', 'book', 'be', 'This', 'one', 'like', 'so', 'It', 'from', 'very', 'at', 'all', 'just', 'or', 'would', 'they', 'about', 'an', 'has', 'good', 'had', 'will', 'out', 'more', 'by', 'get', 'if', 'great', 'your', 'can', 'only', 'what', 'when', 'me', 'up', 'his', 'really', 'than', 'some', 'no', 'read', 'it.', 'who', 'other', 'A', 'he', 'because', 'much', 'were', 'even', '-', 'do', 'her', "don't", 'time', 'been', 'first', "it's", 'i', 'If', 'movie', 'their', 'these', 'which', 'am', 'any', 'there', 'them', 'how', 'love', 'could', 'after', 'bought', 'buy', 'we', 'into', 'use']

 List of labels in the model: 
 ['__label__2', '__label__1']


### Testing

In [31]:
# testing the supervised learning model.
Results = supervised_model.test("test.ft.txt/test.ft.txt")
print("Number of examples:  " , Results[0])
print("Precision P@1:  ", Results[1])
print("Recall R@1:  ", Results[2])

Number of examples:   400000
Precision P@1:   0.91624
Recall R@1:   0.91624


### Predicting Labels

In [37]:
#Predicting labels of sentenses using trained model:
prediction = supervised_model.predict("This Product is Nice. I liked it a lot")
print("Predicted Label: ",prediction[0][0]," with accuracy: ", prediction[1][0])

Predicted Label:  __label__2  with accuracy:  0.9995657801628113


### Computing Sentence Vectors (Supervised):
This model(supervised_model) can also be used for computing the sentence vectors. Let us see how we can compute the sentence vectors by using the .get_sentence_vector('/your sentense here/').

In [33]:
# Get sentence vector representation.
supervised_model.get_sentence_vector("this is a sample sentence")

array([ 0.00074453,  0.00030297, -0.03772879,  0.01707233, -0.01307769,
       -0.01404042, -0.0051821 , -0.00590544,  0.08188382,  0.07856984,
        0.00419959,  0.04076321, -0.02195416,  0.00213791,  0.0328033 ,
       -0.00996574,  0.00589274,  0.00586095, -0.02743017, -0.02000796,
        0.03998065,  0.00544461, -0.01056082,  0.03092726,  0.0228732 ,
        0.00456057,  0.06740533,  0.00912136, -0.02507113,  0.03563398,
        0.00396628,  0.05174385, -0.03168781, -0.06196466, -0.04221461,
       -0.00791785,  0.02250962, -0.03556334,  0.06457132,  0.0332955 ,
       -0.0010088 ,  0.03508161,  0.0079563 ,  0.03280217, -0.01982999,
        0.02058018, -0.02008981,  0.04603979,  0.03482691,  0.02149935,
       -0.0143581 ,  0.04894133, -0.01198739,  0.00143604, -0.00303581,
       -0.017317  ,  0.04596366,  0.0090891 , -0.04792043,  0.00875898,
       -0.01712944,  0.02453457,  0.08769146,  0.00022876, -0.05284953,
        0.04763065,  0.01721057,  0.05324742,  0.0639897 , -0.02

### Compress model files:
When you want to save a supervised model file, fastText can compress it in order to have a much smaller model file by sacrificing only a little bit performance.

In [None]:
# with the previously trained `model` object, call :
supervised_model.quantize(input='train.ft.txt/train.ft.txt', retrain=True)
supervised_model.save_model("model_supervised_quantized.ftz")

'model_supervised_quantized.ftz' will have a much smaller size than 'sup_model.bin'

### Conclusion:
The library is surprisingly very fast in comparison to other methods for achieving the same accuracy and easily being able to compute sentence Vectors(supervised). FastText works better on small datasets in comparison to gensim and FastText performs superior to gensim in terms of syntactic performance and fairs equally well in case of semantic performance.

Fasttext has gained popurality recently over gensim for word representation. It's ablility to return a representation for a word that is not present in dictionary comes in handy when most other representation models fail. 

The ability to vectorize an entire sentence for you is one powerful feature of fasttext. This helps in easy sentence classification thats helps in doccument summerization tasks and also doccument classification tasks which are generaly difficult tasks.