<a href="https://colab.research.google.com/github/raj-vijay/nl/blob/master/05.%20Understanding_Emotions_in_Tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**fastText**

<p align = 'justify'>fastText (https://pypi.org/project/fasttext/) is a library for efficient learning of word representations and sentence classification.

In this Lab, you will train and test FastText text classifier mainly using
the train_supervised, which returns a model object, and call test and predict on this object.</p>

**Data**

We are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let's download examples of questions 
from the cooking section of Stackexchange (https://cooking.stackexchange.com/), and their associated tags:

>> wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
>> head cooking.stackexchange.txt

Each line of the text file contains a list of labels, followed by the corresponding document. All the labels start by the __label__ prefix, which is how fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.

Before training our first classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.

>> wc cooking.stackexchange.txt

Our full dataset contains 15404 examples. Let's split it into a training set of 12404 examples and a validation set of 3000 examples:

>> head -n 12404 cooking.stackexchange.txt > cooking.train
>> tail -n 3000 cooking.stackexchange.txt > cooking.valid

In [None]:
! wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

--2021-05-25 12:33:44--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz’


2021-05-25 12:33:45 (1.19 MB/s) - ‘cooking.stackexchange.tar.gz’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt


In [None]:
!head cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces


In [None]:
!wc cooking.stackexchange.txt

  15404  169582 1401900 cooking.stackexchange.txt


In [None]:
!head -n 12404 cooking.stackexchange.txt > cooking.train

In [None]:
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

**Installation**

FastText builds on modern Mac OS and Linux distributions. Since it uses C++11 
features, it requires a compiler with good C++11 support. You will 
need Python (version 2.7 or ≥ 3.4), NumPy & SciPy and pybind11

In [None]:
!pip install fasttext

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/f8/85/e2b368ab6d3528827b147fdb814f8189acc981a4bc2f99ab894650e05c40/fasttext-0.9.2.tar.gz (68kB)
[K     |████▊                           | 10kB 11.0MB/s eta 0:00:01[K     |█████████▌                      | 20kB 14.7MB/s eta 0:00:01[K     |██████████████▎                 | 30kB 18.4MB/s eta 0:00:01[K     |███████████████████             | 40kB 13.7MB/s eta 0:00:01[K     |███████████████████████▉        | 51kB 5.3MB/s eta 0:00:01[K     |████████████████████████████▋   | 61kB 5.2MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 3.2MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp37-cp37m-linux_x86_64.whl size=3094722 sha256=5dda1a8b341f60199478bf8645c15b810dc3fa54a4f860de2e2ff50ad8aed7bb
  Stored in directory: /root/.cache/pip/wheels/98/ba/7f/b154944a1cf5a8cee91c

In [None]:
import fasttext

In [None]:
help(fasttext.FastText)

Help on module fasttext.FastText in fasttext:

NAME
    fasttext.FastText

DESCRIPTION
    # Copyright (c) 2017-present, Facebook, Inc.
    # All rights reserved.
    #
    # This source code is licensed under the MIT license found in the
    # LICENSE file in the root directory of this source tree.

FUNCTIONS
    cbow(*kargs, **kwargs)
    
    eprint(*args, **kwargs)
    
    load_model(path)
        Load a model given a filepath and return a model object.
    
    read_args(arg_list, arg_dict, arg_names, default_values)
    
    skipgram(*kargs, **kwargs)
    
    supervised(*kargs, **kwargs)
    
    tokenize(text)
        Given a string of text, tokenize it and return a list of tokens
    
    train_supervised(*kargs, **kwargs)
        Train a supervised model and return a model object.
        
        input must be a filepath. The input text does not need to be tokenized
        as per the tokenize function, but it must be preprocessed and encoded
        as UTF-8. You might wan

**Preprocessing the data**

Looking at the data, we observe that some words contain uppercase letter or 
punctuation. One of the first step to improve the performance of our model is to apply some simple pre-processing. A crude normalization can be obtained using command line tools such as sed and tr:

In [None]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt

In [None]:
!head -n 12404 cooking.preprocessed.txt > cooking.train
!tail -n 3000 cooking.preprocessed.txt > cooking.valid

Let's train a new model on the pre-processed data.

In [None]:
model = fasttext.train_supervised(input="cooking.train")

<p align = 'justify'>By default, fastText sees each training example only five times during training, which is pretty small, given that our training set only have 12k training examples. The number of times each examples is seen (also known as the number of epochs), can be increased using the -epoch option:</p>

In [None]:
model = fasttext.train_supervised(input="cooking.train", epoch=25)

<p align = 'justify'>You can improve the performance of a model by using word bigrams, instead of just unigrams. This is especially important for classification problems where word order is important, such as sentiment analysis.</p>

In [None]:
model = fasttext.train_supervised(input="cooking.train", lr=1.0, epoch=25, wordNgrams=2)

<p align = 'justify'>You need to apply the different training configuration explained above and compare the results by evaluating these systems using the following command:</p>

In [None]:
model.test("cooking.valid")

(3000, 0.5996666666666667, 0.2593340060544904)