# fastText

[fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification.

This is from the [tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html).

In [2]:
pip install fasttext

Note: you may need to restart the kernel to use updated packages.


### Getting and preparing the data

In [3]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

--2023-07-01 06:30:58--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 99.84.50.112, 99.84.50.9, 99.84.50.102, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|99.84.50.112|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz’


2023-07-01 06:31:02 (1.05 MB/s) - ‘cooking.stackexchange.tar.gz’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt


In [4]:
!head cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces


In [5]:
!wc cooking.stackexchange.txt

  15404  169582 1401900 cooking.stackexchange.txt


In [6]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

In [None]:
### First classifier

In [7]:
import fasttext
model = fasttext.train_supervised(input="cooking.train")

Read 0M words
Number of words:  14543
Number of labels: 735
Progress: 100.0% words/sec/thread:    2807 lr:  0.000000 avg.loss: 10.301685 ETA:   0h 0m 0s 96.4% words/sec/thread:    2800 lr:  0.003624 avg.loss: 10.334044 ETA:   0h 0m 0s


In [8]:
model.save_model("model_cooking.bin")

In [9]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.07401636]))

In [10]:
model.predict("Why not put knives in the dishwasher?")

(('__label__baking',), array([0.06632103]))

In [11]:
model.test("cooking.valid")

(3000, 0.14766666666666667, 0.06386045841141703)

In [12]:
model.test("cooking.valid", k=5)

(3000, 0.0678, 0.14660516073230503)

In [None]:
### Precision and recall

In [13]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__baking',
  '__label__food-safety',
  '__label__bread',
  '__label__equipment',
  '__label__substitutions'),
 array([0.06632103, 0.0638648 , 0.03948653, 0.03379753, 0.0314443 ]))

In [None]:
### Making the model better

In [14]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt

In [15]:
!head -n 12404 cooking.preprocessed.txt > cooking2.train
!tail -n 3000 cooking.preprocessed.txt > cooking2.valid

In [16]:
import fasttext
model2 = fasttext.train_supervised(input="cooking2.train")

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    3144 lr:  0.000000 avg.loss: 10.099373 ETA:   0h 0m 0s


In [17]:
model2.test("cooking2.valid")

(3000, 0.17433333333333334, 0.07539282110422373)

In [None]:
### More epochs and larger learning rate

In [18]:
import fasttext
model3 = fasttext.train_supervised(input="cooking2.train", epoch=25)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    3099 lr:  0.000000 avg.loss:  7.221395 ETA:   0h 0m 0s lr:  0.008576 avg.loss:  7.402447 ETA:   0h 0m 7s


In [19]:
model3.test("cooking2.valid")

(3000, 0.522, 0.22574599971169093)

In [20]:
model4 = fasttext.train_supervised(input="cooking2.train", lr=1.0)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    3072 lr:  0.000000 avg.loss:  6.542137 ETA:   0h 0m 0s


In [21]:
model4.test("cooking2.valid")

(3000, 0.5773333333333334, 0.2496756522992648)

In [22]:
model5 = fasttext.train_supervised(input="cooking2.train", lr=1.0, epoch=25)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    3052 lr:  0.000000 avg.loss:  4.531206 ETA:   0h 0m 0s 16.2% words/sec/thread:    3062 lr:  0.838363 avg.loss:  8.227530 ETA:   0h 1m14s 7.640467 ETA:   0h 1m 6s 35.7% words/sec/thread:    3055 lr:  0.643435 avg.loss:  7.194624 ETA:   0h 0m57s 71.8% words/sec/thread:    3050 lr:  0.282311 avg.loss:  5.573322 ETA:   0h 0m25s 94.4% words/sec/thread:    3053 lr:  0.055983 avg.loss:  4.723944 ETA:   0h 0m 4s


In [23]:
model5.test("cooking2.valid")

(3000, 0.576, 0.24909903416462448)

### word n-grams

In [24]:
model6 = fasttext.train_supervised(input="cooking2.train", lr=1.0, epoch=25, wordNgrams=2)

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    2970 lr:  0.000000 avg.loss:  3.202410 ETA:   0h 0m 0s 0.950084 avg.loss:  9.927120 ETA:   0h 1m33s avg.loss:  3.203731 ETA:   0h 0m 0s


In [25]:
model6.test("cooking2.valid")

(3000, 0.6053333333333333, 0.26178463312671185)

In [None]:
### Scaling things up

In [26]:
model7 = fasttext.train_supervised(input="cooking2.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:  104120 lr:  0.000000 avg.loss:  2.212304 ETA:   0h 0m 0s


In [27]:
model7.test("cooking2.valid")

(3000, 0.5863333333333334, 0.2535678247080871)

### Multi-label classification

In [28]:
model8 = fasttext.train_supervised(input="cooking2.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

Read 0M words
Number of words:  8952
Number of labels: 735
Progress: 100.0% words/sec/thread:    5297 lr:  0.000000 avg.loss:  4.341012 ETA:   0h 0m 0s ETA:   0h 0m44s 39.0% words/sec/thread:    4982 lr:  0.304911 avg.loss: 10.225122 ETA:   0h 0m33s


In [29]:
model8.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__bread',
  '__label__bananas',
  '__label__equipment'),
 array([1.00001001, 0.99169421, 0.89913142, 0.8808071 ]))

In [30]:
model8.test("cooking2.valid", k=-1)

(3000, 0.003146031746031746, 1.0)