# fastText

[fastText](https://fasttext.cc/) is a library for efficient learning of word representations and sentence classification.

This notebook is adapted from the [tutorial](https://fasttext.cc/docs/en/supervised-tutorial.html).

In [1]:
!apt-get update 

Hit:1 https://deb.nodesource.com/node_14.x focal InRelease
Hit:2 http://archive.ubuntu.com/ubuntu focal InRelease              [0m[33m
Hit:3 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu focal InRelease
Hit:4 http://security.ubuntu.com/ubuntu focal-security InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Reading package lists... Done[33m[33m
Building dependency tree       
Reading state information... Done
144 packages can be upgraded. Run 'apt list --upgradable' to see them.


In [2]:
!apt-get -y install gcc  g++

Reading package lists... Done
Building dependency tree       
Reading state information... Done
Suggested packages:
  g++-multilib gcc-multilib make autoconf automake libtool flex bison gdb
  gcc-doc
The following NEW packages will be installed:
  g++ gcc
0 upgraded, 2 newly installed, 0 to remove and 144 not upgraded.
Need to get 6,812 B of archives.
After this operation, 67.6 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 gcc amd64 4:9.3.0-1ubuntu2 [5,208 B]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 g++ amd64 4:9.3.0-1ubuntu2 [1,604 B]
Fetched 6,812 B in 0s (14.9 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package gcc.
(Reading database ... 27038 files and directories currently installed.)
Preparing to unpack .../gcc_4%3a9.3.0-1ubuntu2_amd64.deb ...
Unpacking gcc (4:9.3.0-1ubuntu2) ...
Selecting previously unselected package g++.
Preparing to unpack ...

In [3]:
pip install fasttext

Collecting fasttext
  Using cached fasttext-0.9.2.tar.gz (68 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.2-cp38-cp38-linux_x86_64.whl size=4771769 sha256=5c6d4528dd89cca3db2db456c45fc7f564a48cb653c75432046ce6618fc25443
  Stored in directory: /home/jovyan/.cache/pip/wheels/93/61/2a/c54711a91c418ba06ba195b1d78ff24fcaad8592f2a694ac94
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.2
Note: you may need to restart the kernel to use updated packages.


### Getting and preparing the data

In [4]:
!wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz

--2023-08-21 03:21:40--  https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 99.84.238.154, 99.84.238.162, 99.84.238.206, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|99.84.238.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 457609 (447K) [application/x-tar]
Saving to: ‘cooking.stackexchange.tar.gz’


2023-08-21 03:21:40 (28.6 MB/s) - ‘cooking.stackexchange.tar.gz’ saved [457609/457609]

cooking.stackexchange.id
cooking.stackexchange.txt
readme.txt


In [5]:
!head cooking.stackexchange.txt

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces


In [6]:
!wc cooking.stackexchange.txt

  15404  169582 1401900 cooking.stackexchange.txt


In [7]:
!head -n 12404 cooking.stackexchange.txt > cooking.train
!tail -n 3000 cooking.stackexchange.txt > cooking.valid

### First classifier

In [8]:
import fasttext
model = fasttext.train_supervised(input="cooking.train")

In [9]:
model.save_model("model_cooking.bin")

In [10]:
model.predict("Which baking dish is best to bake a banana bread ?")

(('__label__baking',), array([0.0798744]))

In [11]:
model.predict("Why not put knives in the dishwasher?")

(('__label__baking',), array([0.07365448]))

In [12]:
model.test("cooking.valid")

(3000, 0.129, 0.05578780452645236)

In [13]:
model.test("cooking.valid", k=5)

(3000, 0.0668, 0.14444284272740376)

### Precision and recall

In [14]:
model.predict("Why not put knives in the dishwasher?", k=5)

(('__label__baking',
  '__label__food-safety',
  '__label__equipment',
  '__label__bread',
  '__label__substitutions'),
 array([0.07365448, 0.06645831, 0.03438285, 0.03428278, 0.03235638]))

### Making the model better

In [15]:
!cat cooking.stackexchange.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > cooking.preprocessed.txt

In [16]:
!head -n 12404 cooking.preprocessed.txt > cooking2.train
!tail -n 3000 cooking.preprocessed.txt > cooking2.valid

In [17]:
import fasttext
model2 = fasttext.train_supervised(input="cooking2.train")

In [18]:
model2.test("cooking2.valid")

(3000, 0.164, 0.07092403056076113)

### More epochs and larger learning rate

In [19]:
import fasttext
model3 = fasttext.train_supervised(input="cooking2.train", epoch=25)

In [20]:
model3.test("cooking2.valid")

(3000, 0.5156666666666667, 0.22300706357214933)

In [21]:
model4 = fasttext.train_supervised(input="cooking2.train", lr=1.0)

In [22]:
model4.test("cooking2.valid")

(3000, 0.5683333333333334, 0.24578347989044255)

In [23]:
model5 = fasttext.train_supervised(input="cooking2.train", lr=1.0, epoch=25)

In [24]:
model5.test("cooking2.valid")

(3000, 0.5783333333333334, 0.25010811590024506)

### word n-grams

In [25]:
model6 = fasttext.train_supervised(input="cooking2.train", lr=1.0, epoch=25, wordNgrams=2)

In [26]:
model6.test("cooking2.valid")

(3000, 0.6023333333333334, 0.2604872423237711)

### Scaling things up

In [27]:
model7 = fasttext.train_supervised(input="cooking2.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs')

In [28]:
model7.test("cooking2.valid")

(3000, 0.5893333333333334, 0.2548652155110278)

### Multi-label classification

In [29]:
model8 = fasttext.train_supervised(input="cooking2.train", lr=0.5, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

In [30]:
model8.predict("Which baking dish is best to bake a banana bread ?", k=-1, threshold=0.5)

(('__label__baking',
  '__label__bread',
  '__label__bananas',
  '__label__equipment'),
 array([1.00001001, 0.98902309, 0.9124462 , 0.79311597]))

In [31]:
model8.test("cooking2.valid", k=-1)

(3000, 0.003146031746031746, 1.0)