<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/chapter-4-text-classification/3_subword_embeddings_and_fast_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Subword Embeddings and fastText

**Word embeddings are about word representations**. Even offthe-
shelf embeddings seem to work well on classification tasks. 

However, if a word in our dataset was not present in the pre-trained model’s vocabulary, how will we get a representation for this word? 

This problem is popularly known as out of vocabulary (OOV). In our previous example, we just ignored such words from feature extraction. Is there a better way?

fastText embeddings is based on the idea of enriching word embeddings with subword-level information. Thus, the embedding representation for each word is represented as a sum of the representations of individual
character n-grams. While this may seem like a longer process compared to just estimating word-level embeddings, it has two advantages:

- This approach can handle words that did not appear in training data (OOV).
- The implementation facilitates extremely fast learning on even very large
corpora.

While fastText is a general-purpose library to learn the embeddings, it also supports off-the-shelf text classification by providing end-to-end classifier training and testing; i.e., we don’t have to handle feature extraction separately.

## Setup

In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedie data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz).
fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).

Note: This notebook uses an older version of fasttext.

In [None]:
!pip install fasttext==0.9.2

In [2]:
import pandas as pd
from fasttext import train_supervised

In [3]:
# downloading the data
!wget -P DATAPATH https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz

# untaring the reuqired file
!tar -xvf DATAPATH/dbpedia_csv.tar.gz -C DATAPATH

# sneek peek in the folder structure
!ls -lah DATAPATH

--2020-10-08 09:52:56--  https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz
Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz [following]
--2020-10-08 09:52:56--  https://github.com/srhrshr/torchDatasets/raw/master/dbpedia_csv.tar.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/srhrshr/torchDatasets/master/dbpedia_csv.tar.gz [following]
--2020-10-08 09:52:56--  https://raw.githubusercontent.com/srhrshr/torchDatasets/master/dbpedia_csv.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connecte

## Data Preprocessing

In [4]:
data_path = "DATAPATH"

# Loading train data
train_file = data_path + "/dbpedia_csv/train.csv"
df = pd.read_csv(train_file, header=None, names=["class", "name", "description"])

# Loading test data
test_file = data_path + "/dbpedia_csv/test.csv"
df_test = pd.read_csv(test_file, header=None, names=["class", "name", "description"])

# Data we have
print("Train:{} Test:{}".format(df.shape, df_test.shape))

Train:(560000, 3) Test:(70000, 3)


In [5]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict = {
    1:'Company',
    2:'EducationalInstitution',
    3:'Artist',
    4:'Athlete',
    5:'OfficeHolder',
    6:'MeanOfTransportation',
    7:'Building',
    8:'NaturalPlace',
    9:'Village',
    10:'Animal',
    11:'Plant',
    12:'Album',
    13:'Film',
    14:'WrittenWork'
}

# Mapping the classes
df["class_name"] = df["class"].map(class_dict)
df.head()

Unnamed: 0,class,name,description,class_name
0,1,E. D. Abbott Ltd,Abbott of Farnham E D Abbott Limited was a Br...,Company
1,1,Schwan-Stabilo,Schwan-STABILO is a German maker of pens for ...,Company
2,1,Q-workshop,Q-workshop is a Polish company located in Poz...,Company
3,1,Marvell Software Solutions Israel,Marvell Software Solutions Israel known as RA...,Company
4,1,Bergan Mercy Medical Center,Bergan Mercy Medical Center is a hospital loc...,Company


In [6]:
df["class_name"].value_counts()

Artist                    40000
Company                   40000
Athlete                   40000
WrittenWork               40000
Film                      40000
Album                     40000
Plant                     40000
NaturalPlace              40000
OfficeHolder              40000
Animal                    40000
MeanOfTransportation      40000
Village                   40000
EducationalInstitution    40000
Building                  40000
Name: class_name, dtype: int64

In [19]:
# Lets do some cleaning of this text
def clean_it(text, normalize=True):
  # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
  s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()

  # normalizing / encoding the text
  if normalize:
    s = s.normalize("NFKD").str.encode("ascii", "ignore").str.decode("utf-8")
  
  return s

In [20]:
# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit=False, shuffleit=False, encodeit=False, label_prefix="__class__"):
  # Defining the new data
  df = data[["name", "description"]].copy(deep=True)
  df["class"] = label_prefix + data["class"].astype(str) + " "

  # cleaning it
  if cleanit:
    df["name"] = df["name"].apply(lambda x: clean_it(x, encodeit))
    df["description"] = df["description"].apply(lambda x: clean_it(x, encodeit))

  # shuffling it
  if shuffleit:
    df.sample(frac=1).reset_index(drop=True)

  return df

In [21]:
%%time

# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

CPU times: user 6.73 s, sys: 1.06 s, total: 7.79 s
Wall time: 7.97 s


In [22]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = data_path + "/dbpedia_train.csv"
df_train_cleaned.to_csv(train_file, header=None, index=False, columns=["class", "name", "description"])

test_file = data_path + "/dbpedia_test.csv"
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=["class", "name", "description"])

## Feature extraction and training

Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

In [24]:
%%time

# Using fastText for feature extraction and training
model = train_supervised(input=train_file, label="__class__", lr=1.0, epoch=75, loss="ova", wordNgrams=2, dim=200, thread=2, verbose=100)

CPU times: user 1h 11min 2s, sys: 17.6 s, total: 1h 11min 20s
Wall time: 36min 32s


In [25]:
for k in range(1, 6):
  results = model.test(test_file, k=k)
  print(f"Test Samples: {results[0]} Precision@{k} : {results[1]*100:2.4f} Recall@{k} : {results[2]*100:2.4f}")

Test Samples: 70000 Precision@1 : 93.0371 Recall@1 : 93.0371
Test Samples: 70000 Precision@2 : 48.5507 Recall@2 : 97.1014
Test Samples: 70000 Precision@3 : 32.5076 Recall@3 : 97.5229
Test Samples: 70000 Precision@4 : 24.4550 Recall@4 : 97.8200
Test Samples: 70000 Precision@5 : 19.6106 Recall@5 : 98.0529


We’ll notice that, despite the fact that this is a huge dataset and we gave the classifier raw text and not the feature vector, the training
takes only a few seconds, and we get close to 98% precision and recall!

When we have a large dataset, and when learning seems infeasible with the
approaches described so far, fastText is a good option to use to set up a strong working baseline.

However, there’s one concern to keep in mind when using fastText, as
was the case with Word2vec embeddings: it uses pre-trained character n-gram
embeddings. Thus, when we save the trained model, it carries the entire character ngram embeddings dictionary with it. This results in a bulky model and can result in engineering issues.

> **fastText is extremely fast to train and very useful for setting up
strong baselines. The downside is the model size.**

Now try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 93% Precision and Recall are hard numbers to beat, too!

In [None]:
# TODO