# FastText for classification

In this notebook we will demonstrate using the fastText library to perform text classificatoin on the dbpedie data which can we downloaded from [here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz). <br>fastText is a library for learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model allows to create an unsupervised learning or supervised learning algorithm for obtaining vector representations for words. Facebook makes available pretrained models for 294 languages(source: [wiki](https://en.wikipedia.org/wiki/FastText)).<br>
**Note**: This notebook uses an older version of fasttext.

## Setup

### Imports

In [1]:
!pip install fasttext==0.9.2



In [1]:
#necessary imports
import pandas as pd
import os, sys

#importing utils
sys.path.insert(0,os.path.split(os.getcwd())[0]) 
import utils

### Downloading data

In [25]:
DATA_PATH="Data"
DATA_TAR_PATH = utils.download_file(url='https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz', directory=DATA_PATH)
DBPEDIA_DIR = utils.extract_tar(DATA_TAR_PATH)
utils.list_files(DBPEDIA_DIR)

Downloading https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz --> Data/dbpedia_csv.tar.gz
Extracting Data/dbpedia_csv.tar.gz --> Data/dbpedia_csv
dbpedia_csv/
    classes.txt
    test.csv
    readme.txt
    train.csv


## EDA

In [123]:
data_path = 'DATAPATH'

# Loading train data
train_file = os.path.join(DBPEDIA_DIR, 'train.csv')
df = pd.read_csv(train_file, header=None, names=['class','name','description'])

# Loading test data
test_file = os.path.join(DBPEDIA_DIR, 'test.csv')
df_test = pd.read_csv(test_file, header=None, names=['class','name','description'])

# Data we have
print("Train Data Shape :\t{}\nTest Data Shape :\t{}".format(df.shape,df_test.shape))

Train Data Shape :	(560000, 3)
Test Data Shape :	(70000, 3)


In [124]:
# Since we have no clue about the classes lets build one
# Mapping from class number to class name
class_dict={
            1:'Company',
            2:'EducationalInstitution',
            3:'Artist',
            4:'Athlete',
            5:'OfficeHolder',
            6:'MeanOfTransportation',
            7:'Building',
            8:'NaturalPlace',
            9:'Village',
            10:'Animal',
            11:'Plant',
            12:'Album',
            13:'Film',
            14:'WrittenWork'
        }

# Mapping the classes
df['class_name'] = df['class'].map(class_dict)

df = df[df['class_name'].isin(['Athlete','Animal','Plant','Company','MeanOfTransportation','Film'])]
df.sample(5)

Unnamed: 0,class,name,description,class_name
494936,13,The Horror of It All,The Horror of It All is a 1963 horror comedy ...,Film
370640,10,Pseudotolithus,Pseudotolithus is a genus of croaker or bar r...,Animal
424816,11,Nectandra barbellata,Nectandra barbellata is a species of plant in...,Plant
433970,11,Oxalis ecuadorensis,Oxalis ecuadorensis is a species of plant in ...,Plant
12748,1,BIS Records,BIS Records is a record label founded in 1973...,Company


In [125]:
df["class_name"].value_counts()

Film                    40000
Company                 40000
Athlete                 40000
Plant                   40000
MeanOfTransportation    40000
Animal                  40000
Name: class_name, dtype: int64

#### Printing samples from different classes

In [126]:
print('-'+'\n\n-'.join(df[df['class_name']=="Athlete"].description.sample(3).to_list()))

- Richard J. Dick Burke (28 October 1920 – 2004) was an English professional footballer. A left back or right back he played in the Football League for Blackpool Newcastle United and Carlisle United.

- Dan Kimmel is a professional bass angler located in Lansing Michigan. He has won twenty-five money tournaments and twenty-eight fishing angler awards.In 1990 he won a Special Conservation award from the Michigan United Conservation Clubs.

- Spencer E. Wishart (3 December 1889 – 22 August 1914) was an American racecar driver. He was active during the early years of the Indianapolis 500.


In [141]:
print('-'+'\n\n-'.join(df[df['class_name']=="Film"].description.sample(3).to_list()))

- Nidhiyude Katha (The Treasure) is an experimental film written and directed by Vijayakrishnan.

- Relax Freddie (Danish: Slap af Frede!) is a 1966 Danish comedy film directed by Erik Balling and starring Morten Grunwald. It is a sequel to Slå først Frede!.

- Madeleine is a 1950 film directed by David Lean based on a true story about Madeleine Smith a young Glasgow woman from a wealthy family who was tried in 1857 for the murder of her lover Emile L'Angelier. The trial was much publicized in the newspapers of the day and was labelled the trial of the century. Lean's adaptation of the story stars his then wife Ann Todd with Ivan Desny as her French lover.


## Preprocessing

In [128]:
# Lets do some cleaning of this text
def clean_it(text,normalize=True):
    # Replacing possible issues with data. We can add or reduce the replacemtent in this chain
    s = str(text).replace(',',' ').replace('"','').replace('\'',' \' ').replace('.',' . ').replace('(',' ( ').\
            replace(')',' ) ').replace('!',' ! ').replace('?',' ? ').replace(':',' ').replace(';',' ').lower()
    
    # normalizing / encoding the text
    if normalize:
        s = s.normalize('NFKD').str.encode('ascii','ignore').str.decode('utf-8')
    
    return s

# Now lets define a small function where we can use above cleaning on datasets
def clean_df(data, cleanit= False, shuffleit=False, encodeit=False, label_prefix='__class__'):
    # Defining the new data
    df = data[['name','description']].copy(deep=True)
    df['class'] = label_prefix + data['class'].astype(str) + ' '
    
    # cleaning it
    if cleanit:
        df['name'] = df['name'].apply(lambda x: clean_it(x,encodeit))
        df['description'] = df['description'].apply(lambda x: clean_it(x,encodeit))
    
    # shuffling it
    if shuffleit:
        df.sample(frac=1).reset_index(drop=True)
            
    return df

In [130]:
# Transform the datasets using the above clean functions
df_train_cleaned = clean_df(df, True, True)
df_test_cleaned = clean_df(df_test, True, True)

In [131]:
df_train_cleaned['class'].value_counts()

__class__1      40000
__class__11     40000
__class__10     40000
__class__6      40000
__class__4      40000
__class__13     40000
Name: class, dtype: int64

In [132]:
df_train_cleaned.head()

Unnamed: 0,name,description,class
0,e . d . abbott ltd,abbott of farnham e d abbott limited was a br...,__class__1
1,schwan-stabilo,schwan-stabilo is a german maker of pens for ...,__class__1
2,q-workshop,q-workshop is a polish company located in poz...,__class__1
3,marvell software solutions israel,marvell software solutions israel known as ra...,__class__1
4,bergan mercy medical center,bergan mercy medical center is a hospital loc...,__class__1


In [133]:
# Write files to disk as fastText classifier API reads files from disk.
train_file = os.path.join(DBPEDIA_DIR,'train_clean.csv')
df_train_cleaned.to_csv(train_file, header=None, index=False, columns=['class','name','description'])

test_file = os.path.join(DBPEDIA_DIR, 'test_clean.csv')
df_test_cleaned = df_test_cleaned[df_test_cleaned['class'].isin(df_train_cleaned['class'].unique())]
df_test_cleaned.to_csv(test_file, header=None, index=False, columns=['class','name','description'])

Now that we have the train and test files written into disk in a format fastText wants, we are ready to use it for text classification!

## Training

In [134]:
%%time
## Using fastText for feature extraction and training
from fasttext import train_supervised 
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is __label__. In our dataset, it is __class__. 
There are several other parameters which can be seen in: 
https://pypi.org/project/fasttext/
"""
model = train_supervised(input=train_file, label="__class__", lr=1.0, epoch=75, loss='ova', wordNgrams=2, dim=200, thread=2, verbose=100)

CPU times: user 9min 36s, sys: 11.4 s, total: 9min 47s
Wall time: 5min 1s


Try training a classifier on this dataset with, say, LogisticRegression to realize how fast fastText is! 93% Precision and Recall are hard numbers to beat, too!

## Evaluation

In [143]:
results = model.test(test_file)
print(f"Test Samples: {results[0]} Precision@{k} : {results[1]*100:2.4f} Recall@{k} : {results[2]*100:2.4f}")

Test Samples: 30000 Precision@5 : 98.5367 Recall@5 : 98.5367


In [142]:
def get_pred(doc):
    scores = model.predict(clean_it(doc,False))
    return pd.Series(dict(pred=[class_dict[int(i.replace('__class__',''))] for i in scores[0]][0],prob=scores[1][0]))

sample_docs = {
    "Dugong" : "The dugong is a medium-sized marine mammal. It is one of four living species of the order Sirenia, which also includes three species of manatees. It is the only living representative of the once-diverse family Dugongidae; its closest modern relative, Steller's sea cow, was hunted to extinction in the 18th century.",
    "Michael Phelps" : "Michael Fred Phelps II is an American former competitive swimmer and the most successful and most decorated Olympian of all time, with a total of 28 medals. Phelps also holds the all-time records for Olympic gold medals, Olympic gold medals in individual events, and Olympic medals in individual events.",
    "Neelakurinji Flower" : "Strobilanthes kunthiana, known as Kurinji or Neelakurinji in Tamil, is a shrub that is found in the shola forests of the Western Ghats in Kerala. Nilgiri Hills, which literally means the blue mountains, got their name from the purplish blue flowers of Neelakurinji that blossoms only once in 12 years.",
    "DeepMind" : "DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in September 2010. DeepMind was acquired by Google in 2014. The company is based in London, with research centres in Canada, France, and the United States.",
    "Shinkasen" : "The Shinkansen, colloquially known in English as the bullet train, is a network of high-speed railway lines in Japan. Initially, it was built to connect distant Japanese regions with Tokyo, the capital, to aid economic growth and development.",
    'Green Book': "Green Book is a 2018 American biographical comedy-drama buddy film directed by Peter Farrelly. Set in 1962, the film is inspired by the true story of a tour of the Deep South by African American classical and jazz pianist Don Shirley and Italian American bouncer Frank Tony Lip Vallelonga who served as Shirley's driver and bodyguard."
}

pd.DataFrame(pd.Series(sample_docs))[0].apply(get_pred)

Unnamed: 0,pred,prob
Dugong,Animal,0.966924
Michael Phelps,Athlete,1.00001
Neelakurinji Flower,Plant,1.00001
DeepMind,Company,1.00001
Shinkasen,MeanOfTransportation,0.51563
Green Book,Film,1.00001
