## Text Classification Using FastText
**Dataset Credits**: https://www.kaggle.com/datasets/saurabhshahane/ecommerce-text-classification

We have a dataset of ecommerce item description. Total 4 categories,

* Household
* Electronics
* Clothing and Accessories
* Books

The task at hand is to classify a product into one of the above 4 categories based on the product description

In [3]:
cd /content/drive/MyDrive/Study/NLP/codebasics/13. Word Embedding/fastText

/content/drive/MyDrive/Study/NLP/codebasics/13. Word Embedding/fastText


In [5]:
import pandas as pd

df= pd.read_csv("ecommerce_dataset.csv", names=["category", "description"], header=None)
print(df.shape)
df.head(3)

(50425, 2)


Unnamed: 0,category,description
0,Household,Paper Plane Design Framed Wall Hanging Motivat...
1,Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,Household,SAF 'UV Textured Modern Art Print Framed' Pain...


In [6]:
df.category.value_counts()

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
Household,19313
Books,11820
Electronics,10621
Clothing & Accessories,8671


In [7]:
df.dropna(inplace = True)
df.shape

(50424, 2)

In [8]:
# replace space of Clothing & accessories with underscore
df.category.replace('Clothing & Accessories', 'Clothing_Accessories', inplace = True)
df.category.unique()

array(['Household', 'Books', 'Clothing_Accessories', 'Electronics'],
      dtype=object)

When you train a fasttext model, it expects labels to be specified with label prefix. We will just create a third column in the dataframe that has label as well as the product description

In [9]:
df['category'] = '__label__' + df['category'].astype(str)
df.head(5)

Unnamed: 0,category,description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ..."
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1..."
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...


In [11]:
df['category_description'] = df['category']+' '+df['description']
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__Household Paper Plane Design Framed W...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__Household SAF 'Floral' Framed Paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__Household SAF 'UV Textured Modern Art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__Household SAF Flower Print Framed Pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__Household Incredible Gifts India Wood...


### Pre-procesing

- Remove punctuation
- Remove extra space
- Make the entire sentence lower case

In [12]:
import re

text = "  VIKI's | Bookcase/Bookshelf (3-Shelf/Shelve, White) | ? . hi"
text = re.sub(r'[^\w\s\']',' ', text) # replace characters that are not word letter (\w), whitespace (\s) and apostrophy (\') with space
text = re.sub(' +', ' ', text)  # replace multi space with 1 space
text.strip().lower()


"viki's bookcase bookshelf 3 shelf shelve white hi"

In [13]:
def preprocess(text):
  text = re.sub(r'[^\w\s\']',' ', text)
  text = re.sub(' +', ' ', text)
  return text.strip().lower()

preprocess(text)

"viki's bookcase bookshelf 3 shelf shelve white hi"

In [14]:
df['category_description'] = df['category_description'].map(preprocess)
df.head()

Unnamed: 0,category,description,category_description
0,__label__Household,Paper Plane Design Framed Wall Hanging Motivat...,__label__household paper plane design framed w...
1,__label__Household,"SAF 'Floral' Framed Painting (Wood, 30 inch x ...",__label__household saf 'floral' framed paintin...
2,__label__Household,SAF 'UV Textured Modern Art Print Framed' Pain...,__label__household saf 'uv textured modern art...
3,__label__Household,"SAF Flower Print Framed Painting (Synthetic, 1...",__label__household saf flower print framed pai...
4,__label__Household,Incredible Gifts India Wooden Happy Birthday U...,__label__household incredible gifts india wood...


### Train test split

In [15]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size = 0.2)
print(train.shape, test.shape)

(40339, 3) (10085, 3)


In [17]:
train.to_csv('ecommerce.train', columns=["category_description"], index = False, header = False)
test.to_csv('ecommerce.test', columns=["category_description"], index = False, header = False)

### Training

In [2]:
!pip install fasttext

Collecting fasttext
  Downloading fasttext-0.9.3.tar.gz (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m71.7/73.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.4/73.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Using cached pybind11-2.13.6-py3-none-any.whl (243 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (pyproject.toml) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.3-cp311-cp311-linux_x86_64.whl size=4313504 sha256=c068706c002381

In [18]:
import fasttext

model = fasttext.train_supervised(input='ecommerce.train')
model.test("ecommerce.test")
# This result will be (n_examples, precision, recall)

(10083, 0.9697510661509472, 0.9697510661509472)

In [22]:
model.predict(["wintech assemble desktop pc cpu 500 gb sata hdd 4 gb ram intel c2d processor 3"])

([['__label__electronics']], [array([0.9983188], dtype=float32)])

In [23]:
model.predict(["ockey men's cotton t shirt fabric details 80 cotton 20 polyester super combed cotton rich fabric"])


([['__label__clothing_accessories']], [array([1.00001], dtype=float32)])

In [24]:
model.get_nearest_neighbors("painting")

[(0.9987840056419373, 'canyon'),
 (0.9987824559211731, 'designe'),
 (0.9987813234329224, '2800rpm'),
 (0.9987813234329224, 'id13'),
 (0.9987795948982239, 'bushes'),
 (0.9987785816192627, 'panini'),
 (0.9987708330154419, 'corns'),
 (0.998769223690033, 'lakme'),
 (0.9987689852714539, 'skincare'),
 (0.9987674355506897, 'spinal')]

In [25]:
model.get_nearest_neighbors("sony")

[(0.999107837677002, 'f8m935bt06'),
 (0.9991005063056946, 'scuba'),
 (0.9990997314453125, '80tl'),
 (0.9990997314453125, '80tl009mih'),
 (0.9990943074226379, '4mbps'),
 (0.9990844130516052, '400mbps3'),
 (0.9990844130516052, 'ohci'),
 (0.9990844130516052, 'ieee1394'),
 (0.9990844130516052, 'connectionpackage'),
 (0.9990844130516052, 'connection1')]

In [26]:
model.get_nearest_neighbors("banglore")

[(0.0, 'to'),
 (0.0, 'and'),
 (0.0, 'a'),
 (0.0, 'with'),
 (0.0, 'for'),
 (0.0, 'is'),
 (0.0, '</s>'),
 (0.0, "borders'"),
 (0.0, "'avital"),
 (0.0, 'atlasfor')]