# Fast-Text Embedding

Fast Text is another embedding method that uses statistical approach like GloVe.
In my opinion, Fast-Text is easier than GloVe since its not intuitively easy to use in first place.

As statistic approach, Fast-Text also required to be trained

In [7]:
!pip install gensim tqdm



In [21]:
from gensim.models import FastText;
import pandas;
import re;
from tqdm import tqdm;
import numpy;

from google.colab import drive;

In [4]:
drive.mount("/content/drive");

Mounted at /content/drive


## Data Loading and Preparation

This part will took the dataset and load it to pandas.

In [5]:
dataframe = pandas.read_csv("/content/drive/MyDrive/Collab Dataset/nlp-dl-self-assignment/data.csv");
dataframe.head()

Unnamed: 0,sentimen,Tweet,Unnamed: 2
0,-1,lagu bosan apa yang aku save ni huhuhuhuhuhuhu...,
1,-1,kita lanjutkan saja diam ini hingga kau dan ak...,
2,1,doa rezeki tak putus inna haa zaa larizquna ma...,
3,1,makasih loh ntar kita bagi hasil aku 99 9 sisa...,
4,-1,aku tak faham betul jenis orang malaysia yang ...,


This part will pre-process the data by this methodology below:

1. For every sentence within dataframe["Tweet"]
1.1 Remove punctuations from the sentence
1.2 Split the sentence by white spaces
1.3 Save the splitted array of words into array called "data"

In [9]:
data = [];

print("Pre-process tweet data")
for sentence in tqdm(dataframe["Tweet"]):
    punct_regex = r'[^\w\s]';
    sentence = re.sub(punct_regex, "", sentence);

    splitted_sentence = sentence.split(" ");
    data.append(splitted_sentence);

Pre-process tweet data


100%|██████████| 10806/10806 [00:00<00:00, 252683.85it/s]


## Model Training

Before we train FastText, we have to conclude the model configuration / hyperparameter first. But thanks to `gensim`, they provided pipeline configuration so we shouldn't train things manually like GloVe's do.

In [10]:
model = FastText(data, vector_size = 100, window = 5, min_count = 1, workers = 4, sg = 1);

In [11]:
model.save("my_fasttext_model.bin")

## Checking Word Embeddings

In [16]:
word_embedding = model.wv["aku"];

# Get embedding for a sentence (average of word embeddings)
sentence_embedding = numpy.mean([model.wv[word] for word in sentence.split()], axis=0)

# Print embeddings
print("Embedding for 'aku':", word_embedding)
print("Embedding for sentence:", sentence_embedding)

Embedding for 'aku': [-0.2118858  -0.46510908  0.07356274  0.5000337   0.15858892  0.01225958
  0.54230964  0.3593001  -0.45987177 -0.04324636 -0.36038065 -0.22016703
 -0.58985645  0.3964743  -0.4786581  -0.18134038  0.91035175  0.26516315
 -0.01888126 -0.24757151  0.7911123   0.26244837 -0.20457892  0.27658257
 -0.4462392  -0.14236915  0.1056407   0.3094996  -0.10937995 -0.11546614
  0.02850477 -0.12288569 -0.04496423  0.1452247   0.33418927  0.35902923
  0.17369899  0.15441279  0.08067826 -0.11640466 -0.23620823 -0.38168007
 -0.22793981 -0.13616744  0.43462357  0.3934606  -0.29961938 -0.32402587
  0.44561595 -0.49166292 -0.09167553 -0.18308903  0.08269393  0.37085322
 -0.08763634  0.11561795  0.28473616 -0.09325995  0.01993903 -0.1463585
 -0.10860904  0.4987945  -0.3746503   0.6226495   0.44252872  0.34606388
  0.03923747 -0.13738218  0.24354866  0.19789657 -0.4108645   0.727028
  0.36057258 -0.31582418  0.45508984 -0.3645546   0.22004052 -0.50036615
 -0.41709575  0.3863687   0.20096

# How to use pre-trained bin model

In [23]:
model = FastText.load("/content/my_fasttext_model.bin");

word_embedding = model.wv["hahahaha"];

# Get embedding for a sentence (average of word embeddings)
sentence_embedding = numpy.mean([model.wv[word] for word in sentence.split()], axis=0)

# Print embeddings
print("Embedding for 'hahahaha':", word_embedding)
print("Embedding for sentence:", sentence_embedding)

Embedding for 'hahahaha': [-0.18335165 -0.34334713  0.04389186  0.48133576  0.5801304  -0.2742045
  0.5116993   0.7824809  -0.13718823 -0.04707021 -0.46549746 -0.21550377
 -0.63309103  0.7176306  -0.38831735 -0.16682045  0.89352465  0.19272158
 -0.17004365 -0.41392803  1.0172237   0.12615032 -0.37084568  0.29657534
 -0.370208   -0.6376893   0.24954556  0.21593751 -0.06407838  0.00776209
 -0.0479213   0.07923892  0.1431387   0.06187275  0.17086685  0.3097347
 -0.10890402 -0.03695381  0.13780384 -0.00698729 -0.6498753  -0.78965616
 -0.10598212 -0.10334729  0.18594596  0.24666108 -0.52221304 -0.5148595
  0.09788939 -0.39837727 -0.09735121  0.08195758  0.23670644  0.37776226
 -0.12358955  0.03011106  0.26545256  0.01752131  0.32981822 -0.35291848
 -0.04962087  0.2389417  -0.28921485  0.61568934  0.53317904  0.5684278
 -0.10133695  0.02910247  0.32810807  0.05823603 -0.44150966  0.5480923
  0.29240164 -0.5463312   0.24345692 -0.50213045  0.5810929  -0.540034
 -0.3065107   0.33544046  0.1327