## Transforming Words into Word Embeddings
Words need to be converted into numbers for analysis in Natural Language Processing (NLP). Here, we convert all words in the tweets into vectors called embeddings using the social media corpus provided by Malaya (https://malaya.readthedocs.io/en/3.9/load-wordvector.html).

In [None]:
import numpy as np
import pandas as pd

Load data from Google Colab or local computer.

In [None]:
# Run from Google Colab
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('drive/MyDrive/Colab/Hate/tweets_malay.csv')

Mounted at /content/drive


In [None]:
df.columns = ["text", "label"]
df

Unnamed: 0,text,label
0,haha babi dia punya tidak menyabar macam ada 1...,1
1,ini namanya pns kontol banyak gaya emosi aing ...,1
2,pukimak punya jantan trick baru dia guna bud...,1
3,pantat apa eh jual karipap inti basi nak menia...,1
4,ini warga emas ke oku le frontliner apa kepent...,1
...,...,...
1855,temen gw banget sudah tau doi nya toxic banget...,0
1856,kau komen lebai la mende la vid lucah pn kau l...,1
1857,pastu kau tahu kain dalam eh babi aku dah paka...,1
1858,sekarang ramai babi dah pandai drive kete atas...,1


In [None]:
df['label'].value_counts()

1    1188
0     672
Name: label, dtype: int64

Load the word embeddings from Malaya social media corpus.

In [None]:
import joblib

def load_word_vector(file = 'drive/MyDrive/Colab/Hate/word_vector.pkl'):
  from os import path

  if path.exists(file):    
    print('Loading word vectors...')
    word_vector = joblib.load(file)
    print('Word vectors loaded.')
  else:
    !pip install malaya
    import malaya

    print('Generating word vectors...')
    vocab, embedded = malaya.wordvector.load(model = 'socialmedia')
    wv = malaya.wordvector.WordVector(embedded, vocab)
    word_vector = {word: wv.get_vector_by_name(word) for word in wv.words}
    joblib.dump(word_vector, file)
    print('Word vectors saved.')
    
  return word_vector

word_vector = load_word_vector()

Loading word vectors...
Word vectors loaded.


In [None]:
# Show the word vector of the word Malaysia
word_vector['malaysia']

array([-2.30527613e-02, -7.00099543e-02, -7.73596168e-02,  5.76117575e-01,
        2.91033030e-01, -1.75550848e-01, -7.64554320e-03, -1.58667445e-01,
       -3.04837413e-02,  1.37347266e-01, -1.14185810e-01, -1.53691053e-01,
       -2.77505696e-01,  1.81020275e-02,  1.18203446e-01, -1.21642143e-01,
       -5.64342737e-02, -4.49809104e-01,  1.55513704e-01,  1.95306510e-01,
        1.13781989e-02, -1.10966735e-01, -4.31171991e-02, -6.23263121e-02,
       -1.46340027e-01,  2.90884599e-02,  1.68946147e-01, -5.46257079e-01,
       -2.80946821e-01, -5.25284512e-03, -4.22020286e-01, -3.56583521e-02,
       -1.92307606e-02, -3.48818123e-01, -2.38249555e-01, -1.02884211e-01,
       -3.54436100e-01, -1.33764833e-01,  2.58435667e-01, -8.92456025e-02,
       -6.93929613e-01,  4.63487327e-01, -2.31501445e-01,  3.48954909e-02,
        2.87289880e-02, -3.62083554e-01, -2.41099656e-01,  6.39930591e-02,
       -1.49497420e-01, -2.81769693e-01,  2.09515467e-01, -4.19085175e-01,
        6.95659369e-02,  

Get the word embeddings from Malaya corpus and use them to transform words in the data.

In [None]:
def get_embeddings(doc, word_vector):
  words = doc.split()
  keys = word_vector.keys()
  v = np.array([word_vector[word] for word in words if word in keys])

  if v.size:
    embeddings = v.mean(axis = 0)
  else:
    embeddings = np.zeros(256, dtype=float)

  return embeddings

In [None]:
df['embeddings'] = df.apply(lambda x: get_embeddings(x['text'], word_vector), axis = 1)
df

Unnamed: 0,text,label,embeddings
0,haha babi dia punya tidak menyabar macam ada 1...,1,"[0.014141768, -0.1007184, -0.13147345, -0.0103..."
1,ini namanya pns kontol banyak gaya emosi aing ...,1,"[-0.010759941, -0.08435769, -0.08514198, -0.03..."
2,pukimak punya jantan trick baru dia guna bud...,1,"[0.030590499, -0.091671385, -0.11107549, -0.02..."
3,pantat apa eh jual karipap inti basi nak menia...,1,"[-0.092314355, -0.09794184, -0.118868366, -0.0..."
4,ini warga emas ke oku le frontliner apa kepent...,1,"[0.06710751, -0.09331882, -0.11471927, -0.0963..."
...,...,...,...
1855,temen gw banget sudah tau doi nya toxic banget...,0,"[0.030354695, -0.13118468, -0.11131802, -0.078..."
1856,kau komen lebai la mende la vid lucah pn kau l...,1,"[0.054582026, -0.11036155, -0.10040197, -0.104..."
1857,pastu kau tahu kain dalam eh babi aku dah paka...,1,"[0.03438075, -0.0802009, -0.11075263, -0.09538..."
1858,sekarang ramai babi dah pandai drive kete atas...,1,"[0.08207133, -0.06931462, -0.104793094, 0.0350..."
