## Prediction-Based Text Vectorization


Metode dalam text vectorization yang menggunakan model prediksi bahasa untuk menghasilkan representasi numerik dari teks.

Pada latihan kali ini, library yang digunakan adalah library dari gensim yang bisa diinstall dengan `pip install gensim`

# 1. CBOW

In [1]:
import pandas as pd
from gensim.models import Word2Vec
import gensim

In [2]:
data_path = '/content/dataset.json'

In [3]:
data = pd.read_json(data_path, lines=True)

In [4]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14375,14376,14377,14378,14379,14380,14381,14382,14383,14384
0,PPP resmi mendukung Ganjar Pranowo sebagai cap...,Manajemen keuangan yang baik bisa dimulai sej...,Wakil Ketua Dewan Pembina Gerindra Sandiaga Un...,Ketua Umum PKB Muhaimin Iskandar alias Cak Imi...,"Seorang pengendara motor di Kota Cimahi, Jawa ...",.\n\nLayanan belanja sekarang bayar nanti (pay...,Laptop saat ini jadi perangkat penting yang d...,Laptop saat ini jadi perangkat penting yang d...,.\n\nAC Milan sukses menendang Napoli dari L...,Badan Pusat Statistik (BPS ) menyebut anjlokny...,...,Kepolisian Resor Mamuju Tengah menangkap dua o...,Barcelona harus puas berbagi angka saat berha...,Polres Jeneponto diserang pada Kamis dini hari...,Polres Jeneponto diserang pada Kamis dini hari...,".\n\nLampung Geh, Lampung Selatan - Menteri B...",Polres Jeneponto diserang pada Kamis dini hari...,Santri mengamati uang pecahan Rp10 ribu saat ...,Singapore Food Agency (SFA) menyetop impor ba...,Pemerintahan Presiden Amerika Serikat (AS) Joe...,Pemerintahan Presiden Amerika Serikat (AS) Joe...


In [5]:
# Transpose data
data = data.T

In [6]:
data.head()

Unnamed: 0,0
0,PPP resmi mendukung Ganjar Pranowo sebagai cap...
1,Manajemen keuangan yang baik bisa dimulai sej...
2,Wakil Ketua Dewan Pembina Gerindra Sandiaga Un...
3,Ketua Umum PKB Muhaimin Iskandar alias Cak Imi...
4,"Seorang pengendara motor di Kota Cimahi, Jawa ..."


In [7]:
data.rename(columns={0: 'content'}, inplace=True)

In [8]:
data.tail()

Unnamed: 0,content
14380,Polres Jeneponto diserang pada Kamis dini hari...
14381,Santri mengamati uang pecahan Rp10 ribu saat ...
14382,Singapore Food Agency (SFA) menyetop impor ba...
14383,Pemerintahan Presiden Amerika Serikat (AS) Joe...
14384,Pemerintahan Presiden Amerika Serikat (AS) Joe...


In [9]:
data = data.drop_duplicates(keep="first", inplace=False)
data = data.dropna(subset=['content'])

In [11]:
data

Unnamed: 0,content
0,PPP resmi mendukung Ganjar Pranowo sebagai cap...
1,Manajemen keuangan yang baik bisa dimulai sej...
2,Wakil Ketua Dewan Pembina Gerindra Sandiaga Un...
3,Ketua Umum PKB Muhaimin Iskandar alias Cak Imi...
4,"Seorang pengendara motor di Kota Cimahi, Jawa ..."
...,...
14377,Polres Jeneponto diserang pada Kamis dini hari...
14379,".\n\nLampung Geh, Lampung Selatan - Menteri B..."
14381,Santri mengamati uang pecahan Rp10 ribu saat ...
14382,Singapore Food Agency (SFA) menyetop impor ba...


Konversi ke dalam bentuk yang dipahami gensim word2vec

In [12]:
clean_data = data['content'].apply(gensim.utils.simple_preprocess)

In [13]:
clean_data

0        [ppp, resmi, mendukung, ganjar, pranowo, sebag...
1        [manajemen, keuangan, yang, baik, bisa, dimula...
2        [wakil, ketua, dewan, pembina, gerindra, sandi...
3        [ketua, umum, pkb, muhaimin, iskandar, alias, ...
4        [seorang, pengendara, motor, di, kota, cimahi,...
                               ...                        
14377    [polres, jeneponto, diserang, pada, kamis, din...
14379    [lampung, geh, lampung, selatan, menteri, bumn...
14381    [santri, mengamati, uang, pecahan, rp, ribu, s...
14382    [singapore, food, agency, sfa, menyetop, impor...
14383    [pemerintahan, presiden, amerika, serikat, as,...
Name: content, Length: 8637, dtype: object

sg: Algoritma pelatihan, CBOW(0) dan skip gram(1). Default adalah CBOW(0)

In [14]:
model_cbow = Word2Vec(clean_data, min_count=3, workers=3, window = 3, sg = 0)

In [27]:
model_cbow_2 = Word2Vec(clean_data, min_count=5, workers=3, window = 2, sg = 0)

In [15]:
model_cbow.build_vocab(clean_data, progress_per=1000)
model_cbow.train(clean_data, total_examples=model_cbow.corpus_count, epochs=model_cbow.epochs)



(12403460, 14119455)

In [28]:
model_cbow_2.build_vocab(clean_data, progress_per=1000)
model_cbow_2.train(clean_data, total_examples=model_cbow_2.corpus_count, epochs=model_cbow_2.epochs)



(12254032, 14119455)

In [16]:
model_cbow.save("./model_cbow.model")

In [41]:
model_cbow.wv.most_similar("korea")

[('rhodesia', 0.675208568572998),
 ('sumatra', 0.6643838882446289),
 ('afrika', 0.6131760478019714),
 ('tapanuli', 0.6056145429611206),
 ('pagai', 0.5894243121147156),
 ('tanabala', 0.5765926837921143),
 ('sumatera', 0.5752552151679993),
 ('luwu', 0.5743155479431152),
 ('sulawesi', 0.5727336406707764),
 ('kayong', 0.5506176352500916)]

In [29]:
model_cbow_2.wv.most_similar("jakarta")

[('tangerang', 0.5371121764183044),
 ('monas', 0.5288844704627991),
 ('perbelanjaan', 0.5111390948295593),
 ('korea', 0.48645028471946716),
 ('nagreg', 0.48588040471076965),
 ('kutub', 0.47718319296836853),
 ('bogor', 0.47226011753082275),
 ('jaksel', 0.470218300819397),
 ('tebet', 0.4584469795227051),
 ('pontianak', 0.4562956392765045)]

In [30]:
model_cbow.wv['jakarta']

array([ 0.81410027, -1.537624  , -1.0829258 , -1.3937246 ,  2.080511  ,
       -1.7189581 , -2.9876356 , -0.8506641 ,  2.1221383 ,  0.309153  ,
        0.37311465,  1.0695043 , -2.4522593 , -0.06711048, -0.8234186 ,
        1.161898  , -0.08868101,  0.39217356,  2.3431296 , -3.0425837 ,
        3.3517861 ,  0.19228312,  2.0484798 ,  2.2557359 ,  0.4989    ,
        1.6929488 ,  0.24341449,  1.6142893 , -0.66548294, -0.7940363 ,
        2.448433  , -0.3906241 ,  0.6352482 , -0.4336496 , -1.7101483 ,
       -3.1923943 ,  0.4550399 ,  1.0709392 ,  0.54153645, -4.580053  ,
        2.241644  ,  0.91329813,  2.2185552 ,  0.9941851 ,  2.0615034 ,
        1.2530289 , -2.0640092 ,  1.4458201 ,  4.4924636 ,  0.07544965,
       -1.3628931 , -0.9391696 , -0.77249736, -0.01307233,  0.01506492,
       -0.38745734,  2.8226533 ,  2.6454558 ,  1.5090199 ,  0.6874521 ,
        1.2477663 ,  0.7477161 ,  1.6425675 ,  1.777242  , -2.5793765 ,
        0.7380453 , -1.59909   , -1.4852897 , -0.12736082, -1.49

In [31]:
model_cbow_2.wv['jakarta']

array([-2.6938076 , -0.5043423 ,  0.7696607 , -0.41427448, -0.49736008,
        1.9122648 ,  1.3616443 ,  1.6256201 ,  0.5459256 , -2.2120724 ,
       -0.77945167,  0.18299933, -1.3405207 ,  0.527081  , -0.56532764,
       -3.0832357 , -0.5363368 , -2.3178823 ,  0.20370966, -1.9522612 ,
       -1.9016595 , -1.0260912 ,  1.0630432 , -0.41219974,  1.9972317 ,
        0.6905018 ,  1.3612916 ,  1.8432562 ,  2.1302462 ,  0.94845515,
       -0.54347837,  0.10815062,  0.18798833, -0.18852116, -0.45961067,
        0.00599091,  1.4333259 , -0.9356349 ,  0.9400231 , -0.47939596,
        1.0534741 , -0.38795853,  0.15476716,  0.2348133 ,  2.0283124 ,
        1.4888669 ,  2.2342966 , -0.85814595,  0.8965871 , -0.0491261 ,
       -0.13725248, -0.39763418, -0.92175853,  0.05025787,  1.4011439 ,
       -0.1593177 , -1.635417  , -0.09061751, -2.721582  ,  1.740517  ,
        2.1743336 ,  0.19315435,  0.9301077 ,  3.447936  , -3.7608342 ,
        1.4827169 , -0.6927207 ,  0.33044666, -1.1281486 , -0.94

In [34]:
model_cbow.wv.most_similar_cosmul(positive=['jakarta', 'model'], negative=['tangerang'])

[('produk', 0.940834105014801),
 ('sertamu', 0.9323060512542725),
 ('listrik', 0.8901747465133667),
 ('pastinya', 0.8785401582717896),
 ('berkembang', 0.8744855523109436),
 ('jual', 0.8736470937728882),
 ('mengayomi', 0.8690223097801208),
 ('honda', 0.8647385835647583),
 ('bekal', 0.8618462085723877),
 ('pdb', 0.8606938719749451)]

In [33]:
model_cbow_2.wv.most_similar_cosmul(positive=['jakarta', 'model'], negative=['tangerang'])

[('beda', 0.8588116765022278),
 ('perantara', 0.8548314571380615),
 ('timbul', 0.8396003246307373),
 ('produk', 0.8242548704147339),
 ('maju', 0.8144422769546509),
 ('varian', 0.8102501630783081),
 ('kredit', 0.8095846772193909),
 ('kinerja', 0.8088123798370361),
 ('contoh', 0.8086586594581604),
 ('kunci', 0.8083571791648865)]

In [19]:
model_cbow.wv.key_to_index

{'yang': 0,
 'di': 1,
 'dan': 2,
 'ini': 3,
 'dengan': 4,
 'untuk': 5,
 'dari': 6,
 'itu': 7,
 'dalam': 8,
 'pada': 9,
 'tidak': 10,
 'juga': 11,
 'akan': 12,
 'ke': 13,
 'ada': 14,
 'bisa': 15,
 'tersebut': 16,
 'atau': 17,
 'sudah': 18,
 'oleh': 19,
 'saat': 20,
 'kita': 21,
 'adalah': 22,
 'menjadi': 23,
 'karena': 24,
 'tahun': 25,
 'indonesia': 26,
 'orang': 27,
 'lebih': 28,
 'sebagai': 29,
 'kata': 30,
 'saya': 31,
 'mereka': 32,
 'dapat': 33,
 'satu': 34,
 'hari': 35,
 'dia': 36,
 'rp': 37,
 'hal': 38,
 'kepada': 39,
 'kamu': 40,
 'seperti': 41,
 'telah': 42,
 'bahwa': 43,
 'kami': 44,
 'masih': 45,
 'hingga': 46,
 'banyak': 47,
 'jadi': 48,
 'ia': 49,
 'melakukan': 50,
 'tak': 51,
 'anak': 52,
 'masyarakat': 53,
 'secara': 54,
 'hanya': 55,
 'namun': 56,
 'salah': 57,
 'harus': 58,
 'merupakan': 59,
 'jakarta': 60,
 'lalu': 61,
 'bagi': 62,
 'memiliki': 63,
 'para': 64,
 'lain': 65,
 'dilakukan': 66,
 'beberapa': 67,
 'kota': 68,
 'baik': 69,
 'baru': 70,
 'korban': 71,
 'atas

# 2. SKIP-GRAM

In [35]:
model_skipgram = Word2Vec(clean_data, min_count=3, workers=3, window =3, sg = 1)

In [43]:
model_skipgram_2 = Word2Vec(clean_data, min_count = 5, workers = 3, window = 2, sg = 1)

In [36]:
model_skipgram.build_vocab(clean_data, progress_per=1000)
model_skipgram.train(clean_data, total_examples=model_skipgram.corpus_count, epochs=model_skipgram.epochs)



(12403273, 14119455)

In [44]:
model_skipgram_2.build_vocab(clean_data, progress_per=1000)
model_skipgram_2.train(clean_data, total_examples=model_skipgram_2.corpus_count, epochs=model_skipgram_2.epochs)



(12252365, 14119455)

In [37]:
model_skipgram.save("./model_skipgram.model")

In [57]:
model_skipgram.wv.most_similar("sandiaga")

[('uno', 0.8341730237007141),
 ('salahuddin', 0.7296923398971558),
 ('menparekraf', 0.7230873703956604),
 ('oso', 0.6339700222015381),
 ('gerindra', 0.6202329993247986),
 ('rafli', 0.6170905828475952),
 ('ganjarian', 0.61518794298172),
 ('thohir', 0.6041550636291504),
 ('mundur', 0.6013341546058655),
 ('dipasangkan', 0.5976154208183289)]

In [56]:
model_skipgram_2.wv.most_similar("sandiaga")

[('uno', 0.7084506154060364),
 ('salahuddin', 0.6855056881904602),
 ('prabowo', 0.6673839688301086),
 ('menparekraf', 0.6644760370254517),
 ('hanura', 0.6409403085708618),
 ('duet', 0.638599157333374),
 ('mendeklarasikan', 0.6380643248558044),
 ('desmond', 0.6301162838935852),
 ('pengusung', 0.6264421939849854),
 ('ahy', 0.6241039633750916)]

In [55]:
model_skipgram.wv['sandiaga']

array([ 0.02649695,  0.02234598,  0.37347078,  0.11977343, -0.42182627,
        0.14438352,  1.4098506 ,  0.55257225, -0.43544993,  0.18819383,
        0.22773963, -0.49750057,  0.44329455,  0.1695091 , -0.5214634 ,
       -0.654717  ,  0.16290224, -0.02400804, -0.41588834,  0.18038285,
        0.38259786,  0.07778295, -0.094966  ,  0.9724474 ,  0.67444825,
        0.32952493, -0.55875725,  0.2972414 ,  0.2430169 ,  0.2887938 ,
        0.89275086, -0.89428836, -0.03470521, -0.0099121 , -0.826569  ,
        0.59627795,  0.67817974, -0.5944021 ,  0.6358281 , -0.28472146,
        0.5409063 , -0.3461897 , -0.49388683,  0.302856  ,  0.97842586,
       -0.47423643, -0.021933  , -0.02741507,  0.4888356 ,  0.6512335 ,
       -0.6251683 , -0.5025365 , -0.57432836,  0.15510829,  0.03307483,
        0.2068265 , -0.31838349, -0.27147502, -0.26349396,  0.41491348,
        0.2713875 ,  0.16051272,  0.39803588,  0.4063181 , -1.0327652 ,
        0.34001076, -0.06056989, -0.03469887, -0.03256943,  0.42

In [42]:
model_skipgram.wv.most_similar_cosmul(positive=['jakarta', 'mobil'], negative=['korea'])

[('bersenggolan', 0.9824720621109009),
 ('kendaraan', 0.9813573360443115),
 ('truk', 0.9468469619750977),
 ('jabodebek', 0.9440714716911316),
 ('bermotor', 0.9413668513298035),
 ('lajur', 0.9375934600830078),
 ('rush', 0.9285282492637634),
 ('lrt', 0.9272882342338562),
 ('transjakarta', 0.9256765842437744),
 ('roda', 0.9252162575721741)]

In [52]:
model_skipgram_2.wv.most_similar_cosmul(positive=['jakarta', 'mobil'], negative=['korea'])

[('spbu', 0.9834356307983398),
 ('sepeda', 0.9437838196754456),
 ('kendaraan', 0.942536473274231),
 ('motor', 0.9336618781089783),
 ('unit', 0.9266325235366821),
 ('sedan', 0.9249581098556519),
 ('pikap', 0.9145265817642212),
 ('pajero', 0.9132686257362366),
 ('terminal', 0.9052368998527527),
 ('jatinegara', 0.9000824689865112)]

In [53]:
model_skipgram.wv.key_to_index

{'yang': 0,
 'di': 1,
 'dan': 2,
 'ini': 3,
 'dengan': 4,
 'untuk': 5,
 'dari': 6,
 'itu': 7,
 'dalam': 8,
 'pada': 9,
 'tidak': 10,
 'juga': 11,
 'akan': 12,
 'ke': 13,
 'ada': 14,
 'bisa': 15,
 'tersebut': 16,
 'atau': 17,
 'sudah': 18,
 'oleh': 19,
 'saat': 20,
 'kita': 21,
 'adalah': 22,
 'menjadi': 23,
 'karena': 24,
 'tahun': 25,
 'indonesia': 26,
 'orang': 27,
 'lebih': 28,
 'sebagai': 29,
 'kata': 30,
 'saya': 31,
 'mereka': 32,
 'dapat': 33,
 'satu': 34,
 'hari': 35,
 'dia': 36,
 'rp': 37,
 'hal': 38,
 'kepada': 39,
 'kamu': 40,
 'seperti': 41,
 'telah': 42,
 'bahwa': 43,
 'kami': 44,
 'masih': 45,
 'hingga': 46,
 'banyak': 47,
 'jadi': 48,
 'ia': 49,
 'melakukan': 50,
 'tak': 51,
 'anak': 52,
 'masyarakat': 53,
 'secara': 54,
 'hanya': 55,
 'namun': 56,
 'salah': 57,
 'harus': 58,
 'merupakan': 59,
 'jakarta': 60,
 'lalu': 61,
 'bagi': 62,
 'memiliki': 63,
 'para': 64,
 'lain': 65,
 'dilakukan': 66,
 'beberapa': 67,
 'kota': 68,
 'baik': 69,
 'baru': 70,
 'korban': 71,
 'atas