# Word Embedding

Teknik representasi kata-kata dalam bentuk vector dengan dimensi vector yang rendah.

Kata-kata dengan makna yang mirip akan memiliki representasi vector yang mendekati satu sama lain dalam ruang tersebut.

Nah, disini ada beberapa jenis algoritma Word Embedding yaitu:

- TF-IDF (Term Frequency Inverse Document Frequency)
- BoW (Bag of Words)

Nah, disini yang bakal kita pelajari tuh cuman TF-IDF aja.
Nah ini tuh apa sih? Ini tuh technically adalah suatu metode statistic untuk menilai **pentingnya suatu kata dalam sebuah dokumen dalam kumpulan dokumen**
Cara tau pentingnya gimana? Disini ada istilah TF dan IDF.
* TF -> Term-Frequency akan kita gunakan untuk mengukur banyaknya suatu kata muncul dalam suatu document.
* IDF -> Inverse-Document-Frequency yang artinya kita akan memberi bobot tambahan pada kata-kata yang jarang muncul pada document tersebut dan memberi nilai tambahan bagi kata-kata yang dianggap lebih informative.

Rumusnya:
```bash
TF(t,d) * IDF(t)

t = term frequency (number of times term 't' appears in doc 'd')

rumus IDF:
```
liat sendiri di discord @admantix :v

In [22]:
# import library yang kita perlukan
import pandas as pd
import sklearn as sk
import math

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [21]:
# How to use TF-IDF methods
first_sentence = "This is a sentence"
second_sentence = "This is another sentence"

# tokenisasi
first_sentence = word_tokenize(first_sentence)
second_sentence = word_tokenize(second_sentence)
total = set(first_sentence).union(set(second_sentence)) # ambil yang muncul di kedua dokumen (union)
print(total)

# hitung frekuensi kemunculan kata (pake hashmap)
wordDictA = dict.fromkeys(total, 0)
wordDictB = dict.fromkeys(total, 0)
for word in first_sentence:
    wordDictA[word] += 1
for word in second_sentence:
    wordDictB[word] += 1

pd.DataFrame([wordDictA, wordDictB])

{'another', 'is', 'a', 'This', 'sentence'}


Unnamed: 0,another,is,a,This,sentence
0,0,1,1,1,1
1,1,1,0,1,1


Komputasi nilai TF-IDF

In [23]:
def computeTF(wordDict, doc):
    tfDict = {}
    corpusCount = len(doc)
    for word, count in wordDict.items():
        tfDict[word] = count / float(corpusCount)
    return tfDict

tfFirst = computeTF(wordDictA, first_sentence)
tfSecond = computeTF(wordDictB, second_sentence)

tf = pd.DataFrame([tfFirst, tfSecond])
print(tf)

   another    is     a  This  sentence
0     0.00  0.25  0.25  0.25      0.25
1     0.25  0.25  0.00  0.25      0.25


Filter kalimatnya (gunakan stopwords)

In [26]:
eng_stopwords = stopwords.words('english')
filtered_sentence = [word for word in wordDictA if word not in eng_stopwords]

print(filtered_sentence)

['another', 'This', 'sentence']


Hitung IDF

In [27]:
def computeIDF(docList):
    idfDict = {}
    N = len(docList)

    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / (float(val) + 1)) # Representasi rumus IDF ada disini semua
    
    idf_df = pd.DataFrame(list(idfDict.items()), columns=['Term', 'IDF'])
    return idf_df

idfs = computeIDF([wordDictA, wordDictB])
print(idfs)

       Term      IDF
0   another  0.30103
1        is  0.30103
2         a  0.30103
3      This  0.30103
4  sentence  0.30103


Komputasikan TF-IDF

In [30]:
def computeIDF(tfBoW, idfs):
    tfidf = {}
    for word, val in tfBoW.items():
        tfidf[word] = val * idfs.loc[idfs['Term'] == word, 'IDF'].iloc[0]
    return tfidf

tfidfFirst = computeIDF(tfFirst, idfs)
tfidfSecond = computeIDF(tfSecond, idfs)

tfidf = pd.DataFrame([tfidfFirst, tfidfSecond])
print(tfidf)

    another        is         a      This  sentence
0  0.000000  0.075257  0.075257  0.075257  0.075257
1  0.075257  0.075257  0.000000  0.075257  0.075257


# TF-IDF menggunakan library

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from numpy.linalg import norm

In [32]:
Document1 = "This is a sentence"
Document2 = "This is another sentence"

Doc = [Document1, Document2]
print(Doc)

vectorizer = TfidfVectorizer()
analyzer = vectorizer.build_analyzer()

# Menghitung TF-IDF
X = vectorizer.fit_transform(Doc)
print(vectorizer.get_feature_names_out())   # kata yang muncul

print("Document 1: ", analyzer(Document1))
print("Document 2: ", analyzer(Document2))

x1 = X.toarray()[0]
x2 = X.toarray()[1]

# Cosine Similarity
cosine_sim = np.dot(x1, x2) / (norm(x1) * norm(x2))
print(cosine_sim) # semakin mendekati 1 semakin mirip

['This is a sentence', 'This is another sentence']
['another' 'is' 'sentence' 'this']
Document 1:  ['this', 'is', 'sentence']
Document 2:  ['this', 'is', 'another', 'sentence']
0.7765145304745157
