# Mencari TF-IDF dan Cosine Similarity

Diberikan 10 buah dokumen (Doc1 s.d Doc10) dalam format .txt yang berisikan abstrak dari 10 paper yang berbeda. Paper diambil dari IJCCS (Indonesian Journal of Computing and Cybernetics Systems) dalam waktu 5 tahun terakhir menggunakan bahasa Indonesia. Tujuan dari kode program ini adalah mencari nilai TF-IDF dan membandingkan tingkat kemiripan 10 abstrak paper tersebut dengan metode Cosine Similarity.

## Inisialisasi Library

In [1]:
import re
import math
import numpy as np
import pandas as pd
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import sent_tokenize, word_tokenize

## Preprocessing Pada Tiap Dokumen

Pada tahap ini, semua kata pada tiap-tiap dokumen akan diproses terlebih dahulu supaya lebih rapi.
Tahap preprocessing terdiri dari :
1. Menghilangkan spasi
2. Mengubah semua huruf menjadi huruf kecil
3. Menghilangkan simbol dan angka
4. Melakukan stemming (mengubah kata ke bentuk dasarnya)
5. Tokenisasi kalimat menjadi array berisi kata
6. Menghilangkan kata di array tersebut yang termasuk dalam stopwords Bahasa Indonesia

In [4]:
## Preprocessing
corpus = [] #array containing words in each document
for i in range(10):
    abstract_file = open("abstrak{}.txt".format(i+1), "r", encoding="utf-8")
    abstract_words = []
    factory = StopWordRemoverFactory()
    stopwords = StopWordRemoverFactory().create_stop_word_remover()
    stemmer = StemmerFactory().create_stemmer()
    
    #extracting words in every line of the selected abstract
    for line in abstract_file:
        if line.strip():
            sentence = sent_tokenize(line)
            for word in sentence:
                word = word.lower()
                word = re.sub(r'[^a-zA-Z]',' ', word)
                word = stemmer.stem(word)
                word = word_tokenize(word)
                word = [w for w in word if not w in factory.get_stop_words()]
                abstract_words += word          
    abstract_file.close()
    
    corpus.append(abstract_words)

## Menyimpan secara Unik Semua Kata dalam Setiap Dokumen

In [25]:
#Seeding the bag of words, containing all words in all abstracts uniquely
bag_of_words = []
for document in corpus:
    bag_of_words = np.concatenate((bag_of_words, document), axis=None)
    bag_of_words  = np.unique(bag_of_words)

## Menghitung Term Frequency (TF) dan Dinormalisasikan

In [9]:
## Calculating the Term Frequency    
def term_frequency(document, word):
    return document.count(word)

tf = np.zeros((bag_of_words.shape[0], 10))
for i in range(len(corpus)):
    for j in range(len(tf)):
        tf[j, i] = term_frequency(corpus[i], bag_of_words[j])
    #Normalizing the TF
    tf[:, i] /= np.sum(tf[:, i])

## Menghitung Inverse Document Frequency (IDF)

In [10]:
## Calculating the Inverse Document Frequency
def document_frequency(document, word, count):
    if word in document:
        count += 1
    return count

def inverse_document_frequency(df):
    document_length = 10
    return np.log(document_length/(df + 1))

idf = np.zeros((tf.shape[0], 1))
for i in range(len(bag_of_words)):
    for document in corpus:
        idf[i, 0] = document_frequency(document, bag_of_words[i], idf[i, 0])
    idf[i, 0] = inverse_document_frequency(idf[i, 0])

## Menghitung TF - IDF

In [11]:
## Calculating tf * idf
tf_idf = tf.copy()
tf_idf = np.multiply(tf_idf, idf)

## Menghitung Cosine Similarity 

In [12]:
## Calculating Cosine Similarity
def multiply_column_sum(doc1, doc2):
    return np.sum(doc1 * doc2)

def quadratic_sum(doc):
    return math.sqrt(np.sum(np.square(doc)))

def cosine_similarity(doc1, doc2):
    return multiply_column_sum(doc1, doc2) / (quadratic_sum(doc1) * quadratic_sum(doc2))

## Membuat Matrix yang Berisi Tingkat Kemiripan Antar Dokumen

In [13]:
#Creating matrix showing similarity of each document
doc_similarity = np.empty((len(corpus), len(corpus)))
for i in range(doc_similarity.shape[0]):
    doc_similarity[i, i] = 1
    for j in range(i+1, doc_similarity.shape[1]):
        doc_similarity[i, j] = cosine_similarity(tf_idf[:, i], tf_idf[:, j])
        doc_similarity[j, i] = np.nan

## Merepresentasikan Tingkat Kemiripan dalam Bentuk Tabel 

In [26]:
labels = ["Doc1", "Doc2", "Doc3", "Doc4", "Doc5", "Doc6", "Doc7", "Doc8", "Doc9", "Doc10"]
a = pd.DataFrame(doc_similarity, columns=labels, index=labels)
#Mengubah cell yang bernilai NaN menjadi empty string ('')
a.replace(np.nan, '', inplace=True)

a

Unnamed: 0,Doc1,Doc2,Doc3,Doc4,Doc5,Doc6,Doc7,Doc8,Doc9,Doc10
Doc1,1.0,0.0227736,0.0196035,0.00435743,0.0253961,0.0494693,0.00833291,0.0810233,0.0690927,0.033174
Doc2,,1.0,0.0716329,0.0662852,0.0608412,0.0510323,0.138885,0.0483437,0.0595224,0.132859
Doc3,,,1.0,0.0263141,0.00994799,0.0242179,0.00740318,0.0502603,0.00977117,0.04909
Doc4,,,,1.0,0.0524253,0.0667139,0.00214859,0.0242466,0.014976,0.011443
Doc5,,,,,1.0,0.0464078,0.0243708,0.0159987,0.027802,0.036077
Doc6,,,,,,1.0,0.104601,0.0747908,0.0372366,0.04217
Doc7,,,,,,,1.0,0.0490705,0.0284846,0.067758
Doc8,,,,,,,,1.0,0.081605,0.136908
Doc9,,,,,,,,,1.0,0.06215
Doc10,,,,,,,,,,1.0
