# Count Vectorizer dan Jaccard Similarity

Diberikan 10 buah dokumen (Doc1 s.d Doc10) dalam format .txt yang berisikan abstrak dari 10 paper yang berbeda. Paper diambil dari IJCCS (Indonesian Journal of Computing and Cybernetics Systems) dalam waktu 5 tahun terakhir menggunakan bahasa Indonesia. Tujuan dari kode program ini adalah mencari Count Vectorizer dan membandingkan tingkat kemiripan 10 abstrak paper tersebut dengan metode Jaccard Similarity.

## Inisialisasi Library 

In [1]:
import re
import numpy as np
import pandas as pd
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import sent_tokenize, word_tokenize

## Preprocessing Pada Tiap Dokumen 

Pada tahap ini, semua kata pada tiap-tiap dokumen akan diproses terlebih dahulu supaya lebih rapi.
Tahap preprocessing terdiri dari :
1. Menghilangkan spasi
2. Mengubah semua huruf menjadi huruf kecil
3. Menghilangkan simbol dan angka
4. Melakukan stemming (mengubah kata ke bentuk dasarnya)
5. Tokenisasi kalimat menjadi array berisi kata
6. Menghilangkan kata di array tersebut yang termasuk dalam stopwords Bahasa Indonesia

In [2]:
## Preprocessing
corpus = [] #array containing words in each document
for i in range(10):
    abstract_file = open("abstrak{}.txt".format(i+1), "r", encoding="utf-8")
    abstract_words = []
    factory = StopWordRemoverFactory()
    stopwords = StopWordRemoverFactory().create_stop_word_remover()
    stemmer = StemmerFactory().create_stemmer()
    
    #extracting words in every line of the selected abstract
    for line in abstract_file:
        if line.strip():
            sentence = sent_tokenize(line)
            for word in sentence:
                word = word.lower()
                word = re.sub(r'[^a-zA-Z]',' ', word)
                word = stemmer.stem(word)
                word = word_tokenize(word)
                word = [w for w in word if not w in factory.get_stop_words()]
                abstract_words += word          
    abstract_file.close()
    
    corpus.append(abstract_words)

## Menyimpan Semua Kata yang Unik 

In [3]:
#Seeding the bag of words, containing all words in all abstracts uniquely
bag_of_words = []
for document in corpus:
    bag_of_words = np.concatenate((bag_of_words, document), axis=None)
    bag_of_words  = np.unique(bag_of_words)
    
bag_of_words = bag_of_words.reshape(1, -1)

## Menghitung Count Vectorizer 

In [4]:
## Calculating Count Vectorizer
cv = np.zeros((bag_of_words.shape[1], 10))
for i in range(len(corpus)):
    for j in range(len(cv)):
        cv[j, i] = corpus[i].count(bag_of_words[0, j])
cv = cv.T

## Menghtiung Jaccard Similarity 

In [5]:
## Calculating Jaccard Similarity
def jaccard_similarity(doc1, doc2):
    union = 0
    intersection = 0
    for i in range(len(doc1)):
        if doc1[i] > 0 and doc2[i] > 0:
            intersection += 1
            union += 1
        elif doc1[i] > 0 or doc2[i] > 0:
            union += 1
    
    return intersection / union

## Membuat Matrix yang Berisi Tingkat Kemiripan Antar Dokumen 

In [6]:
#Creating matrix showing similarity of each document
doc_similarity = np.zeros((len(corpus), len(corpus)))
for i in range(doc_similarity.shape[0]):
    doc_similarity[i, i] = 1
    for j in range(i+1, doc_similarity.shape[1]):
        doc_similarity[i, j] = jaccard_similarity(cv[i], cv[j])
        doc_similarity[j, i] = np.nan

## Merepresentasikan Tingkat Kemiripan Dalam Bentuk Tabel 

In [7]:
labels = ["Doc1", "Doc2", "Doc3", "Doc4", "Doc5", "Doc6", "Doc7", "Doc8", "Doc9", "Doc10"]
a = pd.DataFrame(doc_similarity, columns=labels, index=labels)
#Mengubah cell yang bernilai NaN menjadi empty string ('')
a.replace(np.nan, '', inplace=True)

a

Unnamed: 0,Doc1,Doc2,Doc3,Doc4,Doc5,Doc6,Doc7,Doc8,Doc9,Doc10
Doc1,1.0,0.110294,0.12,0.0701754,0.145631,0.147287,0.119658,0.167939,0.191489,0.203704
Doc2,,1.0,0.11039,0.141791,0.103704,0.0914634,0.140845,0.128834,0.125984,0.141844
Doc3,,,1.0,0.136,0.0952381,0.0980392,0.10219,0.130719,0.118644,0.162791
Doc4,,,,1.0,0.121495,0.119403,0.0725806,0.115108,0.106796,0.081967
Doc5,,,,,1.0,0.140625,0.141593,0.0942029,0.145833,0.132743
Doc6,,,,,,1.0,0.152174,0.152866,0.196581,0.17037
Doc7,,,,,,,1.0,0.15493,0.194175,0.205128
Doc8,,,,,,,,1.0,0.25,0.253846
Doc9,,,,,,,,,1.0,0.284211
Doc10,,,,,,,,,,1.0
