# Tugas 2 : TF-IDF & Vector Space Model

## penjelasan

**Vector Space Model (VSM)** adalah salah satu metode atau algoritma yang sering digunakan untuk sebuah sistem temu kembali informasi. Algoritma ini merupakan sebuah model yang digunakan untuk mengukur kemiripan atau kesamaan (similarity term) antar suatu dokumen dengan suatu query dengan cara pembobotan term.

**TF-IDF** adalah singkatan dari Term Frequency Inverse Document Frequency. Hal ini dapat didefinisikan sebagai perhitungan seberapa relevan sebuah kata dalam kumpulan atau corpus terhadap sebuah teks. Nilai relevansi meningkat secara relatif terhadap berapa kali sebuah kata muncul di dalam teks, namun dikompensasi oleh frekuensi kata di dalam corpus (kumpulan data).

Dimateri kali ini kita akan belajar membuat vektor space model dengan menggunakan data pada code sebelumnya dengan catatan hanya menampilkan sebanyak 2 kategori saja dengan jumlah masing 50

## berikut adalah lagkah - langkah melakukan crowling web berita

### Import Library

In [2]:
!pip install Sastrawi

Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl.metadata (909 bytes)
Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [3]:
# Library untuk data manipulation
import pandas as pd
from tqdm import tqdm
import re
import string

# Library untuk text preprocessing
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt_tab')

# Library untuk text vectorization/TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Library untuk save model
import pickle

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


### Import data berita CSV

In [4]:
data = pd.read_csv("data_terbaru1.csv")
data.columns = data.columns.str.strip()
data

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis
...,...,...,...,...
95,Kementan Berencana Datangkan Sapi Hidup Ke Ind...,Reporter\nM. Raihan Muzzaki\nEditor\nAisha Sha...,30-10-2024 00:58,Bisnis
96,"Profil Charles Sitorus, Tersangka dalam Kasus ...",Reporter\nAdil Al Hasan\nEditor\nRr. Ariyani Y...,30-10-2024 02:58,Bisnis
97,Bisik-bisik Prabowo kepada Fahri Hamzah Wakil ...,Reporter\nSukma Kanthi Nurani\nEditor\nS. Dian...,30-10-2024 03:58,Bisnis
98,"Tom Lembong Jadi Tersangka, Ini Kata Anies, Mu...","Reporter\nAntara\nEditor\nYudono Yanuar\nRabu,...",30-10-2024 10:58,Bisnis


### Menerapkan fungsi clean_text()

In [5]:
def clean_text(text):
	text = re.sub(r'((www\.[^\s]+)|(https?://[^\s]+))', ' ', text) # Menghapus https* and www*
	text = re.sub(r'@[^\s]+', ' ', text) # Menghapus username
	text = re.sub(r'[\s]+', ' ', text) # Menghapus tambahan spasi
	text = re.sub(r'#([^\s]+)', ' ', text) # Menghapus hashtags
	text = re.sub(r'rt', ' ', text) # Menghapus retweet
	text = text.translate(str.maketrans("","",string.punctuation)) # Menghapus tanda baca
	text = re.sub(r'\d', ' ', text) # Menghapus angka
	text = text.lower()
	text = text.encode('ascii','ignore').decode('utf-8') #Menghapus ASCII dan unicode
	text = re.sub(r'[^\x00-\x7f]',r'', text)
	text = text.replace('\n','') #Menghapus baris baru
	text = text.strip()
	return text

### Menerapkan fungsi stemming

In [6]:
def stemming(text):
	factory = StemmerFactory()
	stemmer = factory.create_stemmer()
	text = ' '.join(stemmer.stem(word) for word in text)
	return text

In [7]:
def preprocess_text(text):
    # Tambahkan pengecekan tipe data
    if isinstance(text, str):
        # Preprocessing teks, misalnya dengan regex
        result = re.sub(r'\W+', ' ', text.lower())
    else:
        result = ''  # Jika bukan string, kembalikan string kosong
    return result

# Ganti nilai NaN di kolom 'Isi Berita' dengan string kosong
data['Isi Berita'] = data['Isi Berita'].fillna('')

# Terapkan preprocessing pada kolom 'Isi Berita'
data['cleaned_text'] = data['Isi Berita'].apply(preprocess_text)

### Menerapkan fungsi stopword

In [8]:
def clean_stopword(tokens):
	listStopword =  set(stopwords.words('indonesian'))
	removed = []
	for t in tokens:
		if t not in listStopword:
			removed.append(t)
	return removed

### preprosesing setiap dokumen

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [10]:
def preprocess_text(content):
	result = []
	for text in tqdm(content):
		cleaned_text = clean_text(text)
		tokens = nltk.tokenize.word_tokenize(cleaned_text)
		cleaned_stopword = clean_stopword(tokens)
		stemmed_text = stemming(cleaned_stopword)
		result.append(stemmed_text)
	return result

data['cleaned_text'] = preprocess_text(data['Isi Berita'])

100%|██████████| 100/100 [15:56<00:00,  9.56s/it]


### Proses TF-IDF dan pembuatan VSM

Split data menjadi 80 data untuk train dan 20 data untuk testing dari 100 data yang ada.

In [12]:
#melakukan split data
data_train = data[:80]
data_test = data[80:]
data_train

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita,cleaned_text
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional,repo er novali panji nugroho editor eko ari wi...
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional,repo er novali panji nugroho editor ahmad faiz...
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional,repo er novali panji nugroho editor ahmad faiz...
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional,repo er tempoco editor andry triyanto tjitra r...
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis,repo er vedro imanuel g editor agung dayu rabu...
...,...,...,...,...,...
75,Kejagung Tetapkan Tom Lembong Tersangka Impor ...,Reporter\nRachel Farahdiba Regar\nEditor\nS. D...,30-10-2024 13:57,Bisnis,repo er rachel farahdiba regar editor s dian a...
76,PSPK Sebut Akan Ada Kemunduran jika Ujian Nasi...,Reporter\nAnastasya Lavenia Y\nEditor\nNinis C...,30-10-2024 10:57,Nasional,repo er anastasya lavenia y editor ninis chair...
77,Jadwal dan Cara Sanggah Hasil Seleksi Administ...,Reporter\nHendrik Yaputra\nEditor\nDevy Ernis\...,30-10-2024 06:57,Nasional,repo er hendrik yaputra editor devy ernis rabu...
78,"Daftar Kebijakan Tom Lembong saat jadi Mendag,...",Reporter\nMelynda Dwi Puspita\nEditor\nAisha S...,30-10-2024 09:58,Bisnis,repo er melynda dwi puspita editor aisha shaid...


#### TF-IDF Dan VSM

In [13]:
def tfidf_vsm(data, kategori):
	tfidf = TfidfVectorizer()
	tfidf_matrix = tfidf.fit_transform(data)
	feature_names = tfidf.get_feature_names_out()

	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return tfidf, df_tfidf

tfidf_model, df_tfidf = tfidf_vsm(data_train['cleaned_text'], data_train['Kategori Berita'])

In [14]:
def model_tf_idf(data, model, kategori):
	tfidf_matrix = model.transform(data)
	feature_names = model.get_feature_names_out()

	df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
	df_tfidf.insert(0, 'Kategori Berita', kategori.reset_index(drop=True))

	return df_tfidf

df_tfidf_test = model_tf_idf(data_test['cleaned_text'], tfidf_model, data_test['Kategori Berita'])

In [15]:
df_tfidf_test.head()

Unnamed: 0,Kategori Berita,abdul,abimanyu,acara,ada,adapin,adat,adil,administrasi,af,...,yakti,yanuar,yaputra,yasin,yogyaka,youtube,yudono,yuk,yusuf,zaman
0,Nasional,0.167474,0.0,0.0,0.0,0.0,0.0,0.086121,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Nasional,0.018623,0.0,0.031313,0.0,0.0,0.046275,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.092549,0.0
2,Nasional,0.02228,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Nasional,0.025372,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063044
4,Bisnis,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Save Dataset & Model

In [16]:
df_tfidf.to_csv("data_train_vsm.csv", index=False)
df_tfidf_test.to_csv("data_test_vsm.csv", index=False)

In [17]:
with open('tfidf_model.pkl', 'wb') as f:
    pickle.dump(tfidf_model, f)