# Tugas 2 : Membuat VSM dengan Bobot TFID

## penjelasan

**Vector Space Model (VSM)** adalah salah satu metode atau algoritma yang sering digunakan untuk sebuah sistem temu kembali informasi. Algoritma ini merupakan sebuah model yang digunakan untuk mengukur kemiripan atau kesamaan (similarity term) antar suatu dokumen dengan suatu query dengan cara pembobotan term.

**TF-IDF** adalah singkatan dari Term Frequency Inverse Document Frequency. Hal ini dapat didefinisikan sebagai perhitungan seberapa relevan sebuah kata dalam kumpulan atau corpus terhadap sebuah teks. Nilai relevansi meningkat secara relatif terhadap berapa kali sebuah kata muncul di dalam teks, namun dikompensasi oleh frekuensi kata di dalam corpus (kumpulan data).

Dimateri kali ini kita akan belajar membuat vektor space model dengan menggunakan data pada code sebelumnya dengan catatan hanya menampilkan sebanyak 2 kategori saja dengan jumlah masing 50 data

## pre-processing

Pre-processing adalah tahap awal dalam pemrosesan teks yang berfungsi untuk membersihkan dan mempersiapkan data teks mentah agar dapat diolah atau digunakan dalam model pembelajaran mesin.

Pre-processing juga merupakan proses untuk memastikan data bersih dan siap digunakan dalam analisis. Langkah ini sangat penting dalam proses analisis data.

Berikut adalah beberapa langkah umum dalam pre-processing teks:

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import pickle
import pandas as pd

Mounted at /content/drive


In [None]:
df = pd.read_csv("/content/drive/My Drive/ppw/tugas/DataTugas2/data_terbaru1.csv")
df.head()

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis


### CLEANSING

Cleansing adalah proses membersihkan data dari segala macam “kotoran” atau ketidakakuratan sehingga data tersebut siap digunakan untuk analisis atau pemodelan.

In [None]:
import re
import string
import nltk

# Fungsi ini bertujuan untuk menghapus URL dari teks.
def remove_url(ulasan):
    # Convert ulasan to string if it's not already
    if not isinstance(ulasan, str):
        ulasan = str(ulasan)
    url = re.compile(r'https?://\S+|www\.S+')
    return url.sub(r'', ulasan)

# Fungsi ini bertujuan untuk menghapus tag HTML dari teks.
def remove_html(ulasan):
    if not isinstance(ulasan, str):
        ulasan = str(ulasan)
    html = re.compile(r'<.#?>')
    return html.sub(r'', ulasan)

# Fungsi ini bertujuan untuk menghapus emoji dari teks.
def remove_emoji(ulasan):
    if not isinstance(ulasan, str):
        ulasan = str(ulasan)
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"
        u"\U0001F300-\U0001F5FF"
        u"\U0001F680-\U0001F6FF"
        u"\U0001F1E0-\U0001F1FF""]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', ulasan)

# Fungsi ini bertujuan untuk menghapus angka dari teks.
def remove_numbers(ulasan):
    if not isinstance(ulasan, str):
        ulasan = str(ulasan)
    ulasan = re.sub(r'\d+', '', ulasan)
    return ulasan

# Fungsi ini bertujuan untuk menghapus simbol dari teks, menyisakan hanya huruf, angka, dan spasi.
def remove_symbols(ulasan):
    if not isinstance(ulasan, str):
        ulasan = str(ulasan)
    ulasan = re.sub(r'[^a-zA-Z0-9\s]', '', ulasan)
    return ulasan

df['cleansing'] = df['Isi Berita'].apply(lambda x: remove_url(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_html(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_emoji(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_symbols(x))
df['cleansing'] = df['cleansing'].apply(lambda x: remove_numbers(x))

df.head(5)

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita,cleansing
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional,Reporter\nTempoco\nEditor\nAndry Triyanto Tjit...
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis,Reporter\nVedro Imanuel G\nEditor\nAgung Seday...


### CASE FOLDING

Cleansing adalah proses membersihkan data dari berbagai “kotoran” atau ketidakakuratan, sehingga data siap untuk analisis atau pemodelan.

Pada tahap cleansing, data dibersihkan dari elemen-elemen yang tidak relevan terhadap hasil klasifikasi sentimen. Dokumen ulasan sering kali memiliki atribut yang tidak memengaruhi sentimen, seperti URL, HTML, emoji, simbol, angka, dan tanda baca (~!@#$%^&*{}<>:|). Atribut-atribut ini kemudian dihapus dan digantikan dengan karakter spasi.

In [None]:
def case_folding(text):
    if isinstance(text, str):
      lowercase_text = text.lower()
      return lowercase_text
    else :
      return text

df ['case_folding'] = df['cleansing'].apply(case_folding)

df.head(5)

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita,cleansing,case_folding
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,reporter\nnovali panji nugroho\neditor\neko ar...
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,reporter\nnovali panji nugroho\neditor\nahmad ...
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,reporter\nnovali panji nugroho\neditor\nahmad ...
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional,Reporter\nTempoco\nEditor\nAndry Triyanto Tjit...,reporter\ntempoco\neditor\nandry triyanto tjit...
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis,Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,reporter\nvedro imanuel g\neditor\nagung seday...


### TOKENIZATION

Case folding adalah proses mengonversi semua huruf dalam teks menjadi huruf kecil. Teknik dasar ini digunakan dalam pemrosesan bahasa alami (natural language processing/NLP) untuk menyederhanakan dan menyelaraskan teks.

Pada tahap case folding, semua huruf kapital dalam dokumen diubah menjadi huruf kecil (lowercase). Tujuannya adalah untuk menghilangkan redundansi data akibat perbedaan huruf kapital dan huruf kecil.

In [None]:
def tokenize(text):
    tokens = text.split()
    return tokens

df['tokenize'] = df['case_folding'].apply(tokenize)

df.head(5)

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita,cleansing,case_folding,tokenize
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,reporter\nnovali panji nugroho\neditor\neko ar...,"[reporter, novali, panji, nugroho, editor, eko..."
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,reporter\nnovali panji nugroho\neditor\nahmad ...,"[reporter, novali, panji, nugroho, editor, ahm..."
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,reporter\nnovali panji nugroho\neditor\nahmad ...,"[reporter, novali, panji, nugroho, editor, ahm..."
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional,Reporter\nTempoco\nEditor\nAndry Triyanto Tjit...,reporter\ntempoco\neditor\nandry triyanto tjit...,"[reporter, tempoco, editor, andry, triyanto, t..."
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis,Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,reporter\nvedro imanuel g\neditor\nagung seday...,"[reporter, vedro, imanuel, g, editor, agung, s..."


### STOPWORD REMOVAL

Stopword removal adalah langkah untuk menghilangkan kata-kata umum yang tidak memiliki makna informatif dalam teks.

Kata-kata ini dikenal sebagai “stopwords” karena sering muncul dalam teks namun tidak menambah nilai berarti terhadap makna keseluruhan.

Pada tahap Stopword Removal ini, kata-kata yang tidak berpengaruh besar dalam kalimat akan dihapus. Dalam proses pre-processing ini, penulis menghilangkan stopword pada data ulasan berdasarkan daftar kata stopword seperti “yang,” “dan,” “di,” “dari,” dan sebagainya.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('indonesian')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
def remove_stopwords(text):
  return [word for word in text if word not in stop_words]

df['stopword_removal'] = df['tokenize'].apply(lambda x: ' '.join(remove_stopwords(x)))

df.head(5)

Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita,cleansing,case_folding,tokenize,stopword_removal
0,Pesan Anies ke Tom Lembong Tersangka Korupsi I...,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,30-10-2024 05:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nEko Ar...,reporter\nnovali panji nugroho\neditor\neko ar...,"[reporter, novali, panji, nugroho, editor, eko...",reporter novali panji nugroho editor eko ari w...
1,"Tom Lembong Tersangka Korupsi Impor Gula, Nama...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 23:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,reporter\nnovali panji nugroho\neditor\nahmad ...,"[reporter, novali, panji, nugroho, editor, ahm...",reporter novali panji nugroho editor ahmad fai...
2,"Profil Tom Lembong, Eks Mendag dan Co-Captain ...",Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,29-10-2024 22:55,Nasional,Reporter\nNovali Panji Nugroho\nEditor\nAhmad ...,reporter\nnovali panji nugroho\neditor\nahmad ...,"[reporter, novali, panji, nugroho, editor, ahm...",reporter novali panji nugroho editor ahmad fai...
3,Tom Lembong Sempat Unggah Hal Ini Sehari Sebel...,Reporter\nTempo.co\nEditor\nAndry Triyanto Tji...,30-10-2024 02:55,Nasional,Reporter\nTempoco\nEditor\nAndry Triyanto Tjit...,reporter\ntempoco\neditor\nandry triyanto tjit...,"[reporter, tempoco, editor, andry, triyanto, t...",reporter tempoco editor andry triyanto tjitra ...
4,"Prabowo Ingin Tingkatkan Pembangunan di Papua,...",Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,29-10-2024 23:55,Bisnis,Reporter\nVedro Imanuel G\nEditor\nAgung Seday...,reporter\nvedro imanuel g\neditor\nagung seday...,"[reporter, vedro, imanuel, g, editor, agung, s...",reporter vedro imanuel g editor agung sedayu r...


In [None]:
df.to_csv("/content/drive/My Drive//ppw/tugas/DataTugas2/Hasil_Prepros.csv",encoding='utf8', index=False)

### TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF adalah metode statistik yang menilai seberapa penting suatu kata dalam sebuah dokumen dibandingkan dengan dokumen lain dalam satu kumpulan.

Metode ini membantu mengidentifikasi kata-kata yang paling relevan dan unik untuk setiap dokumen dalam koleksi, memungkinkan analisis yang lebih bermakna.

TF-IDF sering diterapkan dalam berbagai tugas, seperti ekstraksi informasi, penggalian teks, dan pemodelan pembelajaran mesin berbasis teks.

Term Frequency mengukur frekuensi kemunculan suatu kata dalam satu dokumen—semakin sering kata tersebut muncul, semakin tinggi nilai Term Frequency-nya. Sementara itu, Inverse Document Frequency mengevaluasi seberapa sering kata tersebut ditemukan di seluruh dokumen dalam dataset, memberi bobot lebih tinggi pada kata yang jarang muncul di semua dokumen, membuatnya lebih signifikan.

In [None]:
import pandas as pd

data = pd.read_csv("/content/drive/My Drive//ppw/tugas/DataTugas2/Hasil_Prepros.csv", sep=",")

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Menginisialisasi TfidfVectorizer
vectorizer = TfidfVectorizer()

# Menghitung TF-IDF
tfidf_matrix = vectorizer.fit_transform(df['stopword_removal'])

In [None]:
# Mengubah hasilnya menjadi DataFrame
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df.head(10)

Unnamed: 0,abdul,abimanyu,acara,adapin,adat,adil,administrasi,af,agama,agraria,...,yakti,yanuar,yaputra,yasin,yogyakarta,youtube,yudono,yuk,yusuf,zaman
0,0.168487,0.0,0.0,0.0,0.0,0.046607,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.018647,0.0,0.034854,0.0,0.04688,0.0,0.0,0.0,0.0,0.04688,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09376,0.0
2,0.022131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.024579,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.061793
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.160567,0.0,0.0,0.0,...,0.0,0.0,0.026731,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.078429,0.0,0.0,0.0,0.0,0.04339,0.048865,0.0,0.0,0.0,...,0.05381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.016052,0.0,0.030003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.040355,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.049462,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
