# Tugas 2 : TF-IDF & Vector Space Model

## penjelasan

 **Vector Space Model (VSM)** adalah salah satu metode atau algoritma yang sering digunakan untuk sebuah sistem temu kembali informasi. Algoritma ini merupakan sebuah model yang digunakan untuk mengukur kemiripan atau kesamaan (similarity term) antar suatu dokumen dengan suatu query dengan cara pembobotan term.




**TF-IDF** adalah singkatan dari Term Frequency Inverse Document Frequency. Hal ini dapat didefinisikan sebagai perhitungan seberapa relevan sebuah kata dalam kumpulan atau corpus terhadap sebuah teks. Nilai relevansi meningkat secara relatif terhadap berapa kali sebuah kata muncul di dalam teks, namun dikompensasi oleh frekuensi kata di dalam corpus (kumpulan data).

Dimateri kali ini kita akan belajar membuat vektor space model dengan menggunakan data pada code sebelumnya dengan catatan hanya menampilkan sebanyak 2 kategori saja dengan jumlah masing 50 data

In [None]:
!pip install beautifulsoup4 requests



## sebelum di dibuat vektor space model

## Web scraping dan data collection

In [None]:
import requests  # Add this import for the requests library
from bs4 import BeautifulSoup  # Ensure BeautifulSoup is imported
import pandas as pd  # Import pandas for DataFrame manipulation
from datetime import datetime, timedelta  # Import datetime for date manipulation
import random  # Import random for random sampling

untuk menentukan kategori di sini saya mengambil random dari list sebanyak 2, dan membatasi untuk kategori yang di ambil sebanyak masing masing 50.


In [None]:
# Mengambil data dari situs web
url = 'https://www.tempo.co/indeks'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

articles = soup.find_all('article', class_='text-card')

data = []

# Looping melalui artikel dan ekstraksi informasi
for article in articles:
    title = article.find('h2', class_='title')
    title_text = title.text.strip() if title else None

    content = article.find('p')
    content_text = content.text.strip() if content else None

    date = article.find('h4', class_='date')
    if date:
        date_text = date.text.strip()
        if "jam lalu" in date_text:
            hours_ago = int(date_text.split(' ')[0])
            publish_time = datetime.now() - timedelta(hours=hours_ago)
            date_text = publish_time.strftime('%d-%m-%Y %H:%M')
    else:
        date_text = None

    link = title.find('a')['href'] if title else None
    category_from_url = link.split('.')[0].replace('https://', '').replace('http://', '') if link else None
    category = category_from_url.capitalize() if category_from_url else None

    data.append({
        'Judul Berita': title_text,
        'Isi Berita': content_text,
        'Tanggal Berita': date_text,
        'Kategori Berita': category
    })

df = pd.DataFrame(data)

categories = df['Kategori Berita'].unique()

# selected_categories = random.sample(list(categories), 2)
selected_categories = ['Nasional', 'Bisnis']
df_selected = df[df['Kategori Berita'].isin(selected_categories)]
df_final = df_selected.groupby('Kategori Berita').head(50)

df_final = df_final.reset_index(drop=True)
df_final.index += 1

styled_df = df_final.style.set_table_styles(
    [{'selector': 'table', 'props': [('border-collapse', 'collapse'), ('width', '100%')]},
     {'selector': 'th, td', 'props': [('border', '1px solid black'), ('padding', '8px'), ('text-align', 'left')]}]
)

display(styled_df)


Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita
1,"Berapa Jumlah Menteri Zaman Presiden Habibie, Gus Dur, Megawati, SBY, hingga Jokowi?",Jumlah menteri di kabinet Prabowo Subianto diperkirakan lebih banyak dibanding jumlah menteri di kabinet presiden Indonesia sebelum-sebelumnya.,6 menit lalu,Nasional
2,"Kominfo Sebut 5 Dompet Digital untuk Judi Online, Ini Tanggapan Mereka",Menkominfo Budi Arie menegur keras perusahaan-perusahaan penyedia dompet digital (e-wallet) karena dinilai memfasilitasi pemain judi online.,20 menit lalu,Bisnis
3,Puan Sebut Pramono Anung Diutus Megawati untuk Datang Menemui Prabowo di Kertanegara,"Puan Maharani mengatakan, Pramono Anung diutus langsung oleh Megawati untuk bertemu Prabowo di Kertanegara.",20 menit lalu,Nasional
4,Dua Forum Betawi ini Dukung Pramono Anung-Rano Karno di Pilgub Jakarta,Pramono Anung-Rano Karno mendapat dukungan dari dua forum Betawi untuk Pilgub Jakarta 2024.,27 menit lalu,Nasional
5,"Diberhentikan Sepihak, Konsil Tenaga Kesehatan Indonesia Laporkan Kemenkes ke Ombudsman",Konsul Tenaga Kesehatan Indonesia melaporkan Kemenkes ke Ombudsman ihwal dugaan maladministrasi.,41 menit lalu,Nasional
6,Siapa Saja Calon Wamenkeu yang Bakal Mendampingi Sri Mulyani?,"Tiga Wamenkeu direncanakan Prabowo untuk membantu kerja Sri Mulyani. Mereka adalah Thomas Djiwandono, Suahasil Nazara, dan Anggito Abimanyu.",55 menit lalu,Nasional
7,BRI Raih Indonesia Distinguished Human Capital Leader Awards 2024,Penghargaan tersebut menjadi pembuktian dan wujud nyata komitmen BRI menjadi Home to The Best Talent.,56 menit lalu,Bisnis
8,Cek Rincian Tarif Tol Jakarta-Tangerang yang Bakal Naik,Penyesuaian tarif Tol Jakarta-Tangerang akan diberlakukan dalam waktu dekat. Ketahui besaran kenaikannya berikut ini.,16-10-2024 07:00,Bisnis
9,"Daftar Kepala BIN dari Era Reformasi Hingga Sekarang, Muhammad Herindra Jadi Calon Baru","Berikut daftar kepala BIN dari era Reformasi tahun 1998 hingga 2024. Terakhir, ada Herindra yang diproyeksikan bakal jadi kepala BIN.",16-10-2024 07:00,Nasional
10,DPR Setujui Muhammad Herindra Dilantik sebagai Kepala BIN,DPR menyatakan Muhammad Herindra memenuhi syarat sebagai Kepala BIN. DPR akan mengesahkan hasil uji kelayakan tersebut dalam rapat paripurna.,16-10-2024 07:00,Nasional


## Pembersihan teks dengan clean_text

Setelah data dikumpulkan, Anda bisa membersihkan teks dari kolom 'Isi Berita' menggunakan fungsi clean_text(). Fungsi ini akan menghapus URL, tanda baca, angka, dan karakter non-ASCII.

In [None]:
import re
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import pickle

# Fungsi untuk membersihkan teks
def clean_text(text):
    text = re.sub(r'((www\.[^\s]+)|(https?://[^\s]+))', ' ', text)  # Menghapus https* and www*
    text = re.sub(r'@[^\s]+', ' ', text)  # Menghapus username
    text = re.sub(r'[\s]+', ' ', text)  # Menghapus tambahan spasi
    text = re.sub(r'#([^\s]+)', ' ', text)  # Menghapus hashtags
    text = re.sub(r'rt', ' ', text)  # Menghapus retweet
    text = text.translate(str.maketrans("", "", string.punctuation))  # Menghapus tanda baca
    text = re.sub(r'\d', ' ', text)  # Menghapus angka
    text = text.lower()
    text = text.encode('ascii', 'ignore').decode('utf-8')  # Menghapus ASCII dan unicode
    text = re.sub(r'[^\x00-\x7f]', r'', text)
    text = text.replace('\n', '')  # Menghapus baris baru
    text = text.strip()
    return text

## Stemming dengan sastrawi_stem

Setelah teks dibersihkan, Anda bisa melakukan stemming pada kolom yang telah dibersihkan. Fungsi sastrawi_stem() akan mengembalikan kata-kata dasar dari teks yang sudah dibersihkan.

In [None]:
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

# Membuat fungsi stemming menggunakan Sastrawi
def sastrawi_stem(text):
    # Inisialisasi stemmer dari Sastrawi
    factory = StemmerFactory()
    stemmer = factory.create_stemmer()

    # Melakukan stemming pada teks
    return stemmer.stem(text)

## Pembentukan Vector Space Model

In [None]:
# Membersihkan kolom 'Isi Berita' menggunakan clean_text
df_final['Isi Berita'] = df_final['Isi Berita'].apply(lambda x: clean_text(x) if pd.notnull(x) else '')

# Membuat objek TfidfVectorizer
vectorizer = TfidfVectorizer()

# Mengubah 'Isi Berita' menjadi representasi vektor TF-IDF
X = vectorizer.fit_transform(df_final['Isi Berita'].fillna(''))

# Menampilkan hasil VSM dalam bentuk DataFrame
vsm_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Menggabungkan hasil VSM dengan DataFrame asli
df_with_vsm = pd.concat([df_final.reset_index(drop=True), vsm_df], axis=1)

# Menampilkan beberapa baris pertama dari hasil VSM
df_with_vsm.head()


Unnamed: 0,Judul Berita,Isi Berita,Tanggal Berita,Kategori Berita,abdul,abimanyu,aburizal,acara,aceh,ad,...,wapres,wib,widianto,wihaji,wujud,yaksa,yang,yassierli,yovie,zaken
0,"Berapa Jumlah Menteri Zaman Presiden Habibie, ...",jumlah menteri di kabinet prabowo subianto dip...,6 menit lalu,Nasional,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Kominfo Sebut 5 Dompet Digital untuk Judi Onli...,menkominfo budi arie menegur keras perusahaanp...,20 menit lalu,Bisnis,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Puan Sebut Pramono Anung Diutus Megawati untuk...,puan maharani mengatakan pramono anung diutus ...,20 menit lalu,Nasional,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Dua Forum Betawi ini Dukung Pramono Anung-Rano...,pramono anungrano karno mendapat dukungan dari...,27 menit lalu,Nasional,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Diberhentikan Sepihak, Konsil Tenaga Kesehatan...",konsul tenaga kesehatan indonesia melaporkan k...,41 menit lalu,Nasional,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

# Asumsikan df_final sudah ada dan berisi kolom 'Isi Berita'

# Membersihkan kolom 'Isi Berita' menggunakan clean_text
df_final['Isi Berita'] = df_final['Isi Berita'].apply(lambda x: clean_text(x) if pd.notnull(x) else '')

# Membuat objek TfidfVectorizer
vectorizer = TfidfVectorizer()

# Mengubah 'Isi Berita' menjadi representasi vektor TF-IDF
X = vectorizer.fit_transform(df_final['Isi Berita'].fillna(''))

# Menampilkan hasil VSM dalam bentuk DataFrame
vsm_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

# Menggabungkan hasil VSM dengan DataFrame asli
df_with_vsm = pd.concat([df_final.reset_index(drop=True), vsm_df], axis=1)

# Menampilkan tabel nilai TF-IDF
# Mengambil kolom kata dari vsm_df
tfidf_table = pd.concat([df_final[['Isi Berita']], vsm_df], axis=1)

# Menampilkan hasil dalam bentuk tabel
print(tfidf_table.head())  # Menampilkan beberapa baris pertama dari tabel


                                          Isi Berita  abdul  abimanyu  \
1  jumlah menteri di kabinet prabowo subianto dip...    0.0  0.000000   
2  menkominfo budi arie menegur keras perusahaanp...    0.0  0.000000   
3  puan maharani mengatakan pramono anung diutus ...    0.0  0.000000   
4  pramono anungrano karno mendapat dukungan dari...    0.0  0.000000   
5  konsul tenaga kesehatan indonesia melaporkan k...    0.0  0.271009   

   aburizal  acara  aceh   ad  ada    adalah  agendanya  ...  wapres  wib  \
1       0.0    0.0   0.0  0.0  0.0  0.000000        0.0  ...     0.0  0.0   
2       0.0    0.0   0.0  0.0  0.0  0.000000        0.0  ...     0.0  0.0   
3       0.0    0.0   0.0  0.0  0.0  0.000000        0.0  ...     0.0  0.0   
4       0.0    0.0   0.0  0.0  0.0  0.000000        0.0  ...     0.0  0.0   
5       0.0    0.0   0.0  0.0  0.0  0.271009        0.0  ...     0.0  0.0   

   widianto  wihaji  wujud  yaksa  yang  yassierli  yovie  zaken  
1       0.0     0.0    0.0    0

In [None]:
# Mengambil vocabulary dari TfidfVectorizer
vocabulary = vectorizer.get_feature_names_out().tolist()

# Membuat DataFrame dari matriks TF-IDF
tfidf_df = pd.DataFrame(X.toarray(), columns=vocabulary)

# Menambahkan kolom 'Kategori Berita' dari df_final
tfidf_df.insert(0, 'Kategori Berita', df_final['Kategori Berita'])

# Menampilkan DataFrame TF-IDF yang sudah dimodifikasi
tfidf_df


Unnamed: 0,Kategori Berita,abdul,abimanyu,aburizal,acara,aceh,ad,ada,adalah,agendanya,...,wapres,wib,widianto,wihaji,wujud,yaksa,yang,yassierli,yovie,zaken
0,,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0
1,Nasional,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0
2,Bisnis,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0
3,Nasional,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0
4,Nasional,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
73,Bisnis,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.3696,0.000000,0.0,0.0,0.0
74,Nasional,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.173892,0.0,0.0,0.0
75,Bisnis,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.000000,0.0,0.0,0.0
76,Bisnis,0.0,0.0,0.252977,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0000,0.134112,0.0,0.0,0.0


In [None]:
tfidf_df.to_csv('data_berita_vsm.csv', index=False)