# Modelling Vector Space Model

## Penjelasan Vector Space Model

### Pengertian
Vector Space Model (VSM) adalah sebuah model untuk merepresentasikan teks dalam bentuk vektor di ruang berdimensi tinggi. Model ini digunakan dalam pengolahan teks, pencarian informasi, dan berbagai aplikasi NLP (Natural Language Processing). Dalam VSM, setiap dokumen atau kata direpresentasikan sebagai vektor yang berada dalam ruang fitur, di mana dimensi fitur biasanya mencerminkan kata-kata dalam korpus teks.


### Konsep

1. **Representasi Teks sebagai Vektor**
   - **Dokumen**: Setiap dokumen dalam korpus direpresentasikan sebagai vektor berdimensi tinggi.
   - **Kata**: Setiap kata dalam dokumen direpresentasikan sebagai dimensi dalam vektor dokumen.

2. **Dimensi dan Ruang Vektor**
   - **Dimensi**: Setiap dimensi dalam ruang vektor dapat merepresentasikan satu fitur atau kata.
   - **Ruang Vektor**: Ruang ini dapat memiliki dimensi sebanyak jumlah kata dalam korpus atau jumlah fitur yang relevan.


### Pembobotan kata

Untuk merepresentasikan kata-kata dalam vektor, kita menggunakan teknik pembobotan. Salah satu metode yang paling umum adalah TF-IDF (Term Frequency-Inverse Document Frequency).

#### 1. Term Frequency (TF)

**TF** mengukur seberapa sering kata muncul dalam dokumen. Rumusnya adalah:

$$ \text{TF}(t, d) = \frac{\text{Jumlah kemunculan kata } t \text{ dalam dokumen } d}{\text{Jumlah kata dalam dokumen } d} $$

#### 2. Inverse Document Frequency (IDF) 

**IDF** mengukur seberapa penting kata di seluruh korpus. Rumusnya adalah:

$$ \text{IDF}(t, D) = \log\left(\frac{|D|}{|\{d \in D : t \in d\}|}\right) $$

Di mana:
- \( |D| \) adalah jumlah total dokumen dalam korpus.
- \( |\{d \in D : t \in d\}| \) adalah jumlah dokumen yang mengandung kata \( t \).

#### 3. TF-IDF

**TF-IDF** menggabungkan TF dan IDF untuk memberikan bobot kata dalam dokumen. Rumusnya adalah:

$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

### Contoh kalkulasi

Misalkan kita memiliki korpus dengan 100 dokumen dan kata "data" muncul dalam 10 dokumen. Dalam dokumen tertentu, kata "data" muncul 5 kali dari total 100 kata.

#### Hitung TF

$$ \text{TF}(\text{"data"}, d) = \frac{5}{100} = 0.05 $$

#### Hitung IDF

$$ \text{IDF}(\text{"data"}, D) = \log\left(\frac{100}{10}\right) = \log(10) \approx 1.0 $$

#### Hitung TF-IDF

$$ \text{TF-IDF}(\text{"data"}, d, D) = 0.05 \times 1.0 = 0.05 $$

Dengan menggunakan TF-IDF, kita mendapatkan bobot 0.05 untuk kata "data" dalam dokumen tersebut.

## Implementasi Vector Space Model

### Persiapan

#### Import library

In [1]:
import string
import pandas as pd
import re
from tqdm import tqdm
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from tqdm import tqdm
import joblib

In [6]:
nltk.download('stopwords')
nltk.download('punkt_tab')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/wchynto/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/wchynto/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

#### Membuat fungsi stemming

In [7]:
def sastrawi_stemmer(text):
  factory = StemmerFactory()
  stemmer = factory.create_stemmer()
  stemmed_text = ' '.join(stemmer.stem(word) for word in tqdm(text.split()) if word in text)
  return stemmed_text

#### Membuat fungsi untuk membersihkan teks

In [8]:
def clean_string(text):

  # make text lowercase
  text = text.lower() 

  # remove line breaks
  text = re.sub(r'\n', ' ', text)

  # remove puctuation
  translator = str.maketrans('', '', string.punctuation)
  text = text.translate(translator)

    # remove numbers
  text = re.sub(r'\d+', '', text)

  # remove extra spaces 
  text = re.sub(r'\s+', ' ', text)

  # remove non-ascii characters
  text = re.sub(r'[^\x00-\x7F]+', ' ', text)

  # remove stopwords
  stop_words = set(stopwords.words('indonesian'))
  text = ' '.join([word for word in text.split() if word not in stop_words])

  return text  

#### Memuat data hasil crawling

In [9]:
df = pd.read_csv('../antaranews.csv')
df = df[['title', 'content', 'category']]

df.head()

Unnamed: 0,title,content,category
0,Gus Ipul tanggalkan jabatan Wali Kota Pasuruan,"\n""Per hari ini juga saya mundur sebagai Wali ...",Politik
1,Presiden Jokowi lantik Aida Suwandi jadi Anggo...,"\n""Demi Allah saya bersumpah bahwa saya tidak ...",Politik
2,Presiden Jokowi lantik Eddy Hartono jadi Kepal...,"\n""Demi Allah saya bersumpah bahwa saya akan s...",Politik
3,Wakil KSAD tetapkan 500 warga sipil sebagai ko...,"\n“Dengan mengucap Bismillahirrahmanirrahim, p...",Politik
4,"Relawan Prabowo-Gibran: Gerakan ""tusuk 3 paslo...",\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Koordinat...,Politik


In [10]:
# Mengambil 50 data dari kategori politik dan ekonomi secara acak
politik = df[df['category'] == 'Politik'].sample(n=50, random_state=42)
ekonomi = df[df['category'] == 'Ekonomi'].sample(n=50, random_state=42)

df = pd.concat([politik, ekonomi])
df.reset_index(drop=True, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     100 non-null    object
 1   content   100 non-null    object
 2   category  100 non-null    object
dtypes: object(3)
memory usage: 2.5+ KB


In [11]:
df.head()

Unnamed: 0,title,content,category
0,"Relawan Prabowo-Gibran: Gerakan ""tusuk 3 paslo...",\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Koordinat...,Politik
1,Gus Ipul tanggalkan jabatan Wali Kota Pasuruan,"\n""Per hari ini juga saya mundur sebagai Wali ...",Politik
2,DPR-KPU antisipasi kotak kosong menang di Pilk...,\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Komisi II...,Politik
3,"Relawan Prabowo-Gibran: Gerakan ""tusuk 3 paslo...",\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Koordinat...,Politik
4,Ketua DPR sebut RUU Perampasan Aset jadi bahas...,\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Ketua DPR...,Politik


### Pra-Pemrosesan data

#### Membersihkan data

In [12]:
# Inisialisasi dataframe baru
cleaned_df = pd.DataFrame(columns=['cleaned_title', 'cleaned_content', 'category'])

# Cleaning data
cleaned_df['cleaned_title'] = df['title'].apply(clean_string)
cleaned_df['cleaned_content'] = df['content'].apply(clean_string)
cleaned_df['category'] = df['category']

cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   cleaned_title    100 non-null    object
 1   cleaned_content  100 non-null    object
 2   category         100 non-null    object
dtypes: object(3)
memory usage: 2.5+ KB


#### Melakukan stemming data

In [13]:
# Inisalisasi dataframe baru
stemmed_df = pd.DataFrame(columns=['stemmed_title', 'stemmed_content', 'category'])

cleaned_df.head()

# Stemming data
stemmed_df['stemmed_title'] = cleaned_df['cleaned_title'].apply(sastrawi_stemmer)
stemmed_df['stemmed_content'] = cleaned_df['cleaned_content'].apply(sastrawi_stemmer)

100%|██████████| 7/7 [00:00<00:00, 13.96it/s]
100%|██████████| 7/7 [00:00<00:00, 18.81it/s]
100%|██████████| 6/6 [00:00<00:00, 22.58it/s]
100%|██████████| 7/7 [00:00<00:00, 13.29it/s]
100%|██████████| 7/7 [00:00<00:00, 101.15it/s]
100%|██████████| 9/9 [00:00<00:00, 13.47it/s]
100%|██████████| 9/9 [00:00<00:00, 21.28it/s]
100%|██████████| 7/7 [00:00<00:00, 16.17it/s]
100%|██████████| 7/7 [00:00<00:00, 18.23it/s]
100%|██████████| 6/6 [00:00<00:00, 20.37it/s]
100%|██████████| 6/6 [00:00<00:00, 13.97it/s]
100%|██████████| 9/9 [00:00<00:00, 33.02it/s]
100%|██████████| 8/8 [00:00<00:00, 29.73it/s]
100%|██████████| 5/5 [00:00<00:00, 11.00it/s]
100%|██████████| 8/8 [00:00<00:00, 17.78it/s]
100%|██████████| 7/7 [00:00<00:00, 11.02it/s]
100%|██████████| 10/10 [00:00<00:00, 34.75it/s]
100%|██████████| 9/9 [00:00<00:00, 61.16it/s]
100%|██████████| 8/8 [00:00<00:00, 57.45it/s]
100%|██████████| 6/6 [00:00<00:00, 23.76it/s]
100%|██████████| 7/7 [00:00<00:00, 18.57it/s]
100%|██████████| 8/8 [00:00<00:

In [4]:
# stemmed_df['category'] = cleaned_df['category']
# stemmed_df.to_csv('stemmed_antaranews.csv', index=False)
# stemmed_df.head()

stemmed_df = pd.read_csv('../stemmed_antaranews.csv')

### Membangun Vector Space Model

#### Membuat fungsi vsm

In [8]:
import pickle

# Membuat fungsi vsm
def create_vsm(docs, save_vectorizer=False, vectorizer_path='vectorizer.pkl'):
  vectorizer = TfidfVectorizer() # Inisialisasi TF-IDF vectorizer
  tfidf_matrix = vectorizer.fit_transform(docs) # Transformasi dokumen menjadi vektor TF-IDF
  feature_names = vectorizer.get_feature_names_out() # Mendapatkan fitur (kata-kata yang diambil oleh vectorizer)
  df_vsm = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names) # Mengubah hasil TF-IDF menjadi DataFrame
  if save_vectorizer:
    with open(vectorizer_path, 'wb') as file:
      pickle.dump(vectorizer, file) # Menyimpan vectorizer ke dalam file
  return df_vsm

#### Membuat vsm untuk title dan content

In [6]:
vsm_title = create_vsm(stemmed_df['stemmed_title'], save_vectorizer=True, vectorizer_path='title_vectorizer.pkl')
vsm_content = create_vsm(stemmed_df['stemmed_content'], save_vectorizer=True, vectorizer_path='content_vectorizer.pkl')

#### Menampilkan vsm untuk 'title'

In [38]:
category = stemmed_df['category']
categorized_vsm_title = category.to_frame().join(vsm_title)
categorized_vsm_title.to_csv('vsm_title.csv', index=False)
categorized_vsm_title

Unnamed: 0,category,aceh,ada,adat,administrasi,agama,ahmad,aida,air,aisyiyah,...,vs,wakil,wali,wantimpres,wapres,warga,wilayah,xxi,yen,zero
0,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.392031,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
96,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
97,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Menampilkan vsm untuk 'content'

In [37]:
category = stemmed_df['category']
categorized_vsm_content = category.to_frame().join(vsm_content)
categorized_vsm_content.to_csv('vsm_content.csv', index=False)
categorized_vsm_content

Unnamed: 0,category,abadi,abd,abdi,abdul,abdullah,abon,aborted,abror,abudullah,...,yusuf,yusufrolandus,za,zaenal,zakaria,zaman,zeno,zero,zona,zonasi
0,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
1,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.184721,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
2,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
3,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
4,Politik,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
96,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
97,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.104427,0.042772
98,Ekonomi,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000
