# Modelling Vector Space Model

## Penjelasan Vector Space Model

### Pengertian
Vector Space Model (VSM) adalah sebuah model untuk merepresentasikan teks dalam bentuk vektor di ruang berdimensi tinggi. Model ini digunakan dalam pengolahan teks, pencarian informasi, dan berbagai aplikasi NLP (Natural Language Processing). Dalam VSM, setiap dokumen atau kata direpresentasikan sebagai vektor yang berada dalam ruang fitur, di mana dimensi fitur biasanya mencerminkan kata-kata dalam korpus teks.


### Konsep

1. **Representasi Teks sebagai Vektor**
   - **Dokumen**: Setiap dokumen dalam korpus direpresentasikan sebagai vektor berdimensi tinggi.
   - **Kata**: Setiap kata dalam dokumen direpresentasikan sebagai dimensi dalam vektor dokumen.

2. **Dimensi dan Ruang Vektor**
   - **Dimensi**: Setiap dimensi dalam ruang vektor dapat merepresentasikan satu fitur atau kata.
   - **Ruang Vektor**: Ruang ini dapat memiliki dimensi sebanyak jumlah kata dalam korpus atau jumlah fitur yang relevan.


### Pembobotan kata

Untuk merepresentasikan kata-kata dalam vektor, kita menggunakan teknik pembobotan. Salah satu metode yang paling umum adalah TF-IDF (Term Frequency-Inverse Document Frequency).

#### 1. Term Frequency (TF)

**TF** mengukur seberapa sering kata muncul dalam dokumen. Rumusnya adalah:

$$ \text{TF}(t, d) = \frac{\text{Jumlah kemunculan kata } t \text{ dalam dokumen } d}{\text{Jumlah kata dalam dokumen } d} $$

#### 2. Inverse Document Frequency (IDF) 

**IDF** mengukur seberapa penting kata di seluruh korpus. Rumusnya adalah:

$$ \text{IDF}(t, D) = \log\left(\frac{|D|}{|\{d \in D : t \in d\}|}\right) $$

Di mana:
- \( |D| \) adalah jumlah total dokumen dalam korpus.
- \( |\{d \in D : t \in d\}| \) adalah jumlah dokumen yang mengandung kata \( t \).

#### 3. TF-IDF

**TF-IDF** menggabungkan TF dan IDF untuk memberikan bobot kata dalam dokumen. Rumusnya adalah:

$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$

### Contoh kalkulasi

Misalkan kita memiliki korpus dengan 100 dokumen dan kata "data" muncul dalam 10 dokumen. Dalam dokumen tertentu, kata "data" muncul 5 kali dari total 100 kata.

#### Hitung TF

$$ \text{TF}(\text{"data"}, d) = \frac{5}{100} = 0.05 $$

#### Hitung IDF

$$ \text{IDF}(\text{"data"}, D) = \log\left(\frac{100}{10}\right) = \log(10) \approx 1.0 $$

#### Hitung TF-IDF

$$ \text{TF-IDF}(\text{"data"}, d, D) = 0.05 \times 1.0 = 0.05 $$

Dengan menggunakan TF-IDF, kita mendapatkan bobot 0.05 untuk kata "data" dalam dokumen tersebut.

## Implementasi Vector Space Model

### Persiapan

#### Import library

In [1]:
import string
import pandas as pd
import re
from tqdm import tqdm
import nltk
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/wchynto/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

#### Membuat fungsi stemming

In [15]:
from tqdm import tqdm

def sastrawi_stemmer(text):
  factory = StemmerFactory()
  stemmer = factory.create_stemmer()
  stemmed_text = ' '.join(stemmer.stem(word) for word in tqdm(text.split()) if word in text)
  return stemmed_text

#### Membuat fungsi untuk membersihkan teks

In [4]:
def clean_string(text):

  # make text lowercase
  text = text.lower() 

  # remove line breaks
  text = re.sub(r'\n', ' ', text)

  # remove puctuation
  translator = str.maketrans('', '', string.punctuation)
  text = text.translate(translator)

  # remove stopwords
  stop_words = set(stopwords.words('indonesian'))
  text = ' '.join([word for word in text.split() if word not in stop_words])

  # remove numbers
  text = re.sub(r'\d+', '', text)

  # remove extra spaces 
  text = re.sub(r'\s+', ' ', text)

  # remove non-ascii characters
  text = re.sub(r'[^\x00-\x7F]+', ' ', text)

  return text  

#### Memuat data hasil crawling

In [5]:
df = pd.read_csv('antaranews.csv')
df = df[['title', 'content', 'category']]

df.head()

Unnamed: 0,title,content,category
0,Gus Ipul tanggalkan jabatan Wali Kota Pasuruan,"\n""Per hari ini juga saya mundur sebagai Wali ...",Politik
1,Presiden Jokowi lantik Aida Suwandi jadi Anggo...,"\n""Demi Allah saya bersumpah bahwa saya tidak ...",Politik
2,Presiden Jokowi lantik Eddy Hartono jadi Kepal...,"\n""Demi Allah saya bersumpah bahwa saya akan s...",Politik
3,Wakil KSAD tetapkan 500 warga sipil sebagai ko...,"\n“Dengan mengucap Bismillahirrahmanirrahim, p...",Politik
4,"Relawan Prabowo-Gibran: Gerakan ""tusuk 3 paslo...",\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Koordinat...,Politik


In [6]:
# Mengambil 50 data dari kategori politik dan ekonomi secara acak
politik = df[df['category'] == 'Politik'].sample(n=50, random_state=42)
ekonomi = df[df['category'] == 'Ekonomi'].sample(n=50, random_state=42)

df = pd.concat([politik, ekonomi])
df.reset_index(drop=True, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     100 non-null    object
 1   content   100 non-null    object
 2   category  100 non-null    object
dtypes: object(3)
memory usage: 2.5+ KB


In [7]:
df.head()

Unnamed: 0,title,content,category
0,"Relawan Prabowo-Gibran: Gerakan ""tusuk 3 paslo...",\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Koordinat...,Politik
1,Jokowi tetap keliling daerah meski berkantor d...,\n\t\t\t\t\t\t\t\tJakarta (ANTARA) - Presiden ...,Politik
2,Komisi II DPR sepakati pilkada ulang bila kota...,\npilkada diselenggarakan kembali pada tahun b...,Politik
3,Gus Ipul tanggalkan jabatan Wali Kota Pasuruan,"\n""Per hari ini juga saya mundur sebagai Wali ...",Politik
4,"KPU NTT: Film ""Tepatilah Janji"" bernilai eduka...",\n\t\t\t\t\t\t\t\tKupang (ANTARA) - Komisi Pem...,Politik


### Pra-Pemrosesan data

#### Membersihkan data

In [8]:
# Inisialisasi dataframe baru
cleaned_df = pd.DataFrame(columns=['cleaned_title', 'cleaned_content', 'category'])

# Cleaning data
cleaned_df['cleaned_title'] = df['title'].apply(clean_string)
cleaned_df['cleaned_content'] = df['content'].apply(clean_string)
cleaned_df['category'] = df['category']

cleaned_df.info()

Unnamed: 0,cleaned_title,cleaned_content,category
0,relawan prabowogibran gerakan tusuk paslon rus...,jakarta koordinator nasional prabowogibran dig...,Politik
1,jokowi keliling daerah berkantor ikn,jakarta presiden joko widodo berkeliling daera...,Politik
2,komisi ii dpr sepakati pilkada ulang kotak kos...,pilkada diselenggarakan berikutnyajakarta rapa...,Politik
3,gus ipul tanggalkan jabatan wali kota pasuruan,mundur wali kota pasuruan otomatis itujakarta ...,Politik
4,kpu ntt film tepatilah janji bernilai edukasi ...,kupang komisi pemilihan kpu nusa tenggara timu...,Politik


#### Melakukan stemming data

In [19]:
# Inisalisasi dataframe baru
stemmed_df = pd.DataFrame(columns=['stemmed_title', 'stemmed_content', 'category'])

# Stemming data
stemmed_df['stemmed_title'] = cleaned_df['cleaned_title'].apply(sastrawi_stemmer)
stemmed_df['stemmed_content'] = cleaned_df['cleaned_content'].apply(sastrawi_stemmer)




















100%|██████████| 7/7 [00:02<00:00,  2.89it/s][A[A[A[A


















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















100%|██████████| 5/5 [00:00<00:00,  5.71it/s][A[A[A



















100%|██████████| 9/9 [00:00<00:00, 17.15it/s][A[A[A[A


















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















100%|██████████| 7/7 [00:01<00:00,  4.26it/s][A[A[A



















100%|██████████| 9/9 [00:00<00:00, 16.61it/s][A[A[A[A


















100%|██████████| 8/8 [00:00<00:00, 114.94it/s][A[A[A



















100%|██████████| 9/9 [00:02<00:00,  3.34it/s][A[A[A[A


















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A

















100%|██████████| 8/8 [0

In [22]:
stemmed_df.to_csv('stemmed_antaranews.csv', index=False)
stemmed_df.head()

Unnamed: 0,stemmed_title,stemmed_content,category
0,rawan prabowogibran gera tusuk paslon rusak de...,jakarta koordinator nasional prabowogibran dig...,
1,jokowi keliling daerah kantor ikn,jakarta presiden joko widodo keliling daerahda...,
2,komisi ii dpr sepakat pilkada ulang kotak koso...,pilkada selenggara berikutnyajakarta rapat den...,
3,gus ipul tanggal jabat wali kota pasuruan,mundur wali kota pasuruan otomatis itujakarta ...,
4,kpu ntt film tepat janji nila edukasi jelang p...,kupang komisi pilih kpu nusa tenggara timur fi...,


### Membangun Vector Space Model

#### Membuat fungsi vsm

In [30]:
# Membuat fungsi vsm
def create_vsm(docs):
  vectorizer = TfidfVectorizer() # Inisialisasi TF-IDF vectorizer
  tfidf_matrix = vectorizer.fit_transform(docs) # Transformasi dokumen menjadi vektor TF-IDF
  feature_names = vectorizer.get_feature_names_out() # Mendapatkan fitur (kata-kata yang diambil oleh vectorizer)
  df_vsm = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names) # Mengubah hasil TF-IDF menjadi DataFrame
  return df_vsm

#### Membuat vsm untuk title dan content

In [None]:
vsm_title = create_vsm(stemmed_df['stemmed_title'])
vsm_content = create_vsm(stemmed_df['stemmed_content'])

#### Menampilkan vsm untuk 'title'

In [27]:
vsm_title.head()

Unnamed: 0,aceh,adat,administrasi,agama,agung,aida,air,ajak,akademi,al,...,wait,wakil,wali,wantimpres,wapres,warga,wilayah,wisata,wujud,zero
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.390043,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Menampilkan vsm untuk 'content'

In [29]:
vsm_content.head()

Unnamed: 0,aai,ab,abdi,abdul,abdullah,abon,aborted,abror,abu,abudullah,...,yusgiantoro,yusuf,yusufrolandus,zaenal,zainuddin,zakaria,zeno,zero,zona,zulhan
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.177159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.048646,0.0,0.0,0.0
