<a href="https://colab.research.google.com/github/rendyleo/PPW/blob/main/Vector%20Space%20Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. **Vector Space Model (VSM)**

VSM adalah model aljabar yang digunakan untuk merepresentasikan teks sebagai vektor numerik. Dalam VSM, setiap dokumen direpresentasikan sebagai vektor, dan setiap kata unik dalam kumpulan dokumen direpresentasikan sebagai dimensi dalam ruang vektor. Nilai setiap elemen dalam vektor mewakili bobot kata dalam dokumen tersebut.


2. **Penerapan VSM dalam Kode Anda**

Dalam kode yang Anda berikan, VSM diimplementasikan menggunakan TF-IDF (Term Frequency-Inverse Document Frequency).

TF-IDF: TF-IDF adalah teknik untuk menghitung bobot kata dalam dokumen. TF mengukur seberapa sering suatu kata muncul dalam dokumen, sedangkan IDF mengukur seberapa penting kata tersebut dalam seluruh kumpulan dokumen.

TfidfVectorizer: Kode Anda menggunakan TfidfVectorizer dari library sklearn untuk menghitung representasi TF-IDF dari teks yang sudah diproses. Hasilnya disimpan dalam variabel X_train_tfidf.

Cosine Similarity: Setelah dokumen direpresentasikan sebagai vektor TF-IDF, Anda dapat menghitung kesamaan antar dokumen menggunakan cosine similarity. Kode Anda menghitung cosine similarity antara dokumen-dokumen dalam data latih menggunakan cosine_similarity dari library sklearn.

**Kesimpulan**

VSM, yang diimplementasikan menggunakan TF-IDF dalam kode , memungkinkan untuk merepresentasikan teks sebagai vektor numerik. Representasi ini kemudian dapat digunakan untuk menghitung kesamaan antar dokumen, yang berguna untuk berbagai tugas seperti pencarian informasi, pengelompokan dokumen, dan klasifikasi teks.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Load the CSV data
data = pd.read_csv("/content/bola.csv")  # Replace "your_csv_file.csv" with your file name

3. **Label encoding**

adalah teknik untuk mengubah data kategorikal menjadi data numerik.

In [None]:
# Label Encoding
le = LabelEncoder()
data['Link_Text_Encoded'] = le.fit_transform(data['Link_Text'])

In [None]:
print(data['Link_Text_Encoded'])


0       47
1       46
2      120
3      119
4      138
      ... 
291     73
292    148
293    133
294    106
295     56
Name: Link_Text_Encoded, Length: 296, dtype: int64


4. **Lower case**

digunakan Mengubah teks menjadi huruf kecil

In [None]:
# 2. Lower Case
# Convert the 'cleaned_text' column to string type before applying lowercasing
data['cleaned_text'] = data['cleaned_text'].astype(str).str.lower()

In [None]:
# prompt: tampilkan lower case

print(data['cleaned_text'])


0                                                bolanet
1                                                bolacom
2                                                merdeka
3                                               liputan6
4                                                 otosia
                             ...                        
291    gulfstream g650 jet pribadi milik cristiano ro...
292    pesona bidadari wags gres premier league 20242...
293    nyentrik jersey ketiga liverpool chelsea totte...
294    kisah shin taeyong bintang iklan indonesia tar...
295                                       bolatainment »
Name: cleaned_text, Length: 296, dtype: object


5. **Text cleansing**

adalah proses membersihkan teks mentah dengan menghapus atau mengubah bagian-bagian tertentu agar lebih cocok untuk diproses lebih lanjut.

In [None]:
# Text Cleansing
def clean_text(text):
  # Ensure the input is a string
  if isinstance(text, str):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.strip()
  return text

data['cleaned_text'] = data['cleaned_text'].apply(clean_text)

In [None]:
print(data['cleaned_text'])


0                                                bolanet
1                                                bolacom
2                                                merdeka
3                                                liputan
4                                                 otosia
                             ...                        
291    gulfstream g jet pribadi milik cristiano ronal...
292     pesona bidadari wags gres premier league  aduhai
293    nyentrik jersey ketiga liverpool chelsea totte...
294    kisah shin taeyong bintang iklan indonesia tar...
295                                         bolatainment
Name: cleaned_text, Length: 296, dtype: object


6. **Tokenisasi** (memecah teks menjadi kata-kata)

In [None]:
# Tokenization
data['tokens'] = data['cleaned_text'].apply(nltk.word_tokenize)

7. **Stopword** **Removal**

Menghapus stop words (kata-kata umum yang tidak penting).

In [None]:
#  Stopword Removal
stop_words = set(stopwords.words('english'))
data['tokens_without_stopwords'] = data['tokens'].apply(lambda x: [word for word in x if word not in stop_words])


8. Stemming

Adalah Menghapus imbuhan atau akhiran

In [None]:
# Stemming
stemmer = PorterStemmer()
data['stemmed_tokens'] = data['tokens_without_stopwords'].apply(lambda x: [stemmer.stem(word) for word in x])


In [None]:
# Join tokens back into a string
data['processed_text'] = data['stemmed_tokens'].apply(lambda x: ' '.join(x))

9. Data Akan Di split atau dipisahkan menjadi data training dan data testing

In [None]:
# Split Data into Train and Test
X = data['processed_text']
y = data['Link_URL']  # Assuming Link_URL is your target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

10. Data Training akan ditfidf

In [None]:
# 9. TFIDF

from sklearn.feature_extraction.text import TfidfVectorizer

# Assuming X_train contains your training text data
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Now X_train_tfidf contains the TF-IDF representation of your training data


In [None]:
print("Jumlah fitur TF-IDF:", len(tfidf_vectorizer.get_feature_names_out()))


Jumlah fitur TF-IDF: 470


In [None]:
# prompt: tampilkan data tfidf

print(X_train_tfidf)


  (0, 144)	0.5399441133495791
  (0, 410)	0.43722581401424643
  (0, 276)	0.5399441133495791
  (0, 19)	0.4751360820626585
  (1, 122)	1.0
  (2, 436)	0.7071067811865476
  (2, 298)	0.7071067811865476
  (3, 436)	0.7071067811865476
  (3, 298)	0.7071067811865476
  (4, 245)	0.6116553358385446
  (4, 176)	0.7911243582018171
  (5, 354)	0.4703411141585068
  (5, 151)	0.4138873425768273
  (5, 291)	0.4703411141585068
  (5, 208)	0.4703411141585068
  (5, 437)	0.2445260278317886
  (5, 181)	0.32441023159558746
  (6, 161)	0.4556474608628712
  (6, 244)	0.40547050980496835
  (6, 448)	0.48144196055506866
  (6, 299)	0.48144196055506866
  (6, 243)	0.40547050980496835
  (7, 245)	0.5828624248034108
  (7, 181)	0.8125708546042545
  (8, 324)	0.3493959112156099
  :	:
  (227, 236)	0.3110122955208984
  (227, 74)	0.3110122955208984
  (227, 468)	0.33449788239896255
  (227, 382)	0.33449788239896255
  (227, 126)	0.33449788239896255
  (227, 341)	0.33449788239896255
  (227, 277)	0.33449788239896255
  (227, 360)	0.33449788239

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd # Import pandas

# Assuming X_train_tfidf contains your TF-IDF matrix
cosine_sim = cosine_similarity(X_train_tfidf)

# Menampilkan hasil cosine similarity
cosine_sim_df = pd.DataFrame(cosine_sim)

# Menampilkan beberapa baris dari matriks cosine similarity
cosine_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,226,227,228,229,230,231,232,233,234,235
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.373098,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.356511,0.0,0.0,...,0.0,0.0,0.0,0.0,0.318899,0.356511,0.0,0.0,0.0,0.0


In [None]:
temp_counts = data['Link_URL'].value_counts()
print(temp_counts)


Link_URL
https://www.bola.net/tim_nasional/                                                                                                     23
https://www.bola.net/piala_eropa/                                                                                                      17
https://www.bola.net/inggris/                                                                                                          17
https://www.bola.net/otomotif/                                                                                                          7
https://www.bola.net/open-play/                                                                                                         6
                                                                                                                                       ..
https://www.fimela.com/                                                                                                                 1
#                        

In [None]:
# matrix similaritas

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

# Assuming X_train_tfidf contains your TF-IDF matrix
cosine_sim = cosine_similarity(X_train_tfidf)

# Convert the cosine similarity matrix to a Pandas DataFrame for better readability
cosine_sim_df = pd.DataFrame(cosine_sim, index=X_train.index, columns=X_train.index)

# Print or use the cosine similarity matrix
print(cosine_sim_df)


     63   17   215  219       183       114  76        284  66   178  ...  \
63   1.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
17   0.0  1.0  0.0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
215  0.0  0.0  1.0  1.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
219  0.0  0.0  1.0  1.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
183  0.0  0.0  0.0  0.0  1.000000  0.000000  0.0  0.356511  0.0  0.0  ...   
..   ...  ...  ...  ...       ...       ...  ...       ...  ...  ...  ...   
188  0.0  0.0  0.0  0.0  0.356511  0.263606  0.0  1.000000  0.0  0.0  ...   
71   0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
106  0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
270  0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   
102  0.0  0.0  0.0  0.0  0.000000  0.000000  0.0  0.000000  0.0  0.0  ...   

     87   214  121  295       20        188  71   106  270  102  
63   0.0 