<a href="https://colab.research.google.com/github/taufikfy/UTS_IR_KEL3/blob/master/UTS_IR_K3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Import Library**

In [None]:
!pip install transformers --quiet

import pandas as pd
import re
import string
import nltk
import torch

from transformers import BertTokenizer, BertModel
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## **Load Dataset True & Fake**

In [None]:
# Load data
true_df = pd.read_csv('/content/True.csv')
fake_df = pd.read_csv('/content/Fake.csv')
true_df['label'] = 1
fake_df['label'] = 0
df = pd.concat([true_df, fake_df], ignore_index=True)
df = df[['title', 'text', 'label']]
df.dropna(inplace=True)

print(df['label'].value_counts())
df.head()

label
0    23481
1    21417
Name: count, dtype: int64


Unnamed: 0,title,text,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,1


## **Pra-pemrosesan Teks**

In [None]:
# Cleaning
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
def clean_text(text):
    text = text.lower()
    text = re.sub(r'<.*?>', ' ', text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    tokens = word_tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return ' '.join(tokens)

df['clean'] = df['text'].apply(clean_text)

## **Representasi Teks**

**BoW (Bag of Words)**

In [None]:
# BoW Vectorization
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['clean'])
y = df['label']

# Split & Train
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print("=== [BoW] ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

=== [BoW] ===
Accuracy: 0.9953229398663697
              precision    recall  f1-score   support

           0       1.00      0.99      1.00      4650
           1       0.99      1.00      1.00      4330

    accuracy                           1.00      8980
   macro avg       1.00      1.00      1.00      8980
weighted avg       1.00      1.00      1.00      8980



**TF-IDF REPRESENTATION**

In [None]:
# Gunakan kembali df dan df['clean'] dari BoW
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['clean'])
y = df['label']

# Split & Train
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print("=== [TF-IDF] ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


=== [TF-IDF] ===
Accuracy: 0.9865256124721603
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      4650
           1       0.98      0.99      0.99      4330

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980



**BERT(Bidirectional Encoder Transformer)**

In [None]:
# Hapus HTML/karakter aneh saja
def clean_for_bert(text):
    return re.sub(r'<.*?>', ' ', text).strip()

df['clean_bert'] = df['text'].apply(clean_for_bert)

# BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Batasi untuk efisiensi
MAX_BERT = 100  # atau 200 jika GPU cukup

# Sample the dataframe randomly to ensure both classes are present
df_sampled = df.sample(n=MAX_BERT, random_state=42)

texts = df_sampled['clean_bert'].tolist()
labels = df_sampled['label'].tolist()

# Tokenisasi
inputs = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=512)

# Extract CLS embeddings
with torch.no_grad():
    outputs = model(**inputs)
    X_bert = outputs.last_hidden_state[:, 0, :].numpy()

# Train/Test split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X_bert, labels, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Evaluation
print("=== [BERT] ===")
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


=== [BERT] ===
Accuracy: 0.95
              precision    recall  f1-score   support

           0       1.00      0.93      0.96        14
           1       0.86      1.00      0.92         6

    accuracy                           0.95        20
   macro avg       0.93      0.96      0.94        20
weighted avg       0.96      0.95      0.95        20



### **Indexing with Inverted Index**

In [None]:
# Inverted Index
if 'clean' not in df.columns:
    print()
else:
    # Membuat Indeks Terbalik
    inverted_index = {}
    for doc_id, text in enumerate(df['clean']):
        words = text.split()
        for word in words:
            if word not in inverted_index:
                inverted_index[word] = set()
            inverted_index[word].add(doc_id)

    # Contoh cara melakukan pencarian sederhana
    def search(query, index):
        query_words = query.lower().split()
        # Mulai dengan set dokumen dari kata pertama
        if query_words:
            # Cari dokumen untuk kata pertama
            result_docs = index.get(query_words[0], set())

            # Iterasi melalui kata-kata query lainnya dan cari irisan set dokumen
            for word in query_words[1:]:
                if word in index:
                    result_docs = result_docs.intersection(index[word])
                else:
                    # Jika ada kata dalam query yang tidak ada di indeks, tidak ada dokumen yang cocok
                    return set()
            return sorted(list(result_docs))
        else:
            return set() # Query kosong, tidak ada hasil

    # --- Cara Menggunakan Indeks dan Fungsi Pencarian ---

    # Contoh pencarian untuk kata "election"
    query_example = "election"
    results_election = search(query_example, inverted_index)
    print(f"Dokumen yang mengandung kata '{query_example}': {results_election[:10]}...") # Tampilkan 10 hasil pertama

    # Contoh pencarian untuk frasa "donald trump"
    query_example_phrase = "donald trump"
    results_phrase = search(query_example_phrase, inverted_index)
    print(f"Dokumen yang mengandung frasa '{query_example_phrase}': {results_phrase[:10]}...") # Tampilkan 10 hasil pertama

    # Contoh pencarian untuk kata yang tidak ada
    query_example_not_found = "xyzabc"
    results_not_found = search(query_example_not_found, inverted_index)
    print(f"Dokumen yang mengandung kata '{query_example_not_found}': {results_not_found}")

Dokumen yang mengandung kata 'election': [0, 2, 3, 5, 6, 9, 10, 15, 16, 18]...
Dokumen yang mengandung frasa 'donald trump': [0, 1, 3, 4, 5, 6, 7, 8, 10, 11]...
Dokumen yang mengandung kata 'xyzabc': []


### **Analisis Hasil Representasi Teks**

**1. BoW (Bag of Words)**

| Label              | Precision | Recall | F1-Score | Interpretasi                                   |
| ------------------ | --------- | ------ | -------- | ---------------------------------------------- |
| 0 (Fake)           | 1.00      | 0.99   | 1.00     | Hampir semua berita palsu berhasil dikenali    |
| 1 (True)           | 0.99      | 1.00   | 1.00     | Hampir semua berita asli dikenali dengan benar |
| **Akurasi**: 0.995 |           |        |          |                                                |


Kesimpulan: BoW sangat akurat dan tidak bias terhadap salah satu kelas. Sangat ideal untuk klasifikasi berita asli vs palsu.

**2. TF-IDF**

| Label              | Precision | Recall | F1-Score | Interpretasi                                |
| ------------------ | --------- | ------ | -------- | ------------------------------------------- |
| 0 (Fake)           | 0.99      | 0.98   | 0.99     | 2% berita palsu gagal dikenali              |
| 1 (True)           | 0.98      | 0.99   | 0.99     | 2% berita asli salah prediksi sebagai palsu |
| **Akurasi**: 0.986 |           |        |          |                                             |


Kesimpulan: TF-IDF sedikit di bawah BoW, tapi tetap sangat andal dan memiliki generalisasi yang baik.

**3. BERT**

| Label                                       | Precision | Recall | F1-Score | Interpretasi                                                         |
| ------------------------------------------- | --------- | ------ | -------- | -------------------------------------------------------------------- |
| 0 (Fake)                                    | 1.00      | 0.93   | 0.96     | BERT sangat akurat mendeteksi berita palsu, tapi 7% tidak terdeteksi |
| 1 (True)                                    | 0.86      | 1.00   | 0.92     | 14% berita asli **salah dikira palsu** → problem serius              |
| **Akurasi**: 0.95 (tapi hanya dari 20 data) |           |        |          |                                                                      |


**🏆 Final Ranking Berdasarkan Akurasi & Konsistensi**

| Rank | Model      | Akurasi | Kelebihan                                                               |
| ---- | ---------- | ------- | ----------------------------------------------------------------------- |
| 🥇 1 | **BoW**    | 99.53%  | Sederhana, cepat, dan akurat untuk klasifikasi                          |
| 🥈 2 | **TF-IDF** | 98.65%  | Hampir seimbang, cocok untuk IR klasik                                  |
| 🥉 3 | **BERT**   | 95%     | Cocok untuk semantic retrieval, tapi perlu lebih banyak data dan tuning |
