<a href="https://colab.research.google.com/github/romauligraciella/Komputasi-Intelegensia/blob/main/TaskWeek7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Nama: Romauli Graciella Debora \
NPM: 2106722575\
Synthetic Data: https://www.kaggle.com/datasets/smmmmmmmmmmmm/synthetic-twitter-sentiment-analysis?select=twitter_sentiment_dataset.csv

Digunakan data sintesis dari kaggle karena dibutuhkan jumlah data yang banyak untuk membandingkan model baseline dan model yang sudah ditambahkan attention. Jika data terlalu sedikit, hasil dapat menjadi tidak maksimal.

In [1]:
!pip install transformers torch pandas



In [3]:
import pandas as pd

df = pd.read_csv('twitter_sentiment_dataset.csv')
df.head()

Unnamed: 0,Tweet ID,Username,Tweet Text,Retweets,Favorites,Followers,Timestamp,Sentiment
0,05a72860-7fbf-4565-a43c-961b732f0240,samanthagillespie,Talk get bag focus pattern necessary. Step com...,81,14,409,2022-11-07 18:49:55.793691,Neutral
1,0c3735c8-3d67-4c50-b639-14010e918d31,rfisher,Front measure modern design. Policy go start f...,35,31,3657,2023-01-21 21:51:43.768392,Positive
2,044365c9-0e4a-46ce-9a00-1e5d184ec5b7,bgarcia,Lead which daughter join. Yeah world sort pers...,94,13,8935,2021-07-09 06:55:03.007612,Positive
3,f37cc2c8-4ebf-483b-a970-ef9c2e0f4d37,robert13,Morning first receive. Special land oil.\nWond...,75,54,520,2022-09-28 21:08:48.969174,Positive
4,c78dd47e-df0d-47ff-9c29-de9287a96429,erinwalker,Artist church ago. Gun hold bank plan natural ...,11,27,3811,2020-04-03 21:17:33.227220,Neutral


In [4]:
# Map sentiment labels to numeric values
sentiment_mapping = {
    'Neutral': 1,
    'Positive': 2,
    'Negative': 0
}

df['Label'] = df['Sentiment'].map(sentiment_mapping)

# Prepare the test sentences in the required format
test_sentences = [{"text": row['Tweet Text'], "label": row['Label']} for index, row in df.iterrows()]

In [5]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

# Load the pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)

class BaselineSentimentModel(nn.Module):
    def __init__(self, bert_model):
        super(BaselineSentimentModel, self).__init__()
        self.bert = bert_model
        self.classifier = nn.Linear(768, 3)  # 3 classes: Negative, Neutral, Positive

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output  # Use [CLS] token embedding
        logits = self.classifier(pooled_output)
        return logits

class SentimentModelWithAttention(nn.Module):
    def __init__(self, bert_model):
        super(SentimentModelWithAttention, self).__init__()
        self.bert = bert_model
        self.attention = nn.MultiheadAttention(embed_dim=768, num_heads=12)
        self.classifier = nn.Linear(768, 3)  # 3 classes: Negative, Neutral, Positive

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        hidden_states = outputs.last_hidden_state  # Shape: (batch_size, seq_len, hidden_size)

        # Apply attention
        attn_output, _ = self.attention(hidden_states, hidden_states, hidden_states)
        pooled_output = attn_output.mean(dim=1)  # Pool over the sequence

        # Classifier
        logits = self.classifier(pooled_output)
        return logits


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

In [6]:
def preprocess(text, tokenizer, max_length=128):
    encoding = tokenizer(text, truncation=True, padding='max_length', max_length=max_length, return_tensors="pt")
    return encoding['input_ids'], encoding['attention_mask']

In [7]:
from sklearn.metrics import accuracy_score, f1_score

def evaluate_model_with_f1(model, tokenizer, test_sentences):
    model.eval()  # Set to evaluation mode
    true_labels = []
    pred_labels = []
    sentences = []

    with torch.no_grad():  # Disable gradient calculation
        for sample in test_sentences:
            input_ids, attention_mask = preprocess(sample["text"], tokenizer)
            logits = model(input_ids, attention_mask)
            predicted_class = torch.argmax(logits, dim=1).item()

            true_labels.append(sample["label"])
            pred_labels.append(predicted_class)
            sentences.append(sample["text"])

    # Calculate accuracy and F1-score
    accuracy = accuracy_score(true_labels, pred_labels)
    f1 = f1_score(true_labels, pred_labels, average='weighted')

    return accuracy, f1, pred_labels

In [8]:
# Initialize models
baseline_model = BaselineSentimentModel(bert_model)
attention_model = SentimentModelWithAttention(bert_model)

In [9]:
# Prepare the models for evaluation
baseline_accuracy, baseline_f1, baseline_preds = evaluate_model_with_f1(baseline_model, tokenizer, test_sentences)
attention_accuracy, attention_f1, attention_preds = evaluate_model_with_f1(attention_model, tokenizer, test_sentences)

# Prepare results summary DataFrame
results_summary = pd.DataFrame({
    "Sentence": [sample["text"] for sample in test_sentences],
    "True Label": [sample["label"] for sample in test_sentences],
    "Predicted Baseline": baseline_preds,
    "Predicted Attention": attention_preds
})

print(f"Baseline Model - Accuracy: {baseline_accuracy * 100:.2f}%, F1 Score: {baseline_f1:.2f}")
print(f"Attention Model - Accuracy: {attention_accuracy * 100:.2f}%, F1 Score: {attention_f1:.2f}")
print("Evaluation Results:")
print(results_summary)


Baseline Model - Accuracy: 32.80%, F1 Score: 0.32
Attention Model - Accuracy: 33.50%, F1 Score: 0.20
Evaluation Results:
                                               Sentence  True Label  \
0     Talk get bag focus pattern necessary. Step com...           1   
1     Front measure modern design. Policy go start f...           2   
2     Lead which daughter join. Yeah world sort pers...           2   
3     Morning first receive. Special land oil.\nWond...           2   
4     Artist church ago. Gun hold bank plan natural ...           1   
...                                                 ...         ...   
1995  Bag action develop hit paper and exist. Challe...           0   
1996  Together toward bar test. Large hit could powe...           0   
1997  Firm customer game window become alone plan. H...           0   
1998  Will baby line prove book. Century area magazi...           0   
1999  Whom drive star student art hotel. Federal dec...           2   

      Predicted Baseline  

Kesimpulan Perbandingan
* Model Baseline: Meskipun memiliki akurasi yang sedikit
lebih rendah, model ini menunjukkan performa yang lebih baik dalam hal F1 score, yang mencerminkan kualitas prediksi yang lebih baik dalam konteks keseimbangan antara presisi dan recall.
* Model Attention: Meskipun akurasinya sedikit lebih baik, model ini gagal dalam memberikan prediksi yang seimbang, sehingga menghasilkan F1 score yang lebih rendah. Ini menunjukkan bahwa penambahan mekanisme perhatian tidak selalu menjamin peningkatan performa, terutama dalam konteks data yang digunakan.