<a href="https://colab.research.google.com/github/yudahendriawan/google-colab-projects/blob/general/portfolio_sentiment_analysis_deep_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1> Sentimen Analysis with Deep Learning using IndoNLU datasets

# Preparation

In [42]:
!git clone https://github.com/indobenchmark/indonlu

fatal: destination path 'indonlu' already exists and is not an empty directory.


lets take a look to the github folder that we have cloned. There is indonlu folder > dataset > smsa_doc_sentiment_prosa. In that folder our data is located. We can see several files such as train_preprocess.tsv (this is the training data that we will use), and valid_preprocess.tsv (this is the validation data that we will use).

In [43]:
import random
import numpy as np
import pandas as pd
import torch
from torch import optim
import torch.nn.functional as F
from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

from transformers import BertForSequenceClassification, BertConfig, BertTokenizer
from nltk.tokenize import TweetTokenizer

from indonlu.utils.forward_fn import forward_sequence_classification
from indonlu.utils.metrics import document_sentiment_metrics_fn
from indonlu.utils.data_utils import DocumentSentimentDataset, DocumentSentimentDataLoader

In [44]:
# Set and assign random seeds.
def set_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

# Count the number of parameter in models
def count_param(module, trainable=False):
    if trainable:
        return sum(p.numel() for p in module.parameters() if p.requires_grad)
    else:
        return sum(p.numel() for p in module.parameters())

# Set learning rate
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group['lr']

# Convert metrics into a string
def metrics_to_string(metric_dict):
    string_list = []
    for key,value in metric_dict.items():
        string_list.append('{}:{:.2f}'.format(key,value))
    return ' '.join(string_list)

In [45]:
set_seed(19072021)

The purpose of setting the random seed is so that the model gives the same results every time we carry out the training process. To make things easier, set random_seed to the date when we run the code.

# Load Pre-trained Model and Configuration

At this stage, we use the pre-trained Indobert-base-p1 model which has 124.5 million parameters

The Indobert model is built based on the general-purpose architecture BERT (Bidirectional Encoder Representation from Transformers). BERT was designed to help computers understand the meaning of ambiguous language in text. The trick is to use the surrounding text to build context.

In [46]:
# load tokenizer and config
tokenizer = BertTokenizer.from_pretrained('indobenchmark/indobert-base-p1')
config = BertConfig.from_pretrained('indobenchmark/indobert-base-p1')
config.num_labels = DocumentSentimentDataset.NUM_LABELS

#instantiate model
model = BertForSequenceClassification.from_pretrained('indobenchmark/indobert-base-p1', config=config)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at indobenchmark/indobert-base-p1 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [47]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(50000, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [48]:
count_param(model)

124443651

# Dataset Preparation

In [49]:
train_dataset_path = '/content/indonlu/dataset/smsa_doc-sentiment-prosa/train_preprocess.tsv'
valid_dataset_path = '/content/indonlu/dataset/smsa_doc-sentiment-prosa/valid_preprocess.tsv'
test_dataset_path = '/content/indonlu/dataset/smsa_doc-sentiment-prosa/test_preprocess_masked_label.tsv'

After determining the location of the dataset, we need to prepare the data with Pytorch. PyTorch provides a standardized way to prepare data before modeling. PyTorch provides many advanced features for processing data.

Here, we will use 2 classes provided in PyTorch in torch.utils.data module namely Dataset and DataLoader. Adapted from Pre-Trained Models for NLP Tasks Using PyTorch, the Dataset class is an abstract class that we need to extend in PyTorch. Meanwhile, DataLoader is the core of the data processing toolkit in PyTorch. DataLoader provides a wealth of functionality for preparing data including various sampling methods, parallel computing, and distributed processing. Well, we will move objects from Dataset class into objects from DataLoader class for further batch processing of data.

In [50]:
train_dataset = DocumentSentimentDataset(train_dataset_path, tokenizer, lowercase=True)
valid_dataset = DocumentSentimentDataset(valid_dataset_path, tokenizer, lowercase=True)
test_dataset = DocumentSentimentDataset(test_dataset_path, tokenizer, lowercase=True)

train_loader = DocumentSentimentDataLoader(dataset=train_dataset, max_seq_len=512,
                                           batch_size=32, num_workers=16, shuffle=True )
valid_loader = DocumentSentimentDataLoader(dataset=valid_dataset, max_seq_len=512,
                                           batch_size=32, num_workers=16, shuffle=True )
test_loader = DocumentSentimentDataLoader(dataset=test_dataset, max_seq_len=512,
                                          batch_size=32, num_workers=16, shuffle=True )

In [51]:
print(train_dataset[0])

(array([    2,  6540,    92,  2970,   213,  4259,  3553,   899,    34,
         259,  5590,   262,  2558,   386,   899,  1687,    26,  1574,
       30470,   899,  3310, 30468, 22130, 30360,  6123,  6368, 30468,
       22130, 30360,  2652,  1746, 30468,  8869,  6540,    34,  6315,
        1622,  1256,  8949,   899, 30468,  4222,  1622,   752,   245,
         295,  2083, 30470,  2346,  7107,   300, 30470,   405,   724,
        5189, 30470,   843, 17464,   899,   540, 10989,  3331,  1107,
       30468,   119,  3221,    79,    34,  2170,    98,  9167, 30457,
           3]), array(0), 'warung ini dimiliki oleh pengusaha pabrik tahu yang sudah puluhan tahun terkenal membuat tahu putih di bandung . tahu berkualitas , dipadu keahlian memasak , dipadu kretivitas , jadilah warung yang menyajikan menu utama berbahan tahu , ditambah menu umum lain seperti ayam . semuanya selera indonesia . harga cukup terjangkau . jangan lewatkan tahu bletoka nya , tidak kalah dengan yang asli dari tegal !')


The next step is to define variables, for example w2i and i2w to place DocumentSentimentDataset.LABEL2INDEX and DocumentSentimentDataset.INDEX2LABEL.

In [52]:
w2i, i2w = DocumentSentimentDataset.LABEL2INDEX, DocumentSentimentDataset.INDEX2LABEL
print(w2i)
print(i2w)

{'positive': 0, 'neutral': 1, 'negative': 2}
{0: 'positive', 1: 'neutral', 2: 'negative'}


# Evaluate Model with Example

In [53]:
text = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
subwords = tokenizer.encode(text)
subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

logits = model(subwords)[0]
label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

Text: Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita | Label : positive (39.380%)


From the evaluation above, it only has 39% positive sentiment. In fact, we can identify that the text should have positive sentiment more than 50% and even more than 90%. Therefore, let's carry out the Fine Tuning and Evaluation process.

# Fine Tuning and Evaluation

In [54]:
optimizer = optim.Adam(model.parameters(), lr=3e-6)
model = model.cuda()

Next, we will train the model with the number of epochs = 5. The steps we will carry out include:

1. Model training and updating process
2. Training metrics calculation
3. Evaluation on validation data
4. Validation matrix calculation

In [55]:
# train
n_epochs = 5
for epoch in range(n_epochs):
    model.train()
    torch.set_grad_enabled(True)

    total_train_loss = 0
    list_hyp, list_label = [], []

    train_pbar = tqdm(train_loader, leave=True, total=len(train_loader))
    for i, batch_data in enumerate(train_pbar):
        #forward model
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        #update model
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        tr_loss = loss.item()
        total_train_loss = total_train_loss + tr_loss

        #calculate metrics
        list_hyp += batch_hyp
        list_label += batch_label

        train_pbar.set_description("(Epoch {}) TRAIN LOSS:{:.4f} LR:{:.8f}".format((epoch+1),
            total_train_loss/(i+1), get_lr(optimizer)))

    #calculate train metric
    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) TRAIN LOSS:{:.4f} {} LR:{:.8f}".format((epoch+1),
        total_train_loss/(i+1), metrics_to_string(metrics), get_lr(optimizer)))

    #evaluate on validation
    model.eval()
    torch.set_grad_enabled(False)

    total_loss, total_correct, total_labels = 0,0,0
    list_hyp, list_label = [],[]

    pbar = tqdm(valid_loader, leave=True, total=len(valid_loader))
    for i, batch_data in enumerate(pbar):
        batch_seq = batch_data[-1]
        loss, batch_hyp, batch_label = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')

        #calculate total loss
        valid_loss = loss.item()
        total_loss = total_loss + valid_loss

        #calculate evaluation metrics
        list_hyp += batch_hyp
        list_label += batch_label
        metrics = document_sentiment_metrics_fn(list_hyp, list_label)

        pbar.set_description("VALID LOSS:{:.4f} {}".format(total_loss/(i+1), metrics_to_string(metrics)))

    metrics = document_sentiment_metrics_fn(list_hyp, list_label)
    print("(Epoch {}) VALID LOSS:{:.4f} {}".format((epoch+1),
        total_loss/(i+1), metrics_to_string(metrics)))



(Epoch 1) TRAIN LOSS:0.3480 LR:0.00000300: 100%|██████████| 344/344 [02:28<00:00,  2.31it/s]


(Epoch 1) TRAIN LOSS:0.3480 ACC:0.87 F1:0.82 REC:0.79 PRE:0.86 LR:0.00000300


VALID LOSS:0.1947 ACC:0.93 F1:0.90 REC:0.89 PRE:0.90: 100%|██████████| 40/40 [00:07<00:00,  5.33it/s]


(Epoch 1) VALID LOSS:0.1947 ACC:0.93 F1:0.90 REC:0.89 PRE:0.90


(Epoch 2) TRAIN LOSS:0.1581 LR:0.00000300: 100%|██████████| 344/344 [02:31<00:00,  2.26it/s]


(Epoch 2) TRAIN LOSS:0.1581 ACC:0.95 F1:0.93 REC:0.93 PRE:0.93 LR:0.00000300


VALID LOSS:0.1780 ACC:0.93 F1:0.90 REC:0.91 PRE:0.90: 100%|██████████| 40/40 [00:08<00:00,  4.99it/s]


(Epoch 2) VALID LOSS:0.1780 ACC:0.93 F1:0.90 REC:0.91 PRE:0.90


(Epoch 3) TRAIN LOSS:0.1184 LR:0.00000300: 100%|██████████| 344/344 [02:32<00:00,  2.25it/s]


(Epoch 3) TRAIN LOSS:0.1184 ACC:0.96 F1:0.95 REC:0.95 PRE:0.95 LR:0.00000300


VALID LOSS:0.1662 ACC:0.94 F1:0.91 REC:0.90 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.16it/s]


(Epoch 3) VALID LOSS:0.1662 ACC:0.94 F1:0.91 REC:0.90 PRE:0.92


(Epoch 4) TRAIN LOSS:0.0881 LR:0.00000300: 100%|██████████| 344/344 [02:33<00:00,  2.24it/s]


(Epoch 4) TRAIN LOSS:0.0881 ACC:0.97 F1:0.97 REC:0.96 PRE:0.97 LR:0.00000300


VALID LOSS:0.1800 ACC:0.93 F1:0.91 REC:0.90 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.04it/s]


(Epoch 4) VALID LOSS:0.1800 ACC:0.93 F1:0.91 REC:0.90 PRE:0.92


(Epoch 5) TRAIN LOSS:0.0656 LR:0.00000300: 100%|██████████| 344/344 [02:33<00:00,  2.24it/s]


(Epoch 5) TRAIN LOSS:0.0656 ACC:0.98 F1:0.98 REC:0.97 PRE:0.98 LR:0.00000300


VALID LOSS:0.2070 ACC:0.93 F1:0.89 REC:0.88 PRE:0.92: 100%|██████████| 40/40 [00:07<00:00,  5.13it/s]

(Epoch 5) VALID LOSS:0.2070 ACC:0.93 F1:0.89 REC:0.88 PRE:0.92





Next, we will evaluate the test data.

In [56]:
# evaluate on test
model.eval()
torch.set_grad_enabled(False)

total_loss, total_correct, total_labels = 0,0,0
list_hyp, list_label = [],[]

pbar = tqdm(test_loader, leave=True, total=len(test_loader))
for i, batch_data in enumerate(pbar):
    _, batch_hyp, _ = forward_sequence_classification(model, batch_data[:-1], i2w=i2w, device='cuda')
    list_hyp += batch_hyp

#save predicition
df = pd.DataFrame({
    'label':list_hyp
}).reset_index()

df.to_csv('pred.txt', index=False)

print(df)



100%|██████████| 16/16 [00:02<00:00,  5.43it/s]

     index     label
0        0  negative
1        1  positive
2        2  negative
3        3  negative
4        4  positive
..     ...       ...
495    495  negative
496    496  negative
497    497  positive
498    498  negative
499    499  negative

[500 rows x 2 columns]





# Prediction

In [57]:
def prediction(text):
    subwords = tokenizer.encode(text)
    subwords = torch.LongTensor(subwords).view(1, -1).to(model.device)

    logits = model(subwords)[0]
    label = torch.topk(logits, k=1, dim=-1)[1].squeeze().item()

    print(f'Text: {text} | Label : {i2w[label]} ({F.softmax(logits, dim=-1).squeeze()[label] * 100:.3f}%)')

In [58]:
text1 = 'Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita'
prediction(text1)

Text: Bahagia hatiku melihat pernikahan putri sulungku yang cantik jelita | Label : positive (99.592%)


In [59]:
text2 = "Ronaldo pergi ke Mall Grand Indonesia membeli cilok"
prediction(text2)

Text: Ronaldo pergi ke Mall Grand Indonesia membeli cilok | Label : neutral (98.332%)


In [60]:
text3 = "Sayang, aku marah"
prediction(text3)

Text: Sayang, aku marah | Label : negative (99.763%)


In [61]:
text4 = "Merasa kagum dengan toko ini tapi berubah menjadi kecewa setelah transaksi"
prediction(text4)

Text: Merasa kagum dengan toko ini tapi berubah menjadi kecewa setelah transaksi | Label : negative (99.697%)


Sentence on text3 was deliberately made like that with the intention of tricking the machine. The word "sayang" (darling) usually has a positive connotation. But in that sentence it is followed by the word "marah" (angry).

Likewise with text4, "kagum" (amazed) and "kecewa" (dissapointed) are two words that have positive and negative connotations respectively. If you look at the context of the sentences, it is clear that these two sentences have a negative sentiment. Humans can immediately recognize it as a negative sentiment.

In [63]:
text5 = "Aku padahal bangga banget sama kamu, tapi aku kecewa kenapa kamu memilih jalan yang salah pada akhirnya"
prediction(text5)

Text: Aku padahal bangga banget sama kamu, tapi aku kecewa kenapa kamu memilih jalan yang salah pada akhirnya | Label : negative (99.759%)
