# Mô tả bài toán
Trong các câu hỏi của phần **Text Classification**, **POS tagging**, chúng ta được cung cấp một tập dữ liệu nhỏ bao gồm hai chuỗi văn bản và các nhãn tương ứng trong đoạn code Python sau:

```python
corpus = [
    "you will get the low score",
    "more study more lucky come to you"
]
```
Quá trình tiền xử lý dữ liệu, xây dựng vocabulary, embedding được trực quan hóa như hình sau:

![image](https://firebasestorage.googleapis.com/v0/b/aivn-images.appspot.com/o/public%2F2025%2F3%2F2%2F1740886293065-image.png?alt=media&token=2a367e86-1eee-461e-b9ee-2fde92df5a42)

## POS tagging
Mục tiêu của bài toán này là xây dựng một mô hình Part-of-speech Tagging (gồm 4 class: 0: noun/pronoun - 1: verb - others - 2, padding - 3) với Baseline cụ thể như hình sau:
![image](https://firebasestorage.googleapis.com/v0/b/aivn-images.appspot.com/o/public%2F2025%2F3%2F2%2F1740887028471-image.png?alt=media&token=ef0404e3-0df3-464c-8778-d2121d9187d1)

Tất cả thông tin đều đã có ở trong phần mô tả, hãy đọc hiểu và trả lời các câu hỏi sau:

## POS tagging - Linear

In [None]:
!pip install -U torchtext==0.17.0

Collecting torchtext==0.17.0
  Downloading torchtext-0.17.0-cp311-cp311-manylinux1_x86_64.whl.metadata (7.6 kB)
Collecting torch==2.2.0 (from torchtext==0.17.0)
  Downloading torch-2.2.0-cp311-cp311-manylinux1_x86_64.whl.metadata (25 kB)
Collecting torchdata==0.7.1 (from torchtext==0.17.0)
  Downloading torchdata-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.2.0->torchtext==0.17.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.2.0->torchtext==0.17.0)
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.2.0->torchtext==0.17.0)
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.2.0->torcht

## Dataset

In [None]:
import torch
import torch.nn as nn
# import torchtext; torchtext.disable_torchtext_deprecation_warning()
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

corpus = [
    "you will get the low score",
    "more study more lucky come to you"
]
data_size = len(corpus)

# 0: noun/pronoun - 1: verb - others - 2
labels = [
    [0, 1, 1, 2, 2, 0],
    [2, 0, 2, 2, 1, 2, 0]
]

# Define the max vocabulary size and sequence length
vocab_size = 12
sequence_length = 6

In [None]:
# Define tokenizer function
tokenizer = get_tokenizer('basic_english')

# Create a function to yield list of tokens
def yield_tokens(examples):
    for text in examples:
        yield tokenizer(text)

# Create vocabulary
vocab = build_vocab_from_iterator(yield_tokens(corpus),
                                  max_tokens=vocab_size,
                                  specials=["<unk>", "<pad>"])
vocab.set_default_index(vocab["<unk>"])
vocab.get_stoi()

{'to': 11,
 'the': 10,
 'study': 9,
 'score': 8,
 'lucky': 7,
 'low': 6,
 'get': 5,
 'come': 4,
 'more': 2,
 '<pad>': 1,
 'you': 3,
 '<unk>': 0}

In [None]:
# Tokenize and numericalize your samples
def vectorize(text, vocab, sequence_length, sequence_label):
    tokens = tokenizer(text)

    token_ids = [vocab[token] for token in tokens][:sequence_length]
    token_ids = token_ids + [vocab["<pad>"]] * (sequence_length - len(tokens))
    sequence_label = sequence_label + [3] * (sequence_length - len(tokens))
    sequence_label = sequence_label[:sequence_length]

    return torch.tensor(token_ids, dtype=torch.long), torch.tensor(sequence_label, dtype=torch.long)

# Vectorize the samples
sentence_vecs = []
label_vecs = []
for sentence, labels in zip(corpus, labels):
    sentence_vec, labels_vec = vectorize(sentence, vocab, sequence_length, labels)
    sentence_vecs.append(sentence_vec)
    label_vecs.append(labels_vec)

In [None]:
for v in sentence_vecs:
    print(v)

tensor([ 3,  0,  5, 10,  6,  8])
tensor([ 2,  9,  2,  7,  4, 11])


In [None]:
for v in label_vecs:
    print(v)

tensor([0, 1, 1, 2, 2, 0])
tensor([2, 0, 2, 2, 1, 2])


# Model

In [None]:
class POS_Model(nn.Module):
    def __init__(self, vocab_size, num_classes):
        super().__init__()
        # Custom embedding layer
        self.embedding = nn.Embedding(vocab_size, 2)
        custom_embedding_weight = torch.tensor([
            [ 0.26, -1.31],
            [ 0.72,  0.43],
            [-0.67,  0.61],
            [ 0.50,  0.50],
            [-0.26, -0.10],
            [ 1.29,  1.25],
            [ 1.95,  1.18],
            [-1.44, -1.89],
            [-0.20,  0.88],
            [-0.39,  1.07],
            [ 0.32, -0.05],
            [ 0.59, -0.98]
        ])
        self.embedding.weight = nn.Parameter(custom_embedding_weight)
        print("Embedding weights:")
        print(self.embedding.weight)

        # Custom fully connected layer
        self.fc = nn.Linear(2, num_classes)
        custom_fc_weight = torch.tensor([
            [ 0.3792,  0.4146],
            [ 0.4638, -0.0273],
            [-0.2622,  0.2486],
            [ 0.5454, -0.3664],
        ])
        self.fc.weight = nn.Parameter(custom_fc_weight)
        custom_fc_bias = torch.tensor([-0.62,  0.37,  0.57, -0.48])
        self.fc.bias = nn.Parameter(custom_fc_bias)

        print("FC weights:")
        print(self.fc.weight)
        print("FC bias:")
        print(self.fc.bias)

    def forward(self, x):
        print(f"Input shape: {x.shape}")
        x = self.embedding(x)
        print(f"After embedding shape: {x.shape}")
        x = self.fc(x)
        print(f"After FC shape: {x.shape}")
        print(x)
        x = x.permute(0, 2, 1)
        print(f"After permute shape: {x.shape}")
        return x

model = POS_Model(vocab_size, 4)

Embedding weights:
Parameter containing:
tensor([[ 0.2600, -1.3100],
        [ 0.7200,  0.4300],
        [-0.6700,  0.6100],
        [ 0.5000,  0.5000],
        [-0.2600, -0.1000],
        [ 1.2900,  1.2500],
        [ 1.9500,  1.1800],
        [-1.4400, -1.8900],
        [-0.2000,  0.8800],
        [-0.3900,  1.0700],
        [ 0.3200, -0.0500],
        [ 0.5900, -0.9800]], requires_grad=True)
FC weights:
Parameter containing:
tensor([[ 0.3792,  0.4146],
        [ 0.4638, -0.0273],
        [-0.2622,  0.2486],
        [ 0.5454, -0.3664]], requires_grad=True)
FC bias:
Parameter containing:
tensor([-0.6200,  0.3700,  0.5700, -0.4800], requires_grad=True)


# Forward input 1

In [None]:
input_1 = torch.tensor([[3, 0, 5, 10, 6, 8]], dtype=torch.long)
label_1 = label_vecs[0]

output = model(input_1)

Input shape: torch.Size([1, 6])
After embedding shape: torch.Size([1, 6, 2])
After FC shape: torch.Size([1, 6, 4])
tensor([[[-0.2231,  0.5883,  0.5632, -0.3905],
         [-1.0645,  0.5264,  0.1762,  0.1418],
         [ 0.3874,  0.9342,  0.5425, -0.2344],
         [-0.5194,  0.5198,  0.4737, -0.2872],
         [ 0.6087,  1.2422,  0.3521,  0.1512],
         [-0.3310,  0.2532,  0.8412, -0.9115]]], grad_fn=<ViewBackward0>)
After permute shape: torch.Size([1, 4, 6])


## M08POS01
### Câu hỏi
**Output** ở trong hình baseline phải có shape bằng bao nhiêu?  
A.
```
(1, 4, 6)
```
B.
```
(1, 6, 4)
```
C.
```
(1, 3, 6)
```
D.
```
(1, 6, 3)
```
### Đáp án
A  
*Giải thích*: Bình thường output sẽ là (batch_size, seq_len, nums_cls) nhưng vì đưa output vào nn.CrossEntropyLoss() nên bắt buổi phải permute đổi thành (batch_size, nums_cls, seq_len)

# Sofmaxt output

In [4]:
import torch
x = torch.tensor([[-0.2231,  0.5883,  0.5632, -0.3905],
         [-1.0645,  0.5264,  0.1762,  0.1418],
         [ 0.3874,  0.9342,  0.5425, -0.2344],
         [-0.5194,  0.5198,  0.4737, -0.2872],
         [ 0.6087,  1.2422,  0.3521,  0.1512],
         [-0.3310,  0.2532,  0.8412, -0.9115]])

import numpy as np
def softmax(x):
    exp_x = np.exp(x)
    return exp_x / np.sum(exp_x)

rs = softmax(x.detach().numpy())
result = [np.argmax(x) for x in rs]
print(result)

[1, 1, 1, 1, 1, 2]


## M08POS02
### Câu hỏi
Hãy cho biết sau khi đưa sample 1 vào model, tính toán forward, đưa vào softmax, vector dự đoán của mô hình sẽ là?  
A.
```
[1,1,2,2,1,1]
```
B.
```
[0,1,0,2,2,2]
```
C.
```
[0,1,2,2,1,1]
```
D.
```
[1,1,1,1,1,2]
```
### Đáp án
D

# Forward input 2

In [None]:
input_2 = torch.tensor([[2, 9, 2, 7, 4, 11]], dtype=torch.long)
label_2 = label_vecs[1]

output = model(input_2)

Input shape: torch.Size([1, 6])
After embedding shape: torch.Size([1, 6, 2])
After FC shape: torch.Size([1, 6, 4])
tensor([[[-0.6212,  0.0426,  0.8973, -1.0689],
         [-0.3243,  0.1599,  0.9383, -1.0848],
         [-0.6212,  0.0426,  0.8973, -1.0689],
         [-1.9496, -0.2463,  0.4777, -0.5729],
         [-0.7601,  0.2521,  0.6133, -0.5852],
         [-0.8026,  0.6704,  0.1717,  0.2009]]], grad_fn=<ViewBackward0>)
After permute shape: torch.Size([1, 4, 6])
