# Neural Bag-of-Words Classifiers

Lecture 2 | CMU ANLP Spring 2025 | Instructor: Sean Welleck

This is a notebook for [CMU CS11-711 Advanced NLP](https://cmu-l3.github.io/anlp-spring2025/) that trains neural network classifiers. Specifically, each model uses a bag-of-words variant to encode an input sequence into a continuous vector that is mapped to a probability distribution over the output classes. The model is trained to minimize cross-entropy loss using backpropagation.

*Ackowledgements*: adapted from Graham Neubig's ANLP Fall 2025 [code](https://github.com/neubig/anlp-code/tree/main/02-textclass)

### Tweet classification

We use the [`mteb/tweet_sentiment_extraction`](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction) dataset, which consists of classifying an input tweet as positive, neutral, or negative sentiment.

In [1]:
!head -n 4 train.jsonl

{"id":"cb774db0d1","text":" I`d have responded, if I were going","label":1,"label_text":"neutral"}
{"id":"549e992a42","text":" Sooo SAD I will miss you here in San Diego!!!","label":0,"label_text":"negative"}
{"id":"088c60f138","text":"my boss is bullying me...","label":0,"label_text":"negative"}
{"id":"9642c003ef","text":" what interview! leave me alone","label":0,"label_text":"negative"}


In [2]:
!tail -n 4 train.jsonl

{"id":"4f4c4fc327","text":" I`ve wondered about rake to.  The client has made it clear .NET only, don`t force devs to learn a new lang  #agile #ccnet","label":0,"label_text":"negative"}
{"id":"f67aae2310","text":" Yay good for both of you. Enjoy the break - you probably need it after such hectic weekend  Take care hun xxxx","label":2,"label_text":"positive"}
{"id":"ed167662a5","text":" But it was worth it  ****.","label":2,"label_text":"positive"}
{"id":"6f7127d9d7","text":"   All this flirting going on - The ATG smiles. Yay.  ((hugs))","label":1,"label_text":"neutral"}

#### Train a tokenizer

Based on the examples above, splitting on whitespace isn't a great idea. Let's learn a BPE vocabulary using `sentencepiece`.

In [3]:
import sentencepiece as spm
import json

with open("bow_tokenizer_txt.txt", "w", encoding="utf-8") as f:
    with open('train.jsonl', "r") as f2:
        for line in f2:
            j = json.loads(line)
            words = j['text']
            f.write(words + "\n")

import os

options = dict(
  input="bow_tokenizer_txt.txt",
  input_format="text",
  model_prefix="bow_tok", 
  model_type="bpe",
  vocab_size=2048,
  byte_fallback=True,
  num_threads=os.cpu_count()
)

spm.SentencePieceTrainer.train(**options);

sentencepiece_trainer.cc(78) LOG(INFO) Starts training with : 
trainer_spec {
  input: bow_tokenizer_txt.txt
  input_format: text
  model_prefix: bow_tok
  model_type: BPE
  vocab_size: 2048
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  pretokenization_delimiter: 
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 1
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  seed_sentencepieces_file: 
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇ 
  enable_differential_privacy: 0

In [4]:
sp = spm.SentencePieceProcessor()
sp.load('bow_tok.model')
vocab = [[sp.id_to_piece(idx), idx] for idx in range(sp.get_piece_size())]
vocab[1000:1020]

[['get', 1000],
 ['▁gl', 1001],
 ['▁away', 1002],
 ['eeee', 1003],
 ['▁left', 1004],
 ['▁mothers', 1005],
 ['?!', 1006],
 ['ily', 1007],
 ['oke', 1008],
 ['url', 1009],
 ['▁late', 1010],
 ['ire', 1011],
 ['hes', 1012],
 ['ner', 1013],
 ['▁Hope', 1014],
 ['▁Twitter', 1015],
 ['▁sha', 1016],
 ['▁bu', 1017],
 ['▁em', 1018],
 ['inking', 1019]]

In [5]:
#print(dir(sp))
print(sp.tokenize("My name is Reddy Vishnuvardhan Reddy Challapalli"))
#print(sp.get_piece_size())
#print(len(vocab))
words=[pp[0] for pp in vocab]
#print(words)
#print("Challapalli" in words,"   ",sp.get_piece_size("Challapalli"))
mytokens=sp.tokenize("My name is Reddy Vishnuvardhan Reddy Challapalli")
for tk in mytokens:
    print(sp.id_to_piece(tk))
print(sp.encode("My name is Reddy Vishnuvardhan Reddy Challapalli"))

[615, 1540, 325, 505, 305, 1452, 816, 498, 1969, 1975, 1988, 578, 406, 505, 305, 1452, 399, 269, 282, 663, 358, 1968]
▁My
▁name
▁is
▁R
ed
dy
▁V
ish
n
u
v
ard
han
▁R
ed
dy
▁C
ha
ll
ap
all
i
[615, 1540, 325, 505, 305, 1452, 816, 498, 1969, 1975, 1988, 578, 406, 505, 305, 1452, 399, 269, 282, 663, 358, 1968]


a_piece: <0xB6>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xB7>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xB8>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xB9>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xBA>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xBB>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xBC>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xBD>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xBE>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xBF>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xC0>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xC1>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xC2>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xC3>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xC4>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece: <0xC5>
trainer_interface.cc(425) LOG(INFO) Adding meta_piece:

#### Data loading

Read in the data, tokenize it, and split it into a training and dev set. There is a separate test set on [HuggingFace](https://huggingface.co/datasets/mteb/tweet_sentiment_extraction).

In [6]:
from collections import defaultdict
import json
import random

random.seed(123)

label_to_text = {}
def read_dataset(filename):
    with open(filename, "r") as f:
        for line in f:
            j = json.loads(line)
            words = j['text']
            label = j['label']
            label_to_text[label] = j['label_text']
            tokens = sp.encode(words)
            yield (tokens, label)

# Read in the data
ds = list(read_dataset("train.jsonl"))
print(ds[1:3])
random.shuffle(ds)
#train = ds[:-1000]
#dev = ds[1000:]

train=ds[:1000]
test=ds[1000:]

nwords = len(sp)
ntags = 3

 size=520 all=20617 active=1309 piece=ree
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=316 size=540 all=21039 active=1731 piece=ts
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=299 size=560 all=21589 active=2281 piece=age
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=287 size=580 all=21877 active=2569 piece=▁hate
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=272 size=600 all=22316 active=3008 piece=rd
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=272 min_freq=78
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=264 size=620 all=22803 active=1580 piece=▁?
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=250 size=640 all=23175 active=1952 piece=▁We
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=239 size=660 all=23423 active=2200 piece=ooo
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=229 size=680 all=23839 active=2616 piece=▁Wh
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=222 size=700 all=24054 active=2831 piece=self
bpe_model_trainer.cc(159) LOG(INFO)

[([332, 918, 332, 1994, 2006, 273, 512, 499, 301, 583, 312, 332, 296, 381, 394, 1883, 507], 0), ([309, 1639, 551, 325, 271, 1270, 272, 335, 321], 0)]


(INFO) Added: freq=101 size=1240 all=30683 active=1884 piece=▁these
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=97 size=1260 all=30835 active=2036 piece=oring
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=94 size=1280 all=31165 active=2366 piece=▁OMG
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=92 size=1300 all=31297 active=2498 piece=IT
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=92 min_freq=34
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=90 size=1320 all=31610 active=1845 piece=ower
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=87 size=1340 all=31799 active=2034 piece=▁Aw
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=86 size=1360 all=31982 active=2217 piece=▁shopping
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=84 size=1380 all=32233 active=2468 piece=▁Pl
bpe_model_trainer.cc(268) LOG(INFO) Added: freq=83 size=1400 all=32395 active=2630 piece=▁hurt
bpe_model_trainer.cc(159) LOG(INFO) Updating active symbols. max_freq=83 min_freq=31
bpe_mod

### Model 1: Bag-of-Embeddings

Our simplest model simply sums together 3-dimensional word embeddings (3 dimensions since we have three classes).

First, for understanding purposes let's implement our own embedding layer.

To do so, we multiply a one-hot vector representation of a token with a suitably-sized weight matrix.

In [7]:
import torch

print(train[1][0][:5])
print(len(train[1][0]),"  ",len(train[0][0]))
print(nwords)

torch.nn.functional.one_hot(torch.tensor(train[0][0]), num_classes=nwords).shape
torch.nn.functional.one_hot(torch.tensor(train[1][0]), num_classes=nwords).shape

[402, 510, 953, 428, 413]
38    16
2048


torch.Size([38, 2048])

In [8]:
import torch.nn as nn

weight = nn.Parameter(torch.randn(nwords, 64))
weight.shape

torch.Size([2048, 64])

In [9]:
xs = torch.nn.functional.one_hot(torch.tensor(train[0][0]), num_classes=nwords)
print(xs.shape)

torch.matmul(xs.float(), weight).shape

torch.Size([16, 2048])


torch.Size([16, 64])

In [10]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, emb_size):
        super(Embedding, self).__init__()
        self.weight = nn.Parameter(torch.randn(vocab_size, emb_size))
        self.vocab_size = vocab_size

        nn.init.xavier_uniform_(self.weight)
        
    def forward(self, x):
        xs = torch.nn.functional.one_hot(x, num_classes=self.vocab_size).float()
        return torch.matmul(xs, self.weight)

Now here is our simple bag-of-words model

In [11]:
class BoW(torch.nn.Module):
    def __init__(self, vocab_size, num_labels):
        super(BoW, self).__init__()
        self.embedding = Embedding(vocab_size, num_labels)
        nn.init.xavier_uniform_(self.embedding.weight)

    def forward(self, tokens):
        #print("TOKENS.........",tokens)
        emb = self.embedding(tokens)
        #print("Embedding shape.....",emb)
        out = torch.sum(emb, dim=0) 
        #print("Output shape.........",out)
        logits = out.view(1, -1) 
        #print("Logits shape.........",logits)
        return logits

Let's also implement cross-entropy loss ourselves this time:

In [12]:
def ce_loss(logits, target):
    log_probs = torch.nn.functional.log_softmax(logits, dim=1)
    loss = -log_probs[:, target]
    return loss

Here's a training loop.

We simply do "single batch" training here, i.e. loop over each training example one at a time and perform an update. We'll implement batching later on.

You can use the SGD (Stochastic Gradient Descent) optimizer that was introduced in class, or this typically better optimizer Adam (we'll see it in a later class).

In [13]:
import random
import time

# initialize the model
model = BoW(nwords, ntags)
criterion = ce_loss
# optimizer = torch.optim.SGD(model.parameters(), lr=5e-4)
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

for ITER in range(5):
    # Perform training
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        #print("IN TRAIN.......")
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("iter %r: train loss/sent=%.4f, time=%.2fs" % (
                ITER, train_loss/len(train), time.time()-start))
    # Perform validation
    test_correct = 0.0
    for x, y in test:
        x = torch.tensor(x, dtype=torch.long)
        #print("IN TEST.......")
        #print(model(x)[0].detach())
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: valid acc=%.4f" % (ITER, test_correct/len(test)))

iter 0: train loss/sent=1.0835, time=0.63s
iter 0: valid acc=0.4544
iter 1: train loss/sent=0.9576, time=0.64s
iter 1: valid acc=0.4880
iter 2: train loss/sent=0.8560, time=0.62s
iter 2: valid acc=0.5007
iter 3: train loss/sent=0.7742, time=0.62s
iter 3: valid acc=0.5077
iter 4: train loss/sent=0.7051, time=0.61s
iter 4: valid acc=0.5082


### Model 2: Bag-of-embeddings + output layer

This is what we called `CBoW` in the lecture. Take a look at the code to see how it differs from the previous model.

Also, it turns out to be important to initialize the weights well. We'll discuss this in a later class. Try removing the `nn.init` lines and see the performance change.

In [14]:
class CBoW(torch.nn.Module):
    def __init__(self, vocab_size, num_labels, emb_size):
        super(CBoW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.output_layer = nn.Linear(emb_size, num_labels)

        nn.init.xavier_uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.output_layer.weight)

    def forward(self, tokens):
        #print("The tokens..........",tokens)
        emb = self.embedding(tokens)    # [len(tokens) x emb_size]
        #print("The emb output is..........",emb)
        emb_sum = torch.sum(emb, dim=0) # [emb_size]
        h = emb_sum.view(1, -1)         # [1 x emb_size]
        logits = self.output_layer(h)   # [1 x num_labels]
        return logits

In [15]:
EMB_SIZE=32
model = CBoW(nwords, ntags, EMB_SIZE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

for ITER in range(5):
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    model.train()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("iter %r: train loss/sent=%.4f, time=%.2fs" % (
                ITER, train_loss/len(train), time.time()-start))
    model.eval()
    # Perform testing
    test_correct = 0.0
    for x, y in test:
        x = torch.tensor(x, dtype=torch.long)
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: dev acc=%.4f" % (ITER, test_correct/len(test)))

iter 0: train loss/sent=1.0625, time=0.72s
iter 0: dev acc=0.4932
iter 1: train loss/sent=0.6967, time=0.91s
iter 1: dev acc=0.5147
iter 2: train loss/sent=0.4483, time=0.91s
iter 2: dev acc=0.5066
iter 3: train loss/sent=0.2670, time=0.91s
iter 3: dev acc=0.5064
iter 4: train loss/sent=0.1632, time=0.90s
iter 4: dev acc=0.5055


### Model 3: Deep CBoW

Now we introduce a nonlinear layer involving a tanh activation. 

In [16]:
class DeepCBoW(torch.nn.Module):
    def __init__(self, vocab_size, num_labels, emb_size, hid_size):
        super(DeepCBoW, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.linear1 = nn.Linear(emb_size, hid_size)    
        self.output_layer = nn.Linear(hid_size, num_labels)

        nn.init.xavier_uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.linear1.weight)     
        nn.init.xavier_uniform_(self.output_layer.weight)

    def forward(self, tokens):
        emb = self.embedding(tokens)
        emb_sum = torch.sum(emb, dim=0) 
        h = emb_sum.view(1, -1) 
        h = torch.tanh(self.linear1(h))  
        logits = self.output_layer(h)
        return logits

In [17]:
EMB_SIZE=32
model = DeepCBoW(nwords, ntags, EMB_SIZE, 32)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

for EPOCH in range(10):
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    model.train()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("epoch %r: train loss/sent=%.4f, time=%.2fs" % (
                EPOCH, train_loss/len(train), time.time()-start))
    model.eval()
    # Perform testing
    test_correct = 0.0
    for x, y in test:
        x = torch.tensor(x, dtype=torch.long)
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: dev acc=%.4f" % (EPOCH, test_correct/len(test)))

epoch 0: train loss/sent=1.0615, time=0.95s
iter 0: dev acc=0.5027
epoch 1: train loss/sent=0.6741, time=0.97s
iter 1: dev acc=0.5205
epoch 2: train loss/sent=0.3286, time=0.94s
iter 2: dev acc=0.5068
epoch 3: train loss/sent=0.1459, time=0.93s
iter 3: dev acc=0.5109
epoch 4: train loss/sent=0.0567, time=0.93s
iter 4: dev acc=0.5051
epoch 5: train loss/sent=0.0191, time=0.92s
iter 5: dev acc=0.5066
epoch 6: train loss/sent=0.0071, time=0.93s
iter 6: dev acc=0.5055
epoch 7: train loss/sent=0.0029, time=0.93s
iter 7: dev acc=0.5078
epoch 8: train loss/sent=0.0013, time=0.93s
iter 8: dev acc=0.5058
epoch 9: train loss/sent=0.0007, time=0.93s
iter 9: dev acc=0.5067


Go deep learning!

Classify an example with our trained model

In [18]:
tweet = "I'm learning so much in advanced NLP!"
tokens = torch.tensor(sp.encode(tweet), dtype=torch.long)
logits = model(tokens)[0].detach()
predict = logits.argmax().item()
label_to_text[predict]


'positive'

### Suggested exercises

- Try changing the initialization of weights. Does the loss and/or dev accuracy change?
- Generalize the `DeepCBoW` implementation to take in a `num_layers` parameter. How does performance change as the number of layers is increased?
- Try different hyperparameters (e.g., learning rate, embedding size, hidden size, number of epochs). Can you identify any consistent trends?
- Try out different qualitative examples. Can you find patterns in how the model succeeds / fails?
- Implement batching by introducing a new `[PAD]` token. Make sure to mask out vectors for pad tokens in the model forward pass.

# Try changing the initialization of weights. Does the loss and/or dev accuracy change?




In [19]:
#First question try different weight initalizations....


class DeepCBoW_WithLayers(torch.nn.Module):
    def __init__(self, vocab_size, num_labels, emb_size, hid_size,num_layers):
        super(DeepCBoW_WithLayers, self).__init__()
        self.num_layers=num_layers
        self.embedding = nn.Embedding(vocab_size, emb_size)
        self.linear1 = nn.Linear(emb_size, hid_size)    
        self.output_layer = nn.Linear(hid_size, num_labels)
        self.linear_list=nn.ModuleList([nn.Linear(hid_size,hid_size) for _ in range(num_layers)])
        nn.init.xavier_normal_(self.embedding.weight)
        nn.init.xavier_normal_(self.linear1.weight)     
        nn.init.xavier_normal_(self.output_layer.weight)
        for i in range(self.num_layers):
            nn.init.xavier_normal_(self.linear_list[i].weight)

    def forward(self, tokens):
        emb = self.embedding(tokens)
        emb_sum = torch.sum(emb, dim=0) 
        h = emb_sum.view(1, -1) 
        h=self.linear1(h)
        for i in range(self.num_layers):
            h=self.linear_list[i](h)
        h = torch.tanh(h)  
        logits = self.output_layer(h)
        return logits



In [20]:
EMB_SIZE=32
model = DeepCBoW_WithLayers(nwords, ntags, EMB_SIZE, 32,10)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

In [21]:
for EPOCH in range(10):
    random.shuffle(train)
    train_loss = 0.0
    start = time.time()
    model.train()
    for x, y in train:
        x = torch.tensor(x, dtype=torch.long)
        y = torch.tensor([y])
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print("epoch %r: train loss/sent=%.4f, time=%.2fs" % (
                EPOCH, train_loss/len(train), time.time()-start))
    model.eval()
    # Perform testing
    test_correct = 0.0
    for x, y in test:
        x = torch.tensor(x, dtype=torch.long)
        logits = model(x)[0].detach()
        predict = logits.argmax().item()
        if predict == y:
            test_correct += 1
    print("iter %r: dev acc=%.4f" % (EPOCH, test_correct/len(test)))

epoch 0: train loss/sent=1.0897, time=2.25s
iter 0: dev acc=0.4015
epoch 1: train loss/sent=0.8040, time=2.33s
iter 1: dev acc=0.5041
epoch 2: train loss/sent=0.4514, time=2.28s
iter 2: dev acc=0.4982
epoch 3: train loss/sent=0.2620, time=2.26s
iter 3: dev acc=0.5056
epoch 4: train loss/sent=0.1702, time=2.26s
iter 4: dev acc=0.5081
epoch 5: train loss/sent=0.1521, time=2.28s
iter 5: dev acc=0.5164
epoch 6: train loss/sent=0.1422, time=3.08s
iter 6: dev acc=0.5064
epoch 7: train loss/sent=0.0996, time=3.14s
iter 7: dev acc=0.4950
epoch 8: train loss/sent=0.0681, time=3.16s
iter 8: dev acc=0.4917
epoch 9: train loss/sent=0.1146, time=3.16s
iter 9: dev acc=0.4896


In [22]:
tweet = "My name is Reddy Vishnuvardhan Reddy Challapalli and i am also called as Salar!!!!!"
tokens = torch.tensor(sp.encode(tweet), dtype=torch.long)
logits = model(tokens)[0].detach()
predict = logits.argmax().item()
label_to_text[predict]

'neutral'

In [23]:
tweet = "My name is Reddy Vishnuvardhan Reddy Challapalli and i am also called as the One and only one Violent Man!!!!!"
tokens = torch.tensor(sp.encode(tweet), dtype=torch.long)
logits = model(tokens)[0].detach()
predict = logits.argmax().item()
label_to_text[predict]

'negative'

### Trying to pad each sequence and mask out Vectors during the forward pass

In [24]:
#First question try different weight initalizations....
class DeepCBoW_WithLayers(torch.nn.Module):
    def __init__(self, vocab_size, num_labels, emb_size, hid_size,num_layers,padding_idx):
        super(DeepCBoW_WithLayers, self).__init__()
        self.num_layers=num_layers
        self.pad_idx=padding_idx
        self.embedding = nn.Embedding(vocab_size+3, emb_size, padding_idx=self.pad_idx)
        self.embedding.weight.data[self.pad_idx]=0 # This is padding index embedding which will not add any value....
        self.linear1 = nn.Linear(emb_size, hid_size)    
        self.output_layer = nn.Linear(hid_size, num_labels)
        self.linear_list=nn.ModuleList([nn.Linear(hid_size,hid_size) for _ in range(num_layers)])
        nn.init.xavier_normal_(self.embedding.weight)
        nn.init.xavier_normal_(self.linear1.weight)     
        nn.init.xavier_normal_(self.output_layer.weight)
        for i in range(self.num_layers):
            nn.init.xavier_normal_(self.linear_list[i].weight)

    def forward(self, tokens):
        #print("IMMEDIDATE EMBEEDING..........",self.embedding.weight.data[2049])
        #print("SILLY EMBEDDING.................",self.embedding(torch.tensor(2049)))
        #https://discuss.pytorch.org/t/restrict-backpropagation-in-specific-indices-in-nn-embedding/21448/3(The idea of nulling out [PAD] embedding while computing the output....
        vocab_size = self.embedding.weight.data.size(0)
        ids = torch.arange(0, vocab_size)
        mask = ids < self.pad_idx
        mask = mask.unsqueeze(1)
        weight = self.embedding.weight.data
        weight = weight * mask
        self.embedding.weight.data = weight
        emb = self.embedding(torch.tensor(tokens))
        #print(emb)
        #print("NEXT IMMEDIDATE EMBEEDING..........",self.embedding.weight.data[2049])
        #print("next SILLY EMBEDDING.................",self.embedding(torch.tensor(2049)))
        emb_sum = torch.sum(emb, dim=1) 
        h = emb_sum.view(emb_sum.shape[0], -1) 
        h=self.linear1(h)
        for i in range(self.num_layers):
            h=self.linear_list[i](h)
        h = torch.tanh(h)  
        logits = self.output_layer(h)
        return logits



In [25]:
# Corrupt Train & Test Data with some padding set up a random padding indexes at different positions...
EMB_SIZE=32
model = DeepCBoW_WithLayers(nwords, ntags, EMB_SIZE, 32,10,2049)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

In [26]:
# Pad all the training and testing data before performing batch wise processing....
print(train[0][0])
#find the maximum train sequence length0 and pad them with extra zeros.........
maxlen=0
for x,y in train:
    maxlen=max(maxlen,len(x))
print("MAXLEN after train..........",maxlen,"  ",type(x))
for x,y in test:
    maxlen=max(maxlen,len(x))
print("MAXLEN after test.........",maxlen)

[1960, 312, 370, 303, 598, 996, 372, 1400, 1990, 277, 1031, 1119, 418, 1002, 910, 1117, 1990, 311, 266, 356, 799, 541, 868, 1975, 311, 771, 270, 350, 291, 307, 378, 1368, 261, 1552, 349, 1383, 1979]
MAXLEN after train.......... 70    <class 'list'>
MAXLEN after test......... 94


In [27]:
# Read in the data
ds = list(read_dataset("train.jsonl"))
print(ds[1:3])

train=ds[:1000]
test=ds[1000:]

nwords = len(sp)
ntags = 3

print(len(ds[0][0]))
print(ds[0][0])
print("NWORDS,,,,,,,,,,,,,",nwords)


[([332, 918, 332, 1994, 2006, 273, 512, 499, 301, 583, 312, 332, 296, 381, 394, 1883, 507], 0), ([309, 1639, 551, 325, 271, 1270, 272, 335, 321], 0)]
13
[273, 1989, 1974, 356, 339, 1185, 901, 305, 1990, 585, 273, 753, 485]
NWORDS,,,,,,,,,,,,, 2048


In [28]:
# padding all the dataset with maxlen
print("Before len is.............",len(ds),"  ",len(ds[0][0]),"  ",len(ds[1][0]))
import numpy as np
mxtoken=0
nds=[]
minlen=1e9
for i in range(len(ds)):
    minlen=min(minlen,len(ds[i][0]))
    if(len(ds[i][0])==0):
        print("ZEORLEN.........",ds[i][0],"  ",ds[i][1])
    padlen=maxlen-len(ds[i][0])
    for val in ds[i][0]:
        mxtoken=max(mxtoken,val)
    nds.append((ds[i][0]+[nwords+1]*padlen,ds[i][1]))
print(len(ds[0][0]))
print(ds[0][0])
print("MXTOKEN...........",mxtoken)
print("DS SHAPE IS..........",len(ds),"  ",len(ds[0][0]),"  ",len(ds[1][0]))
print("DS SHAPE IS..........",len(ds),"  ",len(nds[0][0]),"  ",len(nds[1][0]))
print("MINLEN IS.............",minlen)

Before len is............. 27481    13    17
ZEORLEN......... []    1
13
[273, 1989, 1974, 356, 339, 1185, 901, 305, 1990, 585, 273, 753, 485]
MXTOKEN........... 2047
DS SHAPE IS.......... 27481    13    17
DS SHAPE IS.......... 27481    94    94
MINLEN IS............. 0


In [29]:
for x,y in nds:
    if(len(x)<94):
        print("Found value....................")

In [30]:
print(train[0][0])

[273, 1989, 1974, 356, 339, 1185, 901, 305, 1990, 585, 273, 753, 485]


In [31]:
dummyembed=nn.Embedding(nwords+2, 32)
print(dummyembed)

Embedding(2050, 32)


In [32]:
print(dummyembed(torch.tensor([2049])))
dummyembed

tensor([[ 0.5967, -0.2450, -1.5582, -0.9639, -0.9788,  0.1469, -2.0398, -0.4725,
          0.4339, -1.3042, -1.7656,  0.0544,  0.3211,  1.3244, -0.9877,  0.4795,
         -0.0845, -0.7555, -0.0022,  1.9366,  0.6541,  1.0746, -0.4496,  0.8735,
          0.9208, -1.5479,  1.2979,  1.2941,  0.4048,  0.6982, -0.7297,  0.3867]],
       grad_fn=<EmbeddingBackward0>)


Embedding(2050, 32)

In [33]:
dummyembed=nn.Embedding(nwords+2, 32,padding_idx=2049)
print(dummyembed)

Embedding(2050, 32, padding_idx=2049)


In [34]:
print(dummyembed(torch.tensor([2049])))
dummyembed

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0.]], grad_fn=<EmbeddingBackward0>)


Embedding(2050, 32, padding_idx=2049)

In [35]:
len(nds)

27481

In [36]:
train=nds[:1000]
test=nds[1000:]

In [37]:
print(train[0])

([273, 1989, 1974, 356, 339, 1185, 901, 305, 1990, 585, 273, 753, 485, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049, 2049], 1)


In [38]:
for i in range(len(train)):
    train[i]=(torch.tensor(train[i][0]),torch.tensor(train[i][1]))
for i in range(len(test)):
    test[i]=(torch.tensor(test[i][0]),torch.tensor(test[i][1]))


In [39]:
from torch.utils.data import DataLoader
# Assuming train_dataset is already defined
batch_size = 16  # You can adjust the batch size as needed
train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=True)

In [40]:
EMB_SIZE=32
model = DeepCBoW_WithLayers(nwords+2, ntags, EMB_SIZE, 32,10,2049)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

In [41]:
for EPOCH in range(20):
    train_loss = 0.0
    start = time.time()
    model.train()
    for x, y in train_loader:
        logits = model(x)
        loss = criterion(logits, y)
        train_loss += loss.item()
        optimizer.zero_grad()
        loss.backward()
        #model.embedding.weight.grad[model.pad_idx] = torch.zeros(EMB_SIZE)
        #model.embedding(torch.tensor(2049))=0
        optimizer.step()
        #print("AFTER BACKPROP........",model.embedding.weight.grad[2049]," 8**********  ",model.embedding(torch.tensor(2049)))
    print("epoch %r: train loss/sent=%.4f, time=%.2fs" % (
                EPOCH, train_loss/len(train_loader), time.time()-start))
    model.eval()
    # Perform testing
    test_correct = 0.0
    for x, y in test_loader:
        logits = model(x).detach()
        predict = torch.argmax(logits,dim=1)
        test_correct += torch.sum(predict==y)
    print("iter %r: dev acc=%.4f" % (EPOCH, test_correct/len(test_loader)))

  emb = self.embedding(torch.tensor(tokens))


epoch 0: train loss/sent=1.1031, time=0.23s
iter 0: dev acc=6.6793
epoch 1: train loss/sent=0.9542, time=0.22s
iter 1: dev acc=7.9698
epoch 2: train loss/sent=0.6224, time=0.23s
iter 2: dev acc=8.1812
epoch 3: train loss/sent=0.3165, time=0.23s
iter 3: dev acc=8.2283
epoch 4: train loss/sent=0.1509, time=0.22s
iter 4: dev acc=8.3647
epoch 5: train loss/sent=0.1076, time=0.22s
iter 5: dev acc=8.1848
epoch 6: train loss/sent=0.0635, time=0.18s
iter 6: dev acc=8.1359
epoch 7: train loss/sent=0.0393, time=0.22s
iter 7: dev acc=8.1401
epoch 8: train loss/sent=0.0420, time=0.23s
iter 8: dev acc=8.1220
epoch 9: train loss/sent=0.0300, time=0.17s
iter 9: dev acc=8.1588
epoch 10: train loss/sent=0.0187, time=0.16s
iter 10: dev acc=8.1884
epoch 11: train loss/sent=0.0105, time=0.16s
iter 11: dev acc=8.1522
epoch 12: train loss/sent=0.0083, time=0.16s
iter 12: dev acc=8.1504
epoch 13: train loss/sent=0.0053, time=0.17s
iter 13: dev acc=8.1075
epoch 14: train loss/sent=0.0042, time=0.16s
iter 14: 