# Text processing with Transformers


A transformer is a deep learning architecture developed by Google and based on the multi-head attention mechanism, proposed in a 2017 paper ["Attention Is All You Need"](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf).

In [14]:
import torch
import torch.nn as nn

In [2]:
sample_texts = [
    'I love this product',
    'This is terrible',
    'Could be better',
    'This is the best',
]

In [3]:
labels = [1,0,0,1]

In [4]:
train_data, test_data = sample_texts[:3], sample_texts[3:]
train_labels, test_labels = labels[:3], labels[3:]

TransformerEncoderLayer:

* d_model - influences the model's representational depth
* nhead - determines how many word contexts the model can focus on simultaneously, impacting its contextual understanding

In [9]:
class TransformerEncoder(nn.Module):
    def __init__(self, embed_size, heads, num_layers, dropout):
        super().__init__()
        self.encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=embed_size, nhead=heads),
            num_layers=num_layers
        )
        self.fc = nn.Linear(embed_size, 2)

    def forward(self, x):
        x = self.encoder(x)
        x = x.mean(dim=1)
        return self.fc(x)

In [None]:
model = TransformerEncoder(embed_size=512, heads=8, num_layers=3, dropout=0.5)

In [12]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_function = nn.CrossEntropyLoss()

In [None]:
# training loop
for epoch in range(10):
    for sentence, label in zip(train_data,train_labels):
        # Split the sentences into tokens and stack the embeddings
        tokens = sentence.split()
        data = torch.stack([token_embeddings[i] for i in tokens], dim=1)
        output = model(data)
        loss = loss_function(output, torch.tensor([label]))
        # Zero the gradients and perform a backward pass
        optimizer.zero_grad()
        loss.backwards()
        optimizer.step()

In [13]:
# testing
def predict(sentence):
    model.eval()
    # Deactivate the gradient computations and get the sentiment prediction.
    with torch.no_grad():
        tokens = sentence.split()
        data = torch.stack([token_embeddings.get(token, torch.rand((1, 512))) for i in tokens], dim=1)
        output = model(data)
        predicted = torch.argmax(output, dim=1)
    return 'positive' if predicted.item() == 1 else 'negative'

# Attention mechanism

<img src="./img/attention.png" alt="attention.png" style="width: 400px;"/>

The Attention Mechanism in the transformer assigns importance to words within a sentence. 

In example, 'it' is understood to be more related to 'animal', 'street' and 'the' in descending order of significance. This ensures that in tasks like translation, the machine's interpretation aligns with the human understanding.

**SELF** and **MULTI-HEAD** attention:

* Self-Attention assigns significance to words within a sentence. In *The cat, which was on the roof, was scared,* the mechanism links "was scared" directly to "The cat". 

* Multi-Head Attention is akin to deploying multiple spotlights. In the same example, "was scared" could relate to "The cat," signify "the roof," or point to "was on".

### Dataset

In [2]:
sample_text = [
    "the animal didn't cross the street because it was too tired",
    "the cat sat on the mat",
]

In [7]:
# vocabulary and word index
vocab = set(' '.join(sample_text).split())
word_to_idx = {word:idx for idx, word in enumerate(vocab)}
ix_to_word = {idx:word for idx, word in enumerate(vocab)}

In [15]:
# encoder/decoder data
pairs = [sentence.split() for sentence in sample_text]
input_data = [[word_to_idx[word] for word in sentence[:-1]] for sentence in pairs]
target_data = [word_to_idx[sentence[-1]] for sentence in pairs]

inputs = [torch.tensor(seq, dtype=torch.long) for seq in input_data]
targets = torch.tensor(target_data, dtype=torch.long)

### Model

In [33]:
embedding_dim = 10
hidden_dim = 16
vocab_size = len(vocab)

class RNNWithAttentionModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        
        # Embedding layer translates word indexes to vectors
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # RNN layer for sequentail processing
        self.rnn = nn.RNN(embedding_dim, hidden_dim, batch_first=True)

        # Attention layer computes word significanes, performing linear transformation of hidden_dim to one,
        # yielding a singular attention score per word
        self.attention = nn.Linear(hidden_dim, 1)

        # Final layer outputting vocab_size pinpoints the predicted word index
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        
        # word indexes are embedded
        x = self.embeddings(x)

        # process embeddings in sequentail layer generating output for each word
        out, _ = self.rnn(x)

        # attention scores are derived by applying a linear transformation to the RNN outputs, 
        # normalizing using softmax, and reshaping the tensor using squeeze two to simplify attention calculations.
        att_weights = torch.nn.functional.softmax(self.attention(out).squeeze(2), dim=1)

        # Context vector is formulated by multiplying attention scores with RNN outputs, 
        # creating a weighted sum of the outputs, where weights are the attention scores. 
        # The unsqueeze two operation is important for adjusting tensor dimensions for matrix multiplication with RNN outputs. 
        # The context vector is then summed using torch-dot-sum to feed into the fc layer for the final prediction.
        context = torch.sum(att_weights.unsqueeze(2) * out, dim=1)
        
        return self.fc(context)

def pad_sequences(batch):
    """
    Ensures consistent sequence lengths by padding the input sequences 
    with torch-dot-cat and torch-dot-stack, avoiding any potential length discrepancies
    """
    max_len = max([len(i) for i in batch])
    return torch.stack(
        [torch.cat([seq, torch.zeros(max_len - len(seq)).long()])
             for seq in batch]
    )

In [34]:
loss_finction = nn.CrossEntropyLoss()
model = RNNWithAttentionModel(vocab_size, embedding_dim, hidden_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

In [37]:
# training
for epoch in range(300):
    model.train()
    optimizer.zero_grad()
    padded_inputs = pad_sequences(inputs)
    output = model(padded_inputs)
    loss = loss_finction(output, targets)
    loss.backward()
    optimizer.step()

    if (epoch % 100) == 0:
        print(f'epoch {epoch}, loss: {loss.item()}')

epoch 0, loss: 0.0005866951541975141
epoch 100, loss: 0.0004257845284882933
epoch 200, loss: 0.00032556717633269727


In [40]:
# testing
for seq, target in zip(input_data, target_data):
    input_test = torch.tensor(seq, dtype=torch.long).unsqueeze(0)
    model.eval()
    output = model(input_test)
    predictions = ix_to_word[torch.argmax(output).item()]
    print(f'\nInput: {" ".join([ix_to_word[i] for i in seq])}')
    print(f'\nTarget: {ix_to_word[target]}')
    print(f'\nOutput: {predictions}')


Input: the animal didn't cross the street because it was too

Target: tired

Output: tired

Input: the cat sat on the

Target: mat

Output: mat


# Adversarial attacks

What is an Adversarial Attack in AI? 

It is an attack where the goal is to cause an AI system to make a mistake or misclassification, often through subtle manipulations of the input data.

**Fast Gradient Sign Method (FGSM)**

<img src="./img/fgsm.png" alt="fgsm.png" style="width: 600px;"/>

It uses precise changes that may go undetected. By exploiting the learning information of a model, it can introduce the tiniest changes to the input, leading the model astray. 

Example of a spam filter that's usually accurate but gets deceived by a cleverly altered email. Notice the tiny tweak in the word "love". To an AI model, this could change the classification. In our real-world example, such alterations can prevent a spammy email from being flagged.

**Projected Gradient Descent (PGD)** 

<img src="./img/pgd.png" alt="pgd" style="width: 600px;"/>

It's like the seasoned burglar who picks the lock step by step. 

It refines its deception across several iterations, ensuring the most effective disturbance. Example of fake news detector, PGD could subtly adjust an article's phrasing over and over until the AI is convinced of its authenticity. Here, likely becomes set to, altering the prediction confidence. If this were a fake news detector, such iterative tweaks could confuse AI's judgment.

**The Carlini & Wagner (C&W) attack**

<img src="./img/cw.png" alt="cw" style="width: 600px;"/>

It's like the mastermind spy who leaves no trace. 

By focusing on optimizing a loss function, it ensures that the modifications are not just deceptive to the AI but virtually undetectable to us. Consider an AI-driven stock trading system; C&W could tweak a financial transcript subtly, potentially causing erroneous investments. The addition of "somewhat" can change the sentiment and context, especially if used in critical financial or medical reports.

**Defence strategies:**

* Model ensembling
* Data Augmentation
* Adversarial Training