# Auto class for text summarization

Training a model for this task requires input-target sequence pairs.

* Input: original text
* Target: summarized text

Extractive summarization, extracts and combines parts of the original text to create a summary, by using encoder models like BERT or sometimes encoder-decoder models like T5.

Abstractive or Generative summarization relies on sequence-to-sequence LLMs to generate (word by word) a summary that may use different words and sentence structures than those in the original text.


<p float="left">
  <img align="left" src="./img/extractive_sum.png" alt="extractive_sum" style="width: 400px;"/>
  <img align="left" hspace="100" src="./img/abstractive_sum.png" alt="abstractive_sum" style="width: 400px;"/>
</p>


In [1]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

In [2]:
model_name = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

In [3]:
sample_text = """
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: 
she knelt down and looked along the passage into the loveliest garden you ever saw. 
How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, 
but she could not even get her head though the doorway; `and even if my head would go through,' thought poor Alice,
`it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! 
I think I could, if I only know how to begin.' For, you see, so many out-of-the-way things had happened lately, 
that Alice had begun to think that very few things indeed were really impossible.
"""

In [4]:
inputs = tokenizer.encode(sample_text, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(inputs, max_length=50)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
generated_text

"the doorway, she knelt down and found that it led into a small passage, not much larger than a rat-hole. and even if my head would go through,'"

# Auto class for text generation

Next-word prediction is a form of self-supervised task requiring training examples consisting of input-target sequence pairs. 

* Input sequence is a segment of a text. For example, "the cat is", from "the cat is sleeping on the mat".
* Target sequence are the tokens shifted one position to the left, e.g. "cat is sleeping on the mat".

In [5]:
from transformers import AutoModelForCausalLM

In [6]:
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

In [7]:
sample_text = "The grey cat sat on the orange mat and was looking very sad. It was dreaming about "

In [None]:
inputs = tokenizer.encode(sample_text, return_tensors="pt")
outputs = model.generate(inputs, max_length=50)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

In [9]:
print(generated_text)

The grey cat sat on the orange mat and was looking very sad. It was dreaming about  the day he would be born.
"I'm going to be a cat," he said.
"I'm going to be a cat,"


# Auto class for text classification

AutoModelForSequenceClassification

BERT-based model for sentiment classification in a 5-star rating scale.

In [10]:
from transformers import AutoModelForSequenceClassification

In [11]:
model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [12]:
sample_text = "I like the colour of this product"

In [13]:
inputs = tokenizer(sample_text, return_tensors="pt", padding=True, truncation=True, max_length=64)
outputs = model(**inputs)
predicted_class = torch.argmax(outputs.logits, dim=1).item()
predicted_class

3

# Auto class for QA

Qustion-Answering could have varios architectures:

* Encoder-only for extractive QA.
* Encoder-Decoder for abstractive QA;
* Decoder-only for closed QA when LLM generates the answer with no context provided.

The typical dataset features are: 'context', 'question', 'answers'.

Extractive QA is formulated as supervised classification problem.

The pre-processed question and context are jointly passed as input to the LLM which returns some raw outputs or logits. 

There are two output logits generated for each input token in the input sequence, indicating the likelihood that the token constitutes the start or end position of the answer span. 

Raw logits are post-processed to obtain the actual prediction or answer span: a portion of the input sequence defined by start and end token positions which are most likely containing the answer. 

This answer span is obtained as the positions of the start and end logits with the highest combined likelihood.

In [14]:
context = """
In the ten years up  to the start of the financial crisis, house prices tripled. 
Many people think this is because there were not enough houses around, but that is only part of the picture.
House prices rise much faster than wages, which means that houses become less and less affordable. 
Anyone who didn’t already own a house before the bubble started growing ends up giving up more and more of their 
salary simply to pay for a place to live. 
And it’s not just house buyers who are affected: pretty soon rents go up too, including in social housing.
This increase in prices led to a massive increase in the amount of money that first time buyers spent on mortgage repayments.
"""

In [15]:
question = 'Why are the house prices so high?'

In [16]:
from transformers import AutoModelForQuestionAnswering

In [None]:
model_name = 'deepset/minilm-uncased-squad2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForQuestionAnswering.from_pretrained(model_name)

In [18]:
inputs = tokenizer(question, context, return_tensors="pt")

In [19]:
inputs.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

Inputs contain the input_ids and attention_mask tensors and token_type_ids tensor whose values are 0 for tokens belonging to the question, and 1 for tokens from the context.

In [20]:
with torch.no_grad():
    outputs = model(**inputs)

# positions of input delimiting asnwer span
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits) + 1

answer = tokenizer.decode(
    inputs["input_ids"][0][start_idx:end_idx]
)
answer

'there were not enough houses around'

# Auto class for translation

Training a model for this task requires input-target sequence pairs.

* Input: text in source language
* Target: translated text

Translation is normally possible thanks to encoder-decoder models, such as the original transformer. 

# AutoModel general class

In [21]:
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

AutoModel is a generic class that, when being passed some inputs for inference, returns the hidden states produced by the model body, but it lacks a task-specific head. 

It should be included at the end as the classification head.

In [22]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

In [23]:
sample_text = "I like summer long days"

In [24]:
class TextClassification(nn.Module):
    def __init__(self, input_size, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_size, num_classes)
        
    def forward(self, x):
        return self.fc(x)

In [25]:
# tokenize inputs
inputs = tokenizer(
    sample_text, return_tensors="pt", padding=True, truncation=True, max_length=64
)

In [26]:
outputs = model(**inputs)

Model hidden states:

* `pooler_output` - high-level aggregated representation of the sequence
* `last_hidden_state` - raw unaggregated hidden states

In [27]:
# hidden states
pooled_output = outputs.pooler_output
print('Pooled output size: ', pooled_output.shape)
print('Last hidden state size: ', outputs.last_hidden_state.shape)    

Pooled output size:  torch.Size([1, 768])
Last hidden state size:  torch.Size([1, 7, 768])


In [28]:
classifier_head = TextClassification(pooled_output.size(-1), 2)
logits = classifier_head(pooled_output)
# class probabilities
proba = torch.softmax(logits, dim=1)
proba

tensor([[0.6490, 0.3510]], grad_fn=<SoftmaxBackward0>)