# Workshop week 12:Text Generation and Summarisation

### Introduction to Text Generation using GPT-2

Text generation stands as one of the most useful applications of Natural Language Processing.
Decode based GPT is the still the SOTA of text generation and other NLP tasks. 
Although we cannot use the latest GPT-4, we are going to use GPT-2, which is a smaller pre-trained model that can be run with limited resources.

## Activity 1: GPT-2 Open Text Generation

Work on this activity is groups (one at each table)

1. Review the following code to understand its working
2. Think of a few more prompting examples, and generate texts using them
3. Open ChatGPT-3.5 and use the same examples to generate texts
4. Compare GPT-2 with GPT-3.5 generation applying human evaluation criteria discussed in Lecture 11. Apply scoring from 1 to 5 for each criteria, add scored together to comare the models.
    fluency
    coherence / consistency
    factuality and correctness
    commonsense
    style / formality
    grammaticality
    typicality (what type of something, exemplars etc.)
    redundancy
5. Discuss your findings in the class. What are the variations between different groups in the class in evaluating texts?

**Explanation of code:**

    Tokenizer Initialization: The code initializes a GPT-2 tokenizer (tokenizer) to preprocess text inputs. Tokenizers break down input text into tokens, which are numerical representations used by the model.

    Model Initialization: The GPT-2 model (model) is loaded. This model is a pre-trained neural network that has learned to predict the next word in a sequence given some context.

    Maximum Length: max_length is set to control the length of the generated text. This prevents the model from generating excessively long responses.

    Input Prompt: The prompt variable contains the initial snippet of text provided to the model for text generation.

    Encoding the Input: The encode() method of the tokenizer converts the input prompt into token IDs (input_ids). These token IDs are the numerical representations of the input text.

    Text Generation: The generate() method of the GPT-2 model generates text based on the input token IDs (input_ids). The do_sample=True parameter allows for sampling from the model's predicted probability distribution, adding randomness to the generated text.

    Decoding the Output: The decode() method of the tokenizer converts the generated token IDs (output_ids) back into text, excluding any special tokens such as padding or separator tokens.

    Printing the Output: The generated text (output_text) is printed to the console for visualization.

This code demonstrates the process of using GPT-2 for text generation based on an initial prompt, providing participants with a hands-on understanding of how the model operates.

In [1]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

# Load the GPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set the maximum length of the generated text
max_length = 100

# Define the input prompt
prompt = "The quick brown fox"

# Encode the input prompt using the tokenizer
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate the text using the GPT-2 model
output_ids = model.generate(input_ids=input_ids, max_length=max_length, do_sample=True)

# Decode the generated text using the tokenizer
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

# Print the generated text
print(output_text)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The quick brown fox is a little different than the rest, only slightly tougher. The brown fox will eat even if it encounters the most obvious predator. There is definitely a difference between brown and brown fox, even though both have very similar prey for their prey. It's like they are constantly trying to find their mates, for some reason.

The brown fox is the same as the brown fox, while the brown fox and the brown fox are very similar. The brown fox is the same as


## Activity 2: Text Summarisation

The following code can summarise text using Bart and trasformer pipeline.

1. Review the example code below
2. In the second cell of code, implement article summarisation, both abstractive and extractive, from given short news. 
3. Compare these two types of summarisation using ROUGE, as well as human evaluation as in Activity 2.
4. Answer the following questions:
    
    a. Which type of summarisation generally gives better ROUGE score?
    
    b. Which type of summarisation generally gives better human score?
    
 Discuss the results in the class. If you find these articles hard to assess the quality of summarisation, you can use some articles from your assignment 2, but need to provide a reference summary.

In [None]:
# Text summarisation example

!pip install rouge

import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import pipeline
from rouge import Rouge

# Load the BART tokenizer and model for abstractive summarization
tokenizer_abstractive = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model_abstractive = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Load the pipeline for extractive summarization
pipeline_extractive = pipeline('summarization')

# Define the input text
input_text = "The quick brown fox jumps over the lazy dog. This is a test sentence for summarization. Here is another sentence for testing."

# Define the target summary
target_summary = "The quick brown fox jumps over the lazy dog. This is a test sentence for summarization."

# Perform abstractive summarization using BART
inputs = tokenizer_abstractive([input_text], max_length=1024, truncation=True, padding='max_length', return_tensors='pt')
outputs = model_abstractive.generate(inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=60, num_beams=4, length_penalty=2.0)
summary_abstractive = tokenizer_abstractive.decode(outputs[0], skip_special_tokens=True)

# Perform extractive summarization using pipeline
summary_extractive = pipeline_extractive(input_text, max_length=60)[0]['summary_text']

# Evaluate the summaries using the ROUGE metric
rouge = Rouge()
scores_abstractive = rouge.get_scores(summary_abstractive, target_summary)
scores_extractive = rouge.get_scores(summary_extractive, target_summary)

# Print the summaries and ROUGE scores
print("Input Text: ", input_text)
print("Target Summary: ", target_summary)
print("Abstractive Summary: ", summary_abstractive)
print("ROUGE Scores for Abstractive Summary: ", scores_abstractive)
print("Extractive Summary: ", summary_extractive)
print("ROUGE Scores for Extractive Summary: ", scores_extractive)

In [None]:
# Text summarisation example

# !pip install rouge
import pandas as pd
import torch
from transformers import BartTokenizer, BartForConditionalGeneration
from transformers import pipeline
from rouge import Rouge

# Load the CNN/DailyMail dataset
df = pd.read_csv('./daily_cnn.csv')

for all_articles_in_file:
    ...
    # Print the summaries and ROUGE scores
    print("Input Text: ", input_text)
    print("Target Summary: ", target_summary)
    print("Abstractive Summary: ", summary_abstractive)
    print("ROUGE Scores for Abstractive Summary: ", scores_abstractive)
    print("Extractive Summary: ", summary_extractive)
    print("ROUGE Scores for Extractive Summary: ", scores_extractive)