# Summarization

Abstractive summarization is a type of text summarization that involves creating a summary of a text that is not simply a copy of a subset of sentences from the original text, but rather a new, concise representation of the essential meaning of the text in the form of natural language sentences.

The idea is that you have a dataset of pairs (document, summary), and then you have to see it as a model that does text translation, except that instead of translating the text it will summarize it. Here you clearly have the pair (text, summary). So you take an encoder-decoder, the encoder takes as input the document, and the decoder takes care of outputting the summary little by little.

We want to summarize the cases of the Australian federal court

In [None]:
%%capture
!pip install transformers[sentencepiece] datasets sacremoses evaluate
# !pip install sentencepiece

In [None]:
import os
import json
import string
from pathlib import Path
from typing import Union
from collections import Counter

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
from tqdm import notebook
from bs4 import BeautifulSoup
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from wordcloud import WordCloud
from tqdm import notebook

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Data

In [None]:
#Connection with my drive folder
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
directory = 'drive/MyDrive/Colab_Notebooks/fulltext'

data = dict()

file_list = os.listdir(directory)

for filename in notebook.tqdm(file_list):
    with open(directory + "/" + filename, 'r') as f:
        raw_data = f.read()
    
    bs_data = BeautifulSoup(raw_data,'xml')
    
    texts = list()
    
    if bs_data.findChild('sentences') == None:
        # If there are no sentences, extract the text from each <text> tag
        for tag in bs_data.find_all('text'):
            texts.append(tag)
    else:
        # If there are sentences, extract the text from each <sentences> tag
        for tag in bs_data.find_all('sentences'):
            texts.append(tag)
    
    data[filename] = texts

  0%|          | 0/3890 [00:00<?, ?it/s]

In [None]:
def clean_data(doc:str) -> str:
    x = re.sub(r'\n', ' ', doc) #we remove \n
    x = re.sub(r"\'", ' ', x)
    return x

In [None]:
documents = [clean_data(data[key][0].text) for key in data.keys()]

In [None]:
documents[100]



### Summarization - Pegasus

Pegasus is a state-of-the-art pre-trained model for text summarization tasks that is based on transformer architecture.

The reason for using Pegasus is that it has been trained on a large amount of diverse data and has achieved state-of-the-art results on many summarization benchmarks. It is designed to handle **long documents** and **generate coherent and informative summaries**.

The choice of the specific model and tokenizer in this code is based on the fact that "tuner007/pegasus_summarizer" is a pre-trained model that has been fine-tuned on summarization tasks and has achieved **good performance** on various datasets. The tokenizer is used to preprocess the input text into tokens that the model can understand and generate the summary based on.

By using a pre-trained model like Pegasus, the time and computational resources required for training a summarization model from scratch can be avoided, and the model can be fine-tuned on specific tasks with smaller datasets, which can save significant amounts of time and resources.

In [None]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = 'tuner007/pegasus_summarizer'
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)

def get_response(input_text):
  batch = tokenizer([input_text],truncation=True,padding='longest',max_length=1024, return_tensors="pt").to(torch_device)
  gen_out = model.generate(**batch,max_length=512,num_beams=5, num_return_sequences=1, temperature=1.5)
  output_text = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
  return output_text

Downloading (…)"spiece.model";:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

In [None]:
summary_documents = list()
i = 0

for document in notebook.tqdm(documents):    
  # summary_documents.append(get_response(document))
  # print("----- Document -----------")
  # print(document)
  # print("----- Summary ------------")
  # print(get_response(document))
  # print(f'\n')
  dict_res = dict()
  dict_res[f'document {i}'] = get_response(document)
  with open("drive/MyDrive/summary_nlp/summary_doc_" + str(i) + ".json", "w") as f:
      json.dump(dict_res, f)
  i+=1

  0%|          | 0/3890 [00:00<?, ?it/s]

We randomly checked several summaries. They seem coherent and make sense. 

*We ran the model on Standard GPU of Google Colab. The computation lasted 2 hours.*

In [None]:
summaries = dict()
sum_directory = Path("./summary_nlp")
files = [sum_directory / filename for filename in os.listdir(sum_directory) if filename.endswith(".json")]

for filepath in files:
    with open(filepath, "r") as json_file:
        data = json.load(json_file)
    summaries.update(data)

assert len(summaries) == len(files)

with open("summaries.json", "w") as summaries_json:
    json.dump(summaries, summaries_json)

**if we had more time:**

we could test gpt-3 in few short learning.

In general in few shot it's easier to do that with a GPT only. You just give at the beginning of the prompt what you want to do
In fact GPT is the basic architecture when you don't have a specific task to do, and you just want to generate text. And so once you have a model that can generate text, you "simply" specify what you want at the beginning of the prompt and let it continue the sentence.

 So for example you can say to chatGPT : summarize the following text : "X".

 But it is expensive