<a href="https://colab.research.google.com/github/yungsinatra0/Big-Jah/blob/main/ThemisAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

Welcome to Themis.AI! This notebook will be used to provide a general summary based on uploaded reports.

## Installing necessary dependencies for dealing with text & pdf

In [1]:
!pip install -q reportlab
!pip install -q patool
!pip install -q PyMuPDF

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.5/77.5 kB[0m [31m844.8 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m50.3 MB/s[0m eta [36m0:00:00[0m
[?25h

## Uploading reports & reading file content

Below you will find 3 ways to upload the reports:


1.   Uploading from local machine - uploading reports stored on your PC/laptop (will need to archive the directories)
2.   Uploading from Google Drive - connecting to your Google Drive account and using the files stored in there
3. Reports already uploaded to Google Colab notebook



### OPTION 1 - Uploading from local machine

Please make sure your reports (pdf files) and directories (folders) are archived together for ease of upload. If you are using Windows, [7-zip](https://www.7-zip.org) or [WinRar](https://www.rarlab.com/download.htm) are excellent and free tools that can be used to do this.

Please refer to [this guide](https://www.howtogeek.com/276972/the-best-file-archiving-program-for-windows/) or [this guide](https://www.wikihow.com/Archive-Folders) for guidance on how to archive the folders.

In [None]:
# Necessary imports for file handling
import os
from google.colab import files
import re
import patoolib

# Prompt the user to upload files
uploaded_files = files.upload()

# Unpack the archive
archive_name = next(iter(uploaded_files))
current_directory = os.getcwd()
patoolib.extract_archive(archive_name, outdir=current_directory)

# Get parent directory path
directory_path = os.path.join(current_directory, os.path.splitext(archive_name)[0])

# print(directory_path)
print("File extraction completed.")

Saving shoigu.zip to shoigu.zip
patool: Extracting shoigu.zip ...
patool: running /usr/bin/7z x -o/content -- shoigu.zip
patool: ... shoigu.zip extracted to `/content'.
File extraction completed.


### OPTION 2 - Uploading from Google Drive

Please make sure the folder containing the reports is named "Themis.AI" and is located at the root of your Google Drive!

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount("/content/drive")

# Path to the "My Drive" directory
my_drive_path = "/content/drive/My Drive"

# Check if "Themis.AI" folder exists within "My Drive"
themis_folder_path = os.path.join(my_drive_path, "Themis.AI_test")
if os.path.exists(themis_folder_path):
    directory_path = os.path.join(themis_folder_path, os.listdir(themis_folder_path)[0])
else:
    print("Could not find 'Themis.AI' folder in 'My Drive'")

Mounted at /content/drive
Could not find 'Themis.AI' folder in 'My Drive'


In [None]:
# Unmount drive
drive.flush_and_unmount()
print("All changes made in this colab session should now be visible in Drive.")

### OPTION 3 - Reports already uploaded to Colab

In [None]:
import os

person_id = input("Please input the person's ID whose reports were uploaded: ")

current_dir = os.getcwd()

directory_path = os.path.join(current_dir, person_id)

Please input the person's ID whose reports were uploaded: shoigu


## Reading file content

In [None]:
import fitz  # Import the fitz module from PyMuPDF

# Initialization of some variables
documents = {}

# Retrieve the person ID from the directory path
person_id = os.path.basename(directory_path)

# Check if the directory exists
if os.path.exists(directory_path):
    # Traverse through the report type directories
    for root, dirs, files in os.walk(directory_path):
        for directory in dirs:
            report_type = directory.lower().strip()

            # Store the report type if it's not already stored
            if report_type not in documents:
                documents[report_type] = []

            # Traverse through the PDF files
            pdf_files = []
            for file in os.listdir(os.path.join(root, directory)):
                if file.endswith(".pdf"):
                    pdf_files.append(file)

            # Initialize the concatenated text for each report
            report_texts = []

            # Read and concatenate the text from PDF files
            for pdf_file in pdf_files:
                file_path = os.path.join(root, directory, pdf_file)

                # Read the PDF file content using PyMuPDF (fitz)
                pdf_document = fitz.open(file_path)
                pdf_text = [page.get_text().strip() for page in pdf_document]
                report_info = {
                    "file_name": pdf_file,
                    "report_text": " ".join(pdf_text).replace("\n", " "),
                }
                documents[report_type].append(report_info)
                pdf_document.close()


else:
    print("Directory not found.")

print(documents)

{'football': [{'file_name': 'shoigu-12072023-football.pdf', 'report_text': 'Musician Alex Yatsun’s house was shelled by Russian forces, but he has focused the trauma  and apocalyptic feeling into atmospheric tracks that help him get ‘out of reality’  When the Russian invasion of Ukraine began in 2022, Alex Yatsun was living just 30km  from the Russian border in the northernmost part of Kharkiv. “When I woke up that day I  started living in a completely different reality,” he recalls. “There were bombs falling every  hour.”-  Yatsun’s family evacuated but he soon returned north to more dangerous territory to  volunteer at a medical centre. “But my house was hit by shelling,” he recalls. A photograph  on his Instagram shows the aftermath: huge chunks of wall blown out, smashed windows, a  mangled front door. “That was the moment I decided to move closer to central Ukraine.”  Despite the difficult and often harrowing backdrop of the last year, the 24-year-old has  managed to produce a new

# Preparing the models & documents

## Choosing number of reports to be used for summarization
**WARNING: SOME MODELS HAVE WORD INPUT LIMITS, PLEASE REFER BELOW TO HOW MANY WORDS CAN BE USED FOR ONE SUMMARIZATION TASK**


*   PRIMERA: 4096 tokens for a 'report cluster' (each document will be 4096 tokens divided by number of documents in the cluster).
*   BRIO: 512 (BART base) or 1024 tokens (PEGASUS base) - depending on model or dataset used, result may vary, but the 512 tokens version is recommended.
*   EfactSum: 512 (BART base) or 1024 tokens (PEGASUS base) - depending on model or dataset used, result may vary, but the 512 tokens version is recommended.
* Unlimiformer: "unlimited input"

In [None]:
# Print all the keys of the "documents" dictionary along with the count of reports
print("Available report types:")
for report_type in list(documents.keys()):
    print(f"{report_type} - {len(documents[report_type])} report(s)")

# Ask the user for input to select report types
chosen_reports = []
chosen_texts = {}
report_types = list(documents.keys())
while True:
    user_input = input(
        "Enter the report types you want to choose (separated by commas), or 'all' for all reports: "
    )
    if user_input.lower() == "all":
        chosen_reports = report_types
        break

    chosen_reports = [report.strip().lower() for report in user_input.split(",")]

    # Validate user input
    invalid_reports = [
        report for report in chosen_reports if report not in report_types
    ]
    if len(invalid_reports) > 0:
        print("Invalid report types:", ", ".join(invalid_reports))
    else:
        break

# Create a new dictionary containing the text of the chosen documents
for report in chosen_reports:
    print(f"\nAvailable reports for '{report}':")
    for index, report_info in enumerate(documents[report], 1):
        print(f"{index}. {report_info['file_name']}")
    while True:
        user_input = input(
            f"Enter the index of the report(s) you want to choose for '{report}' (separated by commas), or 'all' for all reports: "
        )
        if user_input.lower() == "all":
            chosen_texts[report] = [
                report_info["report_text"] for report_info in documents[report]
            ]
            break

        chosen_indices = [
            int(idx.strip())
            for idx in user_input.split(",")
            if 1 <= int(idx.strip()) <= len(documents[report])
        ]
        try:
            chosen_texts[report] = [
                documents[report][idx - 1]["report_text"] for idx in chosen_indices
            ]
            break
        except ValueError:
            print("Invalid input. Please enter valid indices or 'all' for all reports.")
            continue

print(chosen_texts)

Available report types:
football - 1 report(s)
police - 1 report(s)
school - 2 report(s)
Enter the report types you want to choose (separated by commas), or 'all' for all reports: all

Available reports for 'football':
1. shoigu-12072023-football.pdf
Enter the index of the report(s) you want to choose for 'football' (separated by commas), or 'all' for all reports: all

Available reports for 'police':
1. shoigu-10072023-police.pdf
Enter the index of the report(s) you want to choose for 'police' (separated by commas), or 'all' for all reports: all

Available reports for 'school':
1. shoigu-12072023-school.pdf
2. shoigu-10072023-school.pdf
Enter the index of the report(s) you want to choose for 'school' (separated by commas), or 'all' for all reports: all
{'football': ['Musician Alex Yatsun’s house was shelled by Russian forces, but he has focused the trauma  and apocalyptic feeling into atmospheric tracks that help him get ‘out of reality’  When the Russian invasion of Ukraine began in 2

# Abstractive summarization models

In [2]:
!pip install -q transformers torch accelerate datasets sentencepiece evaluate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m21.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m38.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Load the healthcare report from Google Drive

In [14]:
!apt install ocrmypdf
!pip install ocrmypdf

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  fonts-droid-fallback fonts-noto-mono fonts-urw-base35 ghostscript
  icc-profiles-free libgs9 libgs9-common libidn12 libijs-0.35 libimagequant0
  libjbig2dec0 libqpdf28 libraqm0 mailcap mime-support pngquant poppler-data
  python3-bs4 python3-chardet python3-coloredlogs python3-html5lib
  python3-humanfriendly python3-img2pdf python3-lxml python3-olefile
  python3-packaging python3-pdfminer python3-pikepdf python3-pil
  python3-pluggy python3-renderpm python3-reportlab python3-reportlab-accel
  python3-soupsieve python3-tqdm python3-webencodings tesseract-ocr
  tesseract-ocr-eng tesseract-ocr-osd unpaper
Suggested packages:
  fonts-noto fonts-freefont-otf | fonts-freefont-ttf fonts-texgyre
  ghostscript-x ocrmypdf-doc python-watchdog img2pdf poppler-utils
  fonts-japanese-mincho | fonts-ipafont-mincho fonts-japanese-gothic
  | fonts-ipaf

In [15]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount("/content/drive")

# Path to the "My Drive" directory
my_drive_path = "/content/drive/My Drive"

# Check if "Themis.AI" folder exists within "My Drive"
themis_folder_path = os.path.join(my_drive_path, "SummarisationCalin")
if os.path.exists(themis_folder_path):
    directory_path = os.path.join(themis_folder_path, os.listdir(themis_folder_path)[0])
else:
    print("Could not find 'Themis.AI' folder in 'My Drive'")

report_path = os.path.join(themis_folder_path, "Report")

Mounted at /content/drive


In [16]:
"""
Demo script using Mupdf OCR.

Extract text of a page and interpret unrecognized characters using Tesseract.
MuPDF codes unrecognizable characters as 0xFFFD = 65533.
Extraction option is "dict", which delivers contiguous text pieces within one
line, that have the same font properties (color, fontsize, etc.). Together with
the language parameter, this helps Tesseract finding the correct character.

The basic approach is to only invoke OCR, if the span text contains
chr(65533). Because Tesseract's response ignores leading spaces and appends
line break characters, some adjustments are made.

--------------
This demo will OCR only text, that is known to be text. This means, it
does not look at parts of a page containing images or text encoded as drawings.
--------------

Dependencies:
PyMuPDF v1.19.0
"""
import os
os.environ["TESSDATA_PREFIX"] = "/usr/share/tesseract-ocr/4.00/tessdata"
import fitz
import time

mat = fitz.Matrix(5, 5)  # high resolution matrix
ocr_time = 0
pix_time = 0
INVALID_UNICODE = chr(0xFFFD)  # the "Invalid Unicode" character


def get_tessocr(page, bbox):
    """Return OCR-ed span text using Tesseract.

    Args:
        page: fitz.Page
        bbox: fitz.Rect or its tuple
    Returns:
        The OCR-ed text of the bbox.
    """
    global ocr_time, pix_time, tess, mat
    # Step 1: Make a high-resolution image of the bbox.
    t0 = time.perf_counter()
    pix = page.get_pixmap(
        matrix=mat,
        clip=bbox,
    )
    t1 = time.perf_counter()
    ocrpdf = fitz.open("pdf", pix.pdfocr_tobytes())
    ocrpage = ocrpdf[0]
    text = ocrpage.get_text()
    if text.endswith("\n"):
        text = text[:-1]
    t2 = time.perf_counter()
    ocr_time += t2 - t1
    pix_time += t1 - t0
    return text

pdf_file = "HSIB Final Report.pdf"
file_path = os.path.join(report_path, pdf_file)
chosen_pages = [34, 35, 36]
page_text_list = []

doc = fitz.open(file_path)
ocr_count = 0
for page in doc:
    if page.number in chosen_pages:
      page_text = ""
      blocks = page.get_text("dict", flags=0)["blocks"]
      for b in blocks:
          for l in b["lines"]:
              for s in l["spans"]:
                  text = s["text"]
                  if INVALID_UNICODE in text:  # invalid characters encountered!
                      # invoke OCR
                      ocr_count += 1
                      # print("before: '%s'" % text)
                      text1 = text.lstrip()
                      sb = " " * (len(text) - len(text1))  # leading spaces
                      text1 = text.rstrip()
                      sa = " " * (len(text) - len(text1))  # trailing spaces
                      new_text = sb + get_tessocr(page, s["bbox"]) + sa
                      # print(" after: '%s'" % new_text)
                      page_text += new_text
                  else:
                      page_text += text

      page_text_list.append(page_text)


page_text_list[0] = page_text_list[0] + page_text_list[1]
del page_text_list[1]


# Print accumulated text for each page
for page_num, page_text in enumerate(page_text_list):
    print(f"Page {chosen_pages[page_num]} Text:")
    print(page_text)
    print("-------------------------")

Page 34 Text:
 35  Section 6. HSIB Findings and Safety Recommendations  6.1 Findings 1. The Mother booked for maternity care, having returned to the UK from overseas. She had no documentation of the care she had received. 2. The Mothers estimated date of birth from overseas place her at 37+4 weeks. After clinical review the ongoing intention was to utilise this date, this was not clearly documented and late gestation EDD from the Trust USS was used. 3. As an EDD from a late USS was used in care planning. This placed the Mother’s pregnancy two weeks and four days earlier than the correct gestation.  4. The Mother had delivered by prior CS and requested this mode of delivery again. Latent phase of labour occurred prior to the planned CS date. 5. The Mother presented in the latent phase of labour at 36+2 weeks (Trust USS), 38+6 (overseas USS).  From admission the gestation was communicated as 36+2 weeks, this may have influenced the clinical decisions in not proceeding with a CS at the Mo

## PRIMERA Model

Code taken from https://github.com/allenai/PRIMER

Notes on usage:

- Model is pre-trained on a news-based dataset, so it may hallucinate when used in a different task/context.
- Model can handle ~2000 words at once.
- Model does not work well if the input length is too low (less than 300-500 words) or if too high (more than ~2500 words)
- Model can handle multiple documents on the same topic, but pre-processing beforehand is necessary.
- While the model can handle bigger inputs & multiple documents, the summarization process can take longer in terms of time (can be over 10 minutes!)

Evaluation results:

| Dataset | ROUGUE-1 | ROUGUE-2 | ROUGUE-L|
| --- | --- | --- | --- |
| Multi-news | 46.93 | 18.86 | 26.78 |
| CNNDM | 12.37 | 0.96 | 8.12 |

For more details on the metric, check out:

    ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
    ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
    ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460


### Importing modules & initializing variables used for the model

In [25]:
# Importing necessary modules
from transformers import (
    AutoTokenizer,
    LEDForConditionalGeneration,
    LEDConfig,
)
from datasets import load_dataset
import torch
import evaluate


# Initializing variables
TOKENIZER = AutoTokenizer.from_pretrained("allenai/PRIMERA-multinews")
CONFIG = LEDConfig.from_pretrained("allenai/PRIMERA-multinews")
MODEL = LEDForConditionalGeneration.from_pretrained("allenai/PRIMERA-multinews", config=CONFIG)
# MODEL.gradient_checkpointing_enable()
PAD_TOKEN_ID = TOKENIZER.pad_token_id
DOCSEP_TOKEN_ID = TOKENIZER.convert_tokens_to_ids("<doc-sep>")

### Pre-processing the document(s) specifically for that model

In [None]:
print(chosen_texts)

# Create a new list containing the concatenated texts of the chosen documents
concatenated_texts_list = []
for report_texts in chosen_texts.values():
    concatenated_texts_list.append("|||||".join(report_texts))

print(concatenated_texts_list)

print(len(concatenated_texts_list))

{'football': ['Musician Alex Yatsun’s house was shelled by Russian forces, but he has focused the trauma  and apocalyptic feeling into atmospheric tracks that help him get ‘out of reality’  When the Russian invasion of Ukraine began in 2022, Alex Yatsun was living just 30km  from the Russian border in the northernmost part of Kharkiv. “When I woke up that day I  started living in a completely different reality,” he recalls. “There were bombs falling every  hour.”-  Yatsun’s family evacuated but he soon returned north to more dangerous territory to  volunteer at a medical centre. “But my house was hit by shelling,” he recalls. A photograph  on his Instagram shows the aftermath: huge chunks of wall blown out, smashed windows, a  mangled front door. “That was the moment I decided to move closer to central Ukraine.”  Despite the difficult and often harrowing backdrop of the last year, the 24-year-old has  managed to produce a new 21-track compilation album as DJ Sacred, entitled Dungeon Ra

### Use Transformer pipelines to summarize text  (high level & easier to use)


It is possible to change the following parameters within the pipe() call to get a different summarization result:

* Max_length: Set the maximum length of the input + output sequence (can use max_new_tokens to set just the output sequence length)
* Min_length: Set the minimum length of the output sequence (recommended to set max_length as well, and it should be longer than min_length)
* Temperature: Temperature affects how “random” the model’s output is. Lower value = lower "randomness".
* Top_p: Arrange given tokens by probability, select the fewest needed to reach a cumulative probability of at least p, and then sample from them (need to have do_sample set to 'True').

For more details on temperature, top_p: https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/

In [None]:
# Use a pipeline as a high-level helper
import torch
from transformers import pipeline

pipe = pipeline(
    task = "text2text-generation",
    model = MODEL,
    tokenizer = TOKENIZER,
    # torch_dtype=torch.bfloat16,
    # device="auto"
)

# Use model
result = pipe(
    concatenated_texts_list,
    use_cache = True,
    # min_length = 256,
    num_beams = 5,
    # max_length = 1024,
    pad_token_id = TOKENIZER.pad_token_id,
    bos_token_id = TOKENIZER.bos_token_id,
    eos_token_id = TOKENIZER.eos_token_id,
    # do_sample=True, # Only necessary to enable if want to use temperature or top_p parameters
    # temperature=0.1, # Will control
    # top_p=0.3
    )

print(result)

[{'generated_text': '– When the Russian invasion of Ukraine began in 2022, Alex Yatsun was living just 30km from the Russian border in the northernmost part of Kharkiv. "When I woke up that day I started living in a completely different reality," he recalls. "There were bombs falling every hour." His family evacuated, but he returned to volunteer at a medical center. "But my house was hit by shelling," he says. "That was the moment I decided to move closer to central Ukraine." Despite the difficult and often harrowing backdrop of the last year, the 24-year-old has managed to produce a new 21-track compilation album as DJ Sacred, entitled Dungeon Rap: The Evolution. It\'s a follow-up to 2019\'s, Dungeon rap: the Introduction. The album is a new hybrid style of hip-hop created by Yatsin, the Independent reports.'}, {'generated_text': "– A senior Russian draft officer and former submarine commander accused by Ukraine of deadly strikes on its territory has been shot dead while jogging in t

## BRIO Model

Code taken from https://github.com/yixinL7/BRIO

Notes on usage:

- Model is pre-trained on a news-based dataset, so it may hallucinate when used in a different task/context.
- Model can handle ~350-750 words at once (depending on the base model used).
- Model only handly a single documents at a time.
- Model processing time is usually quite fast, depending on chosen parameters (1-3 minutes per document).

Evaluation results:

| Dataset | ROUGUE-1 | ROUGUE-2 | ROUGUE-L|
| --- | --- | --- | --- |
| Multi-news | 26.4 | 9.9 | 16.2 |
| CNNDM | 46.8 | 22.4 | 31.5 |

For more details, check out:

    ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
    ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
    ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460


### Load the model using Hugging Face library

BRIO comes with 2 options of pre-trained weights to be loaded:
- PEGASUS-base: Trained on XSUM dataset (https://huggingface.co/datasets/xsum), has a maximum input length of 1024 tokens. Output will be a single sentence.
- BART-base: Trained on CNNDM dataset (https://huggingface.co/datasets/cnn_dailymail), has a maximum input length of 512 tokens. Output will be one or multiple sentences (depending on input length).

To choose the model, please change the value of IS_CNNDM to either:
- True, for BART-base
- False, for PEGASUS-base

In [18]:
from transformers import BartTokenizer, PegasusTokenizer
from transformers import BartForConditionalGeneration, PegasusForConditionalGeneration

IS_CNNDM = True # whether to use CNNDM dataset (BART-base) or XSum dataset (PEGASUS-base)
LOWER = False

# Load our model checkpoints
if IS_CNNDM:
    model = BartForConditionalGeneration.from_pretrained('Yale-LILY/brio-cnndm-uncased')
    tokenizer = BartTokenizer.from_pretrained('Yale-LILY/brio-cnndm-uncased')
else:
    model = PegasusForConditionalGeneration.from_pretrained('Yale-LILY/brio-xsum-cased')
    tokenizer = PegasusTokenizer.from_pretrained('Yale-LILY/brio-xsum-cased')

max_length = 1024 if IS_CNNDM else 512

### Use the model to summarize text

It is possible to change the following parameters within the generate() call to get a different summarization result:

* Max_length: Set the maximum length of the input + output sequence (can use max_new_tokens to set just the output sequence length)
* Min_length: Set the minimum length of the output sequence (recommended to set max_length as well, and it should be longer than min_length)
* Temperature: Temperature affects how “random” the model’s output is. Lower value = lower "randomness".
* Top_p: Arrange given tokens by probability, select the fewest needed to reach a cumulative probability of at least p, and then sample from them (need to have do_sample set to 'True').

For more details on temperature, top_p: https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/

In [None]:
# Initialize an empty list to store all the resulted summaries
result = []

# Loop through each report type in the chosen_texts dictionary
for report_type, documents in chosen_texts.items():
    # Loop through each document in the list for the current report type
    for document in documents:
        if LOWER:
            article = document.lower()
        else:
            article = document

        # Tokenize the document and generate the summary
        inputs = tokenizer([article], max_length=max_length, return_tensors="pt", truncation=True)
        summary_ids = model.generate(
            inputs["input_ids"],
            # max_length=1024,
            # min_length=128,
            # do_sample=True,
            # temperature=0.1,
            # top_p=0.3
            )
        summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]


        # Append the summary to the list of resulted_summaries
        result.append(summary)

# Print all the summaries
for idx, summary in enumerate(result, 1):
    print(f"Summary {idx}: {summary}")

## EFactsum

Code taken from https://github.com/tanay2001/efactsum

Notes on usage:

- Model is pre-trained on a news-based dataset, so it may hallucinate when used in a different task/context.
- Model can handle ~350-750 words at once (depending on the base model used).
- Model only handly a single documents at a time.
- Model processing time is usually quite fast, depending on chosen parameters (1-3 minutes per document).

Evaluation results:

| Dataset | ROUGUE-1 | ROUGUE-2 | ROUGUE-L|
| --- | --- | --- | --- |
| Multi-news | 25 | 8.5 | 15 |
| CNNDM | 46 | 23.1 | 30.6 |

For more details, check out:

    ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
    ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
    ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460


### Load the model using Hugging Face library

EFactSum comes with 2 options of pre-trained weights to be loaded:

- PEGASUS-base: Trained on XSUM dataset (https://huggingface.co/datasets/xsum), has a maximum input length of 1024 tokens. Output will be a single sentence.
- BART-base: Trained on CNNDM dataset (https://huggingface.co/datasets/cnn_dailymail), has a maximum input length of 512 tokens. Output will be one or multiple sentences (depending on input length).

To choose the model, please change the value of IS_CNNDM to either:

- True, for BART-base
- False, for PEGASUS-base

In [27]:
from transformers import BartTokenizer, PegasusTokenizer
from transformers import BartForConditionalGeneration, PegasusForConditionalGeneration

IS_CNNDM = True
max_length = 1024 if IS_CNNDM else 512

if IS_CNNDM:
    model = BartForConditionalGeneration.from_pretrained('tanay/efactsum-bart-cnndm')
    tokenizer = BartTokenizer.from_pretrained('tanay/efactsum-bart-cnndm')
else:
    model = PegasusForConditionalGeneration.from_pretrained('tanay/efactsum-pegasus-xsum')
    tokenizer = PegasusTokenizer.from_pretrained('tanay/efactsum-pegasus-xsum')

### Use the model to summarize text

It is possible to change the following parameters within the generate() call to get a different summarization result:

* Max_length: Set the maximum length of the input + output sequence (can use max_new_tokens to set just the output sequence length)
* Min_length: Set the minimum length of the output sequence (recommended to set max_length as well, and it should be longer than min_length)
* Temperature: Temperature affects how “random” the model’s output is. Lower value = lower "randomness".
* Top_p: Arrange given tokens by probability, select the fewest needed to reach a cumulative probability of at least p, and then sample from them (need to have do_sample set to 'True').

For more details on temperature, top_p: https://peterchng.com/blog/2023/05/02/token-selection-strategies-top-k-top-p-and-temperature/

In [None]:
# Initialize an empty list to store all the resulted summaries
result = []

# Loop through each report type in the chosen_texts dictionary
for report_type, documents in chosen_texts.items():
    # Loop through each document in the list for the current report type
    for document in documents:
        if LOWER:
            article = document.lower()
        else:
            article = document

        # Tokenize the document and generate the summary
        inputs = tokenizer([article], max_length=max_length, return_tensors="pt", truncation=True)
        summary_ids = model.generate(
            inputs["input_ids"]
            # max_length=1024,
            # min_length=128,
            # do_sample=True,
            # temperature=0.1,
            # top_p=0.3
            )
        summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]


        # Append the summary to the list of resulted_summaries
        result.append(summary)

# Print all the summaries
for idx, summary in enumerate(result, 1):
    print(f"Summary {idx}: {summary}")

## Unlimiformer model



Code taken from https://github.com/abertsch72/unlimiformer

Notes on usage:

- Model is pre-trained on a US government report dataset, so it may hallucinate when used in a different task/context.
- Model can handle an unlimited input context (tested up to ~6000 words).
- Model only handly a single documents at a time.
- Model processing time is usually quite slow (can take up to 15 minutes per document).

Evaluation results:

| Dataset | ROUGUE-1 | ROUGUE-2 | ROUGUE-L|
| --- | --- | --- | --- |
| Multi-news |  |  |  |
| CNNDM | 26.3 | 10.3 | 17 |

For more details, check out:

    ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
    ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
    ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460


### Install & import necessary modules

In [36]:
!git clone https://github.com/abertsch72/unlimiformer.git
!pip install -q -r unlimiformer/requirements.txt
!pip install -q faiss-cpu
%cd unlimiformer/src

Cloning into 'unlimiformer'...
remote: Enumerating objects: 449, done.[K
remote: Counting objects: 100% (307/307), done.[K
remote: Compressing objects: 100% (97/97), done.[K
remote: Total 449 (delta 237), reused 264 (delta 210), pack-reused 142[K
Receiving objects: 100% (449/449), 301.72 KiB | 4.19 MiB/s, done.
Resolving deltas: 100% (302/302), done.
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m188.5/188.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.6/215.6 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m5.3 MB/s[0m eta [

In [43]:
from unlimiformer import Unlimiformer
from random_training_unlimiformer import RandomTrainingUnlimiformer
from usage import UnlimiformerArguments, training_addin

from transformers import BartForConditionalGeneration, AutoTokenizer
from datasets import load_dataset
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# example using govreport
modelname = "abertsch/unlimiformer-bart-govreport-alternating"

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained(modelname)

defaults = UnlimiformerArguments()
unlimiformer_kwargs = {
            'layer_begin': defaults.layer_begin,
            'layer_end': defaults.layer_end,
            'unlimiformer_head_num': defaults.unlimiformer_head_num,
            'exclude_attention': defaults.unlimiformer_exclude,
            'chunk_overlap': defaults.unlimiformer_chunk_overlap,
            'model_encoder_max_len': defaults.unlimiformer_chunk_size,
            'verbose': defaults.unlimiformer_verbose, 'tokenizer': tokenizer,
            'unlimiformer_training': defaults.unlimiformer_training,
            'use_datastore': defaults.use_datastore,
            'flat_index': defaults.flat_index,
            'test_datastore': defaults.test_datastore,
            'reconstruct_embeddings': defaults.reconstruct_embeddings,
            'gpu_datastore': defaults.gpu_datastore,
            'gpu_index': defaults.gpu_index
}

model.to(device)

model = Unlimiformer.convert_model(model, **unlimiformer_kwargs)
model.eval()
model.to(device)

BartForConditionalGeneration(
  (model): BartModel(
    (shared): Embedding(50265, 768, padding_idx=1)
    (encoder): BartEncoder(
      (embed_tokens): Embedding(50265, 768, padding_idx=1)
      (embed_positions): BartLearnedPositionalEmbedding(1026, 768)
      (layers): ModuleList(
        (0-5): 6 x BartEncoderLayer(
          (self_attn): BartAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=True)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (final_layer_norm): LayerNorm((768,), eps=

### Summarize text

In [38]:
# print(f"INPUT LENGTH (tokens): {example['input_ids'].shape[-1]}")

example = tokenizer(example_input, truncation=False, return_tensors="pt")
example.to(device)

unlimiformer_out = tokenizer.batch_decode(model.generate(**example, max_length=512), ignore_special_tokens=True)[0]
print(unlimiformer_out)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

KeyboardInterrupt: ignored

### Evaluate ROUGE

In [44]:
def process_dataset(batch):
    items = batch['article']
    generated_summaries = []

    for item in items:
        example = tokenizer(item, truncation=False, return_tensors="pt")
        unlimiformer_out = tokenizer.batch_decode(model.generate(**example, max_length=512), ignore_special_tokens=True)[0]
        generated_summaries.append(unlimiformer_out)

    result = {'generated': generated_summaries}
    return result

In [45]:
# Importing necessary modules
!pip install -q rouge_score
from datasets import load_dataset
import evaluate
import torch
import random

dataset=load_dataset('cnn_dailymail','3.0.0')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(process_dataset, batched=True, batch_size=2)

rouge = evaluate.load("rouge")
score = rouge.compute(predictions=result_small['generated'], references=dataset_small['highlights'])
print(score['rouge1'])
print(score['rouge2'])
print(score['rougeL'])


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1765 > 1024). Running this sequence through the model will result in indexing errors
INFO:Unlimiformer:Encoding 0 to 1024 out of 1765
INFO:Unlimiformer:Encoding 512 to 1536 out of 1765
INFO:Unlimiformer:Encoding 741 to 1765 out of 1765
INFO:Unlimiformer:Encoding 0 to 1024 out of 1120
INFO:Unlimiformer:Encoding 96 to 1120 out of 1120
INFO:Unlimiformer:Encoding 0 to 726 out of 726
INFO:Unlimiformer:Encoding 0 to 740 out of 740
INFO:Unlimiformer:Encoding 0 to 763 out of 763
INFO:Unlimiformer:Encoding 0 to 642 out of 642
INFO:Unlimiformer:Encoding 0 to 968 out of 968
INFO:Unlimiformer:Encoding 0 to 1011 out of 1011
INFO:Unlimiformer:Encoding 0 to 515 out of 515
INFO:Unlimiformer:Encoding 0 to 860 out of 860
Exception ignored in: <function tqdm.__del__ at 0x7ae3123e9cf0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1149, in __del__
    s

0.2626474292488438
0.10255731051972579
0.16981808389650785


# Extractive summarization methods

## Lexrank

https://iq.opengenus.org/lexrank-text-summarization/

https://github.com/Tuhin-SnapD/Text-Summarization-Models/blob/main/Basic%20to%20Advance%20Text%20Summarisation%20Models/LexRank.ipynb



In [None]:
# Some specific Lexrank requirements
import locale
locale.getpreferredencoding = lambda: "UTF-8"

!pip install lexrank

Collecting lexrank
  Downloading lexrank-0.1.0-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.8/69.8 kB[0m [31m705.7 kB/s[0m eta [36m0:00:00[0m
Collecting path.py>=10.5 (from lexrank)
  Downloading path.py-12.5.0-py3-none-any.whl (2.3 kB)
Collecting urlextract>=0.7 (from lexrank)
  Downloading urlextract-1.8.0-py3-none-any.whl (21 kB)
Collecting path (from path.py>=10.5->lexrank)
  Downloading path-16.7.1-py3-none-any.whl (25 kB)
Collecting uritools (from urlextract>=0.7->lexrank)
  Downloading uritools-4.0.1-py3-none-any.whl (10 kB)
Installing collected packages: uritools, path, urlextract, path.py, lexrank
Successfully installed lexrank-0.1.0 path-16.7.1 path.py-12.5.0 uritools-4.0.1 urlextract-1.8.0


In [None]:
import nltk
nltk.download('punkt')  # Download the necessary data for sentence tokenization

from nltk.tokenize import sent_tokenize

# Create list from dictionary first

doc_list = []

for key, value_list in chosen_texts.items():
        doc_list.extend(value_list)

# Tokenize each document into sentences and create a list of lists
list_of_lists = []
for doc in doc_list:
    sentences = sent_tokenize(doc)
    list_of_lists.append(sentences)

# Print the resulting list of sentences
print(list_of_lists[0])

['A senior Russian draft officer and former submarine commander accused by Ukraine of  deadly strikes on its territory has been shot dead while jogging in the southern Russian city of  Krasnodar.', 'Stanislav Rzhitsky, 42, was killed on Monday by an unidentified gunman during a morning  run in a park near the Olimp sports centre, local police said.', 'Russian FSB security services said on Tuesday that a 64-year-old man was arrested on  suspicion of carrying out the attack.', 'At the time of his death, Rzhitsky was serving as the deputy head of the Krasnodar city  administration’s mobilisation.', 'According to the Russian daily newspaper Kommersant, Rzhitsky was previously the  commander of the Krasnodar submarine, named after the city, in the Russian navy.', 'The Ukrainian army said in a Telegram post on Tuesday that Rzhitsky was in command of a  submarine that carried out a deadly missile attack on the Ukrainian city of Vinnytsia in July  2022, killing 23 civilians.', 'Rzhitsky’s fath

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
from lexrank import LexRank
from lexrank.mappings.stopwords import STOPWORDS

lxr = LexRank(list_of_lists, stopwords=STOPWORDS['en'])

result = []
print(list_of_lists)

# get summary with classical LexRank algorithm
for text in list_of_lists:
  summary = lxr.get_summary(text, summary_size=8, threshold=.1)
  result.append(' '.join(summary))

print(result)



# get summary with continuous LexRank
# summary_cont = lxr.get_summary(list_of_lists[0], threshold=None)
# print(summary_cont)

# get LexRank scores for sentences
# 'fast_power_method' speeds up the calculation, but requires more RAM
# scores_cont = lxr.rank_sentences(
#     sentences,
#     threshold=None,
#     fast_power_method=False,
# )
# print(scores_cont)

[['A senior Russian draft officer and former submarine commander accused by Ukraine of  deadly strikes on its territory has been shot dead while jogging in the southern Russian city of  Krasnodar.', 'Stanislav Rzhitsky, 42, was killed on Monday by an unidentified gunman during a morning  run in a park near the Olimp sports centre, local police said.', 'Russian FSB security services said on Tuesday that a 64-year-old man was arrested on  suspicion of carrying out the attack.', 'At the time of his death, Rzhitsky was serving as the deputy head of the Krasnodar city  administration’s mobilisation.', 'According to the Russian daily newspaper Kommersant, Rzhitsky was previously the  commander of the Krasnodar submarine, named after the city, in the Russian navy.', 'The Ukrainian army said in a Telegram post on Tuesday that Rzhitsky was in command of a  submarine that carried out a deadly missile attack on the Ukrainian city of Vinnytsia in July  2022, killing 23 civilians.', 'Rzhitsky’s fat

## Memsum

https://github.com/nianlonggu/MemSum

In [None]:
!git clone https://github.com/nianlonggu/MemSum.git
!pip install -q torch torchvision torchaudio
!pip install -r MemSum/requirements.txt

Cloning into 'MemSum'...
remote: Enumerating objects: 381, done.[K
remote: Counting objects: 100% (115/115), done.[K
remote: Compressing objects: 100% (103/103), done.[K
remote: Total 381 (delta 53), reused 28 (delta 10), pack-reused 266[K
Receiving objects: 100% (381/381), 82.40 MiB | 13.16 MiB/s, done.
Resolving deltas: 100% (150/150), done.
Collecting rouge_score (from -r MemSum/requirements.txt (line 3))
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyrouge (from -r MemSum/requirements.txt (line 4))
  Downloading pyrouge-0.1.3.tar.gz (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.5/60.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyterlab (from -r MemSum/requirements.txt (line 6))
  Downloading jupyterlab-4.0.3-py3-none-any.whl (9.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.2/9.2 MB[0m

In [None]:
import nltk
nltk.download('punkt')  # Download the necessary data for sentence tokenization

from nltk.tokenize import sent_tokenize

# Create list from dictionary first

doc_list = []

for key, value_list in chosen_texts.items():
        doc_list.extend(value_list)

# Tokenize each document into sentences and create a list of lists
list_of_lists = []
for doc in doc_list:
    sentences = sent_tokenize(doc)
    list_of_lists.append(sentences)

# Print the resulting list of sentences
print(list_of_lists[2])

['A senior Russian draft officer and former submarine commander accused by Ukraine of  deadly strikes on its territory has been shot dead while jogging in the southern Russian city of  Krasnodar.', 'Stanislav Rzhitsky, 42, was killed on Monday by an unidentified gunman during a morning  run in a park near the Olimp sports centre, local police said.', 'Russian FSB security services said on Tuesday that a 64-year-old man was arrested on  suspicion of carrying out the attack.', 'At the time of his death, Rzhitsky was serving as the deputy head of the Krasnodar city  administration’s mobilisation.', 'According to the Russian daily newspaper Kommersant, Rzhitsky was previously the  commander of the Krasnodar submarine, named after the city, in the Russian navy.', 'The Ukrainian army said in a Telegram post on Tuesday that Rzhitsky was in command of a  submarine that carried out a deadly missile attack on the Ukrainian city of Vinnytsia in July  2022, killing 23 civilians.', 'Rzhitsky’s fath

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
%cd MemSum
from huggingface_hub import snapshot_download
## download the pretrained glove word embedding (200 dimension)
snapshot_download('nianlong/memsum-word-embedding', local_dir = "model/word_embedding" )

## download model checkpoint on the arXiv dataset
# snapshot_download('nianlong/memsum-arxiv', local_dir = "model/memsum-arxiv" )

## download model checkpoint on the PubMed dataset
# snapshot_download('nianlong/memsum-pubmed', local_dir = "model/memsum-pubmed" )

## download model checkpoint on the Gov-Report dataset
snapshot_download('nianlong/memsum-gov-report', local_dir = "model/memsum-gov-report" )

/content/MemSum


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

'/content/MemSum/model/memsum-gov-report'

In [None]:
from src.summarizer import MemSum
from tqdm import tqdm
from rouge_score import rouge_scorer
import json
import numpy as np

rouge_cal = rouge_scorer.RougeScorer(['rouge1','rouge2', 'rougeLsum'], use_stemmer=True)

# memsum_arxiv = MemSum(  "model/memsum-arxiv/model.pt",
#                   "model/word_embedding/vocabulary_200dim.pkl",
#                   gpu = 0 ,  max_doc_len = 500  )

# memsum_pubmed = MemSum(  "model/memsum-pubmed/model.pt",
#                   "model/word_embedding/vocabulary_200dim.pkl",
#                   gpu = 0 ,  max_doc_len = 500  )

memsum_gov_report = MemSum(  "model/memsum-gov-report/model.pt",
                  "model/word_embedding/vocabulary_200dim.pkl",
                  gpu = 0 ,  max_doc_len = 500  )

In [None]:
result = []

for text in list_of_lists:
  extracted_summary = memsum_gov_report.extract([text],
                                    p_stop_thres = 0.6,
                                    max_extracted_sentences_per_document = 7
                                    )[0]
  result.append(' '.join(extracted_summary))

print(result)


["UK defence minister: 'people want to see a bit of  gratitude' from Ukraine for weapon supplies  Dan Sabbagh is in Vilnius for the Guardian and reports these words from Ben Wallace:  The British defence secretary suggested Ukraine needed to put more emphasis on  saying thank you for western help when he was asked about President Volodymyr  Zelenskiy’s complaints on Tuesday that the country had not been issued a firm  timetable or set of conditions for joining Nato. There was an  acceptance that “Ukraine belongs at Nato” and that amounted to an effective invitation for  membership whenever the conflict died down. Wallace revealed that he had travelled to Ukraine last year to be presented with a shopping  list of weapons. “I told them that last year, when I  drove 11 hours to be given a list.”  But he said he understood Zelenskiy was speaking to his own public and that despite his  complaint on Tuesday, the final summit deal was a good one for Ukraine. “Whether we like it or not, people

## BERTSum

https://github.com/dmmiller612/bert-extractive-summarizer

https://chriskhanhtran.github.io/posts/extractive-summarization-with-bert/

In [None]:
!pip install bert-extractive-summarizer

Collecting bert-extractive-summarizer
  Downloading bert_extractive_summarizer-0.10.1-py3-none-any.whl (25 kB)
Installing collected packages: bert-extractive-summarizer
Successfully installed bert-extractive-summarizer-0.10.1


In [None]:
from summarizer import Summarizer

result = []

model = Summarizer()

for report_type, documents in chosen_texts.items():
  for doc in documents:
    text = model(doc, ratio=0.4)
    result.append(text)

print(result)



['Musician Alex Yatsun’s house was shelled by Russian forces, but he has focused the trauma  and apocalyptic feeling into atmospheric tracks that help him get ‘out of reality’  When the Russian invasion of Ukraine began in 2022, Alex Yatsun was living just 30km  from the Russian border in the northernmost part of Kharkiv. “ Yatsun’s family evacuated but he soon returned north to more dangerous territory to  volunteer at a medical centre. “ Dungeon rap is a new hybrid style of hip-hop created by Yatsun. DJ Armok is more reliant on bass and more  heavily connected to Memphis rap, whereas Pillbox is more about the ethereal and  transcendental – like a sublime melancholic feeling.” The result is moody and atmospheric stuff. He’s assembled them in mega packs, containing thousands of samples, that  can be used as the basis for other people to make their own mutated twist on the genre. There is also a feeling of inescapable darkness to the music. It corresponds to my life and my beliefs that 



## Sbert

https://github.com/dmmiller612/bert-extractive-summarizer

In [None]:
!pip install -U -q sentence-transformers

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/86.0 kB[0m [31m808.8 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


In [None]:
from summarizer.sbert import SBertSummarizer

model = SBertSummarizer('paraphrase-MiniLM-L6-v2')

for report_type, documents in chosen_texts.items():
  for doc in documents:
    text = model(doc, ratio=0.4)
    result.append(text)

print(result)

Downloading (…)001fa/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)3bbb8001fa/README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading (…)bb8001fa/config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)001fa/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading (…)3bbb8001fa/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)b8001fa/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]



['Musician Alex Yatsun’s house was shelled by Russian forces, but he has focused the trauma  and apocalyptic feeling into atmospheric tracks that help him get ‘out of reality’  When the Russian invasion of Ukraine began in 2022, Alex Yatsun was living just 30km  from the Russian border in the northernmost part of Kharkiv. “ Yatsun’s family evacuated but he soon returned north to more dangerous territory to  volunteer at a medical centre. “ Dungeon rap is a new hybrid style of hip-hop created by Yatsun. DJ Armok is more reliant on bass and more  heavily connected to Memphis rap, whereas Pillbox is more about the ethereal and  transcendental – like a sublime melancholic feeling.” The result is moody and atmospheric stuff. He’s assembled them in mega packs, containing thousands of samples, that  can be used as the basis for other people to make their own mutated twist on the genre. There is also a feeling of inescapable darkness to the music. It corresponds to my life and my beliefs that 



## TransformerSum (RoBERTa or Longformer) - not working yet

https://github.com/HHousen/TransformerSum

In [None]:
!git clone https://github.com/HHousen/transformersum.git
%cd transformersum

Cloning into 'transformersum'...
remote: Enumerating objects: 1424, done.[K
remote: Counting objects: 100% (192/192), done.[K
remote: Compressing objects: 100% (101/101), done.[K
remote: Total 1424 (delta 99), reused 166 (delta 91), pack-reused 1232[K
Receiving objects: 100% (1424/1424), 11.98 MiB | 10.40 MiB/s, done.
Resolving deltas: 100% (883/883), done.
/content/transformersum


In [None]:
!pip install -q pytorch_lightning==1.6.5 transformers==4.* torch_optimizer==0.3.* wandb==0.14.* rouge-score==0.1.* packaging datasets==2.* gradio==3.* torch==2.0.* scikit-learn==1.2.* tensorboard spacy sphinx pyarrow pre-commit==3.2.*

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.1 MB[0m [31m1.2 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/1.1 MB[0m [31m3.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: Cannot install protobuf==3.20.3 and pytorch-lightning==1.6.5 because these package versions have conflicting dependencies.[0m[31m
[0m[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
[0m

In [None]:
!mkdir models
!gdown 1xlBJTO1LF5gIfDNvG33q8wVmvUB4jXYx
!mv epoch=3.ckpt models/epoch=3.ckpt

mkdir: cannot create directory ‘models’: File exists
Downloading...
From: https://drive.google.com/uc?id=1xlBJTO1LF5gIfDNvG33q8wVmvUB4jXYx
To: /content/transformersum/epoch=3.ckpt
100% 1.49G/1.49G [00:14<00:00, 103MB/s] 


In [None]:
import sys
sys.path.append('/content/transformersum/src')

from extractive import ExtractiveSummarizer
model = ExtractiveSummarizer.load_from_checkpoint("models/epoch=3.ckpt")

result = []

for report_type, documents in chosen_texts.items():
  for doc in documents:
    summary = model.predict(doc, num_summary_sentences=5)
    result.append(summary)

print(result)


Downloading (…)lve/main/config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1454 > 512). Running this sequence through the model will result in indexing errors


['Musician Alex Yatsun ’s house was shelled by Russian forces , but he has focused the trauma   and apocalyptic feeling into atmospheric tracks that help him get ‘ out of reality ’   When the Russian invasion of Ukraine began in 2022 , Alex Yatsun was living just 30 km   from the Russian border in the northernmost part of Kharkiv “.   Despite the difficult and often harrowing backdrop of the last year , the 24 - year - old has   managed to produce a new 21 - track compilation album as DJ Sacred , entitled Dungeon Rap :   the Evolution. It ’s a follow - up to 2019 ’s , Dungeon Rap : the Introduction.   Dungeon rap is a new hybrid style of hip - hop created by Yatsun.   Yatsun ’s compilations feature a variety of his musical aliases , such as DJ Bishop , DJ Armok   and Pillbox.', 'A senior Russian draft officer and former submarine commander accused by Ukraine of   deadly strikes on its territory has been shot dead while jogging in the southern Russian city of   Krasnodar.   Stanislav Rz

## HISum - training in process

https://github.com/MySong7NLPer/HISum

In [None]:
!gdown 1PxMHpDSvP1OJfj1et4ToklevQzcPr-HQ
!git clone https://github.com/maszhongming/MatchSum.git

Downloading...
From: https://drive.google.com/uc?id=1PxMHpDSvP1OJfj1et4ToklevQzcPr-HQ
To: /content/MatchSum_cnndm_model.zip
100% 855M/855M [00:09<00:00, 87.8MB/s]
Cloning into 'MatchSum'...
remote: Enumerating objects: 65, done.[K
remote: Counting objects: 100% (24/24), done.[K
remote: Compressing objects: 100% (10/10), done.[K
remote: Total 65 (delta 16), reused 14 (delta 14), pack-reused 41[K
Receiving objects: 100% (65/65), 77.84 KiB | 2.43 MiB/s, done.
Resolving deltas: 100% (31/31), done.


In [None]:
!unzip MatchSum_cnndm_model.zip -d MatchSum

Archive:  MatchSum_cnndm_model.zip
  inflating: MatchSum/MatchSum_cnndm_bert.ckpt  
  inflating: MatchSum/MatchSum_cnndm_roberta.ckpt  


In [None]:
!pip install -q torch transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m31.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# Create list from dictionary first

doc_list = []

for key, value_list in chosen_texts.items():
        doc_list.extend(value_list)

print(doc_list)

['Musician Alex Yatsun’s house was shelled by Russian forces, but he has focused the trauma  and apocalyptic feeling into atmospheric tracks that help him get ‘out of reality’  When the Russian invasion of Ukraine began in 2022, Alex Yatsun was living just 30km  from the Russian border in the northernmost part of Kharkiv. “When I woke up that day I  started living in a completely different reality,” he recalls. “There were bombs falling every  hour.”-  Yatsun’s family evacuated but he soon returned north to more dangerous territory to  volunteer at a medical centre. “But my house was hit by shelling,” he recalls. A photograph  on his Instagram shows the aftermath: huge chunks of wall blown out, smashed windows, a  mangled front door. “That was the moment I decided to move closer to central Ukraine.”  Despite the difficult and often harrowing backdrop of the last year, the 24-year-old has  managed to produce a new 21-track compilation album as DJ Sacred, entitled Dungeon Rap:  the Evolu

In [None]:
print(transformers.__version__)

import torch;
torch.__version__

4.31.0


'2.0.1+cu118'

In [None]:
%cd MatchSum
import torch
import transformers

torch.load('MatchSum_cnndm_roberta.ckpt')

[Errno 2] No such file or directory: 'MatchSum'
/content/MatchSum


ModuleNotFoundError: ignored

# Exporting summary to pdf and downloading the file.

## Creating a pdf file out of the summary result

In [None]:
from reportlab.pdfgen.canvas import Canvas
import textwrap
from datetime import date
from reportlab.lib.pagesizes import A4
from reportlab.lib.units import inch

# Create summary PDF file (if it doesn't exist, it will be automatically created)
canvas = Canvas("summary.pdf", pagesize=A4)

# Get A4 sizes
width, length = A4
top_indent = length - inch
left_indent = inch
right_indent = width - inch

# Set the title font, size, and alignment
canvas.setFont("Helvetica-Bold", 16)
canvas.drawCentredString((width * 0.5), (length - 40), "Themis.AI Summary")

# Add person's ID, today's date, and report types chosen
canvas.setFont("Times-Roman", 12)
canvas.drawString(left_indent, top_indent + 10, "Person ID: " + person_id)
canvas.drawString(left_indent, top_indent - 4, "Date: " + str(date.today()))
canvas.drawString(
    left_indent, top_indent - 18, "Report Types: " + ", ".join(chosen_reports)
)

# Set the text position and font
text_x = left_indent
text_y = top_indent - 42
canvas.setFont("Times-Roman", 12)

# Loop through the results
for result_item in result:
    if isinstance(result_item, dict):  # Handle dictionary output
        # Get the generated text from the dictionary
        text = result_item.get("generated_text", "")

        # Wrap the text and draw it on the canvas
        wrapped_text = textwrap.wrap(text, width=80)
        for line in wrapped_text:
            canvas.drawString(text_x, text_y, line)
            text_y -= 14
            if text_y < 14:
                canvas.showPage()
                text_y = top_indent - 28  # Adjust to the next page starting position

    elif isinstance(result_item, str):  # Handle list output (simple string)
        # Wrap the text and draw it on the canvas
        wrapped_text = textwrap.wrap(result_item, width=80)
        for line in wrapped_text:
            canvas.drawString(text_x, text_y, line)
            text_y -= 14
            if text_y < 14:
                canvas.showPage()
                text_y = top_indent - 28  # Adjust to the next page starting position

# Add the text to the PDF file & close it
canvas.save()


## Download the pdf file

In [None]:
from google.colab import files

files.download("summary.pdf")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Utility cells

### In case it is necessary to delete a whole directory

**WARNING: DELETING THE FOLDER 'DRIVE' WILL ALSO DELETE ALL YOUR GOOGLE DRIVE FILES. DO NOT DELETE THE FOLDER DRIVE THROUGH GOOGLE COLAB!**

In [None]:
# Unmount drive
drive.flush_and_unmount()
print("All changes made in this colab session should now be visible in Drive.")

All changes made in this colab session should now be visible in Drive.


**ATTENTION: THIS WILL DELETE ALL THE FOLDERS WITHIN THAT DIRECTORY!**

In [None]:
directory_name = input("Please input the directory name ")
!rm -rf {directory_name}

Please input the directory name shoigu


## Code formatter

[Run only once, at startup]

    Connect to your drive

    from google.colab import drive
    drive.mount("/content/drive")

    Install black for jupyter

    !pip install black[jupyter]

    Restart kernel

[Then]

    Place your .ipynb file somewhere on your drive
    Anytime you want format your code run:
    !black /content/drive/MyDrive/YOUR_PATH/YOUR_NOTEBOOK.ipynb
    Don't save your notebook, hit F5 to refresh the page
    Voila!
    Now save!


In [None]:
# run once
!pip install black[jupyter] --quiet
from google.colab import drive

drive.mount("/content/drive")

[31mERROR: Operation cancelled by user[0m[31m
[0mDrive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[31merror: cannot format /content/drive/MyDrive/Colab Notebooks/ThemisAI.ipynb: unindent does not match any outer indentation level (<tokenize>, line 24)[0m

[1mOh no! 💥 💔 💥[0m
[31m1 file failed to reformat[0m.


In [None]:
# run many times
!black /content/drive/MyDrive/'Colab Notebooks'/'ThemisAI.ipynb'

In [None]:
# Unmount drive
drive.flush_and_unmount()
print("All changes made in this colab session should now be visible in Drive.")

All changes made in this colab session should now be visible in Drive.


## Random cells that may be useful later

- Reading from healthcare report

### Mohit version

In [None]:
import os
import fitz  # Import the fitz module from PyMuPDF
from contextlib import redirect_stdout

# Initialization of the list to store report text
report_texts = []
errlist = []

# Check if the directory exists
if os.path.exists(report_path):
  try:
    chosen_pages = [34, 37]
    pdf_file = "HSIB Final Report.pdf"
    file_path = os.path.join(report_path, pdf_file)
    print(file_path)
    doc = fitz.open(file_path)
    text_by_page = [doc.get_page_text(i) for i in range(chosen_pages[0],chosen_pages[1])]
    print(text_by_page)
    X = ' '.join(text_by_page)
    print(X)
    with open(file_path.replace("pdf", 'txt'), 'w',encoding="utf-8") as f:
      with redirect_stdout(f):
        print(X)

  except:
    errlist.append(doc)
    pass

else:
    print("Directory not found.")


print(errlist)
# Print the collected report texts
# for index, report_text in enumerate(report_texts, start=1):
#     print(f"Report {index}:\n{report_text}\n")


### Calin version (just PyMuPDF)

In [None]:
import os
import fitz  # Import the fitz module from PyMuPDF

# Initialization of the list to store report text
report_texts = []

# Check if the directory exists
if os.path.exists(report_path):
    chosen_pages = [34, 35, 36]
    pdf_file = "HSIB Final Report.pdf"
    file_path = os.path.join(report_path, pdf_file)
    pdf_document = fitz.open(file_path)
    pdf_text = " ".join([page.get_text().strip() for page in pdf_document if page.number in chosen_pages])
    report_texts.append(pdf_text.replace("\n", " "))
    pdf_document.close()

else:
    print("Directory not found.")

# Print the collected report texts
for index, report_text in enumerate(report_texts, start=1):
    print(f"Report {index}:\n{report_text}\n")


In [None]:
import os
import fitz  # Import the fitz module from PyMuPDF

# Initialization of the list to store report text
report_texts = []

# Check if the directory exists
if os.path.exists(report_path):
    # List available report directories
    report_directories = [d for d in os.listdir(report_path) if os.path.isdir(os.path.join(report_path, d))]
    print("Available report directories:", report_directories)

    # Ask the user for the desired report directory
    chosen_directory = input("Enter the number of the directory you want to use (2, 3, 4, 5 or 6): ")

    # Validate and proceed with the chosen directory
    if chosen_directory in ['2', '3', '4', '5', '6']:
        chosen_directory = chosen_directory + '-way'

        # Construct the full path to the chosen directory
        directory_path = os.path.join(report_path, chosen_directory)

        # Traverse through the PDF files in the chosen directory
        if os.path.exists(directory_path):
            pdf_files = [file for file in os.listdir(directory_path) if file.endswith(".pdf")]

            # Read and clean text from PDF files
            for pdf_file in pdf_files:
                file_path = os.path.join(directory_path, pdf_file)
                pdf_document = fitz.open(file_path)
                pdf_text = " ".join([page.get_text().strip() for page in pdf_document])
                raw_bytes = pdf_text.encode()
                text = raw_bytes.decode("utf-8")
                # pdf_text = pdf_text.decode("utf-8")
                report_texts.append(text.replace("\n", " "))
                pdf_document.close()

            print("PDF files read and stored successfully.")
        else:
            print("Chosen directory not found.")
    else:
        print("Invalid choice. Please enter a valid option (2, 3, 4, 5 or 6).")
else:
    print("Directory not found.")

# Print the collected report texts
for index, report_text in enumerate(report_texts, start=1):
    print(f"Report {index}:\n{report_text}\n")


# Temporary trash

## PRIMERA

### Evaluate using reviews datasets - not good, a lot of hallucinations! (PRIMERA)

In [None]:
# Read reviews from file first

def read_reviews_from_file(file_path):
    reviews = []

    try:
        with open(file_path, 'r') as file:
            for line in file:
                review = line.strip()  # Remove leading/trailing whitespace and newline characters
                if review:
                    reviews.append(review)
    except FileNotFoundError:
        print(f"File not found: {file_path}")

    return reviews

file_path = 'reviews.txt'  # Update this with the actual path to your .txt file
review_list = read_reviews_from_file(file_path)
print(review_list)
print(len(review_list))

File not found: reviews.txt
[]
0


In [None]:
# Use a pipeline as a high-level helper
import torch
from transformers import pipeline

results = []

pipe = pipeline(
    task = "text2text-generation",
    model = MODEL,
    tokenizer = TOKENIZER,
    torch_dtype=torch.bfloat16,
    # device="auto"
)

subset = review_list[:10]

print(subset)

# Use model
result = pipe(
    subset,
    # inputs = input_ids,
    # global_attention_mask = global_attention_mask,
    use_cache = True,
    # min_length = 256,
    num_beams = 5,
    max_length = 1024,
    pad_token_id = TOKENIZER.pad_token_id,
    bos_token_id = TOKENIZER.bos_token_id,
    eos_token_id = TOKENIZER.eos_token_id,
    )

results.append(result)


print(results)

["Still having problems getting the upper tension correct. Still too tight. Keep re-threading the machine but haven't found the problem yet.", 'My craft store was out of this yarn, and I needed more to finish a baby blanket. It is so soft, and fluffy.', 'The color cup itself works well but I wish there was a statement that it did not include a cover. I would not suggest using it without the cover due to the risk of paint spillage. Overall works as advertised and fit is correct for Anthem 155.', 'This latch hook is not made the best, the little lever that pulls the yarn through broke after about a week. The ones with the wooden handle are made MUCH better..', "Ironically, this glue was intended to be used for shipping labels just like those generated for printing by amazon check out.While I found some stuff I really liked about these sticks, the stuff I don't like in fact the critical dislike, far outweighs the positives.Positives:a) No smell, no flavour, no mess (yes, I sniffed and lic

In [None]:
# STILL WORK IN PROGRESS AND IT MAY NOT WORK PROPERLY
# NEED TO FIGURE OUT HOW TO CONCATENATE THE TOKENS INSIDE ONE VARIABLE THAT CAN
# THEN BE PASSED TO THE MODEL FOR SUMMARIZATION

subset = review_list[:10]

input_ids_all = []
for review in subset:
  review = review.replace("\n", " ")
  review = " ".join(review.split())

  input_ids = []
  input_ids.extend(
      TOKENIZER.encode(
      review,
      truncation=True,
      max_length=4096,
      )[1:-1]
  )
  input_ids.append(DOCSEP_TOKEN_ID)
  input_ids = [TOKENIZER.bos_token_id] + input_ids + [TOKENIZER.eos_token_id]
  input_ids_all.append(torch.tensor(input_ids))
  input_ids = torch.nn.utils.rnn.pad_sequence(
      input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
      )

In [None]:
# get the input ids and attention masks together
global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)

# put global attention on <s> token
global_attention_mask[:, 0] = 1
global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1

generated_ids = MODEL.generate(
    input_ids=input_ids,
    global_attention_mask=global_attention_mask,
    use_cache=True,
    max_length=1024,
    num_beams=5,
)
generated_str = TOKENIZER.batch_decode(generated_ids.tolist(), skip_special_tokens=True)
result = {}
result["generated_summaries"] = generated_str
# result['gt_summaries']=batch['summary']

print(result)

{'generated_summaries': ["– If you've ever wanted to know what it's like to live in a world where women don't get equal pay, well, you're in luck. The Los Angeles Times reports that a study published in the Journal of the American Medical Association has found that women in the US earn an average of $50,000 less per year than their male counterparts. The study was based on a survey of more than 1,000 medical professionals, and found that the average wage for a woman in the United States in 2014 was $61,000, while the average salary for a man in the country was $57,000. The Times notes that the study was conducted before the gender pay gap became a big issue, but it could still have an impact: The study found that, for women, the pay gap between men and women was equal to or greater than the average cost of a bachelor or a salaried full-time job. (The study also found that men make more money in the workplace than women do.)", '– If you\'ve ever wanted to know what it\'s like to be preg

### OPTION 2 - Load the model directly (more options but is more advanced)

WIP - more or less combines the reports together, but may miss details from some

In [None]:
# STILL WORK IN PROGRESS AND IT MAY NOT WORK PROPERLY
# NEED TO FIGURE OUT HOW TO CONCATENATE THE TOKENS INSIDE ONE VARIABLE THAT CAN
# THEN BE PASSED TO THE MODEL FOR SUMMARIZATION

input_ids_all = []
for data in documents:
    all_docs = data.split("|||||")[:-1]
    for i, doc in enumerate(all_docs):
        doc = doc.replace("\n", " ")
        doc = " ".join(doc.split())
        all_docs[i] = doc

    #### concat with global attention on doc-sep
    input_ids = []
    for doc in all_docs:
        input_ids.extend(
            TOKENIZER.encode(
                doc,
                truncation=True,
                max_length=4096 // len(all_docs),
            )[1:-1]
        )
        input_ids.append(DOCSEP_TOKEN_ID)
        input_ids = [TOKENIZER.bos_token_id] + input_ids + [TOKENIZER.eos_token_id]
        input_ids_all.append(torch.tensor(input_ids))
        input_ids = torch.nn.utils.rnn.pad_sequence(
            input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
        )

NameError: ignored

In [None]:
# get the input ids and attention masks together
global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)

# put global attention on <s> token
global_attention_mask[:, 0] = 1
global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1

generated_ids = MODEL.generate(
    input_ids=input_ids,
    global_attention_mask=global_attention_mask,
    use_cache=True,
    max_length=1024,
    num_beams=5,
)
generated_str = TOKENIZER.batch_decode(generated_ids.tolist(), skip_special_tokens=True)
result = {}
result["generated_summaries"] = generated_str
# result['gt_summaries']=batch['summary']

print(result)

{'generated_summaries': ['– When Russia invaded Ukraine in 2022, Russian-born musician Alex Yatsun was living in Kharkiv. "When I woke up that day I started living in a completely different reality," he tells the Guardian. "There were bombs falling every hour." He fled his home, but not before his house was hit by shelling. "That was the moment I decided to move closer to central Ukraine," he says. Now 24, Yatsin has created a 21-track compilation album called Dungeon Rap: the Evolution. It\'s a hybrid of two styles of hip-hop, the Guardian reports: Memphis rap, a lo-fi style of southern hip- hop, and dungeon synth, a sub-genre of dark ambient and black metal that started in Scandinavia in the 1990s. The album is out Monday, the same day a 42-year-old man was shot dead while on a jog in a park in the Russian city of Krasnodar, the Telegraph reports. Russia\'s FSB security service says it has arrested a man suspected of killing Stanislav Rzhitsky. The FSB didn\'t claim responsibility fo

### OPTION 3 - WIP - processing the documents based on the model example (may require GPU acceleration)
Currently crashes due to insufficient RAM

In [None]:
def process_document(documents):
    input_ids_all = []
    for data in documents:
        all_docs = data.split("|||||")[:-1]  # ||||| is used to delimit documents
        for i, doc in enumerate(all_docs):
            doc = doc.replace("\n", " ")
            doc = " ".join(doc.split())
            all_docs[i] = doc

        #### concat with global attention on doc-sep
        input_ids = []
        for doc in all_docs:
            input_ids.extend(
                TOKENIZER.encode(
                    doc,
                    truncation=True,
                    max_length=4096 // len(all_docs),
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)
        input_ids = [TOKENIZER.bos_token_id] + input_ids + [TOKENIZER.eos_token_id]
        input_ids_all.append(torch.tensor(input_ids))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids


def batch_process(batch):
    input_ids = process_document(batch["document"])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = MODEL.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=1024,
        num_beams=5,
    )
    generated_str = TOKENIZER.batch_decode(
        generated_ids.tolist(), skip_special_tokens=True
    )
    result = {}
    result["generated_summaries"] = generated_str
    # result['gt_summaries']=batch['summary']
    return result

In [None]:
result = batch_process(combined_reports)

print(result)

{'document': 'A senior Russian draft officer and former submarine commander accused by Ukraine  of \ndeadly strikes on its territory has been shot dead while jogging in the southern Russian city of \nKrasnodar. \nStanislav Rzhitsky, 42, was killed on Monday by an unidentified gunman during a morning run in a park near the Olimp sports centre, local police said. \nRussian FSB security services said on Tuesday that a 64- year-old man was arrested on \nsuspicion of carrying out the attack. \nAt the time of his death, Rzhitsky was serving as the deputy head of the Krasnodar c ity \nadministration’s mobilisation.  \nAccording to the Russian daily newspaper Kommersant, Rzhitsky was previously the \ncommander of the Krasnodar submarine, named after the city, in the Russian navy. \nThe Ukrainian army said in a Telegram post on Tuesday that Rzhitsky was in command of a \nsubmarine that carried out a  \ndeadly missile attack on the Ukrainian city of Vinnytsi a in July \n2022 , killing 23 civilia

### Evaluating using ROUGUE (multinews)

    ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
    ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
    ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460

08/08 evaluation results:

| Metric | ROUGUE-1 | ROUGUE-2 | ROUGUE-L|
| --- | --- | --- | --- |
| Precision | 48.32 | 17.35| 24.42 |
| Recall | 36.21 | 12.91 | 18.21 |
| Fmeasure | 39.92 | 14.31 | 20.12 |


In [None]:
def process_document(documents):
    input_ids_all = []
    for data in documents:
        all_docs = data.split("|||||")[:-1]  # ||||| is used to delimit documents
        for i, doc in enumerate(all_docs):
            doc = doc.replace("\n", " ")
            doc = " ".join(doc.split())
            all_docs[i] = doc

        #### concat with global attention on doc-sep
        input_ids = []
        for doc in all_docs:
            input_ids.extend(
                TOKENIZER.encode(
                    doc,
                    truncation=True,
                    max_length=4096 // len(all_docs),
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)
        input_ids = [TOKENIZER.bos_token_id] + input_ids + [TOKENIZER.eos_token_id]
        input_ids_all.append(torch.tensor(input_ids))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids


def batch_process(batch):
    input_ids = process_document(batch["document"])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = MODEL.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=1024,
        num_beams=5,
    )
    generated_str = TOKENIZER.batch_decode(
        generated_ids.tolist(), skip_special_tokens=True
    )
    result = {}
    result["generated_summaries"] = generated_str
    result['gt_summaries'] = batch['summary']
    return result

In [None]:
import random

dataset=load_dataset('multi_news')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

Downloading data files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/159M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/376M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/661k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/572k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

KeyError: ignored

In [None]:
!pip install rouge_score
rouge = load_metric("rouge")

score=rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])
print(score['rouge1'].mid)
print(score['rouge2'].mid)
print(score['rougeL'].mid)

Traceback (most recent call last):

^C


  rouge = load_metric("rouge")


Score(precision=0.18273866682513235, recall=0.6319687920418022, fmeasure=0.26717736713628903)
Score(precision=0.06569223439441368, recall=0.20986742892616572, fmeasure=0.09445721550047685)
Score(precision=0.1108689297842945, recall=0.3917410555049988, fmeasure=0.16318074666735677)


### Evaluating using ROUGUE (CNNDM)

    ROUGE-n recall=40% means that 40% of the n-grams in the reference summary are also present in the generated summary.
    ROUGE-n precision=40% means that 40% of the n-grams in the generated summary are also present in the reference summary.
    ROUGE-n F1-score=40% is more difficult to interpret, like any F1-score.

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/introduction-to-text-summarization-with-rouge-scores-84140c64b471

https://webcache.googleusercontent.com/search?q=cache:https://towardsdatascience.com/the-ultimate-performance-metric-in-nlp-111df6c64460

08/08 evaluation results:

| Metric | ROUGUE-1 | ROUGUE-2 | ROUGUE-L|
| --- | --- | --- | --- |
| Precision | 19.8 | 8.4 | 12.31 |
| Recall | 64.83 | 25.32 | 39.9 |
| Fmeasure | 29.14 | 12.17 | 18.1 |

In [None]:
def process_dataset(articles):
    input_ids_all = []
    for article in articles:
        article = article.replace("\n", " ")
        article = " ".join(article.split())

        input_ids = TOKENIZER.encode(
            article,
            truncation=True,
            max_length=4096,
        )[1:-1]
        input_ids = [TOKENIZER.bos_token_id] + input_ids + [TOKENIZER.eos_token_id]
        input_ids_all.append(torch.tensor(input_ids))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids


def batch_process(batch):
    input_ids = process_dataset(batch["article"])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    generated_ids = MODEL.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=1024,
        num_beams=5,
    )
    generated_str = TOKENIZER.batch_decode(
        generated_ids.tolist(), skip_special_tokens=True
    )
    result = {}
    result["generated_summaries"] = generated_str
    result['gt_summaries'] = batch['highlights']
    return result

In [None]:
import random

dataset=load_dataset('cnn_dailymail','3.0.0')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
!pip install -q rouge_score
rouge = load_metric("rouge")

score = rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])
print(score['rouge1'].mid)
print(score['rouge2'].mid)
print(score['rougeL'].mid)

Score(precision=0.18273866682513235, recall=0.6319687920418022, fmeasure=0.26717736713628903)
Score(precision=0.06569223439441368, recall=0.20986742892616572, fmeasure=0.09445721550047685)
Score(precision=0.1108689297842945, recall=0.3917410555049988, fmeasure=0.16318074666735677)


### Evaluate using healthcare report

In [None]:
# Use a pipeline as a high-level helper
import torch
from transformers import pipeline

pipe = pipeline(
    task = "text2text-generation",
    model = MODEL,
    tokenizer = TOKENIZER,
    torch_dtype=torch.bfloat16,
    # device="auto"
)

# Use model
result = pipe(
    page_text_list,
    use_cache = True,
    # min_length = 256,
    num_beams = 5,
    max_length = 1024,
    pad_token_id = TOKENIZER.pad_token_id,
    bos_token_id = TOKENIZER.bos_token_id,
    eos_token_id = TOKENIZER.eos_token_id,
    # temperature = 0.8,
    # top_p = 0.9
    )

print(result)

[{'generated_text': '– When a woman returned to the UK from overseas, she had no documentation of the care she had received. Her estimated date of birth from overseas placed her at 37.5 weeks, but her gestation was not clearly documented and was used in care planning. She was two weeks and four days earlier than she was supposed to be pregnant, and her placenta was not sent for histopathological examination. She requested a Cesarean delivery again, but labor began prior to the planned date, and she was delivered in the latent phase of labor at 36.2 weeks, not 38.6 weeks as she was originally estimated. The High Court of Justice in the UK has now issued a report on the case, which found that the mother\'s "late gestation EDD from the Trust USS" was not properly documented and "may have influenced the clinical decisions in not proceeding with a CS at the Mothers request."'}, {'generated_text': "– A study published in the Journal of the American College of Obstetricians and Gynecologists 

### Evaluate using sentences in Excel spreadsheet

In [None]:
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

In [None]:
worksheet = gc.open('senteses').sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)



In [None]:
# Flatten the list of lists and concatenate sentences
all_sentences = [sentence[0] for sentence in rows]
num_sentences = len(all_sentences)

# Ask the user how many times they want to split the sentences
num_splits = int(input("Enter the number of splits: "))

# Calculate the number of sentences per split
sentences_per_split = num_sentences // num_splits

# Initialize a list to hold concatenated strings
concatenated_strings = []

# Iterate over the number of splits
for i in range(num_splits):
    start_idx = i * sentences_per_split
    end_idx = start_idx + sentences_per_split
    split_sentences = all_sentences[start_idx:end_idx]

    # Concatenate the split sentences into a string
    split_string = " ".join(sentence for sentence in split_sentences)
    concatenated_strings.append(split_string)

# Handle any remaining sentences after splitting equally
remaining_sentences = all_sentences[num_splits * sentences_per_split:]
if remaining_sentences:
    remaining_string = " ".join(sentence for sentence in remaining_sentences)
    concatenated_strings.append(remaining_string)

for string in concatenated_strings:
  print(string)
print(concatenated_strings)
print(len(concatenated_strings))


Enter the number of splits: 5
Escalation for an obstetric review did not occur until 13:25 hours, this was an opportunity to expedite the birth of the Baby. Due to USS demand and capacity issues during the COVID-19 pandemic, growth USS were not continued until delivery as recommended in national guidance. As the staffing was low and the acuity was high on the night shift, and clinicians were used to â€˜managing' and supporting one another, there was a loss of awareness by clinicians of the increasing incidence of clinical risk in relation to the Mother. When the Mother attended with a history of raised BP and spontaneous rupture of membranes (SRM), the unit was busy which led to multiple handovers of care; the history of possible SRM and the Mother's concerns were not appreciated, and a full clinical examination did not occur. Triage calls were taken overnight by staff working on the labour ward who had competing demands. Staff absence led to the Mother not receiving community mental h

In [None]:
# Find number of tokens
example = TOKENIZER.encode(
                concatenated_strings[0],
                # truncation=True,
                # max_length=4096,
                return_tensors="pt"
            )[1:-1]
print(example.size())
# print(f"INPUT LENGTH (tokens): {example['input_ids'].shape[-1]}")

torch.Size([0, 44])


In [None]:
# Use a pipeline as a high-level helper
import torch
from transformers import pipeline

input_ids = []
# input_ids_all = []

pipe = pipeline(
    task = "text2text-generation",
    model = MODEL,
    tokenizer = TOKENIZER,
    torch_dtype=torch.bfloat16,
    # device="auto"
)

# Use model
result = pipe(
    concatenated_strings,
    # inputs = input_ids.tolist(),
    # global_attention_mask = global_attention_mask,
    use_cache = True,
    min_length = 1024,
    num_beams = 5,
    max_length = 4096,
    pad_token_id = TOKENIZER.pad_token_id,
    bos_token_id = TOKENIZER.bos_token_id,
    eos_token_id = TOKENIZER.eos_token_id,
    )

print(result)

IndexError: ignored

### Evaluate test

In [None]:
def process_document(documents):
    input_ids_all = []
    for data in documents:
        all_docs = data.split("|||||")[:-1]  # ||||| is used to delimit documents
        for i, doc in enumerate(all_docs):
            doc = doc.replace("\n", " ")
            doc = " ".join(doc.split())
            all_docs[i] = doc

        #### concat with global attention on doc-sep
        input_ids = []
        for doc in all_docs:
            input_ids.extend(
                TOKENIZER.encode(
                    doc,
                    truncation=True,
                    max_length=4096 // len(all_docs),
                )[1:-1]
            )
            input_ids.append(DOCSEP_TOKEN_ID)
        input_ids = [TOKENIZER.bos_token_id] + input_ids + [TOKENIZER.eos_token_id]
        input_ids_all.append(torch.tensor(input_ids))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids_all, batch_first=True, padding_value=PAD_TOKEN_ID
    )
    return input_ids


def batch_process(batch):
    input_ids = process_document(batch["document"])
    # get the input ids and attention masks together
    global_attention_mask = torch.zeros_like(input_ids).to(input_ids.device)
    # put global attention on <s> token

    global_attention_mask[:, 0] = 1
    global_attention_mask[input_ids == DOCSEP_TOKEN_ID] = 1
    generated_ids = MODEL.generate(
        input_ids=input_ids,
        global_attention_mask=global_attention_mask,
        use_cache=True,
        max_length=1024,
        num_beams=5,
        # do_sample=True,
        # top_p=0.8,
        # temperature=0.1
    )
    generated_str = TOKENIZER.batch_decode(
        generated_ids.tolist(), skip_special_tokens=True
    )
    result = {}
    result["generated_summaries"] = generated_str
    result['gt_summaries'] = batch['summary']
    return result

In [None]:
import random

dataset=load_dataset('multi_news')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(batch_process, batched=True, batch_size=2)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [None]:
!pip install rouge_score
rouge = evaluate.load("rouge")

# rouge = evaluate.combine(["f1", "precision", "recall"])

score=rouge.compute(predictions=result_small["generated_summaries"], references=result_small["gt_summaries"])
print(f"Rouge1: {score['rouge1']}")
print(f"Rouge2: {score['rouge2']}")
print(f"RougeL: {score['rougeL']}")

Rouge1: 0.42804480141512435
Rouge2: 0.1303037274924171
RougeL: 0.20012094745800413


## Centrum Model

https://github.com/ratishsp/centrum

### Option 1 - Using Transformers pipeline()

In [None]:
# Importing necessary modules
from transformers import AutoTokenizer, LEDForConditionalGeneration, LEDConfig, pipeline
from datasets import load_dataset, load_metric
import torch

# Initializing variables
TOKENIZER = AutoTokenizer.from_pretrained("ratishsp/Centrum-multinews")
CONFIG = LEDConfig.from_pretrained("ratishsp/Centrum-multinews")
MODEL = LEDForConditionalGeneration.from_pretrained("ratishsp/Centrum-multinews")
PAD_TOKEN_ID = TOKENIZER.pad_token_id

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.64k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/957 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/610M [00:00<?, ?B/s]

In [None]:
pipe = pipeline(
    task="text2text-generation",
    model=MODEL,
    tokenizer=TOKENIZER,
    torch_dtype=torch.bfloat16,
    # device="auto"
)

# Use model
result = pipe(
    page_text_list,
    # inputs = input_ids,
    # global_attention_mask = global_attention_mask,
    use_cache=True,
    # min_length = 256,
    num_beams=5,
    max_length=1024,
    pad_token_id=TOKENIZER.pad_token_id,
    bos_token_id=TOKENIZER.bos_token_id,
    eos_token_id=TOKENIZER.eos_token_id,
)

print(result)

[{'generated_text': '– A British woman was found dead of a "perinatal asphyxia" two days after she returned to the UK from overseas, the Telegraph reports. The mother, who returned to the UK from overseas at 37+4 weeks, was booked for maternity care, having returned to the UK from overseas. She had no documentation of the care she had received. The mother estimated date of birth from overseas place her at 37+4 weeks. After clinical review the ongoing intention was to utilize this date, this was not clearly documented and late gestation EDD from the Trust USS was used. As an EDD from a late USS was used in care planning, this placed the Mother’s pregnancy two weeks and four days earlier than the correct gestation. The Mother had delivered by prior CS and requested this mode of delivery again. Latent phase of labor occurred prior to the planned CS date. The Mother presented in the latent phase of labor at 36+2 weeks (Trust USS), 38+6 (overseas USS). From admission the gestation was commu

## BRIO

### Option 2: Cloning repository & initializing pytorch model first (not working at the moment)

In [None]:
!git clone https://github.com/yixinL7/BRIO.git
%cd BRIO

Cloning into 'BRIO'...
remote: Enumerating objects: 127, done.[K
remote: Counting objects: 100% (48/48), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 127 (delta 40), reused 26 (delta 26), pack-reused 79[K
Receiving objects: 100% (127/127), 7.58 MiB | 6.88 MiB/s, done.
Resolving deltas: 100% (63/63), done.
/content/BRIO


In [None]:
from transformers import BartTokenizer, PegasusTokenizer
# from transformers import BartForConditionalGeneration, PegasusForConditionalGeneration
from model import BRIO

IS_PEGASUS = False # whether to use CNNDM dataset or XSum dataset
LOWER = False

# Load our model checkpoints
if IS_PEGASUS:
    tokenizer = PegasusTokenizer.from_pretrained('Yale-LILY/brio-xsum-cased')
    model = BRIO('Yale-LILY/brio-xsum-cased', tokenizer.pad_token_id, is_pegasus=True)
else:
    tokenizer = BartTokenizer.from_pretrained('Yale-LILY/brio-cnndm-uncased')
    model = BRIO('Yale-LILY/brio-cnndm-uncased', tokenizer.pad_token_id, is_pegasus=False)

max_length = 1024 if IS_PEGASUS else 512

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.29k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [None]:
# Initialize an empty list to store all the resulted summaries
result = []

# Loop through each report type in the chosen_texts dictionary
for page in page_text_list:
  # Tokenize the document and generate the summary
  inputs = tokenizer([page], max_length=max_length, return_tensors="pt", truncation=True)
  summary_ids = model.generate(inputs["input_ids"],
                               early_stopping=False,
                               max_length=1024,
                              #  num_beams=1,
                              #  num_beam_groups=1
                               )
  summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

  # Append the summary to the list of resulted_summaries
  result.append(summary)

# Print all the summaries
for idx, summary in enumerate(result, 1):
    print(f"Summary {idx}: {summary}")

TypeError: ignored

### Evaluating using ROUGUE (CNNDM & Multi-news)

CNNDM

Score(precision=0.2849497304480312, recall=0.6269928048762143, fmeasure=0.3879213779331322)
Score(precision=0.13189367769326338, recall=0.2955957028224432, fmeasure=0.18020487293883186)
Score(precision=0.17193152783960908, recall=0.38417434531746963, fmeasure=0.23519102965625838)

Multi-news

Score(precision=0.5657348407615774, recall=0.2978514239868378, fmeasure=0.38313897764704563)
Score(precision=0.18822957833592233, recall=0.09610687222856976, fmeasure=0.1262925742149782)
Score(precision=0.3147363871594001, recall=0.1632515094800887, fmeasure=0.2109012191720496)

In [None]:
!pip install -q rouge_score

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [None]:
def process_dataset(dataset):
  result = []

  for item in dataset:
    inputs = tokenizer([item], max_length=max_length, return_tensors="pt", truncation=True)
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=128,
        temperature=0,
        # top_p=0.3
        )
    summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

    # Append the summary to the list of resulted_summaries
    result.append(summary)

  return result

In [None]:
# Importing necessary modules
from datasets import load_dataset, load_metric
import torch
import random

dataset=load_dataset('multi_news')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
# print(len(dataset_small['article']))
# print(len(dataset_small['highlights']))
result_small = process_dataset(dataset_small['document'])

# print(len(result_small))

rouge = load_metric("rouge")
score = rouge.compute(predictions=result_small, references=dataset_small['summary'])
print(score['rouge1'].mid)
print(score['rouge2'].mid)
print(score['rougeL'].mid)


Downloading builder script:   0%|          | 0.00/3.83k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.82k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/58.8M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/66.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.30M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/69.0M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/44972 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/5622 [00:00<?, ? examples/s]

Score(precision=0.5657348407615774, recall=0.2978514239868378, fmeasure=0.38313897764704563)
Score(precision=0.18822957833592233, recall=0.09610687222856976, fmeasure=0.1262925742149782)
Score(precision=0.3147363871594001, recall=0.1632515094800887, fmeasure=0.2109012191720496)


### Evaluate using healthcare report

In [None]:
# Initialize an empty list to store all the resulted summaries
result = []

for page in page_text_list:

  inputs = tokenizer([page], max_length=max_length, return_tensors="pt", truncation=True)
  summary_ids = model.generate(
      inputs["input_ids"],
      max_length=max_length,
      min_length=128,
      # temperature=0.3,
      do_sample=True,
      top_p=0.8
      )
  summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

  # Append the summary to the list of resulted_summaries
  result.append(summary)

# Print all the summaries
for idx, summary in enumerate(result, 1):
    print(f"Summary {idx}: {summary}")

RuntimeError: ignored

### Evaluate ROUGE

In [None]:
def process_dataset(batch):
    items = batch['document']
    generated_summaries = []

    for item in items:
        inputs = tokenizer([item], max_length=max_length, return_tensors="pt", truncation=True)
        summary_ids = model.generate(
            inputs["input_ids"],
            # max_length=max_length,
            # min_length=128,
            # temperature=0,
            # top_p=0.3
        )
        gen_str = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        generated_summaries.append(gen_str)

    result = {'generated': generated_summaries}
    return result

In [None]:
# Importing necessary modules
!pip install -q rouge_score
from datasets import load_dataset
import evaluate
import torch
import random

dataset=load_dataset('multi_news')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(process_dataset, batched=True, batch_size=2)

rouge = evaluate.load("rouge")
score = rouge.compute(predictions=result_small['generated'], references=dataset_small['summary'])
print(score['rouge1'])
print(score['rouge2'])
print(score['rougeL'])


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

0.38546372570007437
0.1319863257604803
0.18434961486619195


## EfactSum

### Evaluate using healthcare report

In [None]:
### Initialize an empty list to store all the resulted summaries
result = []

for page in page_text_list:

  inputs = tokenizer([page], max_length=max_length, return_tensors="pt", truncation=True)
  summary_ids = model.generate(
      inputs["input_ids"],
      max_length=max_length,
      min_length=64,
      temperature=0,
      top_p=0.8
      )
  summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

  # Append the summary to the list of resulted_summaries
  result.append(summary)

# Print all the summaries
for idx, summary in enumerate(result, 1):
    print(f"Summary {idx}: {summary}")

Summary 1: This is a report on the death of a baby boy following a Caesarean birth at the Royal Victoria Hospital in Belfast , Northern Ireland , on 5 December 2006 , following a series of failings in the care provided to the mother , who had returned from overseas , and the baby , who had been born prematurely .
Summary 2: This is a summary of the safety recommendations made by the Royal College of Obstetricians and Gynaecologists , following a review of the care provided to mothers and babies during labour at the Royal Lancaster Infirmary and the Royal Victoria Hospital , Bath , between 1 July 2014 and 31 December 2015 , and published in the Journal of the American College of Obstetricians and Gynaecologists .


### Evaluate ROUGE

In [None]:
def process_dataset(batch):
    items = batch['document']
    generated_summaries = []

    for item in items:
        inputs = tokenizer([item], max_length=max_length, return_tensors="pt", truncation=True)
        summary_ids = model.generate(
            inputs["input_ids"],
            # max_length=max_length,
            # min_length=128,
            # temperature=0,
            # top_p=0.3
        )
        gen_str = tokenizer.batch_decode(summary_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        generated_summaries.append(gen_str)

    result = {'generated': generated_summaries}
    return result

In [None]:
# Importing necessary modules
!pip install -q rouge_score
from datasets import load_dataset
import evaluate
import torch
import random

dataset=load_dataset('multi_news')
data_idx = random.choices(range(len(dataset['test'])),k=10)
dataset_small = dataset['test'].select(data_idx)
result_small = dataset_small.map(process_dataset, batched=True, batch_size=2)

rouge = evaluate.load("rouge")
score = rouge.compute(predictions=result_small['generated'], references=dataset_small['summary'])
print(score['rouge1'])
print(score['rouge2'])
print(score['rougeL'])


Map:   0%|          | 0/10 [00:00<?, ? examples/s]

0.25256162885713884
0.08728093207429244
0.15763493117137045


## Llama-2 Model - not working

Requires GPU to run

https://webcache.googleusercontent.com/search?q=cache:https://levelup.gitconnected.com/text-summarization-llama2-how-to-use-llama2-with-langchain-ad5775c80716

In [None]:
!pip install -q transformers einops langchain bitsandbytes sentencepiece safetensors torch xformers datasets
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!huggingface-cli login --token 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

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone
Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# Create a new list containing the texts of the chosen documents

doc_list = []

for key, value_list in chosen_texts.items():
        doc_list.extend(value_list)

print(doc_list[0])

A senior Russian draft officer and former submarine commander accused by Ukraine of  deadly strikes on its territory has been shot dead while jogging in the southern Russian city of  Krasnodar.  Stanislav Rzhitsky, 42, was killed on Monday by an unidentified gunman during a morning  run in a park near the Olimp sports centre, local police said.  Russian FSB security services said on Tuesday that a 64-year-old man was arrested on  suspicion of carrying out the attack.  At the time of his death, Rzhitsky was serving as the deputy head of the Krasnodar city  administration’s mobilisation.  According to the Russian daily newspaper Kommersant, Rzhitsky was previously the  commander of the Krasnodar submarine, named after the city, in the Russian navy.  The Ukrainian army said in a Telegram post on Tuesday that Rzhitsky was in command of a  submarine that carried out a deadly missile attack on the Ukrainian city of Vinnytsia in July  2022, killing 23 civilians.  Rzhitsky’s father told the Ba

### Option 1: Using Langchain (not working atm)

In [None]:
from langchain import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

torch.cuda.empty_cache()

model_name = "TinyPixel/Llama-2-7B-bf16-sharded"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             load_in_4bit=True,
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                            #  low_cpu_mem_usage=True,
                                             trust_remote_code=True
                                            )

tokenizer = AutoTokenizer.from_pretrained(model_name)

pipeline = transformers.pipeline(
    "text-generation", #task
    model=model,
    tokenizer=tokenizer,
    # max_length=1000,
    max_new_tokens=512,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})

Downloading (…)lve/main/config.json:   0%|          | 0.00/626 [00:00<?, ?B/s]

  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00014.bin:   0%|          | 0.00/981M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00014.bin:   0%|          | 0.00/944M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00014.bin:   0%|          | 0.00/990M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00014.bin:   0%|          | 0.00/967M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00014.bin:   0%|          | 0.00/847M [00:00<?, ?B/s]

ValueError: ignored

In [None]:
from langchain import PromptTemplate,  LLMChain

template = """
              Write a summary of the following text delimited by triple backquotes.
              ```{text}```
              SUMMARY:
           """

prompt = PromptTemplate(template=template, input_variables=["text"])

llm_chain = LLMChain(prompt=prompt, llm=llm)

In [None]:
print(llm_chain.run(doc_list[0]))



 `1. A senior Russian draft officer and former submarine commander accused by Ukraine of deadly strikes on its territory has been shot dead while jogging in the southern Russian city of Krasnodar.  2. Stanislav Rzhitsky, 42, was killed on Monday by an unidentified gunman during a morning run in a park near the Olimp sports centre, local police said.  3. Russian FSB security services said on Tuesday that a 64-year-old man was arrested on suspicion of carrying out the attack.  4. At the time of his death, Rzhitsky was serving as the deputy head of the Krasnodar city administration's mobilization.
\end{code}

Comment: You can get the output you want with the `strip_comments` function. See https://stackoverflow.com/a/54300407/16134191

Comment: You are getting the output you need. You just need more experience with the tool. See [this answer.](https://stackoverflow.com/a/54102170/5618127)

Comment: @GordonCraig I want just the text in the paragraph and not the summary. I am getting the sum

### Option 2: Using Sharded model + quantization

https://colab.research.google.com/drive/1zxwaTSvd6PSHbtyaoa7tfedAS31j_N6m#scrollTo=VPYJ5vUNftKm

https://huggingface.co/Trelis/Llama-2-7b-chat-hf-sharded-bf16-5GB

In [None]:
import transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline, TextStreamer

In [None]:
model_id = "Trelis/Llama-2-7b-chat-hf-sharded-bf16-5GB" # sharded model by RonanKMcGovern. Change the model here to load something else.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

config = transformers.AutoConfig.from_pretrained(model_id, trust_remote_code=True)
# config.init_device = 'cuda:0' # Unclear whether this really helps a lot or interacts with device_map.

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(model_id,
                                             config=config,
                                             quantization_config=bnb_config,
                                             device_map='auto',
                                             trust_remote_code=True) # for inference use 'auto', for training us device_map={"":0}

ValueError: ignored

In [None]:
result = []

for text in doc_list:
  system_prompt = 'You are a helpful summarization assistant that provides accurate summaries of text given.'
  user_prompt = f'Please summarize the following article: {text}'

  B_INST, E_INST = "[INST]", "[/INST]"
  B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

  prompt = f"{B_INST} {B_SYS}{system_prompt.strip()}{E_SYS}{user_prompt.strip()} {E_INST}"

  inputs = tokenizer([prompt], return_tensors="pt").to("cuda:0")
  # streamer = TextStreamer(tokenizer)

  generated_ids = model.generate(**inputs, max_new_tokens=1000)
  response = tokenizer.batch_decode(generated_ids.tolist(), skip_special_tokens=True)

  result_text = response[0].split(E_INST, 1)[1].strip()

  result.append(result_text)

print(result)

["A senior Russian military officer, Stanislav Rzhitsky, was shot dead while jogging in the southern city of Krasnodar, Russia. Rzhitsky, 42, was a former submarine commander and had served as the deputy head of the Krasnodar city administration's mobilization. According to Ukrainian authorities, Rzhitsky was in command of a submarine that carried out a deadly missile attack on the Ukrainian city of Vinnytsia in July 2022, killing 23 civilians. The Ukrainian defense ministry's main directorate of intelligence, GUR, did not claim responsibility for Rzhitsky's death, but shared details about the killing, including the time of the attack and the lack of witnesses due to heavy rain. The Russian security services have arrested a 64-year-old man on suspicion of carrying out the attack. It is worth noting that Ukraine typically declines to claim responsibility for attacks on Russia or Russian-annexed Crimea, but Kyiv officials have frequently celebrated such attacks with cryptic or mocking re

### Option 3: Using llama.cpp

https://github.com/abetlen/llama-cpp-python

https://python.langchain.com/docs/integrations/llms/llamacpp#cpu

https://python.langchain.com/docs/modules/chains/foundational/llm_chain

https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/

https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML (or 7B)

In [None]:
!pip install llama-cpp-python



In [None]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

In [None]:
template = """SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
USER: Please summarize this following article: {text}
ASSISTANT:"""

prompt = PromptTemplate(template=template, input_variables=["text"])

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

In [None]:
from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="TheBloke/Llama-2-13B-chat-GGML",
                filename="llama-2-13b-chat.ggmlv3.q4_0.bin",
                local_dir="/content/models/",
                local_dir_use_symlinks=False)

Downloading (…)chat.ggmlv3.q4_0.bin:   0%|          | 0.00/7.32G [00:00<?, ?B/s]

'/content/models/llama-2-13b-chat.ggmlv3.q4_0.bin'

In [None]:
# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/content/models/llama-2-13b-chat.ggmlv3.q4_0.bin",
    input={"temperature": 0.75, "max_length": 2000, "top_p": 1},
    callback_manager=callback_manager,
    verbose=True,
    n_ctx=2048
)

# llm_chain = LLMChain(prompt=prompt, llm=llm)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


In [None]:
llm(prompt.format(text=doc_list[0]))

Llama.generate: prefix-match hit


KeyboardInterrupt: ignored

In [None]:
from llama_cpp import Llama
llm = Llama(model_path="/content/models/llama-2-7b-chat.ggmlv3.q6_K.bin",
            n_ctx=2048)

prompt = f"""SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
USER: Please summarize this following article: {doc_list[0]}
ASSISTANT:"""

output = llm(prompt,
             max_tokens=1000,
             stop=["Q:", "\n"],
             echo=True,
             temperature=0.75,
             top_p=1)
print(output)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 


KeyboardInterrupt: ignored

## Unlimiformer

### Evaluate using sentences in Excel spreadsheet

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# example using govreport
modelname = "abertsch/unlimiformer-bart-govreport-alternating"
# dataset = load_dataset("urialon/gov_report_validation")

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained(modelname)

defaults = UnlimiformerArguments()
unlimiformer_kwargs = {
            'layer_begin': defaults.layer_begin,
            'layer_end': defaults.layer_end,
            'unlimiformer_head_num': defaults.unlimiformer_head_num,
            'exclude_attention': defaults.unlimiformer_exclude,
            'chunk_overlap': defaults.unlimiformer_chunk_overlap,
            'model_encoder_max_len': defaults.unlimiformer_chunk_size,
            'verbose': defaults.unlimiformer_verbose, 'tokenizer': tokenizer,
            'unlimiformer_training': defaults.unlimiformer_training,
            'use_datastore': defaults.use_datastore,
            'flat_index': defaults.flat_index,
            'test_datastore': defaults.test_datastore,
            'reconstruct_embeddings': defaults.reconstruct_embeddings,
            'gpu_datastore': defaults.gpu_datastore,
            'gpu_index': defaults.gpu_index
}


Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.81k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

In [None]:
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

In [None]:
worksheet = gc.open('senteses').sheet1

# get_all_values gives a list of rows.
rows = worksheet.get_all_values()
print(rows)



In [None]:
# Flatten the list of lists and concatenate sentences
all_sentences = [sentence[0] for sentence in rows]
num_sentences = len(all_sentences)

# Ask the user how many times they want to split the sentences
num_splits = int(input("Enter the number of splits: "))

# Calculate the number of sentences per split
sentences_per_split = num_sentences // num_splits

# Initialize a list to hold concatenated strings
concatenated_strings = []

# Iterate over the number of splits
for i in range(num_splits):
    start_idx = i * sentences_per_split
    end_idx = start_idx + sentences_per_split
    split_sentences = all_sentences[start_idx:end_idx]

    # Concatenate the split sentences into a string
    split_string = " ".join(sentence for sentence in split_sentences)
    concatenated_strings.append(split_string)

# Handle any remaining sentences after splitting equally
remaining_sentences = all_sentences[num_splits * sentences_per_split:]
if remaining_sentences:
    remaining_string = " ".join(sentence for sentence in remaining_sentences)
    concatenated_strings.append(remaining_string)

# for string in concatenated_strings:
#   print(string)
# print(concatenated_strings)
# print(len(concatenated_strings))

# print(concatenated_strings[1])


Enter the number of splits: 3


In [None]:
for string in concatenated_strings:
  print(string)

  # Find number of tokens
  example = tokenizer(
                string,
                truncation=False,
                # max_length=4096,
                return_tensors="pt"
            )

  print(f"INPUT LENGTH (tokens): {example['input_ids'].shape[-1]}")

Escalation for an obstetric review did not occur until 13:25 hours, this was an opportunity to expedite the birth of the Baby. Due to USS demand and capacity issues during the COVID-19 pandemic, growth USS were not continued until delivery as recommended in national guidance. As the staffing was low and the acuity was high on the night shift, and clinicians were used to â€˜managing' and supporting one another, there was a loss of awareness by clinicians of the increasing incidence of clinical risk in relation to the Mother. When the Mother attended with a history of raised BP and spontaneous rupture of membranes (SRM), the unit was busy which led to multiple handovers of care; the history of possible SRM and the Mother's concerns were not appreciated, and a full clinical examination did not occur. Triage calls were taken overnight by staff working on the labour ward who had competing demands. Staff absence led to the Mother not receiving community mental health support in pregnancy. Th

In [None]:
results = []

for string in concatenated_strings:
  example = tokenizer(string, truncation=False, return_tensors="pt")
  example.to(device)

  model = Unlimiformer.convert_model(model, **unlimiformer_kwargs)
  model.eval()
  model.to(device)

  # the output of the model /with/ unlimiformer
  unlimiformer_out = tokenizer.batch_decode(model.generate(**example, max_length=1024, min_length=512), ignore_special_tokens=True)[0]
  results.append(unlimiformer_out)

print(results)

INFO:Unlimiformer:Encoding 0 to 1024 out of 4748
INFO:Unlimiformer:Encoding 512 to 1536 out of 4748
INFO:Unlimiformer:Encoding 1024 to 2048 out of 4748
INFO:Unlimiformer:Encoding 1536 to 2560 out of 4748
INFO:Unlimiformer:Encoding 2048 to 3072 out of 4748
INFO:Unlimiformer:Encoding 2560 to 3584 out of 4748
INFO:Unlimiformer:Encoding 3072 to 4096 out of 4748
INFO:Unlimiformer:Encoding 3584 to 4608 out of 4748
INFO:Unlimiformer:Encoding 3724 to 4748 out of 4748
INFO:Unlimiformer:Encoding 0 to 1024 out of 4420
INFO:Unlimiformer:Encoding 512 to 1536 out of 4420
INFO:Unlimiformer:Encoding 1024 to 2048 out of 4420
INFO:Unlimiformer:Encoding 1536 to 2560 out of 4420
INFO:Unlimiformer:Encoding 2048 to 3072 out of 4420
INFO:Unlimiformer:Encoding 2560 to 3584 out of 4420
INFO:Unlimiformer:Encoding 3072 to 4096 out of 4420
INFO:Unlimiformer:Encoding 3396 to 4420 out of 4420
INFO:Unlimiformer:Encoding 0 to 1024 out of 1024
INFO:Unlimiformer:Encoding 0 to 1024 out of 4485
INFO:Unlimiformer:Encoding

["</s><s>Due to the unprecedented demand and capacity issues during the COVID-19 pandemic, growth USS were not continued until delivery as recommended in national guidance. This meant that there was an opportunity to expedite the birth of the Baby. The Baby was small for gestational age at birth (below the 10th centile for growth) which was not identified antenatally. The decision to decline IOL (IOL) around 36-37 weeks had the potential to trigger a different pathway of care such as earlier rupture of membranes or to assess if transfer directly to the operating theatre was required. A referral for an oral glucose tolerance test was not made for the Baby and an assessment of the blood pressure and urine levels were not conducted. A specialist of carer was not provided prior to the Baby's admission and there was no documented risk assessment or appropriate plan for fetal monitoring and waterbirth. The lack of a comprehensive and clinically adequate plan for the mother's ongoing manageme

### Evaluate using healthcare report

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount("/content/drive")

# Path to the "My Drive" directory
my_drive_path = "/content/drive/My Drive"

# Check if "Themis.AI" folder exists within "My Drive"
themis_folder_path = os.path.join(my_drive_path, "SummarisationCalin")
if os.path.exists(themis_folder_path):
    directory_path = os.path.join(themis_folder_path, os.listdir(themis_folder_path)[0])
else:
    print("Could not find 'Themis.AI' folder in 'My Drive'")

report_path = os.path.join(themis_folder_path, "Report")

Mounted at /content/drive


In [None]:
%cd /content/

/content


In [None]:
import os
import fitz  # Import the fitz module from PyMuPDF

# Initialization of the list to store report text
report_texts = []

# Check if the directory exists
if os.path.exists(report_path):
    # List available report directories
    report_directories = [d for d in os.listdir(report_path) if os.path.isdir(os.path.join(report_path, d))]
    print("Available report directories:", report_directories)

    # Ask the user for the desired report directory
    chosen_directory = input("Enter the number of the directory you want to use (2, 3, 4, 5 or 6): ")

    # Validate and proceed with the chosen directory
    if chosen_directory in ['2', '3', '4', '5', '6']:
        chosen_directory = chosen_directory + '-way'

        # Construct the full path to the chosen directory
        directory_path = os.path.join(report_path, chosen_directory)

        # Traverse through the PDF files in the chosen directory
        if os.path.exists(directory_path):
            pdf_files = [file for file in os.listdir(directory_path) if file.endswith(".pdf")]

            # Read and clean text from PDF files
            for pdf_file in pdf_files:
                file_path = os.path.join(directory_path, pdf_file)
                pdf_document = fitz.open(file_path)
                pdf_text = " ".join([page.get_text().strip() for page in pdf_document])
                raw_bytes = pdf_text.encode()
                text = raw_bytes.decode("utf-8")
                # pdf_text = pdf_text.decode("utf-8")
                report_texts.append(text.replace("\n", " "))
                pdf_document.close()

            print("PDF files read and stored successfully.")
        else:
            print("Chosen directory not found.")
    else:
        print("Invalid choice. Please enter a valid option (2, 3, 4, 5 or 6).")
else:
    print("Directory not found.")

# Print the collected report texts
for index, report_text in enumerate(report_texts, start=1):
    print(f"Report {index}:\n{report_text}\n")


Available report directories: ['2-way', '3-way', '4-way', '5-way', '6-way']
Enter the number of the directory you want to use (2, 3, 4, 5 or 6): 3
PDF files read and stored successfully.
Report 1:
14    procedure during which the waters are broken this can be used to help labour  progress) and wait for two hours before deciding whether to start an IV oxytocin  infusion.  Oxytocin   This is one of the hormones produced naturally by mothers in labour and assists in  increasing the frequency of contractions. Oxytocin is given through a drip, and the  timing of the subsequent contractions, are monitored closely. If the contractions  are too sparse, or become too frequent, the amount of oxytocin given via the drip  can be altered if needed. (HSIB maternity team)  At 22:00 hours and 23:00 hours fresh eyes reviews were performed and the CTG  was categorised as normal.  It was recorded that there was repeated loss of contact  with the tocograph which was again managed by readjustment and chang

In [None]:
%cd unlimiformer/src

[Errno 2] No such file or directory: 'unlimiformer/src'
/content/unlimiformer/src


In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# example using govreport
modelname = "abertsch/unlimiformer-bart-govreport-alternating"
# dataset = load_dataset("urialon/gov_report_validation")

tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained(modelname)

defaults = UnlimiformerArguments()
unlimiformer_kwargs = {
            'layer_begin': defaults.layer_begin,
            'layer_end': defaults.layer_end,
            'unlimiformer_head_num': defaults.unlimiformer_head_num,
            'exclude_attention': defaults.unlimiformer_exclude,
            'chunk_overlap': defaults.unlimiformer_chunk_overlap,
            'model_encoder_max_len': defaults.unlimiformer_chunk_size,
            'verbose': defaults.unlimiformer_verbose, 'tokenizer': tokenizer,
            'unlimiformer_training': defaults.unlimiformer_training,
            'use_datastore': defaults.use_datastore,
            'flat_index': defaults.flat_index,
            'test_datastore': defaults.test_datastore,
            'reconstruct_embeddings': defaults.reconstruct_embeddings,
            'gpu_datastore': defaults.gpu_datastore,
            'gpu_index': defaults.gpu_index
}

results = []

for string in page_text_list:
  example = tokenizer(string, truncation=False, return_tensors="pt")
  example.to(device)

  model = Unlimiformer.convert_model(model, **unlimiformer_kwargs)
  model.eval()
  model.to(device)

  # the output of the model /with/ unlimiformer
  unlimiformer_out = tokenizer.batch_decode(model.generate(**example, max_length=1024, min_length=512), ignore_special_tokens=True)[0]
  results.append(unlimiformer_out)

print(results)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.81k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/558M [00:00<?, ?B/s]

INFO:Unlimiformer:Encoding 0 to 469 out of 469
INFO:Unlimiformer:Encoding 0 to 109 out of 109
INFO:Unlimiformer:Encoding 0 to 109 out of 109
INFO:Unlimiformer:Encoding 0 to 207 out of 207
INFO:Unlimiformer:Encoding 0 to 207 out of 207
INFO:Unlimiformer:Encoding 0 to 207 out of 207


["</s><s>The Health and Human Services Information Administration's (HHSIB) Health and Safety Committee held 6.1 hearingings and Safety recommendations on the work of the HSIB. The HSIB was comprised of 12.1 staff findings and safety recommendations, 8.1 findings, and 12.7 recommendations. This testimony highlights some of the actions taken by theHSIB in response to HSIB recommendations. For example: (1) the Mother booked for maternity care, having returned to the UK from overseas. She had no documentation of the care she had received. 2. The Mothers estimated date of birth from overseas place her at 37+4 weeks. After clinical review the ongoing intention was to utilise this date, this was not clearly documented and late gestation EDD from the Trust USS, 38+6 (overseas USS) was used in care planning. This placed the Mother's pregnancy two weeks and four days earlier than the correct gestation. (3) Management of theMother's pain relief was difficult to achieve. (4) The Baby's's head was

# Interesting stuff

https://github.com/Tuhin-SnapD/Text-Summarization-Models
https://huggingface.co/datasets/ccdv/govreport-summarization
