Skip to content

Latest commit

 

History

History
82 lines (69 loc) · 3.46 KB

README.md

File metadata and controls

82 lines (69 loc) · 3.46 KB

reports_generator

This package is intended to generate abstractive reports based on sentences or paragraphs treating the same topic.

Install

pip install git+https://github.com/the-deep/reports_generator

Package description

  • This package is based on pretrained models from the transformers library. It is a based on a multilingual summarization model (languages can be found here).
  • It was first intended to automatically generate reports for the humanitarian world and tested to generate real world test reports, one example being a Humanitarian Needs Overview for the Ukraine Project.
  • To keep the most relevant information only, the library is based on doing summarization iterations on the original text.
  • We show pseudocodes of the methodlogy used for reports creation
`Method: Main function call`
    `Inputs`: List of sentences or paragraph
    `Output`: summary
    
    summarized text = OneSummarizationIteration(original text)
    While (summarized text too long and maximum number of iterations not reached):
        summarized text = OneSummarizationIteration(summarized text)
        
        
        
`Method: OneSummarizationIteration`
    `Inputs`: text
    `Outputs`: Summarized Text
    
    1 - Split text into sentences
    2 - Get sentence embeddings for each sentence
        i - preprocess sentence (delete stop words, remove punctuation, stem and lemmatize words)
        ii - get sentence embeddings using preprocessed sentences (using pretrained HuggingFace models).
    3 - Cluster sentences into different categories using the HDBscan clustering method
    4 - Generate summary for each cluster (using pretrained HuggingFace models).
    5 - link summaries of each together together and return them.

Usage

Basic Usage:

Get a report from raw text.

from reports_generator import ReportsGenerator
summarizer = ReportsGenerator()    
generated_report = summarizer(original_text)

Advanced Usage:

We give an example of another possible usage of the library, using a multilabeled data.

import pandas as pd
from reports_generator import ReportsGenerator
summarizer = ReportsGenerator()

# Example of classified data to import, with two columns, `original_text` and `groundtruth_labels` 
df = pd.read_csv([csv_path]) 
labels_list = df.groundtruth_labels.unique()

# Get summary for each label
generated_report = {}
for one_label in labels_list:
    mask_one_label = df.groundtruth_labels==one_label
    text_one_label = df[mask_one_label].original_text.tolist()
    generated_report[one_label] = summarizer(text_one_label)

English and French languages

For English and french languages, we advise the usage of language specific summarization models by initializing the library the following ways:

  • English reports:
summarizer = ReportsGenerator(summarization_model_name="sshleifer/distilbart-cnn-12-6") 
  • French reports:
summarizer = ReportsGenerator(summarization_model_name="plguillou/t5-base-fr-sum-cnndm") 

Usage On Google Colab

As this library relies on heavy pretrained models, ite requires a significant amount of RAM on the Machine (at least 5GM RAM available). Therefore, we recommend using it on an online machine. We provide an example notebook using Google Colab here.