# Overview

# Text Summarization Overview

For modelling, I perform both extractive and abstractive summarization. For extractive summarization, I use the BERT transformer model and customize it to use the pre-trained weights of the **`sciBERT`** model which specializes in scientific texts, which fit our purpose. For every text, I determine the optimal number of sentences for the extracted summary.

For abstractive summarization, I first concatenate the abstract, extractive summary, and conclusion together since much of the important information can be found in them. Then, I use the **`facebook-BART-large-cnn`** transformer model to perform the abstraction.


# Building the TextSummarizer class

In [1]:
!pip install sentencepiece
!pip install transformers
!pip install tensorflow-gpu # For CPMTokenizer
!pip install bert-extractive-summarizer

Collecting tensorflow-gpu
  Downloading tensorflow-gpu-2.12.0.tar.gz (2.6 kB)
  Preparing metadata (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[44 lines of output][0m
  [31m   [0m Traceback (most recent call last):
  [31m   [0m   File "/home/yassine/miniconda3/lib/python3.12/site-packages/packaging/requirements.py", line 36, in __init__
  [31m   [0m     parsed = _parse_requirement(requirement_string)
  [31m   [0m              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [31m   [0m   File "/home/yassine/miniconda3/lib/python3.12/site-packages/packaging/_parser.py", line 62, in parse_requirement
  [31m   [0m     return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
  [31m   [0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  [31m   [0m   File "/home/yassine/miniconda3/l

In [10]:
import pandas as pd
import numpy as np

# Data preprocessing
import string
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Text Summarization
from transformers import *
from summarizer import Summarizer

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yassine/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/yassine/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/yassine/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
No CUDA runtime is found, using CUDA_HOME='/usr'


In [13]:
class TextSummarizer:
  def __init__(self, data):
    self.data = data

  # Helper functions

  # Get average word length in a document
  def avg_word(self, data):
    words = data.split()
    length = (sum(len(word) for word in words)/(len(words)+0.000001))

    return length
  
  # Get number of punctuations in a document
  def count_punctuation(self, data):
    punctuation_count = sum([1 for char in data if char in string.punctuation])

    return punctuation_count
  
  # Get optimal number of sentences for extractive summarization
  def get_optimal_number_sentences(self, data, model):

    optimal_num_sentences = model.calculate_optimal_k(data, k_max=10)

    return optimal_num_sentences
  
  # Extract numerical text features
  def extract_text_features(self, text_column):
    
    """
    Extracts text features such as number of stopwords, punctuations,
    numerical characters, average word length, average document length
    :param text_column: dataframe column to perform feature extraction on
    :return: dataframe with new feature columns
    """
    
    # Get number of stop words
    stop_words = stopwords.words('english')
    self.data["num_stopwords"] = self.data[text_column].apply(lambda x: 
    len([x for x in x.split() if x in stop_words]))

    # Get number of punctuations
    self.data["num_punctuations"] = self.data[text_column].apply(lambda x: 
    self.count_punctuation(x))

    # Get number of numerical characters
    self.data["num_numerics"] = self.data[text_column].apply(lambda x:
    len([x for x in x.split() if x.isdigit()]))

    # Get number of words in the document
    self.data["num_words"] = self.data[text_column].apply(lambda x: 
    len(str(x).split(" ")))

    # Get average word length in document

    self.data["avg_word_length"] = self.data[text_column].apply(lambda x: 
    round(self.avg_word(x),1))

    # Get the stopwords to word ratio
    self.data["stopwords_to_words_ratio"] = round(self.data["num_stopwords"] / self.data["num_words"], 3)

    return self.data
  
  def extractive_summarizer(self, model, text_column):
    
    """
    Performs extractive text summarization with BERT and allows for different 
    pretrained model loading and configurations.
    :param model: initialized pretrained model
    :param text_column: dataframe column to perform text_summarization on
    :return: dataframe with summarized text columns
    """

    self.data["extractive_summarized_text"] = self.data[text_column].apply(lambda x:
    "".join(model(x, num_sentences=self.get_optimal_number_sentences(x, model))))

    return self.data   


  def join_extracted_summary(self, abstract, extracted_summary, conclusion):

    """
    Concatenates the abstract, extractive_summarized_text, and conclusion columns
    into one column for abstractive summarization
    :param abstract: abstract column
    :param extracted_summary: extractive_summarized_text column
    :param conclusion: conclusion column
    :return: dataframe with concatenated abstract, extracted summary and conclusion 
    columns
    """

    self.data["combined_text"] = self.data[[abstract, extracted_summary, conclusion]].astype(str).agg(
        " ".join, axis=1
    )

    return self.data

  def abstractive_summarizer(self, model, text_column, max_length=750, min_length=250):
    
    """
    Performs abstract text summarization with BART using the extracted summary combined
    with the abstract and conclusion of the text.
    :param model: pipeline of the abstractive summarizer model
    :param text_column: dataframe column to perform text_summarization on
    :return: dataframe with summarized text columns
    """

    summaries_list = []
    for i in range(len(self.data[text_column])):
      text = self.data[text_column][i]
      try:
        summary = model(text, max_length = max_length, 
        min_length = min_length, do_sample=False)[-1]["summary_text"]
      except:
        # Decrease the length of the token to 1024 if it exceeds
        text = text[:1024]
        summary = model(text, max_length = max_length, 
        min_length = min_length, do_sample=False)[-1]["summary_text"]
      
      summaries_list.append(summary)
    
    self.data["abstractive_summaries"] = summaries_list
      
    return self.data

# Performing Text Summarization
## Extractive Summarization

In [5]:
text_df = pd.read_csv("top1000_cleaned.csv")
text_df.drop("Unnamed: 0", axis=1, inplace=True)

MAX_LENGTH = 100000  # You can adjust this based on your model's capability


# Step 1: Fill missing values in the full_text column
text_df["full_text"] = text_df["full_text"].fillna("")  # Replace NaN with an empty string

# Step 2: Ensure all entries are strings
text_df["full_text"] = text_df["full_text"].astype(str)

# step 3: ensure the maxlength doesnt exced the model limit
text_df["full_text"] = text_df["full_text"].apply(lambda x: x[:MAX_LENGTH] if len(x) > MAX_LENGTH else x)


text_class = TextSummarizer(text_df)

# Extract all features from the combined abstract, body, and conclusion text
text_class.extract_text_features("full_text")

# Use scibert to perform extractive summarization
pretrained_model = 'allenai/scibert_scivocab_uncased'

# Load model, model config and tokenizer via Transformers
custom_config = AutoConfig.from_pretrained(pretrained_model)
custom_config.output_hidden_states=True
custom_tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
custom_model = AutoModel.from_pretrained(pretrained_model, config=custom_config)

# Create pretrained-model object
model = Summarizer(custom_model=custom_model, custom_tokenizer=custom_tokenizer)

# Extractive summarization
extractive_summarized_text = text_class.extractive_summarizer(model, "full_text")

# Optional: Save dataframe containing extractive summaries
extractive_summarized_text.to_csv("extractive_summarized_dataframe_final.csv")

# Check extractive summaries
extractive_summaries = extractive_summarized_text["extractive_summarized_text"]
print(extractive_summaries)

loading configuration file config.json from cache at /home/yassine/.cache/huggingface/hub/models--allenai--scibert_scivocab_uncased/snapshots/24f92d32b1bfb0bcaf9ab193ff3ad01e87732fc1/config.json
Model config BertConfig {
  "_name_or_path": "allenai/scibert_scivocab_uncased",
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.46.3",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 31090
}

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /home/yassine/.cache/huggingface/hub/models--allenai--scibert_sciv

0       We apply statistical machine translation (SMT)...
1       Parallel corpora have become an essential reso...
2       The concept of maximum entropy can be traced b...
3       We apply the hypothesis of “One Sense Per Disc...
4       Transformation-based learning has been success...
                              ...                        
1004    Most work in statistical parsing has focused o...
1005    Finding simple, non-recursive, base noun phras...
1006    Making use of latent semantic analysis, we exp...
1007    amp;quot;Two weeks later, Bonadea had already ...
1008    Standard approaches to Chinese word segmentati...
Name: extractive_summarized_text, Length: 1009, dtype: object


# Abstractive Summarization
Now we perform abstractive summarization on the extractively summarized text

In [8]:
import pandas as pd

df = pd.read_csv("extractive_summarized_dataframe_final.csv" , index_col=False)
potential_index_columns = [col for col in df.columns if col.startswith("Unnamed")]

# Drop these columns
df.drop(columns=potential_index_columns, inplace=True)

df.head()

Unnamed: 0,abstract,full_text,conclusion,num_stopwords,num_punctuations,num_numerics,num_words,avg_word_length,stopwords_to_words_ratio,extractive_summarized_text
0,We apply statistical machine translation (SMT)...,We apply statistical machine translation (SMT)...,We presented a novel approach to the problem o...,1345,678,27,3879,5.5,0.347,We apply statistical machine translation (SMT)...
1,Parallel corpora have become an essential reso...,Parallel corpora have become an essential reso...,"For each item, participants were instructed to...",4194,2235,60,11364,5.4,0.369,Parallel corpora have become an essential reso...
2,The concept of maximum entropy can be traced b...,The concept of maximum entropy can be traced b...,We began by introducing the building blocks of...,3725,1673,65,9622,5.0,0.387,The concept of maximum entropy can be traced b...
3,We apply the hypothesis of “One Sense Per Disc...,We apply the hypothesis of “One Sense Per Disc...,The trigger labeling task described in this pa...,934,439,24,2628,5.4,0.355,We apply the hypothesis of “One Sense Per Disc...
4,Transformation-based learning has been success...,Transformation-based learning has been success...,We have presented in this paper a new and impr...,1514,808,25,3922,5.1,0.386,Transformation-based learning has been success...


In [14]:
text_class = TextSummarizer(df)

# Concatenate the extractive summary with the abstract and conclusion
text_class.join_extracted_summary("abstract", "extractive_summarized_text", "conclusion")

# Instantiate abstractive summarizer model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
abstractive_summarized_text = text_class.abstractive_summarizer(summarizer, "combined_text")

# Save to csv
abstractive_summarized_text.to_csv("abstractive_summarized_dataframe_final.csv")
abstractive_summaries = abstractive_summarized_text["abstractive_summaries"]

# Compare abstractive summaries and full text
print(abstractive_summarized_text[["full_text", "abstractive_summaries"]])

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

loading configuration file config.json from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/config.json
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,
  "architectures": [
    "BartForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 0,
  "classif_dropout": 0.0,
  "classifier_dropout": 0.0,
  "d_model": 1024,
  "decoder_attention_heads": 16,
  "decoder_ffn_dim": 4096,
  "decoder_layerdrop": 0.0,
  "decoder_layers": 12,
  "decoder_start_token_id": 2,
  "dropout": 0.1,
  "early_stopping": true,
  "encoder_attention_heads": 16,
  "encoder_ffn_dim": 4096,
  "encoder_layerdrop": 0.0,
  "encoder_layers": 12,
  "eos_token_id": 2,
  "force_bos_token_to_be_generated": true,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "gradient_checkpointin

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/model.safetensors
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "length_penalty": 2.0,
  "max_length": 142,
  "min_length": 56,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1
}

All model checkpoint weights were used when initializing BartForConditionalGeneration.

All the weights of BartForConditionalGeneration were initialized from the model checkpoint at facebook/bart-large-cnn.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BartForConditionalGeneration for predictions without further training.


generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 0,
  "decoder_start_token_id": 2,
  "early_stopping": true,
  "eos_token_id": 2,
  "forced_bos_token_id": 0,
  "forced_eos_token_id": 2,
  "length_penalty": 2.0,
  "max_length": 142,
  "min_length": 56,
  "no_repeat_ngram_size": 3,
  "num_beams": 4,
  "pad_token_id": 1
}

Could not locate the tokenizer configuration file, will try to use the model config instead.
loading configuration file config.json from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/config.json
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_function": "gelu",
  "add_final_layer_norm": false,


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

loading file vocab.json from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/vocab.json
loading file merges.txt from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/merges.txt
loading file tokenizer.json from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at /home/yassine/.cache/huggingface/hub/models--facebook--bart-large-cnn/snapshots/37f520fa929c961707657b28798b30c003dd100b/config.json
Model config BartConfig {
  "_name_or_path": "facebook/bart-large-cnn",
  "_num_labels": 3,
  "activation_dropout": 0.0,
  "activation_f

                                              full_text  \
0     We apply statistical machine translation (SMT)...   
1     Parallel corpora have become an essential reso...   
2     The concept of maximum entropy can be traced b...   
3     We apply the hypothesis of “One Sense Per Disc...   
4     Transformation-based learning has been success...   
...                                                 ...   
1004  Most work in statistical parsing has focused o...   
1005  Finding simple, non-recursive, base noun phras...   
1006  Making use of latent semantic analysis, we exp...   
1007  amp;quot;Two weeks later, Bonadea had already ...   
1008  Standard approaches to Chinese word segmentati...   

                                  abstractive_summaries  
0     We apply statistical machine translation (SMT)...  
1     Parallel corpora have become an essential reso...  
2     The concept of maximum entropy can be traced b...  
3     We apply the hypothesis of “One Sense Per Disc...  
4