# **Paper-to-Podcast**
## Group-1:
* Manisha Yadav,
* Deepika Parshvanath Velapure,
* Vijayalakshmi Pujar,
* Harshita Jaiswal


# **Sections**
* **Dataset Generation**
* **T5 model fine-tuning**
* **Text summarization of Research paper**
* **Saving summary and Converting summary to podcast(audio-file)**

In [None]:
# install dependencies
! pip3 install torch
! pip3 install transformers
!pip install -U transformers

In [None]:
!pip install sentencepiece
!pip install accelerate --force-reinstall

# Restart runtime after installation and run the below steps after runtime has restarted

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Dataset Generation Technique:**

* **Available dataset:**
1.   txt file with title of paper and url to load the dataset
2.   Target Summary in format:
      * Each line contains: sentence index (in original paper)
      * sentence score (i.e. duration), then the sentence itself.
      * sentence itself.

The fields are tab-separated.

The order of the sentences is according to their order in the paper.

Filetering the entire dataset to find papers which are downloadable utilizing the Beautiful soup python library

Marked as failed_to_download for filtering and then used the ones which were downloadable to genearte the dataset.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


In [None]:
# Reading the training pdf files into a Pandas DataFrame

file_path = '/content/drive/MyDrive/nlp_project/talksumm_papers_titles_url.txt'   # Path to text file containing ACL URLs
training_pdf_files_df = pd.read_csv(file_path, delimiter='\t', header=None, names=['Title', 'URLs'])

# Function to extract direct PDF download link from ACL page
def extract_pdf_link(acl_url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
        }
        response = requests.get(acl_url, headers=headers)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            divTag = soup.find("div", {"class": "acl-paper-link-block"})
            if divTag:
              atag = divTag.find("a", {'class': 'btn-primary'});
              download_link = atag.get('href')
              if not download_link.startswith('http'):
                  download_link = f"https://www.aclweb.org{download_link}"
              return download_link
        else:
            print(f"Failed to fetch URL: {acl_url}. Status code: {response.status_code}")
            return None
    except Exception as e:
        print(f"Error occurred while processing URL: {acl_url}. Error: {e}")
        return None

# Function to download PDF content given a direct PDF download link
def download_pdf_content(pdf_url):
    try:
        response = requests.get(pdf_url, verify=False)
        if response.status_code == 200:
            return response.content
        else:
            print(f"Failed to fetch PDF from URL: {pdf_url}. Status code: {response.status_code}")
            return "Failed to download"
    except Exception as e:
        print(f"Error occurred while downloading PDF from URL: {pdf_url}. Error: {e}")
        return "Failed to download"

# Extract direct PDF download links from ACL URLs in DataFrame
training_pdf_files_df['Direct_PDF_Link'] = training_pdf_files_df['URLs'].apply(extract_pdf_link)

# Download PDF content from direct PDF download links
training_pdf_files_df['PDF_Content'] = training_pdf_files_df['Direct_PDF_Link'].apply(download_pdf_content)

# Display the DataFrame with PDF content
print(training_pdf_files_df)

# **Generated Dataset Format:**
We generated a final dataset having a list of json objects with each object having the input_text and target_summary for each page of all the downloadable research papers.
* 967 papers
* Json object : Data-point
    * "input_text" page-wise
    * "target_summary" for that page


In [None]:
# dataframe consisting title and url links to the papers which were downloadable as per acl Anthology website format
training_pdf_files_df = training_pdf_files_df.loc[training_pdf_files_df['PDF_Content'] != "Failed to download"]
print(training_pdf_files_df)
name = str(training_pdf_files_df.head(1)['Title'].iloc[0]);
print(name)

                                                  Title  \
0     A Binarized Neural Network Joint Model for Mac...   
4     A Co-Matching Model for Multi-choice Reading C...   
5     A Comparison between Count and Neural Network ...   
6     A Comparison of Word Similarity Performance Us...   
7     A Computational Cognitive Model of Novel Word ...   
...                                                 ...   
1695  Zeroshot Multimodal Named Entity Disambiguatio...   
1697  Zipporah: a Fast and Scalable Data Cleaning Sy...   
1699  diaNED: Time-Aware Named Entity Disambiguation...   
1700  emrQA: A Large Corpus for Question Answering o...   
1704  simNet: Stepwise Image-Topic Merging Network f...   

                                            URLs  \
0           https://doi.org/10.18653/v1/d15-1250   
4     https://www.aclweb.org/anthology/P18-2118/   
5           https://doi.org/10.18653/v1/d15-1165   
6            https://doi.org/10.3115/v1/n15-1101   
7           https://doi.org/10.

In [None]:
import zipfile
import io
# Reading the target summaries into Pandas dataframe:

# Path to your ZIP file containing text files
zip_file_path = '/content/drive/MyDrive/nlp_project/talksumm_summaries.zip'

# Initialize an empty list to store dataframes for each text file
all_paper_summary_dfs = []
df_columns_names = ['Sentence_Index', 'Range', 'Summary']
error_file_names = []
files_parsed = []
# Open the ZIP file and read its contents
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    # Iterate through the list of files in the ZIP archive
    for file_name in zip_ref.namelist():
        if file_name.endswith('.txt'):
            # Extract the text file from the ZIP archive
            with zip_ref.open(file_name) as file:
              try:
                # Read the tab-separated data into a DataFrame
                each_file_df = pd.read_csv(io.TextIOWrapper(file, encoding='latin-1'), sep='\t', names=df_columns_names)
                pdf_file_name = file_name.split("/")[1]
                pdf_file_name = pdf_file_name[:len(pdf_file_name) - 4]
                each_file_df.insert(0, "Title", pdf_file_name)
                files_parsed.append(file_name)
              except pd.errors.ParserError:
                error_file_names.append(file_name)
                # Append the DataFrame to the list
              all_paper_summary_dfs.append(each_file_df)

# Concatenate all DataFrames into a single DataFrame
total_summary_df = pd.concat(all_paper_summary_dfs, ignore_index=True)

# Drop rows containing NaN values
total_summary_df = total_summary_df.dropna()

# Files with error
print(len(error_file_names))
print(len(files_parsed))

# Display the final DataFrame
print(total_summary_df.head)
name2 = str(total_summary_df.head(1)['Title'].iloc[0]);
print(name2)

1705
1705
<bound method NDFrame.head of                                                     Title  Sentence_Index  \
0       A Neural Network for Coordination Boundary Pre...               0   
1       A Neural Network for Coordination Boundary Pre...               3   
2       A Neural Network for Coordination Boundary Pre...               4   
3       A Neural Network for Coordination Boundary Pre...              13   
4       A Neural Network for Coordination Boundary Pre...              46   
...                                                   ...             ...   
100539  Bootstrapping into Filler-Gap: An Acquisition ...             150   
100540  Bootstrapping into Filler-Gap: An Acquisition ...             152   
100541  Bootstrapping into Filler-Gap: An Acquisition ...             153   
100542  Bootstrapping into Filler-Gap: An Acquisition ...             155   
100543  Bootstrapping into Filler-Gap: An Acquisition ...             156   

        Range                      

In [None]:
# libraries for the pdf reading
!pip install pymupdf

In [None]:
# Reading pdf page content from the dataframe
import io
import fitz  # PyMuPDF

training_data = []
for index, row in training_pdf_files_df.iterrows():
  matching_indices = total_summary_df.loc[total_summary_df['Title'] == row['Title']].index.tolist()
  # print(matching_indices)
  # Dropping duplicate entries for each paper in the dataframe as iloc() and loc() used together return duplicate indices
  summary_df_of_corresponding_pdf = total_summary_df.iloc[matching_indices].drop_duplicates()
  # print(summary_df_of_corresponding_pdf)

  # Create a bytes IO object
  pdf_bytes_io = io.BytesIO(row['PDF_Content'])

  # Create a PDF reader
  # pdf_reader = PyPDF2.PdfReader(pdf_bytes_io);
  # Open the PDF using PyMuPDF (fitz)
  pdf_document = fitz.open(stream=pdf_bytes_io, filetype="pdf")
  #  empty list to store text chunks ( chunk/per page)
  text_chunks = []
  page_and_its_summary = {}

  # Loop through the PDF pages and split into chunks
  #with pdfplumber.open(pdf_bytes_io) as pdf:
  #  for page_num in range(len(pdf.pages)):
  #   page = pdf.pages[page_num]
  #   text = page.extract_text()
  for page_num in range(pdf_document.page_count):
    page = pdf_document.load_page(page_num)
    # Extract text from the current page
    text = page.get_text("text")
    if (page_num == 0 and(("Abstract" in text) or ("ABSTRACT" in text) or ("abstract" in text))):
      text_lower = text.lower();
      abstract_index = text_lower.split().index("abstract");
      abstract_text = text[abstract_index-1:]
      text_chunks.append(abstract_text)
    # if ((page_num == len(pdf.pages) - 1) and (("References" in text) or ("REFERENCES" in text) or ("references" in text))):
    if ((page_num == pdf_document.page_count - 1) and (("References" in text) or ("REFERENCES" in text) or ("references" in text))):
      continue
    text_chunks.append(text)
    # print("Chunk: " + text)

  first_chunk_processed = False
  for each_chunk in text_chunks:
    each_page_summary = ""
    if not first_chunk_processed:
      no_of_sentences_in_chunk = len(each_chunk.split(".")) - 7
      first_chunk_processed = True
    else:
      no_of_sentences_in_chunk = len(each_chunk.split("."))

    indices_of_each_chunk = summary_df_of_corresponding_pdf.loc[summary_df_of_corresponding_pdf['Sentence_Index'] < no_of_sentences_in_chunk].index.tolist()
    # print(indices_of_each_chunk)

    # print(summary_df_of_corresponding_pdf)
    # print(indices_of_each_chunk)

    summary_df_of_each_chunk = summary_df_of_corresponding_pdf.loc[indices_of_each_chunk]
    # print(summary_df_of_each_chunk)

    for index, row in summary_df_of_each_chunk.iterrows():
      each_page_summary += row['Summary'];
      each_page_summary += " ";

    # print(row['Title'])
    # print()
    # print(each_page_summary)
    # print()

    data_point = {
                'input_text': each_chunk,
                'target_summary': each_page_summary
            }
    training_data.append(data_point)
    print(data_point)

# **Final format of Dataset:**
 * It is json object where each object is "input_text" page-wise and "target_summary" for that page.
 * saved in target folder as "dataset.json" file

In [None]:
# Training data entries length = 967 research papers saved in file.
print(len(training_data))
import json
# Convert list of JSON objects to a JSON-formatted string
json_string = json.dumps(training_data, indent=2)

# Define the file path where you want to save the JSON file
file_path = '/content/drive/MyDrive/CS510_NLP/Final_Project/ResearchPaper_Dataset/dataset.json'

# Write the JSON string to a file
with open(file_path, 'w') as json_file:
    json_file.write(json_string)

print(f"Dataset saved to: {file_path}")

# **T5 Model**
Bidirectional model - left and right contexts
with attention masks - focus on relevant parts of the input text
It can be fine-tuned to perform specific text processing tasks using the domain specific datasets.
* **Load the Dataset for T5 model fine-tuning**
* **Split the Dataset into:**
  * Training set
  * Validation set
  * Test set

In [None]:
import json
from sklearn.model_selection import train_test_split
# Load dataset from JSON file
with open('/content/drive/MyDrive/CS510_NLP/Final_Project/ResearchPaper_Dataset/dataset.json', 'r') as file:
    dataset = json.load(file)

# Split dataset into train, validation, and test sets
train_data, test_data = train_test_split(dataset, test_size=0.1, random_state=42)
train_data, val_data = train_test_split(train_data, test_size=0.1, random_state=42)

# Verify dataset sizes
print(f"Train set size: {len(train_data)}")
print(f"Validation set size: {len(val_data)}")
print(f"Test set size: {len(test_data)}")


Train set size: 8248
Validation set size: 917
Test set size: 1019


In [None]:
# Saving the test data as json file for testing purpose later on
with open('/content/drive/MyDrive/CS510_NLP/Final_Project/ResearchPaper_Dataset/test_dataset_4_epoch.json', 'w') as test_file:
    json.dump(test_data, test_file)

In [None]:
# tokenization technique as loading the datasets from json file
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
tokenizer = T5Tokenizer.from_pretrained('t5-base')

# Extract input_text and target_summary from train_data
input_texts_train = [example['input_text'] for example in train_data]
target_summaries_train = [example['target_summary'] for example in train_data]

# Tokenize and encode input texts
input_encodings_train = tokenizer(input_texts_train, padding='max_length', truncation=True, return_tensors='pt', max_length=128)

# Tokenize and encode target summaries
target_encodings_train = tokenizer(target_summaries_train, padding='max_length', truncation=True, return_tensors='pt', max_length=128)

# Tokenizing the validation dataset
# Extract input_text and target_summary from val_data
input_texts_val = [example['input_text'] for example in val_data]
target_summaries_val = [example['target_summary'] for example in val_data]

# Tokenize and encode input texts
input_encodings_val = tokenizer(input_texts_val, padding='max_length', truncation=True, return_tensors='pt', max_length=128)

# Tokenize and encode target summaries
target_encodings_val = tokenizer(target_summaries_val, padding='max_length', truncation=True, return_tensors='pt', max_length=128)

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
print("Input IDs shape:", input_encodings_train['input_ids'].shape)
print("Attention mask shape:", input_encodings_train['attention_mask'].shape)
print("Target labels shape:", target_encodings_train['input_ids'].shape)

print("Input IDs shape:", input_encodings_val['input_ids'].shape)
print("Attention mask shape:", input_encodings_val['attention_mask'].shape)
print("Target labels shape:", target_encodings_val['input_ids'].shape)

Input IDs shape: torch.Size([8248, 128])
Attention mask shape: torch.Size([8248, 128])
Target labels shape: torch.Size([8248, 128])
Input IDs shape: torch.Size([917, 128])
Attention mask shape: torch.Size([917, 128])
Target labels shape: torch.Size([917, 128])


In [None]:
!pip install datasets

In [None]:
from datasets import Dataset
dataset_train = Dataset.from_dict({
        'input_ids': input_encodings_train['input_ids'],
        'attention_mask': input_encodings_train['attention_mask'],
        'labels': target_encodings_train['input_ids']
    })
dataset_val = Dataset.from_dict({
        'input_ids': input_encodings_val['input_ids'],
        'attention_mask': input_encodings_val['attention_mask'],
        'labels': target_encodings_val['input_ids']
    })

# **Training (fine-tuning T5):**
**Load the train and validation dataset**
* training epochs = 4
* training batch size = 7

In [None]:
import torch

model = T5ForConditionalGeneration.from_pretrained('t5-base')

# Fine-tune the model
dir_to_save_model = f"/content/drive/MyDrive/CS510_NLP/Final_Project/T5_Model/model_data_4_epoch"
log_dir = f"/content/drive/MyDrive/CS510_NLP/Final_Project/T5_Model/logs"
training_args = TrainingArguments(
    auto_find_batch_size=True,
    output_dir=dir_to_save_model,
    num_train_epochs=4,
    per_device_train_batch_size=7,
    per_device_eval_batch_size=7,
    logging_dir=log_dir,
    logging_steps=500,
    evaluation_strategy='steps',
    eval_steps=500,
)

In [None]:
# Initialize Trainer with model, training arguments, and train dataset
trainer = Trainer(
    model=model,  # T5-base model
    args=training_args,
    train_dataset=dataset_train,
    eval_dataset=dataset_val,
)

In [None]:
# Start training
trainer.train()

Step,Training Loss,Validation Loss
500,3.5864,3.036115
1000,3.1639,2.834993
1500,2.9624,2.690738
2000,2.865,2.577353
2500,2.76,2.488248
3000,2.6896,2.421905
3500,2.6323,2.371982
4000,2.5855,2.337987
4500,2.5711,2.320912


TrainOutput(global_step=4716, training_loss=2.8547670786617365, metrics={'train_runtime': 2824.838, 'train_samples_per_second': 11.679, 'train_steps_per_second': 1.669, 'total_flos': 5022684681338880.0, 'train_loss': 2.8547670786617365, 'epoch': 4.0})

# **Testing the fine-tuned T5 model**
* Load the saved models final checkpoint and test set

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import json
from datasets import Dataset

In [None]:
# install libraries for evaluation metrics
!pip install rouge-score

In [None]:
from rouge_score import rouge_scorer

In [None]:
# Load fine-tuned T5 model and tokenizer
model = T5ForConditionalGeneration.from_pretrained('/content/drive/MyDrive/CS510_NLP/Final_Project/T5_Model/model_data_4_epoch/checkpoint-4500')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# Load test set from JSON file
with open('/content/drive/MyDrive/CS510_NLP/Final_Project/ResearchPaper_Dataset/test_dataset_4_epoch.json', 'r') as file:
    test_dataset = json.load(file)

# Verify dataset sizes
print(f"Test set size: {len(test_dataset)}")

Test set size: 1019


In [None]:
# Extract input_text and target_summary from test_data
input_texts_test = [example['input_text'] for example in test_dataset]

# Tokenize and encode input texts
input_encodings_test = tokenizer(input_texts_test, padding='max_length', truncation=True, return_tensors='pt', max_length=128)


In [None]:
print("Input IDs shape:", input_encodings_test['input_ids'].shape)
print("Attention mask shape:", input_encodings_test['attention_mask'].shape)

Input IDs shape: torch.Size([1019, 128])
Attention mask shape: torch.Size([1019, 128])


In [None]:
# Setting batch size and generating summaries in batches
batch_size = 10
num_samples = len(input_texts_test)

# Modify the decoding parameters to encourage diversity
decoder_params = {
    'temperature': 0.7,  # Adjust temperature for controlled randomness
    'num_beams': 5,  # Number of beams for beam search
    'diversity_penalty': 1.0,  # Apply a diversity penalty
    'no_repeat_ngram_size': 3,  # Avoid repeating n-grams
}

# Calculate ROUGE scores for each pair of input and generated text
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

test_paper_count = batch_size
for i in range(0, num_samples, batch_size):
    input_batch = {k: v[i:i+batch_size] for k, v in input_encodings_test.items()}

    # Generate summaries
    summaries = model.generate(
        input_ids=input_batch['input_ids'],
        attention_mask=input_batch['attention_mask'],
        max_length=150,
        temperature= 0.7,  # Adjust temperature for controlled randomness
        num_beams= 4,  # Number of beams for beam search
        diversity_penalty= 1.0,  # Apply a diversity penalty
        no_repeat_ngram_size= 3,  # Avoid repeating n-grams
        num_beam_groups= 2
    )

    # Decode the generated summaries and print them
    decoded_summaries = tokenizer.batch_decode(summaries, skip_special_tokens=True)


    for idx, (input_text, generated_summary) in enumerate(zip(input_texts_test[:test_paper_count], decoded_summaries), start=1):
      print("Input Text: ", input_text)
      print()
      print("Generated Summary: ", generated_summary)
      print()
      scores = scorer.score(generated_summary, input_text)
      print(f"Scores for pair {idx}:")
      print("ROUGE-1:", scores['rouge1'])
      print("ROUGE-L:", scores['rougeL'])
      print()

    test_paper_count += batch_size


In [None]:
# to extract text from pdf.
!pip3 install PyPDF2

In [None]:
# restart runtime after sentencepiece installation
!pip install sentencepiece
!pip install rouge-score

Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


# **Evaluation of single Research Paper:**
* **Input:**  Research Paper(pdf)
* **Output:** Summary pdf
* **Evaluation metric:**  ROGUE score

In [None]:
# Testing the model on a single research paper pdf file
import PyPDF2
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration
from rouge_score import rouge_scorer

In [None]:
# pdf file as input
# read the pdf uisng the PyPDF2.PdfReader
# save the extracted text
pdf_path = "/content/drive/MyDrive/CS510_NLP/Final_Project/Test_Paper_P11-1061.pdf"

pdf_reader = PyPDF2.PdfReader(pdf_path)

# empty string to store the extracted text
pdf_text = []
full_pdf_text = ""
# extract text and concatenate it
for page in pdf_reader.pages:
    pdf_text.append(page.extract_text())
    full_pdf_text += page.extract_text()


In [None]:
# Load saved fine-tuned T5 model's last checkpoint and tokenizer
fine_tuned_model = T5ForConditionalGeneration.from_pretrained('/content/drive/MyDrive/CS510_NLP/Final_Project/T5_Model/model_data_4_epoch/checkpoint-4500')
tokenizer = T5Tokenizer.from_pretrained('t5-base')

In [None]:
# Generate summaries for each page of a single pdf text

# Modify the decoding parameters to encourage diversity
decoder_params = {
    'temperature': 0.7,  # Adjust temperature for controlled randomness
    'num_beams': 5,  # Number of beams for beam search
    'diversity_penalty': 1.0,  # Apply a diversity penalty
    'no_repeat_ngram_size': 3,  # Avoid repeating n-grams
}

# Calculate ROUGE scores for each pair of input and generated text
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)

full_pdf_summary = ""
for each_page in pdf_text:
  # Tokenize and encode page text
  input_encodings_of_single_pdf = tokenizer(each_page, padding='max_length', truncation=True, return_tensors='pt', max_length=128)

  # Generate summary for each page
  each_page_summary = fine_tuned_model.generate(
        input_ids=input_encodings_of_single_pdf['input_ids'],
        attention_mask=input_encodings_of_single_pdf['attention_mask'],
        max_length=150,
        temperature= 0.7,
        num_beams= 4,
        diversity_penalty= 1.0,
        no_repeat_ngram_size= 3,
        num_beam_groups= 2
    )

  # Decode the generated summary
  decoded_each_page_summary = tokenizer.decode(each_page_summary[0], skip_special_tokens=True)

  full_pdf_summary += decoded_each_page_summary
  full_pdf_summary += "\n"
  full_pdf_summary += "<br>"
  full_pdf_summary += "\n"
  full_pdf_summary += "<br>"

print("Full pdf summary: ", full_pdf_summary)
print()
scores = scorer.score(full_pdf_summary, full_pdf_text)
print("ROUGE-1:", scores['rouge1'])
print("ROUGE-L:", scores['rougeL'])

Full pdf summary:  In this paper, we focus on the task of part-of-speech tagging, which aims to identify a segment of a sentence in a text that is likely to be uttered by a specific language. The task aims at identifying the part of the sentence that a user wants to read, and then identifying its target language. In this work, we propose a new approach to segmenting text, namely segmenting the text into a single segment, and labeling the segment of the document into segments of the text.
<br>
<br>In this paper, we focus on the task of multilingual part-of-speech induction, a task that is primarily concerned with generating grammatical representations of words in a single language. We also focus on a new task, which is to generate a multilingual representation of a word in âEnglishâ. We propose a model based on the universal part of speech tag, which can be trained on the lexical representation of the word in the input sentence. We use the universal feature of the tag, i.e., the univers

In [None]:
# libraries for saving pdf format of summary
!pip install pdfkit
!apt-get install -y wkhtmltopdf
import pdfkit

In [None]:
# Configure options for PDF generation
options = {
    'page-size': 'A4',
    'orientation': 'Portrait'
}
# Store the full_pdf_summary in a pdf file to compare with the summary generated by any other text summarization model.
summary_file_path = "/content/drive/MyDrive/CS510_NLP/Final_Project/full_pdf_summary_T5_model.pdf"

# Save the string content as a PDF file
pdfkit.from_string(full_pdf_summary, summary_file_path, options=options)

True

### **Created PODCAST of the Generated Summary**

In [None]:
#install libraries
!pip install gTTS

In [None]:
# Import the required module for text
# to speech conversion
from gtts import gTTS
from IPython.display import Audio
import io

def text_to_speech(text, language='en', slow=False):
    # Passing the text and language to the engine, slow=False for normal speed
    speech = gTTS(text=text, lang=language, slow=slow)

    # Saving the speech as an in-memory file
    mp3_fp = io.BytesIO()
    speech.write_to_fp(mp3_fp)

    # Save the audio file
    podcast = '/content/drive/MyDrive/CS510_NLP/Final_Project/full_pdf_summary_T5_model_podcast.mp3'
    speech.save(podcast)

    # Play the audio in the notebook
    return Audio(mp3_fp.getvalue(), autoplay=True)

podcast = text_to_speech(full_pdf_summary)
podcast