**RESEARCH PAPER SUMMARIZATION**

**PROBLEM STATEMENT**

The problem we're addressing is that research papers are often long and difficult to understand, making thorough reading time-consuming and tricky. Abstracts provide limited information and may miss crucial details. Our goal is to develop a computer program that generates concise, easy-to-understand summaries of these lengthy papers, capturing all essential points accurately. This way, users can quickly grasp key findings without reading the entire paper.


**DATASET DESCRIPTION**

The Random Research Papers Dataset comprises a collection of over 600 research papers sourced from various sources on the web. Through a meticulous selection process, approximately 350 of the most relevant and high-quality papers have been curated for inclusion in this dataset.

Contents:

The dataset includes research papers covering a wide range of topics and disciplines, reflecting the diverse nature of academic research.
Each paper is accompanied by relevant metadata such as title, authors, publication source, abstract, and publication date (if available).
The papers cover various fields of study, including but not limited to, computer science, medicine, engineering, social sciences, and more.

**IMPORTING LIBRARIES**

In [None]:
!pip install pymupdf
!pip install pycryptodome
import os
import fitz  # PyMuPDF
import pandas as pd
from multiprocessing import Pool

Collecting pymupdf
  Downloading PyMuPDF-1.23.5-cp310-none-manylinux2014_x86_64.whl (4.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.3/4.3 MB[0m [31m33.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting PyMuPDFb==1.23.5 (from pymupdf)
  Downloading PyMuPDFb-1.23.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.23.5 pymupdf-1.23.5


This Python script extracts text from a collection of PDF research papers, excluding tables, diagrams, and graphs, and stores the information in a DataFrame for analysis.

In [None]:
# Defining the path to your folder containing PDFs
pdf_folder = "/kaggle/input/researchpaper-alldata"

# Initializin the an empty list to store the data
pdf_data = []

# Defining the keywords that indicate the start of the references section
references_start_keywords = ["Bibliography", "REFERENCES", "BIBLIOGRAPHY", "References", "Acknowledgments",
                             "ACKNOWLEDGEMENTS", "Reference", "REFERENCE", "Authors’ Biography"]

# Function to extract text from the PDF, excluding tables, diagrams, and graphs
def extract_text_from_pdf_exclude_elements(pdf_path):
    text = ""
    pdf_document = fitz.open(pdf_path)
    exclude_text = False  # Flag to indicate whether to exclude text after references_start_keywords

    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)

        # Checking if any of the references start keywords are present in the page text
        if any(keyword in page.get_text() for keyword in references_start_keywords):
            exclude_text = True  # Set the flag to exclude text from this point
            break  # Stop text extraction after encountering references_start_keywords

        if not exclude_text:
            page_text = page.get_text("text")
            # Checking if the page contains any images and exclude them
            if not page.get_images():
                if page_text.strip():  # Checking if the extracted text is not empty
                    text += page_text

    return text

# Multiprocessing for parallelization Takes pdf
def process_pdf(pdf_file):
    pdf_path = os.path.join(pdf_folder, pdf_file)
    text = extract_text_from_pdf_exclude_elements(pdf_path)
    return {'Research_Paper_Name': pdf_file, 'source': text}

if __name__ == '__main__':
    pdf_files = [file_name for file_name in os.listdir(pdf_folder) if file_name.endswith('.pdf')]

    # Defining the number of CPU cores to use for parallelization
    num_cores = os.cpu_count()

    with Pool(processes=num_cores) as pool:
        pdf_data = pool.map(process_pdf, pdf_files)

    # Filtering out PDFs with no text output
    pdf_data = [item for item in pdf_data if item['source']]  # Removes items from pdf_data where the 'source' key is empty (no text was extracted).

    # Creating a DataFrame
    df = pd.DataFrame(pdf_data)

    # Displaying the DataFrame
    print(df)


                                   Research_Paper_Name  \
0                                       2210.00881.pdf   
1                                   TSP_CSSE_45181.pdf   
2                             electronics-11-00325.pdf   
3    Performance_of_Deep-Learning_Solutions_on_Lung...   
4                                    9789392995101.pdf   
..                                                 ...   
341                              ASAfilippi_p_558s.pdf   
342                               25894-50878-1-PB.pdf   
343           Women_Empowerment_in_Bangladesh_NGOS.pdf   
344                           diagnostics-12-00203.pdf   
345                       20220122011404pmWEB19034.pdf   

                                                source  
0    Predicting the Future of AI with AI:\nHigh-Qua...  
1    Insider Attack Detection Using Deep Belief Neu...  
2    Electronics 2022, 11, 325\n2 of 18\nartiﬁcial ...  
3    Life 2023, 13, 1911\n2 of 13\neffective in red...  
4     \n \n \n \n 

In [None]:
len(df['source'][0])

28301

In [None]:
print(df['source'][0])

Predicting the Future of AI with AI:
High-Quality link prediction in an exponentially growing knowledge network
Mario Krenn,1, ∗ Lorenzo Buﬀoni,2 Bruno Coutinho,2 Sagi Eppel,3 Jacob Gates Foster,4
Andrew Gritsevskiy,3, 5, 6 Harlin Lee,4 Yichao Lu,7 Jo˜ao P. Moutinho,2 Nima Sanjabi,8 Rishi Sonthalia,4
Ngoc Mai Tran,9 Francisco Valente,10 Yangxinyu Xie,11 Rose Yu,12 and Michael Kopp6
1Max Planck Institute for the Science of Light (MPL), Erlangen, Germany.
2Instituto de Telecomunica¸c˜oes, Lisbon, Portugal.
3University of Toronto, Canada.
4University of California Los Angeles, USA.
5Cavendish Laboratories, Cavendish, Vermont, USA.
6Institute of Advanced Research in Artiﬁcial Intelligence (IARAI), Vienna, Austria.
7Layer 6 AI, Toronto, Canada.
8Independent Researcher, Barcelona, Spain.
9University of Texas at Austin, USA.
10Independent Researcher, Leiria, Portugal.
11University of Pennsylvania, USA.
12University of California, San Diego, USA.
A tool that could suggest new personalized rese

**Preprocessing**

In [None]:
# Importing necessary packages

import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
import string
import re
nltk.download("stopwords")
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

This script preprocesses text data in a DataFrame (df) by removing tables, formulas, emails, URLs, converting to lowercase, removing special characters, single alphabets, numbers, punctuation, and extra spaces. It uses NLTK for text processing and regular expressions for pattern matching.

In [None]:
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# Downloading the 'punkt' tokenizer data
nltk.download('punkt')

def preprocess_text(text):
    # Sentence Tokenization
    sentences = sent_tokenize(text)

    # Initializing an empty list to store preprocessed sentences
    preprocessed_sentences = []

    for sentence in sentences:

        # Using regular expressions to remove tables and formulas (customize as needed)
        table_pattern = r"(?i)(?:Table|Tab\.|Fig\.)\s+\d+\s*[:\s-]*\s*(.*?)\s*(?=(?:Table|Tab\.|Fig\.|\n|\Z))"
        sentence = re.sub(table_pattern, "", sentence)

        formula_pattern = r"(\$\$[\s\S]*?\$\$|\$[\s\S]*?\$)"
        sentence = re.sub(formula_pattern, "", sentence)

        # Removing E-mails
        email_pattern = r'\S+@\S+\.\S+'
        sentence = re.sub(email_pattern, '', sentence)

        # Removing Urls
        url_pattern = r'https?://\S+|www\.\S+'
        sentence = re.sub(url_pattern, '', sentence)

        # Lowercasing
        sentence = sentence.lower()

        # Removing Special Characters
        pattern = r'[^\w\s]'
        sentence = re.sub(pattern, ' ', sentence)

        # Removing single alphabets
        sentence = re.sub(r'\s+[a-zA-Z]\s+', ' ', sentence)

        # Removing Numbers
        sentence = re.sub(r'\d+', '', sentence)

        # Removing Punctuations
        sentence = sentence.translate(str.maketrans('', '', string.punctuation))

        # Removing Extra Spaces
        sentence = " ".join(sentence.split())

        # Appending the preprocessed sentence to the list
        preprocessed_sentences.append(sentence)

    # Joining the preprocessed sentences back into a single string with a space as delimiter
    preprocessed_text = '. '.join(preprocessed_sentences)

    return preprocessed_text

# Apply the preprocessing function to the "Text Data" column using a lambda function
df['source'] = df['source'].apply(lambda x: preprocess_text(x))

df

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,Research_Paper_Name,source
0,2210.00881.pdf,predicting the future of ai with ai high quali...
1,TSP_CSSE_45181.pdf,insider attack detection using deep belief neu...
2,electronics-11-00325.pdf,electronics of artiﬁcial intelligence systems ...
3,Performance_of_Deep-Learning_Solutions_on_Lung...,life of effective in reducing lung cancer mort...
4,9789392995101.pdf,keep your dreams alive. understand to achieve ...
...,...,...
341,ASAfilippi_p_558s.pdf,detecting causes of spatial variation in crop ...
342,25894-50878-1-PB.pdf,int elec comp eng issn computed tomography sca...
343,Women_Empowerment_in_Bangladesh_NGOS.pdf,nu journal of humanities social sciences busin...
344,diagnostics-12-00203.pdf,diagnostics of alt are the enzymes from liver ...


Using lexrank algorithm for preparing dataset.

In [None]:
!pip install lexrank

Collecting lexrank
  Downloading lexrank-0.1.0-py3-none-any.whl (69 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.8/69.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting urlextract>=0.7 (from lexrank)
  Downloading urlextract-1.8.0-py3-none-any.whl (21 kB)
Collecting uritools (from urlextract>=0.7->lexrank)
  Downloading uritools-4.0.2-py3-none-any.whl (10 kB)
Installing collected packages: uritools, urlextract, lexrank
Successfully installed lexrank-0.1.0 uritools-4.0.2 urlextract-1.8.0


**This script performs extractive summarization on research papers using LexRank, spaCy for text processing, and multiprocessing for efficiency. The summary size can be adjusted as needed.**

In [None]:
from lexrank import LexRank
import spacy
import pandas as pd
from multiprocessing import Pool

# Increasing the max_length to accommodate longer texts
nlp = spacy.load("en_core_web_sm")

# Preprocess the text with spaCy and apply LexRank to each text
def extractive_summarize_text(text, summary_size=2):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    lxr = LexRank(sentences)
    summary = lxr.get_summary(sentences, summary_size=summary_size)  # Adjust the summary size as needed

    # Truncate the summary to the specified size
    if len(summary) > summary_size:
        summary = summary[:summary_size]

    return " ".join(summary) if summary and "No informative summary available" not in summary else ""

def parallel_summarization(df_chunk):
    df_chunk['target'] = df_chunk['source'].apply(lambda x: extractive_summarize_text(x, summary_size=2))
    return df_chunk

if __name__ == '__main__':

    # Defining the number of processes to use (we can adjust this as needed)
    num_processes = os.cpu_count()

    # Splits the DataFrame into chunks for parallel processing
    chunk_size = len(df) // num_processes
    df_chunks = [df[i:i + chunk_size] for i in range(0, len(df), chunk_size)]

    # Creates a multiprocessing pool and apply summarization in parallel
    with Pool(processes=num_processes) as pool:  # Use "Pool" from the multiprocessing module
        df_list = pool.map(parallel_summarization, df_chunks)

    # Concatenates the processed DataFrame chunks
    df = pd.concat(df_list, ignore_index=True)

    # Prints the DataFrame with the "Summary" column
    print(df)

                                   Research_Paper_Name  \
0                                       2210.00881.pdf   
1                                   TSP_CSSE_45181.pdf   
2                             electronics-11-00325.pdf   
3    Performance_of_Deep-Learning_Solutions_on_Lung...   
4                                    9789392995101.pdf   
..                                                 ...   
341                              ASAfilippi_p_558s.pdf   
342                               25894-50878-1-PB.pdf   
343           Women_Empowerment_in_Bangladesh_NGOS.pdf   
344                           diagnostics-12-00203.pdf   
345                       20220122011404pmWEB19034.pdf   

                                                source  \
0    predicting the future of ai with ai high quali...   
1    insider attack detection using deep belief neu...   
2    electronics of artiﬁcial intelligence systems ...   
3    life of effective in reducing lung cancer mort...   
4    keep you

In [None]:
len(df['source'][4])

28711

In [None]:
len(df['target'][4])

453

In [None]:
df['source'][4]



In [None]:
df['target'][4]

'in an organization where human intelligence is tied to particular person or to group of people ai applications can provide the permanence that knowledge is not lost as individuals or group members retire or are no longer available to the organization. ai helps doctors assess how dangerous patient health is and then uses intelligence to not only develop quality of care but also observe and advise patients on the effects side effects of certain drugs.'

In [None]:
df.to_csv("output_csv(clean).csv")

# Model Building

T5 (Text-To-Text Transfer Transformer) is a versatile transformer-based language model developed by Google. Unlike traditional models that are designed for specific NLP tasks, T5 treats all tasks as text-to-text tasks. It takes both input and output as text, allowing it to perform a wide range of tasks like translation, summarization

### Using t5 model

In [None]:
!pip install rouge
!pip install simplet5

Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1
Collecting simplet5
  Downloading simplet5-0.1.4.tar.gz (7.3 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting transformers==4.16.2 (from simplet5)
  Downloading transformers-4.16.2-py3-none-any.whl (3.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting pytorch-lightning==1.5.10 (from simplet5)
  Downloading pytorch_lightning-1.5.10-py3-none-any.whl (527 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.7/527.7 kB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyDeprecate==0.3.1 (from pytorch-lightning==1.5.10->simplet5)
  Downloading pyDeprecate-0.3.1-py3-none-any.whl (10 kB)
Collecting setuptools==59.5.0 (from pytorch-lightning==1.5.10->simplet5)
  Downloading setuptools-59.5.0-py3-none-any.whl (952 k

In [None]:
import json
import pandas as pd

# Opening the JSON file for reading
data = pd.read_csv("/kaggle/input/recleaned-csv/output_csv(clean).csv")
df = pd.DataFrame(data)
df

Unnamed: 0.1,Unnamed: 0,Research_Paper_Name,source,target
0,0,2210.00881.pdf,predicting the future of ai with ai high quali...,we point out several extensions and future wor...
1,1,TSP_CSSE_45181.pdf,insider attack detection using deep belief neu...,the stream mining algorithm with decision grap...
2,2,electronics-11-00325.pdf,electronics of artiﬁcial intelligence systems ...,in the literature we ﬁnd different works that ...
3,3,Performance_of_Deep-Learning_Solutions_on_Lung...,life of effective in reducing lung cancer mort...,data sources and search strategy systematic li...
4,4,9789392995101.pdf,keep your dreams alive. understand to achieve ...,in an organization where human intelligence is...
...,...,...,...,...
341,341,ASAfilippi_p_558s.pdf,detecting causes of spatial variation in crop ...,introduction crop yields are affected by many ...
342,342,25894-50878-1-PB.pdf,int elec comp eng issn computed tomography sca...,finally both kenny et al. and wabnitz et al. c...
343,343,Women_Empowerment_in_Bangladesh_NGOS.pdf,nu journal of humanities social sciences busin...,whitmore conceived empowerment as an interacti...
344,344,diagnostics-12-00203.pdf,diagnostics of alt are the enzymes from liver ...,shao et al. applied ultra performance liquid c...


In [None]:
df = df.rename(columns={"target":"target_text", "source":"source_text"})
df = df[['source_text', 'target_text']]

In [None]:
# T5 model expects a task related prefix: since it is a summarization task, we will add a prefix "summarize: "
df['source_text'] = "summarize: " + df['source_text']
df

Unnamed: 0,source_text,target_text
0,summarize: predicting the future of ai with ai...,we point out several extensions and future wor...
1,summarize: insider attack detection using deep...,the stream mining algorithm with decision grap...
2,summarize: electronics of artiﬁcial intelligen...,in the literature we ﬁnd different works that ...
3,summarize: life of effective in reducing lung ...,data sources and search strategy systematic li...
4,summarize: keep your dreams alive. understand ...,in an organization where human intelligence is...
...,...,...
341,summarize: detecting causes of spatial variati...,introduction crop yields are affected by many ...
342,summarize: int elec comp eng issn computed tom...,finally both kenny et al. and wabnitz et al. c...
343,summarize: nu journal of humanities social sci...,whitmore conceived empowerment as an interacti...
344,summarize: diagnostics of alt are the enzymes ...,shao et al. applied ultra performance liquid c...


Spliting Dataset

In [None]:
from sklearn.model_selection import train_test_split

train_df, test_df = train_test_split(df, test_size=0.2)
train_df.shape, test_df.shape



((276, 2), (70, 2))

Loading Model

In [None]:
from simplet5 import SimpleT5
model = SimpleT5()
model.from_pretrained(model_type="t5", model_name="t5-base")

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Fine Tuning The Model

In [None]:
model.train(train_df=train_df,
            eval_df=test_df,
            source_max_token_len=128,
            target_max_token_len=50,
            batch_size=8, max_epochs=5, use_gpu=True)

Validation sanity check: 0it [00:00, ?it/s]

  rank_zero_warn(
  rank_zero_warn(


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [None]:
! ( cd outputs; ls )

simplet5-epoch-0-train-loss-4.1447-val-loss-3.7157
simplet5-epoch-1-train-loss-3.7984-val-loss-3.6766
simplet5-epoch-2-train-loss-3.5484-val-loss-3.6781
simplet5-epoch-3-train-loss-3.3369-val-loss-3.6882
simplet5-epoch-4-train-loss-3.1244-val-loss-3.7309


In [None]:
# let's load the trained model from the local output folder for inferencing:
model.load_model("t5","/kaggle/working/outputs/simplet5-epoch-1-train-loss-3.7984-val-loss-3.6766", use_gpu=True)

In [None]:
text_to_summarize = df["source_text"][0]

generated_text = model.predict(text_to_summarize)

print(generated_text)

Token indices sequence length is longer than the specified maximum sequence length for this model (5600 > 512). Running this sequence through the model will result in indexing errors


['ai is an algorithm that can be used to predict the future of ai by analyzing the data in real time. this approach was presented in the ieee bigdata competition in fall and it has been widely adopted in many other fields such as artificial intelligence and machine learning.']


# Calculating Rouge Score

In [None]:
from rouge import Rouge

# Example: reference and generated summaries
reference_summary = [df["target_text"][0]]  # Place the reference summary in a list
generated_summary = generated_text  # Keep the generated summary as a list

# Creates a Rouge object
rouge = Rouge()

# Calculates ROUGE scores
scores = rouge.get_scores(generated_summary, reference_summary)

print(scores)

[{'rouge-1': {'r': 0.2564102564102564, 'p': 0.23809523809523808, 'f': 0.2469135752537724}, 'rouge-2': {'r': 0.0392156862745098, 'p': 0.041666666666666664, 'f': 0.0404040354086324}, 'rouge-l': {'r': 0.1794871794871795, 'p': 0.16666666666666666, 'f': 0.17283950117969835}}]




```
# This is formatted as code
```

#User Inputs PDF(Pipeline)

In [None]:
!pip install pymupdf
!pip install pycryptodome

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


This Python script extracts and preprocesses text from a PDF file, excluding images, using OCR and regular expressions. It also removes tables, formulas, emails, URLs, and performs various text cleaning operations and creates an user input pipeline

In [None]:
import fitz  # PyMuPDF
import pytesseract
from PIL import Image
import re
import string
from nltk.tokenize import sent_tokenize
import nltk
import os

pdf_path = "/kaggle/input/test-dataset/Virtual_Reality_in_Chemical_Engineering_Education.pdf"

# Function to perform OCR on an image and extract text
def extract_text_from_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text

# Function to extract text from the PDF, excluding images
def extract_and_preprocess_text_from_pdf(pdf_path):
    text = ""
    pdf_document = fitz.open(pdf_path)

    references_section_started = False  # Flag to track if references section started

    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)
        page_text = ""

        # Extract the image from the page and save it as a PNG
        pix = page.get_pixmap()
        image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        image_path = "temp_image.png"
        image.save(image_path)

        # Extracts text from the saved image using OCR
        page_text += extract_text_from_image(image_path)

        # Checks if any of the references start keywords are present in the page text
        if any(keyword in page_text for keyword in references_start_keywords):
            references_section_started = True

        # Appends the page text if references section is not started
        if not references_section_started:
            text += page_text

        # Removes the temporary image file
        os.remove(image_path)

    # Preprocess the extracted text
    text = preprocess_text(text)

    return text

def preprocess_text(text):
    # Sentence Tokenization
    sentences = sent_tokenize(text)

    # Initializes an empty list to store preprocessed sentences
    preprocessed_sentences = []

    for sentence in sentences:

        # Use of regular expressions to remove tables and formulas (customize as needed)
        table_pattern = r"(?i)(?:Table|Tab\.|Fig\.)\s+\d+\s*[:\s-]*\s*(.*?)\s*(?=(?:Table|Tab\.|Fig\.|\n|\Z))"
        sentence = re.sub(table_pattern, "", sentence)

        formula_pattern = r"(\$\$[\s\S]*?\$\$|\$[\s\S]*?\$)"
        sentence = re.sub(formula_pattern, "", sentence)

        # Removing E-mails
        email_pattern = r'\S+@\S+\.\S+'
        sentence = re.sub(email_pattern, '', sentence)

        # Removeing Urls
        url_pattern = r'https?://\S+|www\.\S+'
        sentence = re.sub(url_pattern, '', sentence)

        # Lowercasing
        sentence = sentence.lower()

        # Removes Special Characters
        pattern = r'[^\w\s]'
        sentence = re.sub(pattern, ' ', sentence)

        # Removes single alphabets
        sentence = re.sub(r'\s+[a-zA-Z]\s+', ' ', sentence)

        # Removing Numbers
        sentence = re.sub(r'\d+', '', sentence)

        # Removing Punctuations
        sentence = sentence.translate(str.maketrans('', '', string.punctuation))

        # Removing Extra Spaces
        sentence = " ".join(sentence.split())

        # Appends the preprocessed sentence to the list
        preprocessed_sentences.append(sentence)

    # Joining the preprocessed sentences back into a single string with a space as delimiter
    preprocessed_text = '. '.join(preprocessed_sentences)

    return preprocessed_text

# Defineing the keywords that indicate the start of the references section
references_start_keywords = ["Bibliography", "REFERENCES", "BIBLIOGRAPHY", "References", "Acknowledgments",
                             "ACKNOWLEDGMENTS", "Reference", "REFERENCE", "Authors’ Biography"]

# Extracts text from the specified PDF file and preprocess it
preprocessed_text = extract_and_preprocess_text_from_pdf(pdf_path)

print(preprocessed_text)


researchgate virtual reality in chemical engineering education reprited from the proceedings of the american society for engineering eduction iino odiana sestonal conference purdue university match tual reality in chemical engineering education john bell scott fogler department of chemical engineetis university of michigan ann arbor npossible to achieve. in order to take full advantage of this new technology virtual reality based simulator vicher ly being developed at the university of michigan chemical engineering department in order to aid in the instruction of chemical reactor engineering. while virtual reality has been recently employed in few educational applications grade school and high school levels and for advanced operator training virtual surgery flight simulation the program presented here is the first known application of virtual reality to chemical engineering education backgro virtual reality vr is newly emerging computer interface designed to make the user believe that 

**Predicted Text**

In [None]:
model.predict(preprocessed_text)

['the and is to provide a will be an area for the study of heat effects non isothermal operation in chemical reaction engineering. this virtual reality as well as students their feedback will be used to guide the further development of vicher.s alsoi these aree however the first room contains tables chairs desk pictures books and working television set. that the second room has been designed to facilitate other areas the the which  they']