# 1. <a id='toc1_'></a>[Text Preprocessing & Feature Transformation](#toc0_)

**Author Name:** Salman Tahir  
**Environment:** Conda 23.7.2, Python 3.10.12


**Table of contents**<a id='toc0_'></a>    
1. [Text Preprocessing & Feature Transformation](#toc1_)    
2. [Introduction](#toc2_)    
3. [Importing Libraries](#toc3_)    
4. [Reading Data](#toc4_)    
5. [Downloading PDF Files](#toc5_)    
6. [Aggregating Data from PDFs](#toc6_)    
7. [Extracting Information to Entities](#toc7_)    
8. [Preprocessing Data](#toc8_)    
8.1. [Sentence Segmentation](#toc8_1_)    
8.2. [Tokenization](#toc8_2_)    
8.3. [Bigrams](#toc8_3_)    
8.4. [Further Text Preprocessing](#toc8_4_)    
8.5. [Stemming](#toc8_5_)    
9. [Feature Conversion and Output](#toc9_)    
10. [Statistical Summary](#toc10_)    
11. [Summary](#toc11_)    
12. [References](#toc12_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=true
	minLevel=1
	maxLevel=2
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

---


# 2. <a id='toc2_'></a>[Introduction](#toc0_)

In this project, our primary goal is to bridge the gap between textual information and numerical representations, catering to the needs of advanced Natural Language Processing (NLP) systems and algorithms. Our focus lies in the preprocessing of a diverse dataset of published papers, transforming them into a format that is not only amenable to NLP applications but also highly suitable for downstream modeling tasks.

Furthermore, we generate a statistical summary of the top 10 most frequent words in the titles, authors and abstracts of the papers.


# 3. <a id='toc3_'></a>[Importing Libraries](#toc0_)

The following libraries are used:

-   `os` (for file path manipulation)
-   `re` (for regular expressions and pattern matching)
-   `csv` (for writing to csv files)
-   `nltk` (for tokenization and stemming using PorterStemmer)
-   `requests` (for downloading files from the http links)
-   `concurrent.futures` (for multithreading utilising the ThreadPoolExecutor)
-   `pdftotext` (for converting pdf tables to text)
-   `pypdf` (for converting pdf files to text)
-   `itertools` (for creating combinations of words)
-   `google.oauth2` (for authenticating with google drive API)
-   `collections` (for counting the frequency of words)


In [1]:
import os
import re
import csv
import nltk
import requests
import pdftotext
import itertools
from pypdf import PdfReader
from google.oauth2 import service_account
from google.oauth2.credentials import Credentials
from concurrent.futures import ThreadPoolExecutor
from collections import Counter
from nltk.collocations import *
from nltk import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import MWETokenizer


# 4. <a id='toc4_'></a>[Reading Data](#toc0_)

We start by reading the initial PDF file and extracting the links to the papers in the PDF table.

-   The extracted data is stored in a dictionary.
-   The keys of the dictionary are the paper IDs.
-   The values of the dictionary are the links to the papers.


In [2]:
# Open the PDF file and read its contents
with open("../data/input/papers.pdf", "rb") as file:
    reader = PdfReader(file)
    num_pages = len(reader.pages)
    data = {}

    # Extract URLs and filenames from table in the PDF file
    for page in range(num_pages):
        page_obj = reader.pages[page]
        page_text = page_obj.extract_text()
        lines = page_text.split("\n")
        for line in lines:
            if "http" in line:
                # Extract filename from the line before URL
                filename = lines[lines.index(line) - 1]
                data[filename] = line

    # Create a new dictionary with modified keys
    articles = {key[:-4]: value for key, value in data.items()}


In [3]:
# Print first item in the dictionary
for key, value in list(articles.items())[:1]:
    print(f"Filename: {key}\nURL: {value}\n")


Filename: PP3206
URL: https://drive.google.com/uc?export=download&id=1KXR-_25SCUzukdgBlYVVskKd_sOcK1M0



# 5. <a id='toc5_'></a>[Downloading PDF Files](#toc0_)

-   We start by setting up our credentials for the google drive API.
    -   Doing so allows us to bypass the rate limit enforced by google drive.
-   Download the PDF files into a directory called `pdf_files`.
    -   Note, that we also ensure to append the file extension to the file name.
-   Using multithreading we download the files in parallel.

**Before running the code block below, please ensure you have the `credentials.json` file available in the same directory as this notebook.**


In [4]:
# # Set up credentials for Google Drive API
# credentials = service_account.Credentials.from_service_account_file(
#     'credentials.json')

# # Create a directory to store PDF files
# if not os.path.exists("../data/input/pdf_files/"):
#     os.makedirs("../data/input/pdf_files/")


# def download_pdf(url, filename):
#     """
#     Downloads the PDF file from the given URL and saves it to the pdf_files directory.
#     :param url: URL of the PDF file
#     :param filename: Filename of the PDF file
#     """
#     try:
#         # Use credentials to access the Google Drive API
#         response = requests.get(
#             url, headers={"Authorization": f"Bearer {credentials.token}"})
#         with open(f"pdf_files/{filename}.pdf", "wb") as f:
#             f.write(response.content)
#         # Print filename of the downloaded PDF file to verify status
#         print(f"Downloaded {filename}.pdf")
#     except:
#         # Print filename of the PDF file that failed to download
#         print(f"Error downloading {filename}.pdf from URL: {url}")


# # To download PDF files in parallel
# with ThreadPoolExecutor() as executor:
#     for filename, url in articles.items():
#         executor.submit(download_pdf, url, filename)


In [5]:
# Count the number of files in directory
num_files = len([f for f in os.listdir('../data/input/pdf_files/')
                if os.path.isfile(os.path.join('../data/input/pdf_files/', f))])

print(f"There are {num_files} files in the {'pdf_files'} directory.")


There are 200 files in the pdf_files directory.


# 6. <a id='toc6_'></a>[Aggregating Data from PDFs](#toc0_)

Now, we extract all text from the PDF files into a dictionary.

-   We remove the file extension from the file name and use it as the key for the dictionary.
-   The values of the dictionary are the extracted text from the PDF files.


In [6]:
# Set path to the directory containing the PDF files
pdf_directory = "../data/input/pdf_files/"
pdf_files = os.listdir(pdf_directory)

# Create a dictionary to store text from PDF files
text_dict = {}

# Iterate through PDF files in the directory and extract text
for file_name in pdf_files:
    # Check to confirm that the file is a PDF file
    if file_name.endswith(".pdf"):
        file_path = os.path.join(pdf_directory, file_name)
        with open(file_path, "rb") as pdf_file:
            pdf = pdftotext.PDF(pdf_file)
            text = "\n".join(pdf)
            # Remove file extension from the filename
            key = file_name[:-4]
            # Add text to our dictionary
            text_dict[key] = text


# 7. <a id='toc7_'></a>[Extracting Information to Entities](#toc0_)

Once we have have our dictionary containing the extracted text from the PDF files, we can extract the required entities.

We start by compiling the regular expressions for extracting the required entities.

**Regular expression for extracting Titles**

```python
r'^(.+?)Authored'
```

-   `^(.+?)` matches all text until the word `Authored` is found.
-   `Authored` matches the literal word `Authored`.

**Regular expression for extracting Authors**

```python
r'(?<=Authored by:)(?:\s*)([A-Za-z\s.?-]+)(?=\n\s*Abstract)'
```

-   `(?<=Authored by:)` matches the literal word `Authored by:` and ensures that the match is not included in the result.
-   `(?:\s*)` matches any whitespace characters.
-   `([A-Za-z\s.?-]+)` matches all text until the next newline character is found.
-   `(?=\n\s*Abstract)` matches the literal word `Abstract` and ensures that the match is not included in the result.

**Regular expression for extracting Abstract**

```python
r'Abstract(.+?)\s*1\s*[\n\s]*Paper Body'
```

-   `Abstract` matches the literal word `Abstract`.
-   `(.+?)` matches all text until the word `Paper Body` is found.
-   `\s*1\s*` matches the literal word `1`.
-   `[\n\s]*` matches any whitespace characters.
-   `Paper Body` matches the literal word `Paper Body`.

**Regular expression for extracting Paper Bodies**

```python
r'1\s*Paper Body(.+?)2\s*References'
```

-   `1\s*Paper Body` matches the literal word `1 Paper Body`.
-   `(.+?)` matches all text until the word `2 References` is found.
-   `2\s*References` matches the literal word `2 References`.

Once, we have used the regular expressions to extract the required entities, we can perform some preprocessing on the data.

-   We remove newlines and multiple spaces from the data.
-   Note that before removing multiple spaces for authors, we perform a split using double space as the delimiter.
-   Finally, the data is stored in a dictionary.
    -   The keys of the dictionary are the paper IDs.
    -   The values of the dictionary are the extracted entities.


In [7]:
# Compile regular expression for titles
TITLE_PATTERN = re.compile(r'^(.+?)Authored', re.DOTALL)

# Compile regular expression for authors
AUTHOR_PATTERN = re.compile(
    r'(?<=Authored by:)(?:\s*)([A-Za-z\s.?-]+)(?=\n\s*Abstract)', re.DOTALL)

# Compile regular expression for abstracts
ABSTRACT_PATTERN = re.compile(
    r'Abstract(.+?)\s*1\s*[\n\s]*Paper Body', re.DOTALL)

# Compile regular expression for paper bodies
PAPER_PATTERN = re.compile(
    r'1\s*Paper Body(.+?)2\s*References', re.DOTALL)


# Create dictionaries to store extracted data
titles = {}
authors = {}
abstracts = {}
papers = {}


for file_name, text in text_dict.items():
    # Extract title
    title = TITLE_PATTERN.findall(text)
    # Remove extra spaces from title
    title = re.sub(r'\s+', ' ', title[0].strip())
    # Add title to dictionary
    titles[file_name] = title

    # Extract author
    author = AUTHOR_PATTERN.findall(text)
    authors_list = []
    for a in author:
        # Remove newlines and extra spaces from author name
        clean_author = re.sub(r'[\n\r]+', ' ', a.strip())
        # Split multiple authors using delimiter "  "
        authors_list.extend(clean_author.split("  "))
        # Remove empty strings from list
        authors_list = list(filter(None, authors_list))
        # Add author to dictionary
        authors[file_name] = authors_list

    # Extract abstract
    abstract = ABSTRACT_PATTERN.findall(text)
    # Remove newlines and extra spaces from abstract
    clean_abstract = re.sub(r'\s+', ' ', abstract[0].strip())
    clean_abstract = re.sub(r'\r\n', ' ', clean_abstract)
    # Add the abstract to dictionary
    abstracts[file_name] = clean_abstract

    # Extract the paper
    paper = PAPER_PATTERN.findall(text)
    # Remove newlines and extra spaces from the paper
    clean_paper = re.sub(r'\s+', ' ', paper[0].strip())
    # Add the paper to our dictionary
    papers[file_name] = clean_paper


In [8]:
# Print first paper title
for key, value in list(titles.items())[:1]:
    print(f"Filename: {key}\nTitle: {value}\n")

# Print first paper authors
for key, value in list(authors.items())[:1]:
    print(f"Filename: {key}\nAuthors: {value}\n")

# Print first paper abstract
for key, value in list(abstracts.items())[:1]:
    print(f"Filename: {key}\nAbstract: {value}\n")

# Print first paper body
for key, value in list(papers.items())[:1]:
    print(f"Filename: {key}\nPaperBody: {value}\n")


Filename: PP3206
Title: Learning the 2-D Topology of Images

Filename: PP3206
Authors: ['Yoshua Bengio', 'Bal?zs K?gl', 'Nicolas L. Roux', 'Pascal Lamblin', ' Marc Joliveau']

Filename: PP3206
Abstract: We study the following question: is the two-dimensional structure of images a very strong prior or is it something that can be learned with a few examples of natural images? If someone gave us a learning task involving images for which the two-dimensional topology of pixels was not known, could we discover it automatically and exploit it? For example suppose that the pixels had been permuted in a fixed but unknown way, could we recover the relative two-dimensional location of pixels on images? The surprising result presented here is that not only the answer is yes but that about as few as a thousand images are enough to approximately recover the relative locations of about a thousand pixels. This is achieved using a manifold learning algorithm applied to pixels associated with a measure

# 8. <a id='toc8_'></a>[Preprocessing Data](#toc0_)


## 8.1. <a id='toc8_1_'></a>[Sentence Segmentation](#toc0_)

-   By iterating over each key value pair in the papers dictionary we:
    -   Perform sentence tokenization using the `sent_tokenize` function from the `nltk` library.
    -   We account for the case that we require capitalisation to stay intact.
        -   Doing so by using `isupper()`
    -   We then append each normalised word to the list of normalised sentences.
    -   Finally, we update the value in the `papers` dictionary with the normalised sentences.


In [9]:
# Iterate through papers dictionary and perform sentence segmentation
for file_name, paper in papers.items():
    sentences = sent_tokenize(paper)
    # Create a list to store normalized sentences
    normalized_sentences = []
    for sentence in sentences:
        # Split the sentence into tokens
        words = sentence.split()
        for i, word in enumerate(words):
            if i == 0 or i == len(words) - 1 or word.isupper():
                # Keep capital tokens at the beginning, end or standalone
                normalized_word = word
            else:
                # Normalize lowercase for non initial capital tokens
                normalized_word = word.lower()
            # Append normalized word to list
            normalized_sentences.append(normalized_word)
    # Update the papers dict with our normalized sentences
    papers[file_name] = ' '.join(normalized_sentences)


We also remove any numbers/digits present in the sentences as they are not required for the analysis.


In [10]:
# Remove unnecessary digits from papers
papers = {key: re.sub(r'\d+', '', value) for key, value in papers.items()}


## 8.2. <a id='toc8_2_'></a>[Tokenization](#toc0_)

Using the regular expression provided for in the specification, we perform word tokenization on the sentences.


In [11]:
# Define regular expression pattern for words
WORD_PATTERN = r"[A-Za-z]\w+(?:[-'?]\w+)?"

# Create tokenizer object
tokenizer = RegexpTokenizer(WORD_PATTERN)

# Iterate through the papers dictionary and tokenize words
for file_name, paper in papers.items():
    words = tokenizer.tokenize(paper)
    # Update papers dict with tokenized words
    papers[file_name] = words


## 8.3. <a id='toc8_3_'></a>[Bigrams](#toc0_)

In this step we generate bigrams from the tokens using the `nltk` library.

-   Using list comprehension we remove bigrams that contain any stopwords or have the separator `"__"`
-   We store the top 200 bigrams in filtered_bigrams.

Finally, we retokenize the words in all papers using the multi word expression tokenizer, and update the value in the papers dictionary with the new tokenized words.


In [12]:
# Load stopwords file
with open("../data/input/stopwords_en.txt") as f:
    stopwords = set(f.read().splitlines())


# Create a single list of all words in papers
all_words = [word for words in papers.values() for word in words]


# Create frequency distribution of all words
bigram_freq = nltk.FreqDist(nltk.bigrams(all_words))


# Remove bigrams that contain stopwords or '__'
filtered_bigrams = [(w1, w2) for (
    w1, w2) in bigram_freq if w1 not in stopwords and w2 not in stopwords]
filtered_bigrams = [(w1, w2) for (
    w1, w2) in filtered_bigrams if "__" not in w1 and "__" not in w2]


# Store the top 200 bigrams
top_200_bigrams = filtered_bigrams[:200]


# Create a multi word expression tokenizer with the top 200 bigrams
mwetokenizer = MWETokenizer(top_200_bigrams, separator='__')


# Retokenize the words in all papers using the multi word expression tokenizer
for file_name, words in papers.items():
    papers[file_name] = mwetokenizer.tokenize(words)


## 8.4. <a id='toc8_4_'></a>[Further Text Preprocessing](#toc0_)

We perform the following preprocessing steps on the tokens based on the given specification.


-   Using list comprehension we remove any stopwords.


In [13]:
# Iterate through papers dictionary and remove stopwords
for file_name, words in papers.items():
    papers[file_name] = [word for word in words if word not in stopwords]


-   By computing the frequency distribution of the tokens in all papers, we identify the context-dependent stopwords (that appear in 95% of the papers) and remove them.


In [14]:
# Create a frequency distribution of all words
token_freq = nltk.FreqDist(
    [word for words in papers.values() for word in words])

# Create a set of context-dependent stopwords that appear in more than 95% of papers
context_dependent_stopwords = set(
    [token for token, freq in token_freq.items() if freq/len(papers) >= 0.95])

# Iterate through papers dictionary and remove context-dependent stopwords
for file_name, words in papers.items():
    papers[file_name] = [
        word for word in words if word not in context_dependent_stopwords]


-   We identify rare tokens (that appear in less than 3% papers) and remove them.


In [15]:
# Create a set of rare tokens that appear in less than 3% of papers
rare_tokens = set(
    [token for token, freq in token_freq.items() if freq/len(papers) < 0.03])

# Iterate through papers dictionary and remove these rare tokens
for file_name, words in papers.items():
    papers[file_name] = [word for word in words if word not in rare_tokens]


-   Finally, we remove characters/symbols that are less than 3 characters long.


In [16]:
# Iterate through papers dictionary and remove words with length less than 3
for file_name, words in papers.items():
    papers[file_name] = [word for word in words if len(word) >= 3]


## 8.5. <a id='toc8_5_'></a>[Stemming](#toc0_)

Once we have preprocessed the tokens according to the specification, we perform stemming using the PorterStemmer from the `nltk` library.

Additionally, we account for cases where the original token was in uppercase or capitalized by preserving the original capitalization in the stemmed version of the token (this was required in the specification).


In [17]:
# Create stemmer object
stemmer = PorterStemmer()

# Iterate through papers dictionary and stem words
for file_name, words in papers.items():
    # Create initial list to store stemmed words
    stemmed_words = []
    for word in words:
        # Only stem words with length greater than 3
        if len(word) > 3:
            # Check if word is uppercase, lowercase or TitleCase (we need to preserve this)
            if word.isupper():
                stemmed_words.append(stemmer.stem(word).upper())
            elif word[0].isupper():
                stemmed_words.append(stemmer.stem(word).capitalize())
            else:
                stemmed_words.append(stemmer.stem(word))
    # Update papers dict with stemmed words
    papers[file_name] = stemmed_words


# 9. <a id='toc9_'></a>[Feature Conversion and Output](#toc0_)

Iterating over all values in the `papers` dictionary we add aggregate the tokens into a list.

We remove all duplicates from the list and sort the list in alphabetical order.


In [18]:
# Create a single list of all words in papers
all_tokens = []
for word in papers.values():
    all_tokens.extend(word)

# Create a set of all tokens
all_tokens = set(all_tokens)

# Sort the tokens alphabetically
all_tokens = sorted(all_tokens)


In [19]:
# Print the number of tokens
print(f"Number of tokens: {len(all_tokens)}")


Number of tokens: 3927


We map each token in our list of tokens to the index in the list (that was alphabetically sorted).

Doing so, we can create a mapping of each token to its index


In [20]:
# Create a dict to store index of each token
token_index = {}
for i, token in enumerate(all_tokens):
    token_index[token] = i


Now, we output the the vocabulary index file with the format given in the specification.


In [21]:
# Write vocab index to file
with open("../data/output/vocab.txt", "w") as f:
    for token, index in token_index.items():
        f.write(f"{token}:{index}\n")


Now, we iterate over each paper in our `papers` dictionary and create a sparse count vector for the paper using the token index dictionary.

The format is kept same as the specification where we have the paper ID (filename) and sparse count vector for each paper.


In [22]:
# Write count vectors to file
with open("../data/output/count_vectors.txt", "w") as f:
    for paper_id, paper in papers.items():
        # Create dict to store counts of each token
        counts = {}
        for token in paper:
            if token in counts:
                counts[token] += 1
            else:
                counts[token] = 1
        # Create list to store sparse counts
        sparse_counts = []
        for token, count in counts.items():
            if token in token_index:
                # Get index of token from token_index dict
                index = token_index[token]
                sparse_counts.append(f"{index}:{count}")
        # Join sparse counts with commas
        sparse_counts_str = ",".join(sparse_counts)
        f.write(f"{paper_id},{sparse_counts_str}\n")


# 10. <a id='toc10_'></a>[Statistical Summary](#toc0_)

We previously extracted all required data from the PDF files for the entities required for analysis.

Now, we perform some statistical analysis on the data to find the 10 most frequent words in the titles, authors and abstracts.


In [23]:
# Load stopwords file again (optional, we also have it in memory)
with open("../data/input/stopwords_en.txt") as f:
    stopwords = set(f.read().splitlines())


Set variables for pattern matching as per the specification.


In [24]:
# Set regular expression pattern for words
WORD_PATTERN = r"[A-Za-z]\w+(?:[-'?]\w+)?"

# Compile regular expression
PATTERN = re.compile(WORD_PATTERN)


Here, we perform the following steps:

-   Tokenize the titles and abstracts using the regular expression provided in the specification.
    -   It was set as a global variable in the code block above.
    -   The authors did not require tokenization.
-   We then identify the top 10 most frequent words in the titles, authors and abstracts.

Note that for the entities we had already created dictionaries at the start of the notebook.


In [25]:
# Create empty lists for titles and abstracts
all_titles = []
all_abstracts = []

# Iterate through titles, tokenize and remove stopwords
for paper_id, title in titles.items():
    # Tokenize the title and remove stopwords
    title_tokens = PATTERN.findall(title.lower())
    title_tokens = [token for token in title_tokens if token not in stopwords]
    all_titles.extend(title_tokens)

# Count the most frequent terms in the titles and abstracts
titles_count = Counter(all_titles)
top10_titles = [term for term, count in sorted(
    titles_count.items(), key=lambda x: (-x[1], x[0]))[:10]]


# Iterate through abstracts, tokenize and remove stopwords
for paper_id, abstract in abstracts.items():
    # Tokenize the abstract and remove stopwords
    abstract_tokens = PATTERN.findall(abstract.lower())
    abstract_tokens = [
        token for token in abstract_tokens if token not in stopwords]
    all_abstracts.extend(abstract_tokens)

# Count the most frequent terms in the titles and abstracts
abstracts_count = Counter(all_abstracts)
top10_abstracts = [term for term, count in sorted(
    abstracts_count.items(), key=lambda x: (-x[1], x[0]))[:10]]


# Using list comprehension, create a list of all authors
authors_flat = [author for authors_list in authors.values()
                for author in authors_list]

# Remove whitespace from author names
authors_flat = [author.strip() for author in authors_flat]

# Count the most frequent authors
authors_counter = Counter(authors_flat)
top10_authors = [author for author, count in sorted(
    authors_counter.items(), key=lambda x: (-x[1], x[0]))[:10]]


Finally, we output our statistical summary to a CSV file using the `csv` library.


In [26]:
# Write stats to file
with open('../data/output/summary_stats.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    # Write header
    writer.writerow([
        'top10_terms_in_abstracts',
        'top10_terms_in_titles',
        'top10_authors'])
    # Write rows
    writer.writerows(zip(
        top10_abstracts,
        top10_titles,
        top10_authors))


# 11. <a id='toc11_'></a>[Summary](#toc0_)

To summarise, we have performed the following tasks in this project.

**Data Extraction:**

-   Extracted URLs from the original PDF file.
-   Downloaded PDF files using the extracted URLs.
-   Extracted textual content from the downloaded PDF files.

**Information Extraction:**

-   Identified and extracted essential entities from the text, organizing them into dictionaries.

**Data Preprocessing:**

-   Divided the text into sentences through sentence segmentation.
-   Segmented the sentences into individual words (tokenization).
-   Generated bigrams from the tokenized words.
-   Carried out additional text preprocessing in accordance with provided specifications.
-   Applied stemming to words to reduce them to their base forms.

**Statistical Analysis:**

-   Conducted statistical analysis on the extracted entities.
-   Computed the top 10 most frequent words within titles, authors, and abstracts.


# 12. <a id='toc12_'></a>[References](#toc0_)

[1] [Information on using NLTK library for text preprocessing](https://www.nltk.org/book/)

[2] [Using ThreadPool for parallel processing](https://www.digitalocean.com/community/tutorials/how-to-use-threadpoolexecutor-in-python-3#step-2-using-threadpoolexecutor-to-execute-a-function-in-threads

[3] [Sorting data with lambda function](https://blogboard.io/blog/knowledge/python-sorted-lambda/)

[4] [Compiling and testing regex patterns](https://pythex.org/)
