Date: 15/09/2019

Version: 6.0

Environment: Python 3.7.0 and Anaconda 2019.07 (64-bit)

Libraries used:
* pdfminer 20181108 (to extract data from PDF files)
* nltk 3.4.4 (to tokenize and stem the data)
* requests 2.22.0 (to access the links to download 200 PDFs)
* re 2.2.1 (for regular expression, included in Anaconda Python 3.7)
* pandas 0.24.2 (for data frame, included in Anaconda Python 3.7)  


## 1. Introduction

This assignment requires us to write Python code to extract the links from a PDF file (Group010.pdf) and then preprocess a set of published papers (200 files to be downloaded via links programatically) and convert the data into numerical representations (which are suitable for input into NLP AI systems, recommender-systems, information-retrieval algorithms, etc). 

We are required to do following tasks :

1. Generate a sparse representation for Paper Bodies (i.e. paper text without Title, Authors, Abstract and References). The sparse representation consists of two files:

    a. Vocabulary index file<br>
    b. Sparse count vectors file<br><br>

2. Generate a CSV file (stats.csv) containing three columns:

    a. Top 10 most frequent terms appearing in all Titles<br>
    b. Top 10 most frequent Authors<br>
    c. Top 10 most frequent terms appearing in all Abstracts<br>


More details will be provided in the following sections.

## 2. Import Libraries

In [1]:
# In memory stream for text I/O
import io
from io import StringIO

# To extract information from PDF documents
import pdfminer
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

# For tokenizing and stemming
import nltk.data
from nltk.tokenize import RegexpTokenizer 
from nltk.stem import PorterStemmer
from nltk.tokenize import MWETokenizer
from nltk.util import ngrams
from nltk.probability import *

# To access links and download data
import requests

# For usage of regular expressions
import re

# For statistic generation
import pandas as pd

## 3. Defining functions

#### readTextFromPdf

The `readTextFromPdf` function is used to read the data from a PDF file. We are using in-built functions of the pdfminer library to read the content of the PDF file. This functions returns the text in form of a string.


We are passing two arguments to this function:

- path - location of the file to be read

- pages - always set to none, can be changed if there's a need to access data page wise

In [2]:
# To read the data from a PDF file

def readTextFromPdf(path, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)
        
    resourceManager = PDFResourceManager()                        # Repository of shared resources
    fakeFileHandle = io.StringIO()                                # Text I/O using in-memory buffer
    converter = TextConverter(resourceManager, fakeFileHandle, laparams=LAParams())      # Convert the text through interpreter
    pageInterpreter = PDFPageInterpreter(resourceManager, converter)
 
    with open(path, 'rb') as fileHandle:                          # Opening the PDF file as fileHandle
        for page in PDFPage.get_pages(fileHandle, pagenums):      # Go through each page if page exists
            pageInterpreter.process_page(page)
 
        text = fakeFileHandle.getvalue()                          # Fetching the data from PDF file
 
    # close open handles
    converter.close()
    fakeFileHandle.close()
 
    # If there's some text in the file, pass text or else pass None
    if text:
        return text
    else:
        return None

#### getFilenameAndUrl

The `getFilenameAndUrl` function is used to extract the content from `Group010.pdf` file. We do this by removing `filename` and `url` strings, squeezing multiple new line characters `\n` into one `\n` and using the regex findall function to read pdf filenames with the regex pattern `(PP\d{4})\.pdf` and it's URL with the regex pattern `(http.*?)\n`. After we extract all the filenames and their links, we create a list of tuples, fileLinkPair. At the end, this function returns the fileLinkPair.

We are passing 1 argument to this function:

- text - the content that is to be passed so that this function can extract the required data

In [3]:
# To extract filename and URL from the text that was read through readTextFromPdf function

def getFilenameAndUrl(text):
    text = re.sub(r'filename|url','', text)         # Removing the phrase filename or url
    text = re.sub(r'\n+','\n', text)                # Substituting multiple new line characters with one new line character
    file = re.findall(r'(PP\d{4})\.pdf', text)      # Finding all file names (200)
    link = re.findall(r'(http.*?)\n', text)         # Finding all URLs (200)
    
    fileLinkPair = []
    for i in range(len(file)):                      # Creating a file link pair list of file name and link tuples
        fileLinkPair.append((file[i].strip(), link[i].strip()))

    return fileLinkPair

#### stemWords

The `stemWords` fuction allows us to pass a list of words as an argument and perform stemming on it. Stemming is performed using `PorterStemmer`. This is one of the last steps of processing. In this function, there are four sub levels of checks as stemming on any word structure results in output that is in lowercase. The checks are as below:

- As per specifications, only unigrams are to be stemmed. So, if the word is a bigram, instead of stemming it, we simply append to the list that is to be returned at the end of this function.

- If the first letter of the word is in uppercase and rest of the letters are in lowercase, then we stem that word but later restore the same structure of that word before appending it to the list that is returned at the end of this function.

- Just as above step, for a word that is in upper case, we stem that word and then restore the same structure of that word before appending it to the list that is returned at the end of this function.

- Lastly, if the word is in lowercase, we simply stem it and append it to the final list that is to be returned at the end of this function.


We are passing 1 argument to this function:

- listOfWords - list of words that are to be stemmed using porter stemmer

In [4]:
# To stem the words (In each case, the pattern is restored after stemming)

def stemWords(listOfWords):
    stemList=[]
    ps = PorterStemmer()
        
    # Stemming performs lower case by default. So, we set four conditions :
    for word in listOfWords:
        
        # If bigram, simply append, don't stem
        if '__' in word:
            stemList.append(word)
        
        # If the first letter is uppercase and next letter isin lower case
        elif word[0].isupper() and word[1].islower():
            word = ps.stem(word)
            word = word.replace(word[0], word[0].upper(), 1)
            stemList.append(word)
            
        # If the whole word is uppercase
        elif word.isupper():
            word = ps.stem(word)
            word = word.upper()
            stemList.append(word) 
        
        # For any other case
        else:    
            stemList.append(ps.stem(word))

    return stemList

#### tokenizer

The `tokenizer` function is used to tokenize the text i.e. splitting a sentence into tokens. Firstly, we use `[A-Za-z]\w+(?:[-'?]\w+)?` pattern for the `RegexpTokenizer` and then using that tokenizer, we tokenize the text and form unnigram tokens. 

We are passing 1 argument to this function:

- text - sentences that are to be tokenized following a regex pattern

In [5]:
# To tokenize sentences

def tokenizer(text):
    
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    unigram_tokens = tokenizer.tokenize(text)
        
    return unigram_tokens

## 4. Extracting stop words

Here, we are extracting all the context-independent stop words from `stopwords_en.txt` file and creating a list out of all the stop words.

In [6]:
# Reading stop words from given text file and storing them into a list

stopWords = []

stopWordsFile = open('stopwords_en.txt', 'r')

for word in stopWordsFile:
    stopWords.append(word.strip('\n'))

stopWordsFile.close()

## 5. Extracting filenames as well as links from _'Group010.pdf'_

Using the above created `readTextFromPdf` function, we read the content of the file `Group010.pdf` and extract the file - link pair using the function `getFilenameUrl`.

In [7]:
# Reading text from Group010.pdf 
fileContent = readTextFromPdf('Group010.pdf')

# Extracting file and link pair to download the data
fileLinkPair = getFilenameAndUrl(fileContent)

## 6. Downloading 200 files

After extracting the filenames and their URL, we download all the files using the `requests` library provided by Python, which helps in making simple HTTP requests.

In [8]:
# Download 200 pdf files from links

for i in range(len(fileLinkPair)):
    url = fileLinkPair[i][1]
    file = requests.get(url)
    fileName = fileLinkPair[i][0] + '.pdf'
    with open(fileName, 'wb') as fileHandle:
        fileHandle.write(file.content)

## 7. Initializing variables 

Here, we have initialized few variables that would be used in the later stages of this code.

In [9]:
# creating a dictionary for Paper Body
paperBodyDict = {}

# Creating a dictionary for authors
authorDict = {}

# Creating an empty string for titles
titles = ""

# Initializing empty dictionary to store top title words and their frequency
titleDict = {}

# Creating an empty string for abstract
abstracts = ""

# Initializing empty dictionary to store top abstract words and their frequency 
abstractDict = {}

# Initializing empty list to store all tokens
fullTokenList = []

# Initializing empty list to store bigram words
bigramWordList = []

# Initializing empty list for step 12.2
tokenWordList = []

# Initializing empty list to store vocabulary information
vocabList = []        

# Initializing empty dictionary to store words after removing stop words
dictWithoutStopWords = {}

# Initializing empty dictionary to store words after removing words with length less than 3
dictWithoutSmallTokens = {}

# Initializing empty dictionary to store final set of words
finalDictionary = {}

# Initializing empty vocab list to store vocabulary information after stemming
vocabularyList = []

## 8. Extracting paper body, author names, titles and abstract from each PDF file

Here, we are creating four separate dictionaries, each having content from different part of the pdf file. 

1. We extracted `Paper Body` using the pattern `1 Paper Body(.*?)2 References`. We did this using re.search and group command.

2. We extracted `Authors` using the pattern `Authored by:(.*?)Abstract`. We did this using re.search and group command. After this, we create a dictionary with auther name as it's key and frequency of the name's occurences in all the pdf files as it's value.

3. We extracted `Title` using the pattern `^(.*?)Authored by:`. We did this using re.search and group command. After this, we create a dictionary with title words as it's key and frequency of the word's occurences in all the pdf files as it's value.

4. We extracted `Abstract` using the pattern `Abstract(.*?)1 Paper Body`. We did this using re.search and group command. After this, we create a dictionary with abstract words as it's key and frequency of the word's occurences in all the pdf files as it's value.

**NOTE :**
Before creating dictionaries, we are correcting the pattern of the text that is read from PDF file otherwise these wrong patterns might lead to incorrect vocabulary count and count vectors. For example, we have found patterns such as `ﬃ` and `ﬀ` and have replaced those patterns with `ffi` and `ff` respectively. This will help us in getting the correct format of the tokens and their actual frequency.

In [10]:
# To access each of the 200 files

for i in range(len(fileLinkPair)):
    fileName = fileLinkPair[i][0] + '.pdf'
    readText = readTextFromPdf(fileName)
    
    # Extracting Paper Body
    paperBody = re.search('1 Paper Body(.*?)2 References', readText, re.DOTALL) 
    paperBody = paperBody.group(1).strip()
    
    # Removing weird patterns in paper body
    paperBody = re.sub('ﬃ', 'ffi', paperBody)
    paperBody = re.sub('ﬀ', 'ff', paperBody)
    paperBody = re.sub('ﬁ', 'fi', paperBody)
    paperBody = re.sub('ﬄ', 'ffl', paperBody)
    paperBody = re.sub('ﬂ', 'fl', paperBody)
    
    # Storing paperBody into dictionary with filename as key
    paperBodyDict[fileLinkPair[i][0]] = paperBody
    
    
    # Extracting author names
    authors = re.search('Authored by:(.*?)Abstract', readText, re.DOTALL)   
    authors = authors.group(1).strip()
    authors = re.sub('\n+', '\n', authors)
    
    # Removing weird patterns in author names
    authors = re.sub('ﬃ', 'ffi', authors)
    authors = re.sub('ﬀ', 'ff', authors)
    authors = re.sub('ﬁ', 'fi', authors)
    authors = re.sub('ﬄ', 'ffl', authors)
    authors = re.sub('ﬂ', 'fl', authors)
    
    # Creating an author name list for a file
    authorList = authors.split('\n')                                        
    
    # Updating the author dictionary
    for name in authorList:                                                 
        if name in authorDict.keys():
            authorDict[name] += 1                # If the name exists in the dict, increment value by 1
        else:
            authorDict[name] = 1                 # If the name doesn't exists in the dict, assign value as 1
            
      
    # Extracting titles
    title = re.search('^(.*?)Authored by:', readText, re.DOTALL)   
    title = title.group(1).strip()
    title = re.sub('\n+', ' ', title)
    
    # Removing weird patterns in title
    title = re.sub('ﬃ', 'ffi', title)
    title = re.sub('ﬀ', 'ff', title)
    title = re.sub('ﬁ', 'fi', title)
    title = re.sub('ﬄ', 'ffl', title)
    title = re.sub('ﬂ', 'fl', title)
    
    # Storing all titles in form of a string
    titles = titles + ' ' + title

        
    # Extracting abstract
    abstract = re.search('Abstract(.*?)1 Paper Body', readText, re.DOTALL)   
    abstract = abstract.group(1).strip()
    
    # Removing weird patterns in abstract
    abstract = re.sub('ﬃ', 'ffi', abstract)
    abstract = re.sub('ﬀ', 'ff', abstract)
    abstract = re.sub('ﬁ', 'fi', abstract)
    abstract = re.sub('ﬄ', 'ffl', abstract)
    abstract = re.sub('ﬂ', 'fl', abstract)
    
    # Storing all abstracts in form of a string
    abstracts = abstracts + ' ' + abstract

## 9. Sentence segmentation and normalization

In this section, we are breaking the whole Paper Body content of each file into sentences using `Punkt Sentence Tokenizer` which we are loading from the NLTK package and normalizing the content as per assignment specification.

In [11]:
# To break string into sentences and then normalize the content
sentenceDetector = nltk.data.load('tokenizers/punkt/english.pickle')

for fileName, body in paperBodyDict.items():
    sentences = sentenceDetector.tokenize(body)     # Breaking string into sentences
    normalizedData = []
    
    for sentence in sentences:                      # To access each sentence from the list of sentences
        combinedData = ""
        sentence = re.sub('-\n', '', sentence)      # Substitute '-\n' with nothing to join words that were separated in file
        sentence = re.sub('\n', ' ', sentence)      # Substituting new line character with space
        wordList = sentence.split()                 # Splitting sentence into words
        wordList[0] = wordList[0].lower()           # Normalizing only first word and leaving middle words as they are
        combinedData += ' '.join(wordList)
        normalizedData.append(combinedData)
        
    paperBodyDict[fileName] = normalizedData        # Storing normalized data in a dictionary with it's filename as key

## 10. Tokenization

The process of breaking down a character sequence into pieces is known as tokenization. Here, we are breaking down the sentences into tokens using the `tokenizer` function defined above, generating a list of tokens and assigning it back to it's respective key position in the dictionary.

In [12]:
# To tokenize the normalized data
for fileName, body in paperBodyDict.items():
    tokenList = []
    
    for sentence in body:
        token = tokenizer(sentence)       # Tokenize each sentence in the paper body from paperBodyDict dictionary
        tokenList.extend(token)           # Extending the tokenList for a particular paperbody under one file
        
    fullTokenList.extend(tokenList)       # Extending the tokenList for all the files
    paperBodyDict[fileName] = tokenList   # Storing token list in a dictionary with it's filename as key

## 11. Generating bigrams

We can form n-grams by extracting a continuous sequence of `n` words from a given sentence. By picking `n=2` we can form bigrams.

Firstly, we remove the context dependent stopwords only for the purpose of generating meaningful bigrams and then we can extract a list of bigrams using the function `ngrams()`. We need to pass a list of words and `n=2` as arguments for this function.

Then we use `FreqDist()` function provided by the NLTK package which helps us to compute the dictribution directly from a set of tokens.

FInally, we are extracting the the top 200 bigrams based on the highest frequency using the `most_common()` function.

In [13]:
# Removing context independent stopwords for generating bigrams
tokenListWithoutStopwords = [word for word in fullTokenList if word not in stopWords]

# Generating bigrams 
bigrams = ngrams(tokenListWithoutStopwords, n = 2)

# Finding the frequency of the bigrams and sorting them in the highest frequency
fdbigram = FreqDist(bigrams) 

# Taking the top 200 bigrams
bigramList = fdbigram.most_common(200)

Extracting only the bigram tokens and then joining it based on `'__'` and tokenizing it using the MWETokenizer. Lastly, we retokenize everything back into the dictionary.

In [14]:
# Extracting out the bigrams from the bigramList
for i in range(len(bigramList)):
    bigramWordList.append(bigramList[i][0])

# Tokenizing the bigrams and using '__' as the separator between the bigrams
mweTokenizer = MWETokenizer(bigramWordList, separator = '__')

# Tokenizing the bigrams back into the dictionary
for fileName, body in paperBodyDict.items():
    paperBodyDict[fileName] = mweTokenizer.tokenize(body)

## 12. Removing stop words

### 12.1. Removing context independent stop words

In [15]:
for fileName, body in paperBodyDict.items():
    body = [word for word in body if word not in stopWords]     # Removing stop words
    paperBodyDict[fileName] = mweTokenizer.tokenize(body)       # Storing tokenized words with filename as it's key

### 12.2. Removing context dependent stop words

In [16]:
# Setting upper and lower threshold values
upperThreshold = (0.95)*len(paperBodyDict.keys()) 
lowerThreshold = (0.03)*len(paperBodyDict.keys()) 

First, we find out the frequency of each token in a pdf file, extract only the unique tokens and then append them into a list. After that we are count the frequency of the words in the list outside the loop as it will have the correct document frequency of all the words across all the files.

In [17]:
# To record number of times a word has occurred per file (using FreqDist - nltk)
for fileName, body in paperBodyDict.items():
    bodyCount = FreqDist(body)
    for key in bodyCount.keys():
        tokenWordList.append(key)

# Overall words frequency
wordFrequency = FreqDist(tokenWordList)

We capture only the words whose frequency falls between the lower and upper thresholds.

In [18]:
# Creating vocabulary list of words whose frequency lies with the threshold specified earlier
for key, value in wordFrequency.items():
    if value <= upperThreshold and value > lowerThreshold:
        vocabList.append(key)

In [19]:
# Creating dictionary with words after removing stop words per file (filename as key)
for fileName, body in paperBodyDict.items():
    body = [word for word in body if word in vocabList]
    dictWithoutStopWords[fileName] = mweTokenizer.tokenize(body)

## 13. Remove tokens with length less than 3

In [20]:
# Creating dictionary of words with length greater than or equal to 3 with filename as key
for fileName, body in dictWithoutStopWords.items():
    body = [word for word in body if len(word) >= 3]
    dictWithoutSmallTokens[fileName] = body

## 14. Stemming

The process of `Stemming` refers to reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words known as a lemma. It is helpful in understanding Natural Language Processing (NLP). To perform `Stemming`, we are passing in a list of words as an argument to the function `stemWords` defined above. 

In [21]:
# Creating dictionary with final set of words
for fileName, body in dictWithoutSmallTokens.items():
    body = stemWords(body)
    finalDictionary[fileName] = body

## 15. Generating vocabulary text

In [22]:
for fileName, body in finalDictionary.items():
    bodyCount = FreqDist(body)                       # Taking frequency of each word per file
    for key in bodyCount.keys():
        vocabularyList.append(key)                   # Appending each word into vocabulary list 
vocabularyList = list(set(vocabularyList))           # Getting unique values only using set and converting it into list
vocabularyList.sort()                                # Sorting the list

In [23]:
# Storing vocabulary information into a file

vocabFile = open('Group010_vocab.txt', 'w', encoding='utf-8')
vocabDict = {}

for i in range(len(vocabularyList)):
    vocabDict[vocabularyList[i]] = i
    
    # Writing into the file in required format
    vocabFile.write(vocabularyList[i] + ':' + str(i) )
    vocabFile.write('\n')
    
vocabFile.close()

## 16. Generating sparse representation

In [24]:
# Generting count vector file

vectorFile = open('Group010_count_vectors.txt', 'w', encoding='utf-8')

for fileName, body in finalDictionary.items():
    sparseOutput = ''
    vectorString = ''
    bodyCount = FreqDist(body)
    
    # Storing the pattern in required format to write into the file
    for element, value in bodyCount.items():
        vectorString = vectorString + ',' + str(vocabDict[element]) + ':' + str(value)
    
    sparseOutput = fileName + vectorString
    
    # Writing into the file
    vectorFile.write(sparseOutput)
    vectorFile.write('\n')
    
vectorFile.close()    

#### NOTE: The sparse matrix generated for the tokens is in no particualr order as nothing is specified in the assignment requirement.

## 17. Statistics generation

### 17.1. Sorting author names

In [25]:
# Sorting the author dictionary based on increasing name and decreasing count            
authorList = sorted(authorDict.items(), key=lambda x: (-x[1], x[0]))  # Returns list of tuples(authors, count)

### 17.2. Sorting title words

In [26]:
# Stripping leading and trailing spaces
titles = titles.strip()

# tokenize titles
titleWords = tokenizer(titles)

# Removing stop words from title words
titleWords = [word for word in titleWords if word.lower() not in stopWords]        

# Converting words into lower case
titleWords = [element.lower() for element in titleWords]

# Updating the title dictionary
for word in titleWords:                                                 
    if word in titleDict.keys():
        titleDict[word] += 1                # If the word exists in the dict, increment value by 1
    else:
        titleDict[word] = 1                 # If the word doesn't exists in the dict, assign value as 1

        
# Sorting the title dictionary based on increasing word (based on letters) and decreasing count            
title = sorted(titleDict.items(), key=lambda x: (-x[1], x[0]))  # Returns list of tuples(title words, count)

### 17.3. Sorting abstract words

In [27]:
# Stripping leading and trailing spaces
abstracts = abstracts.strip()

# Substituting '-\n' with nothing and '\n' with space in abstract
abstracts = re.sub('-\n', '', abstracts)
abstracts = re.sub('\n', ' ', abstracts)

# Sentence segmentation of abstracts
sentenceDetector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = sentenceDetector.tokenize(abstracts)

# Normalizing the sentence and storing it in form of a string
combinedData = ""
for sentence in sentences:
    wordList = sentence.split(' ')
    wordList[0] = wordList[0].lower()
    combinedData += ' '.join(wordList)

# Tokenize the abstracts
combinedDataWords = tokenizer(combinedData)

# Removing stop words from abstract words
combinedDataWords = [word for word in combinedDataWords if word.lower() not in stopWords]


# Updating the abstract dictionary
for word in combinedDataWords:                                                 
    if word in abstractDict.keys():
        abstractDict[word] += 1                # If the word exists in the dict, increment value by 1
    else:
        abstractDict[word] = 1                 # If the word doesn't exists in the dict, assign value as 1
        
# Sorting the abstract dictionary based on increasing word (based on letter) and decreasing count            
abstract = sorted(abstractDict.items(), key=lambda x: (-x[1], x[0]))  # Returns list of tuples(abstract words, count)

### 17.4. Creating a dataframe and storing top 10 abstract words, title words and author names

<br>Here, we have created a dataframe (using pandas) that stores top 10 abstracts, titles and authors.

In [28]:
# Creating dataframe with three columns
df = pd.DataFrame(columns=['top10_terms_in_abstracts','top10_terms_in_titles','top10_authors'])

# Storing top10 abstracts, titles and author names into dataframe
for i in range(0, 10):
    df = df.append({'top10_terms_in_abstracts': abstract[i][0], 'top10_terms_in_titles': title[i][0], 'top10_authors': authorList[i][0]}, ignore_index=True)

### 17.5. Exporting statistics to csv file

In [29]:
# Exporting dataframe content to a csv file 
df.to_csv('Group010_stats.csv', index = False, encoding='utf-8')

## References

http://www.nltk.org/api/nltk.tokenize.html

http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize

https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python

http://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/

https://stackoverflow.com/questions/44699682/how-to-save-a-file-downloaded-from-requests-to-another-directory