# FIT5196 Assignment 2: Text Pre-Processing & Feature Generation
<b>Group Number:</b> 149 <br>
<b>Student Names:</b> Sachit Anil Kumar, Xinming Huang <br>
<b>Student IDs:</b> 29392624, 26989166

<b>Tutor:</b> Mohammad <br>
<b>Tutorial:</b> Tuesday 12:00 PM to 2:00 PM, B4.77

<b>Date:</b> 01/09/2019 <br>
<b>Version:</b> 1.0 <br>
<b>Environment:</b> Python 3.6, Anaconda Navigator 1.8.7 and jupyter notebook 5.5.0

<b>Libraries used:</b> 
* pandas (for dataframe, included in Anaconda Python 3.6) 
* re (for regular expressions, included in Anaconda Python 3.6) 
* os (for creating folder, included in Anaconda Python 3.6)
* requests (for downloading pdf files, included in Anaconda Python 3.6)
* itertools (used for flattening 2 dimensional lists, included in Anaconda Python 3.6)
* tabula (used for reading tabulated data from pdfs)
* pdfminer (used for reading pdf files and converting to text)
* io (used for getting string input from pdf files)
* NLTK (for most of the text pre-processing tasks)
* sklearn (used for vectorizing the tokens from the documents)
* IPython display (used for clearing cell during execution)

## 1.  Introduction

The assessment tasks involve parsing a table in a PDF file and reading urls contained in it. Then, files present at these url locations are downloaded and parsed. After parsing these files, the various features such as title, authors, abstract and paper body are stored in a dataframe. <br><br>
Further, there are two major tasks:
* Making a sparse representation of the paper bodies (represented as vocab index file and sparse count vector file) and <br>
* Making a CSV file with the ten most commonly occuring authors and ten most frequently apprearing terms in abstract and title. <br>

<i><b>Note: The MAIN subsections of each section calls all the functions within that section - therefore, these have been placed at the end of each of the sections.</b></i> 

### 1.1 Import Libraries

The following libraries are being imported:
* pandas (for dataframe, included in Anaconda Python 3.6) 
* re (for regular expressions, included in Anaconda Python 3.6) 
* os (for creating folder, included in Anaconda Python 3.6)
* requests (for downloading pdf files, included in Anaconda Python 3.6)
* itertools (used for flattening 2 dimensional lists, included in Anaconda Python 3.6)
* tabula (used for reading tabulated data from pdfs)
* pdfminer (used for reading pdf files and converting to text)
* io (used for getting string input from pdf files)
* NLTK (for most of the text pre-processing tasks)
* sklearn (used for vectorizing the tokens from the documents)
* IPython display (used for clearing cell during execution)

In [1]:
# Importing all the required libraries
import pandas as pd
import re
import requests
import itertools
import nltk.data
import os.path

from itertools import chain
from tabula import read_pdf
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from nltk.probability import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import clear_output

## 2. Preparing Dataframe of all Features from the Papers

### 2.1 Reading Table in PDF to a Dataframe
As the first step, all the pages of the PDF file given are read and the table in parsed. The information obtained from the table is saved into a Pandas dataframe with two columns - one for the names of the files and the other for their location on the web (urls). These urls will be used to programmatically download these files to local system. <br>
<b>NOTE: We assume that appropriate input file named "Group149.pdf" is available in the same folder/directory as "Group149_ass2.ipynb". It has been placed in the submission.</b>

In [2]:
df = pd.DataFrame([])                           # empty dataframe for each page of the file
pdf_table = pd.DataFrame([])                    # dataframe initialised as empty in the beginning 

# Iterate through each page and read into a dataframe
for pageiter in range(5):                       # 5 pages in the file - hardcoded (change if necessary)
    df = read_pdf("Group149.pdf", pages = pageiter + 1, guess = False)    # read table on page to df
    pdf_table = pd.concat([pdf_table, df], ignore_index = True)           # merge df into the actual output dataframe

### 2.2 Download all PDF Files Programmatically
Next, we programmatically download all 200 PDF files (or papers) into a folder called 'pdfs'.<br>
For this we iterate through each row of the dataframe and we use the `requests` package and hit the urls for each of the rows. The responses obtained are captured and their contents (this is the actual papers we will be evaluating) are written into PDFs file with the appropriate names. These PDF files are all created (or re-written) in a folder named 'pdfs'.<br>
<b>NOTE: We assume that appropriate folder named "pdfs" is available in the same folder/directory as "Group149_ass2.ipynb". An empty folder called "pdfs" has been placed in the submission. </b>

In [None]:
# create a 'pdfs' directory for downloading
if not os.path.exists('./pdfs/'):
    os.makedirs('./pdfs/')

# iterate through each row in the dataframe
for index, row in pdf_table.iterrows():
    
    url = row['url']
    response = requests.get(url)                # obtain response by hitting url using requests package
    
    # write response content into PDF files - use write binary (wb) 
    open('./pdfs/' + row['filename'], 'wb').write(response.content)

### 2.3 Get Dataframe of all Features

The following are considered to be the features of the papers: title, list of authors, abstract and paper body. <br>
To obtain these, the PDF files are first converted to text such that it is parseable using regular expressions. This is done using `pdfminer` package. Then, we iterate through each of the papers to extract the features of the paper. This is done through the use of regular expressions. <br>
The features thus obtained are stored in a new dataframe with each row representing a paper and each column having it's features.

#### 2.3.1 Convert PDF to Text
The following fucntion uses `pdfminer` package to convert the PDF files to text that can be parseable easily by using regular expressions.

In [None]:
def convert_pdf_to_txt(path):
    
    # set up all parameters and settings as required
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    # iterate through each page and process it to obtain text from those pages
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, 
                                  caching=caching, check_extractable=True):
        interpreter.process_page(page)

    # obtain text
    text = retstr.getvalue()

    # close everything safely
    fp.close()
    device.close()
    retstr.close()
    
    return text

#### 2.3.2 Extract Title
The title of the paper is extracted using this function. Regular expression is used to capture everything occuring before the keyword "Authored". This can be over multiple lines. Then, the multiple lines are collapsed into one. 

In [None]:
def get_title(text):
    
    match = re.search(r"((?:(?:.*)\n)*)Authored", str(text)) 
    
    if match is None:                               # if no match, return NA
        return "NA"
    
    text = str(match.group(1))
    text = re.sub(r"\n", " ", text)                 # remove \n and add space
    
    return text

#### 2.3.3 Extract Author's Names
The author names are extracted using this function. Regular expressions are used to capture everything occuring between the keywords "Authored by:" and "Abstract". Here text in different lines indicate different authors. Therefore, another regular expression is used to match with all information in the different lines. A list is made out of this and returned by the function. 

In [None]:
def get_authors(text):
    
    author_list = []                                # initialise empty list
    
    # get all info between 'Authored by:' and 'Abstract' keywords
    match = re.search(r"Authored by:((.|\n)*?)Abstract", str(text)) 
    if match is None:                               # if no match, return NA
        return "NA"
    text = str(match.group(1))
    
    # split each authors names (occuring on different lines) and add to list
    match_list = re.findall(r"(.+)\n", text)
    if match_list is None:                          # if no match, return NA
        return "NA"
    for match in match_list:
        author_list.append(str(match))
    
    return author_list

#### 2.3.4 Extract Abstract
The abstract of the paper is extracted using this function. Regular expression is used to capture everything occuring between the keywords "Abstract" and "1 Paper Body". This can be over multiple lines. Then, the multiple lines are collapsed into one. If there are words split into two by new lines, they are joined together. 

In [None]:
def get_abstract(text):
    
    match = re.search(r"Abstract((?:.|\n)*)1 Paper Body", str(text)) 
    
    if match is None:                               # if no match, return NA
        return "NA"
    
    text = str(match.group(1))
    text = re.sub(r"\n", " ", text)                 # remove \n and add space
    text = re.sub(r"-\s", "", text)                 # handle words split across two lines by hyphen
    
    return text

#### 2.3.5 Extract Paper Body
The paper body is extracted using this function. Regular expression is used to capture everything occuring between the keywords "1 Paper Body" and "2 References". This can be over multiple lines. Then, the multiple lines are collapsed into one. If there are words split into two by new lines, they are joined together. 

In [None]:
def get_paper_body(text):
    
    match = re.search(r"1 Paper Body((?:.|\n)*)2 References", str(text)) 
    
    if match is None:                                # if no match, return NA
        return "NA"
    
    text = str(match.group(1))
    text = re.sub(r"\n", " ", text)                  # remove \n and add space
    text = re.sub(r"-\s", "", text)                  # handle words split across two lines by hyphen
    
    return text

#### 2.3.MAIN Making Dataframe of Information
Calls all the above functions to convert PDF files into text and then extract the features from the texts. Further, the information extracted is added to a dataframe. The page numbers are removed and the patterns like "ffl" that pdfminer struggles with have been handled appropriately. The filename column is used as index as it is guaranteed to be unique. 

In [None]:
papers_df = pd.DataFrame([])                         # initialise empty dataframe

# iterate through each PDF file
for index, row in pdf_table.iterrows():
    full_text = convert_pdf_to_txt("./pdfs/" + row['filename'])       # obtain text of the paper
    full_text = re.sub(r"\n\n\d{1,2}\n\n", "", full_text)             # remove page numbers 
    
    # known errors in pdfminer parsing
    full_text = re.sub(r"ﬄ", "ffl", full_text)                       # handle "ffl" pattern - replace with string
    full_text = re.sub(r"ﬃ", "ffi", full_text)                       # handle "ffi" pattern - replace with string
    full_text = re.sub(r"ﬁ", "fi", full_text)                         # handle "fi" pattern - replace with string
    full_text = re.sub(r"ﬀ", "ff", full_text)                         # handle "ff" pattern - replace with string
    full_text = re.sub(r"ﬂ", "fl", full_text)                         # handle "fl" pattern - replace with string
    
    # form single record of the dataframe - for each paper
    df_single_paper = pd.DataFrame([[row['filename'],
                                     get_title(full_text), 
                                     get_authors(full_text), 
                                     get_abstract(full_text), 
                                     get_paper_body(full_text)]], 
                                   columns = ['Filename', 'Title', 'Authors', 'Abstract', 'Paper Body'])
    
    # merge the single row with the existing large dataframe
    papers_df = pd.concat([papers_df, df_single_paper], ignore_index = True) 
    
    # just to keep track of progress during runtime
    if ((index+1) % 10 == 0):
        clear_output()
        print(str((index+1)/200*100) + "% completed")  

In [None]:
papers_df.shape

The expected shape is 200 X 5 - 200 rows for the 200 papers and 5 columns (one column for each of the features and one for the filename). The same is obtained.

In [None]:
papers_df.head(10)

The first 10 rows of the dataframe with papers information looks to be compiled properly. All the required features are captured and dataframe obtained. This dataframe will be used extensively later.

### 2.4 Task 1: Sparse Representation
This task involves doing pre-processing procedures on the text from the body of the papers to represent the documents and vocabulary in a sparse format. The pre-processing procedures are completed in the following order: 
* <b>Sentence Segmentation</b> using Punkt Sentence Tokenizer
* <b>Case Normalization</b> to lower case except capital tokens appearing in the middle of a sentance
* <b>Word Tokenization</b> using the given regular expression
* <b>Forming Bigrams and Retokenization</b> for the 200 most commonly occuring bigrams in the corpus
* <b>Removing Short Tokens</b> (length less than 3 letters)
* <b>Removing Context Independent Stopwords</b> using given list
* <b>Removing Context Dependent Stopwords</b> (threshold set to appearing in more 95% of documents)
* <b>Removing Rate Tokens</b> (threshold set to appearing in less than or equal to 3% of documents)
* <b>Stemming of Unigrams</b> using Porter Stemmer

#### 2.4.1 Sentence Segmentation, Case Normalization and Word Tokenization
This function first does sentence segmentation using Punkt Sentence Tokenizer. Then, it makes the first letter of each sentence lowercase and tokenizes the words using the provided regular expression Word Tokenizer. The output of the function will be a two dimensional list of all the tokens (with appropriate cases) appearing in each of the documents.

In [None]:
def tokenize_body(papers_df):

    # NLTK's Punkt Sentence Tokenizer and given RegEx Word Tokenizer used
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') 
    abstract_tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    
    body_list = papers_df['Paper Body']
    body_sent_list_2d = [sent_detector.tokenize(x.strip()) for x in body_list]    # sentence segmentation
    
    # lowercase first letter for each sentence
    body_tokens_list = [None] * 200
    for index in papers_df.iterrows():
        lowcase_sentance = [y[0].lower() + y[1:] for y in body_sent_list_2d[index[0]]]
        body_tokens_2d_list = [abstract_tokenizer.tokenize(i) for i in lowcase_sentance]
        body_tokens_list[index[0]] = list(itertools.chain.from_iterable(body_tokens_2d_list)) # flatten list of words
        
    return body_tokens_list

#### 2.4.2 Handling Bigrams
This function first finds the 200 most commonly occuring bigrams (two word collocations). Then, it uses these bigrams to retokenize the list of tokens such that these common bigrams are appearing as a single term. A custom separator of double underscore is used.

In [None]:
def handle_bigram(body_tokens_list, stopwords):

    # get all words in corpus and remove context independent stopwords    
    corpus_tokens_list_with_sw = list(itertools.chain.from_iterable(body_tokens_list))
    corpus_tokens_list = [w for w in corpus_tokens_list_with_sw if w not in stopwords]

    # finding top 200 bigrams
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = nltk.collocations.BigramCollocationFinder.from_words(corpus_tokens_list)
    top_200_bigrams = finder.nbest(bigram_measures.raw_freq, 200)
    
    # build tokenizer; __ as the separator
    bigram_list = list(set(top_200_bigrams))
    bigram_tokenizer = MWETokenizer(bigram_list, separator = '__')
    
    # retokenize with bigrams
    body_tokens_list_w_bigrams = [None] * 200
    for index in papers_df.iterrows():
        body_tokens_list_w_bigrams[index[0]] = bigram_tokenizer.tokenize(body_tokens_list[index[0]])

    return body_tokens_list_w_bigrams

#### 2.4.3 Removing Less Meaningful Tokens
This function removes from the token list the following less meaningful tokens:
* Short Tokens that are shorter than three letters
* Context Independent Stopwords such as the, for, was etc.
* Context Dependent Stopwords that appear in more than 95% of all documents
* Rare Tokens that appear in less than or equal to 3% of all documents<br>

The two-dimensional list after the removal of all these tokens are returned

In [None]:
def filter_tokens(body_tokens_list, stopwords):

    # word should only appear one time per document
    token_set = list(chain.from_iterable([set(value) for value in body_tokens_list]))
    token_document_frequency = FreqDist(token_set)
    
    # list of words to be removed
    lessFreqWords = set([token for token, count in token_document_frequency.items() if count < 6]) # < 6 -> 3% of 200
    moreFreqWords = set([token for token, count in token_document_frequency.items() if count > 189]) # 190 = 95% of 200
    
    # filter out tokens from each doc 
    final_list = [None] * 200
    for index in papers_df.iterrows():
        body_tokens_list_wo_short = [w for w in body_tokens_list[index[0]] if len(w) > 2] # short
        body_tokens_list_wo_short_sw1 = [w for w in body_tokens_list_wo_short if w not in stopwords] # CI SWs
        body_tokens_list_wo_short_sw2 = [w for w in body_tokens_list_wo_short_sw1 if w not in moreFreqWords] # CD SWs
        body_tokens_list_wo_short_sw_rt = [w for w in body_tokens_list_wo_short_sw2 if w not in lessFreqWords] # RTs

        final_list[index[0]] = body_tokens_list_wo_short_sw_rt
    
    return final_list

#### 2.4.4 Stemming of Tokens
This function performs stemming of all the remaining unigram tokens so as to reduce the size of the vocabulary. If not unigrams, the tokens are left as is. They are also left as is if they contain a capital letter. The function returns a two-dimensional list of all the stemmed tokens.

In [None]:
def perform_stemming(body_tokens_list):
    
    # Porter Stemmer is used
    stemmer = PorterStemmer()
    
    # For each word, stem it if it is a unigram (ie. does not contain '__'); otherwise keep as it is
    body_tokens_list_stemmed = [None] * 200
    for index in papers_df.iterrows():
        body_tokens_list_stemmed[index[0]] = [stemmer.stem(w) if "__" not in w and w.islower() else w 
                                              for w in body_tokens_list[index[0]]]

    return body_tokens_list_stemmed

#### 2.4.5 Perform all Pre-processing Tasks
The context independent stopwords are loaded into memory and then all the pre-processing tasks are being done. At the end of these steps, we get a list of all tokens (after all the processing) appearing in each of the documents. Then, we find the list of the 200 most commonly occuring bigrams - these bigrams are then retokenised to save them together. Further, we remove the less meaningful tokens such as stopwords, rare tokens and short words. Next, we stem all the unigrams that are all small letters.<br>
<b>NOTE: We assume that appropriate input file named "stopwords_en.txt" is available in the same folder/directory as "Group149_ass2.ipynb". It has been placed in the submission.</b>

In [None]:
# get stopwords
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()

# perform all the pre-processing tasks in the required order
body_tokens_list = tokenize_body(papers_df)                        
body_tokens_list = handle_bigram(body_tokens_list, stopwords)
body_tokens_list = filter_tokens(body_tokens_list, stopwords)
body_tokens_list = perform_stemming(body_tokens_list)

#### 2.4.6 Create the Vocabulary Text File
Next, we get a flat list of all tokens appearing in all the documents such that we can build a vocabulary considering all the documents. Then, check the length of the vocabulary. We get the number of unique tokens to be around 2500 tokens. 

In [None]:
# flatten tokens into a simple list, form a set and sort it
token_list_1d = list(itertools.chain.from_iterable(body_tokens_list))
vocab = set(token_list_1d)
vocab_sorted = sorted(vocab)

In [None]:
len(vocab)

In this section, we form a dataframe from the vocabulary and use the index from this to form the structure of the `Group149_vocab.txt` file. We then write this into a text file and output it. 

In [None]:
# create a dataframe for getting index easily
vocab_df = pd.DataFrame(vocab_sorted, columns = ["token"])

# create text to be written into vocab text file
vocab_text = ""                                                        # instantiated empty
for index, row in vocab_df.iterrows():
    vocab_text = vocab_text + str(row[0]) + ":" + str(index) + "\n"

# write all the processed text to file
output_file_name = "Group149_vocab.txt"
output_file_handle = open(output_file_name, 'w+')
output_file_handle.write(vocab_text)
output_file_handle.close()

#### 2.4.7 Create the Count Text File
Here, we first join all the tokens with a space such that they can be fed into the vectorizer function from `NLTK`. The function splits these using spaces and returns an object with the counts of each of the ~2500 tokens in each of the documents. 

In [None]:
def tokenizeManual(txt):
    return txt.split()

In [None]:
# instantiate count vectoriser - vocabulary specified, tokenizer is only spaces
vectorizer = CountVectorizer(vocabulary = vocab, tokenizer = tokenizeManual, lowercase = False)
# get a count of each of the tokens
data_features = vectorizer.fit_transform([' '.join(value) for value in body_tokens_list])

In this section, we form a string which starts with the name of the document and then list each word (by it's index) that appears in the document and the number of times it appears in each of the documents. Each new document starts in a new line. The string thus prepared is saved into a text file. 

In [None]:
text = ""                                                        # instantiated empty

# iterate through each document
for index, row in pdf_table.iterrows():
    # add file name (remove '.pdf')
    text = text + str(row['filename'][0:-4]) + ","
    
    # iterate through each word in the document
    for word, count in zip(vocab_sorted, data_features.toarray()[index]):
        if count > 0:                                            # include only if the word appears in the document
            text = text + str(vocab_df[vocab_df['token'] == word].index.item()) + ":" + str(count) + ","
    text = text[:-1] + "\n"
    
    # just to keep track of progress during runtime
    if ((index+1) % 10 == 0):
        clear_output()
        print(str((index+1)/200*100) + "% completed")                

In [None]:
# write all the processed text to file
output_file_name = "Group149_count_vectors.txt"
output_file_handle = open(output_file_name, 'w+')
output_file_handle.write(text)
output_file_handle.close()

### 2.5 Task 2: Get Statistics
In this section, we find the ten most commonly featured authors, the ten most commonly featured term in the title and in the abstract. This is then written into a CSV file to be exported. 

#### 2.5.1 Find Common Authors
The following function accepts the papers dataframe and gives a list of the ten most frequently occuring authors.<br>
A list of authors is made from the dataframe column and flattened to give one 1D list with the names of all the authors in the papers. This list is first alphabetically sorted and then a frequency distribution is made out of the names of authors. From the distribution, the ten most commonly apprearing authors are put into a list and returned. <br>Sorting alphabetically before making frequency distribution ensures that in case of ties in number of occurences, it is settled based on alphabetical order. 

In [None]:
def find_common_authors(papers_df):
    
    auth_list = papers_df['Authors'].tolist()                     # get list
    merged_list = list(itertools.chain.from_iterable(auth_list))  # flatten list
    merged_list = sorted(merged_list)                             # sort alphabetically first
    fd_authors = FreqDist(merged_list)                            # fd obtained
    top_author_list = [x[0] for x in fd_authors.most_common(10)]  # top ten frequently occuring

    return top_author_list

#### 2.5.2 Find Common Terms in Titles
The following function accepts the papers dataframe and gives a list of the ten most frequently occuring terms in the paper titles.<br>
All words in the titles are converted to lowercase and titles are tokenzied (by word) using the provided regular expression tokenizer. A flat list is the obtained and stopwords are removed from it. This list is first alphabetically sorted and then a frequency distribution is made out of the terms. From the distribution, the ten most commonly apprearing terms are put into a list and returned.
<br>Sorting alphabetically before making frequency distribution ensures that in case of ties in number of occurences, it is settled based on alphabetical order. 

In [None]:
def find_common_title_terms(papers_df):
    
    low_title_list = papers_df['Title'].str.lower().tolist()                         # case normalise all words
    
    # regexp tokenizer provided is used 
    title_tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    title_tokens_2d_list = [title_tokenizer.tokenize(i) for i in low_title_list]

    title_tokens_list_1 = list(itertools.chain.from_iterable(title_tokens_2d_list))  # flatten list
    
    # stopwords provided is used 
    stopwords = []
    with open('./stopwords_en.txt') as f:
        stopwords = f.read().splitlines()
    
    # removing any stopwords 
    title_tokens_list = [w for w in title_tokens_list_1 if w not in stopwords]

    title_tokens_list = sorted(title_tokens_list)                                    # sort alphabetically first
    fd_title_tokens = FreqDist(title_tokens_list)                                    # fd obtained
    top_title_tokens_list = [x[0] for x in fd_title_tokens.most_common(10)]          # top ten frequently occuring
    
    return top_title_tokens_list

#### 2.5.3 Find Common Terms in Abstract
The following function accepts the papers dataframe and gives a list of the ten most frequently occuring terms in the abstracts of the papers.<br>
First, the abstract is sentence segmented using NLTK's Punkt Sentence Tokenizer. Then, the first letter of each sentance is converted to lower case and text is tokenzied (by word) using the provided regular expression tokenizer. A flat list is the obtained and stopwords are removed from it. This list is first alphabetically sorted and then a frequency distribution is made out of the terms. From the distribution, the ten most commonly apprearing terms are put into a list and returned.
<br>Sorting alphabetically before making frequency distribution ensures that in case of ties in number of occurences, it is settled based on alphabetical order. 

In [None]:
def find_common_abstract_terms(papers_df):
    
    # NLTK's Punkt Sentence Tokenizer used
    sent_detector = nltk.data.load('tokenizers/punkt/english.pickle') 
    
    abstract_list = papers_df['Abstract']
    abstract_sent_list_2d = [sent_detector.tokenize(x.strip()) for x in abstract_list]    # sentence segmentation
    abstract_sent_list = list(itertools.chain.from_iterable(abstract_sent_list_2d))       # flatten list of sentences

    low_abstract_sent_list = [y[0].lower() + y[1:] for y in abstract_sent_list]           # lowercase first letter 
    
    # regexp tokenizer provided is used 
    abstract_tokenizer = RegexpTokenizer(r"[A-Za-z]\w+(?:[-'?]\w+)?")
    abstract_tokens_2d_list = [abstract_tokenizer.tokenize(i) for i in low_abstract_sent_list]

    abstract_tokens_list_1 = list(itertools.chain.from_iterable(abstract_tokens_2d_list)) # flatten list of words
    
    # stopwords provided is used 
    stopwords = []
    with open('./stopwords_en.txt') as f:
        stopwords = f.read().splitlines()
    
    # removing any stopwords 
    abstract_tokens_list = [w for w in abstract_tokens_list_1 if w not in stopwords]

    abstract_tokens_list = sorted(abstract_tokens_list)                                   # sort alphabetically first
    fd_abstract_tokens = FreqDist(abstract_tokens_list)                                   # fd obtained
    top_abstract_tokens_list = [x[0] for x in fd_abstract_tokens.most_common(10)]         # top ten frequently occuring
    
    return top_abstract_tokens_list

#### 2.5.MAIN
The above functions are called to find the most frequently occuring items. A stats dataframe is built from these and then it is written into a CSV file. 

In [None]:
stats_df = pd.DataFrame([])                                                    # initialise empty dataframe

# add all required data into the dataframe with appropriate column headings
stats_df['top10_terms_in_abstracts'] = find_common_abstract_terms(papers_df)   
stats_df['top10_terms_in_titles'] = find_common_title_terms(papers_df)
stats_df['top10_authors'] = find_common_authors(papers_df)

# write all to csv file
stats_df.to_csv('Group149_stats.csv', index = False)                           # no indexing required

In [None]:
stats_df

The above are the statistics generated for the 200 documents.

## 3. Summary
In conclusion, this assessment task involved pre-processing of text files. This was acheived through reading the pdf table, downloading pdf files from the links given and parsing these files to extract information from it. <br>
Further, text preprocessing was achived through the use of various functions from the NLTK library. This processed text was further represented in a sparse form using count vectors. This was outputted into two text files - a file containing all the tokens and their indices and another which contains the token indices along with the number of occurences of the token in each of the documents. This formed Task 1 of the assignment<br>
The second task was to generate certain statistics about the papers such as the most common authors, most commonly appearing term in both the titles and the abstracts. This information was outputted to a .csv file. <br>
In all, through the assignment, we learned about using various packages such as pdfminer, NLTK, sklearn etc. We also learned about various concepts in text pre-processing and topic modelling - such as trade-off between reducing vocabulary size and loss of information; differences between identifiers and features; how identifiers might not be good features etc. 

## 4. References

- Aryan A. (2017). *Downloading Files from URLs in Python*. Retrieved from https://www.codementor.io/aviaryan/downloading-files-from-urls-in-python-77q3bs0un<br>
- Chin S. (2009). *How to make a flat list out of list of lists* [Response to]. Retrieved from https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists<br>
- dmdip (2017). *Get index of a row of a pandas dataframe as an integer* [Response to]. Retrieved from https://stackoverflow.com/questions/41217310/get-index-of-a-row-of-a-pandas-dataframe-as-an-integer/42853445<br>
- DuckPuncher (2014). *Extracting text from a PDF file using PDFMiner in python?* [Response to]. Retrieved from https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python<br>
- Jamaoui M.A. (2018). *CountVectorizer converts words to lower case* [Response to]. Retrieved from https://stackoverflow.com/questions/49380998/countvectorizer-converts-words-to-lower-case<br>
- MaxU (2017). *Create single row python pandas dataframe* [Response to]. Retrieved from https://stackoverflow.com/questions/45504241/create-single-row-python-pandas-dataframe<br>
- NLTK 3.4.5 documentation (2019). *nltk.stem package*. Retrieved from https://www.nltk.org/api/nltk.stem.html<br>
- NLTK 3.4.5 documentation (2019). *nltk.tokenize package*. Retrieved from https://www.nltk.org/api/nltk.tokenize.html<br>
- szeitlin (2015). *if else in a list comprehension [duplicate]* [Response to]. Retrieved from https://stackoverflow.com/questions/4406389/if-else-in-a-list-comprehension<br>
- VoldyArrow (2018). *Extracting tables spanning to multiple pages* [Response to]. Retrieved from https://stackoverflow.com/questions/52234696/extracting-tables-spanning-to-multiple-pages<br>