# FIT5196 Assessment 2
#### Student Name: Haoheng Zhu
#### Student ID: 30376467

Date: 09/11/2019

Version: 1.4

Environment: Python 3.6.5 and Anaconda 4.3.0 (64-bit)

Libraries used:
* pandas 0.19.2 (for data frame, included in Anaconda Python 3.6) 
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* nltk 3.2.2 (Natural Language Toolkit, included in Anaconda Python 3.6)
* nltk.collocations (for finding bigrams, included in Anaconda Python 3.6)
* nltk.tokenize (for tokenization, included in Anaconda Python 3.6)
* nltk.corpus (for stop words, not included in Anaconda, `nltk.download('stopwords')` provided)
* pdfminer.six (for extracting info from PDF, included in Anaconda Python 3.6)
* requests (allows to send http requests, included in Anaconda Python 3.6)
* sklearn (for data mining and analysis, included in Anaconda Python 3.6)
* os (operating syster interface, included in Anaconda Python 3.6)
* tqdm (an extensible progress bar, included in Anaconda Python 3.6)
* vocab (for nltk processing, included in Anaconda Python 3.6)
* pandas (for data structures and data analysis, included in Anaconda Python 3.6)

## 1. Introduction
This assignment exams the skills to parse pdf files with various nltk tools. The objective PDF files can be obtained by downloading the urls in `Group102.pdf`. Tasks are the following:

1. Generate a sparse representation for Paper Bodies and save them to 
  1.  Vocabulary index file
  2.  Sparse count vector file
2. Generate a `CSV` file named (stats.csv) containing three columns:
 1. Top 10 most frequent terms appearing in all __*Titles*__
 2. Top 10 most frequent __*Authors*__
 3. Top 10 most frequent terms appearing in all __*Abstracts*__
 
More details for each task will be given in the following sections.

## 2.  Import libraries 
 * __*Main*__ libraries are:
   * pdfminer
   * requests
   * __*nltk*__
   * __*sklearn*__
   * vocab

In [2]:
import re
import pandas as pd
import requests
import os
import pdfminer
import tqdm
from tqdm import tqdm_notebook as tqdm
from tqdm.autonotebook import tqdm
tqdm.pandas()

import io
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import sys, getopt

import nltk.data
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from nltk.tokenize import MWETokenizer
from itertools import chain
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer


from nltk.probability import *
from nltk.tokenize import word_tokenize

from nltk.stem.porter import *
nltk.download('punkt')

from itertools import chain

from vocab import Vocab, UnkVocab
import collections

[nltk_data] Downloading package punkt to
[nltk_data]     /srv/home/angu0069/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 3. Convert PDF to TXT 

* 3.1  Define a pdf to txt function to convert PDF to TEXT

__Functions to use__:
   * PDFResourceManager()
   * io.StringIO()
   * TextConverter()
   * PDFPageInterpreter()

In [3]:
def pdftable_2txt(fname): 
    
    #########################################
    # This function use pdfminer to extract
    # texts from pdf file, and return output
    # as raw data
    #########################################
    
    
    output = io.StringIO()
    # io.StringIO is a class. It handles Unicode. It reflects the preferred Python 3 library structure
    
    manager = PDFResourceManager()
    # generate a PDF resource manager which is an object that stores shared resources
    
    txt_converter = TextConverter(manager, output, laparams=LAParams())
    # convert PDF to txt
    
    interpreter = PDFPageInterpreter(manager, txt_converter)
    # generate a PDF interpreter object.
    
    file_name = open(fname, 'rb') # open in read and binary mode
    
    for item in PDFPage.get_pages(file_name):
        interpreter.process_page(item)
        # process each page at a time
    file_name.close()
    txt_converter.close()
    txt = output.getvalue()
    output.close
    return txt 

### 3.2 Use pdf_txt to convert Group102.pdf

* store the data with __*dataframe*__ so that they can be processed by each row
  * open 'Group102.pdf' file
  * create dataframe with pandas
  * rename column names with 'filename' and 'url'
  * drop the original column name '0'

In [4]:
#Step 0: read the data table pdf to download file. 

data = pdftable_2txt("Group102.pdf")
# generate a txt file to store the strings extracted from Group102.pdf (the url)
file = open('Group102.txt', "w")
file.write(data)
file.close()
filepath = 'Group102.txt'


with open(filepath) as f: lineList = f. readlines()
# retrieve urls from each line

lineList = [line for line in lineList if line[0] == 'P']
    # each url line starts with PP\w+.pdf then followed by a space then the url (except the top few lines)

df = pd.DataFrame(lineList)
    # generate a dataframe to store the urls for faster process

df[0] = df[0].apply(lambda x: x.split(" "))
    # apply split() to each line in the series to separte filenames from urls

df['filename'] = df[0].apply(lambda x: x[0])
    # create a column 'filename' and store the filename
df['url'] = df[0].apply(lambda x: x[1].strip())
    # create a column 'url' and store the urls
    
df=df.drop(0,axis = 1)
    # drop the original column
os.remove('Group102.txt')
    # remove the txt file because it's no longer needed. All data are now in df

### 3.3 Download the required pdf

Main function to use:
 * check if the directory exists, using control structure with __os.path.exists()__
 * __makedirs__ --> create directory
 * __requests.get()__ --> retrieve data from specified resource

In [5]:
if not os.path.exists('data'): 
    os.makedirs('data') # make a directory for dataset, to store all the pdf files downloaded
    for each in tqdm(df.iterrows(), total = len(df['filename'])): 
        response = requests.get(each[1][1])
        # each[1] is the pandas series that stores filename and urls
        # each[1][1] is the urls
        # get is a request method that download data from specified resource
        # store the information retrieved in a Response object, named response
        with open('data/'+ str(each[1][0]),'wb') as f:
            # generate the pdf files according to their filenames
            f.write(response.content)

### 3.4 Preparation

Load stopword from given **stopwords_en.txt**

In [6]:
#An empty list to store all the given stopwords
stopwords=[]

#Opening the given stopwords file and storing the words in the stopwords list
with open('stopwords_en.txt') as f:
    stopwords = f.read().splitlines()    

In [7]:
#pdfminer sometimes cannot convert latin symbols like ﬀ (ff) and fi, or ffl, so this function is to translate
#these into normal ascii symbols for more accurate data. ﬀ, ﬁ
LATIN2ASCII = {
  0xfb00: 'ff',
  0xfb01: 'fi',
  0xfb02: 'fl',
  0xfb03: 'ffi',
  0xfb04: 'ffl',
  0x00df: 'ss',
  0xfb05: 'ft',
  0xfb06: 'st',
}

def latin2ascii(s):
    return ''.join(LATIN2ASCII.get(ord(c),c) for c in s )

### 3.5 Convert all pdf to txt files
 * use _**pdf2txt.py**_ command to convert
 * makedirs() 

In [8]:
def pdf_to_text(file):
    
    #############################################
    # pdf_to_text will make directory text_data
    # to store all converted files if the text_
    # data directory does not exist. 
    #############################################
    
    if not os.path.exists('text_data'): #if the text_data dir not exist
        os.makedirs('text_data') #make one
        
    file_txt = file[:-3]+'txt'
        #create name for text file
    
    !pdf2txt.py -o text_data/$file_txt data/$file
        #convert to txt files

## 4. Sparse

### Sparse Guideline (Orders)
1. Tokenize 
 * segmentation before tokenize
2. Normalize 
3. Bigram
4. Stopwords Removal
5. Remove Rare Tokens 
6. Remove Tokens Less Than 3 Length
7. Stemming


## Step 1 - (E & A) : Tokenize - Normalize

Tokens must be normalized to lowercase except the capital tokens appearing in the middle of a sentence/line. (use sentence segmentation to achieve this)

In [9]:
def selective_lower(sentence): 
    
    ########################################################
    # This blocks of code is to deal with the problem of 
    # pdfminer, some software producing pdf files may not 
    # properly display these symbols like ff, fi,fl, so 
    ########################################################
    
    sentence = re.sub(r'[\357\254\200]+', 'ff', sentence)
    sentence = re.sub(r'[\357\254\201]+', 'fi', sentence)
    sentence = re.sub(r'[\357\254\202]+', 'fl', sentence)
    sentence = re.sub('fffi ', 'fi', sentence)
    sentence = re.sub('ff ', 'ff', sentence)
    ########################################################
    
    
    aux_sentence = '' #inititate sentence
    cap_set = re.findall(r'(?!^)\b([A-Z]\w+)',sentence) 
        #store the capital words that are not in a cap_set
        #that are not in the begninning of the sentence
        
    # If the words is not in the cap_set, then word is lowered
    # first and then added to aux_sentence.
    for word in sentence.split(" "):
        if (word not in cap_set):
            aux_sentence += word.lower() + str(' ')
        else:
            aux_sentence += word + str(' ')
            
    return  aux_sentence

## Step E - Normalize

### 4.2 Define get_data function to extract **body** , __author__ , __title__ , __abstract__

store the data in the format of dictionary. With **file_name** being keys
 * segment the sentence using Punkt's sent_detector
 * loop through sentence to find marking words
   * __*title*__ is before 'Authored by'
   * __*authors*__ are between 'Authored by:' and 'Abstract'
   * __*abstract*__ is between '1 Abstract' and 'Paper Body'
   * __*body*__ is after '2 Paper Body'

In [10]:
def get_data(directory): 
    
    body_dict={}   
    author_dict = {}
    title_dict = {}
    abstract_dict = {}
    for filename in tqdm(os.listdir(directory)):
        # listdir returns a list containing the names of the entry in the directory given the path
            filepdf = filename.replace('.txt','')
            
            with open(str(os.path.join(directory, filename))) as f: raw_body = latin2ascii(f.read())
            #open the txtfile and read data
            
            sentence_list = sent_detector.tokenize(raw_body.strip())
            #tokenized sentence
            
            #Get title dict#####################################
            for i in range(len(sentence_list)):
                if 'Authored by' in sentence_list[i]:
                    title = sentence_list[i][:sentence_list[i].index('Authored by')]
                    title = title.strip()
                    title = title.replace('\n','')
                    break

            title_dict[filepdf]= title.lower()
            # store the title in title_dict dictionary in lower case with the key 'filepdf'
            ####################################################
            
            #Extract author dict################################
            author = ''
            start = 0
            stop = 0
            for i in range(len(sentence_list)):        
                if 'Authored by' in sentence_list[i]:
                    #find the line that contains 'Authored by'
                    start = i 
                    #mart the beginning of the block
                    
                if 'Abstract' in sentence_list[i]:
                    #find the line that contains 'Abstract'
                    stop = i
                    #mark the ending of the block
                    break        
                    
            for i in range(start, stop+1):
                temp = sentence_list[i]
                if 'Authored by' in temp:
                    temp = temp[temp.index('Authored by:'):]
                        #Get the data from 'Authored by'
                    temp = temp.replace('Authored by:','').strip()
                        #Delete the substring 'Authored by'
                    
                if 'Abstract' in temp:
                    temp =  temp[:temp.index('Abstract')]
                        #Get the data to 'Abstract'
                author += temp + str(' ') 
                    #add to author string
            author_dict[filepdf] = author.strip().split('\n')
            # store the aurhors in author_dict dictionary in lower case with the key 'filepdf' 
            ####################################################
            
            
            #Get abstract dict##################################
            abstract = []
            start = 0
            stop = 0
            #Loop through sentence
            for i in range(len(sentence_list)):
                if 'Abstract' in sentence_list[i]:
                    #Find the sentence containing 'Abstract'
                    start = i
                        #mart the beginning of the block
                if 'Paper Body' in sentence_list[i]:
                    #Find the sentence containing 'Paper Body'
                    stop = i
                        #mart the ending of the block
                    break

            #This block slicing the block of abstract   
            for i in range(start, stop+1):
                temp = sentence_list[i]
                if 'Abstract' in temp:
                    temp = temp[temp.index('Abstract'):]
                        #Get the data from 'Abstract'
                    temp = temp.replace('Abstract','')
                        #Delete term Abstract
                    temp = temp.strip()
                if 'Paper Body' in sentence_list[i]:
                    temp =  temp[:temp.index('1 Paper Body')]
                        #Get the data to '1 Paper Body'
                abstract.append(selective_lower(temp))             
            abstract_dict[filepdf] = " ".join(abstract)
            # use a single space to join the sentences
            # store the abstract in abstract_dict dictionary in lower case with the key 'filepdf'
            ####################################################
            
            #extract body######################################
            body = []
            start = 0
            stop = 0
            for i in range(len(sentence_list)):
                if 'Paper Body' in sentence_list[i]:
                    #Find the sentence containing 'Paper Body'
                    start = i #mark the start of block   
                if '2 References' in sentence_list[i]:
                    #Find the sentence containing '2 References'
                    stop = i #mark the ending of block
                    break
            # this is to find the start and stop of Paper body              
            for i in range(start, stop+1):
                temp = sentence_list[i]
                if 'Paper Body' in sentence_list[i]:
                    temp = temp.replace('1 Paper Body\n\n','')
                        #if 'Paper Body' in sentence, then delete the term
                if '2 References' in sentence_list[i]:
                    temp = temp.replace('2 References\n\n','')
                        #if '2 References' in sentence, then delete the term
                body.append(selective_lower(temp)) 
                # Normalize the sentence, and then added to body
            body_string = " ".join(body)
            body_dict[filepdf] = body_string
            # use a single space to join the sentences
            # store the body in body_dict dictionary in lower case with the key 'filepdf' 
            ####################################################          
            
    return title_dict, author_dict, abstract_dict, body_dict

#### Create a directory text_data to store converted txt files

In [122]:
#create a folder to store converted files
import subprocess
import shutil

if os.path.exists('text_data'): #if the text_data exist, remove
    shutil.rmtree("text_data/") #remove
    
os.makedirs('text_data') #create new directory text_data

for file in tqdm(df['filename']): #convert all the pdf file to txt
    pdf_to_text(file)

pathtxt = 'text_data/' #define the path linked to text_data

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




### 4.3 Handle the sentence with leftover words that pdfminer doesn't process

In [11]:
def fix(sentence):
    
    #########################################################
    # Some documents, pdfminer cannot capture the "The", and 
    # some with displayed ﬁ ﬀ that can be displayed in text file
    # but cannot be displayed after read by python 
    #########################################################
    sentence = sentence.replace('The','')
    sentence = sentence.replace('ﬁ','fi')
    sentence = sentence.replace('ﬀ','ff')
    sentence = sentence.replace('- ','')
    sentence = sentence.replace('-\n','')
    sentence = sentence.replace('\n',' ') 
    sentence = sentence.replace('the- orem','theorem')
    sentence = sentence.replace(' orem ',' theorem')
    sentence = sentence.replace(' oretical ','theoretical')
    sentence = sentence.replace(' oretic ','theoretic')
    sentence = re.sub('[^A-Za-z-]+', ' ', sentence) 

    return sentence.strip()

## Step A - Tokenize

### Retrieve title, author, abstract, body 
* call the User Defined Function get_data function
* use UDF fix function to process the __paper body__

In [13]:
#Four Dictionaries will be used to store data with key being filname without ''.pdf'
title_dict, author_dict, abstract_dict, raw_body_dict = get_data(pathtxt)
    # extract the corresponding content and store them correspondently
body_dict = {}
for i in list(raw_body_dict.keys()):
    body_dict[i] = fix(raw_body_dict[i])

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




In [36]:
#Step A
tokenizer = RegexpTokenizer("[A-Za-z]\w+(?:[-'?]\w+)?")
    # the required RE expression

def tokenize(body):
    tokenized_body = tokenizer.tokenize(body_dict[body]) #tokenizing the string
    return (body, tokenized_body) # return a tuple of file name and a list of tokens

#calling the tokenize method in a loop for all the elements in the dictionary
tokenized_body = dict(tokenize(body) for body in body_dict.keys()) 
    # tokenized_body has file_name as key and tokenized_words as values

## Step 2 G: Bigrams

First 200 meaningful bigrams  (i.e., collocations), based on highest total frequency in the  corpus, must  be  extracted  and  included  in  your  tokenization  process.  Bigrams should not include context-independent stopwords as part of them and they should be separated using double underscore i.e. __ (example: artifical__intelligence)

   * initiate nltk.collocations.BigramAssocMeasures()
   * use nltk.collocations.BigramCollocationFinder.from_words() to measure all_tokens
   * use apply_word_filter() to filter length <3 and lower()
   * use nbest() to find top 200 bigrams
   * use MWETokenizer() to merge multi words into a single token

In [37]:
#step G
#Finding the top 200 bigrams
all_tokens = list(chain.from_iterable(tokenized_body.values()))
    #collect all tokens from tokenized_body
bigram_measures = nltk.collocations.BigramAssocMeasures()
    #initiate a bigram_measure, 
    #collocations are expressions of two-words that usually occur at the same time (collocated)
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_tokens)
    # store all the found bigram_measures in bigram_finder
bigram_finder.apply_word_filter(lambda w: len(w) < 3)
    # filter the words that are less than 3 in length (that's )
bigram_finder.apply_word_filter(lambda w: w.lower() in stopwords)
    # and filter the words that are w.lower() in ignored_words)
    
_list = bigram_finder.ngram_fd.most_common(200) # Top 200 bigrams

bigrams_list = [tup for tup, freq in _list] # save to a list

#Preserving these bigrams and putting it back in the dictionary, along with the unigrams
mwetokenizer = MWETokenizer(bigrams_list,separator='__')

#colloc_body is a dictionary that contains both the bigrams as well as the unigrams
# the keys are the mwetokenized file_names, the corresponding values are list
colloc_body =  dict((body, mwetokenizer.tokenize(data)) for body,data in tokenized_body.items())
all_tokens = list(chain.from_iterable(colloc_body.values()))

In [38]:
#check point 1 after tokenizing and obtaining bigrams:
len(set(all_tokens))

24954

## Step 3 B Stopwords Removal

The context-independent and context-dependent (with the threshold set to %95) stop words must be removed from the vocab. The context-independent stop words list (i.e, stopwords_en.txt) provided in the zip file must be used. In this step, we will remove the context independent stopwords.

* using provided stopwords to check whether a token needs to be removed
  * use control structure:
  if token.lower() not in stopwords

In [39]:
#Step B #removing stopwords
for file in list(colloc_body.keys()):            
    colloc_body[file] = [token for token in colloc_body[file] if token.lower() not in stopwords]
        # remove stop words from each file
all_tokens = list(chain.from_iterable(colloc_body.values()))
    # store all the tokens from stopwords removed tokenized_body
    # chain.from_iterable is an alternate constructor for chain()

In [40]:
#check point 2 after removing stopwords:
len(set(all_tokens))

24322

## Step 4 : D Rare Tokens Removal
Rare tokens (with the threshold set to 3%) must be removed from the vocab, and context dependent tokens appearing in at least  95% of documents. The number of documents in which a specific word appear will be calculate, which will help us remove context dependent stopwords with threshold at least 95% as well.

To get document frequency, need to:
 * Obtain unique tokens from colloc_body
   * utilize set() to obtain the unique tokens
 * Obtain word frequency distribution across documents
   * use FreqDist() from nltk library
 * Calculate document frequency
   * ${Word Frequency \over Total Documents}$

In [41]:
#Step D # removing rare token(3%) and tokens appearing in at least 95% of documents
words_per_doc = list(chain.from_iterable([set(value) for value in colloc_body.values()]))
    # get unique tokens from colloc_body.values(). then list them
wpd = FreqDist(words_per_doc)
    # get the distribution of the unique tokens (words)
word_to_remove = []
#create a list of words to remove based on 95% 03% criteria
for word, count in wpd.items():
    # wpd.items() is a dictionary contains the word distribution.
    # the keys are the words, the values are the frequency
    
    if (count/len(list(colloc_body.keys())) < 0.03) or (count/len(list(colloc_body.keys())) >= 0.95):
        # word_appearance: number of documents that contains this word, or the value in wpd.
        # document frequency = word_appearance/total_documents (200)
        # thus count is the number of documents that contains its corresponding word.
        # lent(list(colloc_body.keys())) is a number all the documents (200 documents)
        # so count/len(list(colloc_body.keys())) gives the document frequency
        
        word_to_remove.append(word)
#cleaning 95% 03% words from tokenized_body and unitoken
for file in tqdm(list(colloc_body.keys())):
    colloc_body[file] = [token for token in colloc_body[file] if token not in word_to_remove]
        # reform colloc_body dictionary, eliminating the word_to_remove (the 95% 03% words)
all_tokens = list(chain.from_iterable(colloc_body.values()))

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




In [42]:
#check point 4 after removing '9503' tokens
len(set(all_tokens))

4513

## Step 5 F Two Letter Words Removal
Tokens with the length less than 3 should be removed from the vocab

In [43]:
#Step F cleaning 2 letter words
for file in list(colloc_body.keys()):            
    colloc_body[file] = [token for token in colloc_body[file] if len(token) > 2]
        # remove tokens that are less than length 2
    
all_tokens = list(chain.from_iterable(colloc_body.values()))

In [44]:
#check point 5 after removing '2 letter' tokens
len(set(all_tokens))

4188

## Step 6 C Unigram Stemming
Unigram  tokens  should  be  stemmed  using  the  Porter  stemmer.  (be  careful  that stemming performs lower casing by default

In [45]:
#Step C Stemming

#Using the porterstemmer method
ps = PorterStemmer()

#An empty string to store the content of a particular body
strcontent=''

#An empty dictionary to append the stemmed data back 
stemmed_dict=dict()

#Looping to stem each value in the dictionary
for key,body in tqdm(colloc_body.items()):  
    # key is the key in collec_body, body is the list of corresponding value to its key
    
    for word in body:
        if '__' in word: # if __ is in the word, it's bigram
            strcontent=strcontent+ ' ' + word
        else: # if not bigram
            if (word[0].isupper() and word[1].islower()):
                # words like Thompson, Marry or leading words in sentence
                
                word = ps.stem(word) # normalize the word to its original form
                word = word[0].upper() + word[1:] # turn it back to its original format
                strcontent= strcontent + ' ' + word # store and concatenate it with single space
                
            elif (word[0].isupper() and word[1].isupper()):
                # special capital words in the middle of sentence
                
                word = ps.stem(word) # normalize the word to its original form
                word = word.upper() # turn it back to its original format
                strcontent=strcontent+ ' ' + word # store and concatenate it with single space
            else:
                strcontent=strcontent+ ' ' + ps.stem(word)    # store and concatenate it with single space
    
    #Assigning the string to the respective key
    stemmed_dict[key]=strcontent
    
    #Again emptying the string to store the next body
    strcontent=''

#Loop to again word tokenize each body in the dictionary and assigning it back to its body number 
for key,body in tqdm(stemmed_dict.items()):
    stemmed_dict[key]=word_tokenize(body)

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




In [46]:
# Check point 7 after stemming
all_tokens = list(chain.from_iterable(stemmed_dict.values()))
len(set(all_tokens)) 

2449

## 5. Statistical Calculation and Analysis

In this step, we aim to derive statistical numbers including `count vector`, `top 10 frequents` 

To achieve the goals, we need to:
 * Obtain unique tokens set
   * utilize set() to get unique tokens
 * Obtain vocabulary count
   * use vocab library
   * use word2index() to derive word index of the document
 * Generate vocabulary vector output
   * use FreqDist() to get the vocabulary distribution
   * generate a dictionary to assign the word index and distribution to file_name
 * Top 10 frequency
   * Obtain word distribution across document, FreqDist()


### 5.1 Unique tokens 

Extract the unique tokens from all_tokens in order to perform vector value calculations

In [47]:
uni_tokens = []
# store and list the stemmed body
for file in list(stemmed_dict.keys()):
    for word in stemmed_dict[file]:
        uni_tokens.append(word)
        # each item in uni_tokens is the stemmed word
        
vocab = list(set(uni_tokens))

In [48]:
bigrams = [str(tup[0] + str('__') + tup[1]) for tup in bigrams_list]
    # adding 200 bigrams back to the vocab
for tok in bigrams:            
    if tok not in vocab:
        vocab.append(tok)

In [49]:
vocab.sort()
    # get the unique tokens by using set, and sort them for better access

### 5.2 Vocabulary Count
   * use vocab library
   * use word2index() to derive word index of the document

In [51]:
#vector file output
from vocab import Vocab, UnkVocab
import collections
v = Vocab()
vocab_index = v.word2index(vocab, train=True)
    # create vocab index for each stemmed words

vocab_serial = dict(zip(vocab,vocab_index))
    # generate a dictionary that has the stemmed word as key and vocab_index as value

vocab_serial = collections.OrderedDict(sorted(vocab_serial.items()))
    #  sort the vocab_serial dictionary

file = open('Group102_vocab.txt', "w", encoding = 'utf')
    # write the sorted vocab_serial dictionary into Group102_vocab.txt
for k, v in vocab_serial.items():
    file.write(str(k) + ':'+ str(v) + '\n')
    # write the file in the format of 'key:value'

file.close()

### 5.3 Vocabulary Vector
   * use FreqDist() to get the vocabulary distribution
   * generate a dictionary to assign the word index and distribution to file_name

In [52]:
vdict = {}
for file, body in stemmed_dict.items():
    # file_name is the key, body is the stemmed tokens
    vdict[file] = FreqDist(body)
        # file_name is the key, word distribution is the value

In [53]:
#write to countvector
file = open('Group102_count_vectors.txt', "w", encoding = 'utf')

for filename in vdict.keys():
    file.write(str(filename)+str(':'))
    vector_list = []
    for word in vdict[filename]:
        # vdict[filename] gets all the keys
      
        vector_list.append(str(vocab_serial[word]) + str(':') + str(vdict[filename][word]))
            # vocab_serial[word] gets the word count
            # vdict[filename][word] gets the distribution value for that word
            
        vector_string = ",".join(vector_list)
    file.write(vector_string + str('\n'))
file.close()

### 5.4 Top 10 Frequency

To get the top 10 frequency, need to:
 * Obtain word frequency distribution across documents
   * use FreqDist()
 * Find top 10 most common 
   * use most_common() method by nltk

In [54]:
#tokenize with regex, same like body
def tokenize_abstract(abstract):
    tokenized_abstract = tokenizer.tokenize(abstract_dict[abstract]) #tokenizing the string
    return (abstract, tokenized_abstract) # return a tuple of file name and a list of tokens

tokenized_abstract = dict(tokenize_abstract(filepdf) for filepdf in abstract_dict.keys()) 
    # the keys are the pdf file names, the values are the tokenized abstract contents

# remove stopwords from tokenized_abstract
for file in  tqdm((list(tokenized_abstract.keys()))):
    tokenized_abstract[file] = [token for token in tokenized_abstract[file] if token.lower() not in stopwords]

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




In [55]:
all_abstract_tokens = list(chain.from_iterable(tokenized_abstract.values()))
    # list the tokens in tokenized_abstract
abstract_wpd = FreqDist(all_abstract_tokens)
    # get the distribution of abstract tokens
    
top_10_abstract = abstract_wpd.most_common(10) # extract the most common 10 tokens
top_10_abstract

[('learning', 200),
 ('data', 159),
 ('algorithm', 129),
 ('model', 126),
 ('show', 122),
 ('problem', 105),
 ('models', 105),
 ('method', 103),
 ('algorithms', 79),
 ('results', 73)]

In [56]:
#tokenize with regex, same like body
def tokenize_title(title):
    tokenized_title = tokenizer.tokenize(title_dict[title]) #tokenizing the string
    return (title, tokenized_title) # return a tuple of file name and a list of tokens

tokenized_title = dict(tokenize_title(title) for title in title_dict.keys()) 
    # the keys are the pdf file names, the values are the tokenized title contents
    
# remove stopwords from tokenized_title    
for file in  tqdm((list(tokenized_title.keys()))):
    tokenized_title[file] = [token for token in tokenized_title[file] if token not in stopwords]
    
all_title_tokens = list(chain.from_iterable(tokenized_title.values()))
    # list the tokens in tokenized_title
title_wpd = FreqDist(all_title_tokens)
    # get the distribution of title tokens
    
top_10_title = title_wpd.most_common(10) # extract the most common 10 tokens
top_10_title

HBox(children=(IntProgress(value=0, max=200), HTML(value='')))




[('learning', 46),
 ('models', 15),
 ('stochastic', 11),
 ('neural', 10),
 ('networks', 9),
 ('bayesian', 8),
 ('deep', 8),
 ('model', 7),
 ('probabilistic', 7),
 ('linear', 7)]

In [57]:
# list all the authors
author_list = []
for file, authors in author_dict.items():
    # file is the file_names in author_dict's keys
    # authors are the corresponding values of its keys
    
    for author in authors:
        author_list.append(author)
        # append the authors to the list
        
author_list = [author for author in author_list if author != '']
    # remove empty strings

In [58]:
author_wpd = FreqDist(author_list)
    # get the document frequency distribution of the authors
top_10_author= author_wpd.most_common(10)
    # get the most common 10 authors
top_10_author

[('Prateek Jain', 3),
 ('Michael I. Jordan', 3),
 ('Lawrence Carin', 3),
 ('Kristen Grauman', 3),
 ('Yee W. Teh', 3),
 ('Ron Meir', 3),
 ('Klaus-Robert M?ller', 3),
 ('Razvan Pascanu', 2),
 ('Peter Battaglia', 2),
 ('Sewoong Oh', 2)]

## 6. Write stats into CSV

In [59]:
stat = {}
stat['top10_terms_in_titles'] = [key[0] for key in top_10_title]
    # top_10_tilte is a list of tuples of (title, number)
    # assign the titles as values to the key 'top10_terms_in_titles'
    
stat['top10_authors'] = [key[0] for key in top_10_author]
    # assign the authors as values to the key 'top10_authors'
    
stat['top10_terms_in_abstracts'] = [key[0] for key in top_10_abstract]
    # assign the abstracts as values to the key 'top10_terms_in_abstracts' 
    
stat_table = pd.DataFrame.from_dict(stat)
    # generate a dataframe of statistical information
stat_table.to_csv('Group102_stats.csv')
stat_table.to_csv('Group102_stats.csv',index=False, encoding='utf')
    # create and save the dataframe to csv file

# Summary

The main challenge for this assignment was to figure out a proper procedure to tokenize and stem the data. We tried several combinations from the given 7 steps; each time gave us different results.Moreover, Pdfminer, according to the documentation, <i>"cannot safely concerted to Unicode all the characters."</i>. So it depends pretty much on the softwares that created pdf files.

https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

The more challeging is that sample output provided, according to Islam Nassar in Unit Announcement (12/0902019 7:55 AM) that is not correct <i> "sample output released (sample_vocab, sample_count_vectors, and sample_stats) are only to be used as reference as to how the output should look like. You should not look too much into them and try to infer how to go about text preprocessing as it is not entirely accurate. The files have been generated randomly and therefore they don't really represent any ground truth."</i>


Main learning:
 * pdfminer functions such as
   * PDFResourceManager()
   * io.StringIO()
   * TextConverter()
   * PDFPageInterpreter()
 * nltk functions such as
   * FreqDist()
   * nltk.collocations.BigramAssocMeasures()
   * nltk.collocations.BigramCollocationFinder.from_words()
   * apply_word_filter()
   * nbest()
   * MWETokenizer()
 * Vocab
   * word2index()

## Reference

- Steve B. *NLTK 3.2.5 Documentation* Retrieved from https://buildmedia.readthedocs.org/media/pdf/nltk/latest/nltk.pdf
- NLTK Project. (2015). *Collocations*. Retrieved from http://www.nltk.org/howto/collocations.html
- Levia3 pdfminer *Release 0.0.1* (2017) Retrieved from https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf
- Vocab 0.0.4 *Project Description* Retrieved from https://pypi.org/project/vocab/
- Scikit-learn *Machine learning in python* Retrieved from https://scikit-learn.org/stable/