<img src="https://upload.wikimedia.org/wikipedia/en/thumb/7/7c/Monash_University_logo.svg/1280px-Monash_University_logo.svg.png" style="height:100px">

# FIT5196 Assessment 1: Text Pre-processing

   #### Name: Subhasish Sarkar
   #### Student ID: 29819253
   #### Date:15/4/2019
   #### Environment: Python 3.6.8 and Jupyter notebook
   #### Libraries used: 
   * slate3k - To read PDF files as text in Python 
   * re - To use regular expressions in Python
   * pandas - To make use of Pandas dataframes for ease of use in computation
   * nltk - To use NLP in Python, like Stemming, tokenization, etc
   
   
   ##### The following code reads a given PDF file with tabular data, parses the data from it, and gives out 2 files - `29819253_vocab.txt` and 
   ##### `29819253_countVec.txt`, the first file containing vocabulary, and their indices, and the second file containing the occurrences of these words along with their word frequencies
   
   
   ## 1. Introduction
   The following code reads tabular data from a PDF file named `29819253.pdf` which contains information about subjects offered 
   by a Univeristy. It contains details about the Unit code, synopsis, and outcomes written as a table. 
   
   This data is read using the `slate3k` library which is built on top of the `PDFMiner` library. It enables us to read tabular 
   data as text. This text is then converted to a Pandas Dataframe. 
   
   Regular Expressions were extensively used to extract and clean the data. 
   
   From this extracted data inside the dataframe, text preprocessing was done on a set of unit information and then converted
   into numerical representations which are suitable for input into recommender-systems/ information-retrieval algorithms.
   
   The data-set that was provided contains 400 unit information crawled from Monash University. The pdf file contains a table 
   in which each row contains information about a unit which is unit code, synopsis, and outcomes. The task was to extract and 
   transform the information for each unit into a vector space model.

## 2. Implementation of Code

#### Importing all required libraries

In [4]:
import slate3k
import re
import pandas as pd
import nltk
from nltk.tokenize import RegexpTokenizer

#### The PDF file is opened in read-binary mode, and taken into the variable `page`. This contains all the data from all the pages from the PDF as a string. 

In [5]:
with open('29819253.pdf','rb') as file:
    page = slate3k.PDF(file)

TypeError: __init__() missing 1 required positional argument: 'parser'

#### The unit numbers, synopsis and outcomes all need to be placed inside lists so that it is easy to use their indices to make a data frame. These will be used inside a dictionary which can be directly converted to a dataframe.

#### The text data from `page` is split into a list by splitting on `\n\n` which separates out all the unit numbers, synopsis and outcomes from each other. The `re.search` function returns the first occurrence of `^[A-Z0-9]*$`


#### These are basically those patterns that begin and end with a capital alphabet and numbers which can repeat 0 or many times. These are basically the unit numbers. So the `re.search` keeps returning the unit numbers for the whole loop. These are thus appended to the `unit_numbers` list

#### Any occurence within the string which starts with a `[` is an outcome, and is thus appended to the `outcomes_list`

#### if the string has anything other than the above mentioned conditions and is is not one of:
* '\x0c' - Garbage value
* 'Title' - Header
* 'Synopsis' - Header
* 'Outcomes' - Header

It is then data for the synopsis and is thus appended to the `synopsis_list`

In [3]:
unit_numbers = []
synopsis_list = []
outcomes_list = []

for i in range(len(page)):
    file_1=page[i].split("\n\n")
    for i in range(len(file_1)):
        if re.search(r"^[A-Z0-9]*$",file_1[i]):
            unit_numbers.append(file_1[i])
        elif file_1[i].startswith('['):
            outcomes_list.append(file_1[i])
        else:
            if (file_1[i]=='\x0c' or file_1[i]=='Title' or file_1[i] == 'Synopsis'or file_1[i] == 'Outcomes'):
                continue
            synopsis_list.append(file_1[i])

200
200
200


#### The data frame is thus made from the dictionary structure containing the previously made lists

In [4]:
pdf_dataframe = pd.DataFrame({'Unit Codes': unit_numbers,'Synopsis': synopsis_list, 'Outcomes': outcomes_list})
pdf_dataframe

Unnamed: 0,Outcomes,Synopsis,Unit Codes
0,['provide a working knowledge of key concepts ...,The process of marketing research. Role of res...,MKF2121
1,['critically analyse and interpret basic resea...,Basic introduction to research design in manag...,MGX5000
2,"['analyse and explain the policies, programs a...","In this unit, students will examine key patter...",ATS3223
3,['Apply knowledge and skills of research desig...,This unit provides an opportunity for high ach...,PBH3006
4,['Examine and implement health education strat...,This unit prepares third year students in the ...,PAR3022
5,['Evaluate the role of taxation in the modern ...,The primary function of taxation - to raise re...,LAW4243
6,['Understand the dynamical principles governin...,Dynamical meteorology concerns itself with the...,EAE5021
7,"['critically reflect on, comprehensively analy...",The aim of the unit is to ensure that students...,AZA3184
8,['NA'],This unit provides the student with the opport...,MAE4904
9,['recognise and evaluate the developing role o...,This unit introduces students to central conce...,ATS1208


In [5]:
#nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ssar0014/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### Tokenizer was downloaded from `nltk.data`

#### Data was read from the dataframe, and then `re` was used to clean them of unusable tags. The cleaned data was then used to create tokens out of, and appended to a dictionary as per the unit codes. 

#### The sentences obtained from tokenization are then normalized and dumped into a dictionary

In [6]:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [33]:
dictionary_unitcode = dict()
for i in range(pdf_dataframe.shape[0]):
    temp_string = ''
    string = pdf_dataframe.iloc[i,0] + pdf_dataframe.iloc[i,1]
    string = string.capitalize()
    string = re.sub(r'\[', '', string)
    string = re.sub(r'\]', '', string)
    string = re.sub(r'\\','',string)
    string = re.sub(r'\n',' ',string)
    string = re.sub(r'/',' ',string)
    string = string.replace(r'(\\)','')
    sentence = tokenizer.tokenize(string)
    sentence = [re.sub('([a-zA-Z])', lambda x: x.groups()[0].lower(), sent, 1) for sent in sentences]
    for tokens in sentence:
        temp_string+=tokens
    dictionary_unitcode.update({pdf_dataframe.iloc[i,2]:temp_string})
dictionary_unitcode

{'ACB3021': "'apply traditional and contemporary performance measurement and control techniques to enable managers to measure and enhance organisational performance', 'analyse and make recommendations regarding the design of performance measurement and control systems', 'explain how behavioural implications are crucial for the effective design of performance measurement and control systems', 'apply critical thinking, problem solving and presentation skills to individual and or group activities dealing with performance measurement and control and demonstrate in an individual summative assessment task the acquisition of a comprehensive understanding of the topics covered by acb3021.'topics include the budget planning process including master budget preparation, budgeting standard costs and variance analysis, responsibility accounting, management by objectives and non-financial performance measurement, divisional performance, transfer pricing issues, program budgeting and other approaches

#### The following function was made to tokenize words obtained from the dictionary above
#### The regular expression tokenizer mentioned in the specifications file was used to tokenize these keys.

In [30]:
def tokenizeKeys(keys):
    key_tokenizer = RegexpTokenizer(r"\w+(?:[-']\w+)?")
    unit_tokens = key_tokenizer.tokenize(dictionary_unitcode[keys])
    return (keys,unit_tokens)

In [31]:
unit_tokens =  dict(tokenizeKeys(keys) for keys in dictionary_unitcode.keys())

In [32]:
unit_tokens

{'ACB3021': ['apply',
  'traditional',
  'and',
  'contemporary',
  'performance',
  'measurement',
  'and',
  'control',
  'techniques',
  'to',
  'enable',
  'managers',
  'to',
  'measure',
  'and',
  'enhance',
  'organisational',
  'performance',
  'analyse',
  'and',
  'make',
  'recommendations',
  'regarding',
  'the',
  'design',
  'of',
  'performance',
  'measurement',
  'and',
  'control',
  'systems',
  'explain',
  'how',
  'behavioural',
  'implications',
  'are',
  'crucial',
  'for',
  'the',
  'effective',
  'design',
  'of',
  'performance',
  'measurement',
  'and',
  'control',
  'systems',
  'apply',
  'critical',
  'thinking',
  'problem',
  'solving',
  'and',
  'presentation',
  'skills',
  'to',
  'individual',
  'and',
  'or',
  'group',
  'activities',
  'dealing',
  'with',
  'performance',
  'measurement',
  'and',
  'control',
  'and',
  'demonstrate',
  'in',
  'an',
  'individual',
  'summative',
  'assessment',
  'task',
  'the',
  'acquisition',
  'of

#### This code block is generating bigram collocations. The first step is to concatenate all the tokenized patents using the chain.from_iterable function. The returned list by the function contains a list of all the words seprated by white space.

In [11]:
from itertools import chain
all_words = list(chain.from_iterable(unit_tokens.values()))
all_words

['provide',
 'students',
 'with',
 'experience',
 'in',
 'an',
 'area',
 'of',
 'research',
 'provide',
 'students',
 'with',
 'an',
 'insight',
 'into',
 'future',
 'opportunities',
 'in',
 'the',
 'area',
 'of',
 'research',
 'encourage',
 'and',
 'attract',
 'high',
 'quality',
 'students',
 'interested',
 'in',
 'pursuing',
 'a',
 'career',
 'in',
 'research',
 'the',
 'biotechnology',
 'or',
 'pharmaceutical',
 'industry',
 'or',
 'academia',
 'an',
 'understanding',
 'of',
 'some',
 'recent',
 'advances',
 'in',
 'research',
 'in',
 'pharmaceutical',
 'science',
 'or',
 'pharmacy',
 'practice',
 'and',
 'the',
 'literature',
 'within',
 'their',
 'area',
 'of',
 'research',
 'an',
 'appreciation',
 'of',
 'the',
 'need',
 'to',
 'define',
 'a',
 'hypothesis',
 'design',
 'an',
 'approach',
 'to',
 'test',
 'the',
 'hypothesis',
 'plan',
 'the',
 'experiments',
 'undertake',
 'the',
 'experiments',
 'analyse',
 'and',
 'interpret',
 'the',
 'data',
 'and',
 'write',
 'a',
 'resear

#### Collocations are created. These are then filtered to remove those than have a length less than 3
#### The top 200 meaningful bigrams are made from collocations and included in the vocab using the PMI measure.

In [12]:
import nltk.collocations
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# or w.lower() in ignored_words)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 100) # Top-100 bigrams
top_200_bigrams

[('associated', 'with'),
 ('critically', 'evaluate'),
 ('this', 'unit'),
 ('students', 'will'),
 ('research', 'project'),
 ('unit', 'will'),
 ('the', 'role'),
 ('explain', 'the'),
 ('understand', 'the'),
 ('describe', 'the'),
 ('the', 'unit'),
 ('evaluate', 'the'),
 ('the', 'ability'),
 ('and', 'other'),
 ('the', 'development'),
 ('analyse', 'the'),
 ('identify', 'and'),
 ('from', 'the'),
 ('analysis', 'and'),
 ('with', 'the'),
 ('and', 'how'),
 ('communication', 'and'),
 ('analyse', 'and'),
 ('apply', 'the'),
 ('and', 'their'),
 ('and', 'apply'),
 ('for', 'the'),
 ('health', 'and'),
 ('skills', 'and'),
 ('design', 'and'),
 ('research', 'and'),
 ('the', 'research'),
 ('and', 'the')]

#### The bigrams are tokenized, using the Multi Word Tokenizer `MWETokenizer`
#### These tokens are then appended to a collocations dictionary for further use. 
#### `colloc_vocab` is a list of all the unique word collocations obtained from the `unit_collocations `

In [13]:
#Multi Word Tokenizer
from nltk.tokenize import MWETokenizer
mwetokenizer = MWETokenizer(top_200_bigrams)
unit_collocations =  dict((keys, mwetokenizer.tokenize(values)) for keys,values in unit_tokens.items())
all_words_collocations = list(chain.from_iterable(unit_collocations.values()))
colloc_vocab = list(set(all_words_collocations))

4013


#### There was a stopwords text file given, which is read into `stop_words_file`
#### These are then appended to a list, based on being split wrt `\n`
#### They are then made unique by being but into a `set`, and put into the list `unique_stop_words_list`

In [14]:
stop_words_file=open('stopwords_en.txt','r')
stopwords_list = stop_words_file.read().split('\n')
unique_stop_words_list = list(set(stopwords_list))
unique_stop_words_list

['truly',
 'amongst',
 'beside',
 'few',
 'l',
 'liked',
 'otherwise',
 'becoming',
 'came',
 'entirely',
 'us',
 'seen',
 'second',
 'whereafter',
 'vs',
 'also',
 'out',
 'seeming',
 'que',
 'almost',
 'however',
 'eight',
 'itself',
 'shall',
 'through',
 'can',
 'thanks',
 'three',
 'whatever',
 'please',
 'r',
 'later',
 'alone',
 'five',
 'enough',
 'yourself',
 'still',
 'having',
 'quite',
 'throughout',
 'changes',
 'a',
 'you',
 "wasn't",
 'consequently',
 'wonder',
 'regards',
 'v',
 'twice',
 'overall',
 'unfortunately',
 'com',
 "you're",
 'm',
 "aren't",
 "where's",
 "isn't",
 'other',
 'ask',
 'under',
 'way',
 'think',
 'everyone',
 'meanwhile',
 'noone',
 'whereas',
 "you've",
 'indicates',
 'as',
 'first',
 'others',
 'allow',
 'qv',
 'four',
 'let',
 'contain',
 'therein',
 'anywhere',
 'saying',
 'said',
 'now',
 'both',
 'into',
 'furthermore',
 'much',
 'herself',
 'd',
 "they'll",
 'ok',
 'those',
 'moreover',
 'uses',
 'regardless',
 'down',
 'were',
 'anyhow',


#### The following function finds all the stopwords from the stop word document, and returns a dictionary of the values returning from  the regexptokenizer tokens without the stop words

In [15]:
#Finds all the stopwords from the stop word document, and returns a dictionary of the values returning from 
#the regexptokenizer tokens without the stop words
def clean_stopwords(key):
    values_stop_words = [iterator for iterator in unit_collocations[key] if iterator.lower() not in unique_stop_words_list]
    return(key,values_stop_words)
cleaned_stop_words =  dict(clean_stopwords(key) for key,values in unit_collocations.items())

In [16]:
cleaned_stop_words

{'ACB3021': ['apply',
  'traditional',
  'contemporary',
  'performance',
  'measurement',
  'control',
  'techniques',
  'enable',
  'managers',
  'measure',
  'enhance',
  'organisational',
  'performance',
  'analyse_and',
  'make',
  'recommendations',
  'design',
  'performance',
  'measurement',
  'control',
  'systems',
  'explain',
  'behavioural',
  'implications',
  'crucial',
  'for_the',
  'effective',
  'design',
  'performance',
  'measurement',
  'control',
  'systems',
  'apply',
  'critical',
  'thinking',
  'problem',
  'solving',
  'presentation',
  'skills',
  'individual',
  'group',
  'activities',
  'dealing',
  'performance',
  'measurement',
  'control',
  'demonstrate',
  'individual',
  'summative',
  'assessment',
  'task',
  'acquisition',
  'comprehensive',
  'understanding',
  'topics',
  'covered',
  'acb3021',
  'topics',
  'include',
  'budget',
  'planning',
  'process',
  'including',
  'master',
  'budget',
  'preparation',
  'budgeting',
  'standar

#### The following function returns all unique cleaned stop words as a key-value pairing.

In [17]:
def clean_stop_words_sort(key):
    unique_stop_words_cleaned = set(cleaned_stop_words[key])
    return(key,unique_stop_words_cleaned)

cleaned_stpwrds_list =  dict(clean_stop_words_sort(key) for key,values in cleaned_stop_words.items())
removed_stop_words_list = list(chain.from_iterable(cleaned_stpwrds_list.values()))

#### The frequency distribution of all the removed stop words are calculated as key value pairings with the key showing the token and the value showing their frequency

In [18]:
from nltk.probability import FreqDist
FreqDist(removed_stop_words_list)

FreqDist({'this_unit': 105, 'skills': 84, 'students': 82, 'and_the': 77, 'research': 73, 'develop': 73, 'the_unit': 73, 'issues': 70, 'understanding': 70, 'demonstrate': 65, ...})

#### This code block is making a list of all the rare tokens which appear either in less than 5% or more than 95% of the documents (units). This is calculated from the frequency distribution of the removed_stop_words_list

In [19]:
#making a list of all the rare tokens which appear either less than 5% or more than 95% of the documents
#this is calculated from the frequency distribution of the removed_stop_words_list
rare_tokens = []
count=0
freq_stop_words= FreqDist(removed_stop_words_list)

less_than_5 = int(0.05*len(unit_numbers))
more_than_95 = int(0.95*len(unit_numbers))

for key,value in freq_stop_words.items():
    if (value < less_than_5 or value > more_than_95):
        rare_tokens.append(key)

In [20]:
rare_tokens

['express',
 'labs',
 'senior',
 'reflections',
 'overview',
 'lawyers',
 'regulation',
 'haematological',
 'dependence',
 'innovation',
 'nazi',
 'childbirth',
 'special',
 'serialism',
 'carried',
 'generated',
 'inter',
 'extratropical',
 'phase',
 'contaminant',
 'disk',
 'illustrate',
 'causation',
 'lived',
 'advocacy',
 'tract',
 'individuals',
 "professionalism'building",
 'nature-based',
 'devise',
 'derivative',
 'artists',
 'buildings',
 'judge',
 'isolated',
 'forge',
 'comparative',
 'general',
 'vicarious',
 'increasing',
 'threads',
 'contained',
 'classifying',
 'youth-focused',
 'funding',
 'exciting',
 'modes',
 'ecology',
 'extended',
 'economies',
 'anatomy',
 'occurrence',
 'advertising',
 'young',
 'flexible',
 'execute',
 'packet-switching',
 'theorists',
 'layout',
 'europe',
 'anthropological',
 'well-argued',
 'discussed',
 'capacitance',
 'emphasise',
 'nutrients',
 'scope',
 'finishing',
 'cooperation',
 'appropriateness',
 'obey',
 'sustains',
 'family-base

#### A dictionary is thus made from the above cleaning of rare tokens, to make a dictionary which contains regular words which appear for every unit. This will be later used for finding out word stems.

In [21]:
def clean_rare_tokens(key):
    cleaned_rare_list = []
    for token in cleaned_stop_words[key]:
        if token not in rare_tokens:
            cleaned_rare_list.append(token)
    return (key,cleaned_rare_list)
stem_dictionary =  dict(clean_rare_tokens(key) for key,values in cleaned_stop_words.items())

In [22]:
stem_dictionary

{'ACB3021': ['apply',
  'contemporary',
  'performance',
  'control',
  'techniques',
  'organisational',
  'performance',
  'analyse_and',
  'design',
  'performance',
  'control',
  'systems',
  'explain',
  'implications',
  'for_the',
  'effective',
  'design',
  'performance',
  'control',
  'systems',
  'apply',
  'critical',
  'thinking',
  'problem',
  'solving',
  'presentation',
  'skills',
  'individual',
  'group',
  'activities',
  'dealing',
  'performance',
  'control',
  'demonstrate',
  'individual',
  'summative',
  'assessment',
  'task',
  'acquisition',
  'comprehensive',
  'understanding',
  'topics',
  'covered',
  'topics',
  'include',
  'planning',
  'process',
  'including',
  'standard',
  'analysis',
  'management',
  'performance',
  'performance',
  'issues',
  'and_other',
  'approaches',
  'planning',
  'control'],
 'ACW3041': ['demonstrate',
  'understanding',
  'the_role',
  'modern',
  'society',
  'including',
  'professional',
  'ethical',
  'legal

#### The Porter Stemmer is used to find out the stems of all the words appearing in the `stem_dictionary`
#### These word stems are then appended to a list which is added to a dictionary containing the keys as the unit codes and the values as the stems inside `word_stem_dictionary`

In [23]:
from nltk.stem import PorterStemmer

def word_stemmer(key):
    word_stems = [PorterStemmer().stem(vals) for vals in  stem_dictionary[key]]
    return(key,word_stems)
word_stem_dictionary =  dict(word_stemmer(key) for key,values in stem_dictionary.items())

word_stem_dictionary

{'ACB3021': ['appli',
  'contemporari',
  'perform',
  'control',
  'techniqu',
  'organis',
  'perform',
  'analyse_and',
  'design',
  'perform',
  'control',
  'system',
  'explain',
  'implic',
  'for_th',
  'effect',
  'design',
  'perform',
  'control',
  'system',
  'appli',
  'critic',
  'think',
  'problem',
  'solv',
  'present',
  'skill',
  'individu',
  'group',
  'activ',
  'deal',
  'perform',
  'control',
  'demonstr',
  'individu',
  'summ',
  'assess',
  'task',
  'acquisit',
  'comprehens',
  'understand',
  'topic',
  'cover',
  'topic',
  'includ',
  'plan',
  'process',
  'includ',
  'standard',
  'analysi',
  'manag',
  'perform',
  'perform',
  'issu',
  'and_oth',
  'approach',
  'plan',
  'control'],
 'ACW3041': ['demonstr',
  'understand',
  'the_rol',
  'modern',
  'societi',
  'includ',
  'profession',
  'ethic',
  'legal',
  'examin',
  'role',
  'emphasi',
  'describe_th',
  'framework',
  'and_appli',
  'plan',
  'process',
  'evid',
  'procedur',
  'for

#### The word stems are then made unique by using a set, and then sorted so that they are in alphabetical order

In [24]:
word_stem_dictionary
word_stem_list = list(chain.from_iterable(word_stem_dictionary.values()))
vocab_sorted_stem = []
for elem in sorted(list(set(word_stem_list))):
    vocab_sorted_stem.append(elem)
vocab_sorted_stem

['abil',
 'academ',
 'acquisit',
 'activ',
 'address',
 'advanc',
 'analys',
 'analyse_and',
 'analyse_th',
 'analysi',
 'analysis_and',
 'analyt',
 'and_appli',
 'and_how',
 'and_oth',
 'and_th',
 'and_their',
 'appli',
 'applic',
 'apply_th',
 'apprais',
 'approach',
 'area',
 'articul',
 'aspect',
 'assess',
 'associated_with',
 'australian',
 'awar',
 'base',
 'basic',
 'busi',
 'capac',
 'care',
 'case',
 'challeng',
 'clinic',
 'commun',
 'communication_and',
 'complex',
 'compon',
 'comprehens',
 'concept',
 'conceptu',
 'conduct',
 'contemporari',
 'content',
 'context',
 'contribut',
 'control',
 'cover',
 'critic',
 'critically_evalu',
 'critiqu',
 'cultur',
 'current',
 'data',
 'deal',
 'debat',
 'decis',
 'demonstr',
 'describ',
 'describe_th',
 'design',
 'design_and',
 'determin',
 'develop',
 'disciplin',
 'discuss',
 'divers',
 'econom',
 'effect',
 'element',
 'emphasi',
 'engag',
 'engin',
 'environ',
 'environment',
 'ethic',
 'evalu',
 'evaluate_th',
 'evid',
 'exa

#### Once the vocabulary is obtained, they are put into a file `29819253_vocab.txt` with their indices starting from 0 

In [25]:
fileWriter = open("29819253_vocab.txt", "w")
vocab_dictionary = dict()

index = 0
for word in vocab_sorted_stem:
    string_file = word+ ":" + str(index) + "\n"
    vocab_dictionary[word] = index
    index += 1    

    fileWriter.write(string_file)
    
fileWriter.close() 

#### The counts of every word is taken according to their indices, and put in together as a sparse representation according to their unit codes. 
#### The output is then written to a file `29819253_countVec.txt`

In [28]:
fileWriter_1 = open("29819253_countVec.txt", "w")

append_string=''

for key,values in word_stem_dictionary.items():
    append_string = append_string + key
    
    token_list = values
    
    token_dict_1 = dict()
    
    
    for tokens in token_list:
        if tokens not in token_dict_1:
            token_dict_1[tokens] = token_list.count(tokens)
   
    for key,value in sorted(token_dict_1.items()):
        append_string = append_string +","+ str(vocab_dictionary[key])+":"+str(value)
    append_string = append_string + "\n"


fileWriter_1.write(append_string)
    
fileWriter_1.close()