# Task B: Text Pre-Processing
#### Name: Rohit Sanjay Tapas

Environment: Python 3 and Jupyter notebook

Libraries used: 
* tika - used to parse pdf file to txt format
* nltk - natural language toolkit (tokenizer, lemmatizer, collocations)
* re (for regular expression, included in Anaconda Python 3) 
* itertools

## 1. Introduction


* Main motive of this task is to convert extracted data into formatted data. The textual information or data is converted into its numerical representation.
* The dataset consists of information of units offered at Monash University.
* The data provided per unit includes:
1.Unit Code
2.Synopsis
3.Outcomes
* The task is to extract the information about each unit and generate a vector-space model.

## 2.  Import libraries 

In [11]:
from tika import parser
import re
import nltk
from nltk.collocations import *
from itertools import chain
from nltk.tokenize import RegexpTokenizer 
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

## 3. Methodology

### 3.1 Convert data from pdf to txt

* Tika library has been used to extract data from pdf file and convert it into txt format.
* Enter the filename into parser.fromfile() to extract the data.

In [12]:
raw = parser.from_file('29812135.pdf')               #parse data from pdf file
dataset = raw['content']                             #store parsed data in variable

### 3.2 File operations
* Read all stop words from the stopwords file provided.

In [13]:
stop_words = open('stopwords_en.txt','r')                     #open stop words file
stop = stop_words.readlines()                                 #read all the data in stopwords file

### 3.3 Initlialise lists

In [14]:
tokens = []                    #stores all tokens
data = []                      #stores extracted data from file
fin = []                       #list of lists which stores data unit wise
uni_voc = []                   
unigrams = []                  #list of unigrams
stops=[]                       #list of stopwords                              
stopped_tokens = []            #list of tokens after removing stop words
init_vocab = []
vocab = []                     #list of vocabulary
stemmed_vocab = []             #list of stemmed vocab
unit_code = []                 #list of unit codes

### 3.4 Extract data unit wise

* Regex is used to extract data unit wise.
* Extracted data is then stored in a list.

In [15]:
main_filter = re.findall('([A-Z][A-Z][A-Z]\d\d\d\d)(.*?)]|[A-Z][A-Z][A-Z][A-Z]\d\d\d\d(.*?)]',dataset,re.DOTALL|re.MULTILINE)  #regex to extract data per unit
for every in main_filter:
    data.append(every)
for each in data:                                      #storing data per unit as a list in a list ehich contains all the units.
    for x in each:
        each = x.replace('\n',' ')
        fin.append(each)
del fin[2::3]
unit_code.append(fin[0::2])

### 3.5 Tokenization
* The data is tokenized using the regex provided.

In [16]:
tokenizer = RegexpTokenizer(r"\w+(?:[-.]\w+)?")        #tokenise all the words
for each in fin:
    unigram_tokens = tokenizer.tokenize(each)
    unigrams.append(unigram_tokens)

for each in unigrams:
    unique = list(set(each))                           #remove repeating tokens
    uni_voc.append(unique)
    
mwe_tokenizer = MWETokenizer(uni_voc)
for each in unigrams:
    mwe_tokens = mwe_tokenizer.tokenize(each)
    tokens.append(mwe_tokens)

### 3.6 Removing Stop Words
* stop words provided are stored in a list
* if the stop words occur in more than %95 of documents, they are removed
* if length of token is less than 3, it is removed

In [17]:
for each in stop:
    each = each.replace('\n',"")
    stops.append(each)
while len(tokens)!=len(stopped_tokens):
    stopped_tokens.append([])

for each in tokens:
    init_vocab.append(each)

for x in stops:                                        #remove stopwords from list of stopwords if with threshold set to %95
    i = 0
    for each in init_vocab:
        if x in each:
            i = i+1
    if i>190:
        stops.remove(x)

i = 0
for i in range(len(tokens)):                           #remove all stopwords from list which cantains lists of all units
    for each in tokens[i]:
        if each not in stops:
            if len(each)>3:                            #remove tokens with length less than 3
                each = each.lower()
                stopped_tokens[i].append(each)


for each in tokens:                                    #creating vocabular set which contains words from the whole dataset
    for x in each:
        if len(x)>3:
            if x not in stops:
                x = x.lower()
                vocab.append(x)
                
temp_vocab = set(vocab)
set_vocab = sorted(temp_vocab)

### 3.7 Removing rare tokens
* Tokens are removed with the threshold set to %5

In [18]:
for each in temp_vocab:                                #filtering rare tokens from vocabulary which occur in less than %5 units.
    i = 0
    for x in stopped_tokens:
        if each in x:
            i = i+1
    if i<10:
        set_vocab.remove(each)

set_vocab = set(set_vocab)
set_vocab = sorted(set_vocab)

### 3.8 Stemming
* Tokens are stemmed using Porter stemmer

In [19]:
ps = PorterStemmer()                                   #stemming all the tokens in vocabulary using porter stemmer
for w in set_vocab:
    stemmed_vocab.append(ps.stem(w))

### 3.9 Finding Bigrams
* Top 200 meaningful bigrams are found.
* As we have to find meaningful bigrams, the bigrams which appear at least 4 times are taken and added to the vocabulary

In [20]:
all_words = list(chain.from_iterable(stopped_tokens))                                #finding first 200 meaningful bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
bigram_finder.apply_freq_filter(4)
bigram_finder.apply_word_filter(lambda w: len(w) < 3)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200)
k = 0
for each in top_200_bigrams:
    stemmed_vocab.append(str(top_200_bigrams[k][0])+' '+str(top_200_bigrams[k][1]))     #adding the extracted bigrams to the final vocabulary
    k = k+1
stemmed_vocab = set(stemmed_vocab)
stemmed_vocab = sorted(stemmed_vocab)                #sorting the final vocabulary alphabetically

In [24]:
top_200_bigrams

[('magnetic', 'resonance'),
 ('childhood', 'adolescence'),
 ('micro', 'nano'),
 ('businesses', 'operating'),
 ('non-fiction', 'fiction'),
 ('monash', 'university'),
 ('central', 'nervous'),
 ('male', 'female'),
 ('crime', 'prevention'),
 ('finite', 'element'),
 ('compare', 'contrast'),
 ('female', 'pelvis'),
 ('geological', 'maps'),
 ('korean', 'peninsula'),
 ('renewable', 'energy'),
 ('virtual', 'space'),
 ('infectious', 'diseases'),
 ('pacific', 'region'),
 ('arguments', 'existence'),
 ('female', 'reproductive'),
 ('corporate', 'finance'),
 ('decision', 'making'),
 ('delivery', 'platforms'),
 ('affecting', 'businesses'),
 ('dispute', 'resolution'),
 ('radiation', 'therapy'),
 ('buyer', 'behaviour'),
 ('sale', 'goods'),
 ('problem', 'solving'),
 ('public', 'sector'),
 ('indigenous', 'peoples'),
 ('nervous', 'system'),
 ('physical', 'computing'),
 ('individual', 'summative'),
 ('rules', 'occupational'),
 ('goods', 'services'),
 ('summative', 'assessment'),
 ('construction', 'disputes')

Above we can see the first 200 meaningful bigrams

### 3.10 Creating vocab.txt

In [21]:
vocab_dict = {}                                      #creating a dictionary for all tokens in the vocabulary with serial indices
index = 1 
for item in stemmed_vocab:
        vocab_dict[item] = index
        index = index+1
for key in vocab_dict:
    print(str(key)+ ' : ' + str(vocab_dict[key]),file = open("29812135_vocab.txt",'a'))   #printing the final vocabulary to output file
        

In [23]:
vocab_dict

{'abil': 1,
 'academ': 2,
 'acquir': 3,
 'acquisition comprehensive': 4,
 'activ': 5,
 'addition students': 6,
 'advanc': 7,
 'affecting businesses': 8,
 'aim': 9,
 'aims develop': 10,
 'analys': 11,
 'analyse interpret': 12,
 'analysi': 13,
 'analyt': 14,
 'analytical skills': 15,
 'appli': 16,
 'applic': 17,
 'apply rules': 18,
 'appreci': 19,
 'approach': 20,
 'area': 21,
 'argument': 22,
 'arguments existence': 23,
 'articul': 24,
 'aspect': 25,
 'assess': 26,
 'assessment children': 27,
 'audienc': 28,
 'australian': 29,
 'awar': 30,
 'base': 31,
 'basi': 32,
 'basic': 33,
 'basic theories': 34,
 'beginner level': 35,
 'behaviour': 36,
 'broad': 37,
 'broad range': 38,
 'busi': 39,
 'business environment': 40,
 'business intelligence': 41,
 'business transactions': 42,
 'businesses operating': 43,
 'buyer behaviour': 44,
 'capstone unit': 45,
 'care': 46,
 'case': 47,
 'case studies': 48,
 'central nervous': 49,
 'challeng': 50,
 'chang': 51,
 'childhood adolescence': 52,
 'chines

* Here we have our complete stemmed vocabulary.
* for each token, we assign an index which will be used in creating the sparse vector.

### 3.11 Creating sparse vector

In [22]:
i = 0

for i in range(1,len(fin),2):                              #creating sparse vector and writing to output file
    print('\n',file = open("29812135_countVec.txt","a"))
    print(unit_code[0][0], end=",",file = open("29812135_countVec.txt","a"))
    for every in vocab_dict:
        if every in fin[i]:   
            print((str(vocab_dict[every]) + ':'+ str(fin[i].count(every))), end=",",file = open("29812135_countVec.txt","a"))
            
            
    if unit_code:
        del unit_code[0][0]

* Above we can see the sparse vector file.
* It should be interpreted in the following manner:
* UnitCode, token1_index:wordcount, token2_index:wordcount...