# Text Pre-processing

Libraries used in the task:
* re (for regular expression) 
* nltk (for text pre-processing)
* codecs(for reading files)
* os(for IO operations)
* itertools (using the chain iterator to simplify nested loops/iterations)
* numpy (for calculating statistics)  

## 1. Introduction
This notebook preprocesses a set of resume files and generates a sparse representation of the resumes. The dataset contains 250 resumes, which includes information of the applicants like contact details, skills, work experience, educational background, hobbies, etc.

The resumes containting the information of the applicants were loaded and any duplicate resumes were excluded from the analysis. Text pre-processing was performed on the resume files in the dataset. The pre-processing included case normalization, tokenization, removal of stopwords and stemming. Additionally, the most frequent and rare tokens were removed, the tokens with length less than 3 were removed, and the first 200 meaningful bigrams were added to the vocabulary. The capital tokens appearing in the middle of a sentence/line were not normalized to lower case because of the hypothesis that these tokens are likely to have a different meaning than their lower case counterparts. 

The notebook concludes with the generation of the lexical vocabulary and the count vector sparse representations of the resumes. The vocabulary and vector representations are written to files, which can be used as input to various recommender-systems and information retrieval algorithms.

## 2.  Import libraries 

In [1]:
# Code to import libraries that are needed for this assessment:
import re
import nltk
# nltk.download('punkt')
import os
import codecs 
import itertools
import numpy as np
from itertools import chain
from nltk.collocations import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk.probability import *

## 3. Loading the resumes

In this section, the resumes needed for this task are loaded.

In [2]:
# Resume IDs
resume_ids = [     685,467,555,22,534,541,359,787,3,265,606,73,144,178,
                   151,259,840,35,433,260,266,408,697,391,335,823,293,733,
                   194,1,643,384,372,93,226,709,760,857,291,817,488,482,
                   781,706,117,323,260,403,234,848,285,727,55,861,370,590,
                   704,510,819,136,577,451,619,291,224,825,259,258,477,575,
                   432,715,12,699,855,13,204,486,361,665,583,686,163,764,
                   391,61,507,147,127,316,537,113,361,857,166,328,577,205,
                   792,198,805,769,294,837,328,715,678,726,287,855,178,449,
                   512,315,453,710,546,619,487,593,329,748,376,92,437,816,
                   324,372,183,231,482,324,132,114,812,637,803,283,256,410,
                   806,724,863,776,314,326,833,277,474,664,857,126,365,578,
                   10,354,70,32,472,536,150,670,230,765,562,52,551,841,
                   85,73,322,827,576,592,211,365,205,353,546,326,64,627,
                   242,850,743,507,183,365,475,240,515,694,77,838,217,649,
                   431,315,260,686,139,823,617,194,690,481,245,87,131,665,
                   546,594,6,507,96,54,449,856,530,805,463,484,682,320,364,
                   688,666,413,22,340,577,612,348,453,538,245,641,12,483,
                   505,642,470,107,17,328,226,179,682,252,61
             ]
len(resume_ids)

250

Before proceeding with any loading/pre-processing tasks, I am checking for **duplicate resume IDs**:

In [3]:
unique_ids = set(resume_ids)
len(unique_ids)

207

There are 250 resume IDs, but only 207 of them are unique. I am using the **207 unique resumes** to complete the task. <br>

Loading the resumes and storing them as **dictionary** with the key as the ID:

In [4]:
# Sorting IDs before iterating through them
sorted_ids = list(unique_ids)  
sorted_ids.sort()
print('The IDs range from',min(sorted_ids), 'to', max(sorted_ids),"but they aren't continuous.")

The IDs range from 1 to 863 but they aren't continuous.


**Reading** all the resume files:

In [5]:
cwd = os.getcwd()
resumes_dict = {}
for i in range(1,864):
    if i in sorted_ids:
        file_contents = []
        name = 'resume_(' + str(i) + ').txt'
        # Note that the files are saved in A1-Task2-Resumes folder for me,
        # you may need to change the path based on the location of your file.
        filepath = os.path.join(cwd,'Resumes',name)  
        file = codecs.open(filepath,mode='r',encoding='utf-8')
        file_contents = file.read()
        resumes_dict[i] = file_contents
        file.close()

In [6]:
resumes_dict[151] # Checking that the files are correctly loaded

'  \r\n\r\n                       Contact: (+65) 8123-1544 (Mobile) \r\n \r\n           Nationality: Singaporean \r\n \r\n\r\nTeo Kai Zhi Terence, CAIA \r\nAddress: Blk 366 Bukit Batok Street 31, #04-269 \r\nEmail: terencettkz@gmail.com        \r\n \r\n \r\n \r\n          \r\nWorking Experience \r\nCitco Fund Services, Singapore \r\n \r\n \r\n \r\n \r\n \r\nRole: Senior Fund Accountant, Private Equity & Real Estate Team \r\n!  Preparing monthly and quarterly reports for Private Equity Funds \r\n!  Performing weekly and monthly custodian and NAV reconciliations \r\n!  Private Equity accounting using Investran (Distributions, Capital Calls and Partner’s Account Summary) \r\n!  Maintaining day-to-day relationship with investment managers, brokers and auditors \r\n!  Working with Traditional and Alternative investments \r\n!  Price checking Hedge Fund’s portfolio using various providers \r\n\r\n \r\n\r\n                  2014 – Present \r\n\r\n \r\nEducation \r\nUniversity of London (UOL),

In [7]:
# Checking that no resume is empty
empty_resumes = [k for k, v in resumes_dict.items() if len(v) == 0]
empty_resumes

[]

## 4. Text Preprocessing

 ### 4.1 Case Normalization  

 
Doing **case normalization** for words based on their occurrence in the middle of a line/sentence. The words in the middle of a sentence are not normalized, remaining words are normalized. The hypothesis here is that the capitalized words in the middle of a sentence have a different meaning compared to their lowercase counterparts. For example, 'Word' in the middle of a sentence could be referring to MS Word; so, 'Word' has a different meaning than 'word' when it appears in the middle of a sentence. 
 
 The first step would be to split the resumes to sentences. Peforming **sentence segmentation** using the **nltk punkt sentence tokenizer**.

In [8]:
# Sentence Segmentation
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences_resumes = {}
for key, resume in resumes_dict.items():
    sentences = sent_detector.tokenize(resume.strip())
    sentences_resumes[key] = sentences

In [9]:
sentences_resumes[151] # Check

['Contact: (+65) 8123-1544 (Mobile) \r\n \r\n           Nationality: Singaporean \r\n \r\n\r\nTeo Kai Zhi Terence, CAIA \r\nAddress: Blk 366 Bukit Batok Street 31, #04-269 \r\nEmail: terencettkz@gmail.com        \r\n \r\n \r\n \r\n          \r\nWorking Experience \r\nCitco Fund Services, Singapore \r\n \r\n \r\n \r\n \r\n \r\nRole: Senior Fund Accountant, Private Equity & Real Estate Team \r\n!',
 'Preparing monthly and quarterly reports for Private Equity Funds \r\n!',
 'Performing weekly and monthly custodian and NAV reconciliations \r\n!',
 'Private Equity accounting using Investran (Distributions, Capital Calls and Partner’s Account Summary) \r\n!',
 'Maintaining day-to-day relationship with investment managers, brokers and auditors \r\n!',
 'Working with Traditional and Alternative investments \r\n!',
 'Price checking Hedge Fund’s portfolio using various providers \r\n\r\n \r\n\r\n                  2014 – Present \r\n\r\n \r\nEducation \r\nUniversity of London (UOL), Singapore Ins

 **Case normalizing** the first word of every sentence to lower case for the result obtained from the sentence tokenizer.  
 
 Defining **functions** to do the case normalization for every sentence. The steps to achieve this are:
 - Capture the first word of the sentence, even if it is present after a bullet
 - If the first word captured is completely in upper case, leave it as is. For example, CEO is left as is
 - Convert any first word, apart from all upper case first words, to lower case. For example, Work converted to work 
 
 A regular expression is written to capture the first word of a sentence; the regex looks for the first word in a new line, bullets followed by a space character before the first word are optional.

In [10]:
def lower_repl(matchobj):
    if matchobj.group(0).isupper(): # If all CAPS, then no normalization. For Example: CGMA, CEO, etc.
        return matchobj.group(0)
    else:
        return matchobj.group(0).lower() # Else, normalize to lower case

def casenormalize(sentence):
    # regex to capture the first word of a sentence
    sentence = re.sub(r'^(?:(?:•|-|\uf0b7|■])?\s*)?\b(\w+)\b',lower_repl, sentence)
    return sentence

normalized_sent_resume = {}
for key, resume in sentences_resumes.items():
    resume_updated = []
    for sentence in resume:
        sent_updated = casenormalize(sentence)
        resume_updated.append(sent_updated)
    normalized_sent_resume[key] = resume_updated 

In [11]:
normalized_sent_resume[151] # Check

['contact: (+65) 8123-1544 (Mobile) \r\n \r\n           Nationality: Singaporean \r\n \r\n\r\nTeo Kai Zhi Terence, CAIA \r\nAddress: Blk 366 Bukit Batok Street 31, #04-269 \r\nEmail: terencettkz@gmail.com        \r\n \r\n \r\n \r\n          \r\nWorking Experience \r\nCitco Fund Services, Singapore \r\n \r\n \r\n \r\n \r\n \r\nRole: Senior Fund Accountant, Private Equity & Real Estate Team \r\n!',
 'preparing monthly and quarterly reports for Private Equity Funds \r\n!',
 'performing weekly and monthly custodian and NAV reconciliations \r\n!',
 'private Equity accounting using Investran (Distributions, Capital Calls and Partner’s Account Summary) \r\n!',
 'maintaining day-to-day relationship with investment managers, brokers and auditors \r\n!',
 'working with Traditional and Alternative investments \r\n!',
 'price checking Hedge Fund’s portfolio using various providers \r\n\r\n \r\n\r\n                  2014 – Present \r\n\r\n \r\nEducation \r\nUniversity of London (UOL), Singapore Ins

The case normalized **sentences** of a resume are **joint** to get the **case normalized resumes**:

In [12]:
normalized_resumes = {}
for key, sent_list in normalized_sent_resume.items():
    resume = ''.join(sent_list)
    normalized_resumes[key] = resume

### 4.2 Tokenization

Performing the **word tokenization** using regular expression:

In [13]:
tokens_dict = {}
tokenizer = RegexpTokenizer("\w+(?:[-']\w+)?", gaps=False)
for key, resume in normalized_resumes.items():
    tokens = tokenizer.tokenize(resume)
    tokens_dict[key] = tokens

In [14]:
tokens_dict[1][1] # Checking

'Vitae'

### 4.3 Generating 200 meaningful bigrams 
A bigram is a pair of consecutive tokens. The nltk collocations packages is used to identify bigrams in the dataset.  

Before we generate the bigrams, we need to concatenate all the tokenized resumes; this was done with the help of chain_from_iterable function.The list returned by the function contains all tokens.

In [15]:
all_words = list(chain.from_iterable(tokens_dict.values()))
voc = list(set(all_words))
print('The number of tokens is',len(all_words))
print('The number of words in the vocabulary is',len(voc))

The number of tokens is 134220
The number of words in the vocabulary is 16443


Generating **first 200 bigrams** using functions from the nltk.collocations package.  
The important things to note  here are:
- Frequency filter is set at 10; the bigrams must have a **frequency > 10**
- Tokens with length < 3, stopwords, and tokens with only numbers are **ignored** in the analysis
- **Pointwise mutual information (PMI)** measure is used to identify the best 200 bigrams

In [16]:
stopwords_list = []
with open('stopwords_en.txt','r') as file:
    for line in file:
        word = line.strip('\n')
        stopwords_list.append(word)
file.close()
stopwords = set(stopwords_list)

In [17]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_words)
bigram_finder.apply_freq_filter(10)
bigram_finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in stopwords or w.isdigit())
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # Top-200 bigrams
top_200_bigrams[:5]

[('Merrill', 'Lynch'),
 ('Ngee', 'Ann'),
 ('Sdn', 'Bhd'),
 ('Abdul', 'Rahman'),
 ('Kuala', 'Lumpur')]

Using the concordance method to **validate the generated bigrams**:

In [18]:
nltk.Text(all_words).concordance('lie')

Displaying 14 of 14 matches:
io ns Lia ise with internat io na l c lie nts a nd o ther globa l o ffic es to 
o ther globa l o ffic es to e nsure c lie nts needs a re me t Per form va lua t
 ime ly ma nner in accorda nce with c lie nts req uireme nts Inve st igated a n
e ne va a nd Bloomberg Lia ise with c lie nts o ther C itco o ffices a nd teams
uppor ted the o n-board ing o f new c lie nts a nd ass isted in the tra ns fer 
d in the tra ns fer o f e xis t ing c lie nts from other globa l o ffices a nd 
ra t io n fo r corpo rate ba nk ing c lie nts Ro le invo lves da ta mining for 
nvo lves da ta mining for targe ted c lie nts matc hing c lie nts p ro files p 
g for targe ted c lie nts matc hing c lie nts p ro files p la nning proce ss fl
 nd a ct ive ly fo llow ing up with c lie nts for ac t iva t io n Ge nera ted d
tus Prob lem so lving for corporate c lie nts thro ugh te lecommun icat io ns D
 fere nt pa yme nt temp la tes fo r c lie nts in the new system JobsDB M arke t
t ly appro 

In [19]:
nltk.Text(all_words).concordance('Abdul')

Displaying 15 of 15 matches:
mberg English Mandarin and Cantonese ABDUL MUHSIN BIN ZULKIFLI Block 140 Pasir 
ion Working Experience College Tunku Abdul Rahman SMJK Sam Tet Advanced Diploma
e College Tunku Financial Accounting Abdul Rahman CGPA 2 66 of 4 00 Diploma In 
oma in Computerized Accounting Tunku Abdul Rahman College Kuala Lumpur Diploma 
EDUCATION 2009 2011 University Tunku Abdul Rahman UTAR Malaysia course Bachelor
 in Malaysia 2007 2009 College Tunku Abdul Rahman KTAR Malaysia diploma in Busi
Pappu a p P Anthony University Tunku Abdul Rahman Final Year Project Supervisor
ed Diploma in Business Studies Tunku Abdul Rahman College Malaysia 1999-2003 ME
 Commerce Financial Accounting Tunku Abdul Rahman College 2003 RELATED SKILLS P
e Business Management cum ICSA Tunku Abdul Rahman College Malaysia Graduated 20
istration Institute University Tunku Abdul Rahman College Malaysia Graduated 20
IVITIES Assistant Group Leader Tunku Abdul Rahman College TARC Orientation Comm
 Committee 

After analysing the results from the concordance method, we can **remove** the following bigrams:
- ('lie' , 'nts') as it does not make sense. It is actually used to refer to clients, but there are white spaces in the word "clients" in the resume
- ('Touche' , 'LLP') and ('Young' , 'LLP') because they are used to refer to Deloitte Touche LLP and Ernst Young LLP; ('Deloitte' , 'Touche') and ('Ernst' , 'Young') are already present as bigrams. Morever, 'LLP' refers to limited liability partnership; it is more meaningful to capture the organizations in a LLP rather than one organization with the token 'LLP'  


The following bigrams need to be **modified**:  
- Tunku Abdul Rahman College: The bigrams ('Tunku' , 'Abdul') and ('Abdul' , 'Rahman') are redundant because they refer to "Abdul Rahman College"; instead we can have one token - ('Abdul' , 'Rahman' , 'College') (trigram)
- Ngee Ann Poytechnic: Similarly, the bigrams ('Ngee' , 'Ann') and ('Ann' , 'Polytechnic') are redundant because they refer to "Ngee Ann Poytechnic"; instead we can have one token - ('Ngee' , 'Ann', 'Polytechnic') (trigram)

In [20]:
bigrams_to_delete = {
    ('Tunku', 'Abdul'),
    ('Abdul','Rahman'),
    ('Ngee', 'Ann'),
    ('Ann', 'Polytechnic'),
    ('Touche', 'LLP'),
    ('Young' ,'LLP'),
    ('lie', 'nts')}
len(bigrams_to_delete)

7

We need to generate new bigrams to **replace** the bigrams we are deleting. Generating **top 210 bigrams**:

In [21]:
top_200_bigrams = set(top_200_bigrams)
top_210_bigrams = set(bigram_finder.nbest(bigram_measures.pmi, 210)) # Top-210 bigrams
top_210_bigrams - top_200_bigrams 

{('Audit', 'Manager'),
 ('Kong', 'Institute'),
 ('Senior', 'Fund'),
 ('Senior', 'Manager'),
 ('Singapore', 'Pte'),
 ('financial', 'models'),
 ('fund', 'manager'),
 ('listed', 'company'),
 ('market', 'research'),
 ('regulatory', 'reporting')}

Selecting the 5 most meaningful bigrams from the additional 10 bigrams generated based on a subjective call, and combining these 5 bigrams with the trigrams ('Tunku', 'Abdul','Rahman') and ('Ngee', 'Ann', 'Polytechnic'):

In [22]:
bigrams_to_add = {
    ('Audit', 'Manager'),
    ('Senior', 'Manager'),
    ('financial', 'models'),
    ('fund', 'manager'),
    ('market', 'research'),
    ('Tunku', 'Abdul','Rahman'),
    ('Ngee', 'Ann', 'Polytechnic')}
len(bigrams_to_add)

7

In [23]:
bigrams_updated = top_200_bigrams - bigrams_to_delete
bigrams_updated = bigrams_updated.union(bigrams_to_add)
len(bigrams_updated)

200

**Re-tokenizing** the resumes to **add collocations**. We also need to ensure that the collocations are not split into two individual words. The **MWEtokenizer** takes care of that and was used to re-tokenize the resumes with collocations:

In [24]:
mwetokenizer = MWETokenizer(bigrams_updated)
colloc_resumes =  dict((key, mwetokenizer.tokenize(resume)) for key,resume in tokens_dict.items())
all_words_colloc = list(chain.from_iterable(colloc_resumes.values()))
print(len(all_words_colloc))

130369


In [25]:
colloc_voc = list(set(all_words_colloc))
print(len(colloc_voc))

16630


Generating a set of the collocations, which can be used in the further steps while preprocessing the dataset:

In [26]:
bigrams_set = set()
for item in bigrams_updated:
    if len(item) == 2:
        bigrams_set.add(item[0] + '_' + item[1])
    if len(item) == 3:
        bigrams_set.add(item[0] + '_' + item[1] + '_' + item[2])

### 4.4 Removing very small tokens
Very small tokens of lengths 1 and 2 characters possess very less lexical context. For example, the tokens 'V', 'B', etc. could be the initals of the applicant and do not add meaningful insights to any analysis on this dataset.

**Removing very small tokens** of length less than 3:

In [27]:
sizeable_tokens_dict = {}
for key, token_list in colloc_resumes.items():
    sizeable_tok_list = []
    for token in token_list:
        if len(token) > 2:
            sizeable_tok_list.append(token)
    sizeable_tokens_dict[key] = sizeable_tok_list

### 4.5 Removing context-independent stopwords
Stopwords are words that are frequently used and possess very less lexical content. It is more efficient to remove stopwords for faster processing and lesser storage space.  

Removing **stopwords** using the given stopwords list:

In [28]:
stopped_tok_dict = {}
for key, token_list in sizeable_tokens_dict.items():
    stopped_tokens = []
    for token in token_list:
        if token not in stopwords:
            stopped_tokens.append(token)
    stopped_tok_dict[key] = stopped_tokens

### 4.6 Stemming

Stemming is done to reduce derivational forms of a word to their base form (stem). For example, the words 'bake', 'baked', and  'baking' are reduced to the base form 'bake'. Hence, the stemming process includes the identification and removal of prefixes, suffixes, and pluralisation of words to their base form (stem). 

The capital tokens and collocations are not stemmed, only the lower case tokens are stemmed; this is because the capital tokens and collocations carry a different meaning than their stemmed lower case counterpart

The tokens were stemmed using the **Porter Stemmer** from the nltk package:

In [29]:
stemmer = PorterStemmer()
stemmed_tok_dict = {}
for key, token_list in stopped_tok_dict.items():
    stemmed_tokens = []
    for token in token_list:
        if token not in bigrams_set and token.islower():
            stem_token = stemmer.stem(token)
            stemmed_tokens.append(stem_token)
        else:
            stemmed_tokens.append(token)
    stemmed_tok_dict[key] = stemmed_tokens

In [30]:
stemmed_tok_dict[1][:5] # checking

['curriculum', 'Vitae', 'Gowribalan', 'MCSI', 'FCMA']

### 4.7 Removing context-dependent stopwords and rare tokens

The goal of this task is to remove the words appearing in more than 98% documents and the words appearing in less than 2% documents.  

The words appearing in more than 98% documents are referred to as context-dependent stopwords. For example, for our dataset words like work, experience, etc. could potentially appear in every resume. These words are generally removed because they aren't very useful in downstream analysis and feature extraction.

The words appearing in less than 2% documents are rare tokens. For example, for our dataset names of the applicants like Gowribalan, Dwayne, Hardy, etc. appear most probably only in their individual resumes. These words are also removed because they aren't very useful in downstream analysis and feature extraction.  

**Creating the vocab** for the resumes:

In [31]:
words = list(chain.from_iterable(stemmed_tok_dict.values()))
vocab = set(words)
lexical_diversity = len(words)/len(vocab)
print ("Vocabulary size: ",len(vocab),"\nTotal number of tokens: ", len(words), \
"\nLexical diversity: ", lexical_diversity)

Vocabulary size:  13365 
Total number of tokens:  91713 
Lexical diversity:  6.862177328843996


The document frequency of a token refers to the number of documents contanining the word. For example, if the word 'manager' appears in 100 resumes, the document frequency of the word 'manager' is 100.

Calculating the **document frequencies** for the words in the corpus:

In [32]:
words_2 = list(chain.from_iterable([set(value) for value in stemmed_tok_dict.values()]))
fdist = FreqDist(words_2)

In [33]:
# Looking at the most common words
fdist.most_common(25)

[('manag', 169),
 ('client', 162),
 ('University', 155),
 ('report', 152),
 ('compani', 150),
 ('team', 147),
 ('includ', 143),
 ('Bachelor', 141),
 ('work', 141),
 ('Singapore', 138),
 ('financi', 133),
 ('account', 130),
 ('2011', 130),
 ('perform', 129),
 ('gmail', 128),
 ('2012', 128),
 ('2013', 126),
 ('busi', 124),
 ('fund', 123),
 ('EDUCATION', 123),
 ('Management', 121),
 ('2015', 121),
 ('invest', 120),
 ('2014', 120),
 ('Finance', 117)]

Calculating the **threshold** for context-dependent stopwords and rare tokens:

In [34]:
# Removing most Common Words
cutoff_most_common = round(0.98 * len(unique_ids),0)
cutoff_most_common

203.0

In [35]:
cutoff_rare = round(0.02 * len(unique_ids),0)
cutoff_rare

4.0

A couple of points to note from the thresholds calculated above:
- 'manag' is the most common word, appearing in 169 out of 207 resumes. However, 169 is lower than the threshold of 203 resumes for removing context-dependent stopwords. So, there are **no context-dependent stopwords** in our dataset
- The threshold for removing **rare tokens** is **4 Resumes**

In [36]:
most_freq_words = set([k for k, v in fdist.items() if v > cutoff_most_common])
most_freq_words

set()

In [37]:
less_freq_words = set([k for k, v in fdist.items() if v < cutoff_rare])
len(less_freq_words)

10462

It is interesting to note that **10,462 words out of 13,365 words appear in less than 4 resumes**. We are removing more than 75% of the tokens with the step of removing rare tokens

In [38]:
to_delete_words = less_freq_words.union(most_freq_words)
len(to_delete_words)

10462

Creating clean resumes by **removing the rare tokens**:

In [39]:
def remove_words(resumeid):
    return (resumeid, [w for w in stemmed_tok_dict[resumeid] if w not in to_delete_words])
cleaned_resumes = dict(remove_words(resumeid) for resumeid in unique_ids)

In [40]:
len(cleaned_resumes[1]) # check

596

### 4.8 Generating Final Vocabulary

After all the various preprocessing steps, it is time to generate the final vocabulary for the resume dataset.

In [41]:
words = list(chain.from_iterable(cleaned_resumes.values()))
vocab = set()
vocab = set(words)

Computing **statistics** for the final vocabulary and the cleaned documents:

In [42]:
lexical_diversity = len(words) / len(vocab)
print ("Vocabulary size: ",len(vocab))
print ("Total number of tokens: ", len(words))
print ("Lexical diversity: ", lexical_diversity)
print ("Total number of resumes:", len(unique_ids))
lens = [len(value) for value in cleaned_resumes.values()]
print ("Average document length:", np.mean(lens))
print ("Maximum document length:", np.max(lens))
print ("Minimum document length:", np.min(lens))
print ("Standard deviation of document length:", np.std(lens))

Vocabulary size:  2903
Total number of tokens:  75463
Lexical diversity:  25.994832931450222
Total number of resumes: 207
Average document length: 364.55555555555554
Maximum document length: 1325
Minimum document length: 68
Standard deviation of document length: 193.25282070562827


**Writing** the final vocabulary to a file:

In [43]:
out_file = open("./resumes_vocab.txt", 'w')
vocab = list(vocab)
vocab.sort()

vocab_dict = {}
i = 0
for word in vocab:
    vocab_dict[word] = i
    i += 1
    
for index, word in vocab_dict.items():
        out_file.write("{}:{} ".format(index,word))
        out_file.write('\n')
out_file.close()

## 5. Generating sparse count vectors

The sparse count vectors are the **numerical representations** of the document. The count vector is a vector with the values of the **token index** and the **token's frequency** in the document.   

Note that the spare count vector of a document does not contain the indices of all words in the vocab, it only contains the indices of the words present in the document. This helps us to save space and computational difficulty.  

The sparse count vector is generated using the FreqDist class from the nltk package.

In [44]:
out_file_2 = open("./resumes_countVec.txt", 'w')

for key, resume in cleaned_resumes.items():
    out_file_2.write(str(key) + ',')
    token_idx = [vocab_dict[w] for w in resume]
    for k, v in FreqDist(token_idx).items():
        out_file_2.write("{}:{},".format(k,v))
    out_file_2.write('\n')
out_file_2.close()

## 6. Summary

The task demonstrated the steps for preprocessing a text file and converting it into a numerical representation while preserving the key features of the text. The output of the task can be a suitable input for downstream analysis, information retrieval algorithms and recommender systems. The statistics below depict that through basic preprocessing steps, we were able to reduce the size of the vocabulary from 16k words to 3k words and significantly improve the lexical diversity by more than 300%.  

**Text Statistics Before Preprocessing:**  

Total number of tokens: 130,369    
Vocabulary size: 16,630  
Lexical diversity is :  7.84 %  

**Text Statistics After Preprocessing:**    

Total number of tokens: 75,463<br>
Vocabulary size: 2,903<br>
Lexical diversity is : 25.99 %<br>

## 7. References

- Jay. (2008, September 9). *Character reading from file in Python* [Response to]. Retrieved from https://stackoverflow.com/questions/147741/character-reading-from-file-in-python
- *Bigram*. Retrieved from https://en.wikipedia.org/wiki/Bigram
- *Stopwords*. Retrieved from https://en.wikipedia.org/wiki/Stop_words
- NLTK Project. (2017). *NLTK 3.0 documentation: `nltk.tokenize.regexp` module*. Retrieved from http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer
- NLTK Project. (2015). *Collocations*. Retrieved from http://www.nltk.org/howto/collocations.html
- The Python Standard Library. *Regular expression operations documentation: `re.sub`*. Retrieved from https://docs.python.org/3/library/re.html   