# Assignment 2: Milestone I Natural Language Processing
## Task 1. Basic Text Pre-processing
#### Student Name: Nguyen Duc Quang
#### Student ID: 3927198

Date: 01 Oct 2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used:
* re
* nltk
* sklearn
* itertools

## Introduction
This notebook's purpose is to demonstrate the process of basic text pre-processing, which takes job descriptions as input and output the processed file of the text, and the vocabulary file of the file collection.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from nltk.probability import *
from sklearn.datasets import load_files
import re
from itertools import chain

### 1.1 Examining and loading data

#### Load job data

In [2]:
# Load job data
job_data = load_files(r"data").data
# Examine 1 job text
job_data[0]

b'Title: Finance / Accounts Asst Bromley to ****k\nWebindex: 68997528\nCompany: First Recruitment Services\nDescription: Accountant (partqualified) to **** p.a. South East London Our client, a successful manufacturing company has an immediate requirement for an Accountant for permanent role in their modern offices in South East London. The Role: Credit Control Purchase / Sales Ledger Daily collection of debts by phone, letter and email. Handling of ledger accounts Handling disputed accounts and negotiating payment terms Allocating of cash and reconciliation of accounts Adhoc administration duties within the business The Person The ideal candidate will have previous experience in a Credit Control capacity, you will possess exceptional customer service and communication skills together with IT proficiency. You will need to be a part or fully qualified Accountant to be considered for this role'

In [3]:
len(job_data)

776

There are a total of 776 articles in total. It loos like each of them needs decoding beforehand.

#### Load stopwords 

In [4]:
# Loading stopwords into a list
with open('./stopwords_en.txt', 'r') as f:
    stw = (f.read().split('\n'))
# Showing stopwords
stw

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

### 1.2 Pre-processing data

#### Tokenize job texts

In [5]:
# Function to remove stopwords and words that are shorter than 2 characters
def remove_words(word_list):
    # Create a new list to append valid words to
    new_list = list()
    # Loop through original list
    for word in word_list:
        if word in stw: # Ignore word to new list if word is stopword
            continue
        if len(word) < 2 : # Ignore word to new list if word is shorter than 2 characters
            continue
        new_list.append(word) # Append word to new list
    return new_list

In [6]:
# Function to tokenize job articles
def tokenize_article(job_txt):
    job_txt = job_txt.decode('utf-8') # Decode job text
    
    title = re.fullmatch('Title: (.+)', job_txt.split('\n')[0]).group(1) # Extract the title

    web_index = re.fullmatch('Webindex: (.+)', job_txt.split('\n')[1]).group(1) # Extract the webindex

    job_descr = re.fullmatch('Description: (.+)', job_txt.split('\n')[-1]).group(1) # Extract the description

    # Create tokenizer
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern)

    # Process job descriptions
    job_descr = job_descr.lower() # normalize description to lowercase
    sentences = sent_tokenize(job_descr) # Sentence segmentation
    
    tks = [] # Create a list to put tokens (words) in
    for sen in sentences:
        tks.extend(tokenizer.tokenize(sen)) # For each sentence tokenize it then put the list of words in tks

    tks = remove_words(tks) # Remove stopwords and words that are shorter than 2 characters

    # Process job titles 
    title = title.lower() # normalize title to lowercase
    title = tokenizer.tokenize(title) # Tokenize title
    title = remove_words(title) # Remove stopwords and words that are shorter than 2 characters

    # Return webindex, tokenized description, and title
    return [web_index, tks, title]

In [7]:
# Extract the tokenized titles
titles = [tokenize_article(job_txt)[2] for job_txt in job_data]

In [8]:
# Extract the webindexes
web_indexes = [tokenize_article(job_txt)[0] for job_txt in job_data]

In [9]:
# Extract the tokenized job articles
tokenize_articles = [tokenize_article(job_txt)[1] for job_txt in job_data]

In [10]:
tokenize_articles

[['accountant',
  'partqualified',
  'south',
  'east',
  'london',
  'client',
  'successful',
  'manufacturing',
  'company',
  'requirement',
  'accountant',
  'permanent',
  'role',
  'modern',
  'offices',
  'south',
  'east',
  'london',
  'role',
  'credit',
  'control',
  'purchase',
  'sales',
  'ledger',
  'daily',
  'collection',
  'debts',
  'phone',
  'letter',
  'email',
  'handling',
  'ledger',
  'accounts',
  'handling',
  'disputed',
  'accounts',
  'negotiating',
  'payment',
  'terms',
  'allocating',
  'cash',
  'reconciliation',
  'accounts',
  'adhoc',
  'administration',
  'duties',
  'business',
  'person',
  'ideal',
  'candidate',
  'previous',
  'experience',
  'credit',
  'control',
  'capacity',
  'possess',
  'exceptional',
  'customer',
  'service',
  'communication',
  'skills',
  'proficiency',
  'part',
  'fully',
  'qualified',
  'accountant',
  'considered',
  'role'],
 ['leading',
  'hedge',
  'funds',
  'london',
  'recruiting',
  'fund',
  'accou

#### Remove words based on document frequency and term frequency

In [11]:
# Function to remove words that appear once
def remove_appear_once(tokenized):
    words = list(chain.from_iterable(tokenized))
    term_fd = FreqDist(words) # Calculate term frequency

    # List of words that only appear once
    appear_once = term_fd.hapaxes()
    
    # Remove words that only appear once
    return [[word for word in article if word not in appear_once] for article in tokenized]

In [12]:
# Function to remove top 50 frequent words in document frequency
def remove_top_50(tokenized):
    words_2 = list(chain.from_iterable([set(article) for article in tokenized]))
    doc_fd = FreqDist(words_2)  # Compute document frequency for each unique word/type

    # List of top 50 words based on document frequency
    most_common_50 = list(map(lambda x: x[0],doc_fd.most_common(50)))
    
    # Remove top 50 words
    return [[word for word in article if word not in most_common_50] for article in tokenized]   

In [13]:
# Remove top 50 document frequent words and words that appear once in descriptions
tokenize_articles = remove_appear_once(tokenize_articles)
tokenize_articles = remove_top_50(tokenize_articles)

In [14]:
# Remove top 50 document frequent words and words that appear once in titles
titles = remove_appear_once(titles)
titles = remove_top_50(titles)

#### Define default values for blank title post-processing

In [15]:
# Define a default value for blank titles post-processing
non_specified = 'non_specified'
for i in range(len(titles)):
    if len(titles[i]) == 0: # If the title is blank, give it a default value
        titles[i] = [non_specified] 
titles

[['bromley'],
 ['fund', 'fund'],
 ['non_specified'],
 ['brokers'],
 ['nurses'],
 ['production'],
 ['scrub'],
 ['purchase', 'ledger', 'clerk', 'maternity'],
 ['non_specified'],
 ['non_specified'],
 ['treasury'],
 ['european', 'payroll'],
 ['engineering', 'yorkshire'],
 ['international'],
 ['production'],
 ['insurance'],
 ['vehicle', 'car'],
 ['marine', 'specialist', 'product'],
 ['medical'],
 ['optical'],
 ['perm', 'unit', 'flexi'],
 ['perm', 'bangor', 'flexi', 'ph', 'bangor'],
 ['non_specified'],
 ['leading', 'lending', 'plc'],
 ['dynamics', 'ax'],
 ['nursing'],
 ['mental'],
 ['nurses', 'needed'],
 ['head', 'planning', 'analysis', 'permanent', 'contract'],
 ['branch'],
 ['professional', 'telesales'],
 ['non_specified'],
 ['non_specified'],
 ['investment', 'bank'],
 ['non_specified'],
 ['telesales'],
 ['electronic', 'fire', 'security', 'solutions'],
 ['stress'],
 ['cnc', 'machinist'],
 ['treasury', 'cashier'],
 ['nurses', 'nurses', 'band', 'northampton'],
 ['non_specified'],
 ['permanen

## Saving required outputs

In [16]:
# Function to write content to file
def write_to_file(filename, content):
    out_file = open(filename, 'w') # Open file
    out_file.write(content) # Write content to file
    out_file.close() # Close file

- #### vocab.txt

In [17]:
# Sort the vocabulary alphabetically
vocab = sorted(list(set(list(chain.from_iterable(tokenize_articles)))))
# Formatting description into a vocabulary to write to a file
vocab_s = '\n'.join(f"{(vocab)[c]}:{c}" for c in range(len((vocab))))

In [18]:
# Save vocabulary in vocab.txt
write_to_file('vocab.txt', vocab_s)

- #### preprocessed.txt

In [19]:
# Formatting preprocessed descriptions into a string to write to a file
preprocessed_s = '\n'.join(' '.join(article) for article in tokenize_articles)

In [20]:
# Save preprocessed descriptions in preprocessed.txt
write_to_file('preprocessed.txt', preprocessed_s)

- #### web_indexes.txt

In [21]:
# Formatting webindexes into a string to write to a file
web_indexes_s = '\n'.join(web_indexes)

In [22]:
# Save webindexes in webindexes.txt
write_to_file('web_indexes.txt', web_indexes_s)

- #### titles.txt

In [23]:
# Formatting titles into a string to write to a file
titles_s = '\n'.join(' '.join(title) for title in titles)

In [24]:
# Save titles in titles.txt
write_to_file('titles.txt', titles_s)

## Summary
Give a short summary and anything you would like to talk about the assessment task here.

## Couple of notes for all code blocks in this notebook
- please provide proper comment on your code
- Please re-start and run all cells to make sure codes are runable and include your output in the submission.   