# Assignment 2: Milestone I Natural Language Processing

<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>Task 1. Basic Text Pre-processing</strong></h3>

#### Student Name: Tran Ngoc Anh Thu
#### Student ID: s3879312

Date: "October 2, 2022"

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* sklearn
* collections
* re
* numpy
* nltk
* itertools
* pandas
* os

## Steps
1. Load data
2. Text Pre-processing
    * Sentence Segmentation
    * Word Tokenization
    * Removing Single Character Tokens
    * Removing Stop words
3. Saving the Pre-processing Reviews

## Introduction
Nowadays there are many job hunting websites including seek.com.au and au.indeed.com. These job hunting sites all manage a job search system, where job hunters could search for relevant jobs based on keywords, salary, and categories. In previous years, the category of an advertised job was often manually entered by the advertiser (e.g., the employer). There were mistakes made for category assignment. As a result, the jobs in the wrong class did not get enough exposure to relevant candidate groups.
With advances in text analysis, automated job classification has become feasible; and sensible suggestions for job categories can then be made to potential advertisers. This can help reduce human data entry error, increase the job exposure to relevant candidates, and also improve the user experience of the job hunting site. In order to do so, we need an automated job ads classification system that helps to predict the categories of newly entered job advertisements.

In this **task1** notebook, we are going to explore a job advertisement data set, and focus on pre-processing the description only.
In the next task **task2_3**, we will then use the pre-processed text reviews to generate data features and build classification models to predict label of the description.

## Dataset
+ A small collection of job advertisement documents (around 776 jobs) inside the `data` folder.
+ Inside the data folder, there are four different sub-folders: Accounting_Finance, Engineering, Healthcare_Nursing, and Sales, representing a job category.
+ The job advertisement text documents of a particular category are in the corresponding sub-folder.
+ Each job advertisement document is a txt file named `Job_<ID>.txt`. It contains the title, the webindex (some will also have information on the company name, some might not), and the full description of the job advertisement.



## Importing libraries 

In [1]:
# import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_files
from collections import Counter
from nltk import RegexpTokenizer
from nltk.tokenize import sent_tokenize
from itertools import chain
import re
import os

In [2]:
# check the version of the main packages
print("Numpy version: ", np.__version__)
print("Pandas version: ",pd.__version__)
! python --version

Numpy version:  1.21.5
Pandas version:  1.4.2
Python 3.9.12


<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>1.1 Examining and loading data</strong></h3>

- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.



Before doing any pre-processing, we need to load the data into a proper format. 
To load the data, you have to explore the data folder. Inside the `data` folder:
+ Inside the data folder you will see 4 different subfolders, namely: `Accounting_Finance`, `Engineering`,`Healthcare_Nursing`, and `Sales`, each folder name is a job category.
+ The job advertisement text documents of a particular category are located in the corresponding subfolder.
+ Each job advertisement document is a txt file, named as "Job_<ID>.txt". It contains the title, the webindex,(some will also have information on the company name, some might not), and the full description of the job advertisement. 

In this case, providing that the dataset is given in a very well organised way, I would use a super handy API [`load_files`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html) from `sklearn.datasets`. 
    
**import the function by:**
```python
from sklearn.datasets import load_files  
```

Then you can use the function to directly load the data and labels, for example:
```python
df = load_files(r"data")  
```

The loaded `movie_data` is then a dictionary, with the following attributes:

| **ATTRIBUTES**   | **DESCRIPTION**                                           |
|--------------|---------------------------------------------------------------|
| Webindex     | 8 digit Id of the job advertisement on the website            |
| Title        | Title of the advertised job position                          |
| Company      | Company (employer) of the advertised job position             |
| Description  | the description of each job advertisement                     |


- Examine the data folder, including the categories and job advertisment txt documents, etc. Explain your findings here, e.g., number of folders and format of txt files, etc.
- Load the data into proper data structures and get it ready for processing.
- Extract webIndex and description into proper data structures.


In [3]:
# load each folder and file inside the data folder
df = load_files(r"data")
# type of the loaded file
type(df)

sklearn.utils.Bunch

In [4]:
df['target'] # this corresponding to the index value of the 4 categories

array([1, 1, 3, 1, 3, 2, 3, 1, 4, 4, 1, 1, 2, 4, 2, 4, 4, 2, 4, 3, 3, 3,
       4, 4, 1, 3, 3, 3, 1, 3, 4, 2, 3, 1, 2, 4, 4, 2, 2, 1, 3, 3, 3, 3,
       1, 1, 3, 2, 4, 2, 2, 3, 3, 4, 1, 1, 2, 1, 3, 3, 4, 4, 4, 1, 4, 1,
       2, 3, 4, 2, 4, 3, 4, 2, 4, 3, 2, 4, 3, 2, 4, 3, 3, 2, 1, 2, 2, 2,
       4, 1, 4, 2, 4, 3, 3, 1, 3, 4, 3, 2, 1, 2, 2, 3, 1, 4, 1, 2, 4, 3,
       2, 3, 1, 4, 2, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 4, 2, 2, 4,
       3, 1, 1, 2, 4, 3, 1, 2, 1, 4, 2, 3, 2, 1, 1, 1, 4, 1, 2, 3, 4, 2,
       2, 2, 3, 2, 1, 2, 1, 2, 1, 2, 2, 3, 1, 3, 3, 1, 3, 4, 3, 3, 1, 3,
       2, 1, 2, 2, 2, 4, 2, 4, 2, 1, 4, 2, 1, 3, 1, 1, 3, 2, 2, 1, 2, 4,
       1, 2, 2, 4, 1, 2, 1, 3, 4, 1, 3, 1, 2, 1, 2, 4, 2, 1, 2, 2, 1, 2,
       1, 2, 3, 2, 4, 2, 3, 4, 2, 2, 3, 1, 1, 2, 3, 1, 4, 3, 4, 3, 3, 4,
       1, 2, 2, 2, 2, 2, 2, 1, 4, 2, 2, 1, 1, 3, 2, 3, 3, 3, 3, 2, 4, 2,
       3, 2, 3, 4, 3, 4, 1, 2, 4, 1, 3, 2, 1, 3, 2, 3, 1, 3, 2, 2, 2, 3,
       3, 2, 3, 1, 3, 3, 2, 3, 1, 2, 1, 1, 4, 3, 2,

In [5]:
# Name of the categories
df['target_names'] # this corresponding to the name value of the 4 categories

['.ipynb_checkpoints',
 'Accounting_Finance',
 'Engineering',
 'Healthcare_Nursing',
 'Sales']

In [6]:
print(f'Category at index 0: {df["target_names"][0]}')
print(f'Category at index 1: {df["target_names"][1]}')
print(f'Category at index 2: {df["target_names"][2]}')
print(f'Category at index 3: {df["target_names"][3]}')

Category at index 0: .ipynb_checkpoints
Category at index 1: Accounting_Finance
Category at index 2: Engineering
Category at index 3: Healthcare_Nursing


In [7]:
# test whether it matches, just in case
emp = 10 # an example, note we will use this example throughout this exercise.
df['filenames'][emp], df['target'][emp] # from the file path we know that it's the correct class too

('data/Accounting_Finance/Job_00263.txt', 1)

In [8]:
# assign variables
full_description, category = df.data, df.target

In [9]:
# the 10th job advertisement description
full_description[emp]

b'Title: Investments & Treasury Controller\nWebindex: 71851935\nCompany: August Clarke\nDescription: Our client, based in Eastleigh, is looking for an Investments and Treasury Controller to join their team. Duties to include: Take responsibility for transactional management, analysis and oversight of the Company\xe2\x80\x99s investment portfolio, including compliance with relevant sections of the relevant policies Ensure that working capital and other liquid resources and cashflow are managed efficiently Deliver consistently against relevant KPIs and KRIs, analysing any shortfalls and putting appropriate action plans in place to remediate process issues Manage day to day relationships with the Company\xe2\x80\x99s outsourced Investment Managers and Custodians ensuring that there is mutual understanding of each others\xe2\x80\x99 operations, systems and developments so that business is transacted efficiently and effectively Own endtoend investment processes, ensuring that processes, pro

### ------> OBSERVATION:
As we can see the current description is in the **binary** form and read as a byte object (a `b` in front of each review text if you print it out). Therefore, we need to decode into normal string for further pre-processing

However, the tokenizer cannot apply a string pattern on a bytes-like object. To resolve this, we decode each read `full_description` text using `utf-8` by writing a decode function


In [10]:
def decode(l):
    if isinstance(l, list):
        return [decode(x) for x in l]
    else:
        return l.decode('utf-8')

# decode the binary description into utf-8 form and save it to full_description
full_description = decode(full_description)
full_description[emp]

list

### ---------------> OBSERVATION:
The current `description` contains these attributes:

| **ATTRIBUTES**   | **MEANING**                                        |
|--------------|----------------------------------------------------|
| Webindex     | 8 digit Id of the job advertisement on the website |
| Title        | Title of the advertised job position               |
| Company      | Company (employer) of the advertised job position  |
| Description  | the description of each job advertisement          |

and I only want the description itself to perform text-preprocessing and NLP on `description`. Therefore, I will perform the following pre-processing steps to the description of each job advertisement;

<h3 style="color:#ffc0cb;font-size:50px;font-family:Georgia;text-align:center;"><strong>1.2 Pre-processing data</strong></h3>

1. Extract information from each job advertisement. Perform the following pre-processing steps to the description of each job advertisement;
2. Tokenize each job advertisement description. The word tokenization must use the following regular expression, r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?";
3. All the words must be converted into the lower case;
4. Remove words with length less than 2.
5. Remove stopwords using the provided stop words list (i.e, stopwords_en.txt). It is located inside the
same downloaded folder.
6. Remove the word that appears only once in the document collection, based on term frequency.
7. Remove the top 50 most frequent words based on document frequency.
8. Save all job advertisement text and information in txt file(s) (you have flexibility to choose what format
you want to save the preprocessed job ads, and you will need to retrieve the pre-processed job ads
text in Task 2 & 3);
9. Build a vocabulary of the cleaned job advertisement descriptions, save it in a txt file (please refer to the
required output);

In [11]:
# Extract description, title, webindex,  from each job advertisement. 

def extract_description(full_description):
    description = [re.search(r'\nDescription: (.*)', str(i)).group(1) for i in full_description]
    return description
description = extract_description(full_description)

# Extract title
def extract_title(full_description):
    title = [re.search(r'Title: (.*)', str(i)).group(1) for i in full_description]
    return title
title = extract_title(full_description)

# Extract Webindex
def extract_webindex(full_description):
    webindex = [re.search(r'Webindex: (.*)', str(i)).group(1) for i in full_description]
    return webindex

webindex = extract_webindex(full_description)

# Extract company
def extract_company(company):
    company = [re.search(r'Company: (.*)', str(i)).group(1) if re.search(r'Company: (.*)', str(i)) else "NA" for i in company]
    return company
company = extract_company(full_description)

In [12]:
description[emp]

'Our client, based in Eastleigh, is looking for an Investments and Treasury Controller to join their team. Duties to include: Take responsibility for transactional management, analysis and oversight of the Company’s investment portfolio, including compliance with relevant sections of the relevant policies Ensure that working capital and other liquid resources and cashflow are managed efficiently Deliver consistently against relevant KPIs and KRIs, analysing any shortfalls and putting appropriate action plans in place to remediate process issues Manage day to day relationships with the Company’s outsourced Investment Managers and Custodians ensuring that there is mutual understanding of each others’ operations, systems and developments so that business is transacted efficiently and effectively Own endtoend investment processes, ensuring that processes, procedures, risks and controls are documented, effective and efficient. Regularly review and test processes and controls in accordance w

In [13]:
def tokenizeDescription(raw_description):
    """
        This function first convert all words to lowercases,
        it then segment the raw description into sentences and tokenize each sentences
        and convert the description to a list of tokens.
    """
    # description = raw_description.decode('utf-8') # convert the bytes-like object to python string, need this before we apply any pattern search on it
    description = raw_description.lower() # cover all words to lowercase

    # segment into sentences
    sentences = sent_tokenize(description)

    # tokenize each sentence
    pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
    tokenizer = RegexpTokenizer(pattern)
    token_lists = [tokenizer.tokenize(sen) for sen in sentences]

    # merge them into a list of tokens
    tokenised_description = list(chain.from_iterable(token_lists))
    return tokenised_description

tk_description = [tokenizeDescription(r) for r in description]  # list comprehension, generate a list of tokenized articles

print("Raw description:\n",description[emp],'\n')
print("Tokenized description:\n",tk_description[emp])

Raw description:
 Our client, based in Eastleigh, is looking for an Investments and Treasury Controller to join their team. Duties to include: Take responsibility for transactional management, analysis and oversight of the Company’s investment portfolio, including compliance with relevant sections of the relevant policies Ensure that working capital and other liquid resources and cashflow are managed efficiently Deliver consistently against relevant KPIs and KRIs, analysing any shortfalls and putting appropriate action plans in place to remediate process issues Manage day to day relationships with the Company’s outsourced Investment Managers and Custodians ensuring that there is mutual understanding of each others’ operations, systems and developments so that business is transacted efficiently and effectively Own endtoend investment processes, ensuring that processes, procedures, risks and controls are documented, effective and efficient. Regularly review and test processes and control

#### A Few Statistics Before Any Further Pre-processing

In the following, we are interested to know a few statistics at this very begining stage, including:
* The total number of tokens across the corpus
* The total number of types across the corpus, i.e. the size of vocabulary 
* The so-called, [lexical diversity](https://en.wikipedia.org/wiki/Lexical_diversity), referring to the ratio of different unique word stems (types) to the total number of words (tokens).  
* The average, minimum and maximum number of token (i.e. document length) in the dataset.

In the following, we wrap all these up as a function, since we will use this printing module later to compare these statistic values before and after pre-processing.

In [14]:
def stats_print(tk_description):
    words = list(chain.from_iterable(tk_description)) # we put all the tokens in the corpus in a single list
    vocab = set(words) # compute the vocabulary by converting the list of words/tokens to a set, i.e., giving a set of unique words
    lexical_diversity = len(vocab)/len(words)
    print("Vocabulary size: ",len(vocab))
    print("Total number of tokens: ", len(words))
    print("Lexical diversity: ", lexical_diversity)
    print("Total number of description:", len(tk_description))
    lens = [len(article) for article in tk_description]
    print("Average description length:", np.mean(lens))
    print("Maximum description length:", np.max(lens))
    print("Minimum description length:", np.min(lens))
    print("Standard deviation of description length:", np.std(lens))

stats_print(tk_description)

Vocabulary size:  9834
Total number of tokens:  186952
Lexical diversity:  0.052601737344345076
Total number of description: 776
Average description length: 240.91752577319588
Maximun description length: 815
Minimun description length: 13
Standard deviation of description length: 124.97750685071483


### Task 2.2 Remove words with length less than 2.

In this sub-task, you are required to remove any token that only contains a single character (a token that of length 1).
You need to double-check whether it has been done properly

In [15]:
words = list(chain.from_iterable(tk_description)) # we put all the tokens in the corpus in a single list
word_counts = Counter(words) # count the number of times each word appears in the corpus
print("Number of words that appear less than 2:", len([w for w in word_counts if word_counts[w] <= 1]))

Number of words that appear only once: 4233


In [28]:
st_list = [[w for w in description if len(w) <= 1 ] \
                      for description in tk_description] # create a list of single character token for each description
list(chain.from_iterable(st_list)) # merge them together in one list

# filter out single character tokens
tk_description = [[w for w in description if len(w) >=2] \
                      for description in tk_description]

words = list(chain.from_iterable(tk_description)) # we put all the tokens in the corpus in a single list
word_counts = Counter(words) # count the number of times each word appears in the corpus
print("Number of words that appear less than 2:", len([w for w in word_counts if word_counts[w] <= 1]))

Number of words that appear less than 2: 4191


In [29]:
# Remove the top 50 most frequent words
words = list(chain.from_iterable(tk_description)) # we put all the tokens in the corpus in a single list
word_counts = Counter(words) # count the number of times each word appears in the corpus
top50 = word_counts.most_common(50) # get the top 50 most frequent words
print("Top 50 most frequent words:\n",top50, "\n\n")

tk_description = [[w for w in description if w not in top50] for description in tk_description]
top50 = word_counts.most_common(50) # get the top 50 most frequent words
print("Top 50 most frequent words after removing:\n",top50)

Top 50 most frequent words:
 [('experience', 1276), ('sales', 1030), ('role', 946), ('work', 861), ('business', 832), ('team', 789), ('working', 719), ('job', 688), ('care', 675), ('skills', 669), ('company', 614), ('client', 594), ('management', 572), ('manager', 519), ('support', 501), ('uk', 496), ('service', 481), ('excellent', 455), ('development', 431), ('required', 399), ('based', 376), ('opportunity', 372), ('services', 369), ('knowledge', 349), ('apply', 349), ('successful', 340), ('training', 338), ('design', 337), ('engineering', 336), ('customer', 335), ('recruitment', 335), ('salary', 322), ('candidate', 319), ('clients', 310), ('high', 309), ('join', 302), ('ability', 301), ('strong', 299), ('provide', 298), ('home', 291), ('ensure', 290), ('leading', 289), ('including', 287), ('engineer', 285), ('not', 280), ('financial', 279), ('good', 274), ('staff', 271), ('position', 268), ('systems', 267)] 


Top 50 most frequent words after removing:
 [('experience', 1276), ('sales

### Task 2.3 Removing Stop words

In this sub-task, you are required to remove the stop words from the tokenized text inside `stopwords_en.txt` file

In [18]:
# remove the stop words inside `stopwords_en.txt` from the tokenized text
with open('stopwords_en.txt', 'r') as f:
    stop_words = f.read().splitlines() # read the stop words into a list
print("Stop words:\n",stop_words)

Stop words:
 ['a', "a's", 'able', 'about', 'above', 'according', 'accordingly', 'across', 'actually', 'after', 'afterwards', 'again', 'against', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', "c'mon", "c's", 'came', 'can', "can't", 'cannot', 'cant', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', 'co', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'contai

In [19]:
[w for w in stop_words if ("not" in w or "n't" in w or "no" in w)]

["ain't",
 'another',
 "aren't",
 "can't",
 'cannot',
 "couldn't",
 "didn't",
 "doesn't",
 "don't",
 'enough',
 "hadn't",
 "hasn't",
 "haven't",
 'ignored',
 "isn't",
 'know',
 'knows',
 'known',
 'no',
 'nobody',
 'non',
 'none',
 'noone',
 'nor',
 'normally',
 'not',
 'nothing',
 'novel',
 'now',
 'nowhere',
 "shouldn't",
 "wasn't",
 "weren't",
 "won't",
 "wouldn't"]

#### The Updated Statistics

In the above, we have done a few pre-processed steps, now let's have a look at the statistics again:


In [20]:
# specify
ignored_words = [w for w in stop_words if not ("not" in w or "n't" in w or "no" in w)]

# filter out stop words
tk_description = [[w for w in description if w not in ignored_words] \
                      for description in tk_description]

stats_print(tk_description)

Vocabulary size:  9423
Total number of tokens:  107751
Lexical diversity:  0.08745162457889022
Total number of description: 776
Average description length: 138.85438144329896
Maximun description length: 489
Minimun description length: 12
Standard deviation of description length: 73.42099464751045


Recall, from the beginning, we have the following:  
_____________________________________________

Vocabulary size:  9834

Total number of tokens:  186952

Lexical diversity:  0.052601737344345076

Total number of description: 776

Average description length: 240.91752577319588

Maximun description length: 815

Minimun description length: 13

Standard deviation of description length: 124.97750685071483
_____________________________________________

We've shrunk more than 40% of the vocabulary.

category

## Saving required outputs
Save the vocabulary, bigrams and job advertisment txt as per spectification.

We are going to store all the preprocessed description texts and its corresponding labels into files for task 2.
* all the tokenized description are stored in a .txt file named `description.txt`
    * each line is a description text, which contained all the tokens of the description text, separated by a space ' '
* all the corresponding labels are store in a .txt file named `category.txt`
    * each line is a label (one of these 4 values: 0,1,2,3)

In [21]:
# save description text
def save_description(descriptionFilename,tk_description):
    out_file = open(descriptionFilename, 'w') # creates a txt file and open to save the descriptions
    string = "\n".join([" ".join(description) for description in tk_description])
    out_file.write(string)
    out_file.close() # close the file

# save the category corresponding with the description text
def save_category(categoryFilename,category):
    out_file = open(categoryFilename, 'w') # creates a txt file and open to save category
    string = "\n".join([str(s) for s in category])
    out_file.write(string)
    out_file.close() # close the file

# save the title corresponding with the description text
def save_title(titleFilename,title):
    out_file = open(titleFilename, 'w') # creates a txt file and open to save title
    string = "\n".join([str(s) for s in title])
    out_file.write(string)
    out_file.close() # close the file


# save description into txt file
descriptionFilename = "description.txt"
save_description(descriptionFilename,tk_description)
print(f'Successfully saved description into {descriptionFilename}')

# save category into txt file
categoryFilename = "category.txt"
save_category(categoryFilename,category)
print(f'Successfully saved category into {categoryFilename}')

# save title into txt file
titleFilename = "title.txt"
save_title(titleFilename,title)
print(f'Successfully saved title into {titleFilename}')

Successfully saved description into description.txt
Successfully saved category into category.txt
Successfully saved title into title.txt


`vocab.txt`
This file contains the unigram vocabulary, one each line, in the following format: word_string:word_integer_index. Very importantly, words in the vocabulary must be sorted in alphabetical order, and the index value starts from 0. This file is the key to interpret the sparse encoding. For instance, in the following example, the word aaron is the 20th word (the corresponding integer_index as 19) in the vocabulary (note that the index values and words in the following image are artificial and used to demonstrate the required format only, it doesn't reflect the values of the actual expected output).

In [22]:
# Save all job advertisement text and information in txt file
with open('job_ad.txt', 'w') as f:
    f.write("Category: " + str(category) + "\n")
    for i in range(len(tk_description)):
        f.write(full_description[i] + "\n")
        f.write("Tokenized Description: " + str(tk_description[i]) + "\n")
        f.write("Category: " + str(df['target'][i]) + "\n")
        f.write("\n")
    print("Successfully write job advertisement with the tokenized description in txt file")

Successfully write job advertisement with the tokenized description in txt file


In [23]:
def write_vocab(vocab, filename):
    with open(filename, 'w') as f:
        for i, word in enumerate(vocab):
            f.write(word + ':' + str(i) + '\n')
# convert tokenized description into a alphabetically sorted list
vocab = sorted(list(set(chain.from_iterable(tk_description))))
write_vocab(vocab, 'vocab.txt')
# print out the first 10 words in the vocabulary
print(vocab[:10])

['aah', 'aap', 'aaron', 'aat', 'abandonment', 'abb', 'abenefit', 'aberdeen', 'aberdenshire', 'abi']


In [24]:
print(f'Category at index 0: {df["target_names"][0]}')
print(f'Category at index 1: {df["target_names"][1]}')
print(f'Category at index 2: {df["target_names"][2]}')
print(f'Category at index 3: {df["target_names"][3]}')

Category at index 0: .ipynb_checkpoints
Category at index 1: Accounting_Finance
Category at index 2: Engineering
Category at index 3: Healthcare_Nursing


In [25]:
# convert job ad to a dataframe
job_ad = pd.DataFrame({'Title': title, 'Webindex': webindex, 'Company': company, 'Description': description,'Tokenized Description': tk_description, 'Category': category})

# change Tokenized Description to string separated by space
job_ad['Tokenized Description'] = job_ad['Tokenized Description'].apply(lambda x: ' '.join([str(i) for i in x]))


# replace the value in Category column
# Category at index 0: Accounting_Finance
# Category at index 1: Engineering
# Category at index 2: Healthcare_Nursing
# Category at index 3: Sales
job_ad['Category'] = job_ad['Category'].replace([0,1,2,3],['Accounting_Finance','Engineering','Healthcare_Nursing','Sales'])

# Cast Webindex to int
job_ad['Webindex'] = job_ad['Webindex'].astype(int)

# save job ad to csv file
job_ad.to_csv('job_ad.csv', index=False)

print(job_ad.info())
# print first 3 rows
job_ad.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 776 entries, 0 to 775
Data columns (total 6 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Title                  776 non-null    object
 1   Webindex               776 non-null    int64 
 2   Company                776 non-null    object
 3   Description            776 non-null    object
 4   Tokenized Description  776 non-null    object
 5   Category               776 non-null    object
dtypes: int64(1), object(5)
memory usage: 36.5+ KB
None


Unnamed: 0,Title,Webindex,Company,Description,Tokenized Description,Category
0,Finance / Accounts Asst Bromley to ****k,68997528,First Recruitment Services,Accountant (partqualified) to **** p.a. South ...,accountant partqualified south east london cli...,Engineering
1,Fund Accountant Hedge Fund,68063513,Austin Andrew Ltd,One of the leading Hedge Funds in London is cu...,leading hedge funds london recruiting fund acc...,Engineering
2,Deputy Home Manager,68700336,Caritas,An exciting opportunity has arisen to join an ...,exciting opportunity arisen join establish pro...,Sales


In [26]:
# The .py format of the jupyter notebook
for fname in os.listdir():
    if fname.endswith('ipynb'):
        os.system(f'jupyter nbconvert {fname} --to python')

[NbConvertApp] Converting notebook task1.ipynb to python
[NbConvertApp] Writing 21057 bytes to task1.py
[NbConvertApp] Converting notebook task2_3.ipynb to python
[NbConvertApp] Writing 18131 bytes to task2_3.py


## Summary
Give a short summary and anything you would like to talk about the assessment task here.

# Reference