# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Dao Sy Trung Kien
#### Student ID: S3979613

Date: 04/09/2024

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* pandas
* re
* numpy

## Introduction
In tasks 2 and 3, students need to create feature vectors for job advertisement descriptions and titles. Then, using the created vectors, build machine learning models to classify the category of a job advertisement text. Students should provide answers to two questions:
Q1: Language model comparisons
Q2: Does more information provide higher accuracy?

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
import nltk
from nltk.tokenize import RegexpTokenizer
import os
import string
import numpy as np
import pandas as pd
import re
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import gensim
from gensim.models import Word2Vec
from collections import defaultdict

## Task 2. Generating Feature Representations for Job Advertisement Descriptions and Titles

### 2.1 Load Vocabulary from Vocab Files and Extract Title, Description, and Webindex from Preprocessed Job Advertisement File

a. Load the description vocabulary from the vocab.txt file

In [5]:
vocab_des = {}
with open('vocab.txt', 'r') as file:
    for line in file:
        word, idx = line.strip().split(':')
        vocab_des[word] = int(idx)

b. Load the title vocabulary from the vocab_title.txt file

In [6]:
vocab_title = {}
with open('vocab_title.txt', 'r') as file:
    for line in file:
        word, idx = line.strip().split(':')
        vocab_title[word] = int(idx)

c. Extract descriptions, titles, and webindexes from Preprocessed Job Advertisement File

In [None]:
# Define lists to store the descriptions, titles, labels, and webindexes
descriptions = []
titles = []
webindexes = []
labels = []
all_doc = []

# Open and read the preprocessed_job_ads.txt file
with open('preprocessed_job_ads.txt', 'r') as file:
    # First, we skip the header
    next(file)
    # Next iterate through each line
    for line in file:
        # Split the line by commas
        fields = line.strip().split(', ')
        # Extract the 'title', 'webindex', 'labels', and 'description' fields
        if len(fields) >= 5:
            title = fields[1]
            webindex = fields[2]
            description = fields[4]
            label = fields[5]
            # Add the title, webindex, label, and description to the respective lists
            titles.append(title)
            webindexes.append(webindex)
            descriptions.append(description)
            labels.append(label)
            all_doc.append(title)
            all_doc.append(description)

### 2.2 Generating Count Vectors

a. Descriptions Count Vector

In [8]:
# Create a CountVectorizer object with the vocabulary from the vocab dictionary
c_vec_des = CountVectorizer(vocabulary=vocab_des)

# Fit and transform descriptions using the c_vec_des
X_c_des = c_vec_des.fit_transform(descriptions)

b. Titles Count Vector

In [9]:
# Create a CountVectorizer object with the vocabulary from the vocab_title dictionary
c_vec_title = CountVectorizer(vocabulary=vocab_title)

# Fit and transform titles using the c_vec_title
X_c_title = c_vec_title.fit_transform(titles)

### 2.3 Generating TF-IDF Vectors

a. Desccriptions TF-IDF Vector

In [None]:
# Create a TfidfVectorizer object with the vocabulary from the vocab dictionary
tfidf_vec_des = TfidfVectorizer(vocabulary=vocab_des)

# Fit and transform descriptions using the TfidfVectorizer
X_tfidf_des = tfidf_vec_des.fit_transform(descriptions)

b. Titles TF-IDF Vector

In [None]:
# Create a TfidfVectorizer object with the vocabulary from the vocab_title dictionary
tfidf_vec_title = TfidfVectorizer(vocabulary=vocab_title)

# Fit and transform titles using the TfidfVectorizer
X_tfidf_title = tfidf_vec_title.fit_transform(titles)

### 2.4 Generating One-hot Vectors

a. Descriptions One-hot Vector

In [None]:
# Create a Binary CountVectorizer object with the vocabulary from the vocab dictionary
one_hot_vec_des = CountVectorizer(vocabulary=vocab_des, binary=True)

# Fit and transform the descriptions using the one_hot_vector
X_one_des = one_hot_vec_des.fit_transform(descriptions)

b. Titles One-hot Vector

In [None]:
# Create a Binary CountVectorizer object with the vocabulary from the vocab_title dictionary
one_hot_vec_title = CountVectorizer(vocabulary=vocab_title, binary=True)

# Fit and transform the descriptions using the CountVectorizer
X_one_tilte = one_hot_vec_title.fit_transform(titles)

### 2.5 Generating Word2Vec Vectors

In [10]:
# Train the Word2Vec model on the tokenized descriptions
word2vecdes_model = Word2Vec(sentences=descriptions, vector_size=100, window=5, min_count=1, workers=4)

In [11]:
# Train the Word2Vec model on the tokenized titles
word2vectitle_model = Word2Vec(sentences=titles, vector_size=100, window=5, min_count=1, workers=4)

Count vectors exported to count_vectors.txt


### 2.6 Saving the Vector Representation

In [None]:
# Open the file for writing the word vectors
output_file = 'count_vectors.txt'
with open(output_file, 'w') as file:
    # Iterate over each description and its corresponding webindex
    for idx, webindex in enumerate(webindexes):
        # Get the sparse representation of the description
        sparse_row = X_c_des[idx]
        non_zero_indices = sparse_row.nonzero()[1]
        sparse_represent = []
        for word_idx in non_zero_indices:
            word_count = sparse_row[0, word_idx]
            sparse_represent.append(f"{word_idx}:{word_count}")
        # Format line and write to the file
        line = f"#{webindex}," + ','.join(sparse_represent) + '\n'
        file.write(line)

## Task 3. Job Advertisement Classification

## Split data into train and test sets

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, cross_val_score
seed = 0

num_folds = 5
kf = KFold(n_splits= num_folds, random_state=seed, shuffle = True)

In [13]:
def evaluate(X_train,X_test,y_train, y_test,seed):
    model = LogisticRegression(random_state=seed)
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [14]:
def evaluate_based_on_kf(count_df, tfidf_df, onehot_df, num_of_folds):
    title_cv_df = pd.DataFrame(columns = ['count','tfidf','onehot'],index=range(num_of_folds))

    fold = 0
    for train_index, test_index in kf.split(list(range(0,len(labels)))):
        y_train = [labels[i] for i in train_index]
        y_test = [labels[i] for i in test_index]

        title_cv_df.loc[fold,'count'] = evaluate(count_df[train_index],count_df[test_index],y_train,y_test,seed)
        
        title_cv_df.loc[fold,'tfidf'] = evaluate(tfidf_df[train_index],tfidf_df[test_index],y_train,y_test,seed)

        title_cv_df.loc[fold,'onehot'] = evaluate(onehot_df[train_index],onehot_df[test_index],y_train,y_test,seed)
        
        fold +=1
    return title_cv_df

## Classification using LogisticRegression on description

In [15]:
num_models = 3
cv_df = evaluate_based_on_kf(X_c, X_tfidf, X_one, num_folds)

In [16]:
cv_df

Unnamed: 0,count,tfidf,onehot
0,0.775641,0.794872,0.782051
1,0.76129,0.787097,0.76129
2,0.76129,0.767742,0.774194
3,0.670968,0.780645,0.722581
4,0.806452,0.845161,0.8


In [30]:
cv_df['tfidf'].mean()

0.7951033912324235

## Classification using LogisticRegression on Title

1. Preprocess titles

In [17]:
# Load stop words
with open('stopwords_en.txt', 'r') as f:
    stopwords = set(f.read().splitlines())

# List to store the results
results = []

# Preprocess the text and store the results
for directory in directories:
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            filepath = os.path.join(directory, filename)
            with open(filepath, 'r', encoding= 'unicode_escape') as file:
                content = file.read()
                title = content.split("Title: ")[1].split("\n")[0]
                webindex = content.split("Webindex: ")[1].split("\n")[0]
                description = content.split("Description: ")[1].strip()
                # Store the result
                results.append({
                    'Title': title,
                    'Webindex': webindex,
                    'Description': description
                })


In [18]:
# Dict to store vocab created from titles and titles + descriptions
title_vocab = {}
all_vocab = {}


# List to store temporary tokens
tokens = []
tokens_lower = []

# Token's regex pattern
pattern = r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?"
tokenizer = RegexpTokenizer(pattern) 

# Tokenize titles
for title in titles:
    token = tokenizer.tokenize(description)
    tokens.append(token)

# Filter tokens that only has one word, then lowercase transform them
for tokene in tokens:
    token_list = []
    for token in tokene:
        if len(token) >= 2 and token not in stopwords:
            token_list.append(token)
        token_list.append(token.lower())
    tokens_lower.append(token_list)

In [19]:
# Create a dictonary that contain the frequency of term
term_frequency = defaultdict(int)

for tokens in tokens_lower:
    for token in tokens: 
        term_frequency[token] += 1

tokens_more_than_1 = []

# Remove tokens that only appears once
for tokens in tokens_lower:
    tokens_filtered_freq = []
    for token in tokens:
        if term_frequency[token] > 1:
            tokens_filtered_freq.append(token)
    tokens_more_than_1.append(tokens_filtered_freq)

In [20]:
# Create a dictonary that contain the document frequency
document_frequency = defaultdict(int)

for document in tokens_more_than_1:
    for word in document:
        document_frequency[word] += 1
        
# Remove tokens that appears more than 50 times
more_than_50 = []
for word, count in document_frequency.items():
    more_than_50.append((word,count))
    more_than_50.sort(key=lambda x: x[1], reverse=True)
    if len(more_than_50) > 50:
        more_than_50.pop()


In [21]:
final_list = []
# Finalise tokens list, filter out too short tokens and tokens that appears too many times
for tokens in tokens_more_than_1:
    token_list = []
    for token in tokens:
        if token not in more_than_50:
            token_list.append(token)
    final_list.append(token_list)

In [22]:
# Build the titles vocabulary
title_vocabulary = []
for tokens in final_list:
    for token in tokens:
        if token not in title_vocabulary:
            title_vocabulary.append(token)

for idx in range(len(title_vocabulary)):
    word = title_vocabulary[idx]
    title_vocab[word] = int(idx)
    if word not in document_vocabulary:
        document_vocabulary.append(word)

# Add new word to the document vocabulary
for idx in range(len(document_vocabulary)):
    word = document_vocabulary[idx]
    all_vocab[word] = int(idx)


### Count Vector

In [23]:
# Initialize the CountVectorizer with the custom vocabulary
c_title_vector = CountVectorizer(vocabulary=title_vocab)
c_doc_vector = CountVectorizer(vocabulary=all_vocab)

# Fit and transform the descriptions using the c_vector
X_title_c = c_vector.fit_transform(titles)
X_doc_c = c_vector.fit_transform(all_doc)

### TF-IDF Vector

In [24]:
# Initialize the TfidfVectorizer with the custom vocabulary
tfidf_title_vector = TfidfVectorizer(vocabulary=title_vocab)
tfidf_doc_vector = TfidfVectorizer(vocabulary=all_vocab)

# Fit and transform the descriptions using the TfidfVectorizer
X_title_tfidf = tfidf_vector.fit_transform(titles)
X_doc_tfidf = tfidf_vector.fit_transform(all_doc)

### One-hot Vector

In [25]:
# Initialize the CountVectorizer with the custom vocabulary and binary option
one_hot_title_vector = CountVectorizer(vocabulary=title_vocab, binary=True)
one_hot_doc_vector = CountVectorizer(vocabulary=all_vocab, binary=True)

# Fit and transform the descriptions using the CountVectorizer
X_title_one = one_hot_vector.fit_transform(titles)
X_doc_one = one_hot_vector.fit_transform(all_doc)

## Classification using LogisticRegression on title

In [26]:
num_models = 3
title_cv_df = evaluate_based_on_kf(X_title_c, X_title_tfidf, X_title_one, num_folds)


In [27]:
title_cv_df

Unnamed: 0,count,tfidf,onehot
0,0.705128,0.711538,0.705128
1,0.787097,0.8,0.780645
2,0.741935,0.767742,0.748387
3,0.787097,0.787097,0.787097
4,0.787097,0.787097,0.774194


In [31]:
title_cv_df['tfidf'].mean()

0.7706947890818858

## Classification on title and description

In [28]:
num_models = 3
doc_cv_df = evaluate_based_on_kf(X_doc_c, X_doc_tfidf, X_doc_one, num_folds)


In [29]:
doc_cv_df

Unnamed: 0,count,tfidf,onehot
0,0.480769,0.487179,0.448718
1,0.535484,0.554839,0.516129
2,0.535484,0.606452,0.503226
3,0.580645,0.6,0.522581
4,0.529032,0.574194,0.503226


In [32]:
doc_cv_df['tfidf'].mean()

0.5645326716294459

## Summary
### Q1: Language model comparisons:
From 'cv_df' DataFrame, we can clearly see that the tfidf vector gives the best result (highest accuracy is 0.85 while average is -.79).

The model accuracy on 5-fold test is the highest on every fold. The onehot vector and count vector is pretty similar but the onehot vector is slightly better.
### Q2: Impact of amount of information on the accuracy:
Different approach of the data gives us different vocabulary to work with. In this assignment, we created 3 vocabulary based on 3 approaches:
1. Build vocabulary based on titles
2. Build vocabulary based on descriptions
3. Build vocabulary based on titles AND descriptions

The results of approaches (1) and (2) are not very differnt from each other (best model works on tfidf vector, the average accuracy is nearly 0.8) with the apprach (2) is slightly better.

The approach (3) has the most information (gather text from both titles and descriptions) but it's performance is not as good as the other two.

We can see that the descriptions can generate bigger vocabulary that the titles and their combination will generate an even bigger one. But the accuracy only improves for the case of titles to descriptions

From the result above, it is safe to say that more information is not always improve the accuracy

