# Assignment 2: Milestone I Natural Language Processing
## Task 2&3
#### Student Name: Nguyen Duc Quang
#### Student ID: 3927198

Date: 01 Oct 2023

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: please include all the libraries you used in your assignment, e.g.,:
* sklearn
* gensim
* numpy
* pandas

## Introduction
The tasks below demonstrate the steps generate count vectors, weighted and unweighted document vectors and using them for building and evaluating classification machine learning models.

## Importing libraries 

In [1]:
# Code to import libraries as you need in this assessment, e.g.,
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from gensim.models.fasttext import FastText
from sklearn.datasets import load_files
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd

## Task 2. Generating Feature Representations for Job Advertisement Descriptions

### Generate count vector

In [2]:
# Read vocab.txt file to get the pre-saved vocabulary
vocab = list()
with open('./vocab.txt', 'r') as file:
    vocab = [word.split(':')[0] for word in file.read().split('\n')]

In [3]:
# Get the webindexes 
with open('web_indexes.txt', 'r') as file:
    web_indexes = file.read().splitlines()

In [4]:
# Get the preprocessed jobs
def to_string_list(filename):
    with open(filename, 'r') as file:
        string_list = file.read().splitlines()
        return string_list
string_list = to_string_list('preprocessed.txt')

In [5]:
# Initialised the CountVectorizer
cVectorizer = CountVectorizer(analyzer = "word",vocabulary = vocab) 

In [6]:
# Create the count feature representation
count_features = cVectorizer.fit_transform(string_list) 

In [9]:
count_features

<776x5168 sparse matrix of type '<class 'numpy.int64'>'
	with 62166 stored elements in Compressed Sparse Row format>

### Generate document vector 

#### Train FastText model

In [21]:
# Function to train FastText model according to corpus file
def train_Fast_Text(corpus_file):
    # Initialized FastText model
    FT = FastText(vector_size=50, min_count=2)
    FT.build_vocab(corpus_file=corpus_file)

    # Train FastText model
    FT.train(
    corpus_file=corpus_file, epochs=FT.epochs,
    total_examples=FT.corpus_count, total_words=FT.corpus_total_words,
    )
    return FT

#### Create unweighted document vectors

In [22]:
# Function to generate unweighted document vectors 
def docvecs(embeddings, docs):
    # Initialize matrix with zeros
    vecs = np.zeros((len(docs), embeddings.vector_size))

    # Loop through the documents
    for i, doc in enumerate(docs):
        # check if term in corpus
        valid_keys = [term for term in doc if term in embeddings.key_to_index] 
        
        # Stack and calculate sum of vectors
        docvec = np.vstack([embeddings[term] for term in valid_keys])
        docvec = np.sum(docvec, axis=0)
        vecs[i,:] = docvec
    return vecs

#### Write TF-IDF features to a text file for later

In [23]:
# Function to write tfidf document vectors to file
def write_tfidfFile(data_features,filename):
    # Get the number of documents
    num = data_features.shape[0]
    # creates a txt file and open to save the vector representation
    out_file = open(filename, 'w') 

    # loop through index of each document
    for a_ind in range(0, num): 
        for f_ind in data_features[a_ind].nonzero()[1]: # for each word index that has non-zero entry in the data_feature
            value = data_features[a_ind][0,f_ind] # retrieve the value of the entry from data_features
            out_file.write("{}:{} ".format(f_ind,value)) # write the entry to the file in the format of word_index:value
        out_file.write('\n') # start a new line after each article
    out_file.close() # close the file

#### Generate weights for each word in vocabulary

In [24]:
def doc_wordweights(fName_tVectors, voc_dict):
    tfidf_weights = [] # a list to store the  word:weight dictionaries of documents
    
    with open(fName_tVectors) as tVecf: 
        tVectors = tVecf.read().splitlines() # each line is a tfidf vector representation of a document in string format 'word_index:weight word_index:weight .......'
    for tv in tVectors: # for each tfidf document vector
        tv = tv.strip()
        weights = tv.split(' ') # list of 'word_index:weight' entries
        weights = [w.split(':') for w in weights] # change the format of weight to a list of '[word_index,weight]' entries
        wordweight_dict = {voc_dict[int(w[0])]:w[1] for w in weights} # construct the weight dictionary, where each entry is 'word:weight'
        tfidf_weights.append(wordweight_dict) 
    return tfidf_weights

#### Generate weighted document vectors

In [25]:
# Function to generate weighted document vectors
def weighted_docvecs(embeddings, tfidf, docs):
    # Initialize matrix with zeros
    vecs = np.zeros((len(docs), embeddings.vector_size))

    # Loop through the documents
    for i, doc in enumerate(docs):
        # Check if term in corpus
        valid_keys = [term for term in doc if term in embeddings.key_to_index]

        # Generate weights for each term
        tf_weights = [float(tfidf[i].get(term, 0.)) for term in valid_keys]
        assert len(valid_keys) == len(tf_weights)

        # Multiply weights to vectors
        weighted = [embeddings[term] * w for term, w in zip(valid_keys, tf_weights)]

        # Stack and sum the vectors
        docvec = np.vstack(weighted)
        docvec = np.sum(docvec, axis=0)
        vecs[i,:] = docvec
    return vecs

In [26]:
mod = train_Fast_Text('./preprocessed.txt')
mod.save('preprocessedFT.model')

#### Combine all the steps above

In [27]:
def generate_doc_vecs(corpus_file_name):

    # Train FastText model
    embedding = train_Fast_Text(corpus_file=corpus_file_name).wv

    # Get string list and tokenized documents based on corpus file
    with open(corpus_file_name, 'r') as file:
        str_list = file.read().splitlines()
        tokenized = [line.split(' ') for line in str_list]

    # Create unweighted vector
    unweighted = docvecs(embedding, tokenized)

    # Create TF-IDF vectorizer
    vocabulary = sorted(list(embedding.key_to_index.keys()))
    tV = TfidfVectorizer(analyzer='word', vocabulary=vocabulary)
    tfidf_features = tV.fit_transform(str_list)
    write_tfidfFile(tfidf_features,'temp.txt')

    # Transform vocabulary to appropriate dictionary
    voc_dict = dict()
    voc = sorted(list(embedding.key_to_index.keys()))
    for i in range(len(voc)):
        voc_dict[i] = voc[i]
    
    # Generate weights for vocabulary
    tfidfWeights = doc_wordweights("temp.txt", voc_dict)

    # Create weighted vector
    weighted = weighted_docvecs(embedding, tfidfWeights, tokenized)
    return unweighted, weighted


In [28]:
# Generate unweighted and weighted document vectors based on preprocessd description file
unweighted_descr, weighted_descr = generate_doc_vecs('./preprocessed.txt')

### Saving outputs
- count_vectors.txt

In [29]:
def write_vectorFile(data_features,filename):
    num = data_features.shape[0] # the number of document
    out_file = open(filename, 'w') # creates a txt file and open to save the vector representation

    for a_ind in range(0, num): # loop through each article by index
        out_file.write(f"#{web_indexes[a_ind]}")
        for f_ind in data_features[a_ind].nonzero()[1]: # for each word index that has non-zero entry in the data_feature
            value = data_features[a_ind][0,f_ind] # retrieve the value of the entry from data_features
            out_file.write(",{}:{}".format(f_ind,value)) # write the entry to the file in the format of word_index:value
        out_file.write('\n') # start a new line after each article
    out_file.close() # close the file

In [30]:
write_vectorFile(count_features, 'count_vectors.txt')

## Task 3. Job Advertisement Classification

### Q1


In [31]:
# Load labels for classification
labels = load_files(r"data").target

In [46]:
labels

array([0, 0, 2, 0, 2, 1, 2, 0, 3, 3, 0, 0, 1, 3, 1, 3, 3, 1, 3, 2, 2, 2,
       3, 3, 0, 2, 2, 2, 0, 2, 3, 1, 2, 0, 1, 3, 3, 1, 1, 0, 2, 2, 2, 2,
       0, 0, 2, 1, 3, 1, 1, 2, 2, 3, 0, 0, 1, 0, 2, 2, 3, 3, 3, 0, 3, 0,
       1, 2, 3, 1, 3, 2, 3, 1, 3, 2, 1, 3, 2, 1, 3, 2, 2, 1, 0, 1, 1, 1,
       3, 0, 3, 1, 3, 2, 2, 0, 2, 3, 2, 1, 0, 1, 1, 2, 0, 3, 0, 1, 3, 2,
       1, 2, 0, 3, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 3, 1, 1, 3,
       2, 0, 0, 1, 3, 2, 0, 1, 0, 3, 1, 2, 1, 0, 0, 0, 3, 0, 1, 2, 3, 1,
       1, 1, 2, 1, 0, 1, 0, 1, 0, 1, 1, 2, 0, 2, 2, 0, 2, 3, 2, 2, 0, 2,
       1, 0, 1, 1, 1, 3, 1, 3, 1, 0, 3, 1, 0, 2, 0, 0, 2, 1, 1, 0, 1, 3,
       0, 1, 1, 3, 0, 1, 0, 2, 3, 0, 2, 0, 1, 0, 1, 3, 1, 0, 1, 1, 0, 1,
       0, 1, 2, 1, 3, 1, 2, 3, 1, 1, 2, 0, 0, 1, 2, 0, 3, 2, 3, 2, 2, 3,
       0, 1, 1, 1, 1, 1, 1, 0, 3, 1, 1, 0, 0, 2, 1, 2, 2, 2, 2, 1, 3, 1,
       2, 1, 2, 3, 2, 3, 0, 1, 3, 0, 2, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 2,
       2, 1, 2, 0, 2, 2, 1, 2, 0, 1, 0, 0, 3, 2, 1,

In [32]:
# Initialise a 5 fold validation
num_folds = 5
kf = KFold(n_splits= num_folds, random_state=0, shuffle = True) 
print(kf)

KFold(n_splits=5, random_state=0, shuffle=True)


In [33]:
import pickle

In [34]:
# Define function to evaluate model performance
def evaluate(X_train,X_test,y_train, y_test):
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

In [35]:
# creates a dataframe to store the accuracy scores in all the folds
num_models = 2
cv_df = pd.DataFrame(columns = ['description', 'weighted description'],index=range(num_folds)) # creates a dataframe to store the accuracy scores in all the folds

# Execute cross validation
fold = 0
for train_index, test_index in kf.split(list(range(0,len(labels)))):
    y_train = [labels[i] for i in train_index]
    y_test = [labels[i] for i in test_index]
    
    X_train_count, X_test_count = unweighted_descr[train_index], unweighted_descr[test_index]
    cv_df.loc[fold,'description'] = evaluate(unweighted_descr[train_index],unweighted_descr[test_index],y_train,y_test)

    X_train_count, X_test_count = weighted_descr[train_index], weighted_descr[test_index]
    cv_df.loc[fold,'weighted description'] = evaluate(weighted_descr[train_index],weighted_descr[test_index],y_train,y_test)
    
    fold +=1

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [36]:
# Performance summary
cv_df

Unnamed: 0,description,weighted description
0,0.794872,0.75
1,0.877419,0.76129
2,0.812903,0.8
3,0.793548,0.767742
4,0.780645,0.748387


In [37]:
cv_df.mean()

description             0.811878
weighted description    0.765484
dtype: float64

#### Conclusion of Q1
The results of all 5 validations convey that the unweighted vector version outperforms the weighted version of the description files, each by 2-8%.

### Q2

#### Vectors with only title

In [38]:
unweighted_title, weighted_title = generate_doc_vecs('./titles.txt')

#### Vectors with descriptions and titles

In [39]:
with open('./titles.txt', 'r') as file:
    tokenized_titles = [line.split(' ') for line in file.read().splitlines()]

In [40]:
with open('./preprocessed.txt', 'r') as file:
    tokenized_articles = [line.split(' ') for line in file.read().splitlines()]

In [41]:
tokenized_titles_articles = [title + (article) for title, article in zip(tokenized_titles, tokenized_articles)]
tokenized_titles_articles

[['bromley',
  'accountant',
  'partqualified',
  'south',
  'east',
  'london',
  'manufacturing',
  'requirement',
  'accountant',
  'permanent',
  'modern',
  'offices',
  'south',
  'east',
  'london',
  'credit',
  'control',
  'purchase',
  'ledger',
  'daily',
  'collection',
  'debts',
  'phone',
  'letter',
  'email',
  'handling',
  'ledger',
  'accounts',
  'handling',
  'accounts',
  'negotiating',
  'payment',
  'terms',
  'cash',
  'reconciliation',
  'accounts',
  'adhoc',
  'administration',
  'duties',
  'person',
  'ideal',
  'previous',
  'credit',
  'control',
  'capacity',
  'possess',
  'exceptional',
  'customer',
  'communication',
  'part',
  'fully',
  'qualified',
  'accountant',
  'considered'],
 ['fund',
  'fund',
  'hedge',
  'funds',
  'london',
  'recruiting',
  'fund',
  'accountant',
  'paying',
  'outstanding',
  'west',
  'end',
  'report',
  'head',
  'fund',
  'accounting',
  'number',
  'fund',
  'accountants',
  'senior',
  'fund',
  'accountants

In [42]:
titles_descriptions_file = 'titles_descriptions.txt'
open_file = open(titles_descriptions_file, 'w')
open_file.write('\n'.join(' '.join(line) for line in tokenized_titles_articles))
open_file.close()

In [43]:
unweighted_title_descr, weighted_title_descr = generate_doc_vecs('./titles_descriptions.txt')

In [44]:
mod = LogisticRegression(max_iter=1000)
mod.fit(unweighted_title_descr, labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [45]:
pickle.dump(mod, open('model.pkl', 'wb'))

In [84]:
num_models = 3
cv_df = pd.DataFrame(columns = ['title','weighted title','description', 'weighted description','both', 'weighted both'],index=range(num_folds)) # creates a dataframe to store the accuracy scores in all the folds

fold = 0
for train_index, test_index in kf.split(list(range(0,len(labels)))):
    y_train = [labels[i] for i in train_index]
    y_test = [labels[i] for i in test_index]

    X_train_binary, X_test_binary = unweighted_title[train_index], unweighted_title[test_index]
    cv_df.loc[fold,'title'] = evaluate(unweighted_title[train_index],unweighted_title[test_index],y_train,y_test)

    X_train_binary, X_test_binary = weighted_title[train_index], weighted_title[test_index]
    cv_df.loc[fold,'weighted title'] = evaluate(weighted_title[train_index],weighted_title[test_index],y_train,y_test)
    
    X_train_count, X_test_count = unweighted_descr[train_index], unweighted_descr[test_index]
    cv_df.loc[fold,'description'] = evaluate(unweighted_descr[train_index],unweighted_descr[test_index],y_train,y_test)

    X_train_count, X_test_count = weighted_descr[train_index], weighted_descr[test_index]
    cv_df.loc[fold,'weighted description'] = evaluate(weighted_descr[train_index],weighted_descr[test_index],y_train,y_test)
    X_train_tfidf, X_test_tfidf = unweighted_title_descr[train_index], unweighted_title_descr[test_index]
    cv_df.loc[fold,'both'] = evaluate(unweighted_title_descr[train_index],unweighted_title_descr[test_index],y_train,y_test)
    X_train_tfidf, X_test_tfidf = weighted_title_descr[train_index], weighted_title_descr[test_index]
    cv_df.loc[fold,'weighted both'] = evaluate(weighted_title_descr[train_index],weighted_title_descr[test_index],y_train,y_test)
    
    fold +=1

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

In [85]:
cv_df

Unnamed: 0,title,weighted title,description,weighted description,both,weighted both
0,0.282051,0.282051,0.794872,0.74359,0.820513,0.730769
1,0.316129,0.316129,0.845161,0.76129,0.845161,0.780645
2,0.296774,0.296774,0.832258,0.793548,0.8,0.819355
3,0.303226,0.303226,0.774194,0.76129,0.780645,0.76129
4,0.290323,0.290323,0.793548,0.748387,0.8,0.741935


In [86]:
cv_df.mean()

title                   0.297701
weighted title          0.297701
description             0.808007
weighted description    0.761621
both                    0.809264
weighted both           0.766799
dtype: float64

#### Conclusion of Q2
The results above shows that the title version performs significantly worse than the rest due to the extreme lack of information. However, the description-only and the combined description-title do not differ much, because the difference in information is very little (only the title).


In conclusion, the more information a model gets, the more accurate it is, and vice versa.

## Summary
In short, the unweighted version of document vectors are preferred by the machine learning model because it gets more accurate. Furthermore, the amount of information fed into the machine learning model is positively correlated with its accuracy.