## Introduction
We wanted to further explore whether clustering methods can be used to provide recommendations. Ideally, users input their skills and the recommender system would output the jobs from the cluster closest to their input. The clustering method we will explore here is a Latent Dirichlet Allocation model, which would cluster the jobs based on their descriptions and provide the key terms used to identify each cluster

In [None]:
##pip3 install spacy
##python3 -m spacy download en_core_web_sm

In [17]:
#Import Packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import numpy as np
import spacy

In [18]:
#initialize spacy model for pos tagging and lemmetization
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])



In [19]:
#Read data on jobs
df = pd.read_csv('../../Datasets/processed_data.csv')

In [None]:
#Overview of the dataset
df.head()

In [21]:
#Lemmatizing the clean description
df['lemmatized_tokens'] = df['description_clean'].map(lambda x:nlp(x))
df['lemmatized_tokens']=df['lemmatized_tokens'].map(lambda x: " ".join([token.lemma_ for token in x]))

In [None]:
df.sample(5)

In [30]:
#Using count vectorizer to obtain term frequencies for each document in "lemmatized_tokens" column
#min_df, which dictates the minimum number of times a word must occur is set at 80 as most words that occur less than 80 times were junk words.
vectorizer = CountVectorizer(min_df = 80)
X = vectorizer.fit_transform(df['lemmatized_tokens'])

In [None]:
#Use this function to check each word and its frequency.
word_list = vectorizer.get_feature_names_out()
count_list = X.toarray().sum(axis=0)
word_dict = dict(zip(word_list,count_list))
{k: v for k, v in sorted(word_dict.items(), key=lambda item: item[1])}

In [33]:
#Cluster the job description into 20 clusters using LDA model
#The number of clusters was determined by observing for overlaps in key terms between topics, which would indicated that number of topics was too high
num_topics = 20
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=5, learning_method='online',random_state=0, verbose =1).fit(X)

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [34]:
#transform the matrix of term frequencies into a list of probabilities for each topic
transformed = lda.transform(X)

In [35]:
transformed.shape

(6243, 20)

In [36]:
#function to display the key words for each topic
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print (f"Topic:{topic_idx}")
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [37]:
#display the top 20 terms for each cluster
display_topics(lda,vectorizer.get_feature_names_out(),20)

Topic:0
solution business technology technical experience team ai project cloud work lead partner development stakeholder drive need develop role enterprise understand
Topic:1
risk management business financial client team market support experience investment control work skill compliance include finance credit regulatory provide global
Topic:2
system support operation provide experience service issue work sap management process maintenance level network perform knowledge maintain ensure incident center
Topic:3
project ensure process management plan manage quality cost budget planning report analysis delivery material review prepare production work forecast standard
Topic:4
work team world opportunity help people build service make technology company we look experience well join global product singapore new
Topic:5
experience design software development system application team work knowledge code good architecture develop technology technical cloud engineering web database platform
Top

In [38]:
#produce a column denoting which cluster each job description belongs to
arg_max = [np.where(i==max(i))[0][0] for i in transformed]
df['cluster'] = arg_max

In [None]:
df.head()

In [40]:
#Checking the distribution of the clusters
df['cluster'].value_counts()

12    1050
5      744
4      741
1      730
10     496
11     453
18     354
19     284
9      280
17     240
7      211
2      184
0      125
13      97
8       80
3       77
14      59
15      20
16      18
Name: cluster, dtype: int64

In [None]:
#Filtering jobs by clusters to determine a hypothesized cluster (together with their key words)
df[df['cluster']==19].sample(5)

Hypothesized Clusters
<li>0: Unknown </li>
<li>1: Banking</li>
<li>2: IT </li>
<li>3: Production</li>
<li>4: Unknown</li>
<li>5: Software Developer</li>
<li>6: Unknown</li>
<li>7: Unknown</li>
<li>8: Unknown</li>
<li>9: Unknown</li>
<li>10: Unknown</li>
<li>11: Product Manager</li>
<li>12: Data Science</li>
<li>13: Sales</li>
<li>14: Cybsersecurity </li>
<li>15: DBS </li>
<li>16: Electrical Engineer </li>
<li>17: Tiktok/ByteDance</li>
<li>18: Unknown</li>
<li>19: AI/NLP roles</li>

## Conclusion
To conclude, we realised that most of the key terms for each topic were not skills, but rather other parts of the job description such as company information or day-to-day roles. Therefore, inputting skills to determine the closest cluster would yield very poor results. If there were methods to sieve out skills from these job descriptions, it would greatly improve the viability of such a recommender system. Furthermore, some clusters were difficult to identify. Nonetheless, we realise that this model can provide a useful function of generating key terms for each of these clusters (which might be industry or job title), which might help with users who are crafting resumes, as resumes with key terms tend to get through automated systems better.

In [127]:
#Obtaining good clusters with well defined key terms
clusters = ["banking", "it","project manager", "software developer","product manager","data science","sales","cybersecurity","electrical engineer","ai and nlp"]
relevant_index =[1,2,3,5,11,12,13,14,16,19]

In [128]:
#Building a dataframe of job titles and key terms
result_dict={}
result_dict['title'] = clusters
key_terms=[]
for topic_idx, topic in enumerate(lda.components_):
    if topic_idx in relevant_index:
        key_terms.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-20 - 1:-1]])
result_dict['key_terms']=key_terms

result_df = pd.DataFrame(result_dict)

    

In [None]:
result_df

In [29]:
result_df.to_csv("key_terms.csv", index=False)