## Introduction
We wanted to further explore whether clustering methods can be used to provide recommendations. Ideally, users input their skills and the recommender system would output the jobs from the cluster closest to their input. The clustering method we will explore here is a Latent Dirichlet Allocation model, which would cluster the jobs based on their descriptions and provide the key terms used to identify each cluster

In [None]:
##pip3 install spacy
##python3 -m spacy download en_core_web_sm

In [17]:
#Import Packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import numpy as np
import spacy

In [18]:
#initialize spacy model for pos tagging and lemmetization
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])



In [19]:
#Read data on jobs
df = pd.read_csv('../../Datasets/processed_data.csv')

In [20]:
#Overview of the dataset
df.head()

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay
0,https://www.mycareersfuture.gov.sg/job/custome...,PRODUCTION CONTROL MANAGER,<p><strong>JOB DESCIPTION</strong></p>\n<ul>\n...,JOB DESCIPTION\n\n planning and organising pr...,"Permanent, Full Time",Snl Logistics Pte Ltd,31 GUL CIRCLE 629569,"['job', 'descipt', 'plan', 'organis', 'product...",job desciption planning organising production ...,"['product', 'control', 'manag', 'snl', 'logist...",production control manager snl logistics pte l...,False,2000.0,3400.0
1,https://www.mycareersfuture.gov.sg/job/enginee...,Design Engineer ( Mechanical / Electrical),<p><strong>SUMMARY</strong></p>\n<ul>\n <li>T...,SUMMARY\n\n This position is responsible...,Full Time,Jamco Aero Design &Amp; Engineering Private Li...,Singapore,"['summari', 'posit', 'respons', 'support', 'pr...",summary position responsible supporting projec...,"['design', 'engin', 'mechan', 'electr', 'jamco...",design engineer mechanical electrical jamco ae...,False,2500.0,4500.0
2,https://www.mycareersfuture.gov.sg/job/sales/b...,Business Development Executive,<p><strong>Job description</strong></p>\n<p>Wh...,Job description\nWho we are:\nWe are a logisti...,"Part Time, Permanent",Airpak Express Pte Ltd,"TECHPLAS INDUSTRIAL BUILDING, 45 CHANGI SOUTH ...","['job', 'descript', 'logist', 'servic', 'provi...",job description logistics service provider sol...,"['busi', 'develop', 'execut', 'airpak', 'expre...",business development executive airpak express ...,False,3200.0,3500.0
3,https://www.mycareersfuture.gov.sg/job/banking...,Senior / Data Scientist,<p>The ideal candidate should have a good unde...,The ideal candidate should have a good underst...,"Permanent, Full Time",Singapore Exchange Limited,"SGX CENTRE I, 2 SHENTON WAY 068804","['ideal', 'candid', 'good', 'understand', 'bus...",ideal candidate good understanding business do...,"['senior', 'data', 'scientist', 'singapor', 'e...",senior data scientist singapore exchange limit...,False,9000.0,14000.0
4,https://www.mycareersfuture.gov.sg/job/archite...,8890-Sales Consultant [ Digital Software| Saas...,<p><strong>Sales Consultant (Digital Software)...,Sales Consultant (Digital Software)\nLocation:...,"Permanent, Full Time",The Supreme Hr Advisory Pte. Ltd.,"SHENTON HOUSE, 3 SHENTON WAY 068805","['sale', 'consult', 'digit', 'softwar', 'locat...",sales consultant digital software location jal...,"['8890', 'sale', 'consult', 'digit', 'softwar'...",8890 sales consultant digital software saas in...,False,3000.0,4500.0


In [21]:
#Lemmatizing the clean description
df['lemmatized_tokens'] = df['description_clean'].map(lambda x:nlp(x))
df['lemmatized_tokens']=df['lemmatized_tokens'].map(lambda x: " ".join([token.lemma_ for token in x]))

In [23]:
df.sample(5)

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay,lemmatized_tokens
4218,https://sg.jobsdb.com/job/Ad-Policy-Manager-ef...,Ad Policy Manager - (Product and System),"<div class=""-desktop-no-padding-top"" id=""job-d...","Founded in 2012, ByteDance's mission is to ins...",Full time,TikTok,Singapore,"['found', '2012', 'byted', 'mission', 'inspir'...",founded 2012 bytedance mission inspire creativ...,"['ad', 'polici', 'manag', 'product', 'system',...",ad policy manager product system tiktok full t...,False,,,found 2012 bytedance mission inspire creativit...
5926,https://sg.jobsdb.com/job/Research-Fellow-7976...,Research Fellow (Quantitative Research) (CARE/AC),"<div class=""-desktop-no-padding-top"" id=""job-d...",Job Description The Centre for Ageing Res...,Contract,National University of Singapore,Singapore,"['job', 'descript', 'centr', 'age', 'research'...",job description centre ageing research educati...,"['research', 'fellow', 'quantit', 'research', ...",research fellow quantitative research care ac ...,False,,,job description centre age research education ...
5342,https://sg.jobsdb.com/job/VP-AFC-Business-aaad...,"VP, - AFC Business Advisory, Group Retail","<div class=""-desktop-no-padding-top"" id=""job-d...",About UOB\n\nUnited Overseas Bank Limited (UOB...,Full time,United Overseas Bank,Singapore,"['uob', 'unit', 'oversea', 'bank', 'limit', 'u...",uob united overseas bank limited uob leading b...,"['vp', 'afc', 'busi', 'advisori', 'group', 're...",vp afc business advisory group retail united o...,False,,,uob united overseas bank limit uob lead bank a...
3632,https://sg.jobsdb.com/job/Research-Fellow-7923...,Research Fellow/Associate (Ref:NHCS/RF/ZLLB) [...,"<div class=""-desktop-no-padding-top"" id=""job-d...",We present an opportunity to contribute in the...,Contract,National Heart Centre Of Singapore,Singapore,"['present', 'opportun', 'contribut', 'research...",present opportunity contribute research arm wo...,"['research', 'fellow', 'associ', 'ref', 'nhc',...",research fellow associate ref nhcs rf zllb cvs...,False,3800.0,7600.0,present opportunity contribute research arm wo...
431,https://www.mycareersfuture.gov.sg/job/informa...,Software Analyst (SAP),"<ul>\n <li><strong>Up to $5000, Permanent Rol...","\n Up to $5000, Permanent Role with AWS and 3...",Full Time,Triton Ai Pte. Ltd.,"INTERNATIONAL PLAZA, 10 ANSON ROAD 079903","['5000', 'perman', 'role', 'aw', '3', 'month',...",5000 permanent role aws 3 months variable bonu...,"['softwar', 'analyst', 'sap', 'triton', 'ai', ...",software analyst sap triton ai pte ltd full ti...,False,3300.0,5500.0,5000 permanent role aws 3 month variable bonus...


In [30]:
#Using count vectorizer to obtain term frequencies for each document in "lemmatized_tokens" column
#min_df, which dictates the minimum number of times a word must occur is set at 80 as most words that occur less than 80 times were junk words.
vectorizer = CountVectorizer(min_df = 80)
X = vectorizer.fit_transform(df['lemmatized_tokens'])

In [31]:
#Use this function to check each word and its frequency.
word_list = vectorizer.get_feature_names_out()
count_list = X.toarray().sum(axis=0)
word_dict = dict(zip(word_list,count_list))
{k: v for k, v in sorted(word_dict.items(), key=lambda item: item[1])}

{'additionally': 81,
 'amongst': 81,
 'dozen': 81,
 'tackle': 82,
 'autonomy': 83,
 'comment': 83,
 'confident': 83,
 'eligibility': 83,
 'fail': 83,
 'http': 83,
 'majority': 83,
 'motivation': 83,
 'panda': 83,
 'professionally': 83,
 'unless': 83,
 'variable': 83,
 'affect': 84,
 'rolling': 84,
 'tomorrow': 84,
 'tuning': 84,
 'withdraw': 84,
 'apart': 85,
 'away': 85,
 'ever': 85,
 'html5': 85,
 'initial': 85,
 'matrix': 85,
 'minimize': 85,
 'simplify': 85,
 'choice': 86,
 'collective': 86,
 'commensurate': 86,
 'eco': 86,
 'instead': 86,
 'intellectual': 86,
 'section': 86,
 'still': 86,
 'windows': 86,
 'evolution': 87,
 'harness': 87,
 'parameter': 87,
 'behind': 88,
 'demonstrable': 88,
 'employ': 88,
 'enter': 88,
 'grasp': 88,
 'ineligible': 88,
 'manipulation': 88,
 'medically': 88,
 'pyspark': 88,
 'resso': 88,
 '80': 89,
 'convert': 89,
 'fill': 89,
 'jenkin': 89,
 'regularly': 89,
 'remove': 89,
 'adjust': 90,
 'externally': 90,
 'integral': 90,
 'mumbai': 90,
 'administ

In [33]:
#Cluster the job description into 20 clusters using LDA model
#The number of clusters was determined by observing for overlaps in key terms between topics, which would indicated that number of topics was too high
num_topics = 20
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=5, learning_method='online',random_state=0, verbose =1).fit(X)

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [34]:
#transform the matrix of term frequencies into a list of probabilities for each topic
transformed = lda.transform(X)

In [35]:
transformed.shape

(6243, 20)

In [36]:
#function to display the key words for each topic
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print (f"Topic:{topic_idx}")
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [37]:
#display the top 20 terms for each cluster
display_topics(lda,vectorizer.get_feature_names_out(),20)

Topic:0
solution business technology technical experience team ai project cloud work lead partner development stakeholder drive need develop role enterprise understand
Topic:1
risk management business financial client team market support experience investment control work skill compliance include finance credit regulatory provide global
Topic:2
system support operation provide experience service issue work sap management process maintenance level network perform knowledge maintain ensure incident center
Topic:3
project ensure process management plan manage quality cost budget planning report analysis delivery material review prepare production work forecast standard
Topic:4
work team world opportunity help people build service make technology company we look experience well join global product singapore new
Topic:5
experience design software development system application team work knowledge code good architecture develop technology technical cloud engineering web database platform
Top

In [38]:
#produce a column denoting which cluster each job description belongs to
arg_max = [np.where(i==max(i))[0][0] for i in transformed]
df['cluster'] = arg_max

In [39]:
df.head()

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay,lemmatized_tokens,cluster
0,https://www.mycareersfuture.gov.sg/job/custome...,PRODUCTION CONTROL MANAGER,<p><strong>JOB DESCIPTION</strong></p>\n<ul>\n...,JOB DESCIPTION\n\n planning and organising pr...,"Permanent, Full Time",Snl Logistics Pte Ltd,31 GUL CIRCLE 629569,"['job', 'descipt', 'plan', 'organis', 'product...",job desciption planning organising production ...,"['product', 'control', 'manag', 'snl', 'logist...",production control manager snl logistics pte l...,False,2000.0,3400.0,job desciption planning organise production sc...,3
1,https://www.mycareersfuture.gov.sg/job/enginee...,Design Engineer ( Mechanical / Electrical),<p><strong>SUMMARY</strong></p>\n<ul>\n <li>T...,SUMMARY\n\n This position is responsible...,Full Time,Jamco Aero Design &Amp; Engineering Private Li...,Singapore,"['summari', 'posit', 'respons', 'support', 'pr...",summary position responsible supporting projec...,"['design', 'engin', 'mechan', 'electr', 'jamco...",design engineer mechanical electrical jamco ae...,False,2500.0,4500.0,summary position responsible support project t...,10
2,https://www.mycareersfuture.gov.sg/job/sales/b...,Business Development Executive,<p><strong>Job description</strong></p>\n<p>Wh...,Job description\nWho we are:\nWe are a logisti...,"Part Time, Permanent",Airpak Express Pte Ltd,"TECHPLAS INDUSTRIAL BUILDING, 45 CHANGI SOUTH ...","['job', 'descript', 'logist', 'servic', 'provi...",job description logistics service provider sol...,"['busi', 'develop', 'execut', 'airpak', 'expre...",business development executive airpak express ...,False,3200.0,3500.0,job description logistic service provider soli...,13
3,https://www.mycareersfuture.gov.sg/job/banking...,Senior / Data Scientist,<p>The ideal candidate should have a good unde...,The ideal candidate should have a good underst...,"Permanent, Full Time",Singapore Exchange Limited,"SGX CENTRE I, 2 SHENTON WAY 068804","['ideal', 'candid', 'good', 'understand', 'bus...",ideal candidate good understanding business do...,"['senior', 'data', 'scientist', 'singapor', 'e...",senior data scientist singapore exchange limit...,False,9000.0,14000.0,ideal candidate good understanding business do...,12
4,https://www.mycareersfuture.gov.sg/job/archite...,8890-Sales Consultant [ Digital Software| Saas...,<p><strong>Sales Consultant (Digital Software)...,Sales Consultant (Digital Software)\nLocation:...,"Permanent, Full Time",The Supreme Hr Advisory Pte. Ltd.,"SHENTON HOUSE, 3 SHENTON WAY 068805","['sale', 'consult', 'digit', 'softwar', 'locat...",sales consultant digital software location jal...,"['8890', 'sale', 'consult', 'digit', 'softwar'...",8890 sales consultant digital software saas in...,False,3000.0,4500.0,sale consultant digital software location jala...,13


In [40]:
#Checking the distribution of the clusters
df['cluster'].value_counts()

12    1050
5      744
4      741
1      730
10     496
11     453
18     354
19     284
9      280
17     240
7      211
2      184
0      125
13      97
8       80
3       77
14      59
15      20
16      18
Name: cluster, dtype: int64

In [121]:
#Filtering jobs by clusters to determine a hypothesized cluster (together with their key words)
df[df['cluster']==19].sample(5)

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay,lemmatized_tokens,cluster
3612,https://sg.jobsdb.com/job/Senior-Vision-Softwa...,Senior Vision Software Design Engineer,"<div class=""-desktop-no-padding-top"" id=""job-d...",MAIN DUTIES AND RESPONSIBILITIES:\n\nThe succe...,Permanent,Mit Semiconductor Pte. Ltd.,Ang Mo Kio,"['main', 'duti', 'respons', 'success', 'candid...",main duties responsibilities successful candid...,"['senior', 'vision', 'softwar', 'design', 'eng...",senior vision software design engineer mit sem...,False,4000.0,5800.0,main duty responsibilitie successful candidate...,19
1854,https://sg.jobsdb.com/job/Associate-Learning-E...,Associate learning engineer computer,"<div class=""-desktop-no-padding-top"" id=""job-d...",\nIf you are passionate about playing a key ro...,,Techbridge Market Holdings,Singapore,"['passion', 'play', 'key', 'role', 'success', ...",passionate playing key role success private or...,"['associ', 'learn', 'engin', 'comput', 'techbr...",associate learning engineer computer techbridg...,False,,,passionate play key role success private organ...,19
2657,https://sg.jobsdb.com/job/Research-Assistant-4...,"Research Assistant (Computer Science, Machine ...","<div class=""-desktop-no-padding-top"" id=""job-d...",A Research Assistant position is available in ...,Full time,Nanyang Technological University,Singapore,"['research', 'assist', 'posit', 'avail', 'scho...",research assistant position available school e...,"['research', 'assist', 'comput', 'scienc', 'ma...",research assistant computer science machine le...,False,,,research assistant position available school e...,19
276,https://www.mycareersfuture.gov.sg/job/informa...,Software Engineer,<p><u><strong>Responsibilities:</strong></u></...,Responsibilities:\n\n Analysing and modifying...,"Permanent, Full Time",Morgan Mckinley Pte. Ltd.,Singapore,"['respons', 'analys', 'modifi', 'exist', 'soft...",responsibilities analysing modifying existing ...,"['softwar', 'engin', 'morgan', 'mckinley', 'pt...",software engineer morgan mckinley pte ltd perm...,False,4000.0,6000.0,responsibility analyse modify exist software w...,19
2274,https://sg.jobsdb.com/job/Research-Fellow-07a2...,Research Fellow [Computer Science] #WorkNow,"<div class=""-desktop-no-padding-top"" id=""job-d...",The School of Computer Science and Engineering...,Full time,Nanyang Technological University,Singapore,"['school', 'comput', 'scienc', 'engin', 'scse'...",school computer science engineering scse invit...,"['research', 'fellow', 'comput', 'scienc', 'wo...",research fellow computer science worknow nanya...,False,,,school computer science engineering scse invit...,19


Hypothesized Clusters
<li>0: Unknown </li>
<li>1: Banking</li>
<li>2: IT </li>
<li>3: Production</li>
<li>4: Unknown</li>
<li>5: Software Developer</li>
<li>6: Unknown</li>
<li>7: Unknown</li>
<li>8: Unknown</li>
<li>9: Unknown</li>
<li>10: Unknown</li>
<li>11: Product Manager</li>
<li>12: Data Science</li>
<li>13: Sales</li>
<li>14: Cybsersecurity </li>
<li>15: DBS </li>
<li>16: Electrical Engineer </li>
<li>17: Tiktok/ByteDance</li>
<li>18: Unknown</li>
<li>19: AI/NLP roles</li>

## Conclusion
To conclude, we realised that most of the key terms for each topic were not skills, but rather other parts of the job description such as company information or day-to-day roles. Therefore, inputting skills to determine the closest cluster would yield very poor results. If there were methods to sieve out skills from these job descriptions, it would greatly improve the viability of such a recommender system. Furthermore, some clusters were difficult to identify. Nonetheless, we realise that this model can provide a useful function of generating key terms for each of these clusters (which might be industry or job title), which might help with users who are crafting resumes, as resumes with key terms tend to get through automated systems better.

In [127]:
#Obtaining good clusters with well defined key terms
clusters = ["banking", "it","project manager", "software developer","product manager","data science","sales","cybersecurity","electrical engineer","ai and nlp"]
relevant_index =[1,2,3,5,11,12,13,14,16,19]

In [128]:
#Building a dataframe of job titles and key terms
result_dict={}
result_dict['title'] = clusters
key_terms=[]
for topic_idx, topic in enumerate(lda.components_):
    if topic_idx in relevant_index:
        key_terms.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-20 - 1:-1]])
result_dict['key_terms']=key_terms

result_df = pd.DataFrame(result_dict)

    

In [129]:
result_df

Unnamed: 0,title,key_terms
0,banking,"[risk, management, business, financial, client..."
1,it,"[system, support, operation, provide, experien..."
2,project manager,"[project, ensure, process, management, plan, m..."
3,software developer,"[experience, design, software, development, sy..."
4,product manager,"[product, business, marketing, team, market, s..."
5,data science,"[datum, data, experience, analytic, model, bus..."
6,sales,"[customer, sale, product, service, account, wo..."
7,cybersecurity,"[security, service, hr, information, chain, su..."
8,electrical engineer,"[engineering, electrical, electronic, industri..."
9,ai and nlp,"[computer, software, development, experience, ..."


In [29]:
result_df.to_csv("key_terms.csv", index=False)