## Introduction
We wanted to further explore whether clustering methods can be used to provide recommendations. Ideally, users input their skills and the recommender system would output the jobs from the cluster closest to their input. The clustering method we will explore here is a Latent Dirichlet Allocation model, which would cluster the jobs based on their descriptions and provide the key terms used to identify each cluster

In [12]:
#Import Packages
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import pandas as pd
import numpy as np

In [2]:
#Read data on jobs
df = pd.read_csv('processed_data.csv')

In [3]:
#Overview of the dataset
df.head()

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay
0,https://www.mycareersfuture.gov.sg/job/custome...,PRODUCTION CONTROL MANAGER,<p><strong>JOB DESCIPTION</strong></p>\n<ul>\n...,JOB DESCIPTION\n\n planning and organising pr...,"Permanent, Full Time",Snl Logistics Pte Ltd,31 GUL CIRCLE 629569,"['job', 'descipt', 'plan', 'organis', 'product...",job desciption planning organising production ...,"['product', 'control', 'manag', 'snl', 'logist...",production control manager snl logistics pte l...,False,2000.0,3400.0
1,https://www.mycareersfuture.gov.sg/job/enginee...,Design Engineer ( Mechanical / Electrical),<p><strong>SUMMARY</strong></p>\n<ul>\n <li>T...,SUMMARY\n\n This position is responsible...,Full Time,Jamco Aero Design &Amp; Engineering Private Li...,Singapore,"['summari', 'posit', 'respons', 'support', 'pr...",summary position responsible supporting projec...,"['design', 'engin', 'mechan', 'electr', 'jamco...",design engineer mechanical electrical jamco ae...,False,2500.0,4500.0
2,https://www.mycareersfuture.gov.sg/job/sales/b...,Business Development Executive,<p><strong>Job description</strong></p>\n<p>Wh...,Job description\nWho we are:\nWe are a logisti...,"Part Time, Permanent",Airpak Express Pte Ltd,"TECHPLAS INDUSTRIAL BUILDING, 45 CHANGI SOUTH ...","['job', 'descript', 'logist', 'servic', 'provi...",job description logistics service provider sol...,"['busi', 'develop', 'execut', 'airpak', 'expre...",business development executive airpak express ...,False,3200.0,3500.0
3,https://www.mycareersfuture.gov.sg/job/banking...,Senior / Data Scientist,<p>The ideal candidate should have a good unde...,The ideal candidate should have a good underst...,"Permanent, Full Time",Singapore Exchange Limited,"SGX CENTRE I, 2 SHENTON WAY 068804","['ideal', 'candid', 'good', 'understand', 'bus...",ideal candidate good understanding business do...,"['senior', 'data', 'scientist', 'singapor', 'e...",senior data scientist singapore exchange limit...,False,9000.0,14000.0
4,https://www.mycareersfuture.gov.sg/job/archite...,8890-Sales Consultant [ Digital Software| Saas...,<p><strong>Sales Consultant (Digital Software)...,Sales Consultant (Digital Software)\nLocation:...,"Permanent, Full Time",The Supreme Hr Advisory Pte. Ltd.,"SHENTON HOUSE, 3 SHENTON WAY 068805","['sale', 'consult', 'digit', 'softwar', 'locat...",sales consultant digital software location jal...,"['8890', 'sale', 'consult', 'digit', 'softwar'...",8890 sales consultant digital software saas in...,False,3000.0,4500.0


In [4]:
#Using count vectorizer to obtain term frequencies for each document in "description_clean" column
#min_df, which dictates the minimum number of times a word must occur is set at 80 as most words that occur less than 80 times were junk words.
vectorizer = CountVectorizer(min_df=80)
X = vectorizer.fit_transform(df['description_clean'])

In [5]:
#Use this function to check each word and its frequency.
word_list = vectorizer.get_feature_names_out()
count_list = X.toarray().sum(axis=0)
word_dict = dict(zip(word_list,count_list))
{k: v for k, v in sorted(word_dict.items(), key=lambda item: item[1])}

{'dozen': 80,
 'additionally': 81,
 'amongst': 81,
 'incentives': 81,
 'pandas': 81,
 'consented': 82,
 'fostering': 82,
 'producing': 82,
 'autonomy': 83,
 'confident': 83,
 'easy': 83,
 'eligibility': 83,
 'http': 83,
 'initiate': 83,
 'majority': 83,
 'professionally': 83,
 'question': 83,
 'unless': 83,
 'disclosure': 84,
 'evolution': 84,
 'inputs': 84,
 'offered': 84,
 'spans': 84,
 'still': 84,
 'tomorrow': 84,
 'withdraw': 84,
 'apart': 85,
 'away': 85,
 'commensurate': 85,
 'documentations': 85,
 'ever': 85,
 'html5': 85,
 'initial': 85,
 'majors': 85,
 'respected': 85,
 'seeks': 85,
 'achieving': 86,
 'cleansing': 86,
 'collective': 86,
 'eco': 86,
 'instead': 86,
 'intellectual': 86,
 'minded': 86,
 'actual': 87,
 'competency': 87,
 'grasp': 87,
 'guiding': 87,
 'style': 87,
 'behind': 88,
 'consolidate': 88,
 'demonstrable': 88,
 'ineligible': 88,
 'medically': 88,
 'quo': 88,
 'resso': 88,
 '80': 89,
 'delight': 89,
 'fulfil': 89,
 'launched': 89,
 'marketplace': 89,
 'reg

In [6]:
#Cluster the job description into 20 clusters using LDA model
#The number of clusters was determined by observing for overlaps in key terms between topics, which would indicated that number of topics was too high
num_topics = 20
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=5, learning_method='online',random_state=0, verbose =1).fit(X)

iteration: 1 of max_iter: 5
iteration: 2 of max_iter: 5
iteration: 3 of max_iter: 5
iteration: 4 of max_iter: 5
iteration: 5 of max_iter: 5


In [7]:
#transform the matrix of term frequencies into a list of probabilities for each topic
transformed = lda.transform(X)

In [8]:
transformed.shape

(6243, 20)

In [10]:
#function to display the key words for each topic
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print (f"Topic:{topic_idx}")
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [11]:
#display the top 20 terms for each cluster
display_topics(lda,vectorizer.get_feature_names_out(),20)

Topic:0
research singapore students nus covid job internship staff may health 19 candidates university 2022 vaccinated successful information roles applicants campus
Topic:1
operations delivery ensure quality technology environment manage group processes production supply management operational healthcare bank control chain work innovation supports
Topic:2
risk credit financial market trading management investment finance quantitative portfolio models asset trade markets capital experience including role model products
Topic:3
support assist management reports job data perform ad duties ensure prepare assigned office hoc related provide review system required daily
Topic:4
sales customer customers product technical center new service solutions products account partners company solution existing revenue key experience services opportunities
Topic:5
clients client business financial global services management team banking experience technology compliance service work bank across program 

In [13]:
#produce a column denoting which cluster each job description belongs to
arg_max = [np.where(i==max(i))[0][0] for i in transformed]
df['cluster'] = arg_max

In [14]:
df.head()

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay,cluster
0,https://www.mycareersfuture.gov.sg/job/custome...,PRODUCTION CONTROL MANAGER,<p><strong>JOB DESCIPTION</strong></p>\n<ul>\n...,JOB DESCIPTION\n\n planning and organising pr...,"Permanent, Full Time",Snl Logistics Pte Ltd,31 GUL CIRCLE 629569,"['job', 'descipt', 'plan', 'organis', 'product...",job desciption planning organising production ...,"['product', 'control', 'manag', 'snl', 'logist...",production control manager snl logistics pte l...,False,2000.0,3400.0,1
1,https://www.mycareersfuture.gov.sg/job/enginee...,Design Engineer ( Mechanical / Electrical),<p><strong>SUMMARY</strong></p>\n<ul>\n <li>T...,SUMMARY\n\n This position is responsible...,Full Time,Jamco Aero Design &Amp; Engineering Private Li...,Singapore,"['summari', 'posit', 'respons', 'support', 'pr...",summary position responsible supporting projec...,"['design', 'engin', 'mechan', 'electr', 'jamco...",design engineer mechanical electrical jamco ae...,False,2500.0,4500.0,14
2,https://www.mycareersfuture.gov.sg/job/sales/b...,Business Development Executive,<p><strong>Job description</strong></p>\n<p>Wh...,Job description\nWho we are:\nWe are a logisti...,"Part Time, Permanent",Airpak Express Pte Ltd,"TECHPLAS INDUSTRIAL BUILDING, 45 CHANGI SOUTH ...","['job', 'descript', 'logist', 'servic', 'provi...",job description logistics service provider sol...,"['busi', 'develop', 'execut', 'airpak', 'expre...",business development executive airpak express ...,False,3200.0,3500.0,14
3,https://www.mycareersfuture.gov.sg/job/banking...,Senior / Data Scientist,<p>The ideal candidate should have a good unde...,The ideal candidate should have a good underst...,"Permanent, Full Time",Singapore Exchange Limited,"SGX CENTRE I, 2 SHENTON WAY 068804","['ideal', 'candid', 'good', 'understand', 'bus...",ideal candidate good understanding business do...,"['senior', 'data', 'scientist', 'singapor', 'e...",senior data scientist singapore exchange limit...,False,9000.0,14000.0,9
4,https://www.mycareersfuture.gov.sg/job/archite...,8890-Sales Consultant [ Digital Software| Saas...,<p><strong>Sales Consultant (Digital Software)...,Sales Consultant (Digital Software)\nLocation:...,"Permanent, Full Time",The Supreme Hr Advisory Pte. Ltd.,"SHENTON HOUSE, 3 SHENTON WAY 068805","['sale', 'consult', 'digit', 'softwar', 'locat...",sales consultant digital software location jal...,"['8890', 'sale', 'consult', 'digit', 'softwar'...",8890 sales consultant digital software saas in...,False,3000.0,4500.0,14


In [15]:
#Checking the distribution of the clusters
df['cluster'].value_counts()

10    757
15    649
18    633
7     608
9     507
11    433
8     360
13    343
14    325
3     322
17    270
12    268
5     254
0     159
2     145
1      70
19     48
16     43
4      30
6      19
Name: cluster, dtype: int64

In [22]:
#Filtering jobs by clusters to determine a hypothesized cluster
df[df['cluster']==2].sample(5)

Unnamed: 0,url,job_title,description_html,description,job_type,company,location,description_tokens,description_clean,full_info_tokens,full_info_clean,duplicated,min_pay,max_pay,cluster
763,https://www.mycareersfuture.gov.sg/job/banking...,"Senior Associate, Front Office Quant, Quant &a...",<p><strong>Business Function</strong></p>\n<p>...,Business Function\nAs a leader in treasury ope...,Full Time,Dbs Bank Ltd.,Singapore,"['busi', 'function', 'leader', 'treasuri', 'op...",business function leader treasury operations d...,"['senior', 'associ', 'front', 'offic', 'quant'...",senior associate front office quant quant amp ...,False,5000.0,9000.0,2
5205,https://sg.jobsdb.com/job/Treasury-Analyst-4b7...,Treasury Analyst,"<div class=""-desktop-no-padding-top"" id=""job-d...",\nIngenico has led the payment industry for mo...,Full time,Ingenico ePayments,Singapore,"['ingenico', 'led', 'payment', 'industri', '30...",ingenico led payment industry 30 years become ...,"['treasuri', 'analyst', 'ingenico', 'epay', 'f...",treasury analyst ingenico epayments full time ...,False,,,2
4251,https://sg.jobsdb.com/job/Market-Risk-Analyst-...,"AVP, Market Risk Analyst, MRM – Risk Control &...","<div class=""-desktop-no-padding-top"" id=""job-d...","\nAVP, Market Risk Analyst, MRM – Risk Control...",Full time,OCBC Bank,Singapore,"['avp', 'market', 'risk', 'analyst', 'mrm', 'r...",avp market risk analyst mrm risk control analy...,"['avp', 'market', 'risk', 'analyst', 'mrm', 'r...",avp market risk analyst mrm risk control analy...,False,,,2
4097,https://sg.jobsdb.com/job/Wholesale-Model-4068...,"VP, Wholesale Model Development, Risk Manageme...","<div class=""-desktop-no-padding-top"" id=""job-d...",Business Functions\nRisk Management Group work...,Full time,DBS Bank,Singapore,"['busi', 'function', 'risk', 'manag', 'group',...",business functions risk management group works...,"['vp', 'wholesal', 'model', 'develop', 'risk',...",vp wholesale model development risk management...,False,,,2
24,https://www.mycareersfuture.gov.sg/job/banking...,Financial Analyst,<p>EDHEC Singapore Infrastructure Investment I...,EDHEC Singapore Infrastructure Investment Inst...,Permanent,Association Edhec Business School Singapore Br...,"ONE GEORGE STREET, 1 GEORGE STREET 049145","['edhec', 'singapor', 'infrastructur', 'invest...",edhec singapore infrastructure investment inst...,"['financi', 'analyst', 'associ', 'edhec', 'bus...",financial analyst association edhec business s...,False,4000.0,5000.0,2


Hypothesized Clusters
<li>0: Research in AI/Data Science</li>
<li>1: Production</li>
<li>2: Analyst for Bank</li>
<li>3: Unknown</li>
<li>4: Sales</li>
<li>5: Unknown</li>
<li>6: Unknown</li>
<li>7: Software Engineer</li>
<li>8: Unknown</li>
<li>9: Data Analyst</li>
<li>10:Product Manager</li>
<li>11: Unknown</li>
<li>12: Openings in Tiktok</li>
<li>13: Research in Data Science</li>
<li>14: Unknown</li>
<li>15: Data related roles</li>
<li>16: AI/NLP related roles</li>
<li>17: Unknown</li>
<li>18: Unknown</li>
<li>19: Cloud AI work</li>

## Conclusion
To conclude, we realised that most of the key terms for each topic were not skills, but rather other parts of the job description such as company information or day-to-day roles. Therefore, inputting skills to determine the closest cluster would yield very poor results. If there were methods to sieve out skills from these job descriptions, it would greatly improve the viability of such a recommender system. Furthermore, some clusters were difficult to identify. Nonetheless, we realise that this model can provide a useful function of generating key terms for each of these clusters, which might help with users who are crafting resumes, as resumes with key terms tend to get through automated systems better.

In [23]:
#Obtaining good clusters with well defined key terms
clusters = ["production", "bank analyst","sales", "software engineer","data analyst","product manager","data science research","ai and nlp","cloud ai"]
relevant_index =[1,2,4,7,9,10,13,16,19]

In [27]:
#Building a dataframe of job titles and key terms
result_dict={}
result_dict['title'] = clusters
key_terms=[]
for topic_idx, topic in enumerate(lda.components_):
    if topic_idx in relevant_index:
        key_terms.append([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-20 - 1:-1]])
result_dict['key_terms']=key_terms

result_df = pd.DataFrame(result_dict)

    

In [29]:
result_df.to_csv("key_terms.csv", index=False)