# Classifying job postings from Indeed.com.uk

In this notebook, I am classifying job postings from Indeed.com.uk.

The structure is as follows :


1. Create a corpus from a number of job postings.
    - This implies scraping the web. For this I used the notebook by https://jessesw.com/Data-Science-Skills/  , which uses the package BeautifulSoup.
       
2. Create bag-of-word features using Tf-idf. I have used 1,2 and 3-gram bag of words. This is done using TfidfVectorizer from sklearn.feature_extraction.text

3. Perform an un-supervised classification of the job-postings with kmeans++ from sklearn.cluster

   

In [1]:
from bs4 import BeautifulSoup # For HTML parsing
import urllib # Website connections
import re # Regular expressions
from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
from nltk.corpus import stopwords # Filter out stopwords, such as 'the', 'or', 'and'
import pandas as pd # For converting results to a dataframe and bar chart plots
%matplotlib inline

In [29]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, MiniBatchKMeans
import numpy as np
import copy

## Function raw_text_cleaner     takes a URL and extracts a job description

In [3]:
def raw_text_cleaner(website):
    '''
    From the notebook by https://jessesw.com/Data-Science-Skills/ 
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        #site = urllib2.urlopen(website).read() # Connect to the job posting
        site = urllib.request.urlopen(website).read() # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 
    
    #soup_obj = BeautifulSoup(site) # Get the html from the site
    soup_obj = BeautifulSoup(site, "lxml")
    if len(soup_obj) == 0: # In case the default parser lxml doesn't work, try another one
        soup_obj = BeautifulSoup(site, 'html5lib')
    
    
    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object
    
    
    text_original = soup_obj.get_text()
    text_original = re.sub("[^a-zA-Z+3]"," ", str(text_original))  # Now get rid of any terms that aren't words (include 3 for d3.js)
    stop_words_base = set(stopwords.words("english")) # Filter out any stop words
    stop_words_jobs = set(['job','jobs','candidate','candidates','apply','now','skills','application','new',
                           'group','day','company','experience','our','job','position',
                           'pay','train','training','team','staff','indeed','work','working',
                           'yes','we','us','pay','no','hour','hours','uk','london','hire',
                           'team','within','slavery','therefore','opportunities','opportunity',
                           'motivation','motivated','he','she','he/she','much','very'])
    stop_words = stop_words_base.union(stop_words_jobs)
    text = [w.lower() for w in text_original.split() if w.lower() not in stop_words]
    text_original = ' '.join(text)

    return text_original

Test that it works:

In [4]:
website = 'https://www.indeed.co.uk/viewjob?jk=ba3df8f30e1691b7&tk=1c82t0tcm9m5i9ng&from=serp&alid=3&advn=1402909195792678'
sample_original = raw_text_cleaner('https://www.indeed.co.uk/viewjob?jk=ba3df8f30e1691b7&tk=1c82t0tcm9m5i9ng&from=serp&alid=3&advn=1402909195792678')
print(sample_original[0:300])

data scientist fospha co skip description searchclose find jobscompany reviewsfind salariesfind cvsemployers post upload cv sign advanced search title keywords city postcode data scientist fospha data scientist variety exciting projects fast growing organization lot tackle complex problems require m


Do a (1,2,3)-gram bag of words on this sample text, to see what the features look like:

In [6]:
vectorizer2 = TfidfVectorizer(ngram_range=(1,3), sublinear_tf=True)
sample_original_features = vectorizer2.fit_transform([sample_original])
vectorizer2.get_feature_names()[0:10]

['ability',
 'ability conduct',
 'ability conduct deep',
 'able',
 'able translate',
 'able translate business',
 'acquiring',
 'acquiring customers',
 'acquiring customers attribute',
 'acquisition']

Now that we see how to extract the features from one job posting, let's open several job postings, create a corpus, and create a sparse matrix for the features.

In [7]:
def create_corpus(city = None, job_list=['data+scientist', 'machine+learning'],pages=5):
    '''
    Initally based on notebook by  https://jessesw.com/Data-Science-Skills/ 
    Input : city, and a list with job description queries
    output: corpus, URLS of jobs descriptions found
    '''

    if type(job_list == str):
        job_list = list(job_list)
        
    job_descriptions = [] # Store all our descriptions in this list
    all_URLS=[]
    bad_URLS = []
    for final_job in job_list:
        print('Searching for ', final_job)

        #https://www.indeed.co.uk/jobs?q=data+scientist&l=london&sort=date&start=10    
        final_site_list = ['https://www.indeed.co.uk/jobs?q=', final_job, '&l=', city,
                       '&sort=date'] # Join all of our strings together so that indeed will search correctly        

        final_site = ''.join(final_site_list) # Merge the html address together into one string


        base_url = 'https://www.indeed.co.uk'

        #print('TRY',final_site)
        try:
            html = urllib.request.urlopen(final_site).read() # Open up the front page of our search first
        except:
            'That city/state combination did not have any jobs. Exiting . . .' # In case the city is invalid
            return
        soup = BeautifulSoup(html,"lxml") # Get the html from the first page
        if len(soup) < 1: print('THERE IS AN ERROR LOADING THE PAGE')

        # Now find out how many jobs there were

        num_jobs_area = soup.find(id = 'searchCount').string.encode('utf-8') # Now extract the total number of jobs found
                                                                             # The 'searchCount' object has this
        print('type(num_jobs_area)')
        job_numbers = re.findall('\d+', str(num_jobs_area)) # Extract the total jobs found from the search result


        if len(job_numbers) > 3: # Have a total number of jobs greater than 1000
            total_num_jobs = (int(job_numbers[2])*1000) + int(job_numbers[3])
        else:
            total_num_jobs = int(job_numbers[2]) 

        city_title = city
        if city is None:
            city_title = 'Nationwide'

        print('There were', total_num_jobs, 'jobs found,', city_title) # Display how many jobs were found

        num_pages = int(total_num_jobs/10) # This will be how we know the number of times we need to iterate over each new
                                      # search result page

        #for i in range(1,num_pages+1): # Loop through all of our search result pages
        for i in range(0,min(pages,num_pages+1)): # Loop through all of our search result pages
            print('Getting page', i)
            start_num = str(i*10) # Assign the multiplier of 10 to view the pages we want
            if i>0:
                current_page = ''.join([final_site, '&start=', start_num])
            else:
                current_page = final_site
            # Now that we can view the correct 10 job returns, start collecting the text samples from each
            html_page = urllib.request.urlopen(current_page).read() # Get the page

            page_obj = BeautifulSoup(html_page,'lxml') # Locate all of the job links
            job_link_area = page_obj.find(id = 'resultsCol') # The center column on the page where the job postings exist
            job_URLS=[]
            
            for link in job_link_area.find_all('a'):
                try:
                    if link.get('href')[0:3]=='/rc':
                        job_URLS.append(base_url + link.get('href'))
                except:
                        if link !=None:
                            if link.get('href') != None:
                                bad_URLS.append(base_url + link.get('href'))

            for j in range(0,len(job_URLS)):
                final_description = raw_text_cleaner(job_URLS[j])
                if final_description: # So that we only append when the website was accessed correctly
                    job_descriptions.append(final_description)
                    all_URLS.append(job_URLS[j])
                #sleep(1) # 

        print('Done with collecting the job postings!')    
        print('There were {} jobs successfully found.'.format(len(job_descriptions)))

    return job_descriptions, all_URLS, bad_URLS

# Example 1: Create the corpus for two very different job descriptions:  'data scientist' and 'restaurant'

In [8]:
corpus, URLs, bad_URLs = create_corpus(city = 'london',job_list=['data+scientist', 'restaurant'],pages=5)

Searching for  data+scientist
type(num_jobs_area)
There were 1787 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Getting page 3
Getting page 4
Done with collecting the job postings!
There were 47 jobs successfully found.
Searching for  restaurant
type(num_jobs_area)
There were 14500 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Getting page 3
Getting page 4
Done with collecting the job postings!
There were 72 jobs successfully found.


In [9]:
print('There are {0} job postings, {1} URLS'.format(len(corpus), len(URLs)))

There are 72 job postings, 72 URLS


In [16]:
def corpus_stop_word_cleaner(corpus, stop_words_input=None):
    '''
    This function just removes some words from the corpus in case you realize you want to filter out more words.
    Inputs: corpus
    Outputs: stop_words filtered out of corpus
    '''
    cc=[0]
    if type(corpus) != list: 
        cc[0] = corpus
        corpus = cc
        
    for ic,text_original in enumerate(corpus):       
        stop_words_base = set(stopwords.words("english")) # Filter out any stop words
        if stop_words_input != None: stop_words_base = stop_words_base.union(stop_words_input)
        stop_words_jobs = set(['job','jobs','candidate','candidates','apply','now','skills','application','new',
                           'group','day','company','experience','our','job','position',
                           'pay','train','training','team','staff','indeed','work','working',
                           'yes','we','us','pay','hour','hours','uk','london','hire',
                           'team','within','slavery','therefore','opportunities','opportunity',
                           'motivation','motivated','he','she','he/she','much','very',
                              'cookies','com','asos','postcode','ago','date','benefits',
                              'cv','role','cookies','com','asos','postcode','ago','date',
                               'benefits','religion','sexual','orientation','salary','asap',
                               'annum','race','like' ,'may','enjoy','keywords' ])
    
        stop_words = stop_words_base.union(stop_words_jobs)
        text = [w.lower() for w in text_original.split() if w.lower() not in stop_words]
        text_original = ' '.join(text)
        corpus[ic] = text_original
    if len(corpus) == 1 :
        return corpus[ic]
    else:
        return corpus

In [17]:
corpus[0][0:300]

'junior data scientist farfetch co skip description searchclose find jobscompany reviewsfind salariesfind cvsemployers post upload sign advanced search title keywords city junior data scientist farfetch data science directed building software solutions enhance marketing activity using machine learnin'

In [19]:
#remove some additional useless words
corpus =  corpus_stop_word_cleaner(corpus, ['cv'])

In [20]:
corpus[0][0:300]

'junior data scientist farfetch co skip description searchclose find jobscompany reviewsfind salariesfind cvsemployers post upload sign advanced search title city junior data scientist farfetch data science directed building software solutions enhance marketing activity using machine learning advance'

In [56]:
def create_features(corpus, nmin=1,nmax=3,nfeat=5000):    
    vectorizer = TfidfVectorizer(ngram_range=(nmin,nmax), min_df = 1, 
                                 sublinear_tf = True, max_features = nfeat)
    job_features = vectorizer.fit_transform(corpus)
    return vectorizer, job_features # End of the function

In [57]:
vectLondon, london_features = create_features(corpus, nmin=1,nmax=2)

In [58]:
print('Shape of extracted features',london_features.toarray().shape)
print('Some features')
print(vectLondon.get_feature_names()[20:100])
print('Features are stored in a sparse format',type(london_features))

Shape of extracted features (72, 5000)
Some features
['able demonstrate', 'able develop', 'able lead', 'abreast', 'academic', 'academic research', 'academy', 'academy aim', 'accept', 'accept right', 'access', 'accessible', 'accessible affordable', 'accomplishments', 'accordance', 'account', 'accounts', 'accounts based', 'accredited', 'accredited qualifications', 'accurately', 'achieve', 'achieve amazing', 'achieves', 'achieving', 'acquisition', 'across', 'across addressable', 'across bbc', 'across broad', 'across business', 'across industrial', 'across markets', 'across sites', 'across whole', 'act', 'acting', 'acting employment', 'action', 'actionable', 'actionable information', 'actions', 'actions organisational', 'active', 'active participant', 'activities', 'activity', 'activity using', 'acumen', 'acumen interested', 'ad', 'ad hoc', 'adam', 'adam description', 'adaptable', 'adaptable able', 'add', 'add business', 'add value', 'addition', 'addition competitive', 'addition plus', 'ad

In [59]:
#Let's see how many bayesian-realted features there are 
for ii,iv in enumerate(vectLondon.get_feature_names()):
    if 'bayes' in iv:
        print(ii,iv)

408 bayesian
409 bayesian inference
410 bayesian networks
411 bayesian statistics
2703 optimization bayesian
4236 statistics bayesian
4419 theory bayesian


In [60]:
print(london_features.shape)

(72, 5000)


# Use these features to do UNSUPERVISED classification

In [110]:
def do_custering(some_features,true_k=2, do_svd=0):
    if do_svd:
        print("Performing dimensionality reduction using LSA")
        # Vectorizer results are normalized, which makes KMeans behave as
        # spherical k-means for better results. Since LSA/SVD results are
        # not normalized, we have to redo the normalization.
        svd = TruncatedSVD(100)
        normalizer = Normalizer(copy=False)
        lsa = make_pipeline(svd, normalizer)

        X = lsa.fit_transform(some_features)
    else:
        X = copy.deepcopy(some_features)
    
    kmeans_minibatch = 0
    if kmeans_minibatch:
        km = MiniBatchKMeans(n_clusters=true_k, init='k-means++', n_init=1,
                             init_size=1000, batch_size=1000, verbose=1)
    else:
        km = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1,
                    verbose=1)

    print("Clustering sparse data with %s" % km)
    km.fit(X)
    return km

In [111]:
km = do_custering(london_features,true_k=2)

Clustering sparse data with KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=2, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)
Initialization complete
Iteration  0, inertia 126.634
Iteration  1, inertia 64.134
Iteration  2, inertia 64.054
Converged at iteration 2: center shift 0.000000e+00 within tolerance 1.833665e-08


In [112]:
km.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 1], dtype=int32)

Looking at the labels we know we did a reasonable good job, since we know all the first job postings are data science related, and the last ones are restaurant realted

### Let's see which features have more weight  in each category

In [113]:
feat_names = vectLondon.get_feature_names()


In [116]:
def importance_features(feat_names, km, perc=99.9):
    res=km.__dict__
    for iclass in set(km.labels_):
        print('\n****** \nImportant features for class ',iclass,'\n')
        for ii,iv in enumerate(res['cluster_centers_'][iclass]):
            if iv > np.percentile(res['cluster_centers_'][iclass],perc)  :
                print('{0:<30s}{1}'.format(feat_names[ii],iv))

In [117]:
importance_features(feat_names,km, perc=99.8)


****** 
Important features for class  0 

food                          0.043046326334422215
front house                   0.03749551893356555
great                         0.0459537366124178
guests                        0.04170009170266304
host                          0.03664484652401619
hotel                         0.03882788559878081
hotels                        0.04145442090075346
kitchen                       0.042782225094246694
service                       0.038183750649968005
ssp                           0.03655653133185926

****** 
Important features for class  1 

data                          0.05498123805199941
data science                  0.021964953597315692
data scientist                0.04409447506062927
help                          0.03384705939705874
learning                      0.02806830710367649
machine                       0.027179709618865
machine learning              0.027131425987046463
science                       0.02536939482284588
scientist   

# Now let's look only at data-science jobs and see what features sets them appart

In [73]:
corpus_ds, URLs_ds, bad_URLs_ds = create_corpus(city = 'london',job_list=['data+scientist'],pages=10)

Searching for  data+scientist
type(num_jobs_area)
There were 1786 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Getting page 3
Getting page 4
Getting page 5
Getting page 6
Getting page 7
Getting page 8
Getting page 9
Done with collecting the job postings!
There were 94 jobs successfully found.


In [128]:
corpus_ds =  corpus_stop_word_cleaner(corpus_ds,['best','great'])

In [150]:
vectLondon_ds, london_features_ds = create_features(corpus_ds, nmin=1,nmax=3,nfeat=10000)
feat_names_ds = vectLondon_ds.get_feature_names()

In [151]:
#Let's arbitrarily split them into 4 clusters
km_ds = do_custering(london_features_ds,true_k=8)
km_ds.labels_

Clustering sparse data with KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=100,
    n_clusters=8, n_init=1, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=1)
Initialization complete
Iteration  0, inertia 139.622
Iteration  1, inertia 73.649
Converged at iteration 1: center shift 0.000000e+00 within tolerance 9.185831e-09


array([2, 2, 7, 6, 4, 1, 6, 1, 1, 1, 2, 1, 5, 7, 6, 2, 2, 2, 2, 1, 1, 3, 1,
       1, 7, 0, 1, 6, 2, 1, 2, 2, 0, 2, 0, 7, 1, 2, 2, 0, 2, 6, 7, 2, 2, 2,
       2, 7, 1, 2, 2, 2, 1, 1, 0, 7, 2, 1, 4, 3, 3, 2, 1, 4, 2, 6, 7, 0, 1,
       5, 7, 5, 2, 2, 3, 1, 1, 7, 7, 3, 4, 2, 4, 1, 0, 1, 6, 1, 6, 6, 6, 4,
       3, 2], dtype=int32)

In [152]:
print(len(corpus_ds))
print(len(km_ds.labels_))
print(london_features_ds.shape)
print(len(feat_names_ds))

94
94
(94, 10000)
10000


In [154]:
importance_features(feat_names_ds, km_ds, perc=99.8)


****** 
Important features for class  0 

analytics                     0.04604695159388804
aspire                        0.07397985857762232
aspire data                   0.07905726454155611
aspire data recruitment       0.07905726454155611
bi analyst                    0.03696233272123313
data                          0.06262106025756638
data recruitment              0.07905726454155611
data scientist                0.05232207509922674
data scientist aspire         0.03796794367499669
etc                           0.03700205106992695
marketing                     0.03607180393174988
marketing analytics           0.04286130883282559
modelling                     0.05824819132252552
python                        0.03663127686155278
recruitment                   0.04678154948975609
scientist                     0.046862274529366164
statistician                  0.05917793853204608

****** 
Important features for class  1 

big                           0.02701058305356886
business     

# Looking at each category, the job adds can be described as:
    0. not clear
    1. developer, engineer, 
    2. data scientist
    3. marketing, meaida, campaign
    4. health, employment
    5. phd, mathematics
    6. financial
    7. AI- ml
    

## Now, you're job search is much easier!

## Imagine you're only interested in data science roles that have a focus in finance. In that case, you can target those jobs specially:


In [160]:
def provide_url_one_class(km,URL, target_category):
    target_url=[]
    for i,ilab in enumerate(km.labels_):
        if ilab in target_category:
            print(URL[i])
            target_url.append(URL[i])
    return target_url
        

In [161]:
my_url = provide_url_one_class(km_ds,URLs_ds, [6])

https://www.indeed.co.uk/rc/clk?jk=d9a8e791fdb670fa&fccid=1196f2e3f43d848d&vjs=3
https://www.indeed.co.uk/rc/clk?jk=7840de748811589f&fccid=36801496409e6dc9&vjs=3
https://www.indeed.co.uk/rc/clk?jk=33856e24365bae94&fccid=c46d0116f6e69eae&vjs=3
https://www.indeed.co.uk/rc/clk?jk=19808101cc3f5559&fccid=5bcd1ef0a7f4fb99&vjs=3
https://www.indeed.co.uk/rc/clk?jk=23de10bb768ae4fb&fccid=5bcd1ef0a7f4fb99&vjs=3
https://www.indeed.co.uk/rc/clk?jk=b99426fe9605e04b&fccid=3c0bf511b4a29309&vjs=3
https://www.indeed.co.uk/rc/clk?jk=02cc5d235730ef8d&fccid=c46d0116f6e69eae&vjs=3
https://www.indeed.co.uk/rc/clk?jk=603427c2c8c8a75d&fccid=df6948c9b8da6236&vjs=3
https://www.indeed.co.uk/rc/clk?jk=a7011e68e4bbc26a&fccid=c46d0116f6e69eae&vjs=3
https://www.indeed.co.uk/rc/clk?jk=69be5747ea363ebb&fccid=c46d0116f6e69eae&vjs=3


If you are interested in a more mathematical position:

In [165]:
my_url = provide_url_one_class(km_ds,URLs_ds, [5])

https://www.indeed.co.uk/rc/clk?jk=0190224aaa202c4b&fccid=cbf7c87b1ccf4a6c&vjs=3
https://www.indeed.co.uk/rc/clk?jk=52cb1c590d212275&fccid=0b33f99aac420958&vjs=3
https://www.indeed.co.uk/rc/clk?jk=64a76f928b12dedc&fccid=0b33f99aac420958&vjs=3
