# Classifying job postings from Indeed.com.uk with Gaussian Mixture Models

In this notebook, I am classifying job postings from Indeed.com.uk.

The structure is as follows :


1. Create a corpus from a number of job postings.
    - This implies scraping the web. For this I used the notebook by https://jessesw.com/Data-Science-Skills/  , which uses the package BeautifulSoup.
       
2. Create bag-of-word features using Tf-idf. I have used 1,2 and 3-gram bag of words. This is done using TfidfVectorizer from sklearn.feature_extraction.text

3. Perform an un-supervised classification of the job-postings with kmeans++ from sklearn.cluster

   

In [2]:
from bs4 import BeautifulSoup # For HTML parsing
import urllib # Website connections
import re # Regular expressions
from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
from nltk.corpus import stopwords # Filter out stopwords, such as 'the', 'or', 'and'
import pandas as pd # For converting results to a dataframe and bar chart plots
import numpy as np
import copy
%matplotlib inline

In [39]:
from sklearn.mixture import GaussianMixture
from sklearn.feature_extraction.text import TfidfVectorizer
import operator

## Function raw_text_cleaner     takes a URL and extracts a job description

In [8]:
def raw_text_cleaner(website):
    '''
    From the notebook by https://jessesw.com/Data-Science-Skills/ 
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        #site = urllib2.urlopen(website).read() # Connect to the job posting
        site = urllib.request.urlopen(website).read() # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 
    
    #soup_obj = BeautifulSoup(site) # Get the html from the site
    soup_obj = BeautifulSoup(site, "lxml")
    if len(soup_obj) == 0: # In case the default parser lxml doesn't work, try another one
        soup_obj = BeautifulSoup(site, 'html5lib')
    
    
    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object
    
    
    text_original = soup_obj.get_text()
    text_original = re.sub("[^a-zA-Z+3]"," ", str(text_original))  # Now get rid of any terms that aren't words (include 3 for d3.js)
    stop_words_base = set(stopwords.words("english")) # Filter out any stop words
    stop_words_jobs = set(['job','jobs','candidate','candidates','apply','now','skills','application','new',
                           'group','day','company','experience','our','job','position',
                           'pay','train','training','team','staff','indeed','work','working',
                           'yes','we','us','pay','no','hour','hours','uk','london','hire',
                           'team','within','slavery','therefore','opportunities','opportunity',
                           'motivation','motivated','he','she','he/she','much','very'])
    stop_words = stop_words_base.union(stop_words_jobs)
    text = [w.lower() for w in text_original.split() if w.lower() not in stop_words]
    text_original = ' '.join(text)

    return text_original

Test that it works:

In [9]:
website = 'https://www.indeed.co.uk/viewjob?jk=ba3df8f30e1691b7&tk=1c82t0tcm9m5i9ng&from=serp&alid=3&advn=1402909195792678'
sample_original = raw_text_cleaner('https://www.indeed.co.uk/viewjob?jk=ba3df8f30e1691b7&tk=1c82t0tcm9m5i9ng&from=serp&alid=3&advn=1402909195792678')
print(sample_original[0:300])

data scientist fospha co skip description searchclose find jobscompany reviewsfind salariesfind cvsemployers post upload cv sign advanced search title keywords city postcode data scientist fospha data scientist variety exciting projects fast growing organization lot tackle complex problems require m


Do a (1,2,3)-gram bag of words on this sample text, to see what the features look like:

In [10]:
vectorizer2 = TfidfVectorizer(ngram_range=(1,3), sublinear_tf=True)
sample_original_features = vectorizer2.fit_transform([sample_original])
vectorizer2.get_feature_names()[0:10]

['ability',
 'ability conduct',
 'ability conduct deep',
 'able',
 'able translate',
 'able translate business',
 'acquiring',
 'acquiring customers',
 'acquiring customers attribute',
 'acquisition']

Now that we see how to extract the features from one job posting, let's open several job postings, create a corpus, and create a sparse matrix for the features.

In [11]:
def create_corpus(city = None, job_list=['data+scientist', 'machine+learning'],pages=5):
    '''
    Initally based on notebook by  https://jessesw.com/Data-Science-Skills/ 
    Input : city, and a list with job description queries
    output: corpus, URLS of jobs descriptions found
    '''

    if type(job_list == str):
        job_list = list(job_list)
        
    job_descriptions = [] # Store all our descriptions in this list
    all_URLS=[]
    bad_URLS = []
    for final_job in job_list:
        print('Searching for ', final_job)

        #https://www.indeed.co.uk/jobs?q=data+scientist&l=london&sort=date&start=10    
        final_site_list = ['https://www.indeed.co.uk/jobs?q=', final_job, '&l=', city,
                       '&sort=date'] # Join all of our strings together so that indeed will search correctly        

        final_site = ''.join(final_site_list) # Merge the html address together into one string


        base_url = 'https://www.indeed.co.uk'

        #print('TRY',final_site)
        try:
            html = urllib.request.urlopen(final_site).read() # Open up the front page of our search first
        except:
            'That city/state combination did not have any jobs. Exiting . . .' # In case the city is invalid
            return
        soup = BeautifulSoup(html,"lxml") # Get the html from the first page
        if len(soup) < 1: print('THERE IS AN ERROR LOADING THE PAGE')

        # Now find out how many jobs there were

        num_jobs_area = soup.find(id = 'searchCount').string.encode('utf-8') # Now extract the total number of jobs found
                                                                             # The 'searchCount' object has this
        print('type(num_jobs_area)')
        job_numbers = re.findall('\d+', str(num_jobs_area)) # Extract the total jobs found from the search result


        if len(job_numbers) > 3: # Have a total number of jobs greater than 1000
            total_num_jobs = (int(job_numbers[2])*1000) + int(job_numbers[3])
        else:
            total_num_jobs = int(job_numbers[2]) 

        city_title = city
        if city is None:
            city_title = 'Nationwide'

        print('There were', total_num_jobs, 'jobs found,', city_title) # Display how many jobs were found

        num_pages = int(total_num_jobs/10) # This will be how we know the number of times we need to iterate over each new
                                      # search result page

        for i in range(0,min(pages,num_pages+1)): # Loop through all of our search result pages
            print('Getting page', i)
            start_num = str(i*10) # Assign the multiplier of 10 to view the pages we want
            if i>0:
                current_page = ''.join([final_site, '&start=', start_num])
            else:
                current_page = final_site
            # Now that we can view the correct 10 job returns, start collecting the text samples from each
            html_page = urllib.request.urlopen(current_page).read() # Get the page

            page_obj = BeautifulSoup(html_page,'lxml') # Locate all of the job links
            job_link_area = page_obj.find(id = 'resultsCol') # The center column on the page where the job postings exist
            job_URLS=[]
            
            for link in job_link_area.find_all('a'):
                try:
                    if link.get('href')[0:3]=='/rc':
                        job_URLS.append(base_url + link.get('href'))
                except:
                        if link !=None:
                            if link.get('href') != None:
                                bad_URLS.append(base_url + link.get('href'))

            for j in range(0,len(job_URLS)):
                final_description = raw_text_cleaner(job_URLS[j])
                if final_description: # So that we only append when the website was accessed correctly
                    job_descriptions.append(final_description)
                    all_URLS.append(job_URLS[j])
                #sleep(1) # 

        print('Done with collecting the job postings!')    
        print('There were {} jobs successfully found.'.format(len(job_descriptions)))

    return job_descriptions, all_URLS, bad_URLS

# Example 1: Create the corpus for two very different job descriptions:  'data scientist' and 'restaurant'

In [12]:
corpus, URLs, bad_URLs = create_corpus(city = 'london',job_list=['data+scientist', 'waiter'],pages=3)

Searching for  data+scientist
type(num_jobs_area)
There were 1778 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Done with collecting the job postings!
There were 27 jobs successfully found.
Searching for  waiter
type(num_jobs_area)
There were 1454 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Done with collecting the job postings!
There were 43 jobs successfully found.


In [13]:
print('There are {0} job postings, {1} URLS'.format(len(corpus), len(URLs)))

There are 43 job postings, 43 URLS


In [14]:
def corpus_stop_word_cleaner(corpus, stop_words_input=None):
    '''
    This function just removes some words from the corpus in case you realize you want to filter out more words.
    Inputs: corpus
    Outputs: stop_words filtered out of corpus
    '''
    cc=[0]
    if type(corpus) != list: 
        cc[0] = corpus
        corpus = cc
        
    for ic,text_original in enumerate(corpus):       
        stop_words_base = set(stopwords.words("english")) # Filter out any stop words
        if stop_words_input != None: stop_words_base = stop_words_base.union(stop_words_input)
        stop_words_jobs = set(['job','jobs','candidate','candidates','apply','now','skills','application','new',
                           'group','day','company','experience','our','job','position',
                           'pay','train','training','team','staff','indeed','work','working',
                           'yes','we','us','pay','hour','hours','uk','london','hire',
                           'team','within','slavery','therefore','opportunities','opportunity',
                           'motivation','motivated','he','she','he/she','much','very',
                              'cookies','com','asos','postcode','ago','date','benefits',
                              'cv','role','cookies','com','asos','postcode','ago','date',
                               'benefits','religion','sexual','orientation','salary','asap',
                               'annum','race','like' ,'may','enjoy','keywords' ])
    
        stop_words = stop_words_base.union(stop_words_jobs)
        text = [w.lower() for w in text_original.split() if w.lower() not in stop_words]
        text_original = ' '.join(text)
        corpus[ic] = text_original
    if len(corpus) == 1 :
        return corpus[ic]
    else:
        return corpus

In [15]:
#remove some additional useless words
corpus =  corpus_stop_word_cleaner(corpus, ['cv'])

In [16]:
def create_features(corpus, nmin=1,nmax=3,nfeat=10000):    
    vectorizer = TfidfVectorizer(ngram_range=(nmin,nmax), min_df = 1, 
                                 sublinear_tf = True, max_features = nfeat)
    job_features = vectorizer.fit_transform(corpus)
    return vectorizer, job_features # End of the function

In [84]:
vectLondon, london_features = create_features(corpus, nmin=1,nmax=3, nfeat=100)

In [85]:
print('Shape of extracted features',london_features.toarray().shape)
print('Some features')
print(vectLondon.get_feature_names()[20:100])
print('Features are stored in a sparse format',type(london_features))

Shape of extracted features (43, 100)
Some features
['deliver', 'deliver services', 'deliver services cookie', 'describes', 'describes use', 'describes use disable', 'description', 'description searchclose', 'description searchclose find', 'development', 'disable', 'disable learn', 'disable learn ok', 'engineer', 'find', 'find jobscompany', 'find jobscompany reviewsfind', 'help', 'help centre', 'help centre help', 'help deliver', 'help deliver services', 'jobscompany', 'jobscompany reviewsfind', 'jobscompany reviewsfind salariesfind', 'learn', 'learn ok', 'learn ok anti', 'learning', 'ok', 'ok anti', 'ok anti statement', 'one', 'original', 'plan', 'policy', 'policy describes', 'policy describes use', 'post', 'post upload', 'post upload sign', 'privacy', 'privacy terms', 'reviewsfind', 'reviewsfind salariesfind', 'reviewsfind salariesfind cvsemployers', 'salariesfind', 'salariesfind cvsemployers', 'salariesfind cvsemployers post', 'save', 'save original', 'scientist', 'search', 'search 

In [86]:
print(london_features.shape)

(43, 100)


In [87]:
print(corpus[3][0:100])

deep learning engineer aig co skip description searchclose find jobscompany reviewsfind salariesfind


# Use these features to do UNSUPERVISED classification with GMM

In [88]:
k = 2
cov_type='diag'
estimator1 = GaussianMixture(n_components=k,
                   covariance_type=cov_type, max_iter=100, random_state=0)
estimator1.fit(london_features.toarray()  )    # Learns model parameters
y_pred = estimator1.predict(london_features.toarray())

In [89]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Looking at the labels we know we did a reasonable good job, since we know all the first job postings are data science related, and the last ones are realted to waiters jobs.

### Let's see which features have more weight  in each category

In [90]:
feat_names = vectLondon.get_feature_names()

In [109]:
def importance_features(k, feature_names,estimator,perc=99.9):
    for ik in range(k):        
        impor_feat = {}
        muik = estimator.means_[ik]
        tth = np.percentile(muik,perc)
        for i,iv in enumerate(muik):
            if iv > tth:
                impor_feat[feature_names[i] ] =  iv 
        sorted_x = sorted(impor_feat.items(), key=operator.itemgetter(1),reverse=True  )
        print('\n \n Important features for cluster ', ik)
        for ii in sorted_x:
             print('{0:<30s}{1}'.format(ii[0],ii[1]))

In [96]:
importance_features(k,feat_names,estimator1,perc=95)


 
 Important features for cluster  0
data                          0.30724879366775093
engineer                      0.17864637694965696
scientist                     0.15368974823220527
help                          0.15113825447693663
learning                      0.15076261773119337

 
 Important features for cluster  1
waiter                        0.35717768617518614
help                          0.1613751747678486
learn                         0.10780829146340816
deliver                       0.10539187608063731
find                          0.0988078090306612


# Now let's look only at data-science jobs and see what features sets them appart

In [158]:
corpus_ds, URLs_ds, bad_URLs_ds = create_corpus(city = 'london',job_list=['data+science'],pages=6)

Searching for  data+science
type(num_jobs_area)
There were 5293 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Getting page 3
Getting page 4
Getting page 5
Done with collecting the job postings!
There were 56 jobs successfully found.


In [159]:
corpus_ds =  corpus_stop_word_cleaner(corpus_ds,['ck','best','great','boyce','mendez','mondelez','durlston','cwjobs','futureheads','harnham','venturi'])

In [171]:
vectLondon_ds, london_features_ds = create_features(corpus_ds, nmin=1,nmax=3,nfeat=1000)
feat_names_ds = vectLondon_ds.get_feature_names()

In [176]:
k = 8
cov_type='diag'
estimator2 = GaussianMixture(n_components=k,
                   covariance_type=cov_type, max_iter=300, random_state=0)
estimator2.fit(london_features_ds.toarray()  )    # Learns model parameters
y_pred = estimator2.predict(london_features_ds.toarray())

In [177]:
y_pred

array([3, 6, 4, 2, 0, 1, 0, 2, 5, 3, 5, 7, 2, 7, 7, 3, 7, 2, 0, 0, 0, 5, 1,
       6, 2, 2, 2, 1, 6, 6, 2, 3, 7, 1, 7, 7, 2, 2, 2, 6, 1, 1, 2, 6, 0, 2,
       2, 2, 4, 2, 1, 6, 2, 1, 5, 6])

In [179]:
importance_features(k, feat_names_ds, estimator2, perc=99)


 
 Important features for cluster  0
engineer                      0.13678746949644155
software                      0.08559024383915062
aws                           0.08116265602017429
software engineer             0.07972412404219022
product                       0.0763694882249686
media                         0.0736189383020986
digital                       0.06650538986881856
business                      0.06617113641471413
development                   0.06561114006135321
technical                     0.05814401334860535

 
 Important features for cluster  1
scientist                     0.111998240250204
data scientist                0.10809198320696624
data science                  0.0730476656066979
data                          0.07243356979037488
ey                            0.0698053020591506
analytics                     0.06472596889054012
public health                 0.062348210126380005
health                        0.060985164976196944
33                          

### Looking at each category, and find the best one(s) suited for you. 

### Now, you're job search is much easier!



In [164]:
def provide_url_one_class(label,URL, target_category):
    target_url=[]
    for i,ilab in enumerate(label):
        if ilab in target_category:
            print(URL[i])
            target_url.append(URL[i])
    return target_url
        

In [183]:
#lets look at developer-oriented data science jobs
my_url = provide_url_one_class(y_pred,URLs_ds, [5])

https://www.indeed.co.uk/rc/clk?jk=1cb0ace166ce0185&fccid=892c145157842ceb&vjs=3
https://www.indeed.co.uk/rc/clk?jk=e6474828284e2918&fccid=892c145157842ceb&vjs=3
https://www.indeed.co.uk/rc/clk?jk=0b1e74d40435441e&fccid=0592bb9a425e26cc&vjs=3
https://www.indeed.co.uk/rc/clk?jk=657e9439763ef114&fccid=113037d2ecac6197&vjs=3


### Looking at estimator2.weights_   we can see that in this case the largest contributions are from category 2 and category 7

In [180]:
estimator2.weights_

array([ 0.10714286,  0.14285714,  0.30357143,  0.07142857,  0.03571429,
        0.07142857,  0.14285714,  0.125     ])