# Classifying job postings from Indeed.com.uk with Gaussian Mixture Models

In this notebook, I am classifying job postings from Indeed.com.uk.

The structure is as follows :


1. Create a corpus from a number of job postings.
    - This implies scraping the web. For this I used the notebook by https://jessesw.com/Data-Science-Skills/  , which uses the package BeautifulSoup.
       
2. Create bag-of-word features using Tf-idf. I have used 1,2 and 3-gram bag of words. This is done using TfidfVectorizer from sklearn.feature_extraction.text

3. Perform an un-supervised classification of the job-postings with kmeans++ from sklearn.cluster

   

In [1]:
from bs4 import BeautifulSoup # For HTML parsing
import urllib # Website connections
import re # Regular expressions
from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
from nltk.corpus import stopwords # Filter out stopwords, such as 'the', 'or', 'and'
import pandas as pd # For converting results to a dataframe and bar chart plots
import numpy as np
import copy
%matplotlib inline

In [2]:
from sklearn.mixture import GaussianMixture
from sklearn.feature_extraction.text import TfidfVectorizer
import operator

## Function raw_text_cleaner     takes a URL and extracts a job description

In [3]:
def raw_text_cleaner(website):
    '''
    From the notebook by https://jessesw.com/Data-Science-Skills/ 
    This function just cleans up the raw html so that I can look at it.
    Inputs: a URL to investigate
    Outputs: Cleaned text only
    '''
    try:
        #site = urllib2.urlopen(website).read() # Connect to the job posting
        site = urllib.request.urlopen(website).read() # Connect to the job posting
    except: 
        return   # Need this in case the website isn't there anymore or some other weird connection problem 
    
    #soup_obj = BeautifulSoup(site) # Get the html from the site
    soup_obj = BeautifulSoup(site, "lxml")
    if len(soup_obj) == 0: # In case the default parser lxml doesn't work, try another one
        soup_obj = BeautifulSoup(site, 'html5lib')
    
    
    for script in soup_obj(["script", "style"]):
        script.extract() # Remove these two elements from the BS4 object
    
    
    text_original = soup_obj.get_text()
    text_original = re.sub("[^a-zA-Z+3]"," ", str(text_original))  # Now get rid of any terms that aren't words (include 3 for d3.js)
    stop_words_base = set(stopwords.words("english")) # Filter out any stop words
    stop_words_jobs = set(['job','jobs','candidate','candidates','apply','now','skills','application','new',
                           'group','day','company','experience','our','job','position',
                           'pay','train','training','team','staff','indeed','work','working',
                           'yes','we','us','pay','no','hour','hours','uk','london','hire',
                           'team','within','slavery','therefore','opportunities','opportunity',
                           'motivation','motivated','he','she','he/she','much','very'])
    stop_words = stop_words_base.union(stop_words_jobs)
    text = [w.lower() for w in text_original.split() if w.lower() not in stop_words]
    text_original = ' '.join(text)

    return text_original

Test that it works:

In [4]:
website = 'https://www.indeed.co.uk/viewjob?jk=ba3df8f30e1691b7&tk=1c82t0tcm9m5i9ng&from=serp&alid=3&advn=1402909195792678'
sample_original = raw_text_cleaner('https://www.indeed.co.uk/viewjob?jk=ba3df8f30e1691b7&tk=1c82t0tcm9m5i9ng&from=serp&alid=3&advn=1402909195792678')
print(sample_original[0:300])

data scientist fospha co skip description searchclose find jobscompany reviewsfind salariesfind cvsemployers post upload cv sign advanced search title keywords city postcode data scientist fospha data scientist variety exciting projects fast growing organization lot tackle complex problems require m


Do a (1,2,3)-gram bag of words on this sample text, to see what the features look like:

In [314]:
vectorizer2 = TfidfVectorizer(ngram_range=(1,3), sublinear_tf=True)
sample_original_features = vectorizer2.fit_transform([sample_original])
vectorizer2.get_feature_names()[0:5]

['ability',
 'ability conduct',
 'ability conduct deep',
 'able',
 'able translate']

Now that we see how to extract the features from one job posting, let's open several job postings, create a corpus, and create a sparse matrix for the features.

In [6]:
def create_corpus(city = None, job_list=['data+scientist', 'machine+learning'],pages=5):
    '''
    Initally based on notebook by  https://jessesw.com/Data-Science-Skills/ 
    Input : city, and a list with job description queries
    output: corpus, URLS of jobs descriptions found
    '''

    if type(job_list == str):
        job_list = list(job_list)
        
    job_descriptions = [] # Store all our descriptions in this list
    all_URLS=[]
    bad_URLS = []
    for final_job in job_list:
        print('Searching for ', final_job)

        #https://www.indeed.co.uk/jobs?q=data+scientist&l=london&sort=date&start=10    
        final_site_list = ['https://www.indeed.co.uk/jobs?q=', final_job, '&l=', city,
                       '&sort=date'] # Join all of our strings together so that indeed will search correctly        

        final_site = ''.join(final_site_list) # Merge the html address together into one string


        base_url = 'https://www.indeed.co.uk'

        #print('TRY',final_site)
        try:
            html = urllib.request.urlopen(final_site).read() # Open up the front page of our search first
        except:
            'That city/state combination did not have any jobs. Exiting . . .' # In case the city is invalid
            return
        soup = BeautifulSoup(html,"lxml") # Get the html from the first page
        if len(soup) < 1: print('THERE IS AN ERROR LOADING THE PAGE')

        # Now find out how many jobs there were

        num_jobs_area = soup.find(id = 'searchCount').string.encode('utf-8') # Now extract the total number of jobs found
                                                                             # The 'searchCount' object has this
        print('type(num_jobs_area)')
        job_numbers = re.findall('\d+', str(num_jobs_area)) # Extract the total jobs found from the search result


        if len(job_numbers) > 3: # Have a total number of jobs greater than 1000
            total_num_jobs = (int(job_numbers[2])*1000) + int(job_numbers[3])
        else:
            total_num_jobs = int(job_numbers[2]) 

        city_title = city
        if city is None:
            city_title = 'Nationwide'

        print('There were', total_num_jobs, 'jobs found,', city_title) # Display how many jobs were found

        num_pages = int(total_num_jobs/10) # This will be how we know the number of times we need to iterate over each new
                                      # search result page

        for i in range(0,min(pages,num_pages+1)): # Loop through all of our search result pages
            print('Getting page', i)
            start_num = str(i*10) # Assign the multiplier of 10 to view the pages we want
            if i>0:
                current_page = ''.join([final_site, '&start=', start_num])
            else:
                current_page = final_site
            # Now that we can view the correct 10 job returns, start collecting the text samples from each
            html_page = urllib.request.urlopen(current_page).read() # Get the page

            page_obj = BeautifulSoup(html_page,'lxml') # Locate all of the job links
            job_link_area = page_obj.find(id = 'resultsCol') # The center column on the page where the job postings exist
            job_URLS=[]
            
            for link in job_link_area.find_all('a'):
                try:
                    if link.get('href')[0:3]=='/rc':
                        job_URLS.append(base_url + link.get('href'))
                except:
                        if link !=None:
                            if link.get('href') != None:
                                bad_URLS.append(base_url + link.get('href'))

            for j in range(0,len(job_URLS)):
                final_description = raw_text_cleaner(job_URLS[j])
                if final_description: # So that we only append when the website was accessed correctly
                    job_descriptions.append(final_description)
                    all_URLS.append(job_URLS[j])
                #sleep(1) # 

        print('Done with collecting the job postings!')    
        print('There were {} jobs successfully found.'.format(len(job_descriptions)))

    return job_descriptions, all_URLS, bad_URLS

# Example 1: Create the corpus for two very different job descriptions:  'data scientist' and 'restaurant'

In [184]:
corpus, URLs, bad_URLs = create_corpus(city = 'london',job_list=['data+scientist', 'waiter'],pages=3)

Searching for  data+scientist
type(num_jobs_area)
There were 1790 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Done with collecting the job postings!
There were 27 jobs successfully found.
Searching for  waiter
type(num_jobs_area)
There were 1476 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Done with collecting the job postings!
There were 44 jobs successfully found.


In [185]:
print('There are {0} job postings, {1} URLS'.format(len(corpus), len(URLs)))

There are 44 job postings, 44 URLS


In [186]:
def corpus_stop_word_cleaner(corpus, stop_words_input=None):
    '''
    This function just removes some words from the corpus in case you realize you want to filter out more words.
    Inputs: corpus
    Outputs: stop_words filtered out of corpus
    '''
    cc=[0]
    if type(corpus) != list: 
        cc[0] = corpus
        corpus = cc
        
    for ic,text_original in enumerate(corpus):       
        stop_words_base = set(stopwords.words("english")) # Filter out any stop words
        if stop_words_input != None: stop_words_base = stop_words_base.union(stop_words_input)
        stop_words_jobs = set(['job','jobs','candidate','candidates','apply','now','skills','application','new',
                           'group','day','company','experience','our','job','position',
                           'pay','train','training','team','staff','indeed','work','working',
                           'yes','we','us','pay','hour','hours','uk','london','hire',
                           'team','within','slavery','therefore','opportunities','opportunity',
                           'motivation','motivated','he','she','he/she','much','very',
                              'cookies','com','asos','postcode','ago','date','benefits',
                              'cv','role','cookies','com','asos','postcode','ago','date',
                               'benefits','religion','sexual','orientation','salary','asap',
                               'annum','race','like' ,'may','enjoy','keywords' ])
    
        stop_words = stop_words_base.union(stop_words_jobs)
        text = [w.lower() for w in text_original.split() if w.lower() not in stop_words]
        text_original = ' '.join(text)
        corpus[ic] = text_original
    if len(corpus) == 1 :
        return corpus[ic]
    else:
        return corpus

In [187]:
#remove some additional useless words
corpus =  corpus_stop_word_cleaner(corpus, ['cv'])

In [188]:
def create_features(corpus, nmin=1,nmax=3,nfeat=10000, mindf=1):    
    vectorizer = TfidfVectorizer(ngram_range=(nmin,nmax), min_df = mindf, 
                                 sublinear_tf = True, max_features = nfeat)
    job_features = vectorizer.fit_transform(corpus)
    return vectorizer, job_features # End of the function

In [189]:
vectLondon, london_features = create_features(corpus, nmin=1,nmax=3, nfeat=100)

In [192]:
print('Shape of extracted features',london_features.toarray().shape)

Shape of extracted features (44, 100)


In [193]:
print(london_features.shape)

(44, 100)


In [194]:
print(corpus[3][0:100])

avp quantitative analyst lma recruitment co skip description searchclose find jobscompany reviewsfin


# Use these features to do UNSUPERVISED classification with GMM

In [195]:
def call_GMM(k, features):
    cov_type='tied'
    estimator3 = GaussianMixture(n_components=k,
                       covariance_type=cov_type, max_iter=20, random_state=0  )
    estimator3.fit(features )    # Learns model parameters
    y_pred = estimator3.predict(features)
    return y_pred, estimator3.predict_proba(features), estimator3

In [197]:
k = 2
ylab1, yprob1,estimator1=call_GMM(k, london_features.toarray())

In [198]:
ylab1

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Looking at the labels we know we did a reasonable good job, since we know all the first job postings are data science related, and the last ones are realted to waiters jobs.

### Let's see which features have more weight  in each category

In [199]:
feat_names = vectLondon.get_feature_names()

In [200]:
def importance_features(k, feature_names,estimator,perc=99.9):
    for ik in range(k):        
        impor_feat = {}
        muik = estimator.means_[ik]
        tth = np.percentile(muik,perc)
        for i,iv in enumerate(muik):
            if iv > tth:
                impor_feat[feature_names[i] ] =  iv 
        sorted_x = sorted(impor_feat.items(), key=operator.itemgetter(1),reverse=True  )
        print('\n \n Important features for cluster ', ik)
        for ii in sorted_x:
             print('{0:<30s}{1}'.format(ii[0],ii[1]))

In [201]:
importance_features(k,feat_names,estimator1,perc=95)


 
 Important features for cluster  0
data                          0.28301804731452984
business                      0.15629106126922104
scientist                     0.1535616342043353
help                          0.14529364299440817
research                      0.13051616225029833

 
 Important features for cluster  1
waiter                        0.34898335237009015
service                       0.20181304942332462
help                          0.15273072464487983
reviews                       0.1223173971749404
services                      0.09996176199299997


As can be seen, there is not much incertitude in the labeling, since we can see the probability is distributed only on one Gaussian. 

In [206]:
yprob1[0:10]

array([[  1.00000000e+00,   1.46166497e-60],
       [  1.00000000e+00,   1.73836827e-60],
       [  1.00000000e+00,   7.20258416e-62],
       [  6.93099134e-57,   1.00000000e+00],
       [  1.00000000e+00,   1.26330172e-72],
       [  1.00000000e+00,   2.77186522e-59],
       [  1.00000000e+00,   3.70681868e-60],
       [  1.00000000e+00,   1.61400038e-59],
       [  1.00000000e+00,   1.69941581e-60],
       [  1.00000000e+00,   1.03947753e-56]])

# Now let's look only at data-science jobs and see what features sets them appart

In [205]:
corpus_ds, URLs_ds, bad_URLs_ds = create_corpus(city = 'london',job_list=['data+science'],pages=8)

Searching for  data+science
type(num_jobs_area)
There were 5326 jobs found, london
Getting page 0
Getting page 1
Getting page 2
Getting page 3
Getting page 4
Getting page 5
Getting page 6
Getting page 7
Done with collecting the job postings!
There were 75 jobs successfully found.


In [207]:
corpus_ds =  corpus_stop_word_cleaner(corpus_ds,['ck','best','great','boyce','mendez','mondelez','durlston','cwjobs','futureheads','harnham','venturi'])

In [270]:
vectLondon_ds, london_features_ds = create_features(corpus_ds, nmin=1,nmax=3,nfeat=500, mindf=5)
feat_names_ds = vectLondon_ds.get_feature_names()
print(london_features_ds.shape)

(75, 500)


In [313]:
k=4
zlab,zprob, estimn = call_GMM(k, london_features_ds.toarray() )
print('Labels',zlab)
print('Probability of belonging to each cluster', zprob[0:5])

Labels [3 1 3 1 2 1 1 1 2 0 2 0 1 1 1 3 3 1 1 1 3 2 1 1 1 2 3 1 0 1 2 1 1 1 3 1 1
 2 1 1 2 2 2 1 0 1 1 1 1 1 2 0 1 2 2 3 2 1 3 2 0 1 1 1 2 2 1 1 3 1 2 1 1 1
 1]
Probability of belonging to each cluster [[  4.91272164e-42   1.47300082e-01   8.24583506e-36   8.52699918e-01]
 [  1.72533774e-62   9.99947673e-01   5.00420735e-53   5.23273617e-05]
 [  8.55562609e-43   1.23105622e-01   4.17968854e-37   8.76894378e-01]
 [  5.92908971e-54   9.97533813e-01   8.29750436e-46   2.46618681e-03]
 [  4.39147798e-03   4.13869091e-45   9.95608522e-01   3.24111757e-28]]


In [277]:
importance_features(k, feat_names_ds, estimn, perc=99)


 
 Important features for cluster  0
research                      0.13519889107029157
sciences                      0.09332351293807206
social                        0.08184207446558818
data                          0.0771361762386771
content                       0.07563418282231256

 
 Important features for cluster  1
analyst                       0.1040533220599665
analytics                     0.09383322420589418
business                      0.08764275289464411
data                          0.08016894058385425
customer                      0.07747423994838035

 
 Important features for cluster  2
business                      0.060880200267596746
data                          0.0602169703252302
engineer                      0.0595723383638893
help                          0.056738072390410184
projects                      0.05558856031452268

 
 Important features for cluster  3
developer                     0.16206101848635407
software                      0.12119626940241453


# If fewer features were to be considered, the probabilites are distributed slightly different. For example, consider only 10 features

In [307]:
vectLondon_ds, london_features_ds = create_features(corpus_ds, nmin=1,nmax=1,nfeat=4, mindf=1)
feat_names_ds = vectLondon_ds.get_feature_names()
print(london_features_ds.shape)
k=4
zlab,zprob, estimn = call_GMM(k, london_features_ds.toarray() )
#print('Labels',zlab)
print('Probability of belonging to each cluster', zprob[0:10])

(75, 4)
Probability of belonging to each cluster [[  4.91272164e-42   1.47300082e-01   8.24583506e-36   8.52699918e-01]
 [  1.72533774e-62   9.99947673e-01   5.00420735e-53   5.23273617e-05]
 [  8.55562609e-43   1.23105622e-01   4.17968854e-37   8.76894378e-01]
 [  5.92908971e-54   9.97533813e-01   8.29750436e-46   2.46618681e-03]
 [  4.39147798e-03   4.13869091e-45   9.95608522e-01   3.24111757e-28]
 [  9.37917404e-59   9.99509611e-01   7.20770495e-49   4.90389271e-04]
 [  2.67654269e-55   9.99956066e-01   1.57547517e-42   4.39342998e-05]
 [  2.77135086e-54   9.99619710e-01   1.90500343e-42   3.80290336e-04]
 [  5.23434948e-04   1.26941061e-59   9.99476565e-01   1.30482674e-41]
 [  9.82504397e-01   1.58152226e-53   1.74956033e-02   3.11998608e-33]]


### Looking at each category, and find the best one(s) suited for you. 

### Now, you're job search is much easier!



In [308]:
def provide_url_one_class(label,URL, target_category):
    target_url=[]
    for i,ilab in enumerate(label):
        if ilab in target_category:
            print(URL[i])
            target_url.append(URL[i])
    return target_url
        

In [310]:
#lets look at developer-oriented data science jobs
my_url = provide_url_one_class(zlab,URLs_ds, [0])

https://www.indeed.co.uk/rc/clk?jk=3e332c559a53025f&fccid=d8589ac02ce8d5f2&vjs=3
https://www.indeed.co.uk/rc/clk?jk=3e332c559a53025f&fccid=d8589ac02ce8d5f2&vjs=3
https://www.indeed.co.uk/rc/clk?jk=698df7625ba7b9af&fccid=618c59020c53801c&vjs=3
https://www.indeed.co.uk/rc/clk?jk=02aa83986bb9c4e6&fccid=bcbe3f5328b59f6d&vjs=3
https://www.indeed.co.uk/rc/clk?jk=e6474828284e2918&fccid=892c145157842ceb&vjs=3
https://www.indeed.co.uk/rc/clk?jk=41f65aa7d234422d&fccid=f21bb40e0e79c0f9&vjs=3


### Looking at estimator2.weights_   we can see which Gaussians have the largest contributions

In [312]:
estimn.weights_

array([ 0.08663011,  0.53281582,  0.23336989,  0.14718418])