fastText
========
- Made by facebook <https://github.com/facebookresearch/fastText>
- Treats each word as the aggregation of its subwords. 
    - Subwords are character n-grams of the word. (e.g. army --> a, r, m, y, ar, rm, my, arm, rmy,
- Pros:
    - Much better thahn Word2Vec on syntactic tasks, especially with small training corpus
    - fastText can be used to obtain vectors for out-of-vocabulary (OOV) words
- Cons
    - Slightly worse than Word2Vec semantic tasks
    - Slower training time than Word2Vec
    - Comparision: <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Word2Vec_FastText_Comparison.ipynb>


<br><br>

Credit:
- https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-download-auto-examples-tutorials-run-fasttext-py
- wm distances work:
    - Ofir Pele and Michael Werman “A linear time histogram metric for improved SIFT matching”
    - Ofir Pele and Michael Werman “Fast and robust earth mover’s distances”
    - Matt Kusner et al. “From Word Embeddings To Document Distances”.



In [1]:

import numpy as np
import pandas as pd


from pprint import pprint
import operator

from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

In [None]:
# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Load Resume Data

In [2]:
local_resume_cleaned_lemmatized_tokenized_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/resumes_tokenized_lemmatized.pickle'
resumes = pd.read_pickle(local_resume_cleaned_lemmatized_tokenized_path)
resumes_sentences = resumes['lemmatized_resume'].tolist()
resumes.head()

Unnamed: 0,ID,Category,dirty_resume,resume,tokenized_resume,lemmatized_resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box...",john h smith phr po box callahan fl infog...,"[john, h, smith, phr, po, box, callahan, fl, i...","[john, h, smith, phr, po, box, callahan, fl, i..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...,name surname address mobile noemail personal p...,"[name, surname, address, mobile, noemail, pers...","[name, surname, address, mobile, noemail, pers..."
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...,anthony brown hr assistant areas expertise per...,"[anthony, brown, hr, assistant, areas, experti...","[anthony, brown, hr, assistant, area, expertis..."
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...,satheesh email id career objective pursue gro...,"[satheesh, email, id, career, objective, pursu...","[satheesh, email, id, career, objective, pursu..."
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert...",human resources director expert organizational...,"[human, resources, director, expert, organizat...","[human, resource, director, expert, organizati..."


In [3]:
# a list where each element is a list of strings
resumes_sentences[0][:15]

['john',
 'h',
 'smith',
 'phr',
 'po',
 'box',
 'callahan',
 'fl',
 'infogreatresumesfastcom',
 'approachable',
 'innovator',
 'passion',
 'human',
 'resource',
 'senior']

Training base FT model for resumes
--------------

In [5]:
base_res_model = FT_gensim(
    sentences=None,
    sg=0, #default to CBOW. if sg=1 then skip-gram
    hs=0, #default, if hs=0 & negative =/= 0 then neg. sampling. if hs=1, hierarchical softmax
    negative=5, #5 words selected for negative sampling
    size=100, #size of vector
    alpha=0.025,
    min_count=5, # ignore words with fewer than 20 apearances
    iter=5,
    seed=42,
    cbow_mean=1, #uses mean for CBOW. If it =0 then sums CBOW (provided CBOW not SG)
    min_n=3, # min length of char n-grams
    max_n=6, # max length of char n-grams. If 0 or less than min_n, this turns into W2V
    trim_rule=None, #if you had a rule to trim down vocabulary
    workers=3 # default
)    

    
# build the vocabulary
base_res_model.build_vocab(sentences = resumes_sentences)

# train the model
base_res_model.train(
    sentences=resumes_sentences,
    epochs=base_res_model.epochs,
    total_examples=base_res_model.corpus_count,
    total_words=base_res_model.corpus_total_words
)

print(base_res_model)

FastText(vocab=11073, size=100, alpha=0.025)


<br><br>
### Need to be able to pick individual jobs to compare to all resumes

In [6]:
# load jobs into df

local_jobs_cleaned_lemmatized_tokenized_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/large_files/jobs_tokenized_lemmatized.pickle'
jobs = pd.read_pickle(local_jobs_cleaned_lemmatized_tokenized_path)
jobs_list = jobs['lemmatized_combined'].tolist()
jobs.head()

Unnamed: 0,Title,Company,JobDescription,RequiredQual,JobRequirement,label,combined,dirty_combined,tokenized_combined,lemmatized_combined
0,Chief Financial Officer,AMERIA Investment Consulting Company,AMERIA Investment Consulting Company is seekin...,"To perform this job successfully, an\r\nindivi...",- Supervises financial management and administ...,1,chief financial officer ameria investment cons...,Chief Financial Officer AMERIA Investment Cons...,"[chief, financial, officer, ameria, investment...","[chief, financial, officer, ameria, investment..."
1,Country Coordinator,Caucasus Environmental NGO Network (CENN),Public outreach and strengthening of a growing...,"- Degree in environmentally related field, or ...",- Working with the Country Director to provide...,2,country coordinator public outreach strengthen...,Country Coordinator Public outreach and streng...,"[country, coordinator, public, outreach, stren...","[country, coordinator, public, outreach, stren..."
2,BCC Specialist,Manoff Group,The LEAD (Local Enhancement and Development fo...,"- Advanced degree in public health, social sci...",- Identify gaps in knowledge and overseeing in...,3,bcc specialist lead local enhancement developm...,BCC Specialist The LEAD (Local Enhancement and...,"[bcc, specialist, lead, local, enhancement, de...","[bcc, specialist, lead, local, enhancement, de..."
3,"Community Development, Capacity Building and C...",Food Security Regional Cooperation and Stabili...,Food Security Regional Cooperation and Stabili...,- Higher Education and/or professional experie...,- Assist the Tavush Marz communities and commu...,4,community development capacity building confli...,"Community Development, Capacity Building and C...","[community, development, capacity, building, c...","[community, development, capacity, building, c..."
4,Country Economist (NOB),"United Nations Development Programme, Armenia",The United Nations Development Programme in Ar...,- Minimum Masters Degree in Economics;\r\n- Mi...,The incumbent under direct supervision of UNDP...,5,country economist nob united nations developme...,Country Economist (NOB) The United Nations Dev...,"[country, economist, nob, united, nations, dev...","[country, economist, nob, united, nation, deve..."


In [7]:
#13124 total jobs
jobs.shape[0]

13124

In [8]:
def pick_job(selection=-999):
    print("There are {} jobs".format(jobs.shape[0]))
    
    # Select a random int from 0 to length of rjob set
    rand_int = np.random.randint(1, jobs.shape[0]+1)
    
    
    
    if selection == -999:
        selection = rand_int
    
    
    
    
    print ('\nselected job is ID #{}'.format(selection))
    
    # pick the job text and ID associated with the random int
    job_label = jobs.iloc[selection - 1, jobs.columns.get_loc('label')] #we could grab ID, but this works for non-indexed labels too
    job_title = jobs.iloc[selection - 1 ]['Title']
    job_company = jobs.iloc[selection - 1 ]['Company']
    job_description = jobs.iloc[selection - 1 ]['JobDescription']
    
    
    print('Job Posting ID is: {}'.format(job_label))
    print('Job Posting Title: {}'.format(job_title))
    print('Job Posting Company: {}'.format(job_company))
    print('Job Posting Description: {}'.format(job_description))
    
    #Convert the sample document into a list and use the infer_vector method to get a vector representation for it
    job_text_to_process = jobs['lemmatized_combined'][selection - 1]
    
    return job_text_to_process


In [12]:
def given_job_find_similar_resumes(job_you_pick, model=base_res_model):

        
    #find all distances between chosen job and each resume
    temp_distance = []
    min_dist = float("inf")
    min_index = float("inf")


    # for i in range len(resume_sentences): 
    for i in range (0, len(resumes_sentences)):
        print(i)
        dist = model.wmdistance(job_you_pick, resumes_sentences[i])
        if dist < min_dist:
            min_dist = dist
            min_index = i
        temp_distance.append((dist, i))

    #sort list of tuples
    temp_distance.sort(key = operator.itemgetter(0))
    temp_distance

    num_similar = 10 #or 10, 20, 25
    print('\nPrinting {} most similar candidates...\n'.format(num_similar))
    for i in range(0,num_similar):
        print('\n#{} most similar job'.format(i+1))
        print('Resume ID from list: {}'.format(temp_distance[i][1]))
        print('WM Distance: {}'.format(temp_distance[i][0]))
        print('Resume ID from df: {}'.format(resumes.iloc[temp_distance[i][1]]['ID']))
        print('Resume text (500 chars): {}'.format(resumes.iloc[temp_distance[i][1]]['resume'][0:500]))


In [10]:
chosen_job = pick_job()

There are 13124 jobs

selected job is ID #9670
Job Posting ID is: 9670
Job Posting Title: Sales Manager
Job Posting Company: Navavan LLC
Job Posting Description: Navavan LLC is looking for a Sales Manager who will be
responsible for the development and performance of all sales activities
in the assigned market. He/ she will staff and direct a sales team and
provide leadership towards the achievement of maximum profitability and
growth in line with company vision and values. The incumbent will be also
responsible for establishing plans and strategies to expand the customer
base in the marketing area.


In [13]:
given_job_find_similar_resumes(chosen_job)
# painfully slow... need GPU?

0


  del sys.path[0]


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


KeyboardInterrupt: 

In [None]:
most_similar(positive=None, negative=None, topn=10, restrict_vocab=None, indexer=None)

Training models
---------------




For the following examples, we'll use the Lee Corpus (which you already have if you've installed gensim) for training our model.






In [22]:
jobs_sentences = jobs['lemmatized_combined'].tolist()

In [23]:
base_jobs_model = FT_gensim(
    sentences=None,
    sg=0, #default to CBOW. if sg=1 then skip-gram
    hs=0, #default, if hs=0 & negative =/= 0 then neg. sampling. if hs=1, hierarchical softmax
    negative=5, #5 words selected for negative sampling
    size=100, #size of vector
    alpha=0.025,
    min_count=5, # ignore words with fewer than 20 apearances
    iter=5,
    seed=42,
    cbow_mean=1, #uses mean for CBOW. If it =0 then sums CBOW (provided CBOW not SG)
    min_n=3, # min length of char n-grams
    max_n=6, # max length of char n-grams. If 0 or less than min_n, this turns into W2V
    trim_rule=None, #if you had a rule to trim down vocabulary
    workers=3 # default
)    

    
# build the vocabulary
base_jobs_model.build_vocab(sentences = jobs_sentences)

# train the model
base_jobs_model.train(
    sentences=jobs_sentences,
    epochs=base_jobs_model.epochs,
    total_examples=base_jobs_model.corpus_count,
    total_words=base_jobs_model.corpus_total_words
)

print(base_jobs_model)

KeyboardInterrupt: 

In [None]:
def pick_resume(selection=-999):
    print("There are {} resumes".format(resumes.shape[0]))
  
    if selection == -999:
        selection = np.random.randint(1, resumes.shape[0]+1)
    
    
    
    print ('\nselected resume is ID #{}'.format(selection))
    
    # pick the job text and ID associated with the random int
    resume_label = resumes.iloc[selection - 1, resumes.columns.get_loc('ID')] #we could grab ID, but this works for non-indexed labels too
    resume_text = resumes.iloc[selection - 1 ]['resume'][:500] 
    
    
    print('Resume ID is: {}'.format(resume_label))
    print('Resume text is (500 chars): {}'.format(resume_text))
    
    #Convert the sample document into a list and use the infer_vector method to get a vector representation for it
    resume_text_to_process = resumes['lemmatized_resume'][selection - 1]
    
    return resume_text_to_process


In [15]:
def given_resume_find_similar_jobs(resume_you_pick):
  
    #find all distances between chosen job and each resume
    temp_distance = []
    min_dist = float("inf")
    min_index = float("inf")

    
    for i in range (0, len(jobs_sentences)):
        dist = model.wmdistance(resume_you_pick, jobs_sentences[i])
        if dist < min_dist:
            min_dist = dist
            min_index = i
        temp_distance.append((dist, i))

    #sort list of tuples
    temp_distance.sort(key = operator.itemgetter(0))

    return temp_distance

    

In [16]:
def print_similar_jobs(list_of_similar_jobs):
    
    num_similar = 10 #or 10, 20, 25
    print('\nPrinting {} most similar jobs for this candidate...\n'.format(num_similar))
    for i in range(0,num_similar):
        print('\n#{} most similar job'.format(i+1))
        print('Job ID from list: {}'.format(list_of_similar_jobs[i][1]))
        print('WM Distance: {}'.format(list_of_similar_jobs[i][0]))
        print('Job ID from df: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['label']))
        print('Job title: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['Title']))
        print('Company: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['Company']))
        print('Job Description: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['JobDescription']))


In [17]:
def print_dissimilar_jobs(list_of_similar_jobs):
    
    num_dissimilar = 10 #or 10, 20, 25
    print('\nPrinting {} most similar jobs for this candidate...\n'.format(num_dissimilar))
    for i in range(0,num_dissimilar):
        print('\n#{} least similar job'.format(i+1))
        print('Job ID from list: {}'.format(list_of_similar_jobs[-(1+i)][1]))
        print('WM Distance: {}'.format(list_of_similar_jobs[-(1+i)][0]))
        print('Job ID from df: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['label']))
        print('Job title: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['Title']))
        print('Company: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['Company']))
        print('Job Description: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['JobDescription']))

In [18]:
chosen_resume = pick_resume()

There are 1219 resumes

selected resume is ID #915
Resume ID is: 915
Resume text is (500 chars): your resume try determine position would work objective statement says want position offers growth advancement opportunity expand education training tells employer looking education experiencewhat put first education pertinent work seeking employment history list education first education pertinent job may wish list specific course titles completed directly related work would employment history pertinent listed first resume academic background academic background included resume recently obtaine


In [None]:
ordered_job_list = given_resume_find_similar_jobs(chosen_resume)

In [None]:
print_similar_jobs(ordered_job_list)

In [None]:
print_dissimilar_jobs(ordered_job_list)