fastText
========
- Made by facebook <https://github.com/facebookresearch/fastText>
- Treats each word as the aggregation of its subwords. 
    - Subwords are character n-grams of the word. (e.g. army --> a, r, m, y, ar, rm, my, arm, rmy,
- Pros:
    - Much better thahn Word2Vec on syntactic tasks, especially with small training corpus
    - fastText can be used to obtain vectors for out-of-vocabulary (OOV) words
- Cons
    - Slightly worse than Word2Vec semantic tasks
    - Slower training time than Word2Vec
    - Comparision: <https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/Word2Vec_FastText_Comparison.ipynb>


<br><br>

Credit:
- https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-download-auto-examples-tutorials-run-fasttext-py
- wm distances work:
    - Ofir Pele and Michael Werman “A linear time histogram metric for improved SIFT matching”
    - Ofir Pele and Michael Werman “Fast and robust earth mover’s distances”
    - Matt Kusner et al. “From Word Embeddings To Document Distances”.



In [1]:

import numpy as np
import pandas as pd
import pickle
from scipy import spatial

from pprint import pprint
import operator

from gensim.models.fasttext import FastText as FT_gensim
from gensim.test.utils import datapath

unable to import 'smart_open.gcs', disabling that module


In [2]:
# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Load Resume Data

In [3]:
local_resume_cleaned_lemmatized_tokenized_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/resumes_tokenized_lemmatized.pickle'
ec2_path = '/home/ubuntu/NLP_projects/job_recommender_project/data/resumes_tokenized_lemmatized.pickle'

# resumes = pd.read_pickle(local_resume_cleaned_lemmatized_tokenized_path)
resumes = pd.read_pickle(ec2_path)
resumes_sentences = resumes['lemmatized_resume'].tolist()
resumes.head()

Unnamed: 0,ID,Category,dirty_resume,resume,tokenized_resume,lemmatized_resume
0,1,HR,"b'John H. Smith, P.H.R.\n800-991-5187 | PO Box...",john h smith phr po box callahan fl infog...,"[john, h, smith, phr, po, box, callahan, fl, i...","[john, h, smith, phr, po, box, callahan, fl, i..."
1,2,HR,b'Name Surname\nAddress\nMobile No/Email\nPERS...,name surname address mobile noemail personal p...,"[name, surname, address, mobile, noemail, pers...","[name, surname, address, mobile, noemail, pers..."
2,3,HR,b'Anthony Brown\nHR Assistant\nAREAS OF EXPERT...,anthony brown hr assistant areas expertise per...,"[anthony, brown, hr, assistant, areas, experti...","[anthony, brown, hr, assistant, area, expertis..."
3,4,HR,b'www.downloadmela.com\nSatheesh\nEMAIL ID:\nC...,satheesh email id career objective pursue gro...,"[satheesh, email, id, career, objective, pursu...","[satheesh, email, id, career, objective, pursu..."
4,5,HR,"b""HUMAN RESOURCES DIRECTOR\n\xef\x82\xb7Expert...",human resources director expert organizational...,"[human, resources, director, expert, organizat...","[human, resource, director, expert, organizati..."


In [4]:
# a list where each element is a list of strings
resumes_sentences[0][:15]

['john',
 'h',
 'smith',
 'phr',
 'po',
 'box',
 'callahan',
 'fl',
 'infogreatresumesfastcom',
 'approachable',
 'innovator',
 'passion',
 'human',
 'resource',
 'senior']

Training base FT model for resumes
--------------

In [5]:
base_res_model = FT_gensim(
    sentences=None,
    sg=0, #default to CBOW. if sg=1 then skip-gram
    hs=0, #default, if hs=0 & negative =/= 0 then neg. sampling. if hs=1, hierarchical softmax
    negative=5, #5 words selected for negative sampling
    size=100, #size of vector
    alpha=0.025,
    min_count=5, # ignore words with fewer than 20 apearances
    iter=5,
    seed=42,
    cbow_mean=1, #uses mean for CBOW. If it =0 then sums CBOW (provided CBOW not SG)
    min_n=3, # min length of char n-grams
    max_n=6, # max length of char n-grams. If 0 or less than min_n, this turns into W2V
    trim_rule=None, #if you had a rule to trim down vocabulary
    workers=3 # default
)    

    
# build the vocabulary
base_res_model.build_vocab(sentences = resumes_sentences)

# train the model
base_res_model.train(
    sentences=resumes_sentences,
    epochs=base_res_model.epochs,
    total_examples=base_res_model.corpus_count,
    total_words=base_res_model.corpus_total_words
)

print(base_res_model)

FastText(vocab=11073, size=100, alpha=0.025)


In [7]:
base_res_model.save('/home/ubuntu/NLP_projects/job_recommender_project/models/fasttext/base_res_ft')

<br><br><br><br><br><br><br><br><br><br>

<br><br>
### Need to be able to pick individual jobs to compare to all resumes

In [8]:
# load jobs into df

local_jobs_cleaned_lemmatized_tokenized_path = '/Users/richardkuzma/coding/NLP_projects/job_recommender_project/data/large_files/jobs_tokenized_lemmatized.pickle'
ec2_path = '/home/ubuntu/NLP_projects/job_recommender_project/data/large_files/jobs_tokenized_lemmatized.pickle'

jobs = pd.read_pickle(ec2_path)
# jobs = pd.read_pickle(local_jobs_cleaned_lemmatized_tokenized_path)

jobs_list = jobs['lemmatized_combined'].tolist()
jobs.head()

Unnamed: 0,Title,Company,JobDescription,RequiredQual,JobRequirement,label,combined,dirty_combined,tokenized_combined,lemmatized_combined
0,Chief Financial Officer,AMERIA Investment Consulting Company,AMERIA Investment Consulting Company is seekin...,"To perform this job successfully, an\r\nindivi...",- Supervises financial management and administ...,1,chief financial officer ameria investment cons...,Chief Financial Officer AMERIA Investment Cons...,"[chief, financial, officer, ameria, investment...","[chief, financial, officer, ameria, investment..."
1,Country Coordinator,Caucasus Environmental NGO Network (CENN),Public outreach and strengthening of a growing...,"- Degree in environmentally related field, or ...",- Working with the Country Director to provide...,2,country coordinator public outreach strengthen...,Country Coordinator Public outreach and streng...,"[country, coordinator, public, outreach, stren...","[country, coordinator, public, outreach, stren..."
2,BCC Specialist,Manoff Group,The LEAD (Local Enhancement and Development fo...,"- Advanced degree in public health, social sci...",- Identify gaps in knowledge and overseeing in...,3,bcc specialist lead local enhancement developm...,BCC Specialist The LEAD (Local Enhancement and...,"[bcc, specialist, lead, local, enhancement, de...","[bcc, specialist, lead, local, enhancement, de..."
3,"Community Development, Capacity Building and C...",Food Security Regional Cooperation and Stabili...,Food Security Regional Cooperation and Stabili...,- Higher Education and/or professional experie...,- Assist the Tavush Marz communities and commu...,4,community development capacity building confli...,"Community Development, Capacity Building and C...","[community, development, capacity, building, c...","[community, development, capacity, building, c..."
4,Country Economist (NOB),"United Nations Development Programme, Armenia",The United Nations Development Programme in Ar...,- Minimum Masters Degree in Economics;\r\n- Mi...,The incumbent under direct supervision of UNDP...,5,country economist nob united nations developme...,Country Economist (NOB) The United Nations Dev...,"[country, economist, nob, united, nations, dev...","[country, economist, nob, united, nation, deve..."


In [9]:
#13124 total jobs
jobs.shape[0]

13124

In [10]:
def vectorize_resumes(model=base_res_model):
    """returns list of document vectors """
    res_vecs = []
    for i in range(0, len(resumes_sentences)):
        temp_vec = model.wv[resumes_sentences[i]]
        one_dim_vec = np.mean(temp_vec, axis=0)
        res_vecs.append(one_dim_vec)
    return res_vecs

In [15]:
def pick_job(selection=-999):
    print("There are {} jobs".format(jobs.shape[0]))
    
    # Select a random int from 0 to length of rjob set
    rand_int = np.random.randint(1, jobs.shape[0]+1)
    
    
    
    if selection == -999:
        selection = rand_int
    
    
    
    
    print ('\nselected job is ID #{}'.format(selection))
    
    # pick the job text and ID associated with the random int
    job_label = jobs.iloc[selection - 1, jobs.columns.get_loc('label')] #we could grab ID, but this works for non-indexed labels too
    job_title = jobs.iloc[selection - 1 ]['Title']
    job_company = jobs.iloc[selection - 1 ]['Company']
    job_description = jobs.iloc[selection - 1 ]['JobDescription']
    
    
    print('Job Posting ID is: {}'.format(job_label))
    print('Job Posting Title: {}'.format(job_title))
    print('Job Posting Company: {}'.format(job_company))
    print('Job Posting Description: {}'.format(job_description))
    
    #Convert the sample document into a list and use the infer_vector method to get a vector representation for it
    job_text_to_process = jobs['lemmatized_combined'][selection - 1]
    
    return job_text_to_process


In [16]:
def given_job_find_similar_resumes(job_you_pick, model=base_res_model, resumes_vectors=vectorize_resumes()):

        
    #find all distances between chosen job and each resume
    temp_distance = []
    min_dist = float("inf")
    min_index = float("inf")

    #turning chosen job into vector
    job_temp = model.wv[job_you_pick]
    job_vec = np.mean(job_temp, axis=0)
    
    
    for i in range (0, len(resumes_vectors)):
        # print(i)
        # dist = model.wmdistance(job_you_pick, resumes_sentences[i])
        dist = spatial.distance.cosine(job_vec, resumes_vectors[i])
        
        
        if dist < min_dist:
            min_dist = dist
            min_index = i
        temp_distance.append((dist, i))

    #sort list of tuples
    temp_distance.sort(key = operator.itemgetter(0))
    return temp_distance

    

In [17]:
def print_similar_resumes(list_of_similar_resumes):
    num_similar = 10 #or 10, 20, 25
    print('\nPrinting {} MOST similar candidates...\n'.format(num_similar))
    for i in range(0,num_similar):
        print('\n#{} most similar job'.format(i+1))
        print('Resume ID from list: {}'.format(list_of_similar_resumes[i][1]))
        print('Cosine Distance: {}'.format(list_of_similar_resumes[i][0]))
        print('Resume ID from df: {}'.format(resumes.iloc[list_of_similar_resumes[i][1]]['ID']))
        print('Resume text (500 chars): {}'.format(resumes.iloc[list_of_similar_resumes[i][1]]['resume'][0:500]))


In [18]:
def print_dissimilar_resumes(list_of_similar_resumes):
    
    
    num_dissimilar = 10 #or 10, 20, 25
    print('\nPrinting {} LEAST similar candidate resumes...\n'.format(num_dissimilar))
    count = 0
    for i in range(0, len(list_of_similar_resumes)):
        if count == num_dissimilar:
            break
        if resumes.iloc[list_of_similar_resumes[-(1+i)][1]]['resume'] != 'nan':        
            print('\n#{} least similar candidate'.format(count+1))
            print('Candidate ID from list: {}'.format(list_of_similar_resumes[-(1+i)][1]))
            print('Cosine Distance: {}'.format(list_of_similar_resumes[-(1+i)][0]))
            print('Resume ID from df: {}'.format(resumes.iloc[list_of_similar_resumes[-(1+i)][1]]['ID']))
            print('Resume text (500 chars): {}'.format(resumes.iloc[list_of_similar_resumes[-(1+i)][1]]['resume'][:500]))
            count +=1
        
        
        

# Given a job, find similar candidates

In [19]:
chosen_job = pick_job()
# 5773 data analyst

There are 13124 jobs

selected job is ID #9670
Job Posting ID is: 9670
Job Posting Title: Sales Manager
Job Posting Company: Navavan LLC
Job Posting Description: Navavan LLC is looking for a Sales Manager who will be
responsible for the development and performance of all sales activities
in the assigned market. He/ she will staff and direct a sales team and
provide leadership towards the achievement of maximum profitability and
growth in line with company vision and values. The incumbent will be also
responsible for establishing plans and strategies to expand the customer
base in the marketing area.


In [20]:
ordered_resumes_list = given_job_find_similar_resumes(chosen_job)


In [21]:
# base_res_model.save('/home/ubuntu/NLP_projects/job_recommender_project/models/fasttext/base_res_ft')

In [22]:
print_similar_resumes(ordered_resumes_list)


Printing 10 MOST similar candidates...


#1 most similar job
Resume ID from list: 653
Cosine Distance: 0.017587125301361084
Resume ID from df: 654
Resume text (500 chars): entry level sales associate resume experience richard forman  cherry tree lane seattle wa  xexxf      richard forman email com sales associate customer servicesellingmerchandisingretail operations profile energetic customer centered individual seeking become part winning sales team vivians cosmetics offers extensive training acquired skills sales enhancement management exceed sales targets education mba sales marketing northwest university seattle wa  ba business economics professional college s

#2 most similar job
Resume ID from list: 102
Cosine Distance: 0.021009325981140137
Resume ID from df: 103
Resume text (500 chars): retail store manager   career profile self motivated retail store leader ten years active experience possess required determination improve performance utmost sense obligation carefulness notabl

In [24]:
#print_dissimilar_resumes(ordered_resumes_list)

# Build model to find jobs for a candidate

In [25]:
jobs_sentences = jobs['lemmatized_combined'].tolist()

In [26]:
base_jobs_model = FT_gensim(
    sentences=None,
    sg=0, #default to CBOW. if sg=1 then skip-gram
    hs=0, #default, if hs=0 & negative =/= 0 then neg. sampling. if hs=1, hierarchical softmax
    negative=5, #5 words selected for negative sampling
    size=100, #size of vector
    alpha=0.025,
    min_count=5, # ignore words with fewer than 20 apearances
    iter=5,
    seed=42,
    cbow_mean=1, #uses mean for CBOW. If it =0 then sums CBOW (provided CBOW not SG)
    min_n=3, # min length of char n-grams
    max_n=6, # max length of char n-grams. If 0 or less than min_n, this turns into W2V
    trim_rule=None, #if you had a rule to trim down vocabulary
    workers=3 # default
)    

    
# build the vocabulary
base_jobs_model.build_vocab(sentences = jobs_sentences)

# train the model
base_jobs_model.train(
    sentences=jobs_sentences,
    epochs=base_jobs_model.epochs,
    total_examples=base_jobs_model.corpus_count,
    total_words=base_jobs_model.corpus_total_words
)

print(base_jobs_model)

FastText(vocab=8277, size=100, alpha=0.025)


In [27]:
base_jobs_model.save('/home/ubuntu/NLP_projects/job_recommender_project/models/fasttext/base_jobs_ft')

In [28]:
def vectorize_jobs(model=base_jobs_model):
    """returns list of document vectors """
    jobs_vecs = []
    for i in range(0, len(jobs_sentences)):
        temp_vec = model.wv[jobs_sentences[i]]
        one_dim_vec = np.mean(temp_vec, axis=0)
        jobs_vecs.append(one_dim_vec)
    return jobs_vecs

In [29]:
def pick_resume(selection=-999):
    print("There are {} resumes".format(resumes.shape[0]))
  
    if selection == -999:
        selection = np.random.randint(1, resumes.shape[0]+1)
    
    
    
    print ('\nselected resume is ID #{}'.format(selection))
    
    # pick the job text and ID associated with the random int
    resume_label = resumes.iloc[selection - 1, resumes.columns.get_loc('ID')] #we could grab ID, but this works for non-indexed labels too
    resume_text = resumes.iloc[selection - 1 ]['resume'][:500] 
    
    
    print('Resume ID is: {}'.format(resume_label))
    print('Resume text is (500 chars): {}'.format(resume_text))
    
    #Convert the sample document into a list and use the infer_vector method to get a vector representation for it
    resume_text_to_process = resumes['lemmatized_resume'][selection - 1]
    
    return resume_text_to_process


In [30]:
def given_resume_find_similar_jobs(resume_you_pick, model=base_jobs_model, jobs_vectors = vectorize_jobs()):
  
    #find all distances between chosen job and each resume
    temp_distance = []
    min_dist = float("inf")
    min_index = float("inf")
    
    res_temp = model.wv[resume_you_pick]
    res_vec = np.mean(res_temp, axis=0)
    
    for i in range (0, len(jobs_vectors)):
        dist = spatial.distance.cosine(res_vec, jobs_vectors[i])
#         dist = model.wmdistance(resume_you_pick, jobs_sentences[i])
        if dist < min_dist:
            min_dist = dist
            min_index = i
        temp_distance.append((dist, i))


                
    #sort list of tuples
    temp_distance.sort(key = operator.itemgetter(0))

    return temp_distance

    

In [31]:
def print_similar_jobs(list_of_similar_jobs):
    
    num_similar = 10 #or 10, 20, 25
    print('\nPrinting {} most similar jobs for this candidate...\n'.format(num_similar))
    for i in range(0,num_similar):
        print('\n#{} most similar job'.format(i+1))
        print('Job ID from list: {}'.format(list_of_similar_jobs[i][1]))
        print('Cosine distance: {}'.format(list_of_similar_jobs[i][0]))
        print('Job ID from df: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['label']))
        print('Job title: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['Title']))
        print('Company: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['Company']))
        print('Job Description: {}'.format(jobs.iloc[list_of_similar_jobs[i][1]]['JobDescription']))


In [32]:
def print_dissimilar_jobs(list_of_similar_jobs):
    
    num_dissimilar = 10 #or 10, 20, 25
    print('\nPrinting {} LEAST similar jobs for this candidate...\n'.format(num_dissimilar))
    for i in range(0,num_dissimilar):
        print('\n#{} least similar job'.format(i+1))
        print('Job ID from list: {}'.format(list_of_similar_jobs[-(1+i)][1]))
        print('Cosine Distance: {}'.format(list_of_similar_jobs[-(1+i)][0]))
        print('Job ID from df: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['label']))
        print('Job title: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['Title']))
        print('Company: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['Company']))
        print('Job Description: {}'.format(jobs.iloc[list_of_similar_jobs[-(1+i)][1]]['JobDescription']))
        
        
        

In [33]:

# jv = vectorize_jobs()

In [71]:
chosen_resume = pick_resume(1147)
# 752 sales manager
# 719 creative computer visuals
# 1147 tax advisor

There are 1219 resumes

selected resume is ID #1147
Resume ID is: 1147
Resume text is (500 chars): tax advisor resume template john depp  cliff ave  san francisco california phone   johnmarvelousgmailcom objective work tax advisor play key role handling tax activities reputable organization summary qualifications possesses extensive knowledge tax laws excellent analytical mathematical skills ability take initiatives work pressure possesses excellent communication interpersonal skills ability solve problems meet timelines effectively ability work team well independently without minimal supervi


In [72]:
ordered_job_list = given_resume_find_similar_jobs(chosen_resume)

In [73]:
print_similar_jobs(ordered_job_list)


Printing 10 most similar jobs for this candidate...


#1 most similar job
Job ID from list: 12333
Cosine distance: 0.06628823280334473
Job ID from df: 12334
Job title: Chief Accountant
Company: Asatryans LLC
Job Description: The Chief Accountant will maintain and control Clients'
current accounting and tax services.

#2 most similar job
Job ID from list: 8715
Cosine distance: 0.07574671506881714
Job ID from df: 8716
Job title: Tax Manager - Financial Department
Company: Ameriabank CJSC
Job Description: Ameriabank CJSC is looking for a Tax Manager to be
responsible for tax and financial accounting, maintenance of tax
methodology and tax planning.

#3 most similar job
Job ID from list: 9119
Cosine distance: 0.08209353685379028
Job ID from df: 9120
Job title: Tax Officer
Company: "FINCA" Universal Credit Organization
Job Description: The incumbent will be responsible for all tax issues
and work at the Finance Department.

#4 most similar job
Job ID from list: 1053
Cosine distance: 0.

In [63]:
print_dissimilar_jobs(ordered_job_list)


Printing 10 LEAST similar jobs for this candidate...


#1 least similar job
Job ID from list: 1682
Cosine Distance: 1.0440135300159454
Job ID from df: 1683
Job title: Cleaning Staff
Company: Partner Organization of Career Center
Job Description: The recruitment division of Career Center is looking
for several candidates to work as cleaning staff at a shop in Yerevan
city center. The cleaning staff will work 6 days a week, one day from
10:00 - 20:00 and another day either from 10:00-13:00 or 13:00-20:00.

#2 least similar job
Job ID from list: 971
Cosine Distance: 0.9858560198917985
Job ID from df: 972
Job title: Receptionist
Company: Golden Palace Hotel
Job Description: Golden Palace Hotel is looking for a Receptionist to
work in night hours.

#3 least similar job
Job ID from list: 7815
Cosine Distance: 0.9577327258884907
Job ID from df: 7816
Job title: Cashier
Company: Ameriabank CJSC
Job Description: The incumbent will be responsible for execution of cash
and non-cash transacti