In [36]:
# YOUR RESUME
resume = """
PhD, Senior Data Scientist, O’Reilly Author
Senior Data Scientist, Silicon Valley Data Science, Mountain View, CA, USA.
{ Consulting as a member of several small data science/data engineering teams at multiple companies.
{ Creating output to explain data analysis, data visualization, and statistical modeling results to managers.
{ Developing Data Science best practices for team.
{ Modeling non-contractual churn on customer population.
{ Modeling survey data responses with ordinal logistic regression in R.
{ Analyzing and visualizing user behavior migration.
2014–2016 Data Scientist, Silicon Valley Data Science, Mountain View, CA, USA.
2014 Insight Data Science Postdoctoral Fellow, Insight Data Science, Palo Alto, CA, USA.
{ Created a Data Science project to predict the auction sale price of Abstract Expressionist art.
2011–2014 Postdoctoral Research Associate, Swinburne University, Melbourne, AUS.
{ Cleaned noisy and inhomogeneous astronomical data taken over four years by different observing groups.
{ Utilized numerous statistical techniques, including sensitivity analysis on non-linear propagation of
errors, Markov-Chain Monte Carlo for model building, and hypothesis testing via information criterion.
{ Simulated spectroscopic data to expose systematic errors that challenge long-standing results on whether
the fundamental physical constants of the universe are constant.
2005–2011 Graduate Student Researcher, UCSD, San Diego, CA, USA.
{ Developed a novel technique to extract information from high resolution spectroscopic data that led to
uncovering unknown short-range systematic errors.
Programming and Development Skills
Languages Python, SQL (Impala/Hive), R, LATEX, Bash.
Tools Jupyter Notebook, pandas, matplotlib, seaborn, numpy, scikit-learn, scipy, pymc3, git, pandoc.
Publishing, Speaking, and Side Projects
2017 Instructor Stanford Continuing Studies: Tips and Tricks for Data Scientists: Optimizing Your Workflow.
2017 Invited Keynote: USC Career Conference Beyond the PhD.
2016 PyData SF: Mental Models to Use and Avoid as a Data Scientist.
2016 O’Reilly author: Jupyter Notebook for Data Science Teams [screencast], editor O’Reilly Media.
2016 UC Berkeley Master in Data Science Guest Lecturer: Jupyter Notebook Usage.
2015 OSCON Speaker: IPython Notebook best practices for data science.
2013-2014 Contributor to astropy; creator of dipole_error, an astronomy Python module.
2013 Co-star and narrator of Hidden Universe, a 3D IMAX astronomy film playing worldwide.
Education
2011 PhD Physics, University of California San Diego, San Diego, CA, USA.
Thesis title: The Fine-Structure Constant and Wavelength Calibration.
2005 Bachelor of Science–Magna Cum Laude, Vanderbilt University, Nashville, TN, USA.
Triple major: Philosophy; Mathematics; Physics (honors)."""

In [37]:
# JOB DESCRIPTION
job_post = """
Applies highly complex statistical techniques to help solve problems. Provides insights and actionable recommendations to the business. Utilizes highly complex statistical modeling to make predictions about future outcomes and in multiple scenarios. Explains findings to business audience.
Responsible for applying highly complex advanced data analysis tools and techniques to provide insights and actionable recommendations for the business
Utilizes highly complex statistical modeling to make predictions about future outcomes and in multiple scenarios.
Interprets and applies data in highly-complex analyses, and explains findings to business audiences to improve products and processes.
Executes multiple and/or highly complex statistical and mathematical analyses to support business decision making for multiple business functions.
Develops and/or uses algorithms and statistical predictive models and determines analytical approaches and modeling techniques to evaluate scenarios and potential future outcomes.
Applies analytical rigor and statistical methods to analyze large amounts of data, using advanced statistical techniques such as predictive statistical models, customer profiling, segmentation analysis, survey design and analysis and data mining.
Documents projects including business objectives, data gathering and processing, leading approaches, final algorithm, detailed set of results and analytical metrics.
Develops materials to explain project findings.
Typically assigned to important / complicated undertakings.
Anticipates and prevents problems and roadblocks before they occur.
Interacts with internal and external peers and managers to exchange complex information related to areas of specialization.
Mentors less experienced members of the team. Provides guidance regarding analytical approach and iteration of algorithms.
Bachelor's degree and at least 4 years of experience in quantitative or computational functions; or graduate degree in a quantitative, computational or technical discipline and at least 2 years of experience in quantitative or computational functions
Advanced knowledge of SQL
Advanced knowledge of open source data science and statistics packages such as Python, R, Spark, etc.
Experience in data science, advanced analytics, or statistics.
Experience interrogating data, performing analyses, interpreting data, and presenting findings to business audiences.
Experience establishing and maintaining key relationships with internal (peers, business partners and leadership) and external (business community, clients and vendors) within a matrix organization to ensure quality standards for service.
Experience diagnosing, isolating, and resolving complex business issues and recommending and implementing strategies to resolve problems.
Experience presenting to all levels of an organization
At least 2 years of experience contributing to financial decisions in the workplace.
At least 2 years of direct leadership, indirect leadership and/or crossfunctional team leadership.
Willing to travel up to/at least 10% of the time for business purposes (within state and out of state).
Graduate degree in a quantitative, computational or technical discipline
"""

In [44]:
import re
from fuzzywuzzy import fuzz
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from Levenshtein import distance as levenshtein_distance
import pandas as pd
from rake_nltk import Rake
from fuzzywuzzy import process

def preprocess_text(text):
    text = text.lower()
    # remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    # remove special characters
    text = re.sub(r'[^\w\s]', ' ', text)
    # remove digits
    text = re.sub(r'\d+', ' ', text)
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Extract keywords from text
def extract_keywords(text, ratio_threshold=80):
    words = text.split()
    keywords = []
    for word in words:
        if fuzz.token_set_ratio(word, text) >= ratio_threshold:
            keywords.append(word)
    return keywords

##########################################################################
# Preprocess resume text
resume_keywords = extract_keywords(preprocess_text(resume))

# Preprocess job post text
job_post = preprocess_text(job_post)

# Tokenize the text into words
words = word_tokenize(job_post)

# Get the POS tags for each word
pos_tags = pos_tag(words)

# Define a list of POS tags that you want to include
included_tags = ['NN', 'NNS', 'JJ', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']

# Filter out the words that have POS tags that are not included
keywords = [word for (word, tag) in pos_tags if tag in included_tags]

stop = set(nltk.corpus.stopwords.words('english'))

# Print the keywords
job_post_keywords = list(filter(lambda x: len(x) >= 5 and x not in stop, list(dict.fromkeys(keywords))))



# Merge similar keywords
merged_keywords = {}
for keyword in job_post_keywords:
    # Use fuzzy matching to find the most similar keyword that has already been seen
    closest_match = process.extractOne(keyword, merged_keywords.keys(), scorer=fuzz.ratio)
    if closest_match and closest_match[1] >= 90:
        merged_keywords[closest_match[0]] += [keyword]
    else:
        merged_keywords[keyword] = [keyword]

# Use the merged keywords as the final keyword list
job_post_keywords = list(merged_keywords.keys())

# Remove similar keywords
final_keywords = []
for i, keyword1 in enumerate(job_post_keywords):
    # Check if keyword1 is too similar to any of the previous keywords
    is_similar = False
    for keyword2 in job_post_keywords[:i]:
        # Calculate the Levenshtein distance between the two keywords
        distance_score = levenshtein_distance(keyword1, keyword2)
        # If the distance is below a certain threshold, consider the keywords too similar
        if distance_score < 3:
            is_similar = True
            break
    # If keyword1 is not too similar to any previous keywords, add it to the final keyword list
    if not is_similar:
        final_keywords.append(keyword1)

job_post_keywords = final_keywords


# Define the minimum threshold for the Levenshtein ratio
levenshtein_threshold = .625
# Create a list to store the matched keywords
matched_keywords = []
for x in range(len(job_post_keywords)):
    max_ratio = 0
    for y in range(len(resume_keywords)):
        ratio = 1 - levenshtein_distance(job_post_keywords[x],resume_keywords[y]) / max(len(job_post_keywords[x]),len(resume_keywords[y]))
        if ratio >= max_ratio:
            max_ratio = ratio
            best_match = resume_keywords[y]
        else:
            continue
    if max_ratio >= levenshtein_threshold:
        matched_keywords.append(best_match)
    else:
        matched_keywords.append('None')
        

job_phrases = Rake()
job_phrases.extract_keywords_from_text(job_post)
ranked_phrases = job_phrases.get_ranked_phrases()
ranked_phrases = list(dict.fromkeys(ranked_phrases))
ranked_phrases = [phrase for phrase in ranked_phrases if len(phrase.split()) > 1]

# Create a list to store the matched keywords
matched_keyphrases= []
for x in range(len(ranked_phrases)):
    phrase_words = ranked_phrases[x].split()
    tally = 0
    for y in range(len(phrase_words)):
        for a in range(len(resume_keywords)):
            ratio = 1 - levenshtein_distance(phrase_words[y],resume_keywords[a]) / max(len(phrase_words[y]),len(resume_keywords[a]))
            if ratio >= levenshtein_threshold :
                tally += 1
                break
    if tally >= len(phrase_words)/3:
        matched_keyphrases.append(ranked_phrases[x])
    else:
        matched_keyphrases.append('None')
##########################################################################
# WEIGHT TO PHRASE
weight_phrase = .65
weight_word = 1-weight_phrase

# PERCENTAGE SIMILAR SCORE
sim_score = round((sum(x != 'None' for x in matched_keywords)/len(job_post_keywords)*weight_word) + (weight_phrase *sum(x != 'None' for x in matched_keyphrases)/len(ranked_phrases)),4)

print('Similarity score:', sim_score)

THRESHOLD = 0.40

if sim_score >= THRESHOLD:
    print("Your resume is a good match for the job post.")
else:
    print("Your resume is not a good match for the job post.")

Similarity score: 0.5088
Your resume is a good match for the job post.


In [45]:
############################################
# RESULTS ANALYSIS #

In [46]:
keyword_df = pd.DataFrame({'Job Post Keyword': job_post_keywords,'Matched Resume Keyword':matched_keywords})

In [47]:
keyphrase_df = pd.DataFrame({'Job Post Keyphrase': ranked_phrases,'Matched Resume Keyphrase':matched_keyphrases})

In [48]:
keyword_df

Unnamed: 0,Job Post Keyword,Matched Resume Keyword
0,applies,
1,complex,
2,statistical,statistical
3,techniques,techniques
4,solve,
...,...,...
131,crossfunctional,
132,willing,building
133,travel,
134,purposes,


In [49]:
keyphrase_df

Unnamed: 0,Job Post Keyphrase,Matched Resume Keyphrase
0,data mining documents projects including busin...,data mining documents projects including busin...
1,predictive statistical models customer profili...,predictive statistical models customer profili...
2,statistics experience interrogating data perfo...,statistics experience interrogating data perfo...
3,processing leading approaches final algorithm ...,
4,applying highly complex advanced data analysis...,applying highly complex advanced data analysis...
...,...,...
68,financial decisions,
69,algorithms bachelor,algorithms bachelor
70,actionable recommendations,
71,least years,least years
