<a href="https://colab.research.google.com/github/samsontran/Job-Search-Recommender-System/blob/main/Job_Search_Recommender_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Objectective: From webscrapings of LinkedIn and Glassdoor job postings, which jobs should we recommend to a user based on their search query/statement?



In [None]:
!pip install rank_bm25



In [None]:
# import libraries
import pandas as pd
import numpy as np
from rank_bm25 import *
import re
import string

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction import text
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

###Text Preprocessing

In [None]:
# remove puncturation, stopwords and set all characters to lower case
def text_preprocess(text):
  new_text = re.sub(r'@\S+', '', text)
  new_text = re.compile("[" + re.escape(string.punctuation) + '’'+ "]").sub('', text)
  new_text = remove_stopwords(str.lower(new_text))

  return new_text

STOPWORD = list(set(list(stopwords.words('english')) + list(text.ENGLISH_STOP_WORDS)))
def remove_stopwords(text):
  return " ".join([word for word in text.split(" ") if word not in STOPWORD])

###Develop search engine using BM-25F approach

In [None]:
# load in dataset of Excel spreadsheet with Data Science related jobs from previous LinkedIn webscraping
df = pd.read_excel('LinkedIn Job Data_Data Scientist in Canada.xlsx')
df2 = pd.read_csv('glassdoor job data.csv')
df2 = df2.drop(columns=['company_starRating', 'company_roleLocation']).rename(columns={"companyName": "Company", "company_offeredRole": "Title", "listing_jobDesc": "Description", "requested_url": "Link"})
df = pd.concat([df, df2])
df = df.drop(columns=['ID']).dropna().drop_duplicates(subset=['Description'])

In [None]:
df

Unnamed: 0,Company,Title,Description,Link
0,Twitter,Data Scientist - Product Data Science,Company Description Twitter serves the public...,https://ca.linkedin.com/jobs/view/data-scienti...
1,Walmart Canada,Data Scientist,Position Summary...The Data Scientist represen...,https://ca.linkedin.com/jobs/view/data-scienti...
2,Morgan Stanley,Data Scientist,We Offer To work with some of the best profes...,https://ca.linkedin.com/jobs/view/data-scienti...
3,Samsung Electronics,Data Scientist,Position Summary Do you believe in the power ...,https://ca.linkedin.com/jobs/view/data-scienti...
4,Yelp,Data Scientist (Remote),"At Yelp, it’s our mission to connect people wi...",https://ca.linkedin.com/jobs/view/data-scienti...
...,...,...,...,...
176,IBM.css-1pmc6te{-webkit-align-items:center;-we...,Data Scientist – IBM Garage – Summit 2022,\nDemonstrate a growth mindset and continuous ...,https://www.glassdoor.com/partner/jobListing.h...
177,Amazon Dev Centre Canada ULC.css-1pmc6te{-webk...,Data Scientist,"Bachelor’s degree in Statistics, Applied Math,...",https://www.glassdoor.com/partner/jobListing.h...
233,Allstate Canada.css-1pmc6te{-webkit-align-item...,DATA SCIENTIST,.css-1yuy9gt{display:-webkit-box;-webkit-line-...,https://www.glassdoor.com/partner/jobListing.h...
311,Virtus Groups,Senior Machine Learning Engineer/ Data Scientist,Thorough understanding of Python Comfortable l...,https://www.glassdoor.com/partner/jobListing.h...


In [None]:
descriptions = df["Description"]

In [None]:
descriptions

0      Company Description  Twitter serves the public...
1      Position Summary...The Data Scientist represen...
2      We Offer  To work with some of the best profes...
3      Position Summary  Do you believe in the power ...
4      At Yelp, it’s our mission to connect people wi...
                             ...                        
176    \nDemonstrate a growth mindset and continuous ...
177    Bachelor’s degree in Statistics, Applied Math,...
233    .css-1yuy9gt{display:-webkit-box;-webkit-line-...
311    Thorough understanding of Python Comfortable l...
446    Responsible for the extraction and analysis of...
Name: Description, Length: 883, dtype: object

In [None]:
# initialize bm25 with selected data
tokenized_corpus = [doc.split(" ") for doc in np.array(descriptions)]
bm25 = BM25Okapi(tokenized_corpus)
query = "data science"
tokenized_query = query.split(" ")

In [None]:
doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores[:10])

[3.8674797  4.0144277  3.63166059 2.95037712 2.87196488 1.85025555
 3.41249983 3.65676377 1.67258395 2.71365618]


In [None]:
docs = bm25.get_top_n(tokenized_query, descriptions, n=5)
df_search = df[descriptions.isin(docs)]
df_search.head()

Unnamed: 0,Company,Title,Description,Link
117,BrainFinance,Analytics Data Scientist,BrainFinance is a leading financial technology...,https://ca.linkedin.com/jobs/view/analytics-da...
136,Interac Corp.,Senior Data Scientist,Are you interested in working for a company as...,https://ca.linkedin.com/jobs/view/senior-data-...
168,OpenText,AI Data Scientist,Opentext - The Information Company As the Inf...,https://ca.linkedin.com/jobs/view/ai-data-scie...
304,Hopper,"Sr Data Scientist, Hotels Marketplace",Hopper offers a remote work policy that empowe...,https://ca.linkedin.com/jobs/view/sr-data-scie...
482,Workday,"Senior Data Scientist, Machine Learning - Spen...",Do what you love. Love what you do. At Workd...,https://ca.linkedin.com/jobs/view/senior-data-...


###Example on a search based on a personal statement

In [None]:
personal_statement = "I am finishing up my studies at the University of Toronto towards Bachelor of Science, Computer Science Specialist with focus in Artificial Intelligence. Through the first two years of the program, I obtained a strong foundation in computer science subjects. I applied critical thinking and strong problem-solving skills towards course-based assignments and projects."

In [None]:
query = personal_statement
tokenized_query = query.split(" ")

docs = bm25.get_top_n(tokenized_query, descriptions, n=5)
df_search = df[descriptions.isin(docs)]
df_search.head()

Unnamed: 0,Company,Title,Description,Link
237,Dribbble,Machine Learning Engineer,"Founded in 2009, Dribbble is the top global co...",https://ca.linkedin.com/jobs/view/machine-lear...
289,MonetizeMore,Remote Data Scientist,MonetizeMore builds industry leading ad techno...,https://ca.linkedin.com/jobs/view/remote-data-...
434,Intelcom,"Data Scientist, Gestion des revenus",Description Du Poste Intelcom est un important...,https://ca.linkedin.com/jobs/view/data-scienti...
542,Twitter,"Senior Machine Learning Engineer, Topics / NLP",Company Description Twitter is what’s happeni...,https://ca.linkedin.com/jobs/view/senior-machi...
614,Twitter,"Software Engineer - Machine Learning, Recommen...",Company Description Twitter is what’s happeni...,https://ca.linkedin.com/jobs/view/software-eng...


###Develop search engine using TF-IDF approach

Find similar jobs from one of the job posts in Glassdoor dataset

In [None]:
df = pd.read_excel('LinkedIn Job Data_Data Scientist in Canada.xlsx')
df2 = pd.read_csv('glassdoor job data.csv')
df2 = df2.drop(columns=['company_starRating', 'company_roleLocation']).rename(columns={"companyName": "Company", "company_offeredRole": "Title", "listing_jobDesc": "Description", "requested_url": "Link"})
df = pd.concat([df, df2])
df = df.drop(columns=['ID']).dropna().drop_duplicates(subset=['Description'])

In [None]:
def df_preprocess(df):
  df = df.apply(lambda x: re.sub(r'@\S+', '', x))
  df = df.apply(lambda x: re.compile("[" + re.escape(string.punctuation) + '’'+ "]").sub('', x))
  df = df.apply(lambda x: str.lower(x))
  df = df.apply(lambda x: remove_stopwords(x))

  return df[df != ""]

STOPWORD = list(set(list(stopwords.words('english')) + list(text.ENGLISH_STOP_WORDS)))
def remove_stopwords(text):
  return " ".join([word for word in text.split(" ") if word not in STOPWORD])

In [None]:
descriptions = df_preprocess(df["Description"])

In [None]:
tfidf = TfidfVectorizer()
tfidf_matrix  = tfidf.fit_transform(descriptions)
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [None]:
# this function returns jobs similar to the job with provided job index in the cosine matrix
def find_similar_jobs(job_idx, top, cosine_sim=cosine_sim):
  sim_scores = list(enumerate(cosine_sim[job_idx]))
  sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
  sim_scores = sim_scores[1:top]
  jobs = [i[0] for i in sim_scores]

  return df.iloc[jobs]

In [None]:
find_similar_jobs(20, 15)

Unnamed: 0,Company,Title,Description,Link
19,TD,Data Scientist I,TD Description Tell us your story. Don't go u...,https://ca.linkedin.com/jobs/view/data-scienti...
195,TD,"Data Scientist I, AI/ML Model Testing Automati...",TD Description Tell us your story. Don't go u...,https://ca.linkedin.com/jobs/view/data-scienti...
229,TD,"Data Scientist I, AI/ML Model Testing Automati...",TD Description Tell us your story. Don't go u...,https://ca.linkedin.com/jobs/view/data-scienti...
122,Vanguard,Data Scientist II,"At Vanguard, our mission is to help investors ...",https://ca.linkedin.com/jobs/view/data-scienti...
191,407 ETR,Data Scientist,Position Summary: The successful candidate wi...,https://ca.linkedin.com/jobs/view/data-scienti...
189,TELUS,Data Scientist/Engineer,Join our team Are you obsessed with data and ...,https://ca.linkedin.com/jobs/view/data-scienti...
4,Hudson's Bay,The Bay | Jr. Data Scientist,"Design, develop, test, advocate and build pred...",https://www.glassdoor.com/partner/jobListing.h...
393,Workday,Sr Business Insights Analyst/Data Scientist,Do what you love. Love what you do. At Workd...,https://ca.linkedin.com/jobs/view/sr-business-...
293,Hudson's Bay,The Bay | Data Scientist,Who We Are As North America’s oldest start-up...,https://ca.linkedin.com/jobs/view/the-bay-data...
106,First West Credit Union,Data Scientist,We are currently seeking a Data Scientist to j...,https://ca.linkedin.com/jobs/view/data-scienti...


References: 

https://www.analyticsvidhya.com/blog/2021/05/build-your-own-nlp-based-search-engine-using-bm25/

https://www.datacamp.com/community/tutorials/recommender-systems-python