# [**Building NLP Content-Based Recommender Systems**](https://medium.com/@armandj.olivares/building-nlp-content-based-recommender-systems-b104a709c042)

- [Github](https://github.com/ArmandDS/jobs_recommendations/blob/master/job_analysis_content_recommendation.ipynb)
- [Kaggle](https://www.kaggle.com/datasets/kandij/job-recommendation-datasets?resource=download)

A replica and adaptation of Armando Olivares' work.

## **Summary**

What was done is, essentially, given **potential employee** information and **job information**, understanding information as text, we try to find the **best job** in for each employee based on the similarity of words employed in both set of information.

Content filtering takes an **item to item** approach, and we are also taking the **employee** information as a possible item, against another item, the job.

For the **potential employee** we looked for:
1. Viewed jobs
2. Job experience
3. Positions of interest

For the **job** we only have position-related data.

We first create a `tfidfVectorizer()` trained and transform with the **job information**, then the same **TFIDF** is applied the the **employee information**, then we look for the most suitable job given the **cosine similarity** between both sets of data.

**References**

- [Count Vectorizer vs TFIDF Vectorizer | NLP](https://www.linkedin.com/pulse/count-vectorizers-vs-tfidf-natural-language-processing-sheel-saket)
- [TF-IDF Vectorizer Scikit-Learn](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a)

In [1]:
import pandas as pd
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 120


import numpy as np
import nltk
import warnings; warnings.filterwarnings('ignore')
import textwrap

## **Load the data**

In [2]:
df_jobs = pd.read_csv('archive/Combined_Jobs_Final.csv')
display(df_jobs.head())
df_jobs.info()

Unnamed: 0,Job.ID,Provider,Status,Slug,Title,Position,Company,City,State.Name,State.Code,Address,Latitude,Longitude,Industry,Job.Description,Requirements,Salary,Listing.Start,Listing.End,Employment.Type,Education.Required,Created.At,Updated.At
0,111,1,open,palo-alto-ca-tacolicious-server,Server @ Tacolicious,Server,Tacolicious,Palo Alto,California,CA,,37.443346,-122.16117,Food and Beverages,"Tacolicious' first Palo Alto store just opened recently, and we are hiring! If you love tacos, you will love working...",,8.0,,,Part-Time,,2013-03-12 02:08:28 UTC,2014-08-16 15:35:36 UTC
1,113,1,open,san-francisco-ca-claude-lane-kitchen-staff-chef,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,California,CA,,37.78983,-122.404268,Food and Beverages,"\r\n\r\nNew French Brasserie in S.F. Financial District Seeks Chef\r\nWe are seeking an energetic, dynamic chef to ...",,0.0,,,Part-Time,,2013-04-12 08:36:36 UTC,2014-08-16 15:35:36 UTC
2,117,1,open,san-francisco-ca-machka-restaurants-corp-bartender,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,California,CA,,37.795597,-122.402963,Food and Beverages,We are a popular Mediterranean wine bar and restaurant in Financial District.\r\n\r\nWe are looking for an experienc...,,11.0,,,Part-Time,,2013-07-16 09:34:10 UTC,2014-08-16 15:35:37 UTC
3,121,1,open,brisbane-ca-teriyaki-house-server,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,California,CA,,37.685073,-122.400275,Food and Beverages,● Serve food/drinks to customers in a professional manner \r\n ● Act as a cashier when needed \r\n ● Clean up the d...,,10.55,,,Part-Time,,2013-09-04 15:40:30 UTC,2014-08-16 15:35:38 UTC
4,127,1,open,los-angeles-ca-rosa-mexicano-sunset-kitchen-staff-chef,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,California,CA,,34.073384,-118.460439,Food and Beverages,"Located at the heart of Hollywood, we are one of the most popular Mexican places in LA! We are currently looking for...",,10.55,,,Part-Time,,2013-07-17 15:26:18 UTC,2014-08-16 15:35:40 UTC


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84090 entries, 0 to 84089
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Job.ID              84090 non-null  int64  
 1   Provider            84090 non-null  int64  
 2   Status              84090 non-null  object 
 3   Slug                84090 non-null  object 
 4   Title               84090 non-null  object 
 5   Position            84090 non-null  object 
 6   Company             81819 non-null  object 
 7   City                83955 non-null  object 
 8   State.Name          83919 non-null  object 
 9   State.Code          83919 non-null  object 
 10  Address             36 non-null     object 
 11  Latitude            84090 non-null  float64
 12  Longitude           84090 non-null  float64
 13  Industry            267 non-null    object 
 14  Job.Description     84034 non-null  object 
 15  Requirements        0 non-null      float64
 16  Sala

Selecting the columns for the jobs corpus

In [3]:
# Select the columns and give them format
cols = ['Job.ID', 'Title', 'Position', 'Company', 'City', 'Employment.Type', 'Job.Description']
df_jobs = df_jobs[cols]
df_jobs.columns = [col.replace('.', '_') for col in df_jobs.columns]
df_jobs.head()

Unnamed: 0,Job_ID,Title,Position,Company,City,Employment_Type,Job_Description
0,111,Server @ Tacolicious,Server,Tacolicious,Palo Alto,Part-Time,"Tacolicious' first Palo Alto store just opened recently, and we are hiring! If you love tacos, you will love working..."
1,113,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,Part-Time,"\r\n\r\nNew French Brasserie in S.F. Financial District Seeks Chef\r\nWe are seeking an energetic, dynamic chef to ..."
2,117,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,Part-Time,We are a popular Mediterranean wine bar and restaurant in Financial District.\r\n\r\nWe are looking for an experienc...
3,121,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,Part-Time,● Serve food/drinks to customers in a professional manner \r\n ● Act as a cashier when needed \r\n ● Clean up the d...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,Part-Time,"Located at the heart of Hollywood, we are one of the most popular Mexican places in LA! We are currently looking for..."


In [4]:
# checking for the null values
df_jobs.isnull().sum()

Job_ID                0
Title                 0
Position              0
Company            2271
City                135
Employment_Type      10
Job_Description      56
dtype: int64

We see that there are only 9 companies cities that are having `NaN` values so it must be manually addig their head quarters (by simply searching at google).

In [5]:
# Set a list of tuples with company-actual city
job_city = [
    ('CHI Payment Systems', 'Illinois'),
    ('Academic Year In America', 'Stanford'),
    ('CBS Healthcare Services and Staffing ', 'Urbandale'),
    ('Driveline Retail', 'Coppell'),
    ('Educational Testing Services', 'New Jersey'),
    ('Genesis Health System', 'Davennport'),
    ('Home Instead Senior Care', 'Nebraska'),
    ('St. Francis Hospital', 'New York'),
    ('Volvo Group', 'Washington'),
    ('CBS Healthcare Services and Staffing', 'Urbandale')
]

for company, city in job_city:
    df_jobs.loc[df_jobs['Company'] == company, 'City'] = city

In [6]:
# Look again for missing cities
df_jobs.isnull().sum()

Job_ID                0
Title                 0
Position              0
Company            2271
City                  2
Employment_Type      10
Job_Description      56
dtype: int64

In [7]:
# There are e,ployment type NA for uber, so asume that is Flexible
df_jobs[df_jobs['Employment_Type'].isnull()]

Unnamed: 0,Job_ID,Title,Position,Company,City,Employment_Type,Job_Description
10768,153197,Driving Partner @ Uber,Driving Partner,Uber,San Francisco,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10769,153198,Driving Partner @ Uber,Driving Partner,Uber,Los Angeles,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10770,153199,Driving Partner @ Uber,Driving Partner,Uber,Chicago,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10771,153200,Driving Partner @ Uber,Driving Partner,Uber,Boston,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10772,153201,Driving Partner @ Uber,Driving Partner,Uber,Ann Arbor,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10773,153202,Driving Partner @ Uber,Driving Partner,Uber,Oklahoma,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10774,153203,Driving Partner @ Uber,Driving Partner,Uber,Omaha,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10775,153204,Driving Partner @ Uber,Driving Partner,Uber,Lincoln,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10776,153205,Driving Partner @ Uber,Driving Partner,Uber,Minneapolis,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."
10777,153206,Driving Partner @ Uber,Driving Partner,Uber,St. Paul,,"Uber is changing the way the world moves. From the tap of a button, Uber connects riders with drivers to make reliab..."


In [8]:
# Replace missing values for Employment type with 'flexible'
df_jobs['Employment_Type'] = df_jobs['Employment_Type'].fillna('Flexible')

## **Creating the job corpus**

In [9]:
# Add the corpus column, which sums all text attributes into a single column
df_jobs['corpus'] = df_jobs.apply(lambda job: ' '.join([str(job[col]) for col in ['Position', 'Company', 'City', 'Employment_Type', 'Job_Description']]), axis=1)
df_jobs.head()

Unnamed: 0,Job_ID,Title,Position,Company,City,Employment_Type,Job_Description,corpus
0,111,Server @ Tacolicious,Server,Tacolicious,Palo Alto,Part-Time,"Tacolicious' first Palo Alto store just opened recently, and we are hiring! If you love tacos, you will love working...","Server Tacolicious Palo Alto Part-Time Tacolicious' first Palo Alto store just opened recently, and we are hiring! I..."
1,113,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef,Claude Lane,San Francisco,Part-Time,"\r\n\r\nNew French Brasserie in S.F. Financial District Seeks Chef\r\nWe are seeking an energetic, dynamic chef to ...",Kitchen Staff/Chef Claude Lane San Francisco Part-Time \r\n\r\nNew French Brasserie in S.F. Financial District Seek...
2,117,Bartender @ Machka Restaurants Corp.,Bartender,Machka Restaurants Corp.,San Francisco,Part-Time,We are a popular Mediterranean wine bar and restaurant in Financial District.\r\n\r\nWe are looking for an experienc...,Bartender Machka Restaurants Corp. San Francisco Part-Time We are a popular Mediterranean wine bar and restaurant in...
3,121,Server @ Teriyaki House,Server,Teriyaki House,Brisbane,Part-Time,● Serve food/drinks to customers in a professional manner \r\n ● Act as a cashier when needed \r\n ● Clean up the d...,Server Teriyaki House Brisbane Part-Time ● Serve food/drinks to customers in a professional manner \r\n ● Act as a ...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,Kitchen Staff/Chef,Rosa Mexicano - Sunset,Los Angeles,Part-Time,"Located at the heart of Hollywood, we are one of the most popular Mexican places in LA! We are currently looking for...","Kitchen Staff/Chef Rosa Mexicano - Sunset Los Angeles Part-Time Located at the heart of Hollywood, we are one of the..."


In [10]:
# Create a more compact version of the jobs df
df_jobs = df_jobs[['Job_ID', 'Title', 'corpus']].fillna('')
df_jobs.info()
df_jobs.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84090 entries, 0 to 84089
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Job_ID  84090 non-null  int64 
 1   Title   84090 non-null  object
 2   corpus  84090 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.9+ MB


Unnamed: 0,Job_ID,Title,corpus
0,111,Server @ Tacolicious,"Server Tacolicious Palo Alto Part-Time Tacolicious' first Palo Alto store just opened recently, and we are hiring! I..."
1,113,Kitchen Staff/Chef @ Claude Lane,Kitchen Staff/Chef Claude Lane San Francisco Part-Time \r\n\r\nNew French Brasserie in S.F. Financial District Seek...
2,117,Bartender @ Machka Restaurants Corp.,Bartender Machka Restaurants Corp. San Francisco Part-Time We are a popular Mediterranean wine bar and restaurant in...
3,121,Server @ Teriyaki House,Server Teriyaki House Brisbane Part-Time ● Serve food/drinks to customers in a professional manner \r\n ● Act as a ...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,"Kitchen Staff/Chef Rosa Mexicano - Sunset Los Angeles Part-Time Located at the heart of Hollywood, we are one of the..."


In [11]:
# Download needed functions from NLTK
# nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])

### **NLP approach for content filtering**

In [12]:
from nltk.corpus import stopwords
import re
import string
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
wn = WordNetLemmatizer()

In [13]:
# Set a set of english stopwords
STOPWORDS = set(stopwords.words('english'))

def is_valid_token(token):
    """
    Validate tokens, they must:

    1. Not be a stopword (and, the, to, of...)
    2. Not be punctuation (!, ?, ., ...)
    3. Not be a single letter
    """
    return token not in STOPWORDS and token not in string.punctuation and len(token) > 2


def clean_text(text):
    # Drop single quotes
    text = re.sub('\'', '', text)
    # Convert numbers or any non-word character into white spaces
    text = re.sub('(\\d|\\W)+', ' ', text)
    # Replace semi-stopword from HTML
    text = text.replace('nbsp', '')
    
    # Return the clean text by 1) tokenizing the text, and 2) extracting the stem of word
    # if it is a valid token. Stem verbs only.
    clean_text = [word for word in 
                  [wn.lemmatize(word, pos='v') for word in word_tokenize(text.lower()) if is_valid_token(word)]
                  if is_valid_token(word)
    ]
    
    
    return ' '.join(clean_text)


In [14]:
# Now having functions that tokenizes words, cleans tokens, removes unnecesary ones and
# stems, then apply it to the corpus
df_jobs['corpus'] = df_jobs['corpus'].apply(clean_text)
df_jobs.head()

Unnamed: 0,Job_ID,Title,corpus
0,111,Server @ Tacolicious,server tacolicious palo alto part time tacolicious first palo alto store open recently hire love tacos love work res...
1,113,Kitchen Staff/Chef @ Claude Lane,kitchen staff chef claude lane san francisco part time new french brasserie financial district seek chef seek energe...
2,117,Bartender @ Machka Restaurants Corp.,bartender machka restaurants corp san francisco part time popular mediterranean wine bar restaurant financial distri...
3,121,Server @ Teriyaki House,server teriyaki house brisbane part time serve food drink customers professional manner act cashier need clean din s...
4,127,Kitchen Staff/Chef @ Rosa Mexicano - Sunset,kitchen staff chef rosa mexicano sunset los angeles part time locate heart hollywood one popular mexican place curre...


### **TF-IDF (Term Frequency Inverse Document Frequency)**

In [15]:
# intialize tfidif vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
# Initializing tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Create a vectorization approach of the job corpus
tfidf_jobid = tfidf_vectorizer.fit_transform(df_jobs['corpus'])
tfidf_jobid

<84090x51498 sparse matrix of type '<class 'numpy.float64'>'
	with 8488642 stored elements in Compressed Sparse Row format>

## **Creating the User Corpus**

### **Viewed jobs**

In [17]:
df_job_view = pd.read_csv('archive/Job_Views.csv')
df_job_view.head()

Unnamed: 0,Applicant.ID,Job.ID,Title,Position,Company,City,State.Name,State.Code,Industry,View.Start,View.End,View.Duration,Created.At,Updated.At
0,10000,73666,Cashiers & Valets Needed! @ WallyPark,Cashiers & Valets Needed!,WallyPark,Newark,New Jersey,NJ,,2014-12-12 20:12:35 UTC,2014-12-12 20:31:24 UTC,1129.0,2014-12-12 20:12:35 UTC,2014-12-12 20:12:35 UTC
1,10000,96655,"Macy's Seasonal Retail Fragrance Cashier - Garden City, NY - Roosevelt Field @ Macy's","Macy's Seasonal Retail Fragrance Cashier - Garden City, NY - Roosevelt Field",Macy's,Garden City,New York,NY,,2014-12-12 20:08:50 UTC,2014-12-12 20:10:15 UTC,84.0,2014-12-12 20:08:50 UTC,2014-12-12 20:08:50 UTC
2,10001,84141,Part Time Showroom Sales / Cashier @ Grizzly Industrial Inc.,Part Time Showroom Sales / Cashier,Grizzly Industrial Inc.,Bellingham,Washington,WA,,2014-12-12 20:12:32 UTC,2014-12-12 20:17:18 UTC,286.0,2014-12-12 20:12:32 UTC,2014-12-12 20:12:32 UTC
3,10002,77989,Event Specialist Part Time @ Advantage Sales & Marketing,Event Specialist Part Time,Advantage Sales & Marketing,Simpsonville,South Carolina,SC,,2014-12-12 20:39:23 UTC,2014-12-12 20:42:13 UTC,170.0,2014-12-12 20:39:23 UTC,2014-12-12 20:39:23 UTC
4,10002,69568,Bonefish - Kitchen Staff @ Bonefish Grill,Bonefish - Kitchen Staff,Bonefish Grill,Greenville,South Carolina,SC,,2014-12-12 20:43:25 UTC,2014-12-12 20:43:58 UTC,33.0,2014-12-12 20:43:25 UTC,2014-12-12 20:43:25 UTC


Reduce the number of features for the current dataframe

In [18]:
df_job_view = df_job_view[['Applicant.ID', 'Job.ID', 'Position', 'Company', 'City']]
df_job_view.columns = [col.replace('.', '_') for col in df_job_view.columns]
df_job_view.head()

Unnamed: 0,Applicant_ID,Job_ID,Position,Company,City
0,10000,73666,Cashiers & Valets Needed!,WallyPark,Newark
1,10000,96655,"Macy's Seasonal Retail Fragrance Cashier - Garden City, NY - Roosevelt Field",Macy's,Garden City
2,10001,84141,Part Time Showroom Sales / Cashier,Grizzly Industrial Inc.,Bellingham
3,10002,77989,Event Specialist Part Time,Advantage Sales & Marketing,Simpsonville
4,10002,69568,Bonefish - Kitchen Staff,Bonefish Grill,Greenville


In [19]:
# look for nulls
df_job_view.isnull().sum()

Applicant_ID      0
Job_ID            0
Position          0
Company         580
City              0
dtype: int64

In [20]:
# Transform the features into a single column
df_job_view['corpus'] = df_job_view.apply(lambda job_view: ' '.join([str(job_view[col]) for col in ['Position', 'Company', 'City']]), axis=1) \
                                    .apply(clean_text)

df_job_view = df_job_view[['Applicant_ID', 'corpus']]
df_job_view.info()
df_job_view.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12370 entries, 0 to 12369
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  12370 non-null  int64 
 1   corpus        12370 non-null  object
dtypes: int64(1), object(1)
memory usage: 193.4+ KB


Unnamed: 0,Applicant_ID,corpus
0,10000,cashier valet need wallypark newark
1,10000,macys seasonal retail fragrance cashier garden city roosevelt field macys garden city
2,10001,part time showroom sales cashier grizzly industrial inc bellingham
3,10002,event specialist part time advantage sales market simpsonville
4,10002,bonefish kitchen staff bonefish grill greenville


In [21]:
# Are there any Applicants with multiple viewed jobs?
df_job_view.groupby('Applicant_ID').count().sort_values('corpus', ascending=False)

Unnamed: 0_level_0,corpus
Applicant_ID,Unnamed: 1_level_1
601,75
6808,61
11475,56
6945,44
9137,43
...,...
12368,1
12364,1
9273,1
9279,1


In [22]:
# Combine all viewed corpus into a single corpus per Applicant
df_job_view = df_job_view.groupby('Applicant_ID')['corpus'].apply(' '.join).reset_index()
df_job_view.info()
df_job_view.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3448 entries, 0 to 3447
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  3448 non-null   int64 
 1   corpus        3448 non-null   object
dtypes: int64(1), object(1)
memory usage: 54.0+ KB


Unnamed: 0,Applicant_ID,corpus
0,42,movie extras actors model want san francisco part time full time cast san francisco grand open new location entry le...
1,96,kitchen staff izakaya yuzuki san francisco server waraku san francisco server sakae sushi burlingame
2,153,valic financial advisor intern roseville aig corp roseville travel inventory associate wis international citrus heig...
3,601,retail sales consultant retail bay area associate manager tumi inc san francisco toy express seasonal store supervis...
4,1877,sales associate see candy sunnyvale


In [23]:
# look for unique ids
df_job_view['Applicant_ID'].nunique()

3448

### **Experience Dataset**

In [24]:
df_experience = pd.read_csv('archive/Experience.csv')
df_experience.info()
df_experience.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8653 entries, 0 to 8652
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Applicant.ID          8653 non-null   int64  
 1   Position.Name         7655 non-null   object 
 2   Employer.Name         8568 non-null   object 
 3   City                  4891 non-null   object 
 4   State.Name            4595 non-null   object 
 5   State.Code            4595 non-null   object 
 6   Start.Date            6618 non-null   object 
 7   End.Date              4906 non-null   object 
 8   Job.Description       5646 non-null   object 
 9   Salary                2798 non-null   float64
 10  Can.Contact.Employer  3581 non-null   object 
 11  Created.At            8653 non-null   object 
 12  Updated.At            8653 non-null   object 
dtypes: float64(1), int64(1), object(11)
memory usage: 878.9+ KB


Unnamed: 0,Applicant.ID,Position.Name,Employer.Name,City,State.Name,State.Code,Start.Date,End.Date,Job.Description,Salary,Can.Contact.Employer,Created.At,Updated.At
0,10001,Account Manager / Sales Administration / Quality Assurance,Barcode Resourcing,Bellingham,Washington,WA,2012-10-15,,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
1,10001,Electronics Technician / Item Master Controller,Ryzex Group,Bellingham,Washington,WA,2001-12-01,2012-04-01,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
2,10001,Machine Operator,comptec inc,Custer,Washington,WA,1997-01-01,1999-01-01,,,,2014-12-12 20:10:02 UTC,2014-12-12 20:10:02 UTC
3,10003,maintenance technician,Winn residental,washington,District of Columbia,DC,,,"Necessary maintenance for ""Make Ready"" Plumbing, electrical, HvAc",10.0,False,2014-12-12 21:27:05 UTC,2014-12-12 21:27:05 UTC
4,10003,Electrical Helper,michael and son services,alexandria,Virginia,VA,,,repair and services of electrical construction,,False,2014-12-12 21:27:05 UTC,2014-12-12 21:27:05 UTC


In [25]:
# Take columns of interest
df_experience = df_experience[['Applicant.ID', 'Position.Name', 'Employer.Name', 'City', 'Job.Description']].fillna('')
df_experience.columns = [col.replace('.', '_') for col in df_experience.columns]
df_experience.info()
df_experience.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8653 entries, 0 to 8652
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Applicant_ID     8653 non-null   int64 
 1   Position_Name    8653 non-null   object
 2   Employer_Name    8653 non-null   object
 3   City             8653 non-null   object
 4   Job_Description  8653 non-null   object
dtypes: int64(1), object(4)
memory usage: 338.1+ KB


Unnamed: 0,Applicant_ID,Position_Name,Employer_Name,City,Job_Description
0,10001,Account Manager / Sales Administration / Quality Assurance,Barcode Resourcing,Bellingham,
1,10001,Electronics Technician / Item Master Controller,Ryzex Group,Bellingham,
2,10001,Machine Operator,comptec inc,Custer,
3,10003,maintenance technician,Winn residental,washington,"Necessary maintenance for ""Make Ready"" Plumbing, electrical, HvAc"
4,10003,Electrical Helper,michael and son services,alexandria,repair and services of electrical construction


In [26]:
# Get the corpus of the job experience
df_experience['corpus'] = df_experience.apply(lambda job_experience: ' '.join([str(job_experience[col]) for col in ['Position_Name', 'Employer_Name', 'City', 'Job_Description']]), axis=1) \
                                        .apply(clean_text)

df_experience = df_experience[['Applicant_ID', 'corpus']]
df_experience.info()
df_experience.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8653 entries, 0 to 8652
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  8653 non-null   int64 
 1   corpus        8653 non-null   object
dtypes: int64(1), object(1)
memory usage: 135.3+ KB


Unnamed: 0,Applicant_ID,corpus
0,10001,account manager sales administration quality assurance barcode resourcing bellingham
1,10001,electronics technician item master controller ryzex group bellingham
2,10001,machine operator comptec inc custer
3,10003,maintenance technician winn residental washington necessary maintenance make ready plumb electrical hvac
4,10003,electrical helper michael son service alexandria repair service electrical construction


In [27]:
# Are there any applicants with multiple experience? The desired is yes
df_experience.groupby('Applicant_ID').count().sort_values('corpus', ascending=False)

Unnamed: 0_level_0,corpus
Applicant_ID,Unnamed: 1_level_1
7606,24
5241,22
14149,18
14347,18
8798,17
...,...
2792,1
7699,1
7685,1
4890,1


In [28]:
# Aggregate all the available info into a single record
df_experience = df_experience.groupby('Applicant_ID')['corpus'].apply(' '.join).reset_index()
df_experience.info()
df_experience.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3790 entries, 0 to 3789
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  3790 non-null   int64 
 1   corpus        3790 non-null   object
dtypes: int64(1), object(1)
memory usage: 59.3+ KB


Unnamed: 0,Applicant_ID,corpus
0,2,writer uloop blog cecilia abate san francisco write article uloop blog site mostly read students school within unite...
1,3,prep cook moscone center san francisco server aloha beach resort shenzhen china market intern honda guangzhou china
2,6,project assistant iom nairobi kenya
3,8,deli clerk server cashier food prep order taker safeway grocery inc lodi
4,11,cashier cristina green daly city


In [29]:
# Get unique Applicant_ID counts
df_experience['Applicant_ID'].nunique()

3790

### **Positions of Interest**

In [30]:
df_poi = pd.read_csv('Archive/Positions_Of_Interest.csv')
df_poi.columns = [col.replace('.', '_') for col in df_poi.columns]
df_poi.info()
df_poi.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6560 entries, 0 to 6559
Data columns (total 4 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Applicant_ID          6560 non-null   int64 
 1   Position_Of_Interest  6558 non-null   object
 2   Created_At            6560 non-null   object
 3   Updated_At            6560 non-null   object
dtypes: int64(1), object(3)
memory usage: 205.1+ KB


Unnamed: 0,Applicant_ID,Position_Of_Interest,Created_At,Updated_At
0,10003,security officer,2014-12-12 21:20:54 UTC,2014-12-12 21:20:54 UTC
1,10007,Server,2014-08-14 15:56:42 UTC,2015-02-26 20:35:12 UTC
2,10007,Bartender,2014-08-14 15:56:44 UTC,2015-02-19 23:21:28 UTC
3,10008,Host,2014-08-14 15:56:42 UTC,2015-02-26 20:35:12 UTC
4,10008,Barista,2014-08-14 15:56:43 UTC,2015-02-18 02:35:06 UTC


In [31]:
# Drop unnecesary columns and give format to the corpus
df_poi = df_poi[['Applicant_ID', 'Position_Of_Interest']].rename(columns={'Position_Of_Interest':'corpus'})
df_poi['corpus'] = df_poi['corpus'].map(str).apply(clean_text)
df_poi.info()
df_poi.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6560 entries, 0 to 6559
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  6560 non-null   int64 
 1   corpus        6560 non-null   object
dtypes: int64(1), object(1)
memory usage: 102.6+ KB


Unnamed: 0,Applicant_ID,corpus
0,10003,security officer
1,10007,server
2,10007,bartender
3,10008,host
4,10008,barista


In [32]:
# Are there any Applicants with multiple POIs?
df_poi.groupby('Applicant_ID').count().sort_values('corpus', ascending=False)

Unnamed: 0_level_0,corpus
Applicant_ID,Unnamed: 1_level_1
2254,13
3719,13
7528,13
7101,13
7902,11
...,...
5873,1
5931,1
5989,1
5991,1


In [33]:
# Then combine records into a single record per Applicant
df_poi = df_poi.groupby('Applicant_ID')['corpus'].apply(' '.join).reset_index()
df_poi.info()
df_poi.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2068 entries, 0 to 2067
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  2068 non-null   int64 
 1   corpus        2068 non-null   object
dtypes: int64(1), object(1)
memory usage: 32.4+ KB


Unnamed: 0,Applicant_ID,corpus
0,96,server
1,153,server host barista customer service rep sales rep
2,256,server host receptionist book keeper customer service rep sales rep production area
3,438,server host barista customer service rep
4,568,receptionist book keeper customer service rep


### **Create the final user datast**

In [34]:
# First join viewed jobs with experience
jobs_view_exp = df_job_view.merge(df_experience, on='Applicant_ID', how='outer', suffixes=['_view', '_experience']).fillna('')
jobs_view_exp.info()
jobs_view_exp.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6461 entries, 0 to 6460
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Applicant_ID       6461 non-null   int64 
 1   corpus_view        6461 non-null   object
 2   corpus_experience  6461 non-null   object
dtypes: int64(1), object(2)
memory usage: 151.6+ KB


Unnamed: 0,Applicant_ID,corpus_view,corpus_experience
0,42,movie extras actors model want san francisco part time full time cast san francisco grand open new location entry le...,street marketer media nation fairfield courtesy clerk safeway fairfield
1,96,kitchen staff izakaya yuzuki san francisco server waraku san francisco server sakae sushi burlingame,cashier honey berry san francisco greet people introduce recommend food items menu take order pos phone take clean t...
2,153,valic financial advisor intern roseville aig corp roseville travel inventory associate wis international citrus heig...,photographer brand affinity technologies santa clara capture memories experience families group individuals provide ...
3,601,retail sales consultant retail bay area associate manager tumi inc san francisco toy express seasonal store supervis...,
4,1877,sales associate see candy sunnyvale,registration coordinator wellness corporate solutions bethesda greet people help fill paperwork front desk extern in...


Now merge the positions of interest - dataframe

In [35]:
jobs_view_exp_poi = jobs_view_exp.merge(df_poi, on='Applicant_ID', how='outer').fillna('') \
                                .rename(columns={'corpus':'corpus_poi'})
jobs_view_exp_poi.info()
jobs_view_exp_poi.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7037 entries, 0 to 7036
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Applicant_ID       7037 non-null   int64 
 1   corpus_view        7037 non-null   object
 2   corpus_experience  7037 non-null   object
 3   corpus_poi         7037 non-null   object
dtypes: int64(1), object(3)
memory usage: 220.0+ KB


Unnamed: 0,Applicant_ID,corpus_view,corpus_experience,corpus_poi
0,42,movie extras actors model want san francisco part time full time cast san francisco grand open new location entry le...,street marketer media nation fairfield courtesy clerk safeway fairfield,
1,96,kitchen staff izakaya yuzuki san francisco server waraku san francisco server sakae sushi burlingame,cashier honey berry san francisco greet people introduce recommend food items menu take order pos phone take clean t...,server
2,153,valic financial advisor intern roseville aig corp roseville travel inventory associate wis international citrus heig...,photographer brand affinity technologies santa clara capture memories experience families group individuals provide ...,server host barista customer service rep sales rep
3,601,retail sales consultant retail bay area associate manager tumi inc san francisco toy express seasonal store supervis...,,server line cook customer service rep
4,1877,sales associate see candy sunnyvale,registration coordinator wellness corporate solutions bethesda greet people help fill paperwork front desk extern in...,receptionist


In [36]:
# Get unique number of Applicants
jobs_view_exp_poi['Applicant_ID'].nunique()

7037

Combine all features into a single one

In [37]:
applicants = jobs_view_exp_poi.copy()
applicants['corpus'] = applicants.apply(lambda employee: ' '.join([employee[col] for col in ['corpus_view', 'corpus_experience', 'corpus_poi']]), axis=1)
applicants = applicants[['Applicant_ID', 'corpus']].sort_values('Applicant_ID')

applicants.info()
applicants.head()

<class 'pandas.core.frame.DataFrame'>
Index: 7037 entries, 3448 to 6460
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Applicant_ID  7037 non-null   int64 
 1   corpus        7037 non-null   object
dtypes: int64(1), object(1)
memory usage: 164.9+ KB


Unnamed: 0,Applicant_ID,corpus
3448,2,writer uloop blog cecilia abate san francisco write article uloop blog site mostly read students school within unit...
3449,3,prep cook moscone center san francisco server aloha beach resort shenzhen china market intern honda guangzhou china
3450,6,project assistant iom nairobi kenya
3451,8,deli clerk server cashier food prep order taker safeway grocery inc lodi
3452,11,cashier cristina green daly city


In [38]:
# Reset the index in order to have an ordered index beased on the new information
applicants = applicants.reset_index().drop('index', axis=1)
applicants.head()

Unnamed: 0,Applicant_ID,corpus
0,2,writer uloop blog cecilia abate san francisco write article uloop blog site mostly read students school within unit...
1,3,prep cook moscone center san francisco server aloha beach resort shenzhen china market intern honda guangzhou china
2,6,project assistant iom nairobi kenya
3,8,deli clerk server cashier food prep order taker safeway grocery inc lodi
4,11,cashier cristina green daly city


In [39]:
applicants.shape

(7037, 2)

## **Recommender System**

Compute the **cosine similarity using TFIDF**

In [40]:
from sklearn.metrics.pairwise import cosine_similarity

In [41]:
tfidf_applicants = tfidf_vectorizer.transform(applicants['corpus'])
cosine_similarity_tfidf = map(lambda job: cosine_similarity(tfidf_applicants, job), tfidf_jobid)
# We have a matrix, that for every one of the available jobs (84,090)
# we have a cosine similarity for every pontential employee (7,037)
scores = list(cosine_similarity_tfidf)

In [42]:
def get_topRecommendations(applicant_id=326, n_top=10, df_jobs=df_jobs, applicants=applicants, scores=scores):
    """
    Given a person in the form of applicant_id, return the top most suitable jobs based
    on TF-IDF vectorization of job tokens and applicant tokens.
    
    """

    applicant_ix = np.where(applicants['Applicant_ID'] == applicant_id)[0][0]

    
    # For every job, I need to look for the user im looking for
    # Identify a vector that keeps the job_ix, and score
    applicant_scores = list(sorted([(job_ix, job[applicant_ix][0]) for job_ix, job in enumerate(scores)],
                                  key=lambda job: job[1], reverse=True))[:n_top]
    
    # Extract the jobs based on the indices and their top scores
    jobs_ix = [job_ix for job_ix, _ in applicant_scores]
    jobs_scores = [score for _, score in applicant_scores]

    # Combine and return applicant-job information
    topRecommendations = df_jobs.iloc[jobs_ix]
    topRecommendations['score'] = jobs_scores
    topRecommendations['Applicant_ID'] = applicant_id
    topRecommendations = topRecommendations[['Applicant_ID', 'Job_ID', 'Title', 'score']]

    return topRecommendations

In [43]:
get_topRecommendations()

Unnamed: 0,Applicant_ID,Job_ID,Title,score
77448,326,311213,"Application UI Developer - Contract to Hire @ iTech Solutions, Inc.",0.283291
13443,326,227065,iOS Developer @ Unied Software Group,0.267393
3231,326,141831,"Lead Java/J2EE Developer - Contract to Hire @ iTech Solutions, Inc.",0.245702
76180,326,309945,"Java Software Engineer @ iTech Solutions, Inc.",0.243742
40634,326,270171,"Senior Java Developer - Contract to Hire - Great Salary @ iTech Solutions, Inc.",0.238764
6988,326,146511,Medical Charge Entry Specialist @ Accountemps,0.234826
69346,326,303112,Java Developer @ TransHire,0.230941
40385,326,269922,Entry Level Java Developer / Jr. Java Developer - Contract to Hire @ iTech Solutions,0.227022
10301,326,150882,Java Consultant - Mobile Apps Development @ Consultis,0.210647
63958,326,294684,Java Developer @ Kavaliro,0.209494


In [44]:
# Select a number of random applicants, get 5
np.random.seed(11)
random_applicants = np.random.choice(applicants['Applicant_ID'], 5, replace=True)
random_applicants

array([ 4483,  9318, 11685,  9729,  3110], dtype=int64)

In [45]:
for applicant_id in random_applicants:
    print(f'PRINTING Applicant_ID = {applicant_id}\'s first 100 corpus words:\n')
    corpus = ' '.join(applicants.query('Applicant_ID == @applicant_id').iloc[0]['corpus'].split(' ')[:100])
    corpus = '\n'.join(textwrap.wrap(corpus, 50))
    print(corpus, end='\n')


    display(get_topRecommendations(applicant_id))
    print('\n\n')

PRINTING Applicant_ID = 4483's first 100 corpus words:

 cashier pizza hut pleasanton cashier server non
alcoholic beverages food prep host server host
receptionist cashier


Unnamed: 0,Applicant_ID,Job_ID,Title,score
84081,4483,734,Server @ Pizza Antica,0.363659
84040,4483,506,Server @ Faz Pleasanton,0.32093
72898,4483,306666,Cashiers @ Helpmates Staffing Services.,0.30561
27514,4483,257044,Marketing Assistant/ Server/ Host - To $11/hr - Part Time and Evening Hours - Unique Market Research Firm @ Select S...,0.302519
21470,4483,251,Server @ Zach's Cafe,0.284735
84005,4483,397,Server @ Ringer Hut,0.283917
48326,4483,277870,Food Service @ Six Flags,0.283473
6991,4483,146514,Billing Clerk @ Accountemps,0.281822
84089,4483,92,Cashier @ Kazoo Restaurant,0.272934
40840,4483,270378,Cashier @ Northern Virginia Community College,0.264937





PRINTING Applicant_ID = 9318's first 100 corpus words:

 driver diamond transportation springfield provide
door door transportation elderly persons
disabilities collect fare check accuracy ats
caregiver individual development inc washington
provide daily care elderly persons disabilities
cook clean feed bath dress cashier stock clerk
maxway temple hill provide customer service
operate cash register complete transactions
accuracy stock merchandise keep clean neat work
space cashier safeway washington provide customer
service operate cash register complete
transactions accuracy maintain clean neat work
space


Unnamed: 0,Applicant_ID,Job_ID,Title,score
20244,9318,249772,Cashier - Starbucks at Molly Pitcher Travel Plaza @ HMSHost - USA,0.227087
33342,9318,262878,Cashier/Stock Clerk @ Lamps Plus,0.212054
50492,9318,280035,Cashier/Stock Clerk @ Lamps Plus,0.212054
72865,9318,306633,Store Clerk (Seasonal - Full Time) @ Castaways RV Resort,0.205718
33813,9318,263349,Store Clerk (Seasonal - Full Time) @ Jellystone of Western New York,0.205006
58335,9318,287873,Store Clerk (Seasonal - Full Time) @ Jellystone of Western New York,0.205006
58331,9318,287869,Store Clerk (Seasonal - Part Time) @ Wagon Wheel RV Resort,0.202474
39905,9318,269442,Store Clerk (Seasonal - Full Time) @ Wagon Wheel RV Resort,0.202421
48517,9318,278061,Cashier I- 10-7PM @ Sodexo- Kennesaw State University,0.201931
23893,9318,253422,Lead Cashier @ Nebraska Furniture Mart,0.198548





PRINTING Applicant_ID = 11685's first 100 corpus words:

exam proctor need university downtown beacon hill
staff group llc boston part time instructors
massachusetts national safety council boston part
time instructors massachusetts national safety
council boston


Unnamed: 0,Applicant_ID,Job_ID,Title,score
52647,11685,282189,"Part Time Positions-Administrative Opportunities-Boston Area @ Beacon Hill Staffing Group, LLC",0.324753
27666,11685,257196,LPN,0.294311
79205,11685,314400,Receptionist Needed for Non-Profit in Downtown Boston! @ OfficeTeam,0.258585
45605,11685,275148,File Clerk @ OfficeTeam,0.249791
55514,11685,285054,Front Desk Coordinator @ OfficeTeam,0.240997
69945,11685,303711,Senior Sustainability Coordinator (Departmental Assistant) Auxiliary Enterprises @ University of Massachusetts Amherst,0.240496
62566,11685,293224,Driving Instructors @ Nationwide truck driver training group,0.239273
78549,11685,313745,"Software Engineer @ Beacon Hill Staffing Group, LLC",0.231378
55859,11685,285399,"Administrative Assistant @ Beacon Hill Staffing Group, LLC",0.225238
72720,11685,306488,File Clerk @ OfficeTeam,0.222003





PRINTING Applicant_ID = 9729's first 100 corpus words:

seasonal wed sales stylist davids bridal lithonia
seasonal wed sales stylist davids bridal lithonia
paul mitchell school atlanta


Unnamed: 0,Applicant_ID,Job_ID,Title,score
18046,9729,247573,Bridal Stylist Sales Consultant @ Jeanette's Bride 'N Boutique,0.392162
5562,9729,144886,Seasonal CSR @ David's Bridal,0.384417
81152,9729,316345,Stylist Sales Consultant @ Impression Bridal,0.354339
5309,9729,144633,Seasonal Wedding Sales Stylist @ David's Bridal,0.328224
5598,9729,144922,Seasonal Wedding Sales Stylist @ David's Bridal,0.31378
5513,9729,144837,Seasonal Wedding Sales Stylist @ David's Bridal,0.313622
5620,9729,144944,Seasonal Wedding Sales Stylist @ David's Bridal,0.313572
5339,9729,144663,Seasonal Wedding Sales Stylist @ David's Bridal,0.313561
5597,9729,144921,Seasonal Wedding Sales Stylist @ David's Bridal,0.313494
5782,9729,145106,Seasonal Wedding Sales Stylist @ David's Bridal,0.313494





PRINTING Applicant_ID = 3110's first 100 corpus words:

 pool cleaner beacon pool davie clean pool include
clear debris pool water proper chemical balance
mechanical inspections dishwasher customer service
rep


Unnamed: 0,Applicant_ID,Job_ID,Title,score
65814,3110,296539,Pool Attendant (Seasonal - Part Time) @ Holiday,0.579091
33825,3110,263361,Pool Attendant (Seasonal - Part Time) @ West Glen,0.578873
77569,3110,311334,Pool Attendant (Seasonal - Part Time) @ East Village Estates,0.577535
26192,3110,255723,Pool Attendant @ Country Hills Village,0.577524
79508,3110,314703,Pool Attendant (Seasonal - Part Time) @ Glen Laurel,0.5774
71085,3110,304853,Pool Attendant (Seasonal - Part Time) @ Lakeview,0.577214
71082,3110,304850,Pool Attendant (Seasonal - Part Time) @ Catalina,0.577134
71088,3110,304856,Pool Attendant (Seasonal - Part Time) @ Sherman Oaks,0.577116
79507,3110,314702,Pool Attendant (Seasonal - Part Time) @ Windham Hills,0.576907
56171,3110,285711,Pool Attendant (Seasonal - Part Time) @ Sycamore Village,0.57688





