## Search for a job Description

The main motivations for this project were three-fold. Firstly, we aimed to explore and apply novel models in the field of Natural Language Processing (NLP), such as OpenAI embedding and MPNet from Hugging Face. Secondly, we sought to develop an application that is specifically tailored for job search. Upon inputting a job title, users would be able to obtain a list of the closest task descriptions associated with that particular job title. Lastly, we aimed to experiment with two different methods of deploying our application - Github pages and Streamlit app.

The inspiration for this project came from a personal experience, where one of our team members received a suggestion from an employer to consider coordinator positions that were suitable for her resume and skills. However, she had never considered or looked for such positions before. This sparked the idea of developing an application that would be beneficial not only to her, but also to the thousands of people who are starting their career and looking for job opportunities.

Furthermore, we were inspired by a lecture by Professor Ran Feldish, where he presented an application that uses OpenAI embedding to match products with product descriptions (https://github.com/ShawnTXin/openai-embeddings). This motivated us to use a similar approach, but with a different objective - matching job titles with job descriptions.

The dataset used in this project was obtained from ONET Resource Center (https://www.onetcenter.org/), specifically the Task Statements Database, which provides a mapping of ONET-SOC codes (occupations) to tasks associated with the occupation.

In this notebook, we will first explore OpenAI embedding and then experiment with MPNet sentence transformer.

## First Model: OpenAI "text-embedding-ada-002"

To provide proper attribution, we would like to acknowledge that approximately 90% of the implementation of this model was adapted from "https://github.com/ShawnTXin/openai-embeddings". However, we utilized the code with our OpenAI API and applied it to a novel dataset for our project.

### 1- Import the libraries

In [None]:
!pip install utils openai tiktoken
import os
import time
import numpy as np
import pandas as pd

from openai.embeddings_utils import get_embedding
from openai.embeddings_utils import cosine_similarity
import tiktoken

In [None]:
import openai

#OpenAI API Key location
ENV_PATH = "/content/env"

def get_openai_api_key(env_path=ENV_PATH):
    # read and parse file
    with open(env_path, 'r') as f:
        lines = f.readlines()
    for line in lines:
        if line.startswith("OPENAI_API_KEY"):
            openai_api_key = line.split("=")[1].strip()
        else:
            print("Error")
    print(f'Read OpenAI API authentication key from {env_path} file')
    return openai_api_key

def openai_auth(openai_api_key:str=get_openai_api_key()):
    # authenticate with OpenAI
    openai.api_key = openai_api_key
    print(f'Set OpenAI authentication key')
    return

### 2- Defining the Dataset and Model

In this section, we define the input file containing the tasks for which we want to generate embeddings and the output path where we would like to save these embeddings. We also define the models that will be used for encoding and embedding.

For the input data, we have specified the location of the CSV file containing the tasks we want to embed and the column name of the task description. We will use this column to generate embeddings.

The output data parameters define where we want to save the embeddings generated by the embedding model.

For the embedding model, we have specified the name of the pre-trained model to use and the encoding model to be used in conjunction with it. We have also set an upper limit on the number of tokens to be utilized in each input sequence. This limit is in accordance with the guidelines presented in "https://platform.openai.com/docs/guides/embeddings/what-are-embeddings".

In [None]:
# input data parameters
DOCUMENT_PATH = "/content/Task Statements(1).xlsx"
COLUMN_TO_EMBED = "Task"

# output data parameters
OUTPUT_PATH = "/content/drive/MyDrive/TaskEmbeddings_boshra.csv"

#embedding model parameters
EMBEDDING_MODEL = "text-embedding-ada-002"
ENCODING_MODEL = "cl100k_base"  # this is the encoding for text-embedding-ada-002
MAX_TOKENS = 8191  # the maximum for text-embedding-ada-oo2 is 8191

print(f'Using embedding model: {EMBEDDING_MODEL}')
print(f'Using encoding: {ENCODING_MODEL}')
print(f'Maximum number of tokens: {MAX_TOKENS}')
print('***********************************')
df = pd.read_excel(DOCUMENT_PATH)
print(f'Read {len(df)} documents from {DOCUMENT_PATH}')
#Print samples from the dataset
df.head(10)

In [None]:
#Print the samples from the 'Task' column that will be used to generate embeddings
print(df['Title'].values[0],"Tasks")
print(df['Task'].values[:3])
df[['Task']].head()

Chief Executives Tasks
["Direct or coordinate an organization's financial or budget activities to fund operations, maximize investments, or increase efficiency."
 'Appoint department heads or managers and assign or delegate responsibilities to them.'
 'Analyze operations to evaluate performance of a company or its staff in meeting objectives or to determine areas of potential cost reduction, program improvement, or policy change.']


Unnamed: 0,Task
0,Direct or coordinate an organization's financi...
1,Appoint department heads or managers and assig...
2,Analyze operations to evaluate performance of ...
3,"Direct, plan, or implement policies, objective..."
4,"Prepare budgets for approval, including those ..."


### 3- Generate Encodings
In this section, we utilize the encoding model that is associated with the embedding model "text-embedding-ada-002". Encoding is performed to ensure two benefits. Firstly, it confirms that the number of tokens is less than the maximum tokens that the embedding model accepts. Secondly, it provides an estimate of the pricing when using the embedding model since usage is priced per input token.

We added the encoding of the "task" column to the original data and ensured that there were no encodings that exceed the maximum token limit set by the embedding model. This step is important as it prevents any loss of data during the subsequent steps of the project.

In [None]:
# tokenize: generate encodings for 'cloumn_to_embed'
encoding = tiktoken.get_encoding(ENCODING_MODEL)
df["encoding"] = df[COLUMN_TO_EMBED].apply(lambda x: encoding.encode(x))
print(f'Encoded the {COLUMN_TO_EMBED} column into an encoding column')
df.head()

Encoded the Task column into an encoding column


Unnamed: 0,O*NET-SOC Code,Title,Task ID,Task,Task Type,Incumbents Responding,Date,Domain Source,encoding
0,11-1011.00,Chief Executives,8823,Direct or coordinate an organization's financi...,Core,87.0,07/2014,Incumbent,"[16411, 477, 16580, 459, 7471, 596, 6020, 477,..."
1,11-1011.00,Chief Executives,8831,Appoint department heads or managers and assig...,Core,87.0,07/2014,Incumbent,"[2213, 787, 9476, 14971, 477, 20258, 323, 9993..."
2,11-1011.00,Chief Executives,8825,Analyze operations to evaluate performance of ...,Core,87.0,07/2014,Incumbent,"[2127, 56956, 7677, 311, 15806, 5178, 315, 264..."
3,11-1011.00,Chief Executives,8826,"Direct, plan, or implement policies, objective...",Core,87.0,07/2014,Incumbent,"[16411, 11, 3197, 11, 477, 4305, 10396, 11, 26..."
4,11-1011.00,Chief Executives,8827,"Prepare budgets for approval, including those ...",Core,87.0,07/2014,Incumbent,"[51690, 42484, 369, 14765, 11, 2737, 1884, 369..."


In [None]:
# omit encodings that are too long to embed
df["n_tokens"] = df.encoding.apply(lambda x: len(x))
n_long_encodings = len(df[df.n_tokens > MAX_TOKENS])
df = df[df.n_tokens <= MAX_TOKENS]
print(f'Omitted {n_long_encodings} encodings that were too long to embed')

Omitted 0 encodings that were too long to embed


### 4- Generate Embeddings

In this section, we focus on the core building block of our project, which is the embedding model. Based on the recommendation from OpenAI, we used the "text-embedding-ada-002" model for nearly all use cases, and we found it to be better, cheaper, and simpler to use. However, we did encounter some issues with the computation time, which could be attributed to the size of our dataset (19265 sentences) and the computing power of our device. To validate this, we tested the model on a smaller sample size of 1000 sentences, and it was much faster.

It's important to note that in order to use the embedding model from OpenAI, we had to obtain an OpenAI API key, which needed to be checked and authorized. Once the API key was validated, we were able to use the embedding model accurately and without any errors.

Additionally, we want to emphasize that we did not pay to use the embedding model. We obtained the OpenAI API key from the official website (https://platform.openai.com/account/api-keys), but it required creating an account to get the API key.

In [None]:
openai_auth()

# embed: generate embeddings for 'column_to_embed', by calling the OpenAI API
print(f'Embedded {COLUMN_TO_EMBED} into an embedding column, by calling the OpenAI API')
print(f'This will take some time for {len(df)} documents...')
df["embedding"] = df[COLUMN_TO_EMBED].apply(lambda x: get_embedding(x, engine=EMBEDDING_MODEL))

Set OpenAI authentication key
Embedded Task into an embedding column, by calling the OpenAI API
This may take about a minute for 19265 documents...


In [None]:
# save embeddings to output_path
df.to_csv(OUTPUT_PATH, index=False)
print(f'Saved embeddings to {OUTPUT_PATH}')

Saved embeddings to data/TaskEmbeddingss.csv


In [None]:
file_path = "/content/drive/MyDrive/TaskEmbeddings_boshra.csv"  # a folder with all the documents embeddings. within this folder, one csv file include multiple documents embedding of the same run
COLUMN_EMBEDDINGS = "embedding"  # the embedding column name in the documents embedding file.

In [None]:
# read documents embeddings
df = pd.read_csv(file_path)
df[COLUMN_EMBEDDINGS] = df[COLUMN_EMBEDDINGS].apply(eval).apply(np.array) # convert string to np array
print(f'Read {len(df)} documents embeddings from {file_path}')



Read 19265 documents embeddings from /content/drive/MyDrive/TaskEmbeddings_boshra.csv


### 5- Testing the model

#### First example: Nurse




In [None]:
openai_auth()
query=""
print('***********************************')
query = "Nurse" 
tic = time.time()
query_embedding = get_embedding(query, engine=EMBEDDING_MODEL)
toc = time.time()
print(f'Embedding {query} took {round(toc-tic)*1000}ms')
print(f'Top matches for {query} ordered by cosine similarity of vector embeddings:')
df['similarity'] = df[COLUMN_EMBEDDINGS].apply(lambda x: cosine_similarity(x, query_embedding))
#print(df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10))
df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10)


Set OpenAI authentication key
***********************************
Embedding Nurse took 1000ms
Top matches for Nurse ordered by cosine similarity of vector embeddings:


Unnamed: 0,Title,Task,similarity
9294,Urologists,"Direct the work of nurses, residents, or other...",0.866371
8807,Registered Nurses,Observe nurses and visit patients to ensure pr...,0.864399
10218,Psychiatric Aides,"Perform nursing duties, such as administering ...",0.863363
8650,Respiratory Therapists,"Work as part of a team of physicians, nurses, ...",0.862016
8975,Nurse Midwives,"Plan, provide, or evaluate educational program...",0.861606
9859,Licensed Practical and Licensed Vocational Nurses,Supervise nurses' aides or assistants.,0.861478
8891,Critical Care Nurses,Supervise and monitor unit nursing staff.,0.859506
9041,Anesthesiologists,"Coordinate and direct work of nurses, medical ...",0.858469
10170,Nursing Assistants,Assist nurses or physicians in the operation o...,0.858173
9278,Hospitalists,"Direct, coordinate, or supervise the patient c...",0.857854


#### Second example: Data Scientist


In [None]:
openai_auth()
query=""
print('***********************************')
query = "Data Scientist"
tic = time.time()
query_embedding = get_embedding(query, engine=EMBEDDING_MODEL)
toc = time.time()
print(f'Embedding {query} took {round(toc-tic)*1000}ms')
print(f'Top matches for {query} ordered by cosine similarity of vector embeddings:')
df['similarity'] = df[COLUMN_EMBEDDINGS].apply(lambda x: cosine_similarity(x, query_embedding))
#print(df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10))
df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10)


Set OpenAI authentication key
***********************************
Embedding Data Scientist took 0ms
Top matches for Data Scientist ordered by cosine similarity of vector embeddings:


Unnamed: 0,Title,Task,similarity
2997,Bioinformatics Technicians,Develop or apply data mining and machine learn...,0.869527
2917,Biostatisticians,Develop or implement data analysis algorithms.,0.864632
2932,Data Scientists,"Analyze, manipulate, or process large sets of ...",0.863898
5338,Social Science Research Assistants,Design and create special programs for tasks s...,0.852801
5341,Social Science Research Assistants,Perform descriptive and multivariate statistic...,0.851475
2905,Statisticians,Develop software applications or programming f...,0.849103
2951,Business Intelligence Analysts,"Create business intelligence tools or systems,...",0.846382
2895,Statisticians,Process large amounts of data for statistical ...,0.846101
2986,Bioinformatics Technicians,Analyze or manipulate bioinformatics data usin...,0.845648
2991,Bioinformatics Technicians,Develop or maintain applications that process ...,0.845227


#### Third Example: Ask the user for a job title

In [None]:
openai_auth()
query=""
print('***********************************')
query = input('Please enter a Job Title: ') # get query from user
tic = time.time()
query_embedding = get_embedding(query, engine=EMBEDDING_MODEL)
toc = time.time()
print(f'Embedding query took {round(toc-tic)*1000}ms')
print(f'Top matches for {query} ordered by cosine similarity of vector embeddings:')
df['similarity'] = df[COLUMN_EMBEDDINGS].apply(lambda x: cosine_similarity(x, query_embedding))
#print(df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10))
df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10)

Set OpenAI authentication key
***********************************
Please enter a Job Title: Mechanical Engineer
Embedding query took 1000ms
Top matches for Mechanical Engineer ordered by cosine similarity of vector embeddings:


Unnamed: 0,Title,Task,similarity
4017,Electro-Mechanical and Mechatronics Technologi...,Assist engineers to implement electromechanica...,0.882823
4127,Mechanical Engineering Technologists and Techn...,Provide technical support to other employees r...,0.873477
4150,Mechanical Engineering Technologists and Techn...,"Devise, fabricate, or assemble new or modified...",0.873453
18767,Sailors and Marine Oilers,Provide engineers with assistance in repairing...,0.872678
4014,Electro-Mechanical and Mechatronics Technologi...,Translate electromechanical drawings into desi...,0.872547
15885,"Maintenance and Repair Workers, General",Design new equipment to aid in the repair or m...,0.872355
3542,Mechanical Engineers,"Research, design, evaluate, install, operate, ...",0.872036
4018,Electro-Mechanical and Mechatronics Technologi...,Consult with machinists to ensure that electro...,0.870969
3997,Electro-Mechanical and Mechatronics Technologi...,"Fabricate or assemble mechanical, electrical, ...",0.870386
16035,"Helpers--Installation, Maintenance, and Repair...","Design, weld, and fabricate parts, using bluep...",0.869464


### 6- Evaluating the model:

In [None]:
unique_job_titles = job['Title'].unique()

# randomly select 10 unique job titles
random_job_titles = np.random.choice(unique_job_titles, size=50, replace=False)

print(random_job_titles)

['Petroleum Pump System Operators, Refinery Operators, and Gaugers'
 'Information Security Engineers' 'Advertising Sales Agents'
 'Crane and Tower Operators'
 'Extruding and Forming Machine Setters, Operators, and Tenders, Synthetic and Glass Fibers'
 'Animal Caretakers' 'Naturopathic Physicians'
 'Biomass Power Plant Managers' 'Tapers' 'Cargo and Freight Agents'
 'Sociologists' 'Gas Plant Operators' 'Social Science Research Assistants'
 'Urologists' 'New Accounts Clerks'
 'Camera Operators, Television, Video, and Film'
 'Licensed Practical and Licensed Vocational Nurses'
 'Timing Device Assemblers and Adjusters'
 'Dining Room and Cafeteria Attendants and Bartender Helpers'
 'Tailors, Dressmakers, and Custom Sewers'
 'Environmental Science and Protection Technicians, Including Health'
 'Environmental Science Teachers, Postsecondary'
 'Plating Machine Setters, Operators, and Tenders, Metal and Plastic'
 'Helpers--Brickmasons, Blockmasons, Stonemasons, and Tile and Marble Setters'
 'Spec

In [None]:
sum_FirstModel=0
for title in random_job_titles:
  
  query_embedding = get_embedding(title, engine=EMBEDDING_MODEL)
  toc = time.time()
  #print(f'Embedding {title} took {round(toc-tic)*1000}ms')
  #print('Top matches ordered by cosine similarity of vector embeddings:')
  df['similarity'] = df[COLUMN_EMBEDDINGS].apply(lambda x: cosine_similarity(x, query_embedding))
  df2=df.sort_values(by='similarity', ascending=False)[['Title','Task','similarity' ]].head(10)
  predicted_titles=df2['Title'].tolist()
  count=0
  for job_title in predicted_titles:
    if title in  job_title:
      count=count+1
  sum_FirstModel=sum_FirstModel+count
  print("input title:",title,"top 10 predicted titles:",predicted_titles,count)



input title: Petroleum Pump System Operators, Refinery Operators, and Gaugers top 10 predicted titles: ['Petroleum Pump System Operators, Refinery Operators, and Gaugers', 'Wellhead Pumpers', 'Rail Yard Engineers, Dinkey Operators, and Hostlers', 'Wellhead Pumpers', 'Petroleum Pump System Operators, Refinery Operators, and Gaugers', 'Pump Operators, Except Wellhead Pumpers', 'First-Line Supervisors of Production and Operating Workers', 'Wellhead Pumpers', 'Derrick Operators, Oil and Gas', 'Ship Engineers'] 2
input title: Information Security Engineers top 10 predicted titles: ['Information Security Engineers', 'Security Management Specialists', 'Network and Computer Systems Administrators', 'Web Developers', 'Information Security Engineers', 'Information Security Engineers', 'Information Security Engineers', 'Information Security Engineers', 'Information Security Engineers', 'Security Management Specialists'] 6
input title: Advertising Sales Agents top 10 predicted titles: ['Advertisin

In [None]:
total_accuracy_FirstModel=sum_FirstModel/500
print("Score for the OpenAi Model",total_accuracy_FirstModel)

Score for the OpenAi Model 0.26


### Resources:

1- https://github.com/ShawnTXin/openai-embeddings

2- https://www.onetcenter.org/ and https://www.onetcenter.org/dictionary/27.2/excel/task_statements.html

3- https://platform.openai.com/docs/guides/embeddings/what-are-embeddings

4-[https://chat.openai.com/chat](https://)