<a href="https://colab.research.google.com/github/seanreed1111/colab-demos/blob/master/resumeGPT_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

based on https://github.com/openai/openai-cookbook/tree/main/examples

## TODO  
- Return a link to the full original resume instead of to chunks
- Run experiments to find optimal chunking size of resume text
- Figure out how to use Ada and Babbage to reduce costs
- port to Azure Platform



In [1]:
!pip install -qqq loguru textract tiktoken openai azure-ai-ml mlflow azureml-sdk azureml-mlflow

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m68.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m82.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m80.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m814.0/814.0 kB[0m [31m46.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.9/106.9 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) 

# before running this notebook, UPLOAD these files
- openai.env
- azure.env

In [2]:
import os,argparse,loguru, json, time, datetime, openai
from pathlib import Path
from loguru import logger

In [3]:
def maybe_load_aml_env_vars(env_path=None):
  import os, json
  try:
    with open(env_path, "r") as f:
      env_vars = json.load(f)
    os.environ["resource_group"] = env_vars["resource_group"]
    os.environ["workspace_name"] = env_vars["workspace_name"]
    os.environ["subscription_id"] = env_vars["subscription_id"]
    if (os.getenv("resource_group") and os.getenv("workspace_name")
    and os.getenv("subscription_id")):
      return True
  except Exception as e:
    logger.error(f"{e}")
    return False

In [4]:
def set_open_ai_key(env_path=None):
  import json, os
  from pathlib import Path
  try:
    with open(env_path, "r") as f:
        env_vars = json.load(f)
    os.environ["OPENAI_API_KEY"] = env_vars["OPENAI_API_KEY"]
    openai.api_key = os.environ["OPENAI_API_KEY"]
    openai.Model.list() #test a random command on the openai API
    return True
  except Exception as e:
    logger.error(f"{e}")
  return False

def test_set_open_ai_key(key_path=None):
  openai.api_key = None #disconnect from api key if already registered
  try:
    set_open_ai_key(key_path)
    openai.Model.list()
    return True
  except Exception as e:
    logger.error(f"{e}")
  return False


In [5]:
def maybe_get_ml_client(env_path=None):
  # this is a mix of sdk v1 and v2. Try to consolidate 
  import json, os, mlflow
  from pathlib import Path
  from azureml.core import Workspace 
  from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
  from azure.ai.ml import MLClient

  if not env_path: return None

  ws = Workspace.from_config(env_path)
  tracking_uri = ws.get_mlflow_tracking_uri()
  mlflow.set_tracking_uri(tracking_uri)

  try:
      credential = DefaultAzureCredential()
  except Exception as ex:
      # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not working
      credential = InteractiveBrowserCredential()

  is_loaded = maybe_load_aml_env_vars(env_path)
  if is_loaded:
    try:
      ml_client = MLClient(
          subscription_id=os.getenv("subscription_id"),
          resource_group_name=os.getenv("resource_group"),
          workspace_name=os.getenv("workspace_name"),
          credential=credential,
      )
      return ml_client
    except Exception as e:
      logger.error(f"{e}")
      return None

In [6]:
def maybe_setup_azure_env(azure_env_path=None):
  # setup azure ml env if azure credentials are available.
  if azure_env_path and azure_env_path.is_file() and maybe_load_aml_env_vars(azure_env_path):
    ml_client = maybe_get_ml_client(azure_env_path)
    if ml_client:
      #do a random test to check that ml_client and mlflow are playing nicely together
      import mlflow
      experiment_name = 'mlflow-2'
      mlflow.set_experiment(experiment_name)
      from random import random

      with mlflow.start_run() as mlflow_test_run:
          mlflow.log_param("hello_param", "world")
          mlflow.log_metric("hello_metric2", random())
          os.system(f"echo 'hello world2' > helloworld2.txt")
          mlflow.log_artifact("helloworld2.txt")
      return True
  return False

In [7]:
# setup
azure_env_path, openai_env_path, ml_client, openai.api_key = None, None, None , None
cwd = Path.cwd()
# resume_path = cwd / "Resumes"
# resume_path.mkdir(exist_ok=True)

azure_env_path = cwd / "azure.env" ##uncomment if providing azure env
openai_env_path = cwd/ "openai.env"
maybe_setup_azure_env(azure_env_path)
set_open_ai_key(openai_env_path)

True

# SPLIT SECTIONS
source: Embedding_Wikipedia_articles_for_search.ipynb 
https://colab.research.google.com/drive/1EJMtCmF8jZc2Y-c1RaBxFSCTPcjzjJf4#scrollTo=TOVSYkDur9zA

Next, we'll recursively split long sections into smaller sections.

There's no perfect recipe for splitting text into sections.

Some tradeoffs include:
- Longer sections may be better for questions that require more context
- Longer sections may be worse for retrieval, as they may have more topics muddled together
- Shorter sections are better for reducing costs (which are proportional to the number of tokens)
- Shorter sections allow more sections to be retrieved, which may help with recall
- Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,000 tokens each by default, recursively halving any sections that are too long. To avoid cutting in the middle of useful sentences, we'll split along paragraph boundaries when possible.

### extract text from pdf

In [8]:
import textract, os, openai, tiktoken

In [23]:
GPT_MODEL = 'gpt-3.5-turbo'  # only matters insofar as it selects which tokenizer to use

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

## TODO This needs more sophistication in the use of a delimiter
def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    """Split a string in two, on a delimiter, trying to balance tokens on each side."""
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  # no delimiter found
    elif len(chunks) == 2:
        return chunks  # no need to search for halfway point
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]


def truncated_string(
    string: str,
    model: str,
    max_tokens: int,
    print_warning: bool = False,
    TRUNCATION_WARNING_PERCENTAGE: float = 0.25

) -> str:
    """Truncate a string to a maximum number of tokens."""
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    truncation_percentage = 1.0 - max_tokens*1.0 / len(encoded_string)
    if print_warning and (len(encoded_string) > max_tokens) and (truncation_percentage > TRUNCATION_WARNING_PERCENTAGE):
        logger.warning(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens. \nOriginalString:{string} \n")

    return truncated_string

In [10]:
def split_strings_from_subsection(
    subsection: tuple[list[str], str], #legacy structure. 
    max_tokens: int = 1000,
    model: str = GPT_MODEL,
    max_recursion: int = 8,
) -> list[str]:
    """
    Split a subsection into a list of subsections, each with no more than max_tokens.
    Each subsection is a tuple of parent titles [H1, H2, ...] and text (str).
    """
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    num_tokens_in_string = num_tokens(string)
    # if length is fine, return string
    if num_tokens_in_string <= max_tokens:
        return [string]
    # if recursion hasn't found a split after X iterations, just truncate
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    # otherwise, split in half and recurse
    else:
        titles, text = subsection
        for delimiter in ["\n\n", "\n", ". ", "●", "•", ":", ";"]:
            left, right = halved_by_delimiter(text, delimiter=delimiter)
            if left == "" or right == "":
                # if either half is empty, retry with a more fine-grained delimiter
                continue
            else:
                # recurse on each half
                results = []
                for half in [left, right]:
                    half_subsection = (titles, half)
                    half_strings = split_strings_from_subsection(
                        half_subsection,
                        max_tokens=max_tokens,
                        model=model,
                        max_recursion=max_recursion - 1,
                    )
                    results.extend(half_strings)
                return results
    # otherwise no split was found, so just truncate (should be very rare)
    return [truncated_string(string, model=model, max_tokens=max_tokens)]
 

In [11]:
#load resumes from csv file made by pdf parser and then split and create embeddings.
import pandas as pd
df = pd.read_csv("/content/resume_books.csv",usecols=["text", "source"]);df.head()
df = df.dropna()
df.loc[:,"name"] = [f"Name: {i}" for i in df.index]
df.loc[:,"text2"] = list(zip(df["name"].to_list(), df["text"].to_list()))
df = df[["name","text2"]];df.head()
df.to_csv("resume_books_v2.csv")
clean_texts = df["text2"].to_list()
clean_texts = [([x1],x2) for x1, x2 in clean_texts]

In [20]:
clean_texts[2]

(['Name: 2'],
 'RUIZE CHEN  \n(585) 540-6418 // ruize.chen@nyu.edu  // linkedin.com/in/ ruize-chen  \nEDUCATION  \nExpected 12/23 NEW YORK UNIVERSITY                 New York, NY \nThe Courant Institute of Mathematical Sciences \nM.S. in Mathematics in Finance \n● Expected Coursework:  object-oriented programming (Java), stochastic calculus, Brownian motion, \nFama-French, Black-Scholes, risk and portfolio management , data-driven modeling \n08/18 - 05/22 UNIVERSITY OF ROCHESTER                               Rochester, NY \n  B.A. in Mathematics and Statistics & B.S in Finance \n● Coursework:  linear algebra, ordinary differential equations, real analysis, stochastic processes, \nprobability theory, linear regression, mean-variance optimization, corporate finance \n● Honors/Awards:  Dean’s List (3 years), Cum Laude, Beta Gamma Sigma Honor Society \nEXPERIENCE  \n06/21 - 08/21 NORTHEAST SECURITIES              Shenzhen, China  \n(Top 25 Chinese securities firm)  \nQuantitative Research 

In [21]:
clean_texts[-1]

(['Name: 191'],
 'MICHELLE L. DEMOTTE  \nBronx, NY 10458  | (203) -241-330 1 | mdemotte@fordham.edu | www.linkedin.com/in/michelle -demotte   \n  \nEDUCATION  \nFordham University, Gabelli School of Business           Bronx, New York  \nBachelor of Sc ience in Business Administration , Major in Finance     September 2022  – May 2024  \nProject:  Gabelli School of Business Consulting Cup Challenge  (2022)  \n• Collaborat ing with a five -person team to develop a thorough financial and strategic analysis of Nordstrom  by \nidentifying challenges, analyzing previous financial reports, and developing a marketing/implementation strategy   \n \nWestern Connecticut State University, Ancell College of Business             Danbury, Connecticut  \nKathwari Honors Program ,           September 2021 - May 202 2 \nGPA:  3.93/4.0, Dean’s List               \n \nSacred Heart University , Thomas More Honors Program              Fairfield, Connecticut  \nGPA: 3.86/4.0, Dean’s List           September 2

In [24]:
# split resumes into chunks. Small chunks probably better when searching for skills? 
# maybe even shrink to individual sentences
MAX_TOKENS = 50
resume_strings = []
for section in clean_texts:
    resume_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(clean_texts)} resumes split into {len(resume_strings)} strings.")


191 resumes split into 4418 strings.


In [25]:
resume_strings[-5:]

['Name: 191\n\n \nSacred Heart University Journey                 Fairfield, Connecticut \nVolunteer                                     Summer 2018, 2019  ',
 'Name: 191\n\n• Centralized efforts to support lo cal gardens, housing developments, and communities in the greater Bridgeport area  ',
 'Name: 191\n\n• Assessed community needs and developed solutions followed by insightful reflection in small group settings  \n \nACTIVITIES/INTERESTS  ',
 'Name: 191\n\nSkills: Advanced Microsoft Excel, PowerPoint, Word, and Outlook \nRelevant Coursework: Financial Accounting, Information Systems, Statistical Decision Making, Business Statistics  ',
 'Name: 191\n\nInterests: Cooking, self-improvement, spending time outside, diversity, Minecraft, and traveling  ']

# calculate embeddings and store in dataframe

In [26]:
import pandas as pd
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
MAX_BATCH_SIZE = 1000 # you can submit up to 2048 embedding inputs per request
NUMBER_OF_STRINGS_TO_EMBED = len(resume_strings)

if NUMBER_OF_STRINGS_TO_EMBED < MAX_BATCH_SIZE:
  BATCH_SIZE = NUMBER_OF_STRINGS_TO_EMBED
else: 
  BATCH_SIZE = MAX_BATCH_SIZE 

embeddings = []
for batch_start in range(0, NUMBER_OF_STRINGS_TO_EMBED, BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = resume_strings[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

df = pd.DataFrame({"text": resume_strings, "embedding": embeddings})

Batch 0 to 999
Batch 1000 to 1999
Batch 2000 to 2999
Batch 3000 to 3999
Batch 4000 to 4999


In [27]:
# Store embeddings
df.to_csv("embeddings_v2.csv", index=False)

In [28]:
df.head()

Unnamed: 0,text,embedding
0,Name: 0\n\n/content/resume_book_2023_courant.pdf,"[0.00010535831097513437, -0.012883378192782402..."
1,Name: 1\n\nRESUME BOOK\nCLASS OF 2023,"[-0.006130422931164503, -0.006197203416377306,..."
2,Name: 2\n\nRUIZE CHEN \n(585) 540-6418 // rui...,"[-0.0007784886402077973, 0.0017219326691702008..."
3,Name: 2\n\nExpected 12/23 NEW YORK UNIVERSITY ...,"[-0.013850468210875988, 0.01317818183451891, 0..."
4,Name: 2\n\nM.S. in Mathematics in Finance \n● ...,"[-0.001279508345760405, -0.0063638705760240555..."


# search documents using query and text embeddings and and retrieve relevant consultant name from resume information using GPT

1. Search (once per query) - Given a user question, generate an embedding for the query from the OpenAI API
1. Using the embeddings, rank the text sections by relevance to the query
1. Ask (once per query)
  1. Insert the question and the most relevant sections into a message to GPT
  1. Return GPT's answer

In [29]:
from scipy import spatial
EMBEDDING_MODEL = "text-embedding-ada-002"

In [30]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 3
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [31]:
def test_strings_ranked_by_relatedness(query, df, top_n=10):
  strings, relatednesses = strings_ranked_by_relatedness(query, df, top_n)
  for string, relatedness in zip(strings, relatednesses):
      print(f"{relatedness=:.3f}")
      display(string)

In [32]:
query = "strong in math"
strings_ranked_by_relatedness(query, df)


(('Name: 18\n\nin Mathematics and Applied Mathematics ',
  'Name: 16\n\nin Mathematics in Finance ',
  'Name: 23\n\nMath Department Grader \n●\nGraded homework for more than 300 students in upper-division courses including real analysis, \nlinear algebra, abstract algebra, and probability \n●'),
 (0.8197180259857161, 0.8182264010521288, 0.8150523976996884))

In [33]:
query = "understands pytorch"
strings_ranked_by_relatedness(query, df)

(('Name: 4\n\nSklearn, Tensorflow), R, Java, C++\nLanguages:\nEnglish (fluent), Mandarin (native)\nInterests:',
  'Name: 6\n\n●\nUsed natural language understanding; designed model that learned image generation from \ntext data with 1M-word vocabulary, producing high-level generic sentence representations \n●\nImproved model by employing distributed text encoder conditioned with generative ',
  'Name: 6\n\n08/16 - 02/17\nINDIAN INSTITUTE OF TECHNOLOGY ROORKEE\nRoorkee, India \nText-Image Synthesis with Uni-Skip Vectors (Python, Deep Learning) '),
 (0.7835732057219097, 0.7830662985290144, 0.7830618617194605))

In [34]:
query = " understands law"
strings_ranked_by_relatedness(query, df)

(('Name: 81\n\nCompeted as a pre-trial attorney in a mock criminal law case \n●\nAnalyzed legal documents pertaining to the case and argued for the permittance of evidence ',
  'Name: 161\n\nCompeted as a pre-trial attorney in a mock criminal law case \n●\nAnalyzed legal documents pertaining to the case and argued for the permittance of evidence ',
  'Name: 110\n\n● Attended court hearings involving, commercial law, traffic violation, family law, and data breach cases  '),
 (0.8110228926750287, 0.8092794685151007, 0.8082582811210965))

In [35]:
query = " azure databricks"
strings_ranked_by_relatedness(query, df)

(('Name: 13\n\nwith Azure HDInsight; prepared data visualization for industry report\n●',
  'Name: 122\n\nTools: Jupyter Notebook, Tableau, Lucid Chart, Anaconda Navigator and Android Studio. Big Data and Cloud: Heroku, AWS',
  'Name: 42\n\nTools: Jupyter Notebook, Tableau, Lucid Chart, Anaconda Navigator and Android Studio. Big Data and Cloud: Heroku, AWS'),
 (0.757380922555399, 0.7565015675875518, 0.7511625982697887))

## 3. Ask

With our database of resumes turned into vector embeddings, we can use the vector search function above to automatically retrieve relevant knowledge from the resumes and feed the query plus our knowledge base into GPT.

Below, we define a function `ask` that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

In [36]:
GPT_MODEL = 'gpt-3.5-turbo'
def num_tokens(text: str, model: str = 'gpt-3.5-turbo') -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


#using v1 search function
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    logger.info(f"Strings Found From Search\n:{strings}\n Relatednesses:{relatednesses}\n")
    introduction = ' You are a Human Resources agent looking for skills in resumes'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nresume section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

@logger.catch
def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    logger.info(f"{message}")
    content = "Construct a list of Name fields from the documents given. Remove all duplicates from the list"
    messages = [
        {"role": "system", "content": content},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message



In [37]:
ask('who knows law')

[32m2023-05-12 17:31:48.882[0m | [1mINFO    [0m | [36m__main__[0m:[36mquery_message[0m:[36m17[0m - [1mStrings Found From Search
:('Name: 81\n\nCompeted as a pre-trial attorney in a mock criminal law case \n●\nAnalyzed legal documents pertaining to the case and argued for the permittance of evidence ', 'Name: 110\n\n● Attended court hearings involving, commercial law, traffic violation, family law, and data breach cases  ', 'Name: 161\n\nCompeted as a pre-trial attorney in a mock criminal law case \n●\nAnalyzed legal documents pertaining to the case and argued for the permittance of evidence ')
 Relatednesses:(0.7960334080871797, 0.7936805124322088, 0.7932960803294815)
[0m
[32m2023-05-12 17:31:48.886[0m | [1mINFO    [0m | [36m__main__[0m:[36mask[0m:[36m42[0m - [1m You are a Human Resources agent looking for skills in resumes

resume section:
"""
Name: 81

Competed as a pre-trial attorney in a mock criminal law case 
●
Analyzed legal documents pertaining to the cas

'Possible answer:\n\n- 81\n- 161'

In [38]:
ask('who majored in physics')

[32m2023-05-12 17:32:10.980[0m | [1mINFO    [0m | [36m__main__[0m:[36mquery_message[0m:[36m17[0m - [1mStrings Found From Search
:('Name: 76\n\nBachelor of Science in Engineering Physics with Concentration in Mechanical Engineering\nRelevant Coursework: Solidworks, AutoCAD, Machine Dynamics & Design, Mechanics of Materials, ', 'Name: 156\n\nBachelor of Science in Engineering Physics with Concentration in Mechanical Engineering\nRelevant Coursework: Solidworks, AutoCAD, Machine Dynamics & Design, Mechanics of Materials, ', 'Name: 5\n\nB.S. in Physics and B.S. in Financial Math & Statistics \n●\nCoursework:\nmultivariable calculus,\nprobability and\nstatistics,')
 Relatednesses:(0.8106379068250826, 0.8095400054922902, 0.8052200089651524)
[0m
[32m2023-05-12 17:32:10.984[0m | [1mINFO    [0m | [36m__main__[0m:[36mask[0m:[36m42[0m - [1m You are a Human Resources agent looking for skills in resumes

resume section:
"""
Name: 76

Bachelor of Science in Engineering Physics

'The following individuals majored in Physics:\n- Resume section 3 (Name: 5)'

In [39]:
QUERY = 'who is a consultant'
ask(QUERY)

[32m2023-05-12 17:33:09.736[0m | [1mINFO    [0m | [36m__main__[0m:[36mquery_message[0m:[36m17[0m - [1mStrings Found From Search
:('Name: 35\n\nPublic Consulting Group , Contact Tracer for New York State , Long Island Region , NY 12/2020 - 03/2022  ', 'Name: 115\n\nPublic Consulting Group , Contact Tracer for New York State , Long Island Region , NY 12/2020 - 03/2022  ', 'Name: 85\n\nAcquired industry-based knowledge from workshops regarding tax, consulting, audit, and advisory services and gained a \ndeeper perspective on the consulting services industry\nWE ARE ONE Campaign\nSaco, ME ')
 Relatednesses:(0.7516799652320376, 0.7498786336834139, 0.7467068908859048)
[0m
[32m2023-05-12 17:33:09.739[0m | [1mINFO    [0m | [36m__main__[0m:[36mask[0m:[36m42[0m - [1m You are a Human Resources agent looking for skills in resumes

resume section:
"""
Name: 35

Public Consulting Group , Contact Tracer for New York State , Long Island Region , NY 12/2020 - 03/2022  
"""

resum

'It is not possible to determine who is a consultant based on the given resume sections as there is no clear indication of a specific person being referred to as a consultant.'

In [40]:
QUERY = 'who knows SAP'
ask(QUERY)

[32m2023-05-12 17:34:19.519[0m | [1mINFO    [0m | [36m__main__[0m:[36mquery_message[0m:[36m17[0m - [1mStrings Found From Search
:('Name: 3\n\nin largest global SAP S/4HANA ERP project at EY Singapore in 2019 for client, DyStar Group \n●\nConducted international localization workshops for franchises in 8 countries; communicated ', 'Name: 29\n\n12/20 - 02/21\nACCENTURE\nBeijing, China \nTechnology Consulting Assistant (SAP) \n●', 'Name: 29\n\nCollaborated with business planning and consolidation consultant to construct expense budget \ntable in SAP; created 23 logical carding diagrams of cost allocation configuration rules ')
 Relatednesses:(0.7974384286051879, 0.7935589698321063, 0.7872619620330196)
[0m
[32m2023-05-12 17:34:19.526[0m | [1mINFO    [0m | [36m__main__[0m:[36mask[0m:[36m42[0m - [1m You are a Human Resources agent looking for skills in resumes

resume section:
"""
Name: 3

in largest global SAP S/4HANA ERP project at EY Singapore in 2019 for client, Dy

'The first and third resumes mention SAP, indicating that those candidates have experience with SAP.'