<a href="https://colab.research.google.com/github/seanreed1111/colab-demos/blob/master/resumeGPT_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

based on https://github.com/openai/openai-cookbook/tree/main/examples

## TODO  
- Return a link to the full original resume instead of to chunks
- Run experiments to find optimal chunking size of resume text
- Figure out how to use Ada and Babbage to reduce costs
- port to Azure Platform



In [None]:
!pip install -qqq loguru textract tiktoken openai azure-ai-ml mlflow azureml-sdk azureml-mlflow

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.1/6.1 MB[0m [31m84.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.7/17.7 MB[0m [31m69.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m814.0/814.0 kB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m106.9/106.9 kB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.4/133.4 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) 

# before running this notebook, UPLOAD these files
- openai.env
- azure.env

In [None]:
import os,argparse,loguru, json, time, datetime, openai
from pathlib import Path
from loguru import logger

In [None]:
def maybe_load_aml_env_vars(env_path=None):
  import os, json
  try:
    with open(env_path, "r") as f:
      env_vars = json.load(f)
    os.environ["resource_group"] = env_vars["resource_group"]
    os.environ["workspace_name"] = env_vars["workspace_name"]
    os.environ["subscription_id"] = env_vars["subscription_id"]
    if (os.getenv("resource_group") and os.getenv("workspace_name")
    and os.getenv("subscription_id")):
      return True
  except Exception as e:
    logger.error(f"{e}")
    return False

In [None]:
def set_open_ai_key(env_path=None):
  import json, os
  from pathlib import Path
  try:
    with open(env_path, "r") as f:
        env_vars = json.load(f)
    os.environ["OPENAI_API_KEY"] = env_vars["OPENAI_API_KEY"]
    openai.api_key = os.environ["OPENAI_API_KEY"]
    openai.Model.list() #test a random command on the openai API
    return True
  except Exception as e:
    logger.error(f"{e}")
  return False

def test_set_open_ai_key(key_path=None):
  openai.api_key = None #disconnect from api key if already registered
  try:
    set_open_ai_key(key_path)
    openai.Model.list()
    return True
  except Exception as e:
    logger.error(f"{e}")
  return False


In [None]:
def maybe_get_ml_client(env_path=None):
  # this is a mix of sdk v1 and v2. Try to consolidate 
  import json, os, mlflow
  from pathlib import Path
  from azureml.core import Workspace 
  from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
  from azure.ai.ml import MLClient

  if not env_path: return None

  ws = Workspace.from_config(env_path)
  tracking_uri = ws.get_mlflow_tracking_uri()
  mlflow.set_tracking_uri(tracking_uri)

  try:
      credential = DefaultAzureCredential()
  except Exception as ex:
      # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not working
      credential = InteractiveBrowserCredential()

  is_loaded = maybe_load_aml_env_vars(env_path)
  if is_loaded:
    try:
      ml_client = MLClient(
          subscription_id=os.getenv("subscription_id"),
          resource_group_name=os.getenv("resource_group"),
          workspace_name=os.getenv("workspace_name"),
          credential=credential,
      )
      return ml_client
    except Exception as e:
      logger.error(f"{e}")
      return None

In [None]:
def maybe_setup_azure_env(azure_env_path=None):
  # setup azure ml env if azure credentials are available.
  if azure_env_path and azure_env_path.is_file() and maybe_load_aml_env_vars(azure_env_path):
    ml_client = maybe_get_ml_client(azure_env_path)
    if ml_client:
      #do a random test to check that ml_client and mlflow are playing nicely together
      import mlflow
      experiment_name = 'mlflow-2'
      mlflow.set_experiment(experiment_name)
      from random import random

      with mlflow.start_run() as mlflow_test_run:
          mlflow.log_param("hello_param", "world")
          mlflow.log_metric("hello_metric2", random())
          os.system(f"echo 'hello world2' > helloworld2.txt")
          mlflow.log_artifact("helloworld2.txt")
      return True
  return False

In [None]:
# setup
azure_env_path, openai_env_path, ml_client, openai.api_key = None, None, None , None
cwd = Path.cwd()
# resume_path = cwd / "Resumes"
# resume_path.mkdir(exist_ok=True)

azure_env_path = cwd / "azure.env" ##uncomment if providing azure env
openai_env_path = cwd/ "openai.env"
maybe_setup_azure_env(azure_env_path)
set_open_ai_key(openai_env_path)

True

# SPLIT SECTIONS
source: Embedding_Wikipedia_articles_for_search.ipynb 
https://colab.research.google.com/drive/1EJMtCmF8jZc2Y-c1RaBxFSCTPcjzjJf4#scrollTo=TOVSYkDur9zA

Next, we'll recursively split long sections into smaller sections.

There's no perfect recipe for splitting text into sections.

Some tradeoffs include:
- Longer sections may be better for questions that require more context
- Longer sections may be worse for retrieval, as they may have more topics muddled together
- Shorter sections are better for reducing costs (which are proportional to the number of tokens)
- Shorter sections allow more sections to be retrieved, which may help with recall
- Overlapping sections may help prevent answers from being cut by section boundaries

Here, we'll use a simple approach and limit sections to 1,000 tokens each by default, recursively halving any sections that are too long. To avoid cutting in the middle of useful sentences, we'll split along paragraph boundaries when possible.

### extract text from pdf

In [None]:
import textract, os, openai, tiktoken

In [None]:
# #TODO walk the directory to get all the filenames, use as names of the people in tagging and document retrieval
# #TODO use regex to get rid of excess spaces and new lines. only one new line needed per line.
# file_names = ["Jesse_Jayant.pdf", "Nadia_Smythe.pdf"]
# file_paths = [(resume_path / file) for file in file_names];print(file_paths)
# names = [path.stem.lower() for path in file_paths];

# # Extract the raw text from each PDF using textract

# texts =[textract.process((file_path), method='pdfminer').decode('utf-8') for file_path in file_paths]

# #TODO Do more cleaning with regex
# texts = [text.strip().replace("  ", " ") for text in texts]
# #create tuple[list[str],str]
# clean_texts = [(['NameOnResume: '+ item1], item2) for item1,item2 in zip(names,texts)] 

In [None]:
GPT_MODEL = 'gpt-3.5-turbo'  # only matters insofar as it selects which tokenizer to use

def num_tokens(text: str, model: str = GPT_MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

## TODO This needs more sophistication in the use of a delimiter
def halved_by_delimiter(string: str, delimiter: str = "\n") -> list[str, str]:
    """Split a string in two, on a delimiter, trying to balance tokens on each side."""
    chunks = string.split(delimiter)
    if len(chunks) == 1:
        return [string, ""]  # no delimiter found
    elif len(chunks) == 2:
        return chunks  # no need to search for halfway point
    else:
        total_tokens = num_tokens(string)
        halfway = total_tokens // 2
        best_diff = halfway
        for i, chunk in enumerate(chunks):
            left = delimiter.join(chunks[: i + 1])
            left_tokens = num_tokens(left)
            diff = abs(halfway - left_tokens)
            if diff >= best_diff:
                break
            else:
                best_diff = diff
        left = delimiter.join(chunks[:i])
        right = delimiter.join(chunks[i:])
        return [left, right]


def truncated_string(
    string: str,
    model: str,
    max_tokens: int,
    print_warning: bool = True,
    TRUNCATION_WARNING_PERCENTAGE: float = 0.25

) -> str:
    """Truncate a string to a maximum number of tokens."""
    encoding = tiktoken.encoding_for_model(model)
    encoded_string = encoding.encode(string)
    truncated_string = encoding.decode(encoded_string[:max_tokens])
    truncation_percentage = 1.0 - max_tokens*1.0 / len(encoded_string)
    if print_warning and (len(encoded_string) > max_tokens) and (truncation_percentage > TRUNCATION_WARNING_PERCENTAGE):
        logger.warning(f"Warning: Truncated string from {len(encoded_string)} tokens to {max_tokens} tokens. \nOriginalString:{string} \n")

    return truncated_string

In [None]:
def split_strings_from_subsection(
    subsection: tuple[list[str], str], #legacy structure. 
    max_tokens: int = 1000,
    model: str = GPT_MODEL,
    max_recursion: int = 8,
) -> list[str]:
    """
    Split a subsection into a list of subsections, each with no more than max_tokens.
    Each subsection is a tuple of parent titles [H1, H2, ...] and text (str).
    """
    titles, text = subsection
    string = "\n\n".join(titles + [text])
    num_tokens_in_string = num_tokens(string)
    # if length is fine, return string
    if num_tokens_in_string <= max_tokens:
        return [string]
    # if recursion hasn't found a split after X iterations, just truncate
    elif max_recursion == 0:
        return [truncated_string(string, model=model, max_tokens=max_tokens)]
    # otherwise, split in half and recurse
    else:
        titles, text = subsection
        for delimiter in ["\n\n", "\n", ". ", "●", "•", ":", ";"]:
            left, right = halved_by_delimiter(text, delimiter=delimiter)
            if left == "" or right == "":
                # if either half is empty, retry with a more fine-grained delimiter
                continue
            else:
                # recurse on each half
                results = []
                for half in [left, right]:
                    half_subsection = (titles, half)
                    half_strings = split_strings_from_subsection(
                        half_subsection,
                        max_tokens=max_tokens,
                        model=model,
                        max_recursion=max_recursion - 1,
                    )
                    results.extend(half_strings)
                return results
    # otherwise no split was found, so just truncate (should be very rare)
    return [truncated_string(string, model=model, max_tokens=max_tokens)]
 

In [None]:
#load resumes from csv file made by pdf parser and then split and create embeddings.
import pandas as pd
df = pd.read_csv("/content/resume_books.csv",usecols=["text", "source"]);df.head()
df = df.dropna()
df.loc[:,"name"] = [f"Name: {i}" for i in df.index]
df.loc[:,"text2"] = list(zip(df["name"].to_list(), df["text"].to_list()))
df = df[["name","text2"]];df.head()
df.to_csv("resume_books_v2.csv")
clean_texts = df["text2"].to_list()
clean_texts = [([x1],x2) for x1, x2 in clean_texts]

In [None]:
# split resumes into chunks. Small chunks probably better when searching for skills? 
# maybe even shrink to individual sentences
MAX_TOKENS = 100
resume_strings = []
for section in clean_texts:
    resume_strings.extend(split_strings_from_subsection(section, max_tokens=MAX_TOKENS))

print(f"{len(clean_texts)} resumes split into {len(resume_strings)} strings.")


OriginalString:Name: 90

 The Ground Floor, Basic Microeconomics, Statistics, Principles of Management, Marketing Principles,  Consulting Cup, Information Systems, Principles of Financial Accounting, Business Communications  EXPERIENCE Maimonides Medical Center Department of Surgery                                                                                              Brooklyn, NY Admin Intern                                                                                                                                                      July 2022 – August 2022 ! Oversaw 35 surgical residents and formulated day assignment sheet and monthly attestation from an Excel block schedule  ! Evaluated 35 surgical residents’ files weekly on New Innovations for missing certifications and recertifications  ! Updated surgical residents’ and faculty’s PubMed academic activity  ! Emailed 10-15 medical students their acceptance email to elective surgeries weekly and confirmed their elective su

191 resumes split into 2127 strings.


# calculate embeddings and store in dataframe

In [None]:
import pandas as pd
EMBEDDING_MODEL = "text-embedding-ada-002"  # OpenAI's best embeddings as of Apr 2023
MAX_BATCH_SIZE = 1000 # you can submit up to 2048 embedding inputs per request
NUMBER_OF_STRINGS_TO_EMBED = len(resume_strings)

if NUMBER_OF_STRINGS_TO_EMBED < MAX_BATCH_SIZE:
  BATCH_SIZE = NUMBER_OF_STRINGS_TO_EMBED
else: 
  BATCH_SIZE = MAX_BATCH_SIZE 

embeddings = []
for batch_start in range(0, NUMBER_OF_STRINGS_TO_EMBED, BATCH_SIZE):
    batch_end = batch_start + BATCH_SIZE
    batch = resume_strings[batch_start:batch_end]
    print(f"Batch {batch_start} to {batch_end-1}")
    response = openai.Embedding.create(model=EMBEDDING_MODEL, input=batch)
    for i, be in enumerate(response["data"]):
        assert i == be["index"]  # double check embeddings are in same order as input
    batch_embeddings = [e["embedding"] for e in response["data"]]
    embeddings.extend(batch_embeddings)

df = pd.DataFrame({"text": resume_strings, "embedding": embeddings})

Batch 0 to 999
Batch 1000 to 1999
Batch 2000 to 2999


In [None]:
# Store embeddings
df.to_csv("embeddings_v2.csv", index=False)

# search documents using query and text embeddings and and retrieve relevant consultant name from resume information using GPT

1. Search (once per query) - Given a user question, generate an embedding for the query from the OpenAI API
1. Using the embeddings, rank the text sections by relevance to the query
1. Ask (once per query)
  1. Insert the question and the most relevant sections into a message to GPT
  1. Return GPT's answer

In [None]:
from scipy import spatial

In [None]:
EMBEDDING_MODEL = "text-embedding-ada-002"

In [None]:
# search function
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 3
) -> tuple[list[str], list[float]]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = openai.Embedding.create(
        model=EMBEDDING_MODEL,
        input=query,
    )
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["text"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n], relatednesses[:top_n]

In [None]:
def test_strings_ranked_by_relatedness(query, df, top_n=10):
  strings, relatednesses = strings_ranked_by_relatedness(query, df, top_n)
  for string, relatedness in zip(strings, relatednesses):
      print(f"{relatedness=:.3f}")
      display(string)

In [None]:
query = "strong in math"
strings_ranked_by_relatedness(query, df)


(("Name: 8\n\nlinear algebra, partial differential equations,\nItô's lemma, real analysis, Bayesian\nstatistics, CAPM, WACC, options, data structure (Python), stochastic processes, linear regression \n●\nHonors:\nDean’s Honors (top 5% of GPA in department),\nPresident’s Scholarship",
  'Name: 2\n\n● Coursework:  linear algebra, ordinary differential equations, real analysis, stochastic processes, \nprobability theory, linear regression, mean-variance optimization, corporate finance \n● Honors/Awards:  Dean’s List (3 years), Cum Laude, Beta Gamma Sigma Honor Society ',
  'Name: 14\n\nCoursework:\nlinear algebra, mathematical statistics,\nBrownian motion, law of large numbers, \nmachine learning, data structures, algorithms, databases \n●\nHonors/Awards:\nDean’s list for 4 years, Latin Honors\nCum Laude'),
 (0.8062711795862255, 0.7996338658385634, 0.7995175364806416))

In [None]:
query = "understands pytorch"
strings_ranked_by_relatedness(query, df)

(('Name: 6\n\n08/16 - 02/17\nINDIAN INSTITUTE OF TECHNOLOGY ROORKEE\nRoorkee, India \nText-Image Synthesis with Uni-Skip Vectors (Python, Deep Learning) \n●\nUsed natural language understanding; designed model that learned image generation from \ntext data with 1M-word vocabulary, producing high-level generic sentence representations \n●\nImproved model by employing distributed text encoder conditioned with generative ',
  'Name: 21\n\nProgramming Languages:\nPython (NumPy, Pandas, Sklearn,\nSciPy), SQL, Java, R, MATLAB\nLanguages:\nEnglish (fluent); Mandarin (native)',
  'Name: 7\n\nC O M P U T A T I O N A L  S K I L L S  /  O T H E R\nProgramming Languages:\nPython (Numpy, Pandas, Scikit-learn,\nMatplotlib), Java, SQL, R\nLanguages:\nEnglish (fluent), Mandarin (native)\nPublication:\nOption Mispricing & Arbitrage Opportunity\n,\nICSET 2021 Taiwan\nActivities:\nDiscrete Structure Teaching Assistant\nat Northeastern University'),
 (0.7885447575940931, 0.7751890320298489, 0.774539988040

In [None]:
query = " understands law"
strings_ranked_by_relatedness(query, df)

(('Name: 110\n\nIntroduction to the American Legal System, Constitutional Law, American Constitution, Eur opean Politics  \n \nRELEVANT EXPERIENCE Whitestone Chambers, Legal Intern London, UK             January -April 2022  ',
  'Name: 190\n\nIntroduction to the American Legal System, Constitutional Law, American Constitution, Eur opean Politics  \n \nRELEVANT EXPERIENCE Whitestone Chambers, Legal Intern London, UK             January -April 2022  ',
  'Name: 110\n\n● Conducted in depth legal research and prepared arguments for active data breach cases  \n● Worked directly with head barrister to prepare for various stages of commercial and civil litigation  \n● Attended court hearings involving, commercial law, traffic violation, family law, and data breach cases  '),
 (0.8022047376928069, 0.8016164332607587, 0.7965876491636508))

In [None]:
query = " azure databricks"
strings_ranked_by_relatedness(query, df)

(('Name: 13\n\nwith Azure HDInsight; prepared data visualization for industry report\n●\nBuilt large-scale database  from daily news and data for 3,000 clean energy automobile stocks\nfrom 2018 to 2019, using R and SQL\n●\nUsed  feature extraction on news about 1,000 selected stocks in 2019; improved stock prediction\nbased on sentiment analysis with RNN; average accuracy increased by 7%\nP R O J E C T',
  'Name: 122\n\nTools: Jupyter Notebook, Tableau, Lucid Chart, Anaconda Navigator and Android Studio. Big Data and Cloud: Heroku, AWS.  WORK EXPERIENCE Research Assistant in Quantum Computing at Fordham University                   (August 2022 – Present) Tools: Python, Qiskit',
  'Name: 42\n\nTools: Jupyter Notebook, Tableau, Lucid Chart, Anaconda Navigator and Android Studio. Big Data and Cloud: Heroku, AWS.  WORK EXPERIENCE Research Assistant in Quantum Computing at Fordham University                   (August 2022 – Present) Tools: Python, Qiskit'),
 (0.7623053836758877, 0.74437404

In [None]:
query = " azure AND databricks"
strings_ranked_by_relatedness(query, df)


(('NameOnResume: seanreed\n\n2009 \n\n1990 \n\nDatabricks, Azure ML, Python, Pandas, Spark, PyTorch, TensorFlow, Keras, Git, Computer Vision, \nNatural Language Processing, Medical Image Segmentation, Deep Learning, Machine Learning, \nGLMs, SQL, Linux, Bayesian Statistics, Data Pipelines, GCP, AWS, Docker, Kubernetes, Pandas, \nNumPy, Random Forests, Gradient Boosting, SVMs, GLMs, Recommender Systems, Graph \nDatabases, Neo4j',
  'NameOnResume: seanreed\n\n● Built distributed Swin-UNETR encoder-decoder model using Azure ML, PyTorch, Docker, and \nLightning AI that can pretrain on unlabeled 3D CT scans, eliminating substantial future labeling \ntime and expense incurred by company radiologists. \nImplemented computer vision model from innovative academic research paper and completely \nrefactored model’s existing Python code to solve client’s business problem. \n\n● ',
  'NameOnResume: seanreed\n\nPython model to production environment and usage with Ray, Python, and PySpark on Azure \

## 3. Ask

With the search function above, we can now automatically retrieve relevant knowledge and insert it into messages to GPT.

Below, we define a function `ask` that:
- Takes a user query
- Searches for text relevant to the query
- Stuffs that text into a message for GPT
- Sends the message to GPT
- Returns GPT's answer

In [None]:
GPT_MODEL = 'gpt-3.5-turbo'
def num_tokens(text: str, model: str = 'gpt-3.5-turbo') -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


#using v1 search function
def query_message(
    query: str,
    df: pd.DataFrame,
    model: str,
    token_budget: int
) -> str:
    """Return a message for GPT, with relevant source texts pulled from a dataframe."""
    strings, relatednesses = strings_ranked_by_relatedness(query, df)
    logger.info(f"Strings Found From Search\n:{strings}\n Relatednesses:{relatednesses}\n")
    introduction = ' You are a Human Resources agent looking for skills in resumes'
    question = f"\n\nQuestion: {query}"
    message = introduction
    for string in strings:
        next_article = f'\n\nresume section:\n"""\n{string}\n"""'
        if (
            num_tokens(message + next_article + question, model=model)
            > token_budget
        ):
            break
        else:
            message += next_article
    return message + question

@logger.catch
def ask(
    query: str,
    df: pd.DataFrame = df,
    model: str = GPT_MODEL,
    token_budget: int = 4096 - 500,
    print_message: bool = False,
) -> str:
    """Answers a query using GPT and a dataframe of relevant texts and embeddings."""
    message = query_message(query, df, model=model, token_budget=token_budget)
    logger.info(f"{message}")
    content = "Construct a list of Name fields from the documents given. Remove all duplicates from the list"
    messages = [
        {"role": "system", "content": content},
        {"role": "user", "content": message},
    ]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0
    )
    response_message = response["choices"][0]["message"]["content"]
    return response_message



In [None]:
ask('who knows law')

[32m2023-05-09 16:35:14.433[0m | [1mINFO    [0m | [36m__main__[0m:[36mquery_message[0m:[36m17[0m - [1mStrings Found From Search
:('Name: 110\n\nIntroduction to the American Legal System, Constitutional Law, American Constitution, Eur opean Politics  \n \nRELEVANT EXPERIENCE Whitestone Chambers, Legal Intern London, UK             January -April 2022  ', 'Name: 190\n\nIntroduction to the American Legal System, Constitutional Law, American Constitution, Eur opean Politics  \n \nRELEVANT EXPERIENCE Whitestone Chambers, Legal Intern London, UK             January -April 2022  ', 'Name: 81\n\nCompeted as a pre-trial attorney in a mock criminal law case \n●\nAnalyzed legal documents pertaining to the case and argued for the permittance of evidence \nutilizing Supreme Court cases \nLeapStart\nCupertino, CA \nTeacher\nJune 2020- June 2022 \n●')
 Relatednesses:(0.7921320362469393, 0.7905199568274273, 0.7850107571639804)
[0m
[32m2023-05-09 16:35:14.437[0m | [1mINFO    [0m | [36m

'The following individuals have knowledge of law based on their resumes:\n\n- 110\n- 190\n- 81'

In [None]:
ask('who majored in physics')

[32m2023-05-09 16:37:22.998[0m | [1mINFO    [0m | [36m__main__[0m:[36mquery_message[0m:[36m17[0m - [1mStrings Found From Search
:('Name: 5\n\nB.S. in Physics and B.S. in Financial Math & Statistics \n●\nCoursework:\nmultivariable calculus,\nprobability and\nstatistics,\nlinear algebra,\nODE&PDEs,\ncomplex analysis, numerical methods, regression, stochastic process, machine learning \n●\nHonors/Awards:', "Name: 30\n\nB.S. in Material Science and Engineering, Minor in Economics \n●\nCoursework:\nstochastic process, probability theory,\nlinear algebra, MLE, machine learning,\npartial differential equation, corporate finance, game theory, Hamilton's equations,", 'Name: 130\n\nBachelor of Science in General Science, Minor in Business Administration, GPA: 3.42\nHigh School Diploma:\nSt. Jean Baptiste High School\nJune 2017\nRelevant Coursework\nGeneral Chemistry, Macroeconomics, Computer Science, Philosophical Ethics\nAwards\nFordham Jogues Scholarship, Fordham ASPIRES Scholarship

'The first resume section mentions a B.S. in Physics, so the person in that section majored in physics.'

In [None]:
len(clean_texts)

191