# Evaluation of a Hybrid Rules-Based and NER Approach for Search Query Interpretation

## Objective

The goal of this project was to evaluate whether an in-house hybrid Rules-Based and Named Entity Recognition (NER) approach could replace or supplement the existing LLM/Vector algorithm that translates user input from a search box into structured filter settings for search results.

## Tag Analysis

If we cannot identify all the entities, we should continue using the existing tool, as it would still be necessary for the remaining tags.

*   **JOB_TITLE** - A check against a list of known job titles. Said list would need updating periodically. (An LLM would need to be updated just as often.) [Dataset of 70k job titles from GitHub.](https://github.com/jneidel/job-titles)

*   **COMPANY** - Checked against a list of known companies. This list would require periodic updates. I utilized the dataset from People Data Labs, although it dates back to 2017. With 7 million company names, I segmented them into multiple Trie structures and employed Aho-Corasick for pattern matching.

*   **INDUSTRY** - While we could derive the industry from the company list, this would only provide industries where candidates have worked, not those they are seeking. To determine the desired industry, an LLM would be more appropriate due to the varied ways industries can be implied.

*   **EXPERIENCE_LEVEL** - Challenging to capture with Regex or NER due to the numerous ways to express experience (senior, lead, years of experience, etc.). An LLM would excel in this area.

*   **LOCATION** - Can be reliably identified using SpaCy NER. I would recommend cross-referencing with the user's profile location to resolve ambiguities for places with identical names.

*   **DISTANCE (from LOCATION)** - Can be identified with Regex, but not reliably. As a numeric value, it can be confused with experience (years) and salary.

*   **SALARY** - Can be identified with Regex and existing NER models like LPDoctor, but not reliably. There are numerous ways to express salary. If successful, additional code would be needed to convert it to annual salary, but this is straightforward.

*   **WORK_TYPE, JOB_TYPE** - Both are straightforward to identify due to limited options (remote, hybrid, onsite) and (full-time, part-time, contract, internship). I checked if the text contained any of these terms.

## Methods

To evaluate the hybrid approach, I implemented several methods for extracting different filter types from search queries:

1. **Company Name Extraction**:
   - Created Aho-Corasick tries from a dataset of 7 million company names
   - Segmented the data into chunks and built separate tries for each chunk
   - Used exact pattern matching to find company names in search queries

2. **Named Entity Recognition (NER)**:
   - Used SpaCy's built-in NER model to extract GPE (Geo-Political Entity) for locations
   - Applied an external custom NER model (LPDoctor) to extract profession, facility, experience, and money entities

3. **Regex Pattern Matching**:
   - Developed regex patterns to extract:
     - Distance information (e.g., "within 10 miles")
     - Salary information with normalization to annual figures
     - Work types (remote, hybrid, onsite)
     - Job types (full-time, part-time, contract)

4. **Job Title Matching**:
   - Created a dataset of 70k job titles
   - Implemented exact string matching to find job titles in search queries

The implementation details can be found in the following scripts:
- `create_tries.py`: Builds and pickles Aho-Corasick tries for company name matching
- `search_companies.py`: Searches text for company names using the tries
- `process_candidates.py`: Processes candidate searches using NER and regex patterns
- `built_in.py`: Contains various extraction functions using built-in tools

In [None]:
import polars as pl

df = pl.read_parquet("./candidate_searches_lpdoctor.parquet")

column_order = [
    'text',
    'jobTitles', 'workTitle', 'TITLE_dataset',
    'industries', 'PROFESSION',
    'locations', 'LOCATION_SpaCy', 'DISTANCE_RegEx',
    'minWorkExperience', 'maxWorkExperience', 'experience', 'workHistory', 'EXPERIENCE',
    'jobTypes', 'JOB_TYPE_RegEx',
    'WORK_TYPE_RegEx',
    'ANNUAL_SALARY_RegEx', 'MONEY',
    'educationSubject', 'educationHistory',
    'competencySkills', 'requiredCompetencySkills',
    'workplaces', 'FACILITY', 'COMPANY_SpaCy', 'companies',
    'jobDescription'
]

df_reordered = df.select(column_order)
df_reordered.head()

## Summary

Out of the 9 tags we aimed to identify, SALARY, EXPERIENCE_LEVEL, and DISTANCE proved to be the least reliable with existing methods. INDUSTRY can be derived from experience but not for job seeking purposes.

It would be possible but not feasible for me to develop an NER model to identify these tags. Training it would require thousands of accurately-labeled examples for each tag. Our current solution already tags filter settings, albeit without spans indicating their location in the text.

I would prefer the existing system to handle the initial annotation, but there is a risk that using an external LLM could expose user data, and running a local LLM was computationally intensive on my machine.

If data privacy is not an issue, we could request the current system to tag the data with spans, but we would need to verify the accuracy of the results (for thousands of entries). It would likely have the same failure rate as when setting search filters.

I believe that if it were possible to build a reliable NER model for these tags with my resources, one would already exist. Other companies have developed such models, but they are not publicly available, and these companies possess significantly more resources. [A paper from ZipRecruiter detailing their challenges in building an in-house model.](https://drive.google.com/file/d/1RIVLhpQSBqiZ7aYmi1SgcWdOaEswlR37/view) [Medium article about the same](https://medium.com/@ziprecruiter.engineering/named-entity-recognition-ner-of-short-unstructured-job-search-queries-6b265ec0fb)

Developing a new NER model would incur costs in terms of man-hours and thus financial investment in R&D. The available free options trained for similar tasks are unreliable, suggesting that a custom model would not perform significantly better without the investment in accurately tagged data and computational resources for training. Our existing solution functions adequately.

### Cost Saving Ideas

(If we're using the same vector method as the rest of Obra to create the search filter settings, ignore this.)

Assuming we're identifying filter settings via an LLM model, we could reduce costs by transitioning to a "Light" model if not already in use. This would be faster and more economical. The trade-off would be potentially lower accuracy and reduced capability to handle complex queries.

User searches are typically short and simple, so the impact of the query limit depends on the complexity of our prompt to obtain filter settings. If it does not result in more missed filter settings than before, it would be beneficial.

We could operate an LLM locally, but I assume we've already conducted a cost-benefit analysis regarding running one in the office versus running our own LLM on a rented server versus simply using a provider.