A search engine for a database of ~477 companies. You type a plain-English query like "pharmaceutical companies in Romania" and it finds the best matches, ranked by how well each company truly fits — not just by keywords.
Instead of a simple keyword search (which would just look for the word "pharmaceutical" in a text field), this system understands your query and scores each company based on what it actually does for a living.
Every query goes through 4 stages. Each stage is cheaper and faster than the one before it, and each stage throws out bad matches before passing the rest forward.
Your query
│
▼
┌──────────────────────────────────────────────────────┐
│ Stage 0 — Understand the query (< 1ms) │
│ Regex + keyword matching │
│ "pharmaceutical in Romania" → │
│ country = Romania │
│ industry = Pharmaceutical │
└─────────────────────┬────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Stage 1 — Hard filters (< 1ms) │
│ Plain data matching, no AI 477 → ~26 │
│ "country = Romania" → keep only the │
│ 26 Romanian companies │
└─────────────────────┬────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Stage 2A — Semantic search (instant) │
│ Pre-built FAISS vector index │
│ Finds the most semantically similar companies │
│ using sentence embeddings (local model) │
│ 26 → top 26 re-ranked by relevance │
└─────────────────────┬────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Stage 2B — AI scoring (~4s) │
│ Llama-3.1-8B reads each company and │
│ scores it 0-10 for the query │
│ Uses NAICS industry code as primary signal │
└─────────────────────┬────────────────────────────────┘
│
▼
Final ranked list
(only green ≥ 70% shown)
The system reads your query using regex and keyword matching (no API call, instant). It extracts:
- Country — "Romania" →
ro, "France" →fr, "Scandinavia" →[se, no, dk] - Employee count — "more than 500 employees" →
min_employees = 500 - Revenue — "revenue over $50M" →
min_revenue = 50000000 - Public/private — "public companies" →
is_public = true - Industry — "pharmaceutical" → industry rule for the AI scorer
- Cities — "companies in Bucharest" → city-level filter
Example: "pharmaceutical companies in Romania" →
{ country: "ro", industry: "Pharmaceutical", criteria: ["primary business is pharma manufacturing or drug distribution (NOT generic chemicals)"] }
Pure data matching using pandas. No AI involved. If you said "Romania", only Romanian companies survive. If you said "more than 500 employees", companies with fewer are removed.
This alone can cut 477 companies down to 20-50 before the expensive stages even run.
Every company's name, description, industry label, and offerings are turned into a vector (a list of 384 numbers that captures the meaning of the text) using a local AI model called all-MiniLM-L6-v2.
This is done once at startup and cached. At query time, your query is also turned into a vector, and the system finds the companies whose vectors are closest to yours — these are the most semantically relevant ones.
Think of it like: instead of searching for the exact word "pharmaceutical", it understands that "drug distributor", "medicine manufacturer", and "pharma wholesaler" all mean the same thing.
The remaining companies are sent in parallel batches to Llama-3.1-8B (a fast open-source language model hosted on featherless.ai). The model reads each company's name, NAICS industry code, and description, then scores it 0–10 based on how well it matches your query.
The scoring uses hard rules to avoid common mistakes:
- For a pharmaceutical query: if the company's NAICS code doesn't contain "pharmaceutical/drug/medicine" — maximum score of 2, regardless of description
- For a logistics query: energy/oil/gas companies score 0, even if they "distribute" a product
- The LLM focuses on what the company IS, not who its customers are
Batches run in parallel (all at the same time), so 26 companies in 3 batches takes the same time as 1 batch.
final_score = (llm_score / 10) + (embedding_similarity × 0.01)
The embedding similarity acts as a tiny tiebreaker between companies that got the same LLM score. The LLM score is what actually matters.
Only companies scoring ≥ 70% (green) are shown. If nothing scores that high, the top 3 are shown instead.
Here is exactly what happens from the moment you press Search to the moment results appear.
POST /qualify → { query: "pharmaceutical companies in Romania", top_n: 20 }
main.py receives this and calls qualify().
Regex and keyword lists read your query. No AI, no API call, runs in under 1ms.
It figures out:
- "Romania" → filter by country code
ro - "pharmaceutical" → industry =
Pharmaceutical, and writes a strict rule for the AI scorer: "must be drug/medicine company, NOT generic chemicals"
Result: a structured JSON object called the intent.
Takes all 477 companies and applies dead-simple pandas filters:
country_code == "ro"→ 26 companies survive- If you had said "more than 500 employees" it would also filter by that
- No AI, runs in under 1ms
Now we have 26 companies instead of 477.
At startup, every company's text (name + description + industry label) was converted into a list of 384 numbers called an embedding — a mathematical fingerprint of its meaning. These are stored in a FAISS index in memory.
Your query is also converted to 384 numbers. FAISS then finds which of the 26 Romanian companies have the closest fingerprints to your query.
This takes 0 milliseconds because the index was built at startup and just sits in RAM.
The 26 companies are split into batches of 10. All batches are sent at the same time (parallel threads) to Llama-3.1-8B on featherless.ai.
For each company the AI sees:
[1] Fildas Trading | NAICS: Drugs and Druggists' Sundries Merchant Wholesalers
Romania's largest pharmaceutical distributor...
[2] Unilever | NAICS: Other Chemical and Allied Products Merchant Wholesalers
Consumer goods company...
It also receives a hard rule injected into the prompt:
"If NAICS does NOT contain 'pharmaceutical/drug/medicine' — max score 2"
It replies with just scores:
1:9
2:1
All batches finish in ~4 seconds because they run at the same time, not one after another.
Scores are combined:
final_score = (llm_score / 10) + (embedding_similarity × 0.01)
llm_score / 10— turns 0–10 into 0.0–1.0embedding_similarity × 0.01— tiny tiebreaker for companies with the same LLM score
Companies are sorted highest to lowest, converted to JSON, and sent back to your browser.
The frontend receives all scored companies and shows only those with score ≥ 70% (green). If nothing qualifies, the top 3 are shown so the page is never empty.
0ms → intent extracted (regex, instant)
1ms → 477 companies filtered to ~26 (pandas)
2ms → FAISS ranks the 26 by semantic relevance (pre-built index)
~4s → Llama-3.1-8B scores all 26 in parallel
~4s → results appear in your browser
CompanyQualification/
├── frontend/src/App.jsx — React UI (search box, company cards, score rings)
└── solution/
├── data/companies.jsonl — 477 companies (name, description, NAICS, address, etc.)
└── backend/
├── main.py — FastAPI server (HTTP endpoints)
├── config.py — API keys, model names, knobs
├── requirements.txt — Python dependencies
└── pipeline/
├── intent_extractor.py — Stage 0: parse the query
├── structured_filter.py — Stage 1: hard pandas filters
├── rag_filter.py — Stage 2: FAISS + LLM scoring
└── qualify.py — Orchestrates all stages, builds final response
cd solution/backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
./venv/bin/python -m uvicorn main:app --host 0.0.0.0 --port 8000The first startup takes ~20 seconds — it downloads the embedding model and builds the FAISS index. After that it's instant.
cd frontend
npm install
npm run devOpen http://localhost:5173.
Each company in companies.jsonl has:
| Field | What it is |
|---|---|
operational_name |
Company name |
description |
What the company does |
primary_naics |
Industry classification code + label (e.g. "Pharmaceutical Preparation Manufacturing") |
address |
Country, city |
employee_count |
Number of employees (often null) |
revenue |
Annual revenue in USD (often null) |
year_founded |
When the company was founded |
is_public |
Whether the company is publicly listed |
business_model |
e.g. B2B, SaaS, marketplace |
core_offerings |
Main products/services |
target_markets |
Who they sell to |
The system is only as good as the data. If you search for "pharmaceutical companies in Romania" and only one comes back, that means only one Romanian company in the database has a pharmaceutical NAICS code. The AI isn't wrong — the data just doesn't have more.
| Layer | Technology |
|---|---|
| Frontend | React + Vite |
| Backend | FastAPI (Python) |
| Embeddings | all-MiniLM-L6-v2 via sentence-transformers (local) |
| Vector search | FAISS (local, in-memory) |
| LLM scoring | Llama-3.1-8B via featherless.ai (OpenAI-compatible API) |
| Intent parsing | Regex + keyword matching (no API, instant) |