Sample app intended to illustrate how Natural Language Processing (NLP), specifically Named Entity Recognition ( NER) can be used to improve the accuracy of Elasticsearch queries and the overall user experience. There are two key benefits to using NLP alongside Elasticsearch (or any other full text search engine):
Given the query 'black jacket costing less than $200', we can infer the color and max price, and apply these search filters for the user. This concept can be extended to other fields (e.g. brand) and also support conjugations e.g. 'black or dark green barbour jacket'
Imagine you work for an outdoor clothing and equipment store. You're building a catalog search feature. Given the query 'packable jacket', how should the database choose between a 'packable mosquito net' and a 'lightweight jacket'. Both products partially match. TF-IDF will most likely select the mosquito net as there will be fewer instances of 'packable' than 'jacket' in the corpus. However when looking at the query it's clear that the lightweight jacket would be the better match.
We typically solve this problem by boosting certain document fields e.g. by attaching more weight to the title or product type fields than the description. This sort of works, but the logic is wrong. We're essentially telling the shopper "based on what we sell, this is what we think is important to you".
Humans understand that given the query 'packable jacket', the shopper wants a jacket first and foremost. That's because we understand that 'jacket' is a product type and 'packable' is an attribute of the product. Natural Language Processing (NLP) allows us to apply this same reasoning programmatically. In simple terms we can perform an elasticsearch bool query in which we must have a match for 'jacket' and should have a match for 'packable'.
Firstly, and most importantly this is not a production implementation. The NLP model used for this example is really basic. For production use we'd build something far more robust, trained with historic search data. We'd also employ Part of Speech Tagging along with Dependency Parsing to get a better understanding of the sentences and fragments of text.
Secondly, the elasticsearch code is very basic. For production use we'd want to use custom tokenizers, analysers & synonyms. Of course, we'd have many more fields and lots more documents.
Finally, there's no error handling!.
So please treat this in the spirit in which it was created - a proof of concept!
- Setup your environment
- Fire up an elasticsearch instance
- Create the index and mapping
- Import some test data
- Fire up a simple webserver to handle search queries
- Cleanup
The python code needs a 3.9.7+ environment. I recommend running this in a virtualenv using either venv or pyenv/virtualenv
$ pyenv install 3.9.7
$ pyenv virtualenv 3.9.7 nlp-search-poc
$ pyenv local nlp-search-poc
$ pip install -U pip
$ pip install -r requirements.txt
I've provided a docker-compose.yml file, so you can fire up a simple elasticsearch instance
$ docker-compose up -d elasticsearch-7
Python dependencies and paths can be tricky, so I provided a simple utility to check everything is working as expected. Note: elasticsearch can take a few seconds to come online.
$ python -m src.tools ping
Elasticsearch alive: True
$ python -m src.tools create
productRepository INFO Creating products index
productRepository INFO products created
$ python -m src.tools ingest
productRepository INFO Ingesting lightweight black jacket
productRepository INFO Ingesting midweight black jacket
...
I created a wrapper shell script to fire up uvicorn/fastapi
$ bin/server.sh
uvicorn.error INFO Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
...
Make a GET request to http://localhost:8000 passing a json body:
{
"query": "lightweight black jacket less than $100"
}
Postman is probably the best tool for this, but I've also included a simple client:
$ python -m src.client 'lightweight black jacket less than $100'
{
"ner_prediction": {
"text": "lightweight black jacket less than $100",
"product": "jacket",
"price_from": null,
"price_to": 100,
"colors": [
"black"
],
"attrs": [
"lightweight"
]
},
"results": [
{
"title": "lightweight black jacket",
"product_type": "jacket",
"price": 100,
"colors": [
"black"
],
"attrs": [
"lightweight"
]
}
]
}
Important: If you choose to use this script you should enclose your search query in single quotes to avoid variable expansion.
Hit Ctrl + c
Don't worry about the asyncio.exceptions.CancelledError
- it's caused by the hot reload feature of the uvicorn server.
$ python -m src.tools drop
productRepository INFO Dropping products index
productRepository INFO products dropped
$ docker-compose down
Stopping elasticsearch-7 ... done
Removing elasticsearch-7 ... done
Removing network nlp-search-poc_default
I've provided a Dockerfile in case you want to run everything inside docker
$ docker build -t nlp-search-poc .
Then run elasticsearch and the server
$ docker-compose up -d
If you also want to use docker to ingest the test data into elasticsearch you can do so:
$ docker run -it --rm --network nlp-search-poc_default -e "ELASTIC_SEARCH_HOST=elasticsearch-7" nlp-search-poc "python" "-m" "src.tools" "reset"
Note: The network name is determined by docker's networking rules
docker-compose.yml exposes the server's port 8000, so you can query as before:
$ python -m src.client 'packable jacket'