# Minimalist semantic search

This notebooks presents a proof of concept for applying very basics principles of semantic search on a search query and converting it to a more specific query for a search platform like [Solr](http://lucene.apache.org/solr/) or [Elesticsearch](https://www.elastic.co/). 

This can to a limited extend provide capabilities similar to commerce search platforms like the [Fredhopper discovery engine](https://www.attraqt.com/technology/), [FACT-finder](https://www.fact-finder.com/) or services like [Twiggle](https://twiggle.com/).

## Preparations

Install [Python 3.5+](https://www.python.org/). The code presented uses only functions of the standard library.

## About search platforms

Search platforms typically organize information in a key value representation. For example a product like a dress might be stored in a structure like:

In [1]:
%%capture
{
    "code": "A19/7983",
    "name": "wool and silk dress",
    "color": "red",
    "brand": "dior",
    "price_eur": "999",
    "description": "Sleeveless dress in red wool and silk"
}

Search platform provide special query languages to search specific fields in such a structure. for example the [Dismax query parser of Solr](https://lucene.apache.org/solr/guide/7_7/the-dismax-query-parser.html) can search for products matching the color red and a price below 1000 EUR using the expression:
> color:red price_eur:\[0 TO 1000\]

## Semantic search

A German customer might search for a red dress with a price below 1000 EUR using:

> rotes Kleid unter 1000 EUR

With standard search, this will just search for each of the specific terms. What was actually meant from a search platform's point if view:

> Kleid color:rot price_eur:\[0 TO 1000\]

Modern search platforms can already find out that `rotes` stems to `rot`, so we don't have to bother about that here. So with stemming applied our search query would turn into:

## Mapping key terms to field expressions

To map key terms to fields we can use simple dictionaries:

In [2]:
FIELD_TO_TERM_MAP = {
    'color': ['blau', 'braun', 'gelb', 'grün', 'rot', 'schwarz', 'weiß'],
    'brand': ['apple', 'braun', 'dior', 'samsung', 'sony'],
    # ...
}

Notice that in German `'braun`' is both a brand and a color.

For efficient lookup we also need the reverse mapping:

In [3]:
TERM_TO_FIELDS_MAP = {}
for field, terms in FIELD_TO_TERM_MAP.items():
    for term in terms:
        if term in TERM_TO_FIELDS_MAP:
            TERM_TO_FIELDS_MAP[term].append(field)
        else:
            TERM_TO_FIELDS_MAP[term] = [field]
TERM_TO_FIELDS_MAP

{'blau': ['color'],
 'braun': ['color', 'brand'],
 'gelb': ['color'],
 'grün': ['color'],
 'rot': ['color'],
 'schwarz': ['color'],
 'weiß': ['color'],
 'apple': ['brand'],
 'dior': ['brand'],
 'samsung': ['brand'],
 'sony': ['brand']}

Now we can write a function that traverses this dictionary and replaces any keywords by its respective lucene expression to search a specific field:

In [4]:
def resolved_standard_terms(search_query):
    result_parts = []
    for term in search_query.split():
        fields = TERM_TO_FIELDS_MAP.get(term)
        if fields is None:
            result_parts.append(term)
        else:
            for field in fields:
                result_parts.append(f'{field}:{term}')
    return ' '.join(result_parts)

resolved_standard_terms('rot kleid unter 1000 EUR')

'color:rot kleid unter 1000 EUR'

When looking for a razor, we might have to search both the color and the brand. This is ok because the color rarely matters for a razor.

In [5]:
resolved_standard_terms('braun rasierer')

'color:braun brand:braun rasierer'

## Mapping phrases to field queries

To consider term composed of multiple word we can use regular expressions to map them to a replacement term. First let's compose a regular expressions that seperates the search query in groups of text that should be preserved and the parts that should be converted to a field expression for the search platform:

In [6]:
import re
match = re.match(
    r'(.*\b)(unter\s+)(\d+)(\s+EUR)(\b.*)', 
    'rot kleid unter 1000 EUR dior')

match.groups()

('rot kleid ', 'unter ', '1000', ' EUR', ' dior')

With this we can build a mapping to replacement terms. Notice that in the replacement terms like `\1` or `\3` refer to the value stored in the respective group of the regular expressions.

In [7]:
re.sub(
    r'(.*\b)(unter\s+)(\d+)(\s+EUR)(\b.*)', 
    r'\1price_eur:[0 TO \3]\5',
    'rot kleid unter 1000 EUR dior'
)

'rot kleid price_eur:[0 TO 1000] dior'

We can collect multiple regular expressions and their replacement in a map and build a funtion to apply all of them:

In [8]:
REGEX_TO_REPLACEMENT_MAP = {
    r'(.*\b)(unter\s+)(\d+)(\s+EUR)(\b.*)': r'\1price_eur:[0 TO \3]\5',
    r'(.*\b)(ab\s+)(\d+)(\s+EUR)(\b.*)': r'\1price_eur:[\3 TO 10000]\5',
}

def resolved_expressions(search_query):
    result = search_query
    for regex, replacement in REGEX_TO_REPLACEMENT_MAP.items():
        result = re.sub(regex, replacement, result)
        # FIXME: We might need some logic to prevent already replaced
        # parts to be replaced again. This is just a proof of concept.
    return result

resolved_expressions('rot kleid unter 1000 EUR dior')

'rot kleid price_eur:[0 TO 1000] dior'

And now let's combine this in a single function that converts a manually entered search query to a more structured query with some of the semantic information already resolved:

In [9]:
def semantic_query(search_query):
    return resolved_standard_terms(resolved_expressions(search_query))

semantic_query('rot kleid unter 1000 EUR dior')

'color:rot kleid price_eur:[0 TO 1000] brand:dior'

## Conclusion

This notebook demonstrated how a few Python dictionaries and string operations can convert an unstructured search query in a more specific one where some of the semantic information is already resolved and turned into expressions for a search platform.