This notebook tries to resolve the stackoverflow http://stackoverflow.com/questions/42301375/mapping-international-character-to-multiple-options question with an adhoc example

So, we need a custom translation. By now I've only set those chars that you shown, but feel free to complete the list

In [16]:
accented_letters = {
    u'ö' : [u'o',u'oe'],
    u'ø' : [u'o',u'oe'],
}

def normalize_word(word):
    _word = word
    for c in word:
        for replacement in accented_letters.get(c,[]):
            _word = word.replace(c,replacement)
    return _word
            

We can normalize words and store them in a special property, `body_normalized` for instance, and index them as a field of your Elasticsearch records

In [6]:
import re

texts = [
    u"Paul Jorgensen, South African Advocate",
    u"Andy Jörgensen, American politician, Wisconsin State Assembly",
    u"Anker Jørgensen, Danish Prime Minister",
]
records = []

for text in texts:
    text_parts = [p.strip() for p in re.split('[,\s]',text)]
    text_normalized = []
    for tp in text_parts:
        tp = tp.strip().lower()
        if tp:
            text_normalized.append(normalize_word(tp)) 
                
    record = {
        'body' : text,
    }
    if text_normalized:
        record['body_normalized'] = ' '.join(text_normalized)
    records.append(record)

for r in records:
    for k,v in r.iteritems():
        print k,v
    

body Paul Jorgensen, South African Advocate
body_normalized paul jorgensen south african advocate
body Andy Jörgensen, American politician, Wisconsin State Assembly
body_normalized andy joergensen american politician wisconsin state assembly
body Anker Jørgensen, Danish Prime Minister
body_normalized anker joergensen danish prime minister


In [16]:
import json
from elasticsearch import Elasticsearch

#cluster node to query
es = Elasticsearch(['localhost:9200',])
index_name    = 'your_index'



es.indices.create(
    index = index_name,
)

kwargs = {
    'body' : []
}

index_config = { 
    "_type"  : "your_type", 
    "_index" : 'your_index', 
}
for idx,r in enumerate(records):
    _index_config = dict(index_config)
    _index_config['_id'] = idx
    kwargs['body'].append({'index' : _index_config})
    kwargs['body'].append(r)

_ = es.bulk(**kwargs)

Once they are inserted, you could perform two types of search:<br>
    1. exact search: User input isn't normalized and Elasticsearch query search against `body` field, that isn't normalized too.
    2. simliar search. User input is normalized and we'll search againts `body_normalized` field.

In [23]:

body_matches = [
    u'Jorgensen',
    u'Jörgensen',
    u'Jørgensen',
    u'Joergensen',
]
print "------EXACT MATCH------"
for body_match in body_matches:
    elasticsearch_query = {
        "query": {
            "match" : {
                "body" : body_match
            }
        }
    }
    es_kwargs = { 
        "doc_type"  : "your_type", 
        "index" : 'your_index', 
        "body" : elasticsearch_query
    }

    res = es.search(**es_kwargs)
    print body_match," MATCHING BODIES=",res['hits']['total']

    for r in res['hits']['hits']:
        print "-",r['_source'].get('body','')

print "\n------SIMILAR MATCHES------"
for body_match in body_matches:
    body_match = normalize_word(body_match)
    elasticsearch_query = {
        "query": {
            "match" : {
                "body_normalized" : body_match
            }
        }
    }
    es_kwargs = { 
        "doc_type"  : "your_type", 
        "index" : 'your_index', 
        "body" : elasticsearch_query
    }

    res = es.search(**es_kwargs)
    print body_match," MATCHING NORMALIZED BODIES=",res['hits']['total']

    for r in res['hits']['hits']:
        print "-",r['_source'].get('body','')


------EXACT MATCH------
Jorgensen  MATCHING BODIES= 1
- Paul Jorgensen, South African Advocate
Jörgensen  MATCHING BODIES= 1
- Andy Jörgensen, American politician, Wisconsin State Assembly
Jørgensen  MATCHING BODIES= 1
- Anker Jørgensen, Danish Prime Minister
Joergensen  MATCHING BODIES= 0

------SIMILAR MATCHES------
Jorgensen  MATCHING NORMALIZED BODIES= 3
- Paul Jorgensen, South African Advocate
- Anker Jørgensen, Danish Prime Minister
- Andy Jörgensen, American politician, Wisconsin State Assembly
Joergensen  MATCHING NORMALIZED BODIES= 2
- Anker Jørgensen, Danish Prime Minister
- Andy Jörgensen, American politician, Wisconsin State Assembly
Joergensen  MATCHING NORMALIZED BODIES= 2
- Anker Jørgensen, Danish Prime Minister
- Andy Jörgensen, American politician, Wisconsin State Assembly
Joergensen  MATCHING NORMALIZED BODIES= 2
- Anker Jørgensen, Danish Prime Minister
- Andy Jörgensen, American politician, Wisconsin State Assembly
