# Tokenization

### Named Entity Normalization: person name

- Proper nouns are one of the parts of speech where a lot of new words appears
- E.g., person name, movie title, ...
- That's why I implemented the normalization below
- However, there is a lot of words with strong ambiguity in the person name dictionary provided by default in MeCab
- So it may be more appropriate to remove the default person name dictionary and build a new one ourselves in terms of disambiguation if we can obtain a list of people names
- E.g., f1(val) w/ NEN is 0.850 and w/o is 0.852
- And `인명` may be too general to be used for normalization
- So it may be appropriate to normalize by their jobs or other characteristics

In [None]:
def normalize_person_names(x):
    morphemes = mecab.parse(x)
    result = x
    
    for idx, morpheme in enumerate(morphemes):
        if morpheme[1].semantic != '인명': continue        
        result = result.replace(
            morpheme[0], 
            '사람', 
            1
        )
        
    result = ' '.join(mecab.morphs(result))
        
    return result

def analyze(documents):
    return documents.parallel_apply(
        lambda x: normalize_person_names(x)
    )