# French NER Processing 

## POS Tagging

Tenemos que mapear en python las etiquetas de la página: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html para que cuando aparezcan sean reemplazados por las etiquetas que correspondan del tagset Upenn (se muestran abajo).

In [2]:
import nltk
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## TreeTagger

Para que funcione la librería hay que seguir el proceso que se muestra en esta web: http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Todos los ficheros deben estar en el mismo directorio.La clave es que los ficheros que lo necesiten no tengan sin descomprimir el .gz. De todos modos adjunto los ficheros preparaditos para ejecutar el script de instalar _sh install-tagger.sh_.

Para que tire el diccionario francés meter en TreeTagger/lib el fichero _french-utf8.par_.

## Python3 Tree Tagger


Clonar el repositorio de este pollo. https://github.com/miotto/treetagger-python

##### NOTA: Directamente adjunto el fichero python para no clonar ni nada

No instalar como dice él con setup.py (al menos a mí no me lo reconoce el notebook después). Mejor importar directamente la librería treetagger.py como toda la vida (_import_ fichero _treetagger.py_ situado en mismo directorio que el notebook o en otro directorio que sea hijo del directorio de este notebook).

Recordar añadir con export el PATH al ejecutable como se dice en la instalación



## Pruebas

El programa treetagger se es capaz de ejecutar los procedimientos de 'tokenizing' y de 'stemming' sobre el corpus objetivo. Posteriormente cogeremos el resultado y los mapearemos al tagset Upenn y para dárselo al NER de NLTK.

In [3]:
from treetagger import TreeTagger
tt = TreeTagger(language='french')

test_ttg = tt.tag('Emmanuel Macron a gagné à la candidate du Front National, Marine Le Pen, par un soixante-cinq percent des votes dans les elections de la France.')
print(test_ttg)

[['Emmanuel', 'NAM', 'Emmanuel'], ['Macron', 'NAM', '<unknown>'], ['a', 'VER:pres', 'avoir'], ['gagné', 'VER:pper', 'gagner'], ['à', 'PRP', 'à'], ['la', 'DET:ART', 'le'], ['candidate', 'NOM', 'candidat'], ['du', 'PRP:det', 'du'], ['Front', 'NAM', '<unknown>'], ['National', 'NAM', '<unknown>'], [',', 'PUN', ','], ['Marine', 'NAM', 'Marine'], ['Le', 'DET:ART', 'le'], ['Pen', 'NAM', '<unknown>'], [',', 'PUN', ','], ['par', 'PRP', 'par'], ['un', 'DET:ART', 'un'], ['soixante-cinq', 'NUM', 'soixante-cinq'], ['percent', 'VER:pres', 'percer'], ['des', 'PRP:det', 'du'], ['votes', 'NOM', 'vote'], ['dans', 'PRP', 'dans'], ['les', 'DET:ART', 'le'], ['elections', 'NOM', '<unknown>'], ['de', 'PRP', 'de'], ['la', 'DET:ART', 'le'], ['France', 'NAM', 'France'], ['.', 'SENT', '.']]


In [4]:
#Dictionary uses following pair convention --> 'Treetagger_french_tagging':'UPenn_tagging'
mapping = {'ABR':'CC',
       'ADJ':'JJ',
       'ADV':'RB',
       'DET:ART':'DT',
       'DET:POS':'PRP$',
       'INT':'UH',
       'KON':'CC',
       'NAM':'NNP',
       'NOM':'NN',
       'NUM':'CD',
       'PRO':'PRP',
       'PRO:DEM':'DT',
       'PRO:IND':'DT',
       'PRO:PER':'PRP',
       'PRO:POS':'PRP$',
       'PRO:REL':'DT',
       'PRP':'IN',
       'PRP:det':'RP',
       'PUN':'SYM',
       'PUN:cit':'SYM',
       'SENT':'.',
       'SYM':'SYM',
       'VER:cond':'MD',
       'VER:futu':'MD',
       'VER:impe':'VBP',
       'VER:impf':'VBD',
       'VER:infi':'VB',
       'VER:pper':'VBN',
       'VER:ppre':'VBG',
       'VER:pres':'VB',
       'VER:simp':'VBD',
       'VER:subi':'VBN',
       'VER:subp':'VBD',      
}

Ejecutamos el mapeado sobre el corpus resultante de TreeTagger.

A continuación se carga el diccionario encargado de mapear el etiquetado del formato empleado por TreeTagger al formato UPenn.

In [5]:
test_mapped=[]
j=0
for i in test_ttg:
    if i[1] in mapping:
        test_mapped.append((test_ttg[j][0],mapping[i[1]]))
    j+=1
print(test_mapped)


[('Emmanuel', 'NNP'), ('Macron', 'NNP'), ('a', 'VB'), ('gagné', 'VBN'), ('à', 'IN'), ('la', 'DT'), ('candidate', 'NN'), ('du', 'RP'), ('Front', 'NNP'), ('National', 'NNP'), (',', 'SYM'), ('Marine', 'NNP'), ('Le', 'DT'), ('Pen', 'NNP'), (',', 'SYM'), ('par', 'IN'), ('un', 'DT'), ('soixante-cinq', 'CD'), ('percent', 'VB'), ('des', 'RP'), ('votes', 'NN'), ('dans', 'IN'), ('les', 'DT'), ('elections', 'NN'), ('de', 'IN'), ('la', 'DT'), ('France', 'NNP'), ('.', '.')]


Una vez tenemos nuestro corpus etiquetado en formato UPenn, empleamos la librería ne_chunk de NLTK para realizar el  procesamiento NER.

In [6]:
# https://gist.github.com/onyxfish/322906
def extract_entity_names(t):
    entity_names = []

    if hasattr(t, 'label') and t.label:
        if t.label() == 'NE':
            entity_names.append(' '.join([child[0] for child in t]))
        else:
            for child in t:
                entity_names.extend(extract_entity_names(child))

    return entity_names

In [27]:
from nltk import ne_chunk
ne_tagged = ne_chunk(test_mapped, binary=True)
print(ne_tagged)


entity_names = []
for tree in ne_tagged:
    # Print results per sentence
    # print extract_entity_names(tree)
    # print(tree)
    entity_names.extend(extract_entity_names(tree))

# Print all entity names
#print entity_names

# Print unique entity names
print("Found entities are listed below:")
print(set(entity_names))

(S
  (NE Emmanuel/NNP Macron/NNP)
  a/VB
  gagné/VBN
  à/IN
  la/DT
  candidate/NN
  du/RP
  (NE Front/NNP National/NNP)
  ,/SYM
  (NE Marine/NNP Le/DT Pen/NNP)
  ,/SYM
  par/IN
  un/DT
  soixante-cinq/CD
  percent/VB
  des/RP
  votes/NN
  dans/IN
  les/DT
  elections/NN
  de/IN
  la/DT
  France/NNP
  ./.)
Found entities are listed below:
{'Front National', 'Marine Le Pen', 'Emmanuel Macron'}


Como podemos ver la librería NER de NLTK es capaz de reconocer a dos personas (Macron y Le Pen) y un lugar (Francia) correctamente. También ha sido capaz de reconocer el partido Front National pero como si se tratase de una persona en vez de una organización. Por último, comentar que en este caso no ha sido capaz de reconocer el nombre 'Macron' como parte de la entidad persona 'Emmanuel'.
##### NUEVO !!
Cambiando a "True" la variable _binary_ del método _ne-chunk_, estamos indicando al procesador NER que no intente clasificar las entidades que encuentre, sino que simplemente la liste sin identificar su tipo. Con este modo vemos que en este caso es capaz de identificar "Emmanuel Macron" como una única entidad, pero por el contrario ya no es capaz de identificar la palabra "France" como otra entidad en el texto.

#### _EXTRA_

El siguiente apartado simplemente es una muestra de ejemplo en la que el mapeo se ha hecho de forma manual.

In [8]:
test = [('Emmanuel','NNP'),('Macron','NNP'),('est','VBG'),('le','DT'),('nouveau','JJ'),('président','NN'),('de','IN'),('la','DT'),('France','NNP')]
ne_tagged = ne_chunk(test,binary=True)
print(ne_tagged)

(S
  (NE Emmanuel/NNP Macron/NNP)
  est/VBG
  le/DT
  nouveau/JJ
  président/NN
  de/IN
  la/DT
  France/NNP)


## NER on tweets

Ahora vamos a aplicar el procesamiento NER que hemos mostrado anteriormente para todos los tweets recogidos en cada región de Francia.

En primer lugar cargamos el csv con panda para manejarlo fácilmente.

In [9]:
import pandas as pd
pd.set_option('max_colwidth',1000)

In [10]:
ledebat = pd.read_csv('sitc_nahr/tweets_debat_completos/Auvergne-Rhone-Alpes.csv', encoding='utf-8')
ledebat[:20]

Unnamed: 0.1,Unnamed: 0,text,dic_rounded,lsvc,knn,mnb,lsvc_bin,knn_bin,mnb_bin
0,0,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
1,1,RT @Floryan_Real: Si il te regarde comme ça t'as tout gagné #2017LeDébat,1,2,2,2,0,0,0
2,2,RT @Perrine1402: En cours j'me sentais inutile au milieu de tout ces gens puis j'me suis rappelée des journalistes du #2017leDebat ça m'a d…,0,2,2,2,0,0,0
3,3,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
4,4,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
5,5,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
6,6,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
7,7,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
8,8,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0
9,9,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0


Primero creamos un nuevo dataset panda con las nuevas columnas que guardarán la información de las entidades que se han reconocido en cada tweet.

In [11]:
# Map universal tagging to UPenn format
def map_to_upenn(target):
    test_mapped=[]
    j=0
    for i in target:
        if i[1] in mapping:
            test_mapped.append((target[j][0],mapping[i[1]]))
        j+=1
    return(test_mapped)

In [None]:
from nltk.tokenize import TweetTokenizer
import re
import numpy as np

tknzr = TweetTokenizer(strip_handles=True, reduce_len=True) # Tweet tokenizer works better

# These are the words that we want to look for inside all hashtags
key_words = {
    'LePen': ['marine','lepen'],
    'Macron': ['emmanuel','macron'],
}

# Let's tokenize all tweets
def preprocess(words, main_hashtag):
    tokens = tknzr.tokenize(words) # Tokenize a sentence
    http_regex = re.compile('^http.*$') # HTTP regex for links 
    no_links = [w for w in tokens if not http_regex.match(w)] # Remove links attached to the tweet
    no_colon =[w for w in no_links if w != ':'] # Using tokenizer user's nickname colon is not removed
    final_words = [w for w in no_colon if w != main_hashtag and w != 'RT' ] # Remove RT word and the main hashtag
    return final_words

results = ledebat.copy() # Make copy of original dataset to work better
for index, row in results.iterrows():
    comment=preprocess(row['text'],'#2017ledebat') # Preprocess the content of each tweet (Assuming main hashtag)
    comment_ttg = tt.tag(comment) # TreeTagger French POS tagging on tokenized tweet   
    comment_ttg_upenn = map_to_upenn(comment_ttg) # Map universal POS tagging to UPenn POST tagging format
    comment_ne_tagged = ne_chunk(comment_ttg_upenn ,binary=True) # Apply NLTK NER processing
    
    # First Filter: Find entities using NER
    entities = []
    # entities = set()
    for tree in comment_ne_tagged: #Let's extract all found entities
        entities.extend(extract_entity_names(tree))
        
    # Second Filter: Find entities in hashtags
    hashtags = [w for w in comment if '#' in w]
    # print(hashtags)
    for ht in hashtags:
        ht_lower = str.lower(ht)
        for key, names in key_words.items():
            for name in names:
                if name in ht_lower:
                    entities.append(ht)
                    #print("Entity", entities)
                    
    if entities:
        #print(entities)
        results.set_value(index,'entities',','.join(entities)) # Add new column with recognized entities
    else:
        results.set_value(index,'entities','None')
    
    # HERE begins the 'candidate' field classifier
    # We consider that the candidate the tweet refers to is the one that has the most mentions
    candidate_count = {
        'LePen': 0,
        'Macron': 0,
    }
    if entities:
        for entity in entities:
            entity_lower = str.lower(entity)
            entity_crop = entity_lower.replace(' ','')
            for key, names in key_words.items():
                for name in names:
                    if name in entity_crop:
                        candidate_count[key] += 1

    if candidate_count['LePen'] > candidate_count ['Macron']:
        results.set_value(index,'candidate','LePen')
    elif candidate_count['LePen'] == candidate_count ['Macron']:
        results.set_value(index,'candidate','None')
    else:
        results.set_value(index,'candidate','Macron')
    
    print(entities)
        


[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['Dimanche']
[]
['Interdit Interdit']
[]
[]
[]
[]
[]
[]
['Message', 'Marine Le Pen']
[]
[]
['KAIBA']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['Houellebecq', 'Macron', '#Macron']
[]
['KAIBA']
[]
['#Macron', '#StopMacron']
['Same Ol', '#Macron', '#StopMacron']
[]
[]
[]
[]
['Marine', '#EmmanuelMacron', '#EmmanuelMacron']
[]
[]
[]
['KAIBA']
[]
[]
[]
['Message', 'Marine Le Pen']
[]
['KAIBA']
[]
[]
['KAIBA']
[]
[]
[]
[]
['#JeVoteMacron', '#MacronPresident']
[]
[]
[]
[]
[]
[]
[]
[]
['#Macron']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['KAIBA']
[]
[]
[]
[]
['Interdit Interdit']
[]
[]
[]
[]
[]
['#lePen', '#macron']
[]
[]
[]
[]
['KAIBA']
[]
[]
[]
['Message', 'Marine Le Pen']
[]
['#Macron']
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
[]
['France', '#JeVoteMacron']
[]
['Macron', '#MacronLeaks', '#Macron', '#Marine2017']
[]
[]
[]
[]
[]
[]
[]
[]
[]
['Madame Lepen']
['KAIBA']
[]
[]
['

In [61]:
results[:80]

Unnamed: 0.1,Unnamed: 0,text,dic_rounded,lsvc,knn,mnb,lsvc_bin,knn_bin,mnb_bin,entities,candidate
0,0,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
1,1,RT @Floryan_Real: Si il te regarde comme ça t'as tout gagné #2017LeDébat,1,2,2,2,0,0,0,,
2,2,RT @Perrine1402: En cours j'me sentais inutile au milieu de tout ces gens puis j'me suis rappelée des journalistes du #2017leDebat ça m'a d…,0,2,2,2,0,0,0,,
3,3,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
4,4,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
5,5,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
6,6,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
7,7,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
8,8,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
9,9,RT @fille_motivee: j'ai mis le calendrier à jour #2017LeDebat #LeGrandDebat,Nan,2,2,2,0,1,0,,
