# Module Neural Coref

## 1. Installation

On installe et importe tout les packages nécessaires :

In [None]:
#!pip uninstall spacy 
#!pip uninstall neuralcoref
#!pip install spacy==2.1.0
#!pip install neuralcoref --no-binary neuralcoref

#!python -m spacy download en

#!pip install colorama

In [9]:
import pandas as pd
import numpy as np

import logging;
logging.basicConfig(level=logging.INFO)
import neuralcoref
import spacy
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

from colorama import Fore, Back, Style

INFO:neuralcoref:Loading model from /Users/clementineabed-meraim/.neuralcoref_cache/neuralcoref


On charge la base de donnée qui nous intéresse :

In [10]:
ANNOTATED_DATA_PATH  = '/Users/clementineabed-meraim/Documents/Stage 2021 Medialab/SourcedStatements-master/annotated/annotated_examples.json'

In [11]:
df = pd.read_json(ANNOTATED_DATA_PATH, orient='records', lines=True)
#df.head()

## 2. Fonctions préalables : passage de spans au strings

Ces fonctions nous seront utiles pour manipuler les transitions de span (token) en chaîne de caractère et vice-versa.

In [44]:
def isprefixe(i,mot,texte): # vérifie si mot (str) a une occurrence dans texte en position i 
    B = True
    j=0
    while (j < len(mot)) and B:
        if texte[i+j] != mot[j]:
            B = False
        j+= 1 
    return B

In [43]:
def positions_str(mention_str,texte): # retourne les positions d'occurences d'un mot (str) dans un texte
    occ = []
    for i in range(len(texte)-len(mention_str)+1):
        if isprefixe(i,mention_str,texte): 
            occ.append(i)
    return occ

In [60]:
def position_str_to_span(start,end,texte): #renvoie la position en span à partir de la position en str (début et fin)
    mention_str = texte[start:end]
    mention_span = nlp(mention_str)

    chaine = texte[0:end]
    chain = nlp(chaine)

    return (len(chain)-len(mention_span))


In [61]:
def positions_span(mention_str,texte): # renvoie liste des positions en span d'une mention (str) (peut avoir plusieurs occurences)
    occ1 = []
    for i in positions_str(mention_str,texte): 
        #print(i)
        chaine = texte[0:i+len(mention_str)]
        mention_span = nlp(mention_str)
        #print(mention_span)
        #print(chaine)
        chain = nlp(chaine)
        occ1.append(len(chain)-len(mention_span))
 
    return occ1

In [62]:
def position_span_to_str(mention,texte): # prend un span et renvoie sa position correspondante en str dans le texte
    mention_str = mention.text

    span_position = mention.start 
    #print(mention.start)

    liste_pos_str = positions_str(mention_str,texte) #fonction qui renvoie une liste des positions str d'une mention dans un texte
    liste_pos_span = positions_span(mention_str,texte) #fonction qui renvoie une liste des positions token d'une mention dans un texte 
    #print(liste_pos_str)
    #print(liste_pos_tok) 

    if span_position in liste_pos_span :
        ind = liste_pos_span.index(span_position)
        position_finale = liste_pos_str[ind]
    
    return position_finale #renvoie la position du span en str

## 3. Prétraitement du dataframe :

**Création de la colonne annotations_sources :**

On filtre les dictionnaires annotations en ne gardant que les sources.

In [17]:
def filtrage(dataframe):  # crée une nouvelle colonne avec uniquement les labels de type "source"
    dict_filtered = []
    for liste_dico in dataframe['annotations'] : #on se place dans la liste de dictionnaire de chaque ligne du dataframe
        new_liste_dico = [dico for dico in liste_dico if dico["label"]== 14] #on filtre cette liste
        dict_filtered.append(new_liste_dico)

    dataframe['annotations_source'] = dict_filtered

In [18]:
filtrage(df)

**Création de la colonne spans :**

Pour chaque texte, on regroupe les spans correspondant aux sources dans une liste.

In [41]:
def liste_span(dataframe):  #crée colonne des spans correspondant aux sources pour chaque texte
    colonne_span = []
    for i in range(len(dataframe)):
        liste_span = []

        texte = dataframe['text'][i]
        nlp_texte = nlp(texte)

        for dico in dataframe['annotations_source'][i]:
            start = dico['start_offset']
            end = dico['end_offset']

            mention = texte[start:end]
            nlp_mention = nlp(mention)

            index = position_str_to_span(start,end,texte)
            span = nlp_texte[index:index+len(nlp_mention)]
            liste_span.append(span)
    
        colonne_span.append(liste_span)
    #print(colonne_span)
    dataframe['spans'] = colonne_span

In [20]:
liste_span(df)

In [22]:
#df.head()

## 4. Fonction : chaîne de coréférences 

Une fois le dataframe prétraité, on peut construire une fonction qui, pour un texte donné (en ligne i du dataframe) renvoie les chaînes de coréférences des propos sourcés du texte (identifiés précédemment).


**Fonctions préalables :**

Tout d'abord, on construit une fonction renvoyant les chaînes de coréférence des propos sourcés (si elles existent) pour un texte donné (ligne i dans le dataframe).

In [27]:
def liste_cluster(i,dataframe): 
    liste_main_span = []
    liste_cluster = []

    for span in dataframe['spans'][i]:
        if span._.is_coref and span._.coref_cluster.main not in liste_main_span : # si le span est bien coréférent et pas déjà considéré
            liste_main_span.append(span._.coref_cluster.main)
            liste_cluster.append(span._.coref_cluster.mentions)

    return liste_cluster


Exemple d'utilisation de la fonction :

In [28]:
nlp_texte = nlp(df['text'][1])
print(df['spans'][1])
liste_cluster(1,df)

[Fauci, Fauci, he, Fauci, he, Fauci, Fauci, he, Chris Murphy, D-Conn., he, Trump, Trump, he, Robert Redfield, head of the Centers for Disease Control and Prevention, He, Murphy, Murphy, Fauci, Fauci, Fauci]


[[Fauci, Fauci, Fauci, he, Fauci, he, Fauci, Fauci, he],
 [the president, he, he],
 [Trump, Trump, Trump, Donald Trump, he],
 [Robert Redfield, head of the Centers for Disease Control and Prevention,
  his,
  He],
 [Murphy, Murphy, Murphy],
 [Fauci, Fauci, His, Fauci]]

Neural Coref identifie parfois des spans qui se chevauchent : il ne faudrait en garder qu'un. On construit donc la fonction no_doublons, qui enlève les doublons de span dans les clusters de coréférence. Si deux spans se chevauchent dans le texte, on choisit de garder celui qui a le meilleur score de paire parmi tous ces scores de paires calculés.

In [49]:
def no_doublons(clusters): # à partir de l'ensemble des clusters de coref, renvoie les positions des mentions (span) a supprimer
    liste_positions = []
    liste_mentions = []
    liste_mentions_a_suppr = []
    for clust in clusters :
        cluster = clust.mentions
        
        for mention in cluster:
            liste_positions.append(pd.Interval(mention.start, mention.end)) #liste de tout les intervalles pris par les spans
            liste_mentions.append(mention) #liste de tout les spans
    #print(liste_positions)
    #print(liste_mentions)

        #on regarde si certains se chevauchent

    for interval1 in liste_positions :
        for interval2 in liste_positions :
            if interval1.overlaps(interval2) and interval1 != interval2 :
                #print(interval1,interval2)
                i1 = liste_positions.index(interval1) #index du span dans la liste
                i2 = liste_positions.index(interval2)
                mention1 = liste_mentions[i1]
                mention2 = liste_mentions[i2]

                dico1 = mention1._.coref_scores
                score1 = max(dico1.values())

                dico2 = mention2._.coref_scores
                score2 = max(dico2.values())
                #print(score1,mention1)
                #print(score2,mention2)

                if score1 <= score2 and [mention1.start,mention1.end] not in liste_mentions_a_suppr :
                    liste_mentions_a_suppr.append([mention1.start, mention1.end])
                    #print(mention1)
                elif score1 > score2 and [mention2.start,mention2.end] not in liste_mentions_a_suppr :
                    liste_mentions_a_suppr.append([mention2.start, mention2.end])
                    #print(mention2)

    return(liste_mentions_a_suppr)

Exemple d'utilisation de la fonction :

In [50]:
texte = df['text'][1]
texte_nlp = nlp(texte) 
print(no_doublons(texte_nlp._.coref_clusters))

[[7, 8]]


**On implémente alors la fonction qui affiche les clusters de coréférences pour les propos sourcés d'un texte donné du dataframe :**

In [65]:
def coref(i,dataframe) : #Retourne la chaîne de coréférence pour le paragraphe i dans la base de données
  texte = dataframe['text'][i].replace('\n','. ')
  texte_or = texte #texte original
  nlp_texte = nlp(texte)
  liste_charactere = [i for i in range(len(texte))]
  liste_charactere_updated = [i for i in range(len(texte))]

  color = 0 #couleur des caractères
  colors = 240 #couleurs de fond

  mentions_a_supp = no_doublons(nlp_texte._.coref_clusters)

  #print(nlp_texte._.coref_clusters)

  for cluster in liste_cluster(i,dataframe):

    color += 1
    nouveau_clust = [mention for mention in cluster if [mention.start,mention.end] not in mentions_a_supp]

    if len(nouveau_clust)>1 : # un cluster avec un unique élément n'est pas une chaîne de coréférence
      for mention in nouveau_clust :

          mention_str = mention.text # mention en string

          index_position_start = position_span_to_str(mention,texte_or) # position début de mention en string
          position_start = liste_charactere_updated[index_position_start]
          position_end = position_start+len(mention_str) # position de fin de mention en string

          deb = texte[0: position_start] # texte jusqu'à la mention
          fin = texte[position_end:] # fin du texte

          texte = deb + f'\033[38;5;{color}m' + f'\x1b[48;5;{colors}m' + mention_str + '\033[0;0m' + fin #on modifie texte en changeant la couleur de la mention
          add1 = len(f'\033[38;5;{color}m') + len(f'\x1b[48;5;{colors}m')
          add2 = len('\033[0;0m')

          for i in range(index_position_start,len(liste_charactere_updated)): # on update les positions des éléments du texte après ajout de add1
            liste_charactere_updated[i] += add1
            
          for i in range(index_position_start+len(mention_str),len(liste_charactere_updated)): # on update les positions des éléments du texte après ajout de add2
            liste_charactere_updated[i] += add2

          
  return texte

In [71]:
print(coref(0,df))

Hong Kong, with a population of around 7.5 million, had a total of 6,039 cases and 108 deaths as of Saturday, a low rate for any city. But the region’s recent setbacks underscore the challenges that the world will continue to face until there is a widely available vaccine . As cases have soared back to alarming levels in recent weeks, South Korea, Japan and Hong Kong have had to quickly recalibrate their strategies. Travel bubbles that were announced with great fanfare are now on hold. Weeks after reopening, schools have been shut again. Bars and restaurants are closing early or shifting to takeaway menus. “We need solidarity in this kind of situation, but as everyone knows, it’s not easy,” said Dr. Kim Woo-joo, an infectious disease specialist at Korea University in Seoul .


Exemple d'utilisation de la fonction :

In [66]:
print(coref(1,df))



In [67]:
print(coref(2,df))

The history of humanity is the history of impatience. Not only do we want knowledge of the future, we want it when we want it. The Book of [38;5;2m[48;5;240mJob[0;0m condemns as prideful this desire for immediate attention. Speaking out of the whirlwind, [38;5;1m[48;5;240mGod[0;0m makes it clear that [38;5;1m[48;5;240mhe[0;0m is not a vending machine. [38;5;1m[48;5;240mHe[0;0m shows [38;5;1m[48;5;240mhis[0;0m face and reveals [38;5;1m[48;5;240mhis[0;0m plans when the time is ripe, not when the mood strikes us. We must learn to wait upon the Lord, the Bible tells us. Good luck with that, [38;5;2m[48;5;240mJob[0;0m no doubt grumbled. When the gods are silent, human beings take things into their own hands. In religions where the divine was thought to inscribe its messages in the natural world, specialists were taught to take auspices from the disposition of stars in the sky, from decks of cards, dice, a pile of sticks, a candle flame, a bowl of oily water, or the live

In [68]:
print(coref(3,df))

Associated Press Florida judge blocks state order for schools to reopen A Florida judge granted a temporary injunction Monday against the state's executive order requiring school districts to reopen schools during the pandemic, the Florida teachers union said. According to the Florida Teachers Association, Circuit Court Judge Charles Dodson granted its request to put a hold on the order issued in July by state Education Commissioner Richard Corcoran compelling schools to reopen. [38;5;1m[48;5;240mThe Florida Education Department[0;0m said [38;5;1m[48;5;240mit[0;0m could not immediately comment. Some districts in south Florida were given permission by the state to start the 2020-21 school year remotely because of high virus spread, but other districts had to begin in-person education, even if they did not want to. In one instance, the administration of Governor Ron DeSantis, a close ally of President Trump, threatened to withhold nearly $200 million from Hillsborough County if it 

In [72]:
print(coref(4,df))

“This is a serious setback in a delicate stage of the recovery,” said [38;5;1m[48;5;240mDec Mullarkey, managing director of SLC Management in Wellesley[0;0m, though [38;5;1m[48;5;240mhe[0;0m cautioned that Trump's move may be a negotiating ploy. If [38;5;1m[48;5;240mhe[0;0m sticks with [38;5;1m[48;5;240mhis[0;0m decision to pause stimulus talks, Trump appears to believe that quickly pushing through [38;5;1m[48;5;240mhis[0;0m nomination of Barrett to the Supreme Court is politically smarter than striking a deal with Democrats on the economy. “The president seems to be betting that his supporters care more about the Supreme Court approval than the stimulus plan,” said Karen Firestone, CEO of Aureus Asset Management. And as Dan Kern, chief investment officer at TFC Financial in Boston, noted, “The lack of pandemic relief will hurt the economy, but major harm in terms of [economic] growth and the jobs market won't be fully reflected in economic releases until after the elect