In [13]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import sys
sys.path.append('.')
from Coref import CorefModel

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## **Using class CorefModel()**


The class CorefModel() implements two coreference detection models : NeuralCoref and e2eCoref. It allows one user to chose one of those two models and use it to detect and extract all coreference chains from a list of texts under a dataset format.

The user must give in entry a dataset with one or several columns of texts for which we want to do the coreference detection. Then, the CorefModel() tool will detect and extract coreference chains for each text, and place those created clusters in a new column of the dataset. User can also use a visualisation tool to highlight the different coreference chains of one text found by one model.

In this tutorial, we will present an example of coreference detection and extraction of a dataset of interest, using each model and each function of class CorefModel(). This tutorial will illustrate the standard use of that class.

### 1. Dataset 

To use the class, the user must give either a csv or a json dataset in entry, with at least one column being a list of texts (under string format).

The dataset df_tutorial used for this example is a csv. It has 4 lines and 2 columns of interest : 'text1' and 'text2', which are the columns of texts for which we want to detect coreference chains. 
Texts are in english and extracted from press corpus.



In [None]:
!pwd

In [None]:
df_tutorial = pd.read_csv('/sps/humanum/user/cabedmer/SourcedStatements/df_tutorial.csv')  # path to change
df_tutorial.head()


## **Standards steps of CorefModel() :**


**The class steps must be done in order.**

## 1. Calling the class 

In [None]:
coref_model = CorefModel()

## 2.  Preprocessing

**Importation of the dataset** using it path and the name of the columns colnames.

Colnames can be a string if there is only one column, or a list of string if there is one or several column of interest.

In [None]:
coref_model.import_dataset('/sps/humanum/user/cabedmer/SourcedStatements/df_tutorial.csv', colnames=['text1','text2'])

**Cleaning of the dataset :** can only be done after importation.
The columns of interest (given by colnames) will be cleaned : string format are checked, typos errors are corrected and line break are erased.

In [None]:
coref_model.clean()

## 3. Choosing the model for the following steps


When choosing a model steps of inference, and visualisation must be done successively, one model after the other. One must be careful because if   inference is done for both models, then visualisation for both, the visualisation function will only use the last resulting dataframes (*df_results*, *df_standardized_results*) so it will only work for the last model.

## **When choosing : NeuralCoref**


### **4. Inference**

For evaluation, we use the function **inference** that take the model in argument and requires to have already import and clean the dataset.

NeuralCoref detect and extract coreference chains. 

**Steps of the inference fonction :**

 - **Transforming the dataset format**

We want the dataset to be to the right format, to use NeuralCoref.

For NeuralCoref we only need the dataset with columns of interest to the right format, which is already the case after preprocessing.
The transformation step creates *df_eval*, used for evaluation (which is directly the dataframe *df_tutorial* after preprocessing).

 - **Detect coreference chains**

Detect and extract coreference chains for each text and column of the dataframe and present the results in a new dataframe called df_results with the columns of interested *col* and columns of predicted clusters *cluster_col*. Each line of a column is a list of detected clusters, each cluster being a list of spans (specific class of NeuralCoref) : it is the intervall of text corresponding to the mention. 

**Inference function returns the dataframe *df_results* with original columns of text and columns of coreference chain detected.**

In [None]:
coref_model.inference(model='neuralcoref')

### **5. Visualisation**

Using inference function is useful to see the predicted clusters of coreference as list of text. But the dataframe returned can be complex to read and not really visual. To see the coreference chains of a specific text of the dataframe highlighted, we can print the function **visualisation**.

This requires to have already import, clean and used inference on the dataset of interest. 

Function **visualisation** takes in argument the model (must be the same as the one chosed for inference), and the position of the text of interested : column col and line i.

In [None]:
print(coref_model.visualisation(model='neuralcoref', col='text1', i=1))

# **When choosing : e2eCoref**

### **4. Inference**

For evaluation, we use the function **inference** that take the model in argument and requires to have already import and clean the dataset.

e2eCoref detect and extract coreference chains. 

**Steps of the inference fonction :**

 - **Transforming the dataset format**

We want the dataset to be to the right format, to use e2eCoref.

For e2eCoref we need to create a specific jsonfile to the right format for each column of interest. 
The transformation step creates that file for each column *col*, called *df_coref_col*. 

 - **Detect coreference chains**

Detect and extract coreference chains for each text and column of the dataframe and present the results in a new dataframe called *df_results* with the columns of interested *col* and columns of predicted clusters *cluster_col*. Each line of a column is a list of detected clusters, each cluster being a list of strings (specific class of NeuralCoref). 

- **Creates a dataframe useful for further use**

Parallel to the coreference chains detections, we create *df_useful* which stocks for each column *col*, columns *text_list_col* - text under list format - and *predicted_clusters_col* - list of clusters, each cluster being a list of coreference mentions positions under list format. "List format" means the interval [a,b] given as positions correspond to the mention returned when selecting the interval of text under list format  : text_list_col[a,b]. 


**Inference function returns the dataframe *df_results* with original columns of text and columns of coreference chain detected.**
        

      



In [None]:
coref_model.inference(model='e2ecoref')

### **5. Visualisation**

**Steps:**

- **Standardized results**

We use df_useful to have the positions of clusters and convert it to "spans positions" to have for each column *col* a *span_positions_col*

- **Visualisation**

Those columns are then used for visualisation to rewrite the text while highlighting mentions of the same coreference chain in the same colour.

In [None]:
print(coref_model.visualisation(model='e2ecoref',col='text1',i=0))