# WMIR practice lesson on Spacy

### Objective
Use the Spacy framework to extract relevant information from sentences for enriching the representations.  
Train several models and compare them with and without the enriched representations.

#### Author
Claudiu Daniel Hromei, April 2023.  
hromei@ing.uniroma2.it

# Introduction

[SpaCy](spacy.io) is a free, open-source library for Natural Language Processing (NLP) in Python. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.  
Some Features:
- **Tokenization**: Segmenting text into words, punctuations marks etc.
- **Part-of-speech (POS) Tagging**: Assigning word types to tokens, like verb or noun.
- **Dependency Parsing**: Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
- **Lemmatization**: Assigning the base forms of words. For example, the lemma of “was” is “be”, and the lemma of “rats” is “rat”.
- **Sentence Boundary Detection (SBD)**: Finding and segmenting individual sentences.
- **Named Entity Recognition (NER)**: Labelling named “real-world” objects, like persons, companies or locations.
- **Entity Linking (EL)**: Disambiguating textual entities to unique identifiers in a knowledge base.
- **Similarity**: Comparing words, text spans and documents and how similar they are to each other.
- **Text Classification**: Assigning categories or labels to a whole document, or parts of a document.
- **Rule-based Matching**: Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.
- **Training**: Updating and improving a statistical model’s predictions.
- **Serialization**: Saving objects to files or byte strings.

# Required Libraries

In [26]:
import pandas as pd

from IPython.display import display, HTML

In [27]:
# option to print all the value of cells in DataFrames
pd.set_option("max_colwidth", None)

### Install spacy and download the english pipeline

In [28]:
# install the spacy module
!pip install spacy

# download the english pipeline here
# 'it_core_news_sm' for italian texts
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 28.5 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [29]:
import spacy
from spacy import displacy

# Annotation example

In [30]:
input_string = "In 1982, Mark drove his car from Los Angeles to Las Vegas until 5 of july"
nlp = spacy.load('en_core_web_sm')

In [31]:
def print_annotation(input_string):
    doc = nlp(input_string)
    
    df = pd.DataFrame({
        "id": [],
        "word": [],
        "lemma": [],
        "tag": [],
        "entity": [],
        "dependency": [],
        "head id": []
    })

    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": str(i+1), "word": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": str(head_idx)}
            df = df.append(word_obj, ignore_index=True)
    display(df)

- O: significa out. Nella colonna entity vuol dire che non fa parte di nessua entità;
- Root: nella colonna dependency vuol dire che non ha nessuna dipendenza;

In [32]:
print_annotation(input_string)

  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)
  df = df.append(word_obj, ignore_index=True)


Unnamed: 0,id,word,lemma,tag,entity,dependency,head id
0,1,In,in,IN,O,prep,5
1,2,1982,1982,CD,DATE,pobj,1
2,3,",",",",",",O,punct,5
3,4,Mark,Mark,NNP,PERSON,nsubj,5
4,5,drove,drive,VBD,O,ROOT,0
5,6,his,his,PRP$,O,poss,7
6,7,car,car,NN,O,dobj,5
7,8,from,from,IN,O,prep,5
8,9,Los,Los,NNP,GPE,compound,10
9,10,Angeles,Angeles,NNP,GPE,pobj,8


In [33]:
def visualize_annotation(input_string, style="dep"):
    doc = nlp(input_string)
    # style can be either "dep" or "ent"
    displacy.render(doc, style=style, jupyter=True, options={"distance": 100}) #distance default 140

In [34]:
visualize_annotation(input_string, style="dep")

In [35]:
visualize_annotation(input_string, style="ent")

# Information extraction

Get information about a particular word in a given string.

In [36]:
def get_word_annotation(input_string, word_string):
    doc = nlp(input_string)
    
    words = []
    for sent in doc.sents:
        for i, word in enumerate(sent):
            if word.head is word:
                head_idx = 0
            else:
                head_idx = doc[i].head.i+1
            if head_idx == i + 1:
                head_idx = 0

            entity_tag = word.ent_type_
            if len(entity_tag) == 0:
                entity_tag = "O"
            
            word_obj = {"id": i+1, "word": str(word), "lemma": word.lemma_, "tag": word.tag_, "entity": entity_tag,
                                    "dependency": word.dep_, "head id": head_idx}
            words.append(word_obj)
    
    for word in words:
        if word["word"] == word_string:
            return word
    
    return None

In [37]:
print(get_word_annotation(input_string, "Mark"))

{'id': 4, 'word': 'Mark', 'lemma': 'Mark', 'tag': 'NNP', 'entity': 'PERSON', 'dependency': 'nsubj', 'head id': 5}


### Exercise 1: Search for relations

Define a method that takes in input a sentence (`input_string`) and the name of a relation (`relation_string`), parses with spacy the input and returns the triples (w1, w2, relation) where w is a lemma. If the relation is not present, return an empty array.

```
def search_relation(input_string, relation_string):
    return word_obj_list
```

In [38]:
def search_relation(input_string, relation_string):
    word_obj_list = []
    return word_obj_list

### Exercise 2: Search for entities

Define a method that takes in input a sentence (`input_string`) and the name of an entity type (`entity_type_string`), parses with spacy the input and returns the words (the `objects`! not the strings) described by that entity. If the entity type is not present, return an empty array.

```
def search_entity(input_string, entity_type_string):
    return word_obj_list
```

### Exercise 3: Enriching the sentences

For every sentence in the QuestionClassification dataset, extract the `subject-verb` relation and the `verb-object` relation. Add these couples to the original input, divided by the `#`:

- Sentence: '*What is the full form of .com?*' 
- `subject-verb`: *What is*  
- `verb-object`: *is the full form* 
- Enriched sentence: '*What is the full form of .com? # What is # is the full form*'  

Store the enriched sentences in a new dataframe and train a classifier (SVM, NB, Rocchio..) and evaluate it.

### Exercise 4: Replace texts with entities

For every sentence in the QuestionClassification dataset, extract the entities annotated by the spacy module only for the proper nouns (`PROPN`) and replace the spans in the text with the entity name:

- Sentence: '*In 1982, Mark drove his car from Los Angeles to Las Vegas until 5 of july*'  
- Modified Sentence: '*In 1982, PERSON drove his car from GPE to GPE until 5 of DATE*'

Store the enriched sentences in a new dataframe and train a classifier (SVM, NB, Rocchio..) and evaluate it.

**WARNING**: be careful with `compounds`, they should be replaced by a SINGLE entity name: *Las Vegas* => `GPE`

### Exercise 5: Compare the models

Compare the models from Exercises 3 and 4 with a simpler model you trained in the previous lessons in terms of F1 measure.