<a href="https://colab.research.google.com/github/valexharo/information_extraction/blob/master/nlp_in_action_2019.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing in action

Este _notebook_ ilustra la facilidad de uso y la flexibilidad de [__spaCy__](https://spacy.io/) `>= 2` para procesar material textual. Este _notebook_ fue utilizado durante nuestra comunicación en WOMEN IN DATA SCIENCE VALENCIA 2020 que tuvo lugar en Valencia.

## Objetivo

Con este _notebook_ aprenderás:

1. los conceptos básicos de __spaCy__
1. cómo encontrar en un texto las menciones de ciertos términos
1. cómo extraer información estructurada de un texto usando la anotación lingüística que aporta __spaCy__

## Preliminares

Necesitarás tener instalado en tu entorno las siguientes dependencias:

- __spaCy__ 
- el modelo de lenguaje `en_core_web_md`

Un modelo de lenguaje es una clase cuyo corazón lo constituye una serie de reglas y un modelo estadístico y un pipeline para llevar a cabo:

- tokenización
- lematización
- etiquetado de palabras por categoría gramatical
- anotación sintáctica de dependencias
- anotación de entidades nominadas
- identificación de oraciones

El modelo es una red neuronal convolucional (CNN) entrenada sobre el corpus OntoNotes y viene con vectores Glove entrenados sobre el corpus Common Crawl. Estos vectores permiten realizar operaciones semánticas.

Puedes instalarlos ejecutando los siguientes comandos:

```shell
pip install -U spacy
python -m spacy download en_core_web_md
```

## Lo básico

Importa los módulos

In [0]:
import spacy

[`Doc`](https://spacy.io/api/doc) es un objeto que representa un texto analizado con __spaCy__ que, a su vez, contiene objetos tipo `Token` y `Span`. Un [`Token`](https://spacy.io/api/token) es la estructura de datos para representar una palabra y su anotación lingüística. Un [`Span`](https://spacy.io/api/span) es una sección de un `Doc`. Por ejemplo, una frase es un objeto `Span`, una entidad nominada (Named Entity) es también un objeto `Span`.

In [0]:
from spacy.tokens import Doc, Span

El [`Matcher`](https://spacy.io/api/matcher) es una clase que permite encontrar secuencias de tokens basada en reglas. Dichas [reglas](https://spacy.io/usage/linguistic-features#section-rule-based-matching) nos recuerdan a las expresiones regulares pero aprovechando toda la información lingüística que añade __spaCy__ al parsear un texto.

In [0]:
from spacy.matcher import Matcher

__spaCy__ viene con un visualizador para mostrar la anotación lingüística, se llama [__displaCy__](https://spacy.io/usage/visualizers). Aquí puedes encontrar una [demo](https://explosion.ai/demos/displacy-ent).

In [0]:
from spacy import displacy

In [0]:
from spacy.symbols import NOUN, PROPN, VERB

In [0]:
from itertools import takewhile

Vamos a instalar el modelo de spaCy

In [0]:
!python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Ahora, carga el modelo...

tarda unos segundos...

In [0]:
nlp = spacy.load("en_core_web_sm")

Pipes `==` etapas, spaCy nos permite organizar `pipelines` o flujos de procesamiento de texto. Para una expliación detallada del paradigma y lo que hace cada una de las etapas que vienen con los modelos de __spaCy__ consulta la [documentación](https://spacy.io/usage/processing-pipelines).

In [0]:
nlp.pipe_names

['tagger', 'parser', 'ner']

- [_tagger_](https://spacy.io/usage/linguistic-features#section-pos-tagging) es el anotador de la categoría gramatical de las palabras (sustantivo, adjetivo, verbo, preposición...)
- [_parser_](https://spacy.io/usage/linguistic-features#section-dependency-parse) es el anotador sintáctica que anota la función de cada palabra dentro de la frase (sujeto, verbo, objeto directo...)
- [_ner_](https://spacy.io/usage/linguistic-features#section-named-entities) es el modulo para identificar entidades nominadas (nombres de personas, países, ciudades, organizaciones, cantidades, fechas...)

Ahora vamos a analizar el siguiente fragmento de texto con __spaCy__.

In [0]:
text = '''Uber pays $148 mn over data breach in latest image-boosting move. 
SAN FRANCISCO.
Uber agreed Wednesday to pay a $148 million penalty over a massive
2016 data breach which the company concealed for a year, in the latest effort by the 
global ridesharing giant to improve its image and move past its missteps from its early years.'''

Analiza lingüísticamente el texto con el modelo y devuelve un objeto de la clase `Doc`.

In [0]:
doc = nlp(text)

El `Doc` contiene un generador de oraciones (`sents`) y todas las palabras (`Token`)

In [0]:
for sent in doc.sents:
    for token in sent:
        if not token.is_space:
            print("{:<15}{:<15}{}".format(
                token.text,  # la palabra tal y como apareció en el texto
                token.lemma_,  # su forma lematizada
                token.pos_  # la categoría gramatical de la palabra
            ))
    print('\n')  # cada línea en blanco marca el final de una frase

Uber           Uber           PROPN
pays           pay            VERB
$              $              SYM
148            148            NUM
mn             mn             NOUN
over           over           ADP
data           datum          NOUN
breach         breach         NOUN
in             in             ADP
latest         late           ADJ
image          image          NOUN
-              -              PUNCT
boosting       boost          VERB
move           move           NOUN
.              .              PUNCT


SAN            SAN            PROPN
FRANCISCO      FRANCISCO      PROPN
.              .              PUNCT


Uber           Uber           PROPN
agreed         agree          VERB
Wednesday      Wednesday      PROPN
to             to             PART
pay            pay            VERB
a              a              DET
$              $              SYM
148            148            NUM
million        million        NUM
penalty        penalty        NOUN
over           ov

También podemos recuperar las entidades nominadas que ha encontrado en el texto

In [0]:
doc.ents

(148, SAN FRANCISCO, Uber, Wednesday, $148 million, 2016, a year, early years)

In [0]:
for ent in doc.ents:
    print("{:<15}{}".format(
        ent.text,  # el texto marcado como entidad nominada, pueden ser una o más palabras
        ent.label_  # la categoría adjudicada por spaCy
    ))

148            MONEY
SAN FRANCISCO  GPE
Uber           PERSON
Wednesday      DATE
$148 million   MONEY
2016           DATE
a year         DATE
early years    DATE


## Cómo encontrar términos en un texto

Imagina que esta es nuestra ontología. Las claves del diccionario son los conceptos, y los elementos de la lista son los términos, que a su vez están compuestos de una lista de palabras.

In [0]:
terminology = {
    "CstSec": [
        [{'LOWER':'data'}, {'LOWER':'breach'}],
        [{'LOWER':'data'}, {'LOWER':'protection'}],
        [{'LOWER':'personal'}, {'LOWER':'information'}]
    ]
    ,
    "IntelPty": [
        [{'LOWER':'trade'}, {'LOWER':'secrets'}]
    ]
    ,
    "HuRgts": [
        [{'LOWER':'discrimination'}]
    ]
} 

Recuerda las etapas que teníamos para procesar un texto en __spaCy__:

In [0]:
nlp.pipe_names

['tagger', 'parser', 'ner']

A continuación vamos a definir una nueva etapa. Se trata de un objeto `Matcher` que nos permitará hacer la búsqueda de los términos en un objeto `Doc`.

In [0]:
class MyMatcher():
    
    def __init__(self, nlp, terminology, label='Match', function=None):  # el constructor de nuestro Matcher
        
        self.matcher = Matcher(nlp.vocab)  # creamos un objeto Matcher
        for topic, patterns in terminology.items():  # cargamos los términos de la mini ontología
            for term in patterns:
                self.matcher.add(topic, function, term)
        Doc.set_extension('rule_match', default=False, force=True)  # creamos una extensión al objeto Doc para poder 
                                                                    # almacenar esta información

    def __call__(self, doc):  # la función que se llamará desde el pipeline de spaCy
        matches = self.matcher(doc)  # aplicaremos el matcher sobre el Doc
        
        spans = []
        for label, start, end in matches:  # para cada termino encontrado
            span = Span(doc, start, end, label=label)  # crea un objeto Span
            spans.append(span)
            
        
        doc._.rule_match = spans  # guardamos los Spans en el atributo que habíamos declarado en el constructor
        return doc

Declaramos el buscador de términos con nuestra terminología.

In [0]:
my_matcher = MyMatcher(nlp, terminology, label='about')

Y lo añadimos a nuestra analizador lingüístico como una nueva etapa.

In [0]:
nlp.add_pipe(my_matcher, name="term_matcher", after='ner')

Ahora comprobamos que se ha añadido al pipeline.

In [0]:
nlp.pipe_names

['tagger', 'parser', 'ner', 'term_matcher']

## Cómo extraer información estructurado de un texto

Con la información lingúistica que nos proporciona __spaCy__ vamos a extraer quién (sujeto) hizo (verbo) qué (objeto) en aquellas frases donde se ha mencionado alguno de los términos de la ontología. Para ello usaremos la información sintáctica.

Este es un ejemplo de las depencias sintácticas anotadas por __spaCy__ para una frase muy simple.

In [0]:
displacy.render(nlp("Paul ate paella."), style='dep', jupyter=True)

Ahora necesitaríamos escribir las reglas sintácticas para poder recuperar los elementos de la tupla que estamos buscando. En este caso vamos a tomar algunas funciones prestadas de [textacy](https://github.com/chartbeat-labs/textacy), que es un módulo muy interesante para realizar análisis textual.

In [0]:
SUBJ_list = ['agent', 'csubj', 'csubjpass', 'expl', 'nsubj', 'nsubjpass']
OBJ_list = ['attr', 'dobj', 'dative', 'oprd', 'pobj']
AUX_list = ['aux', 'auxpass', 'neg']
prepositional_phrase = ['NOUN', 'ADP', 'PROPN']

def get_main_verbs_of_sent(sent):
    """Return the main (non-auxiliary) verbs in a sentence."""
    return [tok for tok in sent
            if tok.pos == VERB and tok.dep_ not in {'aux', 'auxpass'}]

def get_subjects_of_verb(verb):
    """Return all subjects of a verb according to the dependency parse."""
    subjs = [tok for tok in verb.lefts
             if tok.dep_ in SUBJ_list]
    # get additional conjunct subjects
    subjs.extend(tok for subj in subjs for tok in _get_conjuncts(subj))
    return subjs

def get_objects_of_verb(verb):
    """
    Return all objects of a verb according to the dependency parse,
    including open clausal complements.
    """
    objs = [tok for tok in verb.rights
            if tok.dep_ in OBJ_list]
    # get open clausal complements (xcomp)
    objs.extend(tok for tok in verb.rights
                if tok.dep_ == 'xcomp')
    # get additional conjunct objects
    objs.extend(tok for obj in objs for tok in _get_conjuncts(obj))
    return objs

def get_preprositional_phrase(objs):
    """
    Receive the object and check if the object have predical phrase
    Here I want to extract patterns like: (NOUN + ADP + PROPN) or 
    (NOUN + ADP + NOUN)
    """
    return [right for right in objs.rights
            if right.dep_ == 'prep']

def get_span_for_subtree(obj):
    
    min_i = obj.i
    max_i = obj.i + sum(1 for _ in [right.subtree for right in obj.rights])
    return (min_i, max_i)

def get_span_for_compound_noun(noun):
    """
    Return document indexes spanning all (adjacent) tokens
    in a compound noun.
    """
    min_i = noun.i - sum(1 for _ in takewhile(lambda x: x.dep_ == 'compound',
                                              reversed(list(noun.lefts))))
    return (min_i, noun.i)


def get_span_for_verb_auxiliaries(verb):
    """
    Return document indexes spanning all (adjacent) tokens
    around a verb that are auxiliary verbs or negations.
    """
    min_i = verb.i - sum(1 for _ in takewhile(lambda x: x.dep_ in AUX_list,
                                              reversed(list(verb.lefts))))
    max_i = verb.i + sum(1 for _ in takewhile(lambda x: x.dep_ in AUX_list,
                                              verb.rights))
    return (min_i, max_i)

def _get_conjuncts(tok):
    """
    Return conjunct dependents of the leftmost conjunct in a coordinated phrase,
    e.g. "Burton, [Dan], and [Josh] ...".
    """
    return [right for right in tok.rights
            if right.dep_ == 'conj']


def subject_verb_object(doc, start_i):
    """
    Extract an ordered sequence of subject-verb-object (SVO) triples from a
    spacy-parsed doc. Note that this only works for SVO languages.

    Args:
        doc (``textacy.Doc`` or ``spacy.Doc`` or ``spacy.Span``)

    Yields:
        Tuple[``spacy.Span``, ``spacy.Span``, ``spacy.Span``]: the next 3-tuple
            of spans from ``doc`` representing a (subject, verb, object) triple,
            in order of appearance
    """
    
    obj_pp =[]
    verbs = get_main_verbs_of_sent(sent)
    for verb in verbs:
        subjs = get_subjects_of_verb(verb)
        if not subjs:
            continue
        objs = get_objects_of_verb(verb)
        if not objs:
            continue
                
        for subj in subjs:
            subj = sent[get_span_for_compound_noun(subj)[0] - start_i: subj.i - start_i + 1]
            for obj in objs:
                    
                if obj.pos == NOUN:
                    span = get_span_for_compound_noun(obj)
                    obj_pp = get_preprositional_phrase(obj)
                        
                    if obj_pp:
                        span = (span[0], get_span_for_subtree(obj_pp[0])[1])
                            
                elif obj.pos == VERB:
                    span = get_span_for_verb_auxiliaries(obj)
                    obj_pp = get_preprositional_phrase(obj)
                    if obj_pp:
                        span = (span[0], get_span_for_subtree(obj_pp[0])[1])
                           
                else:
                    span = (obj.i, obj.i)
                    obj_pp = get_preprositional_phrase(obj)
                    if obj_pp:
                        span = (span[0], get_span_for_subtree(obj_pp[0])[1])
                obj = sent[span[0] - start_i: span[1] - start_i + 1]
                yield (subj, verb, obj)
        

Ahora vamos a trabajar con un texto más largo, una noticia sobre Uber.

In [0]:
text = '''Uber pays $148 mn over data breach in latest image-boosting move. 
SAN FRANCISCO - Uber agreed Wednesday to pay a $148 million penalty over a massive 2016 data breach which the company concealed for a year, in the latest effort by the global ridesharing giant to improve its image and move past its missteps from its early years.
The settlement stems from a breach affecting some 57 million Uber riders and drivers, prompting litigation that was eventually joined by officials from the 50 US states and the District of Columbia.

The payment, described as the largest in a data breach settlement, is part of Uber's efforts to burnish its reputation after a series of scandals over alleged misconduct and unethical practices.

Uber disclosed the breach last year shortly after it hired chief executive Dara Khosrowshahi, who promised a new way of doing business as the company with an estimated value of more than $70 billion expands globally and prepares for what could be a massive stock offering.

"The commitments we're making in this agreement are in line with our focus on both physical and digital safety for our customers," Uber's chief legal officer Tony West said in announcing the settlement.

"We know that earning the trust of our customers and the regulators we work with globally is no easy feat ... We'll continue to invest in protections to keep our customers and their data safe and secure, and we're committed to maintaining a constructive and collaborative relationship with governments around the world."

The company reached an agreement with the US Federal Trade Commission on the breach that called for improved security and audits but no financial penalty.

According to officials, Uber paid data thieves $100,000 to destroy the swiped information -- and remained quiet about the breach for a year.

The settlement avoid a potentially lengthy court fight which could be embarrassing to Uber.

- Improving security -

As part of the settlement, Uber will be required to improve its security practices, with an independent outside review of data practices.

Illinois Attorney General Lisa Madigan said her office would oversee a fund of $5.1 million that would pay each driver from the state $100, and seek to locate those who may no longer be driving for Uber.

"While Uber is now taking the appropriate steps to protect the data of its drivers in Illinois and across the country, the company's initial response was unacceptable," Madigan said. "Companies cannot hide when they break the law."

New York Attorney General Barbara Underwood said: "This record settlement should send a clear message: we have zero tolerance for those who skirt the law and leave consumer and employee information vulnerable to exploitation."

The case is the second large court settlement this year for Uber.

In February, Uber agreed to pay $245 million to Alphabet's self-driving car unit Waymo to settle a lawsuit over allegedly stolen trade secrets.

As part of its transparency effort, Uber this year also scrapped policies requiring arbitration over claims of sexual misconduct involving employees, riders and drivers, allowing cases to be heard in public and pursued in open court.

As a privately held firm, Uber is not required to report its finances. Released data from the second quarter however shows it lost $891 million on revenues of $2.8 billion, with bookings hitting a total of $12 billion.

About Uber
'''

Parseamos el texto de entrada con el nuevo pipeline que montamos en la sección anterior.

In [0]:
doc = nlp(text)

Aplicamos las reglas para extraer quién (sujeto) hizo (verbo) qué (objeto) en frases donde se ha mencionado alguno de los términos de la ontología.

In [0]:
if doc._.rule_match and len(doc._.rule_match) > 0:
    for match in doc._.rule_match:
        span = doc[match.start : match.end]  # matched span
        sent = span.sent  # sentence containing matched span
        match_ents = [{'start': span.start_char - sent.start_char,
                       'end': span.end_char - sent.start_char,
                       'label': nlp.vocab.strings[match.label]}]
        
        displacy.render([{'text': sent.text, 'ents': match_ents}], style='ent', manual=True, jupyter=True)
       
        start_i = sent[0].i
        for info in subject_verb_object(sent, start_i):
            print("Tema:\t{}\nSujeto:\t{}\nVerbo:\t{}\nObjeto:\t{}".format(
                span.text,
                info[0].text,
                info[1].text,
                info[2].text))

Tema:	data breach
Sujeto:	Uber
Verbo:	pays
Objeto:	148


Tema:	data breach
Sujeto:	Uber
Verbo:	agreed
Objeto:	to pay a $148 million penalty over a


Tema:	data breach
Sujeto:	payment
Verbo:	is
Objeto:	part of Uber


Tema:	trade secrets
Sujeto:	Uber
Verbo:	agreed
Objeto:	to pay


__The End!__