# Information Extraction

### Natural Language Processing and Information Extraction,  2024WS
Lecture 8, 12/13/2024

Gábor Recski

This material can be downloaded from [https://github.com/tuw-nlp-ie/tuw-nlp-ie-2024WS](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2024WS)

## SLP3 relevant chapters

[17](https://web.stanford.edu/~jurafsky/slp3/17.pdf)

[20](https://web.stanford.edu/~jurafsky/slp3/20.pdf)

## What is Information Extraction?

**Extracting structured data from text**

|Domain|Text| Information |
|:---|:---|:---|
|HR | candidate CVs | name, address, workplaces, schools, skills|
|Business | business news  | acquisitions, stock price movements, bankruptcies|
|Medicine| clinical records  | symptoms, diagnoses, treatments|
|Science| scientific articles  | hypotheses, methods, results|
|Law | legislation, case law | courts, cases, verdicts|
|Advertising | news & social media | product mentions, company mentions |

### What should be the structure of the output?
Labels? Relations? Formulae?

### Which pieces of information do we try to extract?

### Task formulation

The answers depend on:
- business priorities
- data quality
- data quantity
- domain and genre
- budget and timeframe


And you shouldn't expect to get all the answers right on the first try.

### Relevant NLP technologies

- classification / labeling of documents, paragraphs, sentences
- **Named Entity Recognition (NER)**
- recognition of other relevant expressions (time, quantities, prices, links, IDs, etc.)
- **Relation Extraction (RE)**
- event extraction, time expression recognition and normalization, entity linking, ...

## Named Entity Recognition

![ner](ner_70.jpg)

![ner](ner1.jpg)

## NER is sequential tagging

| _American_ | _Airlines_ | , | _a_  | _unit_ | _of_ | _AMR_   | _Corp_  | _._     | _,_  | _immediately_ | _matched_ | _the_ | _move_ |
| :-------: | :-------: | :-:| :-: | :---: | :-: | :--:   | :---:  | :-:    | :-: | :-: | :-: | :-: | :-: |
| B-ORG    | I-ORG    | O | O  | O    | O  | B-ORG | I-ORG | I-ORG | O  | O  | O  | O  | O  |                                      

![ner4](ner4.jpg)

A simpler architecture for sequential tagging we mentioned for POS-tagging is the [Hidden Markov Model](https://github.com/tuw-nlp-ie/tuw-nlp-ie-2021WS/blob/main/lectures/06_Syntax/06b_POS_tagging_HMMs.ipynb) (HMM)

Modern DL models only need data, but they are essentially **black boxes**

HMMs and similar models require hand-coded features, but less data, and allow for **interpretability/explainability**

![ner3](ner3.jpg)

(Currently it is Figure 8.15)

## Questions?

## Relation extraction

### Example

_"Gryffindor values courage, bravery, nerve, and chivalry. Gryffindor's mascot is the lion, and its colours are scarlet and gold. The Head of this house is the Transfiguration teacher and Deputy Headmistress, Minerva McGonagall until she becomes headmistress, and the house ghost is Sir Nicholas de Mimsy-Porpington, more commonly known as Nearly Headless Nick. According to Rowling, Gryffindor corresponds roughly to the element of fire. The founder of the house is Godric Gryffindor."_ ([source](http://hogwartsss.epizy.com/gryffindor.html))

- values(Gryffindor, courage)
- mascot(Gryffindor, lion)
- color(Gryffindor, scarlet)
- head(Gryffindor, Minerva_McGonagall)
- house_ghost(Gryffindor, Sir_Nicholas_de_Mimsy-Porpington)
- founder(Gryffindor, Godric_Gryffindor)

##  Rule-based approaches

Templates:
- _X dropped by Y points_ -> drop_by(X, Y)
- _X, CEO of Y_ -> ceo_of(X, Y)
- _X was born in Y_ -> born_in(X, Y)

If parsers/NER-taggers/Chunkers are available, templates can refer to their output:

- X_NP _dropped by_ Y_NUM _points_ -> drop_by(X, Y)

- X_PERSON, CEO _of_ Y_ORGANIZATION -> ceo_of(X, Y)
- X_PERSON _was born in_ Y_LOCATION -> born_in(X, Y)

#### Pros:

- simple and effective, yields fast results

- high precision

#### Cons:

- low recall

- limited, no capacity for generalization

- real-life systems may contain thousands of templates, and require continuous development by experts

- many companies still depend on such systems

## Supervised learning

Use annotated text to train classifiers over pairs of entities

E.g. decide for each pair of named entities (PERSON, ORGANIZATION) whether they are in the "ceo_of" relationship

![re1](re1.jpg)

![re1b](re1b.png)

![re1c](re1c.png)

#### Pros:

-  Effective if large training sample is available and target texts are similar

#### Cons: 

- requires a fair amount of annotated data (costly to produce)

- doesn't generalize well across genres, domains

## Semi-supervised approaches

### Bootstrapping

When little or no training data is available, we must use what we have to generalize:

- a few annotated examples -> some patterns -> more examples -> more patterns

- a few patterns -> some examples -> more patterns -> more examples

### Example

seed tuple: author(William_Shakespeare, Hamlet) 

- _William Shakespeare's Hamlet_
- _the William Shakespeare play Hamlet_
- _Hamlet by William Shakespeare_
- _Hamlet is a tragedy written by William Shakespeare_

- X's Y
- the X play Y
- Y by X
- Y is a tragedy written by X

seed tuple: hub(Ryanair, Charleroi)

- _Budget airline Ryanair, which uses Charleroi as a hub, scrapped all weekend flights out of the airport._
- _All flights in and out of Ryanair’s Belgian hub at Charleroi airport were grounded on Friday_
- _A spokesman at Charleroi, a main hub for Ryanair, estimated that 8000 passengers had already been affected._

- ORG, _which uses_ LOC _as a hub_
- ORG _'s hub at_ LOC
- LOC _a main hub for_ ORG

![re2](re2.jpg)

(Currently Figure 21.5)

### Distant supervision

Similar to bootstrapping, but on a larger scale

Using large databases of relations, extract possible occurrences

train supervised model to assign confidence values / find true relations among the candidates

## Unsupervised approaches

If we (only) know what the entities are, we can try to discover (new) relations

### Open IE

The patterns ARE the relations

They are extracted using simple constraints

![oie](oie.png)

Syntactic and lexical constraints:
```
V | VP | VW*P
V = verb particle? adv?
W = (noun | adj | adv | pron | det )
P = (prep | particle | inf. marker)
```

_United has a hub in Chicago, which is the headquarters of United
Continental Holdings_

```
r1: <United, has a hub in, Chicago>
r2: <Chicago, is the headquarters of, United Continental Holdings>
```

Confidence values are again assigned by a supervised model

![re3](re3.jpg)

## Questions?