# Week 6

# ―Information extraction

<img src="images/_0.png" width="20%" >

# Agenda

+ Nature of information extraction tasks (Chapter 18 of Jurafsky and Martin's book)
+ Various forms of information extraction (Chapter 18 of Jurafsky and Martin's book)
+ Real word example

# A prototpycal information extraction task

Imagine that you are an analyst with an investment firm that tracks airline 
stocks:

+ You’re given the task of determining the relationship (if any) between 
  airline announcements of fare increases and the behavior of their 
  stocks the next day
+ Historical data about stock prices is easy to come by, but what about 
  the airline announcements?
  * You will need to know at least:
    - the name of the airline
    - the nature of the proposed fare hike
    - the dates of the announcement
    - the response of other airlines.

# A prototypical document to process

```
Citing high fuel prices, United Airlines said Friday it has increased fares 
by $6 per round trip on flights to some cities also served by lower-cost 
carriers. American Airlines, a unit of AMR Corp., immediately matched the 
move, spokesman Tim Wagner said. United, a unit of UAL Corp., said the 
increase took effect Thursday and applies to most routes where it competes 
against discount carriers, such as Chicago to Dallas and Denver to San Francisco.
```

# The information extraction process

+ This process of information extraction (IE), turns the unstructured
  information embedded in texts into structured data
  - by structured data, we mean annotated words or combination of words
  - these structured data can be used in different ways:
    * to  populate a relational databases
    * as features to be post-processed in a stat/ML analysis (e.g., to
      cluster standard errors around entities, being companies, products,
      or individuals)

# The information extraction process ― forms

+ Named Entity Recognition $\rightarrow$ locating names of companies, people, 
  or products
+ Relation extraction  $\rightarrow$ appreciating how two tokens are connected (e.g., employment relationship)
+ Event extraction $\rightarrow$ finding entity-event affiliations
+ Temporal expressions $\rightarrow$ isolating `datetime` and `deltatime` quantities
+ Temporal normalization $\rightarrow$ producing a timeline/partial order

# Information extraction != coreference resolution

<img src="images/_1.jpg" width="100%">



"The Last Dance is a 2020 American sports documentary miniseries co-produced by ESPN Films and Netflix. Directed by Jason Hehir, the series revolves around the career of Michael Jordan, with particular focus on his last season with the Chicago Bulls. The series features exclusive footage from a film crew that had an all-access pass to the Bulls, as well as interviews of many NBA personalities including Jordan, Scottie Pippen, Dennis Rodman, Steve Kerr, and Phil Jackson."
 
Source: Wikipedia

# Information extraction != coreference resolution

+ 'Michael Jordan', 'MJ', 'GOAT', 'His Airness' are possible entities that 
   occurr in the data:
   - IE detects these entities
   - however, IE doesn't recognize the fact these entities are associated with same person
     * that task is known as coreference resolution

# Named Entity Recognition (NER)

<img src="images/_2.png" width="50%">

# Types of entities

<img src="images/_3.png" width="70%">

# Types of entities in spaCy's models

<img src="images/_4.png" width="60%">

# Types of entities in spaCy's models (cont'd)

<img src="images/_5.png" width="60%">

# Challenges in NER

Recognition is difficult partly because of the ambiguity of segmentation; 
we need to decide:

+ what’s an entity
+ what isn’t
+ where the boundaries are

<img src="images/_6.png" width="50%">

# Challenges in NER (cont'd)

Another difficulty is caused by type ambiguity ― the mention JFK can refer to:

+ a person
+ the airport in New York
+ or any number of American:
  - schools
  - bridges 
  - streets
  
<img src="images/_7.png" width="50%">

# Implementation of NERs

This topic is highly technical and falls beyond the scope of SMM694.

Students who want to get a closer understanding of NER implementations: there 
are three main approaches:

+ feature-based algorithm
+ neural algorithm
+ rule-based approach

# Relation extraction

Let's consider our working example:

<img src='images/_10.png' width=50% >

# Relation extraction's scope

The text tells us, for example, that:

+ Tim Wagner is a spokesman for American Airlines
+ United is a unit of UAL Corp
+ American is a unit of AMR

These binary relations are instances of more generic affiliation relations
such as:

+ part-of
+ employment

# Map of possible relations to extract

<img src='images/_11.png' width=60%>

# Examples of relations

<img src='images/_12.png' width=50%>

# Extracting times

+ Times and dates are a particularly important kind of named entity
  - thery play a role in longitudinal research designs (e.g., time-series analysis)
+ After we extract these temporal expressions, times an dates must 
  be normalized. This is key to:
  - appreciating the temporal distance between pairs of events
  - creating complex timelines

# Types of temporal expressions

+ Temporal expressions can take various forms:
  - absolute points in time
  - relative times
  - durations
  - sets of these
+ **Absolute** temporal expressions are those that can be mapped directly to:
  - calendar dates
  - times of day
  - both
+ **Relative** temporal expressions map to particular times through some other 
  reference point, e.g. a week from last Tuesday
+ **Durations** denote spans of time at varying levels of granularity (seconds, 
  minutes, days, weeks, centuries, etc.)

# Types of temporal expressions ― examples
  
<img src='images/_13.png' width=60%>

# Finding temporal expressions ― the role of temporal triggers

+ Temporal expressions are grammatical constructions that have temporal 
  lexical triggers ― e.g.:
  - nouns $\rightarrow$ morning, noon, night, winter, dusk, dawn
  - proper nouns $\rightarrow$ January, Monday, Ides, Easter, Rosh Hashana, 
    Ramadan, Tet
  - adjectives $\rightarrow$ recent, past, annual, former
  - adverbs $\rightarrow$ hourly, daily, monthly, yearly
+ Such lexical triggers can be exploited to find the start and end of
  all of the text spans that correspond to such temporal expressions

# Normalizing temporal expressions

+ Temporal normalization is the process of mapping a temporal expression to 
  either a specific point in time or to a duration
  - Points in time correspond to calendar dates, to times of day, or both
  - Durations primarily consist of lengths of time but may also include 
    information about start and end points
    
<img src='images/_14.png' width=50%>

# Temporal anchors

+ Fully qualified temporal expressions are fairly rare in real texts
+ Most temporal expressions in news articles are incomplete and are 
  only implicitly anchored, often with respect to the dateline of 
  the article, which we refer to as the document’s **temporal anchor**
  Possible examples are:
  - tomorrow
  - three days ago
  - next Monday

# Extracting events and their times

+ The task of event extraction is to identify mentions of events in texts
+ For the purposes of this task, an event mention is any expression denoting 
  an event or state that can be assigned to a particular point, or interval, 
  in time
+ Typically, events are classified as:
  - actions
  - states
  - reporting events (introduced by verbs such as report, say, explained)
  - perceptions

# Extracting events and their times (cont'd)


+ With both the events and the temporal expressions in a text having been 
  detected, the next logical task is to use this information to fit the events 
  into a complete timeline
+ A somewhat simpler, but still useful, task is to impose a partial ordering 
  on the events and temporal expressions mentioned in a text:


<img src='images/_16.png' width=80%>

# Example

<img src='images/_17.png' width=50%>

# Skill-ML'scope: isolating unique skills

<img src='images/_18.png' width=50%>

# Skill-ML's scope: clustering skills

<img src='images/_19.png' width=50%>

# Skill-ML's scope: matching skills with occupations

<img src='images/_20.png' width=50%>