# Information Extraction (IE)
### Goal of lesson
- What is Information Extraction
- Extract knowledge from patterns
- Word representation
- Skip-Gram architecture
- To see how words relate to each other (this is surprising)

### What is Information Extraction (IE)
- the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents ([wiki](https://en.wikipedia.org/wiki/Information_extraction))

### Extract knowledge from patterns
- Given data knowledge that is fit together - find patterns
- Example
    - Knowledge given:
        - Amazon (1992)
        - Facebook (2004)
    - Pattern (template) found:
        - "When {company} was founded in {year},"
- This is a simple, but very powerful approach

> #### Programming Notes:
> - Libraries used
>     - [**pandas**](https://pandas.pydata.org) - a data analysis and manipulation tool
>     - [**re**](https://docs.python.org/3/library/re.html) regular expressions
> - Functionality and concepts used
>     - [**CSV**](https://en.wikipedia.org/wiki/Comma-separated_values) file ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
>     - [**read_csv()**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values (csv) file into **pandas** DataFrame.
>     - [**Regular Expression**](https://en.wikipedia.org/wiki/Regular_expression) s a sequence of characters that specifies a search pattern.

In [1]:
import pandas as pd
import re

In [2]:
books = pd.read_csv('files/books.csv', header=None)

In [3]:
book_list = books.values.tolist()

In [4]:
book_list

[['1984', 'George Orwell'], ['The Help', 'Kathryn Stockett']]

In [5]:
with open('files/penguin.html') as f:
    corpus = f.read()

In [6]:
corpus = corpus.replace('\n', ' ').replace('\t', ' ')

In [7]:
for val1, val2 in book_list:
    print(val1, '-', val2)
    for i in range(0, len(corpus) - 100, 20):
        pattern = corpus[i:i + 100]
        if val1 in pattern and val2 in pattern:
            print('-:', pattern)

1984 - George Orwell
-: ge-orwell-with-a-foreword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h
-: eword-by-thomas-pynchon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="de
-: hon/">1984</a></h2>   <h2 class="author">by George Orwell</h2>    <div class="desc">We were pretty c
The Help - Kathryn Stockett
-: /the-help-by-kathryn-stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <
-: -stockett/">The Help</a></h2>   <h2 class="author">by Kathryn Stockett</h2>    <div class="desc">Thi


In [8]:
prefix = re.escape('/">')
middle = re.escape('</a></h2>   <h2 class="author">by ')
suffix = re.escape('</h2>    <div class="desc">')
prefix, middle, suffix

('/">',
 '</a></h2>\\ \\ \\ <h2\\ class="author">by\\ ',
 '</h2>\\ \\ \\ \\ <div\\ class="desc">')

In [9]:
regex = f"{prefix}(.{{0,50}}?){middle}(.{{0,50}}?){suffix}"
results = re.findall(regex, corpus)

In [10]:
results

[('War and Peace', 'Leo Tolstoy'),
 ('Song of Solomon', 'Toni Morrison'),
 ('Ulysses', 'James Joyce'),
 ('The Shadow of the Wind', 'Carlos Ruiz Zafon'),
 ('The Lord of the Rings', 'J.R.R. Tolkien'),
 ('The Satanic Verses', 'Salman Rushdie'),
 ('Don Quixote', 'Miguel de Cervantes'),
 ('The Golden Compass', 'Philip Pullman'),
 ('Catch-22', 'Joseph Heller'),
 ('1984', 'George Orwell'),
 ('The Kite Runner', 'Khaled Hosseini'),
 ('Little Women', 'Louisa May Alcott'),
 ('The Cloud Atlas', 'David Mitchell'),
 ('The Fountainhead', 'Ayn Rand'),
 ('The Picture of Dorian Gray', 'Oscar Wilde'),
 ('Lolita', 'Vladimir Nabokov'),
 ('The Help', 'Kathryn Stockett'),
 ("The Liar's Club", 'Mary Karr'),
 ('Moby-Dick', 'Herman Melville'),
 ("Gravity's Rainbow", 'Thomas Pynchon'),
 ("The Handmaid's Tale", 'Margaret Atwood')]

### One-Hot Representation
- Representation word as a vector with a single 1, and with other values as 0
- Maybe not useful to have with

### Distributed Representation
- representation of meaning distributed across multiple values

### How to define words as vectors
- Word is defined by what words suround it
- Based on the context
- What words happen to show up around it

### word2vec
- model for generating word vectors

### Skip-Gram Architecture
- Neural network architecture for predicting context words given a target word
    - Given a word - what words show up around it in a context
- Example
    - Given **target word** (input word) - train the network of which **context words** (right side)
    - Then the weights from input node (**target word**) to hidden layer (5 weights) give a representation
    - Hence - the word will be represented by a vector
    - The number of hidden nodes represent how big the vector should be (here 5)

<img src="img/word_vectors.png" width="600" align="left">

- Idea is as follows
    - Each input word will get weights to the hidden layers
    - The hidden layers will be trained
    - Then each word will be represented as the weights of hidden layers
- Intuition
    - If two words have similar context (they show up the same places) - then they must be similar - and they have a small distance from each other representations

> #### Programming Notes:
> - Libraries used
>     - [**numpy**](http://numpy.org) - scientific computing with Python ([Lecture on NumPy](https://youtu.be/BpzpU8_j0-c))
>     - [**scipy**](https://www.scipy.org) - open-source software for mathematics, science, and engineering
> - Functionality and concepts used
>     - [**cosine**](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) Compute the Cosine distance between 1-D arrays.

In [11]:
import numpy as np
from scipy.spatial.distance import cosine

In [12]:
with open('files/words.txt') as f:
    words = {}
    lines = f.readlines()
    for line in lines:
        row = line.split()
        word = row[0]
        vector = np.array([float(x) for x in row[1:]])
        words[word] = vector

In [14]:
words['a'].shape

(100,)

In [15]:
def distance(word1, word2):
    return cosine(word1, word2)

In [16]:
def closest_words(word):
    distances = {w: distance(word, words[w]) for w in words}
    return sorted(distances, key=lambda w: distances[w])[:10]

In [18]:
distance(words['king'], words['queen'])

0.19707422881543946

In [19]:
distance(words['king'], words['pope'])

0.42088794105426874

In [21]:
closest_words(words['king'] - words['man'] + words['woman'])

['queen',
 'king',
 'empress',
 'prince',
 'duchess',
 'princess',
 'consort',
 'monarch',
 'dowager',
 'throne']