# Intro

In [27]:
import spacy
import en_core_web_sm

%load_ext nb_black

nlp = spacy.load("en_core_web_sm")

The nb_black extension is already loaded. To reload it, use:
  %reload_ext nb_black


<IPython.core.display.Javascript object>

In [28]:
# Process sentences 'Hello, world. Antonio is learning Python.' using spaCy
doc = nlp(u"Hello, world. Antonio is learning Python.")

<IPython.core.display.Javascript object>

## Get tokens and sentences

#### What is a Token?
A token is a single chopped up element of the sentence, which could be a word or a group of words to analyse. The task of chopping the sentence up is called "tokenisation".

Example: The following sentence can be tokenised by splitting up the sentence into individual words.

	"Antonio is learning Python!"
	["Antonio","is","learning","Python!"]

In [29]:
# Get first token of the processed document
token = doc[0]
print(token)

# Print sentences (one sentence per line)
for sent in doc.sents:
    print(sent)

Hello
Hello, world.
Antonio is learning Python.


<IPython.core.display.Javascript object>

## Part of speech tags

#### What is a Speech Tag?
A speech tag is a context sensitive description of what a word means in the context of the whole sentence.
More information about the kinds of speech tags which are used in NLP can be [found here](http://www.winwaed.com/blog/2011/11/08/part-of-speech-tags/).

Examples:

1. CARDINAL, Cardinal Number - 1,2,3
2. PROPN, Proper Noun, Singular - "Jan", "Javier", "Antonio", "Italy"
3. INTJ, Interjection - "Ohhhhhhhhhhh"

In [30]:
# For each token, print corresponding part of speech tag
tags = [(token.pos_, token) for token in doc]
print(tags)

[('INTJ', Hello), ('PUNCT', ,), ('NOUN', world), ('PUNCT', .), ('PROPN', Antonio), ('AUX', is), ('VERB', learning), ('PROPN', Python), ('PUNCT', .)]


<IPython.core.display.Javascript object>

In [31]:
from spacy import displacy

<IPython.core.display.Javascript object>

In [32]:
displacy.render(doc, style='dep')



<IPython.core.display.Javascript object>

In [33]:
displacy.render(doc, style = "ent",jupyter = True)


<IPython.core.display.Javascript object>

We have said that dependency structures are represented by directed graphs that satisfy the following constraints:

1. There is a single designated root node that has no incoming arcs.

2. With the exception of the root node, each vertex has exactly one incoming arc.

3. There is a unique path from the root node to each vertex in V.

You can inspect the head of each token by invoking the `.head` attribute of a spaCy token:


In [38]:
doc[2]

world

<IPython.core.display.Javascript object>

In [39]:
doc[2].head

Hello

<IPython.core.display.Javascript object>

So how would you search for the root?

Since there is a unique path from the root node to each vertex in V, there's only one root node that has no incoming arcs, we can search for the token which have as head itself!

In [40]:
for token in doc:
    if token.head == token:
        print(token)

Hello
learning


<IPython.core.display.Javascript object>

As expected, since there were two sentences in the doc, we got two roots.

We can also build a function that, given a spaCy token, gives the path till the root:

In [42]:
# Define a function to find the path to the root of each word in a sentence
def path_to_the_root(token):
    if token.head == token:
        return
    else: 
        print(f'{token.head}->{token}')
        get_root(token.head)



<IPython.core.display.Javascript object>

In [43]:
path_to_the_root(doc[4])

learning->Antonio


<IPython.core.display.Javascript object>

## Embeddings 

An embedding is a fixed sizes numerical vector that attempts to encode some semantic meaning of the word or sentence it is encoding. The distributional hypothesis is usually the concept behind most embeddings. This hypothesis states that words which often have the same neighboring words tend to be semantically similar. For example if 'football' and 'basketball' usually appear close the word 'play' we assume that they will be semantically similar. An algorithm that is based on this concept is Word2Vec. A common way of obtaining sentence embeddings is to average the word embeddings inside the sentence and use that average as the representation of the whole sentence. 

- In spacy every token has its embedding.
- It is under the attribute 'vector'.
- In spacy embeddings are of size 96 or 128.


Obtain the embeddings of all the tokens.

In [44]:
embd = [token.vector for token in doc]
print(embd)

[array([-7.2769040e-01, -6.5467489e-01,  3.4545848e-01, -1.3211331e+00,
        3.8329524e-01,  1.6095481e+00,  4.0210347e+00,  2.7814531e+00,
        4.2899132e+00,  3.5156488e+00,  2.2655723e+00, -3.0103502e+00,
        2.2251215e+00,  8.6697561e-01,  2.1651659e+00,  3.6331279e+00,
        3.2939694e+00,  2.4538593e+00, -1.0154125e-01, -6.6271687e-01,
        4.4526100e+00, -4.8443186e-01,  1.1999233e+00,  5.0796497e-01,
        2.2959852e+00,  1.2470492e+00, -7.6798129e-01,  2.8587785e+00,
       -8.9219618e-01, -6.2705553e-01, -1.5051959e+00, -6.2507683e-01,
       -2.0319629e+00, -1.5944457e+00,  1.2612584e+00,  8.6164564e-01,
        4.2223060e-01,  8.7489688e-01, -2.7122064e+00, -1.7095518e+00,
        4.4883766e+00,  2.1760783e+00, -2.3332880e+00,  2.7503304e+00,
        9.2422390e-01, -2.2956979e-01, -1.6059887e+00,  3.3145928e+00,
       -2.3757935e+00, -2.1428237e+00, -3.9303966e+00,  1.0370404e+00,
       -1.9046664e-04,  1.9203992e+00, -6.8202686e-01, -2.1523390e+00,
     

<IPython.core.display.Javascript object>

## Semantic similarity 

To compute the semantic similarity between two sentences, $u$ and $v$, we measure the cossine similarity between the two sentence embeddings. The formula is as follows:

$sim(u, v) = \frac{u \cdot v}{||u|| ||v||} $


Use the following formula to get the semantic similarity betwen the words in doc.
Feel free to test it between differente words too

In [None]:
def semantic_sim(u,v):
    return

# Pride and Prejudice analysis

We would like to:

- Extract the names of all the characters from the book (e.g. Elizabeth, Darcy, Bingley)
- Visualize characters' occurences with regards to relative position in the book
- Authomatically describe any character from the book
- Find out which characters have been mentioned in a context of marriage
- Build keywords extraction that could be used to display a word cloud (example)

To load the text file, it is convinient to decode using the utf-8 standard:

In [None]:
def read_file(file_name):
    with open(file_name, "r", encoding="utf-8") as file:
        return file.read()

### Process full text

In [None]:
text = read_file("data/pride_and_prejudice.txt")
# Process the text

In [None]:
# How many sentences are in the book (Pride & Prejudice)?

# Print sentences from index 10 to index 15, to make sure that we have parsed the correct book


## Find all the personal names

[Hint](# "List doc.ents and check ent.label_")

In [None]:
# Extract all the personal names from Pride & Prejudice and count their occurrences.
# Expected output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].

from collections import Counter, defaultdict


def find_character_occurences(doc):
    """
    Return a list of actors from `doc` with corresponding occurences.

    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266)]
    """

    characters = Counter()
    # your code here


print(find_character_occurences(processed_text)[:20])

## Plot characters personal names as a time series 

In [None]:
# Matplotlib Jupyter HACK
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

We can investigate where a particular entity occurs in the text. We can do it just accessing the `.start` attribute of an entity:

[Hint](# "ent.start")

In [None]:
# List all the start positions of person entities

So we can create a function that stores all the offsets of every character:
   
   
[Hint](# "Create a dictionary with the lowered lemmas [ent.lemma_.lower()] and associate a list of all the ent.starts")

In [None]:
# Plot characters' mentions as a time series relative to the position of the actor's occurrence in a book.

def get_character_offsets(doc):
    """
    For every character in a `doc` collect all the occurences offsets and store them into a list. 
    The function returns a dictionary that has actor lemma as a key and list of occurences as a value for every character.
    
    :param doc: Spacy NLP parsed document
    :return: dict object in form
        {'elizabeth': [123, 543, 4534], 'darcy': [205, 2111]}
    """
            
    return dict(character_offsets)

character_occurences = get_character_offsets(processed_text)

In [None]:
character_occurences

[Hint](# "Use the character offsets for each character as x")

In [None]:
# Plot the histogram of the character occurrences in the whole text
NUM_BINS = 20

def plot_character_hist(character_offsets, character_label, cumulative=False):
    pass

In [None]:
plot_character_hist(character_occurences, "elizabeth")

In [None]:
plot_character_hist(character_occurences, "darcy")

### Cumulative occurrences

In [None]:
plot_character_hist(character_occurences, "elizabeth", cumulative=True)

In [None]:
plot_character_hist(character_occurences, "darcy", cumulative=True)

### Spacy parse tree in action

[Hint](# "ent.subtree, token.pos_ == 'ADJ'") 

In [None]:
# Find words (adjectives) that describe Mr. Darcy.

def get_character_adjectives(doc, character_lemma):
    """
    Find all the adjectives related to `character_lemma` in `doc`
    
    :param doc: Spacy NLP parsed document
    :param character_lemma: string object
    :return: list of adjectives related to `character_lemma`
    """
    
    adjectives = []
    for ent in processed_text.ents:
        # your code here
        pass
    
     for ent in processed_text.ents:
        if ent.lemma_.lower() == character_lemma:
            if ent.root.dep_ == 'nsubj':
                for child in ent.root.head.children:
                    if child.dep_ == 'acomp':
                        adjectives.append(child.lemma_)
                        
    return adjectives

print(get_character_adjectives(processed_text, 'darcy'))

In [None]:
# Find words (adjectives) that describe Elizabeth.


print(get_character_adjectives(processed_text, 'elizabeth'))

For all the dependencies manual: https://nlp.stanford.edu/software/dependencies_manual.pdf

`acomp`: adjectival complement
*i.e.* an adjectival phrase which functions as the complement (like an object of the verb) e.g. "She looks very beautiful": *beautiful* is an adjectival complement of *looks*

`nsubj`: nominal subject
*i.e.* a noun phrase which is the syntactic subject of a clause. The head of this relation
might not always be a verb: when the verb is a copular verb, the root of the clause is the complement of
the copular verb, which can be an adjective or noun.
*e.g.* "Clinton defeated Dole". The relationship is *nsubj(defeated, Clinton)*

"The baby is cute". The relationship is *nsubj(cute, baby)*.

In the code, `.dep_`stands for syntactic dependency, *i.e.* the relation between tokens.

In [None]:
processed_text.ents[30].root.dep_

[Hint](# "ent.label_, ent.root.head.lemma_") 

In [None]:
# Find characters that are 'talking', 'saying', 'doing' the most. Find the relationship between 
# entities and corresponding root verbs.

character_verb_counter = Counter()


for ent in processed_text.ents:
    if # your code here:
        character_verb_counter[ent.text] += 1

print(character_verb_counter.most_common(10)) 

# do the same for talking and doing

print(character_verb_counter.most_common(10)) 


[Hint](# "ent.label_, ent.root.head.pos_") 

In [None]:
# Find 20 most used verbs
verb_counter = Counter()

# your code here

print(verb_counter.most_common(20))

In [None]:
# Create a dataframe with the most used verb and how many time a character used the verb

import pandas as pd
verb_characters = {}
verb_list = [verb[0] for verb in verb_counter.most_common(20)]
for ent in processed_text.ents:
    if ent.label_ == 'PERSON' and ent.root.head.lemma_ in verb_list:
        # complete the code
        pass


In [None]:
df = pd.DataFrame(verb_characters).transpose().fillna(0)
df

In [None]:
# drop the less meaningful columns
df = df[df.columns[df.sum()>=10]].sort_index()
df

In [None]:
import seaborn as sns
%matplotlib inline
sns.heatmap(df, annot=True, cmap='Blues')
df.style.background_gradient(cmap='Blues')
