# Pride & Prejudice analysis

# Real text analysis

We got familiar with Spacy. In the next section we are going to analyse a real text (Pride & Prejudice). 

We would like to:
* Extract the names of all the characters from the book (e.g. Elizabeth, Darcy, Bingley)
* Visualize characters' occurences with regards to relative position in the book
* Authomatically describe any character from the book
* Find out which characters have been mentioned in a context of marriage
* Build keywords extraction that could be used to display a word cloud

## Load text file

In [9]:
def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()  # .decode('utf-8')

## Process full text

In [10]:
import spacy

nlp = spacy.load('en')

# Process `text` with Spacy NLP Parser
text = read_file('data/pride_and_prejudice.txt')
processed_text = nlp(text)

### How many sentences are there in the book (Pride & Prejudice)?

In [11]:
sentences = [s for s in processed_text.sents]
print(len(sentences))

7761


### Print sentences from index 10 to index 15
... to make sure that we have parsed the correct book

In [12]:
print(sentences[10:15])

[*, ** START OF THIS PROJECT GUTENBERG EBOOK PRIDE AND, PREJUDICE ***




Produced by Anonymous Volunteers





PRIDE AND PREJUDICE

By Jane Austen



Chapter 1


, It is a truth universally acknowledged, that a single man in possession
of a good fortune, must be in want of a wife.

, However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds
of the surrounding families, that he is considered the rightful property
of some one or other of their daughters.

]


# Exercise
### 3.1. Find all the personal names

Extract all the personal names from Pride & Prejudice and count their occurrences. 

Expected output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].

*Hint: you can use a Counter: a container that keeps track of how many times equivalent values are added*

*SuperHint: iterate over the entities checking if the label of each entity is PERSON*

In [13]:
from collections import Counter, defaultdict

def find_character_occurences(doc):
    """
    Return a list of actors from `doc` with corresponding occurences.
    
    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266)]
    """
    
    characters = Counter()
    
    # your code here
            
    return characters.most_common()

print(find_character_occurences(processed_text)[:20])

[]


### 3.2.  Find words (adjectives) that describe Mr. Darcy

In [17]:
def get_character_adjectives(doc, character_lemma):
    """
    Find all the adjectives related to `character_lemma` in `doc`
    
    :param doc: Spacy NLP parsed document
    :param character_lemma: string object
    :return: list of adjectives related to `character_lemma`
    """
    
    adjectives = []
    
    # your code here
    
    return adjectives

print(get_character_adjectives(processed_text, 'darcy'))

[]


### 3.3. Find characters that are 'talking', 'saying', 'doing' the most.

Find the relationship between entities and corresponding root verbs.

In [20]:
# your code here