# spaCy Demo

# Set up environment

In [1]:
import pandas as pd
import numpy as np

In [2]:
import spacy

In [3]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home'] 
output_dir = config['DEFAULT']['output_dir']
data_prefix = 'austen-melville'

# About spaCy

## Installing

It's best to install spaCy in its own virtual environment.

## Pipeline

<img src="spacy-pipeline.svg" width="500" />

## Object Model

<img src="spacy-architecture.svg" width="500" />

## Language Models

# Import F1 docs

We start with a DOC table that is the result of converting  $F0$ data into $F1$ data.

In this case, the DOCs are paragraphs.

In [9]:
DOC = pd.read_csv(f"{output_dir}/pg105-PARAS.csv").set_index(['chap_num', 'para_num'])

In [10]:
DOC.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers..."
1,1,"""ELLIOT OF KELLYNCH HALL."
1,2,"""Walter Elliot, born March 1, 1760, married, J..."
1,3,Precisely such had the paragraph originally st...
1,4,Then followed the history and rise of the anci...
1,5,"""Heir presumptive, William Walter Elliot, Esq...."
1,6,Vanity was the beginning and the end of Sir Wa...
1,7,His good looks and his rank had one fair claim...
1,8,"This friend, and Sir Walter, did not marry, wh..."
1,9,"That Lady Russell, of steady age and character..."


# Apply spaCy model

## Load the model

First, we load a language model.

You need to download one of these first.

spaCy has models for many languages; English has four. You can learn about them here: https://spacy.io/models/en 

To download one, do this:

```bash
python -m spacy download <language_model>
```

For example, do this to get the small language model for English:

```bash
python -m spacy download en_core_web_sm
```

Once you have your model, you can pass its name to the spaCy object constructor:

In [11]:
nlp = spacy.load("en_core_web_lg")

Then we apply the spaCy Doc object constructor `nlp()` to each doc.

## Understand the model

To get an intuition of how spaCy works, let's work with a list of strings, and pass each string to the `nlp` object in a comprehension.

In [26]:
Docs = [nlp(doc) for doc in DOC.para_str.to_list()]

This produces a list of spaCy `Doc` objects, one for each paragraph in `DOC`.

We can see this by checking the object type of a sample element from the list `Docs`.

In [56]:
DocSample = Docs[10]

In [57]:
type(DocSample)

spacy.tokens.doc.Doc

The `Doc` object is a container holding all the information about the text, including tokenization, part-of-speech tags, named entities, and more.

This may not be obvious, though, since when printed the object appears as a list of strings.

This is because the `__str__` method has been overridden to show the text associated with the object and not the object signature. 

In [58]:
print(DocSample)

To Lady Russell, indeed, she was a most dear and highly valued god-daughter, favourite, and friend.  Lady Russell loved them all; but it was only in Anne that she could fancy the mother to revive again.


To unpack the contents of a Doc object, we use the results of spaCy's sentence recognizer, which is stored in the object's `sent` attribute:

In [74]:
Sents = [sent for sent in DocSample.sents]

In [75]:
Sents

[To Lady Russell, indeed, she was a most dear and highly valued god-daughter, favourite, and friend.  ,
 Lady Russell loved them all; but it was only in Anne that she could fancy the mother to revive again.]

In [68]:
len(Sents)

2

Then we can parse the sentence into tokens, along with linguistic annotations:

In [82]:
Tokens = [(token.text, token.pos_, token.is_stop, token.dep_) for token in Sents[0]]

In [83]:
print(Tokens)

[('To', 'ADP', True, 'prep'), ('Lady', 'PROPN', False, 'compound'), ('Russell', 'PROPN', False, 'pobj'), (',', 'PUNCT', False, 'punct'), ('indeed', 'ADV', True, 'advmod'), (',', 'PUNCT', False, 'punct'), ('she', 'PRON', True, 'nsubj'), ('was', 'AUX', True, 'ROOT'), ('a', 'DET', True, 'det'), ('most', 'ADV', True, 'advmod'), ('dear', 'ADJ', False, 'amod'), ('and', 'CCONJ', True, 'cc'), ('highly', 'ADV', False, 'advmod'), ('valued', 'VERB', False, 'conj'), ('god', 'PROPN', False, 'nmod'), ('-', 'PUNCT', False, 'punct'), ('daughter', 'NOUN', False, 'attr'), (',', 'PUNCT', False, 'punct'), ('favourite', 'ADJ', False, 'conj'), (',', 'PUNCT', False, 'punct'), ('and', 'CCONJ', True, 'cc'), ('friend', 'NOUN', False, 'conj'), ('.', 'PUNCT', False, 'punct'), (' ', 'SPACE', False, 'dep')]


## Apply the model

Now, we can also apply the `nlp` object constructor directly to a column of our data frame.

In [31]:
DOC['spacy_doc'] = DOC.para_str.apply(nlp)

The result is that each observation now has a spaCy `Doc` object.

In [32]:
DOC

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str,spacy_doc
chap_num,para_num,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,"Sir Walter Elliot, of Kellynch Hall, in Somers...","(Sir, Walter, Elliot, ,, of, Kellynch, Hall, ,..."
1,1,"""ELLIOT OF KELLYNCH HALL.","("", ELLIOT, OF, KELLYNCH, HALL, .)"
1,2,"""Walter Elliot, born March 1, 1760, married, J...","("", Walter, Elliot, ,, born, March, 1, ,, 1760..."
1,3,Precisely such had the paragraph originally st...,"(Precisely, such, had, the, paragraph, origina..."
1,4,Then followed the history and rise of the anci...,"(Then, followed, the, history, and, rise, of, ..."
...,...,...,...
24,9,"Anne, satisfied at a very early period of Lady...","(Anne, ,, satisfied, at, a, very, early, perio..."
24,10,Her recent good offices by Anne had been enoug...,"(Her, recent, good, offices, by, Anne, had, be..."
24,11,Mrs Smith's enjoyments were not spoiled by thi...,"(Mrs, Smith, 's, enjoyments, were, not, spoile..."
24,12,Finis,(Finis)


## Unpack the model

We can easily extract the contents of all the `Doc` objects using the pattern:

```python
DOC.spacy_doc.apply(lambda x: [<prop_var> for <prop_var> in x.<prop_attr>])\
    .apply(pd.Series)\
    .stack()
```

That is, we unpack the attributes we want from the `Doc` object in a comprehension (as above), convert them to a Series, and then stack them.

We can apply this pattern successively to produce the `TOKEN` table.

In [84]:
TOKEN = DOC.spacy_doc.apply(lambda x: pd.Series([sent for sent in x.sents])).stack()\
    .apply(lambda x: [[token.text, token.pos_, token.lemma_, token.dep_] for token in x])\
    .apply(pd.Series).stack()\
    .apply(pd.Series) # No need to stack since we want to keep results as columns

In [52]:
TOKEN

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,0,1,2,3
chap_num,para_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,0,0,Sir,PROPN,Sir,compound
1,0,0,1,Walter,PROPN,Walter,compound
1,0,0,2,Elliot,PROPN,Elliot,nsubj
1,0,0,3,",",PUNCT,",",punct
1,0,0,4,of,ADP,of,prep
...,...,...,...,...,...,...,...
24,13,0,7,Persuasion,PROPN,Persuasion,pobj
24,13,0,8,",",PUNCT,",",punct
24,13,0,9,by,ADP,by,prep
24,13,0,10,Jane,PROPN,Jane,compound


In [53]:
TOKEN.columns = ['token_str', 'pos', 'lemma', 'dep']

In [54]:
TOKEN.index.names = ['chap_num', 'para_num', 'sent_num', 'token_num']

In [55]:
TOKEN

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,pos,lemma,dep
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,0,0,Sir,PROPN,Sir,compound
1,0,0,1,Walter,PROPN,Walter,compound
1,0,0,2,Elliot,PROPN,Elliot,nsubj
1,0,0,3,",",PUNCT,",",punct
1,0,0,4,of,ADP,of,prep
...,...,...,...,...,...,...,...
24,13,0,7,Persuasion,PROPN,Persuasion,pobj
24,13,0,8,",",PUNCT,",",punct
24,13,0,9,by,ADP,by,prep
24,13,0,10,Jane,PROPN,Jane,compound


In [59]:
TOKEN[TOKEN.pos == 'PROPN'].groupby(['chap_num','token_str']).token_str.count().unstack(fill_value=0)

token_str,!,---5,.,A,A.,Abydos,Admiral,Admiralty,Alicia,Anne,...,"reply:--""Elizabeth",said--,sarcastically--,seas,sir,suggested--,suppositions:--,thither,unison,us
chap_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,4,1,0,0,0,0,0,7,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,12,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,0,0,15,0,0,5,...,0,0,1,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,12,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,6,0,0,29,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,2,0,0,21,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,21,...,0,0,0,0,0,0,1,0,0,0
8,0,0,0,0,0,0,8,1,0,13,...,0,0,0,0,0,0,0,0,1,0
9,0,0,0,0,0,0,2,0,0,8,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,6,0,0,22,...,0,1,0,0,0,0,0,0,0,0


# Extra Stuff

## Named Entities

In [91]:
ENT = DOC.spacy_doc.apply(lambda x: [ent for ent in x.ents]).apply(pd.Series).stack().to_frame('ent')

In [92]:
ENT

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ent
chap_num,para_num,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0,0,"(Walter, Elliot)"
1,0,1,"(Kellynch, Hall)"
1,0,2,(Somersetshire)
1,0,3,(Baronetage)
1,0,4,"(an, idle, hour)"
...,...,...,...
24,11,1,(Anne)
24,11,2,(Wentworth)
24,12,0,(Finis)
24,13,0,(Persuasion)


In [94]:
ENT['ent_str'] = ENT.ent.apply(lambda x: ' '.join(map(str,x)))

In [95]:
ENT

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,ent,ent_str
chap_num,para_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,0,"(Walter, Elliot)",Walter Elliot
1,0,1,"(Kellynch, Hall)",Kellynch Hall
1,0,2,(Somersetshire),Somersetshire
1,0,3,(Baronetage),Baronetage
1,0,4,"(an, idle, hour)",an idle hour
...,...,...,...,...
24,11,1,(Anne),Anne
24,11,2,(Wentworth),Wentworth
24,12,0,(Finis),Finis
24,13,0,(Persuasion),Persuasion


In [143]:
ENT_TYPE = ENT.ent_str.value_counts().to_frame('n')

In [144]:
ENT_TYPE.head(10)

Unnamed: 0_level_0,n
ent_str,Unnamed: 1_level_1
Anne,380
Elliot,213
Wentworth,206
Mary,131
Walter,124
Charles,111
first,99
Lady Russell,97
Louisa,96
one,93


In [145]:
print(ENT_TYPE[ENT_TYPE.n == 1].index.to_list())

['about a week', 'Sophys', 'July 15 , 1784', 'almost every morning', 'ten days', 'a Mrs Wallis', 'the same morning', 'eleven', 'Ten minutes', 'printshop', 'Modest', 'Archibald Drew', 'Bond Street', 'Admiral Brand', 'eighty - seven', 'the spring months', 'The Miss Musgrove', 'Brigden', 'Archibald', 'Captain Brigden', "ten o'clock", 'Lansdown Crescent', 'five - and - thirty', 'thirty', '& c. & c', 'three - shilling', 'three months', 'the heyday', 'one evening', 'an idle hour', 'another year', 'about two years before', 'The first ten minutes', 'three weeks', 'the beginning of February', 'fifteen', 'himself!--she', 'weeks', 'seven', 'every - day', 'twelve years', 'between thirty and forty', "Henry Russell 's", 'schoolfellow', 'A Mrs Smith', 'five thousand', 'this winter', 'several days', 'seven months', 'February 1st', 'many days', 'March 1 , 1760', 'ELLIOT OF KELLYNCH HALL', 'Gay Street', 'the week', 'the last century', 'million', 'Mary M---', 'the year before', 'Dowager Viscountess Dalry

In [146]:
ENT_TYPE[ENT_TYPE.n == 2].index

Index(['Dick Musgrove', 'Wallises', '& c.', 'One morning', 'Thirteen years',
       'Mansion', 'Sunday', 'morning', 'Mrs Shirley', 'Giaour', 'November day',
       'autumn', 'last summer', 'six', 'Miss Elliot', 'Could Anne',
       'two years before', 'Lady Elliot 's', 'Lady Dalrymple 's',
       'the evening', 'Irish', 'twenty - four hours', 'Sophia', 'two days',
       'that day', 'a week ago', 'The day', 'more than half', 'Somerset',
       'Walter Elliot 's', 'only seventeen miles', 'hours', 'twenty miles',
       'the year six', 'a week later', 'the day of the month', 'Lisbon',
       'Charmouth', 'last spring', 'the winter', 'next week',
       'William Walter Elliot', 'the night', 'Monday', 'the next day',
       'fourteen', 'another hour', 'six weeks', 'William', 'another day',
       'Grappler', 'ten minutes', 'Bath Street', 'Forty', 'this summer',
       'French', 'the first week', 'fifty', 'several weeks', 'Christmas',
       'every day', 'the Christmas holidays', 'the morni

In [162]:
ENTM = ENT.groupby(['chap_num', 'para_num', 'ent_str']).ent_str.count().unstack(fill_value=0)
ENT_TYPE['df'] = ENTM.astype(bool).sum()
ENT_TYPE['dp'] = ENT_TYPE.df / ENT_TYPE.df.sum()
ENT_TYPE['di'] = np.log2(1/ENT_TYPE.dp)
ENT_TYPE['dh'] = ENT_TYPE.dp * ENT_TYPE.di
ENT_TYPE['dfidf'] = ENT_TYPE.df * np.log2(len(ENT)/ENT_TYPE.df)

In [163]:
ENT_TYPE.sort_values('dfidf', ascending=False).head(30).style.background_gradient(cmap="YlGnBu")

Unnamed: 0_level_0,n,df,dp,di,dh,dfidf
ent_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Anne,380,320,0.092593,3.432959,0.317867,1167.531951
Wentworth,206,167,0.048322,4.371183,0.211223,765.989112
Elliot,213,154,0.04456,4.488101,0.199991,724.366551
Mary,131,109,0.031539,4.986703,0.157277,567.048642
Walter,124,97,0.028067,5.154975,0.144685,520.943602
Charles,111,92,0.02662,5.231326,0.13926,501.115121
first,99,92,0.02662,5.231326,0.13926,501.115121
one,93,82,0.023727,5.397335,0.128062,460.258902
Lady Russell,97,77,0.02228,5.488101,0.122275,439.183275
Louisa,96,76,0.021991,5.50696,0.121102,434.912882
