# EDA with SpaCy

## The spaCy pipeline

"When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing [pipeline](https://spacy.io/usage/processing-pipelines)." 

From https://course.spacy.io/en/chapter3:|

<img src="https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg">

In [1]:
import sys
import spacy
#from spacy.tokens import Doc, Span
nlp = spacy.load("en_core_web_md") #_sm does not have word vectors
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']


## Adding sentiment via TextBlob

In [2]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe('spacytextblob')
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'spacytextblob']


## Spacy Architecture 

![](https://spacy.io/architecture-415624fc7d149ec03f2736c4aa8b8f3c.svg)

## Exploring document structure

In [3]:
# Break the corpus into:
# Title, Chapters, Sections, Paragraphs, Thoughts, Sentences
#  Title: Knowing Gaia
#  C1:Introduction 1 section, 6 paragraphs
#  C2:Taking a leap 8 sections, 7 paragraphs
#  C3-C10: each with 10 sections  thought-embedding 
#  C11: Contemplation: 2 sections

In [4]:
doc=nlp(open("clean.txt",'r').read())

In [5]:
# Before extra bash cleaning:
#print([tok.text for tok in doc[21:30] 
# if not tok.is_stop and tok.text!="\n\n"])

# After extra bash cleaning:
print([tok.text for tok in doc[21:30]])

['traditional', 'non', '-', 'duality', 'speakers', ',', 'I', 'was', 'inspired']


In [6]:
sents=[ sent for sent in doc.sents]
type(sents[0])

spacy.tokens.span.Span

In [7]:
nlp.vocab["\n"].is_stop = True
words=[token.text for token in doc
       if not token.is_stop and not token.is_punct ]

In [8]:
from collections import Counter
word_freq=Counter(words)

In [9]:
common_words = word_freq.most_common(30)

In [10]:
print(common_words)

[('Gaia', 190), ('want', 99), ('know', 78), ('people', 75), ('self', 71), ('like', 68), ('system', 61), ('need', 58), ('think', 53), ('experience', 52), ('feel', 48), ('life', 46), ('feeling', 46), ('power', 44), ('body', 41), ('dream', 40), ('attention', 39), ('person', 38), ('mind', 35), ('systems', 34), ('living', 33), ('right', 33), ('way', 30), ('ideology', 30), ('emotions', 29), ('work', 29), ('time', 28), ('thinking', 27), ('away', 26), ('touch', 26)]


In [11]:
for token in doc[1:10]:
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

Gaia NNP PROPN noun, proper singular

 _SP SPACE whitespace
C1 NN NOUN noun, singular or mass
: : PUNCT punctuation mark, colon or ellipsis
Introduction NN NOUN noun, singular or mass

 _SP SPACE whitespace
After IN ADP conjunction, subordinating or preposition
many JJ ADJ adjective (English), other noun-modifier (Chinese)
years NNS NOUN noun, plural


In [12]:
nouns = []
adjectives = []
for token in [t for t in doc[2:4000] if t.text != "-"]:
    if token.pos_ == 'NOUN':
        nouns.append(token)
    if token.pos_ == 'ADJ':
        adjectives.append(token)

print(f'Total nouns: {len(nouns)} \nFirst 5 nouns: {nouns[:5]}')
print(f'Total adjectives: {len(adjectives)}')
print(f'First 5 adjectives: {adjectives[:5]}')

Total nouns: 685 
First 5 nouns: [C1, Introduction, years, study, practice]
Total adjectives: 277
First 5 adjectives: [many, Buddhist, less, traditional, non]


In [13]:
from spacy import displacy
# Creates large image of relations
#displacy.render(ss[3:4], style='dep', jupyter=True)

### A SpaCy Document is a list of tokens
SpaCy **Doc** data structure is a sequence of Token objects allowing access to **sentences** 
and named entities.

In [14]:
# Get all tokens and part-of-speech tags
token_texts = [token.text for token in doc]
pos_tags = [token.pos_ for token in doc]

# Iterate over first 2000 tokens
for token in doc[1:2000]:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

center=4
width=4
print(f'{len(doc[1].vector)}')
print(f'Simularity between {doc[center].text} and \
        {doc[center].similarity(doc[center-width:center+width])}')

Found proper noun before a verb: NASA
Found proper noun before a verb: LeGuin
Found proper noun before a verb: Contemplation
Found proper noun before a verb: Gaia
300
Simularity between : and         0.4422609806060791


In [15]:
# Each vocabulary word has a 300 dim vector which is a word2vec vector
# Dot product is not normalized but seems to work

In [16]:
import numpy
numpy.dot(doc.vocab['earth'].vector,doc.vocab['gaia'].vector)

0.8249819

In [17]:
from itertools import product
from itertools import zip_longest
c=["C1", "C2", "C3", "C4", "C5", "C6", "C7", "C8", "C9", "C10", "C11"]
starts=[ t.i for chap in c for t in doc if t.text==chap]
ci=list(zip_longest(starts,starts[1:]))
chaps=[ doc[s[0]:s[1]] for s in ci] # list of spans
print(ci)

[(3, 720), (720, 1912), (1912, 3603), (3603, 4985), (4985, 7204), (7204, 9617), (9617, 12594), (12594, 16739), (16739, 20888), (20888, 24152), (24152, None)]


In [18]:
pts=["P1", "P2", "P3", "P4", "P5", "P6", "P7", "P8", "P9", "P10", "P11"]
chap=5
prompt=3
pis=[ t.i for pstr in pts for t in chaps[chap] if t.text==pstr]
pi=list(zip_longest(pis,pis[1:])) # iterator can only be used once
ptexts=[ chaps[chap][s[0]:s[1]] for s in pi]
chap6prompts=[ doc[s[0]:s[1]] for s in pi]
sents=list(doc[pis[prompt-1]:pis[prompt]].sents)

In [29]:
with open("./eda.html", 'w', encoding='utf-8') as f:
    f.write("<!DOCMENT HTML>")
    f.write("<html><head><style>")
    f.write("html{ font-family: monospace; color:#888}")    
    f.write("</style>")
    for c in chaps:
        f.write(f'<h1>{c[1:3].text}</h1>')
        f.write(c.text)
        
    f.write("</html>")

## Adding spans to Doc

In [20]:
from spacy import displacy
from spacy.tokens import Span

sents = list(chaps[0].sents)
type(doc)
type(sents)
options = {"colors": {"PHRASE1": "green",
                      "PHRASE2": "red",
                      "THOUGHT": "orange"}}

# type(chap[0]) = spacy.tokens.span.Span
# type(chap[0][1:60]) = spacy.tokens.span.Span

smalldoc=chaps[0][0:64].as_doc()
smalldoc.spans["sc"] = [
 
    Span(doc, 24, 36, "PHRASE2"),
    Span(doc, 3, 35, "THOUGHT"),
      Span(doc, 4, 23, "PHRASE1"), 

]
displacy.render( smalldoc, style='span',options=options)

## References

- [Thinc way](
https://thinc.ai/docs/usage-type-checking) typechecking using mypy.
- [RealPython notes on SpaCy](https://realpython.com/natural-language-processing-spacy-python/#preprocessing-functions)