# Spacy stuff

Explore by going through tutorial: https://course.spacy.io/

Spacy models are here: https://spacy.io/usage/models



In [None]:
import spacy

In [None]:
with open('../data/cipsum.txt','r') as f:
    corpus = f.read()

In [None]:
from spacy.lang.en import English

In [None]:
nlp = English()

What can we do with cipsum? 

* LDA; detect key topics to build presentation around
* Use word embeddings and similarity to find sentences/paragraphs around that theme
    * Possibly create own word embedding on this text if possible?
* Get some getchy phrases for in title of slides. Or tag/punchlines.
* Generate sensible bullet points

Other tasks for good slides:
* Find images related to a topic
* Esthetics; create slide such that it looks awesome.

In [None]:
nlp(corpus)

tokens and spans (a span is a view on `doc`)

In [None]:
doc = nlp(corpus)

In [None]:
doc[0:4]

In [None]:
doc[0:4].text

In [None]:
for token in doc[0:10]:
    if 'a' in token.text:
        print(token)

## Language models

Load a small language model and receive an nlp object

In [None]:
nlp = spacy.load('en_core_web_sm')

process the text

In [None]:
doc = nlp(corpus)

Use part of speech tagger from the language model: For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In [None]:
for token in doc[0:10]:
    print(token.text, token.pos_)

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.

The "dep underscore" attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [None]:
for token in doc[0:10]:
    print(token.text, token.pos_, token.dep_, token.head.text)

In [None]:
spacy.explain('amod')

Predicting named entities

In [None]:
# Iterate over the predicted entities
for ent in doc[0:100].ents:
    # Print the entity text and its label
    print(ent.text, ent.label_)

To use a pattern, we first import the matcher from spacy dot matcher.

* match_id: hash value of the pattern name
* start: start index of matched span
* end: end index of matched span

In [None]:
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
# pattern = [{'TEXT': 'change'}, {'TEXT': 'management'}]
pattern = [{'POS': 'ADJ'},
           {'POS': 'NOUN'}]
matcher.add('tryout', None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches[0:10]:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [None]:
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
# pattern = [{'TEXT': 'change'}, {'TEXT': 'management'}]
pattern = [{'LEMMA': 'leverage'},
           {'POS': 'ADJ', 'OP': '*'},  # optional: match 0 or 1 times
           {'POS': 'NOUN'}]
matcher.add('tryout', None, pattern)

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches[0:10]:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

## Chapter 2

In [None]:
nlp = spacy.load('en_core_web_md')


In [None]:
from itertools import islice

In [None]:
for sentence in islice(doc.sents, 3):
    print(sentence)

In [None]:
sentences = islice(doc.sents, 2)
sentences[0]

In [None]:
# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

## Chapter 3

In [None]:
from spacy import displacy


In [None]:
displacy.render(doc, style='ent', jupyter=True)


lemmatization

In [None]:
text = nlp("This sentence contains multiple forms of the English verb of existence: is, am, and are.")
print(" ".join([token.lemma_ for token in text if not token.is_punct]))

We also see that Tokens carry useful attributes; a full list is here but some useful ones include:

* lemma_: the underlying lemma for each word
* is_punct: allows us to ignore non-word tokens
* ent_type_: named entity type (useful for distinguishing companies, people, cities, etc.)
* like_url: indicates whether the token resembles a URL
* like_email: indicates whether the token resembles an email address

lots of default attributes: https://spacy.io/api/token#attributes