## Introduction to ```spaCy```

There are a number of different NLP frameworks that you're likely to encounter. The most popular and widely-used of these are:

- ```NLTK``` (Natural Language Toolkit, old-school)
- ```UDPipe``` (Neural network based, fast and light, but not super accurate)
- ```CoreNLP``` and ```stanza``` (Created by the team at Stanford; academically robust)
- ```spaCy``` production-ready, well-documented, state-of-the-art

We'll be working with ```spaCy``` in this module, primarily because it's easy and intuitive, and also scales well.

First thing we need to do is install ```spaCy``` and the language model that we want to use.

```
$ pip install spacy 
$ python -m spacy download en_core_web_sm
```

In [None]:
python -m spacy download en_core_web_sm

## Initializing ```spaCy```

The first thing we need to do is import ```spaCy``` __and__ the language model that we want to use.

Note that, if you want to use different langauges you want to use different language models.

In [5]:
import spacy

nlp = spacy.load("en_core_web_sm")

With the model now loaded, we can begin to do some very simple NLP tasks.

Here, we create a spaCy object and assign it to the variable ```nlp```. This is the NLP pipeline that will do all our heavy lifting, using the trained model we've specified.

Below, you can see what the pipeline does with a bit of sample text. Passing text to the nlp object gives us access to a bunch of properties, including tokens (words), parts of speech, named entities, and so on. Here's we two of them, tokens and entities. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found in the spaCy docs.

In this case, for all token objects, let's return the token itself (token.text); its part-of-speech tag (token.pos_); and the grammatical dependency relations between the tokens (token.dep_).


In [25]:
doc = nlp("Hello from New York! we're happy here. i went running yesterday")

__Tokenize__

In [28]:
for token in doc:
    print(token.text,  ':', token.pos_, token.dep_)

Hello : INTJ ROOT
from : ADP prep
New : PROPN compound
York : PROPN pobj
! : PUNCT punct
we : PRON nsubj
're : AUX ROOT
happy : ADJ acomp
here : ADV advmod
. : PUNCT punct
i : PRON nsubj
went : VERB ROOT
running : VERB advcl
yesterday : NOUN npadvmod


In [37]:
for token in doc:
    if token.pos_ == 'VERB':
        print(token.text, 'is a verb' )
    else:
        pass

go is a verb
run is a verb


__Trying some more attributes__

In [24]:
for entity in doc.ents:
   print(entity, ":", entity.label_)

New York : GPE


In [23]:
spacy.displacy.render(doc, style="dep")

## Count distribution of linguistic features

__Create doc object__

In [None]:
# load one text from the data folder (novel)
# put through spacy pipeline
# count how many adj. there are

In [39]:
pwd()

'/work/cds-lang/cds-language/notebooks'

In [50]:
import os

In [96]:
filepath = os.path.join('..', '..', '..', 'CDS-LANG', '100_english_novels', 'corpus', 'Barclay_Ladies_1917.txt')

In [97]:
with open(filepath. 'r') as f:
    text = f.read()

In [98]:
doc = nlp(text)

In [99]:
adjectives = []
adjectives2 = 0

for token in doc:
    if token.pos_ == 'ADJ':
        adjectives.append(token.text)
        adjectives2 += 1
    else:
        pass

In [100]:
len(adjectives)

8573

__Relative frequency__

In [73]:
# calculate relative frequency per 1000 words

In [104]:
words = 0

for token in doc:
    if token.pos_ != 'PUNCT':
        words += 1


In [105]:
rel_freq = (adjectives2 / words) * 1000

In [106]:
rel_freq #56, but without puncts its 67

67.19599943565707