## Introduction to spaCy

__Initializing spaCy__

In [2]:
import spacy

In [3]:
# Load the language model for English
# In order to do so we create a spacy object called "nlp"
nlp = spacy.load("en_core_web_sm")
# Now we have a spacy object 

In [4]:
type(nlp)
# We can see that it is a spacy object 

spacy.lang.en.English

Now we can perform NLP tasks. We can use a spacy object to analyze text.

In [6]:
# create a Doc object
# We annotate a random string using the spacy object
doc = nlp("Ferocious winter weather sweeping across large parts of the central and southern US has brought record-breaking cold temperatures, left millions without power and killed at least 21 people across multiple states.")

In [7]:
type(doc)
# It is a spacy object that comprises tokens
# doc is a completely tokenized and annotated string, because the nlp() function has already been called on it - hence, the tokenization has already happened

spacy.tokens.doc.Doc

In [7]:
print(doc)
# The doc has already been tokenized in the background. This is cool about spacy. 

Ferocious winter weather sweeping across large parts of the central and southern US has brought record-breaking cold temperatures, left millions without power and killed at least 21 people across multiple states.


Now we can iterate over the string using spacy. 

__Tokens__

We can call many different methods/attributes on tokens (e.g. token.text is one function that can be called). SpaCy provides an overview of the different attributes/methods that can be called on a doc/token object: https://spacy.io/api/doc  

In [8]:
# We iterate over each token in the string
for token in doc:
    print(token.text) # here we use the method .text 
# Now we can see that the string has been tokenized
# Punctuations count as individual tokens 

Ferocious
winter
weather
sweeping
across
large
parts
of
the
central
and
southern
US
has
brought
record
-
breaking
cold
temperatures
,
left
millions
without
power
and
killed
at
least
21
people
across
multiple
states
.


In [9]:
for token in doc:
    print(token.text, token.lemma) # here we print both the token and the lemma
# Now we get both the token and a number for each token. This is because spaCy converts every string into a number, because this makes opeartions make efficient. Each number is unique.

Ferocious 7398913484123842627
winter 8844163100600735019
weather 1756699799731398535
sweeping 5287345485016755302
across 12865022372469924430
large 2751841902330220293
parts 4485934323942657167
of 886050111519832510
the 7425985699627899538
central 13919618042645247414
and 2283656566040971221
southern 12121605977752639731
US 15397641858402276818
has 14692702688101715474
brought 3597906902382212429
record 12677120423429974351
- 9153284864653046197
breaking 5527797886271786622
cold 3117178197819627377
temperatures 5627807717403523368
, 2593208677638477497
left 9707179535890930240
millions 17365054503653917826
without 4711265942760212190
power 10405720708504167118
and 2283656566040971221
killed 3883960749573218104
at 11667289587015813222
least 12059514183285037132
21 4686009691886217934
people 7593739049417968140
across 12865022372469924430
multiple 16628341085578573424
states 12763746643991857148
. 12646065887601541794


In [11]:
for token in doc:
    print(token.text, token.lemma_) #_ means that we want the string of the lemma
# Now instead of getting the number for each token, we get the lemma itself as a string. We can see what happens with each word when it is lemmatized. 

Ferocious ferocious
winter winter
weather weather
sweeping sweep
across across
large large
parts part
of of
the the
central central
and and
southern southern
US US
has have
brought bring
record record
- -
breaking break
cold cold
temperatures temperature
, ,
left leave
millions million
without without
power power
and and
killed kill
at at
least least
21 21
people people
across across
multiple multiple
states state
. .


In [12]:
for token in doc:
    print(token.text, token.is_punct) # Here we use the .is_punct method. This is just another example of a method that can be called on a token object

Ferocious False
winter False
weather False
sweeping False
across False
large False
parts False
of False
the False
central False
and False
southern False
US False
has False
brought False
record False
- True
breaking False
cold False
temperatures False
, True
left False
millions False
without False
power False
and False
killed False
at False
least False
21 False
people False
across False
multiple False
states False
. True


In [13]:
for token in doc:
    print(token.text, token.pos_, token.tag_) # here we use the .pos and .tag to get the part-of-speech tag and the tag for each token

Ferocious ADJ JJ
winter NOUN NN
weather NOUN NN
sweeping VERB VBG
across ADP IN
large ADJ JJ
parts NOUN NNS
of ADP IN
the DET DT
central ADJ JJ
and CCONJ CC
southern ADJ JJ
US PROPN NNP
has AUX VBZ
brought VERB VBN
record NOUN NN
- PUNCT HYPH
breaking VERB VBG
cold ADJ JJ
temperatures NOUN NNS
, PUNCT ,
left VERB VBD
millions NOUN NNS
without ADP IN
power NOUN NN
and CCONJ CC
killed VERB VBD
at ADV RB
least ADV RBS
21 NUM CD
people NOUN NNS
across ADP IN
multiple ADJ JJ
states NOUN NNS
. PUNCT .


^<br>
UPOS = token.pos_ <br>
fine-grained tags = token.tag_