## NICAR 2020: NLP Workshop

Spacy is a library that can assist you in doing linguistic analyses. 

To install and use the Englis-language version of spacy you should run these commands in your virtual environment:
`pip3 install spacy`
`python3 -m spacy download en_core_web_sm`
We will be importing the `text.txt` file in our `data` folder. It contains a sample article about a very special [cat](https://www.buzzfeednews.com/article/juliareinstein/this-thicc-lazy-high-maintenance-incredibly-well-hydrated/).

In [1]:
# Import our libraries
import spacy
import pandas as pd

In [2]:
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

# opens the text file and turns it into a string
text = open("../data/text.txt","r+").read()
len(text) # this returns the length of characters and spaces

2990

### Corpus & Tokenizing

Now let's turn the string into a corpus for spacy!

In [4]:
doc = nlp(text)

The document can act like a list of words. To access each word or 'token' we can use the built in function `.text`

In [None]:
for token in doc:
    print(token.text)

Now we can count some words by:
- turning the words into a list
- turning that list into a pandas data frame
- counting the values

In [None]:
rows = []
for token in doc:
    rows.append(token.text)

In [None]:
print(rows)

In [None]:
word_dataframe = pd.DataFrame(rows)
word_dataframe.columns = ['word']
word_dataframe.head()

Two ways to count:

In [None]:
word_count = word_dataframe['word'].value_counts().reset_index()
word_count.head()

In [None]:
word_count_alt = word_dataframe.groupby('word').agg({"word":"count"})
word_count_alt.head()

You can save the outputs by using `.to_csv`:

In [None]:
word_count.to_csv('../output/word_count.csv', index=False)
word_count_alt.to_csv('../output/word_count2.csv')

### More Ways to Filter & Clean

When you convert a doc into tokens using spacy, it doesn't just contain information about each word. 

#### Stopwords

Spacy has a built-in list of 'stopwords', or extremely common words that might not add much insight to our analysis. When you tokenize your document, it also checks your tokens against the stopwords.

In [None]:
rows = []
for token in doc:
    rows.append([token.text, token.is_stop])

In [None]:
stop_dataframe = pd.DataFrame(rows)
stop_dataframe.columns = ['word', 'is_stop']
stop_dataframe.head()

Compare the difference between the aggregated counts with and without the stopwords:

In [None]:
no_stop = stop_dataframe[~stop_dataframe['is_stop']]

no_stop['word'].value_counts().head(10)

In [None]:
stop_dataframe['word'].value_counts().head(10)

#### Lemmatization

When you tokenize your document with spacy, it also _lemmatizes_ them, or groups words and their derivatives together (e.g., _organize_, _organized_, and _organizing_). For instance:

- am, are, is $\Rightarrow$ be
- car, cars, car's, cars' $\Rightarrow$ car

You can access a token's lemmatized word by using the `token.lemma_`.

In [5]:
rows = []
for token in doc:
    rows.append([token.text, token.lemma_])

In [9]:
lemma_dataframe = pd.DataFrame(rows)
lemma_dataframe.columns = ['word', 'lemma']
lemma_dataframe.head(10)

Unnamed: 0,word,lemma
0,This,this
1,is,be
2,Bruno,Bruno
3,",",","
4,and,and
5,he,-PRON-
6,’s,’s
7,a,a
8,25-pound,25-pound
9,cat,cat


Compare the difference between the aggregated counts with and without the stop words:

In [13]:
lemma_dataframe['lemma'].value_counts()

-PRON-          84
,               34
.               33
be              28
"               27
the             20
\n\n            18
a               15
to              15
and             12
“                9
of               8
say              8
have             8
but              7
in               7
shelter          7
do               7
with             7
like             6
pet              6
on               6
not              6
foster           6
Bruno            6
at               6
water            5
also             5
if               5
’s               5
                ..
trick            1
(                1
play             1
big              1
Rescue           1
where            1
Video            1
teach            1
out              1
Wright           1
apparently       1
meet             1
scratcher        1
furry            1
house            1
11               1
end              1
take             1
stay             1
far              1
side             1
may         

In [14]:
lemma_dataframe['word'].value_counts()

,             34
.             33
I             20
the           18
\n\n          18
a             15
to            15
my            13
and           12
“             12
”             12
"             12
is             9
he             9
you            8
said           8
of             8
He             8
shelter        7
his            7
in             7
with           7
but            7
Bruno          6
foster         6
at             6
was            6
on             6
like           5
also           5
              ..
help           1
really         1
Adoption       1
25-pound       1
loved          1
making         1
very           1
normal         1
most           1
simple         1
No             1
great          1
sleep          1
(              1
may            1
big            1
gains          1
polydactyl     1
purring        1
staying        1
Rescue         1
wand           1
meshing        1
Facebook       1
extra          1
where          1
former         1
right         

### Bonus: Extracting Entities

A common NLP task that might be useful is the extraction of **entities**. These include people, business, places, organizations and dates–for a full list of what entity types spacy is able to recognize out of the box, you can refer to the [documentation](https://spacy.io/api/annotation#named-entities).

In [None]:
print(rows)

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

You can view them highlighted in the text by using `displacy`.

In [None]:
from spacy import displacy
displacy.render(doc, style = "ent",jupyter = True)