## Introduction to ```spaCy```

There are a number of different NLP frameworks that you're likely to encounter. The most popular and widely-used of these are:

- ```NLTK``` (Natural Language Toolkit, old-school)
- ```UDPipe``` (Neural network based, fast and light, but not super accurate)
- ```CoreNLP``` and ```stanza``` (Created by the team at Stanford; academically robust)
- ```spaCy``` production-ready, well-documented, state-of-the-art

We'll be working with ```spaCy``` in this module, primarily because it's easy and intuitive, and also scales well.

First thing we need to do is install ```spaCy``` and the language model that we want to use.

From the command line, you should first make sure to run the setup script to install requirements:

```shell 
bash setup.sh
```

## Initializing ```spaCy```

The first thing we need to do is import ```spaCy``` __and__ the language model that we want to use.

Note that, if you want to use different langauges you want to use different language models.

In [2]:
# create a spacy NLP class
import spacy
nlp = spacy.load("en_core_web_md")

With the model now loaded, we can begin to do some very simple NLP tasks.

Here, we create a spaCy object and assign it to the variable ```nlp```. This is the NLP pipeline that will do all our heavy lifting, using the trained model we've specified.

Below, you can see what the pipeline does with a bit of sample text. Passing text to the nlp object gives us access to a bunch of properties, including tokens (words), parts of speech, named entities, and so on. Here's we two of them, tokens and entities. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found in the spaCy docs.

In this case, for all token objects, let's return the token itself (```token.text```); its part-of-speech tag (```token.pos_```); and the grammatical dependency relations between the tokens (```token.dep_```).


In [30]:
# a single sentence example
input_string = "My name is Rikke and I have family in New York City."


In [31]:
# creat a new Doc object
doc = nlp(input_string)
#doc are complex object, that contains infromation from the text string and additional NLP stuff 

In [6]:
#checking type of doc
type(doc)

spacy.tokens.doc.Doc

In [7]:
#Printng the Doc (we actually just get back a string, buuuut each of the tokens in the string, has additional grammar information )
print(doc)

My name is Rikke and I come from Denmark.


__Tokenize__

In [15]:
# tokenizing text
for token in doc:
    print(token.text) #using .text give every single token an attribute = POS, Name Entitiy etc.


#Printing the tokens in the doc (string)
#as we can see Spacy counts punctuation 
# and it seperated "New York"
# when using spacy we work withing their framework and decisions of NLP. 

My
name
is
Rikke
and
I
have
family
in
New
York
.


__Trying some more attributes__

In [21]:
# find parts-of-speech and grammatical relations
for token in doc:
    print(token.i, token.text, token.pos_, token.dep_, token.morph) #prints:
    #token index
    # token text
    # part of speech (POS) (note the _ is very important),
    # depth of token (more gramatical information),
    # morphology: tenses and number (singular&plural)
        # if you dont have the _ you get a numerical value of the word class 

0 My PRON poss Number=Sing|Person=1|Poss=Yes|PronType=Prs
1 name NOUN nsubj Number=Sing
2 is AUX ROOT Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
3 Rikke PROPN attr Number=Sing
4 and CCONJ cc ConjType=Cmp
5 I PRON nsubj Case=Nom|Number=Sing|Person=1|PronType=Prs
6 have VERB conj Mood=Ind|Tense=Pres|VerbForm=Fin
7 family NOUN dobj Number=Sing
8 in ADP prep 
9 New PROPN compound Number=Sing
10 York PROPN pobj Number=Sing
11 . PUNCT punct PunctType=Peri


__NER__

Extracting named entities from a ```spaCy``` doc requires an extra step, but nothing too challenging:

In [32]:
# extracting NERs (NAMED ENTITIES)
for ent in doc.ents: #ent = entity
    print(ent.text, ent.label_) #prints:
    #name entity
    #label entity  (person & geopolitical entity) 


Rikke PERSON
New York City GPE


__Questions:__ 

1. What range of linguistic features is available beyond what we're looking at here? 
2. Are the same range of features available for all languages? Compare e.g. English and Danish.

## Count distribution of linguistic features

__Create doc object__

In [39]:
# load a text file
import os 
filename = os.path.join("..", "data", "example.txt")


In [40]:
# create a doc object
with open(filename, "r", encoding= "utf-8") as file:
    text = file.read()

In [44]:
#make doc from new text
doc = nlp(text)

first ORDINAL
Twitter ORG
Canaelan PERSON
Taylor Swift PERSON
Tottenham Hotspur PERSON
Sheffield GPE
Canaelan PERSON
Advanced Impact Media Solutions ORG
more than 30,000 CARDINAL
Team Jorge WORK_OF_ART
Israel GPE
Tal Hanan PERSON
Jorge PRODUCT
UK GPE
Information Commissioner’s Office ORG
18 October 2020 DATE
ICO ORG
multimillion-pound MONEY
PPE ORG
Canaelan PERSON
Twitter two days later DATE
ICO ORG
ICO ORG
ICO ORG
UK GPE
one CARDINAL
Team Jorge PERSON
ICO ORG
Hanan PERSON
ICO ORG
Team Jorge’s Aims ORG
Hanan PERSON
Twitter ORG
LinkedIn ORG
Facebook ORG
Telegram ORG
Gmail PERSON
Instagram PERSON
YouTube ORG
Amazon ORG
Airbnb ORG
Hanan PERSON
Team Jorge ORG
Guardian ORG
Le Monde ORG
Der Spiegel PERSON
2,000 CARDINAL
Facebook ORG


In [45]:
# create an empty list
entities = []

# print all entitites in the doc (Text file)
for ent in doc.ents:
    entities.append(ent.text)


In [48]:
#printing entities
print(set(entities)) #set function is like "unique" in R (doing this we can e.g. find all unique persons in the data)

{'Sheffield', '2,000', 'Twitter', 'multimillion-pound', 'YouTube', 'Twitter two days later', 'PPE', 'Advanced Impact Media Solutions', 'Gmail', 'Amazon', '18 October 2020', 'Team Jorge’s Aims', 'Israel', 'Canaelan', 'Tottenham Hotspur', 'one', 'LinkedIn', 'Telegram', 'Hanan', 'first', 'ICO', 'Tal Hanan', 'Taylor Swift', 'UK', 'more than 30,000', 'Facebook', 'Information Commissioner’s Office', 'Der Spiegel', 'Instagram', 'Guardian', 'Jorge', 'Le Monde', 'Airbnb', 'Team Jorge'}


In [61]:
#empty list
adjective_count = 0

#count number of adjectives
for token in doc:
    if token.pos_ == "ADJ":
        adjective_count += 1

In [63]:
print(adjective_count) # note the relative frequency is more useful than the raw frequency =46

46


__Relative frequency__

In [71]:
# find the relative frequency per 10,000 words
relative_freq = (adjective_count/len(doc)) * 10000 #if out text was 10000 words long then we'd have 727 adjectives (per 10000 words = 727 adjectives)
round(relative_freq, 1) 

727.8

## Creating neater outputs using ```pandas```

At the moment, all of our output from ```spaCy``` is in the form of lists. If we want to save these, it probably makes sense to have them saved in a more transferable format, such as CSV files or JSONs.

One very easy way to do this with Python is by using the dataframe library ```pandas```.

In [72]:
import pandas as pd

In [73]:

# a single sentence example
input_string = "My name is Rikke and I have family in New York City."

# create spaCy doc
doc = nlp(input_string)

In [78]:
annotations = []

for token in doc:
    annotations.append((token.text, token.pos_)) #making them tuple so they can be appended
    #append takes only 1 list, but we have 2, so we make them tuple just using brackets

In [84]:
# spaCy doc to pandas dataframe
data = pd.DataFrame(annotations, columns = ["Token", "POS"]) #making my "annotations" into DF <333
data

Unnamed: 0,Token,POS
0,My,PRON
1,name,NOUN
2,is,AUX
3,Rikke,PROPN
4,and,CCONJ
5,I,PRON
6,have,VERB
7,family,NOUN
8,in,ADP
9,New,PROPN


In [86]:
# save dataframe
outpath = os.path.join("..", "data", "annotations.csv")

data.to_csv(outpath)

## Assignment 1

Spend some time exploring and familiarizing yourself with ```spaCy``` and ```pandas```. We'll come back to them quite a lot through this semester, so it will help to have a solid handle on how they function.

When you are ready, head over to [Assignment 1](https://classroom.github.com/a/PdNi7nPv) which takes some of the skills you've learned last week and today. The task will be to count how many times certain linguistic features occur accross different documents, and to save those results in a clear and easy-to-understand way.