# 01 spaCy Exploration

In [1]:
import pandas as pd
pd.set_option("display.colheader_justify","left") # sets the default alignment of column headers to 'left'
import spacy
from pathlib import Path
from IPython.display import Image
import imgkit

In [2]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

### Helper Functions

In [3]:
def show(table):
    with pd.option_context('display.max_colwidth', None):
        display(table)

### Starting with spaCy

In [4]:
# Prints version of spaCy in use
print(spacy.__version__)

2.3.2


In [5]:
gerNLP = spacy.load('de_core_news_sm')

In [6]:
# Calling 'spacy.info()' on the German model returns the model's meta data.
info = spacy.info('de_core_news_sm')

[1m

lang             de                            
name             core_news_sm                  
license          MIT                           
author           Explosion                     
url              https://explosion.ai          
email            contact@explosion.ai          
sources          [{'name': 'TIGER Corpus', 'url': 'https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/tiger.html', 'license': 'commercial (licensed by Explosion)'}, {'name': 'WikiNER', 'url': 'https://figshare.com/articles/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500', 'license': 'CC BY 4.0'}]
description      German multi-task CNN trained on the TIGER and WikiNER corpora. Assigns context-specific token vectors, POS tags, dependency parses and named entities.
notes            Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the "silver standard" annotations in the WikiN

# Linguistic Features in Speech Parsing

For the exploration of the configuration and features of the library spaCy, a speech given on March 23rd 2011 during the parliament's question time by Ursula Heinen-Esser (CDU) is loaded as a document. At the time, Ursula Heinen-Esser was parliamentary undersecretary to the Federal Minister for the Environment, Nature Conservation and Nuclear Safety.

To get started, the raw text is converted to a doc object using the German language model previously loaded. As the doc object is created, tokenization is done, too.

In [7]:
doc_txt = open('data/sample_speech.txt','r', encoding='utf8')
doc = gerNLP(doc_txt.read())

with pd.option_context('display.max_colwidth',25):
    print(doc[:300])

Frau Präsidentin! Sehr geehrte Damen und Herren! Liebe Kolleginnen und Kollegen! Ich glaube, es isl wirklich unzweifelhaft: Die nuklearen Folgen der Erdbebenkatastrophe in Japan bedeuten einen Einschnitt, zuallererst selbstverständlich für Japan, aber auch für die ganze Welt. Die Katastrophe hat ganz deutlich gezeigt, dass Ereignisse auch jenseits der bislang berücksichtigten Szenarien eintreten können. Vielleicht noch ein paar Punkte zum Sachverhalt, weil es im Weiteren darum gehen wird - so verstehe ich das Thema der Aktuellen Stunde -, welche Sicherheitsüberprüfungen es in unseren deutschen Kernkraftwerken geben wird. Bei allen betroffenen Reaktoren gab es ein Zusammentreffen eines extremen Erdbebens mit einem Tsunami. Das Zusammenwirken hat zum Ausfall der externen Stromversorgung geführt. In der Folge wurden die notwendigen Sicherheitseinrichtungen zerstört. Die Kernkühlung bei den Blöcken 1 bis 3 am Standort Fukushima fiel aus. Die Blöcke 4 bis 6 waren zu diesem Zeitpunkt abgesch

In [8]:
sentences = [sentence.orth_ for sentence in doc.sents]
words = [token.orth_ for token in doc if token.pos_ != 'PUNCT']
print('The sample speech contains '+str(len(sentences))+' sentences and '+ str(len(words))+' words in total.')

The sample speech contains 60 sentences and 1019 words in total.


## Part-of-Speech (POS) tagging
The next chunk of code returns a table of tokens 289 to 317, a sample sentence of the speech, with labels. spaCy tokenizes everything (words, numbers, punctuation, etc.) except single spaces next to words. The POS label indicate language-universal syntactic token positions, the TAG label indicates position labels specifically for the German language using the STTS. https://explosion.ai/blog/german-model#Data-sources

In [9]:
pos_tags = [] # empty list to be filled later
# for every token the loop appends the tags and labels to the list
for token in doc:
    pos_tags.append((token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_stop))

# converts the list to a dataframe which can be visualized nicely in markdown
pos_tags = pd.DataFrame(pos_tags, columns=('TEXT','LEMMA','POS','TAG','DEP','STOP'))
show(pos_tags[289:317])

Unnamed: 0,TEXT,LEMMA,POS,TAG,DEP,STOP
289,Die,der,DET,ART,nk,True
290,Ereignisse,Ereignis,NOUN,NN,sb,False
291,in,in,ADP,APPR,mnr,True
292,Japan,Japan,PROPN,NE,nk,False
293,haben,haben,AUX,VAFIN,ROOT,True
294,uns,sich,PRON,PPER,oa,True
295,gezeigt,zeigen,VERB,VVPP,oc,False
296,",",",",PUNCT,"$,",punct,False
297,dass,dass,SCONJ,KOUS,cp,True
298,das,der,DET,ART,nk,True


**Decoding the tag labels** <br>
Calling *spacy.explain()* on the TAG or POS will return a more profound explanation of the tag closer to human understanding of language.

In [10]:
tags_explained = []
for token in doc:
    tags_explained.append((token.text, token.pos_, spacy.explain(token.pos_), token.tag_, spacy.explain(token.tag_)))

tags_explained = pd.DataFrame(tags_explained, columns=('TEXT','POS','POS_Explained','TAG','TAG_Explained'))
show(tags_explained[289:317])

Unnamed: 0,TEXT,POS,POS_Explained,TAG,TAG_Explained
289,Die,DET,determiner,ART,definite or indefinite article
290,Ereignisse,NOUN,noun,NN,"noun, singular or mass"
291,in,ADP,adposition,APPR,preposition; circumposition left
292,Japan,PROPN,proper noun,NE,proper noun
293,haben,AUX,auxiliary,VAFIN,"finite verb, auxiliary"
294,uns,PRON,pronoun,PPER,non-reflexive personal pronoun
295,gezeigt,VERB,verb,VVPP,"perfect participle, full"
296,",",PUNCT,punctuation,"$,",comma
297,dass,SCONJ,subordinating conjunction,KOUS,subordinate conjunction with sentence
298,das,DET,determiner,ART,definite or indefinite article


In the sample speech of 65 sentences and 1019 words in total there are 40 unique tags, each shown with an example in the 'TEXT' columns in the next table.
\label{se-unique-tags}

In [11]:
unique_tags = tags_explained.drop_duplicates('TAG_Explained').sort_values(by=['TAG'])[['TAG','POS','TAG_Explained','TEXT']]
unique_tags = pd.DataFrame(unique_tags, columns=('TAG','POS','TAG_Explained', 'TEXT')).reset_index()
print(f"Table shaped {str(unique_tags.shape)} with unique values")
show(pd.DataFrame(unique_tags, columns=('TAG','POS','TAG_Explained', 'TEXT')))

Table shaped (41, 5) with unique values


Unnamed: 0,TAG,POS,TAG_Explained,TEXT
0,$(,PUNCT,other sentence-internal punctuation mark,-
1,"$,",PUNCT,comma,","
2,$.,PUNCT,sentence-final punctuation mark,!
3,ADJA,ADJ,"adjective, attributive",geehrte
4,ADJD,ADJ,"adjective, adverbial or predicative",wirklich
5,ADV,ADV,adverb,Sehr
6,APPR,ADP,preposition; circumposition left,in
7,APPRART,ADP,preposition with article,zum
8,ART,DET,definite or indefinite article,Die
9,CARD,NUM,cardinal number,1


## Dependencies
German language is less restrictive 
https://explosion.ai/blog/german-model#word-order

In [12]:
example = gerNLP(u'Ich glaube, wir sollten bei dieser Branche einen Schwerpunkt setzen, da sie uns weg von \
Atomenergie und fossilen Energieträgern hin zu dezentralen Lösungen führt, und nicht schon jetzt Kürzungen \
vornehmen, obwohl die Branche noch nicht einmal richtig etabliert ist. - Schönen Dank, Herr Schirmbeck.')

[Displacy](https://spacy.io/api/top-level#displacy_options) visualizes annotated text as HTML or SVG

In [13]:
from spacy import displacy
# displacy documentation https://spacy.io/api/top-level#displacy_options
def displacy_visual(spaCy_doc, style='dep', options={'compact': True}):
    """Takes a spaCy document and returns a visual of the dependency parse"""
    visual = displacy.render(spaCy_doc, style, options=options)
    return visual

In [14]:
sentence_spans = list(example.sents)
displacy_visual(sentence_spans)

In [15]:
displacy_visual(example)

In [16]:
short_doc = gerNLP(u'Die Ereignisse in Japan haben uns gezeigt, dass das sogenannte Restrisiko durchaus existent ist \
und dass es sich hierbei nicht nur um eine rechnerische Größe handelt.')

In [17]:
displacy_visual(short_doc)

### Investigating the parse tree
The arcs shown in the dependency parsing visualization with displacy define the syntactic relation between two words. The arcs are directional, which indicates the status of head and child inside of the dependency tree. In the following chunk, the entire dependency tree is parsed.

In [18]:
parse_tree = []
for token in example:
    parse_tree.append((token.text, token.pos_, token.dep_, spacy.explain(token.dep_), token.head.text, token.head.pos_, [child for child in token.children]))

parse_tree = pd.DataFrame(parse_tree, columns=('TOKEN_TEXT','TEXT_POS','DEP','DEP_Explained','HEAD_TEXT','HEAD_POS','CHILDREN'))
show(parse_tree)

Unnamed: 0,TOKEN_TEXT,TEXT_POS,DEP,DEP_Explained,HEAD_TEXT,HEAD_POS,CHILDREN
0,Ich,PRON,sb,subject,glaube,VERB,[]
1,glaube,VERB,ROOT,,glaube,VERB,"[Ich, ,, sollten, .]"
2,",",PUNCT,punct,punctuation,glaube,VERB,[]
3,wir,PRON,sb,subject,sollten,VERB,[]
4,sollten,VERB,oc,clausal object,glaube,VERB,"[wir, setzen]"
5,bei,ADP,mo,modifier,setzen,VERB,[Branche]
6,dieser,DET,nk,noun kernel element,Branche,NOUN,[]
7,Branche,NOUN,nk,noun kernel element,bei,ADP,[dieser]
8,einen,DET,nk,noun kernel element,Schwerpunkt,NOUN,[]
9,Schwerpunkt,NOUN,oa,accusative object,setzen,VERB,[einen]


In [19]:
parse_tree = []
for token in doc:
    parse_tree.append((token.text, token.pos_, token.dep_, spacy.explain(token.dep_), token.head.text, token.head.pos_, [child for child in token.children]))

parse_tree = pd.DataFrame(parse_tree, columns=('TOKEN_TEXT','TEXT_POS','DEP','DEP_Explained','HEAD_TEXT','HEAD_POS','CHILDREN'))
show(parse_tree[289:317])

Unnamed: 0,TOKEN_TEXT,TEXT_POS,DEP,DEP_Explained,HEAD_TEXT,HEAD_POS,CHILDREN
289,Die,DET,nk,noun kernel element,Ereignisse,NOUN,[]
290,Ereignisse,NOUN,sb,subject,haben,AUX,"[Die, in]"
291,in,ADP,mnr,postnominal modifier,Ereignisse,NOUN,[Japan]
292,Japan,PROPN,nk,noun kernel element,in,ADP,[]
293,haben,AUX,ROOT,,haben,AUX,"[Ereignisse, gezeigt, .]"
294,uns,PRON,oa,accusative object,gezeigt,VERB,[]
295,gezeigt,VERB,oc,clausal object,haben,AUX,"[uns, ,, ist]"
296,",",PUNCT,punct,punctuation,gezeigt,VERB,[]
297,dass,SCONJ,cp,complementizer,ist,AUX,[]
298,das,DET,nk,noun kernel element,Restrisiko,NOUN,[]


The parse tree above shows that dependency tags are still accessed via tokens, despite actually explaining what the relationship between two tokens looks like. The dependency between a token and its head can be accessed via the token. As the sentence root does not have a head, no dependency can be accessed via the root.

In [20]:
unique_dep = parse_tree.drop_duplicates('DEP').sort_values(by=['DEP'])[['DEP','DEP_Explained']]
show(unique_dep)

Unnamed: 0,DEP,DEP_Explained
879,,
0,ROOT,
201,ac,adpositional case marker
26,ag,genitive attribute
752,app,apposition
687,cc,coordinating conjunction
6,cd,coordinating conjunction
7,cj,conjunct
686,cm,comparative conjunction
52,cp,complementizer


Viewing the table it becomes clear that first degree relations can proove to be meaningful. For example, 'Restrisiko' (remaining risk) is the direct head to the adjective 'sogenannte' (so-called) which carries judging of the noun 'Restrisiko'. 
Nevertheless, working only with first degree relationships would disregard plenty of information hidden in the sentence. For example, the adjective 'durchaus' (indeed) which describes the adverb 'existent' (existant) is related to 'Restrisiko' through the auxiliary verb 'ist' (is). This valuable information is lost if the dependency tree is not crawled through in order to investigate local trees.

### Crawling through the local tree
Previously we have only investigated single archs. Looking at local trees will put our focus on a per sentence level. The ultimate interest is to iterate over the token to find chains, in other words to follow an arch to another arch to another arch. Later, in the analysis, conditions for the continuation of the iteration can be set, such as the POS/TAG label having to be a verb and then an adverb since we are looking at such discriptive words.

#### Local Surrounding per Token
It is of interest to follow along the word hierarchy. Therefore, let's first look at the local surroundings of each token in the next table. Syntactic children are words which are connected by an arch to the token, the distinction in left and right refers to their appearance before or after the token. This number gives an idea about the surrounding of each token. The token head, which is also returned, will indicate the location of the root - it is where both token and token head are similar.

In [21]:
local_trees = []
for token in doc:
    local_trees.append((token.text, token.head, token.n_lefts+token.n_rights, token.n_lefts, token.n_rights))

local_trees = pd.DataFrame(local_trees, columns=('TOKEN_TEXT','TOKEN_HEAD','TOTAL_CHILD','LEFT_CHILD','RIGHT_CHILD'))
show(local_trees[289:317])

Unnamed: 0,TOKEN_TEXT,TOKEN_HEAD,TOTAL_CHILD,LEFT_CHILD,RIGHT_CHILD
289,Die,Ereignisse,0,0,0
290,Ereignisse,haben,2,1,1
291,in,Ereignisse,1,0,1
292,Japan,in,0,0,0
293,haben,haben,3,1,2
294,uns,gezeigt,0,0,0
295,gezeigt,haben,3,1,2
296,",",gezeigt,0,0,0
297,dass,ist,0,0,0
298,das,Restrisiko,0,0,0


#### Descendants per Token
This looks at the subtree of the token and returns any words the token archs out to. Thus, also words beyond the first degree are included.

In [22]:
descendants = []
for token in example:
    descendants.append((token.text, [descendant.text for descendant in token.subtree if token != descendant]))
    
descendants = pd.DataFrame(descendants, columns=('TOKEN_TEXT','DESCENDANTS'))
show(descendants)

Unnamed: 0,TOKEN_TEXT,DESCENDANTS
0,Ich,[]
1,glaube,"[Ich, ,, wir, sollten, bei, dieser, Branche, einen, Schwerpunkt, setzen, ,, da, sie, uns, weg, von, Atomenergie, und, fossilen, Energieträgern, hin, zu, dezentralen, Lösungen, führt, ,, und, nicht, schon, jetzt, Kürzungen, vornehmen, ,, obwohl, die, Branche, noch, nicht, einmal, richtig, etabliert, ist, .]"
2,",",[]
3,wir,[]
4,sollten,"[wir, bei, dieser, Branche, einen, Schwerpunkt, setzen, ,, da, sie, uns, weg, von, Atomenergie, und, fossilen, Energieträgern, hin, zu, dezentralen, Lösungen, führt, ,, und, nicht, schon, jetzt, Kürzungen, vornehmen, ,, obwohl, die, Branche, noch, nicht, einmal, richtig, etabliert, ist]"
5,bei,"[dieser, Branche]"
6,dieser,[]
7,Branche,[dieser]
8,einen,[]
9,Schwerpunkt,[einen]


In [23]:
descendants = []
for token in short_doc:
    descendants.append((token.text, [descendant.text for descendant in token.subtree if token != descendant]))
    
descendants = pd.DataFrame(descendants, columns=('TOKEN_TEXT','DESCENDANTS'))
show(descendants)

Unnamed: 0,TOKEN_TEXT,DESCENDANTS
0,Die,[]
1,Ereignisse,"[Die, in, Japan]"
2,in,[Japan]
3,Japan,[]
4,haben,"[Die, Ereignisse, in, Japan, uns, gezeigt, ,, dass, das, sogenannte, Restrisiko, durchaus, existent, ist, und, dass, es, sich, hierbei, nicht, nur, um, eine, rechnerische, Größe, handelt, .]"
5,uns,[]
6,gezeigt,"[uns, ,, dass, das, sogenannte, Restrisiko, durchaus, existent, ist, und, dass, es, sich, hierbei, nicht, nur, um, eine, rechnerische, Größe, handelt]"
7,",",[]
8,dass,[]
9,das,[]


#### Anchestors per token
This looks at the ancestors of the token and returns any words which archs out to the token.

In [24]:
ancestor = []
for token in short_doc:
    ancestor.append((token.text, [ancestor.text for ancestor in token.ancestors if token != ancestor]))

ancestor = pd.DataFrame(ancestor, columns=('TOKEN_TEXT','ANCESTORS'))
show(ancestor)

Unnamed: 0,TOKEN_TEXT,ANCESTORS
0,Die,"[Ereignisse, haben]"
1,Ereignisse,[haben]
2,in,"[Ereignisse, haben]"
3,Japan,"[in, Ereignisse, haben]"
4,haben,[]
5,uns,"[gezeigt, haben]"
6,gezeigt,[haben]
7,",","[gezeigt, haben]"
8,dass,"[ist, gezeigt, haben]"
9,das,"[Restrisiko, ist, gezeigt, haben]"


In [25]:
# the zero index picks the first token head of multiple if the syntactic parsing was erroneous
root = [token for token in short_doc if token.head == token][0]
subject = list(root.lefts)[0]
for descendant in subject.subtree:
    print(descendant.text, [ancestor.text for ancestor in descendant.ancestors])

Die ['Ereignisse', 'haben']
Ereignisse ['haben']
in ['Ereignisse', 'haben']
Japan ['in', 'Ereignisse', 'haben']


In [26]:
subject1 = list(root.lefts)
subject2 = list(root.rights)
print(subject1,'\n',subject2)

[Ereignisse] 
 [gezeigt, .]


## Named-Entity Recognition (NER)

In [27]:
for ent in doc.ents:
    print(u'{:6} {:50}'.format(ent.label_, ent.text))

MISC   Herren!                                           
MISC   Erdbebenkatastrophe in                            
LOC    Japan                                             
LOC    Japan                                             
MISC   deutschen                                         
LOC    Fukushima                                         
LOC    Reaktorkerne                                      
PER    Wasserstoff                                       
PER    Explosionen                                       
LOC    Deutschland                                       
LOC    Japan                                             
LOC    Japan                                             
LOC    Deutschland                                       
LOC    Japan                                             
MISC   Atomgesetz                                        
MISC   Zeit des Moratoriums                              
LOC    Ländern                                           
LOC    Hochwas

In [28]:
displacy_visual(doc, style='ent')

# Sentiment Analysis with TextBlob
spaCy does not ship with sentiment lexicons, therefore I chose TextBlob to lookup sentiment values for German language. TextBlob itself is another NLP library, the TextBlobDE addition provides the German sentiment lexicon and is very easy to use. At this time, subjectivity scores are not integrated into the lexicon, only polarity scores ranging from -1 to 1.

In [32]:
# To download the language model and nltk corpora for TextBlob, make this cell executable
if 1 == 1:
    !python -m textblob.download_corpora

from textblob_de import TextBlobDE as TextBlob

[nltk_data] Downloading package brown to
[nltk_data]     /Users/seppmacmini/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seppmacmini/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/seppmacmini/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/seppmacmini/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/seppmacmini/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/seppmacmini/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


The next chunk looks through the entire document already used above and retrieves and lemmatized words used which are assigned either a positive or negative polarity. It's this easy.

In [33]:
# https://spacy.io/usage/linguistic-features#accessing
for token in doc:
    if TextBlob(token.lemma_).sentiment[0] != 0:
        print(u'{:20} {:}'.format(token.lemma_, TextBlob(token.lemma_).sentiment))

geehrt               Sentiment(polarity=1.0, subjectivity=0.0)
lieben               Sentiment(polarity=1.0, subjectivity=0.0)
wirklich             Sentiment(polarity=1.0, subjectivity=0.0)
unzweifelhaft        Sentiment(polarity=0.7, subjectivity=0.0)
groß                 Sentiment(polarity=0.7, subjectivity=0.0)
dramatisch           Sentiment(polarity=0.7, subjectivity=0.0)
unabhängig           Sentiment(polarity=0.7, subjectivity=0.0)
richtig              Sentiment(polarity=1.0, subjectivity=0.0)
richtig              Sentiment(polarity=1.0, subjectivity=0.0)
robust               Sentiment(polarity=0.7, subjectivity=0.0)
zusätzlich           Sentiment(polarity=-0.7, subjectivity=0.0)
groß                 Sentiment(polarity=0.7, subjectivity=0.0)
verheerend           Sentiment(polarity=-1.0, subjectivity=0.0)
natürlich            Sentiment(polarity=0.7, subjectivity=0.0)
natürlich            Sentiment(polarity=0.7, subjectivity=0.0)
vernünftig           Sentiment(polarity=1.0, subjecti