**spaCy** is a library for advanced Natural Language Processing designed in 2015 in Python and Cython. This contribution is using the basic Spacy operations with **german lexicon**.

1. Matcher - search tokens in a given Text based on given criteria
2. Word Vectors and sentence similarity - calculate cosinus similarity of 2 senetences

https://spacy.io/

https://spacy.pythonhumanities.com/02_02_matcher.html


In [1]:
# import libraries
import re
import pandas as pd
import bs4
import requests
import spacy
from spacy import displacy

from spacy.matcher import Matcher
from spacy.tokens import Span

import networkx as nx

import matplotlib.pyplot as plt
from tqdm import tqdm

pd.set_option('display.max_colwidth', 200)
%matplotlib inline

In [2]:
!python -m spacy download de_core_news_sm
!python -m spacy download de_core_news_md
!python -m spacy download de_core_news_lg
# english: en_core_web_sm, en_core_web_md, en_core_web_lg, en_core_web_trf
# german: de_core_news_sm, de_core_news_md, de_core_news_lg

Collecting de-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.8.0/de_core_news_sm-3.8.0-py3-none-any.whl (14.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.6/14.6 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting de-core-news-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_md-3.8.0/de_core_news_md-3.8.0-py3-none-any.whl (44.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

1. **Use Matcher with pattern builder**

Pattern examples:

ORTH, TEXT, LOWER,LENGTH,IS_ALPHA,IS_ASCII,IS_DIGIT,IS_LOWER,IS_UPPER,IS_TITLE, IS_PUNCT,IS_SPACE,IS_STOP,
IS_SENT_START, LIKE_NUM,LIKE_URL,LIKE_EMAIL, SPACY,POS, TAG,MORPH,DEP,LEMMA, SHAPE, ENT_TYPE

In [3]:
from spacy.lang.de.examples import sentences
sentences

['Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen',
 'Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz',
 'Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz',
 'Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion',
 'San Francisco erwägt Verbot von Lieferrobotern',
 'Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller',
 'Wo bist du?',
 'Was ist die Hauptstadt von Deutschland?']

In [19]:
# search a lower case word in a text
nlp = spacy.load("de_core_news_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"IS_LOWER": True}]
matcher.add("bodensee", [pattern])
doc = nlp("Der Bodensee ist ein Binnengewässer im südwestlichen Mitteleuropa. Er besteht aus zwei Teilen und einem sie verbindenden Flussabschnitt des Rheins, namentlich dem Obersee, dem Seerhein und dem Untersee (mit Rheinsee, Zeller See und Gnadensee inklusive des Markelfinger Winkels). Der Bodensee liegt im Bodenseebecken, einem Teil des nördlichen Alpenvorlands; der See wird vom Rhein durchflossen: Der Zufluss heißt Alpenrhein, der Abfluss Hochrhein.")
matches = matcher(doc)

In [20]:
print (matches)

[(60212700337922568, 2, 3), (60212700337922568, 3, 4), (60212700337922568, 5, 6), (60212700337922568, 6, 7), (60212700337922568, 10, 11), (60212700337922568, 11, 12), (60212700337922568, 12, 13), (60212700337922568, 14, 15), (60212700337922568, 15, 16), (60212700337922568, 16, 17), (60212700337922568, 17, 18), (60212700337922568, 19, 20), (60212700337922568, 22, 23), (60212700337922568, 23, 24), (60212700337922568, 26, 27), (60212700337922568, 28, 29), (60212700337922568, 29, 30), (60212700337922568, 32, 33), (60212700337922568, 37, 38), (60212700337922568, 39, 40), (60212700337922568, 40, 41), (60212700337922568, 47, 48), (60212700337922568, 48, 49), (60212700337922568, 51, 52), (60212700337922568, 53, 54), (60212700337922568, 54, 55), (60212700337922568, 57, 58), (60212700337922568, 59, 60), (60212700337922568, 60, 61), (60212700337922568, 62, 63), (60212700337922568, 66, 67), (60212700337922568, 69, 70)]


In [21]:
print (nlp.vocab[matches[0][0]].text)

bodensee


In [23]:
# save some text to txt file
with open ("bodensee.txt", "r") as f:
    text = f.read()

In [24]:
# find nouns based on matcher pattern
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

60
(3232560085755078826, 1, 2) Bodensee
(3232560085755078826, 7, 8) Mitteleuropa
(3232560085755078826, 20, 21) Rheins
(3232560085755078826, 25, 26) Obersee
(3232560085755078826, 43, 44) Untersee
(3232560085755078826, 46, 47) Rheinsee
(3232560085755078826, 49, 50) See
(3232560085755078826, 51, 52) Gnadensee
(3232560085755078826, 60, 61) Bodensee
(3232560085755078826, 75, 76) Rhein


In [26]:
# find multi word tokens based on matcher pattern
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern])
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

69
(3232560085755078826, 1, 2) Bodensee
(3232560085755078826, 7, 8) Mitteleuropa
(3232560085755078826, 20, 21) Rheins
(3232560085755078826, 25, 26) Obersee
(3232560085755078826, 43, 44) Untersee
(3232560085755078826, 46, 47) Rheinsee
(3232560085755078826, 49, 50) See
(3232560085755078826, 51, 52) Gnadensee
(3232560085755078826, 60, 61) Bodensee
(3232560085755078826, 75, 76) Rhein


In [29]:
# use Greedy Keyword Argument¶ based on matcher pattern
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
print (len(matches))
for match in matches[:10]:
    print (match, doc[match[1]:match[2]])

52
(3232560085755078826, 321, 324) Lacus Raetiae Brigantinus
(3232560085755078826, 84, 86) Abfluss Hochrhein
(3232560085755078826, 244, 246) Pomponius Mela
(3232560085755078826, 254, 256) Lacus Venetus
(3232560085755078826, 371, 373) Ammianus Marcellinus
(3232560085755078826, 377, 379) Lacus Brigantiae
(3232560085755078826, 536, 538) lacum Podamicum
(3232560085755078826, 1, 2) Bodensee
(3232560085755078826, 7, 8) Mitteleuropa
(3232560085755078826, 20, 21) Rheins


2. **Word vectors**



In [34]:
import numpy as np

In [31]:
nlp = spacy.load("de_core_news_md")

In [51]:
# 2.A - find similar words
your_word = "Zitrone"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['Zitroneneis', 'Minzblättchen', 'Gingerol', 'Orangenzeste', 'Bittermelone', 'fenchel', 'Kochbanane', 'Zimtöl', 'Apfelessigs', 'Sellerieknolle']


In [52]:
# 2.B - sentence cosinus similarity
nlp = spacy.load("de_core_news_md")  # make sure to use larger package!
doc1 = nlp("Ich liebe es zu wandern und bin der Meinung, wer Berge und Seen liebt, der muss unbedingt die Wanderwege entdecken.")
doc2 = nlp("Ich mag es zu wandern. Kristallklare Seen, idyllische Wanderwege und die schönste Seenwanderungen.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

Ich liebe es zu wandern und bin der Meinung, wer Berge und Seen liebt, der muss unbedingt die Wanderwege entdecken. <-> Ich mag es zu wandern. Kristallklare Seen, idyllische Wanderwege und die schönste Seenwanderungen. 0.9214679002761841
