# Lunch Time Python

## 25.11.2022: spaCy
<img style="width: 600px;" src="https://upload.wikimedia.org/wikipedia/commons/8/88/SpaCy_logo.svg">

[spaCy](https://spacy.io/) is an open-source natural language processing library written in Python and Cython.

spaCy focuses on production usage and is very fast and efficient. It also supports deep learning workflows through interfacing with [TensorFlow](https://www.tensorflow.org/) or [PyTorch](https://pytorch.org/), as well as the transformer model library [Hugging Face](https://github.com/huggingface).

*Press `Spacebar` to go to the next slide (or `?` to see all navigation shortcuts)*

[Lunch Time Python](https://ssciwr.github.io/lunch-time-python/), [Scientific Software Center](https://ssc.iwr.uni-heidelberg.de), [Heidelberg University](https://www.uni-heidelberg.de/)

# 0 What to do with spaCy

spaCy is very powerful for text annotation:
- sentencize and tokenize
- POS (part-of-speech) and lemma
- NER (named entity recognition)
- dependency parsing
- text classification
- morphological analysis
spaCy can also learn new tasks through integraton with your machine learning stack. It also provides multi-task learning with pretrained transformers like [BERT](https://arxiv.org/abs/1810.04805). 
(BERT is used in the google search engine.)


In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_md")
doc = nlp(
    "The Scientific Software Center offers lunch-time Python - an informal way to learn about new Python libraries."
)
displacy.render(doc, style="dep")

In [None]:
displacy.render(doc, style="ent")

# 1 Install spaCy
You can install spaCy using `pip`:

`pip install spacy`

It is also available via `conda-forge`:

`conda install -c conda-forge spacy`

After installing spaCy, you also need to download the language model. For a medium-sized English model, you would do this using

`python -m spacy download en_core_web_md`

The available models are listed on the spaCy website: https://spacy.io/usage/models

## Install spaCy with CUDA support

`pip install -U spacy[cuda]`

You can also explore the [online tool](https://spacy.io/usage) for installation instructions.

# 2 Let's try it out!

In [None]:
nlp = spacy.load("en_core_web_md")
nlp("This is lunch-time Python.")

In [None]:
doc = nlp("This is lunch-time Python.")
print(type(doc))
[i for i in doc]

In [None]:
t = doc[0]
type(t)

In [None]:
t.pos_

In [None]:
displacy.render(doc)

In [None]:
spacy.explain("nsubj")

In [None]:
for t in doc:
    print(t.text, t.pos_, t.dep_, t.lemma_)

# 3 Pipelines


![pipeline](https://spacy.io/pipeline-fde48da9b43661abcdf62ab70a546d71.svg)

[source: spaCy 101]

The capabilities of the processing pipeline dependes on the components, their models and how they were trained.

In [None]:
nlp.pipe_names

In [None]:
nlp.tokenizer

In [None]:
text = "Python is a very popular - maybe even the most popular - programming language among scientific software developers. One of the reasons for this success story is the rich standard library and the rich ecosystem of available (scientific) libraries. To fully leverage this ecosystem, developers need to stay up to date and explore new libraries. Lunch Time Python aims at providing a communication platform between Pythonistas to learn about new libraries in an informal setting. Sessions take roughly 30 minutes, one library is presented per session and the code will be made available afterwards. Come by, enjoy your lunch with us and step up your Python game!"

In [None]:
doc = nlp(text)

In [None]:
for i, sent in enumerate(doc.sents):
    print(i, sent)

In [None]:
for i, sent in enumerate(doc.sents):
    for j, token in enumerate(sent):
        print(i, j, token.text, token.pos_)

# Rule-based matching

In [None]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
python_pattern = [{"TEXT": "Python", "POS": "PROPN"}]
matcher.add("PYTHON_PATTERN", [python_pattern])

doc = nlp(text)

# Call the matcher on the doc
matches = matcher(doc)

In [None]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

# Word vectors and semantic similarity
spaCy can compare two objects and predict similarity:

In [None]:
text1 = "I like Python."
text2 = "I like snakes."


doc1 = nlp(text1)
doc2 = nlp(text2)

In [None]:
print(doc1.similarity(doc2))

In [None]:
token1 = doc1[2]
token2 = doc2[2]
print(token1.text, token2.text)

In [None]:
print(token1.similarity(token2))

The similarity score is generated from word vectors.

In [None]:
print(token1.vector)

Similarity can be used to predict similar texts to users, or to flag duplicate content. 

But: Similarity always depends on the context.

In [None]:
text3 = "I hate snakes."
doc3 = nlp(text3)
print(doc1.similarity(doc3))

These come out similar as both statements express a sentiment.

# Internal workings
spaCy stores all strings as hash values and creates a lookup table. This way, a word that occurs several times only needs to be stored once.

In [None]:
nlp.vocab.strings.add("python")
python_hash = nlp.vocab.strings["python"]
python_string = nlp.vocab.strings[python_hash]
print(python_hash, python_string)

- lexemes are entries in the vicabulary and contain context-independent information (the text, hjash, lexical attributes).
![data structure](https://course.spacy.io/vocab_stringstore.png)

# Train your own model

# spaCy demos
- You can explore spaCy using [online tools](https://explosion.ai/software)

For example, the [rule-based matcher explorer](https://demos.explosion.ai/matcher) -

- or the [spaCy online course](https://course.spacy.io/en/).


# Example use cases
- [Detection of programming language in stackoverflow posts](https://github.com/koaning/spacy-youtube-material)
