<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy in the World of LLMs` `1`

This is lesson `3` of 3 in the educational series on `spaCy and Large Language Models (LLMs)`. This notebook we will continue our work with `spacy-llm`. Unlike in the last notebook, we will not rely on zero-shot classification. Instead, we will learn about how to provide examples to LLMs. This is known as single-shot and few-shot classification. We will also learn about the benefits of this process.

**Skills:** 
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
* A general understanding of natural language processing (NLP)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand the main concepts of natural language processing (NLP)
2. Understand the role of spaCy within NLP
3. Understand the key advantages and disadvantages of large language models.
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [spaCy](https://spacy.io/) for performing [natural language processing](https://docs.constellate.org/key-terms/#nlp).
* `spacy-llm` for working with large language models
* `gliner-spacy` for performing zero-shot classification locally

## Install Required Libraries

In [160]:
### Install Libraries ###

# Using !pip installs
!pip install spacy spacy-llm gliner-spacy

Collecting gliner-spacy
  Downloading gliner_spacy-0.0.10-py3-none-any.whl.metadata (3.6 kB)
Collecting gliner>=0.2.0 (from gliner-spacy)
  Downloading gliner-0.2.8-py3-none-any.whl.metadata (6.3 kB)
Collecting onnxruntime (from gliner>=0.2.0->gliner-spacy)
  Downloading onnxruntime-1.18.1-cp310-cp310-macosx_11_0_universal2.whl.metadata (4.3 kB)
Collecting sentencepiece (from gliner>=0.2.0->gliner-spacy)
  Using cached sentencepiece-0.2.0-cp310-cp310-macosx_11_0_arm64.whl.metadata (7.7 kB)
Collecting coloredlogs (from onnxruntime->gliner>=0.2.0->gliner-spacy)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting flatbuffers (from onnxruntime->gliner>=0.2.0->gliner-spacy)
  Downloading flatbuffers-24.3.25-py2.py3-none-any.whl.metadata (850 bytes)
Collecting protobuf (from onnxruntime->gliner>=0.2.0->gliner-spacy)
  Using cached protobuf-5.27.2-cp38-abi3-macosx_10_9_universal2.whl.metadata (592 bytes)
Collecting humanfriendly>=9.1 (from coloredlogs->onnxruntim

In [None]:
!python -m download en_core_web_sm

In [1]:
import os
# uncomment this out if you are using a Mac. This is a bug and with spacy-llm and pytorch on a Mac and this resolves it for now.
# os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["OPENAI_API_KEY"] = ""
import spacy
from spacy_llm.util import assemble
from spacy import displacy

# Introduction

In the last notebook, we saw the benefits of using a Large Language Model in a spaCy pipeline. We were able to detect things that were not only out-of-scope contextually for spaCy models, but we were able to pass in our own labels entirely! In this notebook, we will learn a few new tricks.

# GLiNER

We will begin by learning about a separate library called `gliner-spacy` (full-disclosure, I wrote this package). GLiNER spaCy is built on top of [GLiNER](https://github.com/urchade/GLiNER) and spaCy. GLiNER allows for you to perform zero-shot token classification. This means you can use it to perform NER, part-of-speech tagging, even relationship extraction (more on this below)! Instead of training a model, you can pass a text and a collection of labels to the model and it will handle everything for you.

GLiNER by itself does not align within a spaCy pipeline. `gliner-spacy` resolves this issue by handling chunking, device mapping (important if you are using a GPU), and token alignment for you. This means that you can install `gliner-spacy`. It even has its own entry point, so you don't even have to import it in your code to use it.

You can install it with pip:

```bash
pip install gliner-spacy
```

Once installed, you can import spaCy as normal, and simply add `gliner_spacy` into your pipeline. A good way to do this is to disable the original `ner`. We can do this by passing an argument, `disable`, a list of pipes to disable. We will disable `ner`.

In [2]:
nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("gliner_spacy")

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

.gitattributes:   0%|          | 0.00/1.57k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.78k [00:00<?, ?B/s]



<gliner_spacy.pipeline.GlinerSpacy at 0x176909e70>

Now, that we have the pipeline created, let's run it over a text!

In [7]:
composer_text = """Program Hodle Chrlstus Notus Est Jan Peterszoon Sweellnck Sung in Latin 1562 -1621 Today Christ is born Today the Saviour hos appeared On earth the angels sing the archangels rejoice and the righteous ore glad saying `` Glory to God in the highest Noell Alleluia/ '' Rorate Coen Giovanni Perlulgl do Palestrina Sung in Latin 1525 -1594 Pour out dew from above you heavens and let the clouds rain down the Just One Let the earth open and bring forth a saviour Show us your mercy 0 Lord and grant us your salvation Come 0 Lord and do not delay Alleluia ! Allelulal Ascendlt Deus WIiiiam Byrd Sung in Latin 1543 -1623 Alleluia ! God hos ascended in jubilation and Christ the Lord with the sound of the trumpets Alleluia/ II Nunc Dfmlttls Christian Figueroa tenor The Te Deum of Sandor Slk Ill Sergei Rachmaninoff 1873-1943 Zoltan Kodaly 1882 -1967 Motet `` Lobet den Herrn alle Heiden '' J. S. Bach Sung in German 1685 -1750 Praise the Lord all ye notions praise him all ye people For his mercy and truth watch over us for evermore Alleluia/ -intermission -
"""

colors = {
    "COMPOSER": "#a8d5e2",      # Light blue
    "COMPOSITION": "#c2e8c4",   # Light green
    "DATE_RANGE": "#f9c2c2",    # Light red
    "LANGUAGE": "#f7e1a1",       # Light yellow
    "RIVER": "#f7e1a1",
    "LOCATION": "#a8d5e2",      # Light blue
    "GROUP": "#f9c2c2",    # Light red
}


options={"colors":colors}
options["spans_key"] = "sc"

doc = nlp(composer_text)
displacy.render(doc, style="ent", options=options)

This is our output. I wouldn't call this the best output.There are some clear errors, such as `Christ` being labeled as a person. While technically this is correct, I'd rather see this identified as part of the `lyrics` of the song. One of the things we can do, is we can create a custom config and pass in a custom set of labels to the model, just like we did with `spacy-llm`!

In [5]:
custom_spacy_config = { "gliner_model": "urchade/gliner_multi",
                            "chunk_size": 250,
                            "labels": ["COMPOSER"],
                            "style": "ent"}

nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("gliner_spacy", config=custom_spacy_config)

options={"colors":colors}
options["spans_key"] = "sc"

doc = nlp(composer_text)
displacy.render(doc, style="ent", options=options)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

gliner_config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.52k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/153 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


While these results are not perfect, they are now an improvement. What's the advantage of this approach? It's another zero-shot classification method, but it is entirely local, meaning you do not have to pay to use GPT 3.5 or 4. In many cases, GLiNER is the not only the cheaper option, but the better one. This is especially true if you fine-tune your own model.

But GLiNER is not just English, specific! We can use this exact same model to identify entities in other languages. This particular model never saw NER examples for Latin, but does a fairly good job of finding things like PEOPLE, GROUP, and LOCATION!

In [8]:
latin_text = """Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.
"""
latin_config = { "gliner_model": "urchade/gliner_multi",
                            "chunk_size": 250,
                            "labels": ["PERSON", "LOCATION", "GROUP"],
                            "style": "ent"}

latin_nlp = spacy.load("en_core_web_sm", disable=["ner"])
latin_nlp.add_pipe("gliner_spacy", config=latin_config)

latin_doc = latin_nlp(latin_text)
displacy.render(latin_doc, style="ent", options=options)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


This is actually a perfect output. For those who can read Latin, you may notice an obvious thing that can be added to this set of labels: RIVER! Garumna, [Matrona (Marne)](https://en.wikipedia.org/wiki/Rivers_of_classical_antiquity), and the [Sequana (Seine)](https://topostext.org/place/494005WSeq) are all rivers. We can add rivers, by simply updating the "RIVER" in the labels.

In [9]:
latin_text = """Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.
"""
latin_config = { "gliner_model": "urchade/gliner_multi",
                            "chunk_size": 250,
                            "labels": ["PERSON", "LOCATION", "GROUP", "RIVER"],
                            "style": "ent"}

latin_nlp = spacy.load("en_core_web_sm", disable=["ner"])
latin_nlp.add_pipe("gliner_spacy", config=latin_config)

latin_doc = latin_nlp(latin_text)
displacy.render(latin_doc, style="ent", options=options)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


While we successfully got Garumna, we missed the other two. Fortunately, we can improve our results by simply targeting a more specific set of labels. Here, we will change our labels to simply `['RVER']`. Notice how much our results improve.

In [12]:
latin_text = """Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.
"""
latin_config = { "gliner_model": "urchade/gliner_multi",
                            "chunk_size": 250,
                            "labels": ["RIVER"],
                            "style": "ent"}

latin_nlp = spacy.load("en_core_web_sm", disable=["ner"])
latin_nlp.add_pipe("gliner_spacy", config=latin_config)

latin_doc = latin_nlp(latin_text)
displacy.render(latin_doc, style="ent", options=options)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


GLiNER is, therefore, potentially a good solution to many zero-shot tasks. If you need to improve the model, fine-tuning is also possible. Fine-tuning GLiNER has two key advantages. First, it can help the model recognize your particular set of labels. This is useful if the model struggles in your domain. Second, it can adjust how annotations are done. For example, Imagine we did not want to identify the `flumen` attached to `Garumna`, we could use a fine-tuned Latin GLiNER model that does not attach qualifiers to the governing entity.

In [14]:
latin_text = """Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.
"""
latin_config = { "gliner_model": "medieval-data/gliner_multi-v2.1-medieval-latin",
                            "chunk_size": 250,
                            "labels": ["PERSON", "LOCATION", "GROUP", "RIVER"],
                            "style": "ent"}

latin_nlp = spacy.load("en_core_web_sm", disable=["ner"])
latin_nlp.add_pipe("gliner_spacy", config=latin_config)

latin_doc = latin_nlp(latin_text)
displacy.render(latin_doc, style="ent", options=options)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Notice here that we don't have `Garumna` being flagged as a RIVER, but we have managed to extract the entity in the style that we wanted, that is, without `flumen` attached. This result demonstrates two things. First, it shows how a fine-tuned model adjusts the way it performs annotation. Second, it shows us that it was biased a bit away from RIVER, at least in this context, because it fails to identify correctly any of he three rivers. But again, we can use our RIVER trick to identify just rivers in the text.

In [16]:
latin_text = """Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.
"""
latin_config = { "gliner_model": "medieval-data/gliner_multi-v2.1-medieval-latin",
                            "chunk_size": 250,
                            "labels": ["RIVER"],
                            "style": "ent"}

latin_nlp = spacy.load("en_core_web_sm", disable=["ner"])
latin_nlp.add_pipe("gliner_spacy", config=latin_config)

latin_doc = latin_nlp(latin_text)
displacy.render(latin_doc, style="ent", options=options)

Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


GLiNER isn't just good at zero-shot classification for NER. There are numerous teams working with GLiNER now and they are pushing it to do more complex tasks, such as relationship extraction. In this notebook, we won't be using GLiNER for this, rather `spacy-llm`, but if you want to check out some of these more advanced models, then I recommend following [`Knowledgator`](https://huggingface.co/knowledgator) on HuggingFace!

# spaCy LLM Label Definitions

I would like to pivot now back to `spacy-llm` for the remainder of this notebook. The goal here is to give you the configs necessary to begin using `spacy-llm` for more improving outputs and using it for more advanced tasks. We will use it for not only NER in this notebook, but also `text classification` and `relationship extraction`.

In the previous notebook, when we tried to identify composer with `GPT-3.5`, it struggled to identify anyone other than `J.S. Bach`. There are a few possibilities as to why this was the case. But the big question is, how can we help improve our chances for identifying other composers with a smaller model. One approach is to provide the model with `label definitions`. We can do this by adding one single thing to our config:

```yml
[nlp]
lang = "en"
pipeline = ["llm_ner"]

[components]

[components.llm_ner]
factory = "llm"

[components.llm_ner.task]
@llm_tasks = "spacy.NER.v3"
labels = ["COMPOSER"]

[components.llm_ner.model]
@llm_models = "spacy.GPT-3-5.v3"
config = {"temperature": 0.0}

[components.llm_ner.task.label_definitions]
COMPOSER = "Extract the name of any one who contextually looks like a composer of music."
```

The main thing that is different here is the final section:

```yml
[components.llm_ner.task.label_definitions]
COMPOSER = "Extract the name of any one who contextually looks like a composer of music."
```

Here, we are able to set the specific definition for our label. Let's see how it performs!

In [17]:
composer_text = """Program Hodle Chrlstus Notus Est Jan Peterszoon Sweellnck Sung in Latin 1562 -1621 Today Christ is born Today the Saviour hos appeared On earth the angels sing the archangels rejoice and the righteous ore glad saying `` Glory to God in the highest Noell Alleluia/ '' Rorate Coen Giovanni Perlulgl do Palestrina Sung in Latin 1525 -1594 Pour out dew from above you heavens and let the clouds rain down the Just One Let the earth open and bring forth a saviour Show us your mercy 0 Lord and grant us your salvation Come 0 Lord and do not delay Alleluia ! Allelulal Ascendlt Deus WIiiiam Byrd Sung in Latin 1543 -1623 Alleluia ! God hos ascended in jubilation and Christ the Lord with the sound of the trumpets Alleluia/ II Nunc Dfmlttls Christian Figueroa tenor The Te Deum of Sandor Slk Ill Sergei Rachmaninoff 1873-1943 Zoltan Kodaly 1882 -1967 Motet `` Lobet den Herrn alle Heiden '' J. S. Bach Sung in German 1685 -1750 Praise the Lord all ye notions praise him all ye people For his mercy and truth watch over us for evermore Alleluia/ -intermission -
"""

nlp_composer = assemble("../assets/openai-ner-composer-v2.cfg")
doc_composer = nlp_composer(composer_text)
displacy.render(doc_composer, style="ent", options=options)

While our results still aren't perfect, they are a clear improvement from our earlier attempts with GPT-3.5. That's because this definition helps clarify what we mean by `COMPOSER`. This is a useful trick with prompting that helps clarify to the model what we want it to identify. Notice, though, that the results are far from perfect. To do this, we can add in a few new lines to our config:

```yml
[components.llm_ner.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "../assets/examples/composer_example.json"
```

This specifies where our data sits and how to load it. It will be used to help guide the model. Our data in this file looks like this:

```json
[
    {
      "text": "Program Hodie Christus Natus Est Jan Pieterszoon Sweelinck Sung in Latin 1562 -1621 Today Christ is born Today the Saviour has appeared On earth the angels sing the archangels rejoice and the righteous are glad saying `` Glory to God in the highest Noell Alleluia/ ''",
      "spans": [
        {
          "text": "Jan Pieterszoon Sweelinck",
          "is_entity": true,
          "label": "COMPOSER",
          "reason": "name of the composer for 'Hodie Christus Natus Est'"
        }
      ]
    },
...
]
```

In [45]:
nlp_composer = assemble("../assets/openai-ner-composer-v2-examples.cfg")
doc_composer = nlp_composer(composer_text)

displacy.render(doc_composer, style="ent", options=options)

Full disclosure, if we look at our few-shot examples, I'm clearly cheating here. I'm using the precise examples from this text. Ignoring that fact, this demonstrates how examples can influence the output. The reason why I did this, though was because I tried a lot of ways to get this to work with GPT-3.5, but ultimately failed. I think it's good to include this failure in this notebook because it demonstrates that some problems are just too complex for `3.5` and require GPT 4 or one of its variants.

In the last notebook, we worked with only GPT-4, but as of yesterday (July 18, 2024), there's a replacement for GPT-3.5. It's cheaper and it's called `4o-mini`. We can use this specific model by changing a few things in our config:


```yml
[components.llm_ner.model]
@llm_models = "spacy.GPT-4.v3"
name = "gpt-4o-mini"
config = {"temperature": 0.9}
```

Here, we have changed to `spacy.GPT-4.v3` to specify we will be using the GPT 4 models and we have specified which model with name. If I wanted to use the larger 4o, I could drop off the `mini` here. In my small tests, I didn't notice a difference here and `mini` proved to be a bit better. Notice also that I have turned up the temperature. This will help it make predictions on our-of-scope entities not included in our examples. All of this allows us to identify a couple extra composers.

In [48]:
nlp_composer = assemble("../assets/openai-ner-composer-v2-examples-4o.cfg")
doc_composer = nlp_composer(composer_text)

displacy.render(doc_composer, style="ent", options=options)

# Relationship Extraction

Sometimes extracting entities on their own isn't enough. Sometimes, we need to do relationship extraction. This is where we first identify the named entities and then try to understand how those entities relate to one another in the context of a text. This is particularly where LLMs shine because they have a very large understanding of language and the text. Relationship extraction is more complex than NER, so expect the need for a larger model here, especially if the types of relationships you are extracting are many and complex. Always start small and simple and work your way up.

Imagine we wanted to identify not only `COMPOSER`, but also `COMPOSITION`. We could use GPT to find these two types of entities for us, but we wouldn't know which composer wrote which composition. Here's where relationship extraction comes into play.

Let's take a look at our config for this problem.

```yml
[nlp]
lang = "en"
pipeline = ["llm_ner", "llm_rel"]

[components]

[components.llm_ner]
factory = "llm"

[components.llm_ner.task]
@llm_tasks = "spacy.NER.v2"
labels = ["COMPOSER", "COMPOSITION"]
alignment_mode = "expand"

[components.llm_ner.model]
@llm_models = "spacy.GPT-4.v3"
name = "gpt-4"
config = {"temperature": 0.0}

[components.llm_ner.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "../assets/examples/composer_example.yaml"


[components.llm_rel]
factory = "llm"

[components.llm_rel.task]
@llm_tasks = "spacy.REL.v1"
labels = ["Wrote"]
```

As we can see, we have defined an extra model in our pipeline (`llm_rel`). This will be our relationship extraction component:

```yml
[nlp]
lang = "en"
pipeline = ["llm_ner", "llm_rel"]
```

To add this into our pipeline, we only need to only define the factory. Again, we are using the `llm` factory.


```yml
[components.llm_rel]
factory = "llm"
```

Finally, we need to specify the task.

```yml
[components.llm_rel.task]
@llm_tasks = "spacy.REL.v1"
labels = ["Wrote"]
```

Here, we are using the `spacy.REL.v1` task and we are specifying that we want to identify the relationship of Wrote. It's important to be explicit in your label naming here.

Let's go ahead and run this the precise same way we have before. Notice that we are changing the config to our relationship config.


In [52]:
nlp_composer = assemble("../assets/openai-ner-composer-v2-rel.cfg")
doc_composer = nlp_composer(composer_text)

displacy.render(doc_composer, style="ent", options=options)

You will notice that our output looks good, but we have some odd things appearing after our entities, such as `[ENT1:COMPOSER]`. This tells us the specific index of each entity. This is important because when we access our relationship tags at the Doc container level, we need to be able to reassemble these relationships. Let's examine the relationships by looking at the newly added `._.rel` attribute on our Doc.

In [53]:
doc_composer._.rel

[RelationItem(dep=1, dest=0, relation='Wrote'),
 RelationItem(dep=3, dest=2, relation='Wrote'),
 RelationItem(dep=5, dest=4, relation='Wrote'),
 RelationItem(dep=7, dest=6, relation='Wrote'),
 RelationItem(dep=9, dest=8, relation='Wrote'),
 RelationItem(dep=10, dest=11, relation='Wrote')]

Each one of these is an instance of a `RelationItem`. Each relationship has three parts: the dep, or the source, the destination, or the target, and the relationship. If we want to reconstruct this into knowledge as prose, we can use the code snippet below. Please note that sometimes this order will be reversed. Definitely do manual validation. Examples here would allow for the model to understand more clearly how you intend this relationship to be constructed.

In [54]:
for relation in doc_composer._.rel:
    print(f"{doc_composer.ents[relation.dep]} {relation.relation} {doc_composer.ents[relation.dest]}")

Jan Peterszoon Sweellnck Wrote Hodle Chrlstus Notus Est
Giovanni Perlulgl do Palestrina Wrote Rorate Coen
WIiiiam Byrd Wrote Ascendlt Deus
Christian Figueroa Wrote Nunc Dfmlttls
Sergei Rachmaninoff Wrote Te Deum of Sandor Slk
Zoltan Kodaly Wrote Motet `` Lobet den Herrn alle Heiden ''


# Text Classification

In addition to relationship extraction, we can also do text classification with `spacy-llm`. Here is an example of a simple config for text classification that determines if the input text is Toxic or Not Toxic.

```yml
[nlp]
lang = "en"
pipeline = ["llm_textcat"]

[components]

[components.llm_textcat]
factory = "llm"

[components.llm_textcat.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["NON_TOXIC", "TOXIC"]

[components.llm_textcat.model]
@llm_models = "spacy.GPT-3-5.v3"
config = {"temperature": 0.0}
```

The big difference here is where we specify the task:

```yml
[components.llm_textcat.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["NON_TOXIC", "TOXIC"]
```

In [55]:
tweet = """
Tweet:
You are terrible!
"""

nlp_textcat = assemble("../assets/openai-textcat.cfg")
doc_textcat = nlp_textcat(tweet)
doc_textcat.cats

defaultdict(<function spacy_llm.tasks.textcat.util.reduce_shards_to_doc.<locals>.<lambda>()>,
            {'NON_TOXIC': 0.0, 'TOXIC': 1.0})

# Multiple Tasks

We have already seen an example of multiple tasks above, where we identified relationship extraction. Let's break this config down a bit more to explain how you can stack multiple tasks in the same config file. Here, we will be using this config:

```yml
[nlp]
lang = "en"
pipeline = ["llm_ner", "llm_textcat"]

[components]

[components.llm_ner]
factory = "llm"

[components.llm_ner.task]
@llm_tasks = "spacy.NER.v2"
labels = ["COMPOSER", "PERSON"]

[components.llm_ner.model]
@llm_models = "spacy.GPT-3-5.v3"
config = {"temperature": 0.0}

[components.llm_textcat]
factory = "llm"

[components.llm_textcat.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["COMPLIMENT", "INSULT"]

[components.llm_textcat.model]
@llm_models = "spacy.GPT-3-5.v3"
config = {"temperature": 0.0}
```

Can you identify and break down the two asks that we have here? Take a moment and consider the config file and try to figure out what it will do before we move forward.

In [57]:
tweet = """
Tweet:
John, you are great!
"""

nlp_multi = assemble("../assets/openai-multiple-task.cfg")
doc_multi = nlp_multi(tweet)

displacy.render(doc_multi, style="ent", options=options)

As we can see, we have access to our named entities, but we can also access our categories.

In [58]:
doc_multi.cats

defaultdict(<function spacy_llm.tasks.textcat.util.reduce_shards_to_doc.<locals>.<lambda>()>,
            {'COMPLIMENT': 1.0, 'INSULT': 0.0})

And as we can see, this is a compliment.