<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Firstname Lastname](https://) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email author@email.address.<br />
____

# `spaCy in the World of LLMs` `2`

This is lesson `2` of 3 in the educational series on `spaCy and Large Language Models (LLMs)`. This notebook will introduce you to Large Language Models, the basic concepts behind theme, when to use them, and how to bring their outputs into a spaCy pipeline. 

**Skills:** 
* Python

**Audience:** `Teachers` / `Learners` / `Researchers`

**Use case:** `Tutorial` / `Reference` / `Explanation` 

`Include the use case definition from [here](https://constellate.org/docs/documentation-categories)`

**Difficulty:** `Beginner`

`Beginner assumes users are relatively new to Python and Jupyter Notebooks. The user is helped step-by-step with lots of explanatory text.`
`Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.`
`Advanced assumes users are very familiar with Python and have been programming for years, but they may not be familiar with the process being explained.`

**Completion time:** `90 minutes`

**Knowledge Required:** 
```
* Python basics (variables, flow control, functions, lists, dictionaries)
```

**Knowledge Recommended:**
```
* A general understanding of natural language processing (NLP)
```

**Learning Objectives:**
After this lesson, learners will be able to:
```
1. Understand text processing pipelines
2. Understand the basics of large language models (LLMs)
3. Understand how to start using spacy-llm to bring LLM outputs into a spaCy pipeline
```
___

# Required Python Libraries
`List out any libraries used and what they are used for`
* [spaCy](https://spacy.io/) for performing [natural language processing](https://docs.constellate.org/key-terms/#nlp).

## Install Required Libraries

In [None]:
### Install Libraries ###

# Using !pip installs
!pip install spacy spacy-llm

Before starting this notebook, you **MUST** have an OpenAI API Key. To make one you can visit [OpenAI's website, make an account, and navigate to Platform](https://platform.openai.com/api-keys) and follow the directions seen in this gif.

![openai api key](../images/openai-api-key.gif)

The video covers these steps.

1. Make an Account
2. Visit the Dashboard
3. Go to API Keys
4. Create an API Key
5. Paste it into this notebook in the cell below.

In [20]:
import os
# uncomment this out if you are using a Mac. This is a bug and with spacy-llm and pytorch on a Mac and this resolves it for now.
# os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["OPENAI_API_KEY"] = ""

In [25]:
### Import Libraries ###
import spacy
from spacy import displacy

# Introduction to Large Language Models and spacy-llm

Today, we will be looking at Large Language Models (LLMs) and `spacy-llm`. In this lesson, we will learn about the concepts behind LLMs, their strengths and weaknesses, and we will learn how they can be used for classification. We will learn about a few different methods of classification, from zero-shot to single and multi-shot classification. This background will allow us to then dive into NLP pipelines.

After covering this introductory material, we can then learn about how LLMs can be easily introduced into a spaCy pipeline with `spacy-llm`. We will learn about APIs and how to properly setup an environment to work with API keys. While we will focus specifically on OpenAI, the methods introduced here will work for other LLM services as well as local models downloaded from HuggingFace.

# What are Large Language Models?

Large Language Models (LLMs) represent the cutting edge of natural language processing. These sophisticated AI systems are designed to comprehend, generate, and manipulate human language with proficiency. Built on deep learning architectures, typically employing transformer models, LLMs are trained on vast corpora of text data (trillions of tokens). This extensive training allows them to capture the subtle nuances of language, including complex grammatical structures, contextual meanings, and even rudimentary reasoning capabilities.

LLMs are very versatile and can be used to solve (or partially solve) many NLP problems. From translation to summarization, from question-answering to creative writing, these models demonstrate a wide-ranging applicability across numerous language tasks. You might be familiar with some popular examples, such as the GPT (Generative Pre-trained Transformer) series, the basis for ChatGPT. These models have played a pivotal role in revolutionizing natural language processing, enabling more human-like text generation and understanding than ever before.

# The strengths and weaknesses

Like any technology, LLMs come with their own set of strengths and weaknesses. On the positive side, they have the ability to adapt to various language tasks without requiring task-specific training. This is very different from traditional task-specific machine learning models. Their contextual understanding allows them to grasp nuances in language that previous models struggled with. Moreover, their generation capabilities enable them to produce human-like text for diverse purposes. One of their most impressive features is few-shot learning – the ability to adapt to new tasks with minimal examples.

Yet, it's crucial to be aware of their limitations. Despite their impressive abilities, LLMs lack true real-world knowledge and may generate plausible-sounding but factually incorrect information. This is known as a `hallucination` They can inadvertently perpetuate societal biases present in their training data. From a practical standpoint, they demand significant computational resources to train and run. It's also important to note that while they can mimic reasoning, they don't truly "understand" in a human sense, and they may struggle with consistency, potentially providing different answers to the same question asked in different ways.

# Zero-Shot, Single-Shot, and Multi-Shot Classification

As we dive deeper into working with LLMs, you'll encounter terms like Zero-Shot, Single-Shot, and Multi-Shot Classification. These refer to different approaches in using LLMs for classification tasks. 

Zero-Shot Classification is particularly intriguing – it involves the model classifying inputs into categories it hasn't been explicitly trained on, relying instead on its general language understanding to infer appropriate categories. Imagine classifying news articles into topics without providing any specific examples – that's Zero-Shot Classification in action.

Single-Shot Classification takes this a step further by giving the model one example of each category before making classifications. This approach is invaluable when you have very limited labeled data available. For instance, you might provide one example each of a positive and negative review before conducting sentiment analysis.

Multi-Shot Classification, as the name suggests, provides the model with multiple examples for each category. This method typically improves accuracy by giving the model more context about each category. Think of it as giving several examples of different animal species before asking the model to classify new animals.

# What are Pipelines?

Now, let's talk about pipelines, a crucial concept in NLP and specifically in spacy-llm. In essence, pipelines are sequences of data processing components that work together to analyze text. They're the backbone of efficient and modular text processing systems. In the context of spacy-llm, pipelines play a vital role in integrating Large Language Models into spaCy's ecosystem, allowing for a seamless combination of rule-based and statistical NLP with LLM capabilities.

A typical NLP pipeline might include components like tokenization (breaking text into individual tokens), part-of-speech tagging (assigning grammatical categories to tokens), named entity recognition (identifying and categorizing named entities), and dependency parsing (analyzing the grammatical structure of sentences). With spacy-llm, we can now integrate LLM capabilities into these pipelines, using them for tasks like classification, generation, or analysis.

The beauty of pipelines lies in their standardization, efficiency, and flexibility. They provide a consistent interface for various NLP tasks, are optimized for performance and easy to scale, and can be customized and extended based on specific needs. In the context of spacy-llm, this means you can seamlessly integrate LLM capabilities into existing spaCy workflows, enabling powerful hybrid approaches that combine traditional NLP techniques with the advanced capabilities of Large Language Models.

As we progress through this lesson, you'll gain hands-on experience with these concepts, learning how to leverage the full potential of both spaCy and Large Language Models in your NLP projects. Whether you're looking to classify text, generate human-like responses, or perform complex language analysis, the combination of LLMs and spacy-llm provides a powerful toolkit for tackling a wide range of natural language processing challenges.

So, let's roll up our sleeves and dive into the exciting world of Large Language Models and spacy-llm. By the end of this lesson, you'll have a solid foundation in these cutting-edge technologies and practical knowledge of how to apply them in real-world scenarios. Get ready to unlock new possibilities in your NLP journey!

# spaCy Pipeline

Let's take a look at a typical spaCy pipeline. To do that, we will load up the `en_core_web_sm` model.

In [4]:
nlp = spacy.load("en_core_web_sm")

Now that we have it loaded into memory, let's go ahead and access the pipe names by using `.pipe_names`:

In [9]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

This shows us the sequence of the pipes.

1. 'tok2vec': This is the first step in the pipeline. It stands for "tokenization to vectors". This component converts text into numerical vectors that represent the semantic meaning of each token. It's a crucial preprocessing step for many other components.
2. 'tagger': This component performs part-of-speech (POS) tagging. It assigns grammatical categories (like noun, verb, adjective, etc.) to each token in the text.
3. 'parser': The parser analyzes the grammatical structure of the sentence. It determines the relationships between words and creates a dependency parse tree.
4. 'attribute_ruler': This component can be used to add, modify or remove token attributes based on token or span matches. It's often used for rule-based corrections or additions to the pipeline's output.
5. 'lemmatizer': The lemmatizer reduces words to their base or dictionary form. For example, "running" would be lemmatized to "run".
6. 'ner': This stands for Named Entity Recognition. It identifies and classifies named entities (like persons, organizations, locations, etc.) in the text.

This sequence represents a common order of operations in NLP:

First, the text is tokenized and converted to vectors. Then, grammatical information is added (tagging and parsing). Additional attributes might be adjusted. Words are reduced to their base forms. Finally, named entities are identified.

Each step in this pipeline builds on the previous ones, creating a rich set of linguistic annotations for the input text. This particular pipeline is quite comprehensive and would be suitable for a wide range of NLP tasks.

We can analyze this pipeline further with the `.analyze_pipes()` method.

In [11]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

# spaCy Config System

As of spaCy version 3, a spaCy pipeline is based on a config system. This controls the pipeline, location of important assets, such as training data, defines pipes and points the pipeline to the correct machine learning models to use. If we want to examine the config file of a given spaCy pipeline, we can use `nlp.config`

In [17]:
nlp.config

{'paths': {'train': None, 'dev': None, 'vectors': None, 'init_tok2vec': None},
 'system': {'gpu_allocator': None, 'seed': 0},
 'nlp': {'lang': 'en',
  'pipeline': ['tok2vec',
   'tagger',
   'parser',
   'senter',
   'attribute_ruler',
   'lemmatizer',
   'ner'],
  'disabled': ['senter'],
  'before_creation': None,
  'after_creation': None,
  'after_pipeline_creation': None,
  'batch_size': 256,
  'tokenizer': {'@tokenizers': 'spacy.Tokenizer.v1'},
  'vectors': {'@vectors': 'spacy.Vectors.v1'}},
 'components': {'tok2vec': {'factory': 'tok2vec',
   'model': {'@architectures': 'spacy.Tok2Vec.v2',
    'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
     'width': '${components.tok2vec.model.encode:width}',
     'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE', 'SPACY', 'IS_SPACE'],
     'rows': [5000, 1000, 2500, 2500, 50, 50],
     'include_static_vectors': False},
    'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
     'width': 96,
     'depth': 4,
     'window_size': 1

This can be a little difficult to read, though, when rendered as JSON. It is a bit easier, if we use the `.to_str()` method to convert the JSON into a readable string.

In [16]:
print(nlp.config.to_str())

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["tok2vec","tagger","parser","senter","attribute_ruler","lemmatizer","ner"]
disabled = ["senter"]
before_creation = null
after_creation = null
after_pipeline_creation = null
batch_size = 256
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}

[components]

[components.attribute_ruler]
factory = "attribute_ruler"
scorer = {"@scorers":"spacy.attribute_ruler_scorer.v1"}
validate = false

[components.lemmatizer]
factory = "lemmatizer"
mode = "rule"
model = null
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
ma

It's a bit beyond the scope of this notebook to dive into the entirety of this config file, but it is worth noting that this is the basis for defining and structuring a pipeline manually. The reason I am providing this example here is because we will be creating our own spaCy config files in order to properly define and create a spaCy pipeline for working with LLMs.

# Introduction to `spacy-llm`

The best way to work with LLMs inside of spaCy is to use the spaCy-created package `spacy-llm`. We've already installed it up above. In order to begin working with spacy-llm, you need to assemble a pipeline. There are several ways to do this. For this tutorial, we will be using a config file. The reason for this is because it allows you to have more easily-reproducible results. Rather than pasting a code snippet into other notebooks, you can reuse the same config file.

For this course, all config files will be located in `./assets`. Remember, because our notebooks are in `./notebooks`, we need to navigate out of notebooks and into assets, to do that, we can use the following directory `../assets`. The `..` brings us back to the root directory of this project.

To begin working with spacy-llm, let's go ahead and import `assemble` from `spacy_llm.util`

In [18]:
from spacy_llm.util import assemble

Now that we have everything imported, we can go ahead and create our pipeline. Before we do, though, we need to make sure our environment variable for OPENAI_API_KEY is set correctly. We have already done this at the top of the notebook. Remember, when working with notebooks, you MUST always set your environment variables first with the `os` module or set that as environment variables.

The code you should use is this:

```python
import os
# uncomment this out if you are using a Mac. This is a bug and with spacy-llm and pytorch on a Mac and this resolves it for now.
# os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
os.environ["OPENAI_API_KEY"] = "YOUR OPENAI API KEY"
```

SpaCy requires that your api keys be set as environment variables, rather than used in-script. The more secure way to do this is to use `export` in the terminal. Because we are working on different operating systems and different computers, we are using `os` to ensure that all students have a shared experience.

Once your api key is set correctly, we can now begin working with assembling our pipeline. To do that, we will use `assemble()` and pass a single argument: the path to our config file.

In [75]:
nlp = assemble("../assets/openai-ner.cfg")

But what does this config file look like? It looks like this:

```yaml
[nlp]
lang = "en"
pipeline = ["llm_ner"]

[components]

[components.llm_ner]
factory = "llm"

[components.llm_ner.task]
@llm_tasks = "spacy.NER.v2"
labels = ["GPE", "PERSON"]

[components.llm_ner.model]
@llm_models = "spacy.GPT-3-5.v3"
config = {"temperature": 0.0}

```

What's happening here exactly? Well, we are defining our LLM pipeline within spaCy. Let's break down each line.

# Breakdown of spaCy LLM Configuration

Let's go through each section of the configuration file and explain what each line means:

```yaml
[nlp]
lang = "en"
pipeline = ["llm_ner"]
```

- `[nlp]`: This section defines general settings for the NLP (Natural Language Processing) model.
- `lang = "en"`: This specifies that the language of the model is English.
- `pipeline = ["llm_ner"]`: This defines the components of the processing pipeline. Here, we have only one component named "llm_ner".

```yaml
[components]
```

- `[components]`: This section introduces the configuration for individual components in the pipeline.

```yaml
[components.llm_ner]
factory = "llm"
```

- `[components.llm_ner]`: This subsection configures the "llm_ner" component we specified in the pipeline.
- `factory = "llm"`: This tells spaCy to use the LLM (Large Language Model) factory to create this component.

```yaml
[components.llm_ner.task]
@llm_tasks = "spacy.NER.v2"
labels = ["GPE", "PERSON"]
```

- `[components.llm_ner.task]`: This subsection defines the specific task for the LLM component.
- `@llm_tasks = "spacy.NER.v2"`: This specifies that we're using version 2 of spaCy's Named Entity Recognition (NER) task.
- `labels = ["COMPOSER"]`: This defines the label(s) that the NER task should identify. In this case, it's looking for entities labeled as "GPE" or "PERSON". GPE stands for Geo-Political Entity.

```yaml
[components.llm_ner.model]
@llm_models = "spacy.GPT-3-5.v3"
config = {"temperature": 0.0}
```

- `[components.llm_ner.model]`: This subsection configures the specific LLM model to use.
- `@llm_models = "spacy.GPT-3-5.v3"`: This specifies that we're using version 3 of the GPT-3.5 model integration in spaCy.
- `config = {"temperature": 0.0}`: This sets the configuration for the GPT-3.5 model. The "temperature" parameter controls the randomness of the model's output. A value of 0.0 means the model will always choose the most likely next word, making the output more deterministic and focused.

In summary, this configuration sets up a spaCy pipeline that uses GPT-3.5 to perform Named Entity Recognition, specifically looking for entities that can be labeled as "GPE" or "PERSON". The pipeline is set to process English text and aims to produce consistent, focused results due to the low temperature setting of the model.

## Models

When it comes to models, spaCy has a lot of flexibility with most of the major APIs. Here is a table that comes from spaCy's documentation. To stay up-to-date, make sure you visit their page directly [here](https://spacy.io/api/large-language-models#models).


| MODEL                     | PROVIDER            | SUPPORTED NAMES                                                                 | DEFAULT NAME         | DEFAULT CONFIG                           |
|---------------------------|---------------------|--------------------------------------------------------------------------------|----------------------|------------------------------------------|
| spacy.GPT-4.v1            | OpenAI              | ["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]                          | "gpt-4"              | {}                                       |
| spacy.GPT-4.v2            | OpenAI              | ["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]                          | "gpt-4"              | {temperature=0.0}                        |
| spacy.GPT-4.v3            | OpenAI              | All names of GPT-4 models offered by OpenAI                                     | "gpt-4"              | {temperature=0.0}                        |
| spacy.GPT-3-5.v1          | OpenAI              | ["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"] | "gpt-3.5-turbo"      | {}                                       |
| spacy.GPT-3-5.v2          | OpenAI              | ["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"] | "gpt-3.5-turbo"      | {temperature=0.0}                        |
| spacy.GPT-3-5.v3          | OpenAI              | All names of GPT-3.5 models offered by OpenAI                                   | "gpt-3.5-turbo"      | {temperature=0.0}                        |
| spacy.Davinci.v1          | OpenAI              | ["davinci"]                                                                     | "davinci"            | {}                                       |
| spacy.Davinci.v2          | OpenAI              | ["davinci"]                                                                     | "davinci"            | {temperature=0.0, max_tokens=500}        |
| spacy.Text-Davinci.v1     | OpenAI              | ["text-davinci-003", "text-davinci-002"]                                        | "text-davinci-003"   | {}                                       |
| spacy.Text-Davinci.v2     | OpenAI              | ["text-davinci-003", "text-davinci-002"]                                        | "text-davinci-003"   | {temperature=0.0, max_tokens=1000}       |
| spacy.Code-Davinci.v1     | OpenAI              | ["code-davinci-002"]                                                            | "code-davinci-002"   | {}                                       |
| spacy.Code-Davinci.v2     | OpenAI              | ["code-davinci-002"]                                                            | "code-davinci-002"   | {temperature=0.0, max_tokens=500}        |
| spacy.Curie.v1            | OpenAI              | ["curie"]                                                                       | "curie"              | {}                                       |
| spacy.Curie.v2            | OpenAI              | ["curie"]                                                                       | "curie"              | {temperature=0.0, max_tokens=500}        |
| spacy.Text-Curie.v1       | OpenAI              | ["text-curie-001"]                                                              | "text-curie-001"     | {}                                       |
| spacy.Text-Curie.v2       | OpenAI              | ["text-curie-001"]                                                              | "text-curie-001"     | {temperature=0.0, max_tokens=500}        |
| spacy.Babbage.v1          | OpenAI              | ["babbage"]                                                                     | "babbage"            | {}                                       |
| spacy.Babbage.v2          | OpenAI              | ["babbage"]                                                                     | "babbage"            | {temperature=0.0, max_tokens=500}        |
| spacy.Text-Babbage.v1     | OpenAI              | ["text-babbage-001"]                                                            | "text-babbage-001"   | {}                                       |
| spacy.Text-Babbage.v2     | OpenAI              | ["text-babbage-001"]                                                            | "text-babbage-001"   | {temperature=0.0, max_tokens=500}        |
| spacy.Ada.v1              | OpenAI              | ["ada"]                                                                         | "ada"                | {}                                       |
| spacy.Ada.v2              | OpenAI              | ["ada"]                                                                         | "ada"                | {temperature=0.0, max_tokens=500}        |
| spacy.Text-Ada.v1         | OpenAI              | ["text-ada-001"]                                                                | "text-ada-001"       | {}                                       |
| spacy.Text-Ada.v2         | OpenAI              | ["text-ada-001"]                                                                | "text-ada-001"       | {temperature=0.0, max_tokens=500}        |
| spacy.Azure.v1            | Microsoft, OpenAI   | Arbitrary values                                                                | No default           | {temperature=0.0}                        |
| spacy.Command.v1          | Cohere              | ["command", "command-light", "command-light-nightly", "command-nightly"]         | "command"            | {}                                       |
| spacy.Claude-2-1.v1       | Anthropic           | ["claude-2-1"]                                                                  | "claude-2-1"         | {}                                       |
| spacy.Claude-2.v1         | Anthropic           | ["claude-2", "claude-2-100k"]                                                   | "claude-2"           | {}                                       |
| spacy.Claude-1.v1         | Anthropic           | ["claude-1", "claude-1-100k"]                                                   | "claude-1"           | {}                                       |
| spacy.Claude-1-0.v1       | Anthropic           | ["claude-1.0"]                                                                  | "claude-1.0"         | {}                                       |
| spacy.Claude-1-2.v1       | Anthropic           | ["claude-1.2"]                                                                  | "claude-1.2"         | {}                                       |
| spacy.Claude-1-3.v1       | Anthropic           | ["claude-1.3", "claude-1.3-100k"]                                               | "claude-1.3"         | {}                                       |
| spacy.Claude-instant-1.v1 | Anthropic           | ["claude-instant-1", "claude-instant-1-100k"]                                   | "claude-instant-1"   | {}                                       |
| spacy.Claude-instant-1-1.v1| Anthropic          | ["claude-instant-1.1", "claude-instant-1.1-100k"]                               | "claude-instant-1.1" | {}                                       |
| spacy.PaLM.v1             | Google              | ["chat-bison-001", "text-bison-001"]                                            | "text-bison-001"     | {temperature=0.0}                        |


## Creating our First Doc

Now that we have created our pipeline, let's create a text and run it through the pipeline. Our syntax from here on out remains precisely the same as any other spaCy pipeline.

In [59]:
text = """
Sir
I understand Mr Skinner is gone to Philadelphia. You will keep the inclosed Letter for him till he returns, when You will take the earliest opportunity of delivering it to him. I desire to see him as soon as he arrives & have written to him for the purpose.1

You will inform the Officer who came with a flag to Elizabeth Town Yesterday—that he is not to wait for an Answer to the Letters he brought; and that One will be transmitted by an early conveyance.2 You will deliver him the Letters in the packet which accompanies this.3 I am Sir Yr Hbl. sert
"""

doc = nlp(text)

This is the same text we used in our previous notebook, but notice that this particular text has the archaic style English of the 18th century preserved. Let's see how good our GPT-3.5 annotations are with `displacy`.

In [60]:
displacy.render(doc, style="ent")

The results here are perfect. It has correctly labeled each label. Remember, you may have a slightly different output. That's okay (especially as you are working with more complex tasks.)

Let's see how well this process works with a more challenging text. The following text comes from the [FIND PROJECT NAME]. Our goal is to identify the composers in the text. Let's see how well our pipeline works.

In [76]:
composer_text = """
Program Hodle Chrlstus Notus Est Jan Peterszoon Sweellnck Sung in Latin 1562 -1621 Today Christ is born Today the Saviour hos appeared On earth the angels sing the archangels rejoice and the righteous ore glad saying `` Glory to God in the highest Noell Alleluia/ '' Rorate Coen Giovanni Perlulgl do Palestrina Sung in Latin 1525 -1594 Pour out dew from above you heavens and let the clouds rain down the Just One Let the earth open and bring forth a saviour Show us your mercy 0 Lord and grant us your salvation Come 0 Lord and do not delay Alleluia ! Allelulal Ascendlt Deus WIiiiam Byrd Sung in Latin 1543 -1623 Alleluia ! God hos ascended in jubilation and Christ the Lord with the sound of the trumpets Alleluia/ II Nunc Dfmlttls Christian Figueroa tenor The Te Deum of Sandor Slk Ill Sergei Rachmaninoff 1873-1943 Zoltan Kodaly 1882 -1967 Motet `` Lobet den Herrn alle Heiden '' J. S. Bach Sung in German 1685 -1750 Praise the Lord all ye notions praise him all ye people For his mercy and truth watch over us for evermore Alleluia/ -intermission -
"""
colors = {
    "COMPOSER": "#a8d5e2",      # Light blue
    "COMPOSITION": "#c2e8c4",   # Light green
    "DATE_RANGE": "#f9c2c2",    # Light red
    "LANGUAGE": "#f7e1a1"       # Light yellow
}

options={"colors":colors}

nlp_composer = assemble("../assets/openai-ner-composer.cfg")
doc_composer = nlp_composer(composer_text)
displacy.render(doc_composer, style="ent", options=options)

This looks very bad. Let's break down one of the key issues. While we correctly identified `J.S. Bach`, we missed all other true positives. Why is that? For one, we are working with an NER label that is not traditional. I don't know of any datasets that label composers in this way. Secondly, we are using GPT-3.5. For more complex tasks, we should be using GPT-4. I have prepared for us another config that uses GPT-4. Let's load up that config and rerun this process.

In [77]:
nlp_composer = assemble("../assets/openai-ner-composer-good.cfg")
doc_composer = nlp_composer(composer_text)
displacy.render(doc_composer, style="ent", options=options)

This is looking a lot better! Why do we have such improvement? For one, GPT-4 is a much larger model. It was trained on more material and has a better understanding of more complex things. Why don't we just use GPT-4 for everything then? Well, GPT-4 is drastically more expensive than GPT-3.5. You really want to use GPT-4 when the task is complex and cannot be done with 3.5.

But GPT-4 can do a lot more than label composer. We can also label a lot of other things in the text, like COMPOSITION, DATE_RANGE, and LANGUAGE. Let's load up another config.

In [78]:
nlp_composer = assemble("../assets/openai-ner-composer-good-all.cfg")
doc_composer = nlp_composer(composer_text)
displacy.render(doc_composer, style="ent", options=options)

Here we can see how well our extra labels perform alongside composers. Are the results perfect? No. But that's okay. No machine learning model is perfect. The big question at this stage is how well does this approach work on domain-specific data?

# In-Class Exercise

Let's test this. For this portion of the class, you will:

1) Build your own config file, using the ones provided as a template. You should really only change the labels and models section of the config.
2) Paste in your text in the code cell below.
3) Create your pipeline
4) Process the text
5) Visualize the text

Where does your approach work? Where does it fail? How does it compare to the other spaCy models?

# Text Classification

In [89]:
tweet = """
Tweet:
You are terrible!
"""

nlp_textcat = assemble("../assets/openai-textcat.cfg")
doc_textcat = nlp_textcat(tweet)
doc_textcat.cats

defaultdict(<function spacy_llm.tasks.textcat.util.reduce_shards_to_doc.<locals>.<lambda>()>,
            {'NON_TOXIC': 0.0, 'TOXIC': 1.0})