## CEO Letter to Shareholder Demo

For this, let's try to create a text classifier for sentences related to technology related topics.

We'll go through an example similar to the [TextClassification Docs](https://prodi.gy/docs/text-classification#workflow).

In [3]:
# step 1: manual label
# see https://prodi.gy/docs/recipes#textcat-manual
!source pgy-env/bin/activate && prodigy textcat.manual ceo_manual ceo-letters-sample.jsonl --label TECHNOLOGY

Using 1 label(s): TECHNOLOGY

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C

[38;5;2m✔ Saved 200 annotations to database SQLite[0m
Dataset: ceo_manual
Session ID: 2022-08-24_17-34-41



In [35]:
# step 2: train initial model
# see https://prodi.gy/docs/recipes#train
!source pgy-env/bin/activate && \
    prodigy \
    train \
    ceo-tech-model \
    --textcat-multilabel ceo_manual 

[38;5;4mℹ Using CPU[0m
[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
[2022-08-24 20:39:30,980] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 160 | Evaluation: 40 (20% split)
Training: 160 | Evaluation: 40
Labels: textcat_multilabel (1)
[2022-08-24 20:39:31,008] [INFO] Pipeline: ['textcat_multilabel']
[2022-08-24 20:39:31,011] [INFO] Created vocabulary
[2022-08-24 20:39:31,012] [INFO] Finished initializing nlp object
[2022-08-24 20:39:31,115] [INFO] Initialized pipeline components: ['textcat_multilabel']
[38;5;2m✔ Initialized pipeline[0m
[1m
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 160 | Evaluation: 40 (20% split)
Training: 160 | Evaluation: 40
Labels: textcat_multilabel (1)
[38;5;4mℹ Pipeline: ['textcat_multilabel'][0m
[38;5;4mℹ Initial l

In [46]:
# prodigy configuration: see https://prodi.gy/docs/install#config
!export PRODIGY_CONFIG="prodigy.json"

In [54]:
# step 3: active learning
# see https://prodi.gy/docs/recipes#textcat-teach 
!source pgy-env/bin/activate && \
    prodigy \
    textcat.teach \
    ceo_teach \
    ceo-tech-model/model-best \
    ceo-letters-sample.jsonl \
    --label TECHNOLOGY \
    --exclude ceo_manual # exclude previously labeled

Using 1 label(s): TECHNOLOGY

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C


In [48]:
# step 4: merge manual + teach
# see https://prodi.gy/docs/recipes#merge
!source pgy-env/bin/activate && \
    prodigy \
    db-merge \
    ceo_manual,ceo_teach \
    ceo_dataset

[38;5;2m✔ Created dataset 'ceo_dataset'[0m
[38;5;2m✔ Merged 300 examples from 2 datasets[0m
Created merged dataset 'ceo_dataset'


In [49]:
# step 5: retrain model
# see https://prodi.gy/docs/recipes#merge
!source pgy-env/bin/activate && \
    prodigy \
    train \
    ceo-tech-model \
    --textcat-multilabel ceo_dataset

[38;5;4mℹ Using CPU[0m
[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
[2022-08-24 21:01:21,007] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 240 | Evaluation: 60 (20% split)
Training: 240 | Evaluation: 60
Labels: textcat_multilabel (1)
[2022-08-24 21:01:21,045] [INFO] Pipeline: ['textcat_multilabel']
[2022-08-24 21:01:21,048] [INFO] Created vocabulary
[2022-08-24 21:01:21,049] [INFO] Finished initializing nlp object
[2022-08-24 21:01:21,182] [INFO] Initialized pipeline components: ['textcat_multilabel']
[38;5;2m✔ Initialized pipeline[0m
[1m
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 240 | Evaluation: 60 (20% split)
Training: 240 | Evaluation: 60
Labels: textcat_multilabel (1)
[38;5;4mℹ Pipeline: ['textcat_multilabel'][0m
[38;5;4mℹ Initial l

In [50]:
# optional: run train-curve for label diagnostic
# see https://prodi.gy/docs/recipes#train-curve
!source pgy-env/bin/activate && \
    prodigy \
    train-curve \
    --textcat-multilabel ceo_dataset

[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
Training 4 times with 25%, 50%, 75%, 100% of the data

%      Score    textcat_multilabel
----   ------   ------
  0%   0.47     0.47  
 25%   0.64 ▲   0.64 ▲
 50%   0.71 ▲   0.71 ▲
 75%   0.65 ▼   0.65 ▼
100%   0.72 ▲   0.72 ▲

[38;5;2m✔ Accuracy improved in the last sample[0m
As a rule of thumb, if accuracy increases in the last segment, this could
indicate that collecting more annotations of the same type will improve the
model further.


## Model Scoring as a spaCy model

In [51]:
import spacy

nlp = spacy.load("ceo-tech-model/model-last")
doc = nlp("As the importance of cloud, AI and digital platforms grows, this competition will become even more formidable.")
print(doc.cats)

{'TECHNOLOGY': 0.5724478363990784}


In [52]:
doc = nlp("We have taken extensive steps to support our employees, who are our greatest strength.")
print(doc.cats)

{'TECHNOLOGY': 0.10642561316490173}
