## CEO Letter to Shareholder Demo

For this, let's try to create a text classifier for sentences related to technology related topics.

We'll go through an example similar to the [TextClassification Docs](https://prodi.gy/docs/text-classification#workflow).

# Step 1: Matcher Rules

Let's set up these annotation guidelines (definitions) for our labels:
1. "TECHNOLOGY": Electronic object or system (e.g., computing) that helps employees or customers to accomplish tasks
2. "ESG": Environmental, social, and governance (ESG) criteria are a set of standards for a company’s behavior used by socially conscious investors to screen potential investments.
3. "DEI": Describes policies and programs that promote the representation and participation of different groups of individuals, including people of different ages, races and ethnicities, abilities and disabilities, genders, religions, cultures and sexual orientations.
4. "FINANCIAL": Company-specific ("micro") financial topics including profiability, sales / revenue, capital, and balance sheet.
5. "ECONOMIC": Economy-specific ("macro") events including employment levels, asset prices, wages, and trade (not regulation).
6. "WORKPLACE": Workplace-related terms including working-from-home, returning to the office, employee retention.

First, specify simple matcher rules.

In [23]:
patterns = [
    {"label": "TECHNOLOGY", "pattern": [{"lower": "technology"}]},
    {"label": "TECHNOLOGY", "pattern": [{"lower": "ai"}]},
    {"label": "TECHNOLOGY", "pattern": [{"lower": "platform"}]},
    {"label": "TECHNOLOGY", "pattern": [{"lower": "artificial"}, {"lower": "intelligence"}]},
    {"label": "TECHNOLOGY", "pattern": [{"lower": "analytics"}]},
    {"label": "TECHNOLOGY", "pattern": [{"lower": "fintech"}]},
    {"label": "ESG", "pattern": [{"lower": "sustainability"}]},
    {"label": "ESG", "pattern": [{"lower": "communities"}]},
    {"label": "ESG", "pattern": [{"lower": "environmental"}]},
    {"label": "ESG", "pattern": [{"lower": "climate"}]},
    {"label": "ESG", "pattern": [{"lower": "philanthropy"}]},
    {"label": "ESG", "pattern": [{"lower": "social"}, {"lower": "justice"}]},
    {"label": "DEI", "pattern": [{"lower": "diversity"}]},
    {"label": "DEI", "pattern": [{"lower": "inclusion"}]},
    {"label": "DEI", "pattern": [{"lower": "equity"}]},
    {"label": "FINANCIAL", "pattern": [{"lower": "financial"}]},
    {"label": "FINANCIAL", "pattern": [{"lower": "profit"}]},
    {"label": "FINANCIAL", "pattern": [{"lower": "loss"}]},
    {"label": "FINANCIAL", "pattern": [{"lower": "liquidity"}]},
    {"label": "ECONOMIC", "pattern": [{"lower": "unemployment"}]},
    {"label": "ECONOMIC", "pattern": [{"lower": "economy"}]},
    {"label": "ECONOMIC", "pattern": [{"lower": "economic"}]},
    {"label": "ECONOMIC", "pattern": [{"lower": "inflation"}]},
    {"label": "ECONOMIC", "pattern": [{"lower": "markets"}]},
    {"label": "WORKPLACE", "pattern": [{"lower": "remote"}]},
    {"label": "WORKPLACE", "pattern": [{"lower": "workplace"}]},
    {"label": "WORKPLACE", "pattern": [{"lower": "from"},{"lower": "home"}]},
    {"label": "WORKPLACE", "pattern": [{"lower": "home"},{"lower": "office"}]},
]

# write to file
srsly.write_jsonl("../assets/patterns.jsonl", patterns)

In [24]:
!python -m prodigy \
    textcat.teach \
    ceo_data \
    blank:en \
    ../assets/ceo-letters-sample.jsonl \
    --label TECHNOLOGY,ESG,DEI,FINANCIAL,ECONOMIC,WORKPLACE \
    --patterns ../assets/patterns.jsonl

Using 6 label(s): TECHNOLOGY, ESG, DEI, FINANCIAL, ECONOMIC, WORKPLACE

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C

[38;5;2m✔ Saved 214 annotations to database SQLite[0m
Dataset: ceo_data
Session ID: 2022-09-05_16-28-18



In [33]:
!python -m prodigy progress ceo_data

[38;5;2m✔ Loaded 214 annotations from 1 datasets[0m
[1m

New      New annotations collected in interval
Total    Total annotations collected   
Unique   Unique examples (not counting multiple annotations of same example)

[1m

           New   Unique   Total   Unique
--------   ---   ------   -----   ------
Sep 2022   214      169     214      169



In [38]:
# step 2: train initial model
# see https://prodi.gy/docs/recipes#train
!python -m \
    prodigy \
    train \
    ../ceo-topics \
    --textcat-multilabel ceo_data \
    --label-stats

[38;5;4mℹ Using CPU[0m
[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
[2022-09-05 17:08:49,270] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 169 | Evaluation: 45 (20% split)
Training: 136 | Evaluation: 33
Labels: textcat_multilabel (6)
[2022-09-05 17:08:49,293] [INFO] Pipeline: ['textcat_multilabel']
[2022-09-05 17:08:49,295] [INFO] Created vocabulary
[2022-09-05 17:08:49,295] [INFO] Finished initializing nlp object
[2022-09-05 17:08:49,364] [INFO] Initialized pipeline components: ['textcat_multilabel']
[38;5;2m✔ Initialized pipeline[0m
[1m
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 169 | Evaluation: 45 (20% split)
Training: 136 | Evaluation: 33
Labels: textcat_multilabel (6)
[38;5;4mℹ Pipeline: ['textcat_multilabel'][0m
[38;5;4mℹ Initial l

## Step 3: Correct to get gold labels

In [51]:
!python -m \
    prodigy \
    textcat.correct \
    ceo_correct \
    ../ceo-topics/model-last \
    ../assets/ceo-letters-sample.jsonl \
    --label TECHNOLOGY,ESG,DEI,FINANCIAL,ECONOMIC,WORKPLACE \
    --update

Using 6 label(s): TECHNOLOGY, ESG, DEI, FINANCIAL, ECONOMIC, WORKPLACE
[38;5;4mℹ Annotating non-exclusive categories based on 'textcat_multilabel'
component config[0m

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C

[38;5;2m✔ Saved 250 annotations to database SQLite[0m


In [52]:
# retrain initial model
!python -m \
    prodigy \
    train \
    ../ceo-topics \
    --textcat-multilabel ceo_correct \
    --label-stats

[38;5;4mℹ Using CPU[0m
[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
[2022-09-05 17:51:13,781] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 200 | Evaluation: 50 (20% split)
Training: 200 | Evaluation: 50
Labels: textcat_multilabel (6)
[2022-09-05 17:51:13,821] [INFO] Pipeline: ['textcat_multilabel']
[2022-09-05 17:51:13,823] [INFO] Created vocabulary
[2022-09-05 17:51:13,824] [INFO] Finished initializing nlp object
[2022-09-05 17:51:13,917] [INFO] Initialized pipeline components: ['textcat_multilabel']
[38;5;2m✔ Initialized pipeline[0m
[1m
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 200 | Evaluation: 50 (20% split)
Training: 200 | Evaluation: 50
Labels: textcat_multilabel (6)
[38;5;4mℹ Pipeline: ['textcat_multilabel'][0m
[38;5;4mℹ Initial l

In [34]:
# prodigy configuration: see https://prodi.gy/docs/install#config
!export PRODIGY_CONFIG="../prodigy.json"

In [66]:
# step 3: active learning
# see https://prodi.gy/docs/recipes#textcat-teach 
!python -m \
    prodigy \
    textcat.teach \
    ceo_teach \
    ../ceo-topics/model-last \
    ../assets/ceo-letters-sample.jsonl \
    --label TECHNOLOGY,ESG,DEI,FINANCIAL,ECONOMIC,WORKPLACE \
    --exclude ceo_correct # exclude previously labeled

Using 6 label(s): TECHNOLOGY, ESG, DEI, FINANCIAL, ECONOMIC, WORKPLACE

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

^C


In [48]:
# step 4: merge manual + teach
# see https://prodi.gy/docs/recipes#merge
!python -m \
    prodigy \
    db-merge \
    ceo_manual,ceo_teach \
    ceo_dataset

[38;5;2m✔ Created dataset 'ceo_dataset'[0m
[38;5;2m✔ Merged 300 examples from 2 datasets[0m
Created merged dataset 'ceo_dataset'


In [56]:
# step 5: retrain model
# see https://prodi.gy/docs/recipes#merge
!python -m \
    prodigy \
    train \
    ceo-tech-model \
    --textcat-multilabel ceo_dataset

[38;5;4mℹ Using CPU[0m
[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
[2022-09-05 22:06:53,705] [INFO] Set up nlp object from config
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 200 | Evaluation: 50 (20% split)
Training: 200 | Evaluation: 50
Labels: textcat_multilabel (6)
[2022-09-05 22:06:53,746] [INFO] Pipeline: ['textcat_multilabel']
[2022-09-05 22:06:53,748] [INFO] Created vocabulary
[2022-09-05 22:06:53,749] [INFO] Finished initializing nlp object
[2022-09-05 22:06:53,849] [INFO] Initialized pipeline components: ['textcat_multilabel']
[38;5;2m✔ Initialized pipeline[0m
[1m
Components: textcat_multilabel
Merging training and evaluation data for 1 components
  - [textcat_multilabel] Training: 200 | Evaluation: 50 (20% split)
Training: 200 | Evaluation: 50
Labels: textcat_multilabel (6)
[38;5;4mℹ Pipeline: ['textcat_multilabel'][0m
[38;5;4mℹ Initial l

In [57]:
# optional: run train-curve for label diagnostic
# see https://prodi.gy/docs/recipes#train-curve
!python -m \
    prodigy \
    train-curve \
    --textcat-multilabel ceo_correct

[1m
[38;5;4mℹ Auto-generating config with spaCy[0m
[38;5;2m✔ Generated training config[0m
[1m
Training 4 times with 25%, 50%, 75%, 100% of the data

%      Score    textcat_multilabel
----   ------   ------
  0%   0.35     0.35  
 25%   0.43 ▲   0.43 ▲
 50%   0.55 ▲   0.55 ▲
 75%   0.62 ▲   0.62 ▲
100%   0.65 ▲   0.65 ▲

[38;5;2m✔ Accuracy improved in the last sample[0m
As a rule of thumb, if accuracy increases in the last segment, this could
indicate that collecting more annotations of the same type will improve the
model further.


## Model Scoring as a spaCy model

In [59]:
import spacy

nlp = spacy.load("../ceo-topics/model-last")
doc = nlp("As the importance of cloud, AI and digital platforms grows, this competition will become even more formidable.")
print(doc.cats)

{'DEI': 0.002334033139050007, 'WORKPLACE': 0.0006752138724550605, 'TECHNOLOGY': 0.6779870986938477, 'ECONOMIC': 0.03985508531332016, 'ESG': 0.0628301203250885, 'FINANCIAL': 0.060610558837652206}


In [60]:
doc = nlp("We have taken extensive steps to support our employees, who are our greatest strength.")
print(doc.cats)

{'DEI': 0.006951575633138418, 'WORKPLACE': 0.01583658903837204, 'TECHNOLOGY': 0.010677837766706944, 'ECONOMIC': 0.0004547737189568579, 'ESG': 0.04622439295053482, 'FINANCIAL': 0.3732890486717224}
