来自[Building a Text Classifier with Spacy 3.0](https://medium.com/analytics-vidhya/building-a-text-classifier-with-spacy-3-0-dd16e9979a)

In [8]:
import spacy
# tqdm is a great progress bar for python
# tqdm.auto automatically selects a text based progress 
# for the console 
# and html based output in jupyter notebooks
from tqdm.auto import tqdm
# DocBin is spacys new way to store Docs in a 
# binary format for training later
from spacy.tokens import DocBin

# !pip install ml-datasets
# 公开的用于机器学习、深度学习的各种数据集的集合。
import ml_datasets
from ml_datasets import imdb

# ebablbe auto-completion
%config Completer.use_jedi = False

[ml_datasets](https://pypi.org/project/ml-datasets/)中有如下数据集

| More ActionsID / Function | Description                                  | NLP task                                  | From URL |
| :------------------------ | :------------------------------------------- | :---------------------------------------- | :------: |
| `imdb`                    | IMDB sentiment dataset                       | Binary classification: sentiment analysis |    ✓     |
| `dbpedia`                 | DBPedia ontology dataset                     | Multi-class single-label classification   |    ✓     |
| `cmu`                     | CMU movie genres dataset                     | Multi-class, multi-label classification   |    ✓     |
| `quora_questions`         | Duplicate Quora questions dataset            | Detecting duplicate questions             |    ✓     |
| `reuters`                 | Reuters dataset (texts not included)         | Multi-class multi-label classification    |    ✓     |
| `snli`                    | Stanford Natural Language Inference corpus   | Recognizing textual entailment            |    ✓     |
| `stack_exchange`          | Stack Exchange dataset                       | Question Answering                        |    ✓      |
| `ud_ancora_pos_tags`      | Universal Dependencies Spanish AnCora corpus | POS tagging                               |    ✓     |
| `ud_ewtb_pos_tags`        | Universal Dependencies English EWT corpus    | POS tagging                               |    ✓     |
| `wikiner`                 | WikiNER data                                 | Named entity recognition                  |    ✓     |


In [9]:
dir(ml_datasets)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_registry',
 'cmu',
 'dbpedia',
 'imdb',
 'loaders',
 'mnist',
 'quora_questions',
 'reuters',
 'snli',
 'stack_exchange',
 'ud_ancora_pos_tags',
 'ud_ewtb_pos_tags',
 'util']

## 数据准备

In [14]:
# We want to classify movie reviews as positive or negative
# http://ai.stanford.edu/~amaas/data/sentiment/
# load movie reviews as a tuple (text, label)
train_data, valid_data = imdb()

print(len(train_data), type(train_data))
print(len(valid_data), type(valid_data))

print(train_data[0])

25000 <class 'list'>
25000 <class 'list'>
('- A film crew is shooting a horror movie in an old, supposedly cursed house where over the years, seven people have mysteriously died. One of the crew finds an old book of spells and it looks like it would be perfect to use in some of the ritual scenes in their movie. It is reasoned that the spells in the book are better written than the script they are using. But as the book is read, the graveyard outside suddenly comes to life. Now the cast and crew are faced with real danger .\n\n\n\n- IMDb lists a running time of 90 minutes. For the first 60 of those minutes, nothing happens. Far too much time is spent on the movie within a movie. Are we supposed to be frightened by the horror movie that they are shooting? We already know that their movie isn\'t "real". These scares just don\'t work.\n\n\n\n- There are very few things to enjoy about The House of Seven Corpses. The acting is atrocious. Most of these "actors" would have trouble making a ele

In [18]:
# load a medium sized english language model in spacy
# ！python -m spacy download en_core_web_md
nlp = spacy.load("en_core_web_md")
print(nlp.pipe_names)

doc = nlp('load a medium sized english language model in spacy')
print(f'doc.cats = {doc.cats}')
print(f'doc.ents = {doc.ents}')

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']
doc.cats = {}
doc.ents = (english,)


In [19]:
def make_docs(data):
    """
    this will take a list of texts and labels 
    and transform them in spacy documents
    
    data: list(tuple(text, label))
    
    returns: List(spacy.Doc.doc)
    """
    
    docs = []
    # nlp.pipe([texts]) is way faster than running 
    # nlp(text) for each text
    # as_tuples allows us to pass in a tuple, 
    # the first one is treated as text
    # the second one will get returned as it is.
    
    for doc, label in tqdm(nlp.pipe(data, as_tuples=True), total = len(data)):
        
        # we need to set the (text)cat(egory) for each document
        doc.cats["positive"] = label
        
        # put them into a nice list
        docs.append(doc)
    
    return docs

# we are so far only interested in the first 5000 reviews
# this will keep the training time short.
# In practice take as much data as you can get.
# you can always reduce it to make the script even faster.
num_texts = len(train_data)
# first we need to transform all the training data
train_docs = make_docs(train_data[:num_texts])
# then we save it in a binary file to disc
doc_bin = DocBin(docs=train_docs)
doc_bin.to_disk("./data/train.spacy")

num_texts = len(valid_data)
# repeat for validation data
valid_docs = make_docs(valid_data[:num_texts])
doc_bin = DocBin(docs=valid_docs)
doc_bin.to_disk("./data/valid.spacy")

  0%|          | 0/25000 [00:00<?, ?it/s]

  0%|          | 0/25000 [00:00<?, ?it/s]

In [31]:
print(len([label for text, label in train_data if label=='neg']))
print(len([label for text, label in train_data if label=='pos'])) 

12500
12500


## 生成配置文件

In [21]:
#!python -m spacy init fill-config ./config/base_config.cfg ./config/config.cfg

2021-08-29 08:51:25.712934: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;2m✔ Auto-filled config with all values[0m
[38;5;2m✔ Saved config[0m
config/config.cfg
You can now add your data and train your pipeline:
python -m spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy


## 3. 训练

In [25]:
!python -m spacy train ./config/config.cfg --output ./output --gpu-id 0

2021-08-29 08:54:22.230562: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-08-29 08:54:25,339] [INFO] Set up nlp object from config
[2021-08-29 08:54:25,355] [INFO] Pipeline: ['transformer', 'textcat']
[2021-08-29 08:54:25,362] [INFO] Created vocabulary
[2021-08-29 08:54:25,363] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that

In [32]:
!python -m spacy train ./config/config.cfg --output ./output --gpu-id 0 --paths.train ./data/train.spacy --paths.dev ./data/valid.spacy

2021-08-29 09:04:20.809378: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
[38;5;4mℹ Using GPU: 0[0m
[1m
[2021-08-29 09:04:23,898] [INFO] Set up nlp object from config
[2021-08-29 09:04:23,915] [INFO] Pipeline: ['transformer', 'textcat']
[2021-08-29 09:04:23,921] [INFO] Created vocabulary
[2021-08-29 09:04:23,922] [INFO] Finished initializing nlp object
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that

## 评估

In [None]:
!python -m spacy evaluate output/gpu_aug/model-best ./data/test_aug.spacy --gpu-id 0

In [None]:
import spacy
# load thebest model from training
nlp = spacy.load("output/model-best")
text = ""
print("type : ‘quit’ to exit")

# # predict the sentiment until someone writes quit
# while text != "quit":
#     text = input("Please enter example input: ")
#     doc = nlp(text)
#     if doc.cats['positive'] >.5:
#         print(f"the sentiment is positive")
#     else:
#         print(f"the sentiment is negative")