<a href="https://colab.research.google.com/github/zhangxs131/NER/blob/main/NER_with_spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 使用 spacy framework 进行NER

首先尝试了spacy 进行ner的pipeline函数，使用方法，spacy的分句有些不足

然后使用conll2003 数据集训练了NER模型，并于en_core_web_sm模型进行比较，可视化结果

In [None]:
!python -m spacy download en_core_web_lg

In [4]:
import en_core_web_lg

nlp=en_core_web_lg.load()

In [5]:
mytext = """SAN FRANCISCO — Shortly after Apple used a new tax law last year to bring back most of the $252 billion it had held abroad, the company said it would buy back $100 billion of its stock.

On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock.

“Our first priority is always looking after the business and making sure we continue to grow and invest,” Luca Maestri, Apple’s finance chief, said in an interview. “If there is excess cash, then obviously we want to return it to investors.”

Apple’s record buybacks should be welcome news to shareholders, as the stock price is likely to climb. But the buybacks could also expose the company to more criticism that the tax cuts it received have mostly benefited investors and executives.
"""

doc=nlp(mytext)
for ent in doc.ents:
  print(ent.text,'\t',ent.label_)

print('_______________________________\n')
for sent in doc.sents:
  print(sent.text)
  print('_________________________________')

SAN FRANCISCO 	 GPE
Apple 	 ORG
last year 	 DATE
$252 billion 	 MONEY
$100 billion 	 MONEY
Tuesday 	 DATE
Apple 	 ORG
a further $75 billion 	 MONEY
first 	 ORDINAL
Luca Maestri 	 PERSON
Apple 	 ORG
Apple 	 ORG
_______________________________

SAN FRANCISCO —
_________________________________
Shortly after Apple used a new tax law last year to bring back most of the $252 billion it had held abroad, the company said it would buy back $100 billion of its stock.


_________________________________
On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock.


_________________________________
“Our first priority is always looking after the business and making sure we continue to grow and invest,” Luca Maestri, Apple’s finance chief, said in an interview.
_________________________________
“If there is excess cash, then obviously we want to return it to investors.”


_________________________________
Apple’s record buybacks shou

# 使用spacy 训练ner模型，conll2003 数据需要提前下载

conll2003地址 https://deepai.org/dataset/conll-2003-english

In [6]:
# upload train.txt, test.txt, valid.txt from Data/conll2003/en
try:
    from google.colab import files
    uploaded = files.upload()
except ModuleNotFoundError:
    print('Not using colab')

Saving conll2003.zip to conll2003.zip


In [7]:
!unzip conll2003.zip

Archive:  conll2003.zip
  inflating: metadata                
  inflating: test.txt                
  inflating: train.txt               
  inflating: valid.txt               


In [8]:
import os
os.mkdir('spacyNER_data')

try:
    import google.colab 
    !python -m spacy convert "train.txt" spacyNER_data -c ner
    !python -m spacy convert "test.txt" spacyNER_data -c ner
    !python -m spacy convert "valid.txt" spacyNER_data -c ner
except ModuleNotFoundError:
    !python -m spacy convert "Data/conll2003/en/train.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/test.txt" spacyNER_data -c ner
    !python -m spacy convert "Data/conll2003/en/valid.txt" spacyNER_data -c ner

[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (14987 documents): spacyNER_data/train.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents): spacyNER_data/test.json[0m
[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3466 documents): spacyNER_data/valid.json[0m


In [11]:
#查看处理之前的数据

!echo 'Before:(train.txt)
!head 'train.txt' -n 11 |tail -n 9

/bin/bash: -c: line 0: unexpected EOF while looking for matching `''
/bin/bash: -c: line 1: syntax error: unexpected end of file
EU NNP B-NP B-ORG
rejects VBZ B-VP O
German JJ B-NP B-MISC
call NN I-NP O
to TO B-VP O
boycott VB I-VP O
British JJ B-NP B-MISC
lamb NN I-NP O
. . O O


In [12]:
#处理之后的数据
!echo "AFTER : (spacyNER_data/train.json)"
!head "spacyNER_data/train.json" -n 77 | tail -n 58

AFTER : (spacyNER_data/train.json)
  {
    "id":1,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"EU",
                "tag":"NNP",
                "ner":"U-ORG"
              },
              {
                "orth":"rejects",
                "tag":"VBZ",
                "ner":"O"
              },
              {
                "orth":"German",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {
                "orth":"call",
                "tag":"NN",
                "ner":"O"
              },
              {
                "orth":"to",
                "tag":"TO",
                "ner":"O"
              },
              {
                "orth":"boycott",
                "tag":"VB",
                "ner":"O"
              },
              {
                "orth":"British",
                "tag":"JJ",
                "ner":"U-MISC"
              },
              {


In [13]:
#训练模型
!python -m spacy train en model spacyNER_data/train.json spacyNER_data/valid.json -G -p tagger,ner

[38;5;2m✔ Created output directory: model[0m
Training pipeline: ['tagger', 'ner']
Starting with blank model 'en'
Counting training words (limit=0)
  "__main__", mod_spec)

Itn  Tag Loss    Tag %    NER Loss   NER P   NER R   NER F   Token %  CPU WPS
---  ---------  --------  ---------  ------  ------  ------  -------  -------
  1  31226.986    94.087  17033.296  83.526  82.767  83.145  100.000     4555
  2  16668.119    94.835   7955.035  86.424  85.813  86.117  100.000     4646
  3  13555.303    95.050   5270.181  87.443  86.957  87.199  100.000     4563
  4  11789.310    95.225   4058.905  87.983  87.731  87.857  100.000     4369
  5  10602.921    95.341   3063.747  88.313  88.001  88.156  100.000     4427
  6   9522.561    95.434   2700.996  88.521  88.253  88.387  100.000     4465
  7   8925.069    95.485   2326.025  88.677  88.573  88.625  100.000     4472
  8   8423.105    95.516   1986.350  88.449  88.270  88.359  100.000     4450
  9   7761.064    95.541   1926.637  88.675  8

#评估测试集

In [18]:
if not os.path.exists('result'):
  os.mkdir('result')
!python -m spacy evaluate model/model-best spacyNER_data/test.json -dp result

[1m

Time      11.11 s
Words     46666  
Words/s   4202   
TOK       100.00 
POS       95.16  
UAS       0.00   
LAS       0.00   
NER P     81.78  
NER R     82.15  
NER F     81.96  
Textcat   0.00   

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
result


#使用预训练的模型 en_core_web_sm进行测试

In [19]:
!python -m spacy download en_core_web_sm

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 11.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [20]:
if not os.path.exists('pretrained_result'):
  os.mkdir('pretrained_result')
!python -m spacy evaluate en_core_web_sm spacyNER_data/test.json -dp pretrained_result

[1m

Time      18.66 s
Words     46666  
Words/s   2501   
TOK       100.00 
POS       86.21  
UAS       0.00   
LAS       0.00   
NER P     6.51   
NER R     9.17   
NER F     7.62   
Textcat   0.00   

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
[38;5;2m✔ Generated 25 parses as HTML[0m
pretrained_result


In [21]:
#可视化结果，en_core_web_sm
from IPython.core.display import HTML
with open('pretrained_result/entities.html','r') as f:
  html=f.read()
HTML(html)

In [22]:
#可视化结果，自己训练模型
from IPython.core.display import HTML
with open('result/entities.html','r') as f:
  html=f.read()
HTML(html)