<a href="https://colab.research.google.com/github/syq-tju/Bert/blob/main/BertforNER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from transformers import BertTokenizer, BertForTokenClassification
from transformers import pipeline
import torch

# 加载模型和Tokenizer
model_name = "dslim/bert-base-NER"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForTokenClassification.from_pretrained(model_name)

# 初始化NER pipeline
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# 示例文本
text = "Apple is looking at buying U.K. startup for $1 billion"

# 使用pipeline进行命名实体识别
ner_results = ner_pipeline(text)

# 打印结果
for entity in ner_results:
    print(f"Entity: {entity['word']}, Type: {entity['entity']}, Score: {entity['score']:.4f}")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Entity: Apple, Type: B-ORG, Score: 0.9987
Entity: U, Type: B-LOC, Score: 0.9995
Entity: ., Type: I-LOC, Score: 0.9982
Entity: K, Type: I-LOC, Score: 0.9990
Entity: ., Type: I-LOC, Score: 0.9807


In [2]:
def merge_entities(ner_results):
    """
    合并从NER模型输出的实体片段。
    参数:
    - ner_results: 包含实体识别结果的列表，每个元素是一个字典，包括'word', 'entity', 'score'键。

    返回:
    - 一个列表，包含合并后的实体。
    """
    merged_entities = []
    current_entity = None

    for token in ner_results:
        if token['entity'].startswith('B-'):
            # 开始新的实体
            if current_entity:
                merged_entities.append(current_entity)
            current_entity = {
                "words": [token['word']],
                "type": token['entity'][2:],
                "scores": [token['score']]
            }
        elif token['entity'].startswith('I-') and current_entity and current_entity['type'] == token['entity'][2:]:
            # 继续当前实体
            current_entity['words'].append(token['word'])
            current_entity['scores'].append(token['score'])
        else:
            # 结束当前实体，开始新的实体或忽略无效标记
            if current_entity:
                merged_entities.append(current_entity)
                current_entity = None
            if token['entity'].startswith('B-'):
                # 开始新的实体
                current_entity = {
                    "words": [token['word']],
                    "type": token['entity'][2:],
                    "scores": [token['score']]
                }

    # 添加最后一个实体（如果存在）
    if current_entity:
        merged_entities.append(current_entity)

    # 将实体的多个词合并成一个字符串，计算平均分数
    for entity in merged_entities:
        entity['word'] = ' '.join(entity['words'])
        entity['average_score'] = sum(entity['scores']) / len(entity['scores'])
        del entity['words'], entity['scores']

    return merged_entities

# 示例用法


# 合并实体
merged_entities = merge_entities(ner_results)

# 输出合并后的实体
for entity in merged_entities:
    print(f"Entity: {entity['word']}, Type: {entity['type']}, Average Score: {entity['average_score']:.4f}")


Entity: Apple, Type: ORG, Average Score: 0.9987
Entity: U . K ., Type: LOC, Average Score: 0.9943
