DATASET https://github.com/thunlp/Few-NERD
I imported subset inter from https://cloud.tsinghua.edu.cn/f/a176a4870f0a4f8ba0db/?dl=1 (Download link)

I need to build NER model for mountain detection in sentences

So firstly i need to understand data, here we can see 2 columns with word and NER label

Final dataset
https://drive.google.com/drive/folders/1giuhBNTPA5BoNy0vajkpiVLXUwP9ncvd?usp=sharing

In [1]:
import pandas as pd
import warnings
warnings.simplefilter('ignore')


data = pd.read_csv('inter/train.txt', sep="\t", header=None)
data=data.rename(columns={0: "word", 1: "label",}, errors="raise")
print("Data shape:",data.shape)

Data shape: (3455940, 2)


In [2]:
data[26:38]

Unnamed: 0,word,label
26,The,organization-education
27,Institute,organization-education
28,of,organization-education
29,International,organization-education
30,Finance,organization-education
31,meetings,O
32,are,O
33,being,O
34,held,O
35,at,O


In [3]:
print("Amount of mountain targets in dataset is only:",data.label.value_counts()[' '])
data.label.value_counts()[:15]

Amount of mountain targets in dataset is only: 6600


O                                           2873658
location-GPE                                 130205
organization-other                            61718
organization-company                          41167
organization-education                        33839
person-artist/author                          31553
person-politician                             24898
organization-sportsteam                       24445
location-road/railway/highway/transit         20506
other-award                                   17276
product-other                                 16198
event-attack/battle/war/militaryconflict      15560
other-biologything                            13034
organization-media/newspaper                  11969
art-film                                      11575
Name: label, dtype: int64

As we can see we have 3455940 words total, with diferent labels. 2873658 is amount zero entity words. So we have only 582282 word with any labeled entity, but i need location-mountain.
Dataset have 6600 word of location-mountain. For training model i'm going to create subset of dataset consists from sentences which contain location-mountain.

In [4]:
# Just to show what mountains do we have
data[data['label']=='location-mountain']

Unnamed: 0,word,label
3473,Grand,location-mountain
3474,Canyon,location-mountain
8855,Hetch,location-mountain
8856,Hetchy,location-mountain
8857,Valley,location-mountain
...,...,...
3453077,Mount,location-mountain
3453078,St,location-mountain
3453079,Benedict,location-mountain
3455376,Beverly,location-mountain


I made Sentence columns to create subset of sentences which have location-mountain in it.

In [5]:
data['Sentence'] = (data['word'] == '.').cumsum()

In [6]:
mountain_sentences=data[data['label']=='location-mountain'].Sentence.unique()
data=data[data['Sentence'].isin(mountain_sentences)]
data[:15]

Unnamed: 0,word,label,Sentence
3465,.,O,128
3466,After,O,128
3467,joining,O,128
3468,a,O,128
3469,rafting,O,128
3470,trip,O,128
3471,in,O,128
3472,the,O,128
3473,Grand,location-mountain,128
3474,Canyon,location-mountain,128


In [7]:
# Here I reset Sentence index and set all other entities to O, because I only need to detect location-mountain.
data['Sentence'] = (data['word'] == '.').cumsum()
data['label'] = data['label'].apply(lambda x: 'O' if x != 'location-mountain' else x)
data.reset_index(drop=True, inplace=True)

data.head(20)

Unnamed: 0,word,label,Sentence
0,.,O,1
1,After,O,1
2,joining,O,1
3,a,O,1
4,rafting,O,1
5,trip,O,1
6,in,O,1
7,the,O,1
8,Grand,location-mountain,1
9,Canyon,location-mountain,1


In [8]:
print(f"So we have {data.Sentence.max()} sentences, and {data.shape[0]} tokens.")

print('\nAnd here is the split after preproccesing:')
print(data.label.value_counts())

So we have 2167 sentences, and 62881 tokens.

And here is the split after preproccesing:
O                    56281
location-mountain     6600
Name: label, dtype: int64


Test and train split before fine-tuning

In [9]:
from simpletransformers.ner import NERModel,NERArgs
from sklearn.metrics import f1_score



label = data["label"].unique().tolist()
label

# Train .8 and test .2
# int(62881*0.8)=50304
# But I dont want to break Sentence I'm going to use 50302

data.rename(columns={"word": "words", "label": "labels", "Sentence": "sentence_id"}, inplace=True)
train=data[:50302]
test=data[50302:]

In [10]:
args = NERArgs()
args.num_train_epochs = 3
args.learning_rate = 1e-4
args.overwrite_output_dir = True
args.train_batch_size = 32
args.eval_batch_size = 32


model = NERModel('bert', 'bert-base-cased',labels=label,args =args,use_cuda=False)

model.train_model(train,eval_data = test,acc=f1_score)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/4 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/55 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/55 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/55 [00:00<?, ?it/s]

(165, 0.08828501872097452)

Model performance

In [11]:
result, model_outputs, preds_list = model.eval_model(test)

result

  0%|          | 0/1 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/14 [00:00<?, ?it/s]

{'eval_loss': 0.08183732362730163,
 'precision': 0.819060773480663,
 'recall': 0.819060773480663,
 'f1_score': 0.819060773480663}

To evaluate model I also ask to ChatGPT to create sentences with some mountains

In [12]:
val_data_byGPT=["Mount Everest, standing at 29,032 feet, is the highest peak in the world, located in the Himalayas.",
                "The Rocky Mountains, spanning North America from British Columbia to New Mexico, are known for their breathtaking scenery and diverse wildlife.",
                "Switzerland is renowned for its stunning Alps, with iconic peaks like the Matterhorn attracting climbers and tourists alike.", 
                "The Andes, the longest mountain range in the world, traverse seven South American countries, offering a rich tapestry of landscapes and cultures.", 
                "Japan's Mount Fuji, an active stratovolcano, is an iconic symbol and the highest peak in the country.",
                "The Appalachian Mountains, stretching from Georgia to Maine, are known for their lush forests and historic significance in the United States.",
                "K2, the second-highest mountain on Earth, is part of the Karakoram Range and is considered one of the most challenging peaks to climb.", 
                "The Cascade Range in the Pacific Northwest is home to notable volcanoes like Mount Rainier and Mount St. Helens.", 
                "The Atlas Mountains in North Africa extend across Morocco, Algeria, and Tunisia, providing a rugged and scenic landscape." , 
                "The Australian Alps, located in the southeastern part of the continent, offer unique alpine environments and are a haven for outdoor enthusiasts."]

In [13]:
prediction, model_output = model.predict(val_data_byGPT)
# Here is the result of predictions
for i in range(len(val_data_byGPT)):
    print(val_data_byGPT[i])
    print(prediction[i])
    print("\n")

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

Mount Everest, standing at 29,032 feet, is the highest peak in the world, located in the Himalayas.
[{'Mount': 'location-mountain'}, {'Everest,': 'location-mountain'}, {'standing': 'O'}, {'at': 'O'}, {'29,032': 'O'}, {'feet,': 'O'}, {'is': 'O'}, {'the': 'O'}, {'highest': 'O'}, {'peak': 'O'}, {'in': 'O'}, {'the': 'O'}, {'world,': 'O'}, {'located': 'O'}, {'in': 'O'}, {'the': 'O'}, {'Himalayas.': 'location-mountain'}]


The Rocky Mountains, spanning North America from British Columbia to New Mexico, are known for their breathtaking scenery and diverse wildlife.
[{'The': 'O'}, {'Rocky': 'location-mountain'}, {'Mountains,': 'location-mountain'}, {'spanning': 'O'}, {'North': 'O'}, {'America': 'O'}, {'from': 'O'}, {'British': 'O'}, {'Columbia': 'O'}, {'to': 'O'}, {'New': 'O'}, {'Mexico,': 'O'}, {'are': 'O'}, {'known': 'O'}, {'for': 'O'}, {'their': 'O'}, {'breathtaking': 'O'}, {'scenery': 'O'}, {'and': 'O'}, {'diverse': 'O'}, {'wildlife.': 'O'}]


Switzerland is renowned for its stunning Alps,

In [17]:
!pip freeze > requirements.txt