# Task 1. Natural Language Processing. Named entity recognition

In this task, we need to train a named entity recognition (NER) model for the identification of
mountain names inside the texts.

There are semantic problems:

What is considered a mountain? Are the Himalayas a mountain or not? Three Bald Heads Hill?

For example, Mount Saser Kangri has official peaks: Saser Kangri I, Saser Kangri II, Saser Kangri III, etc.

Sometimes the spurs of a mountain or its peaks are considered distinctive mountains: Lungnak La by Hrten Nyima

Let's take what Wikipedia considers mountains.

There are linguistical problems:

If the mountain is shortened in the text: Changabang -> Chang

Or if one mountain has different names and we don't know these alternative names? 'Saltoro Kangri', 'Peak 36' or 'Saser Kangri', 'Sasir Kangri'.

For now we will use the names of mountains and their synonyms found on Wikipedia.

# Load libraries

In [1]:
!pip install transformers



In [69]:
from transformers import AutoModelForTokenClassification, pipeline, BertTokenizer
import pandas as pd, numpy as np
import os

In [3]:
# tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
tokenizer = BertTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [36]:
classifier = pipeline('ner', model = model, tokenizer = tokenizer)

# Testing the hypothesis

The BERT model is already trained to find tokens. Among them there are tokens that indicate a place.

If we run texts about mountains through the BERT model, we get “location” labels in relation to mountains.

If we compare the received location labels with the list of mountains, we will quickly find the mountains we need.

There is no need to waste time and resources for additional training of the BERT model, which, moreover, may be of questionable quality due to the linguistic and semantic problems listed above.

In [6]:
mount = ['Everest'] # choose the name of mount

Take a little training text

In [37]:
data = '''
Mount Everest attracts many climbers, including highly experienced mountaineers. There are two main climbing routes, one approaching the summit from the southeast in Nepal (known as the "standard route") and the other from the north in Tibet. While not posing substantial technical climbing challenges on the standard route, Everest presents dangers such as altitude sickness, weather, and wind, as well as hazards from avalanches and the Khumbu Icefall. As of November 2022, 310 people have died on Everest. Over 200 bodies remain on the mountain and have not been removed due to the dangerous conditions.[5][6]

The first recorded efforts to reach Everest's summit were made by British mountaineers. As Nepal did not allow foreigners to enter the country at the time, the British made several attempts on the north ridge route from the Tibetan side. After the first reconnaissance expedition by the British in 1921 reached 7,000 m (22,970 ft) on the North Col, the 1922 expedition pushed the north ridge route up to 8,320 m (27,300 ft), marking the first time a human had climbed above 8,000 m (26,247 ft). The 1924 expedition resulted in one of the greatest mysteries on Everest to this day: George Mallory and Andrew Irvine made a final summit attempt on 8 June but never returned, sparking debate as to whether they were the first to reach the top. Tenzing Norgay and Edmund Hillary made the first documented ascent of Everest in 1953, using the southeast ridge route. Norgay had reached 8,595 m (28,199 ft) the previous year as a member of the 1952 Swiss expedition. The Chinese mountaineering team of Wang Fuzhou, Gonpo, and Qu Yinhua made the first reported ascent of the peak from the north ridge on 25 May 1960.
'''

In [39]:
results = classifier(data) # take a results

In [40]:
results

[{'entity': 'B-LOC',
  'score': 0.83748555,
  'index': 1,
  'word': 'Mount',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.42056566,
  'index': 2,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9998179,
  'index': 29,
  'word': 'Nepal',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99981266,
  'index': 46,
  'word': 'Tibet',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9714537,
  'index': 60,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-MISC',
  'score': 0.6827861,
  'index': 84,
  'word': 'K',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.85744745,
  'index': 85,
  'word': '##hum',
  'start': None,
  'end': None},
 {'entity': 'I-MISC',
  'score': 0.41460603,
  'index': 86,
  'word': '##bu',
  'start': None,
  'end': None},
 {'entity': 'I-MISC',
  'score': 0.7169586,
  'index': 87,
  'word': 'Ice',
  'start': None,
  'end': None},
 {'enti

The problem. We have:

{'entity': 'B-MISC',
  'score': 0.6827861,
  'index': 84,
  'word': 'K',
  'start': 440,
  'end': 441},  
{'entity': 'I-LOC',
  'score': 0.85744745,
  'index': 85,
  'word': '##hum',
  'start': 441,
  'end': 444},  
 {'entity': 'I-MISC',
  'score': 0.41460603,
  'index': 86,
  'word': '##bu',
  'start': 444,
  'end': 446},  
 {'entity': 'I-MISC',
  'score': 0.7169586,
  'index': 87,
  'word': 'Ice',
  'start': 447,
  'end': 450},  
 {'entity': 'I-MISC',
  'score': 0.55333614,
  'index': 88,
  'word': '##fall',
  'start': 450,
  'end': 454},  

  But must be "Khumbu Icefall"

We see that the tokenizer breaks new words using ##. This is useful when you need to separate a root from a suffix, but harmful when breaking up a place name. And it is difficult to compare the place and the name.

Let's add new words to BERT.

In [48]:
new_words = ['Khumbu', 'Icefall', 'Khumbu Icefall']

In [49]:
tokenizer.add_tokens(new_words)

1

In [50]:
model.resize_token_embeddings(len(tokenizer)) # adjusting the embedding size

Embedding(28999, 768)

In [51]:
classifier = pipeline('ner', model = model, tokenizer = tokenizer) # refresh classifier

In [52]:
results = classifier(data) # take a new data
results

[{'entity': 'B-LOC',
  'score': 0.8793697,
  'index': 1,
  'word': 'Mount',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.46240875,
  'index': 2,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99980587,
  'index': 29,
  'word': 'Nepal',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99980694,
  'index': 46,
  'word': 'Tibet',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9612443,
  'index': 60,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9922184,
  'index': 97,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9925248,
  'index': 129,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-MISC',
  'score': 0.99977463,
  'index': 136,
  'word': 'British',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9998192,
  'index': 141,
  'word': 'Nepal',
  'start': None,
  'end': N

New problem with BERT: new location Khumbu Icefall did not mark as a location.

Select only location from answer from BERT:

In [54]:
[results[i] for i in range(len(results)) if results[i]['entity'] == 'B-LOC' or results[i]['entity'] == 'I-LOC']

[{'entity': 'B-LOC',
  'score': 0.8793697,
  'index': 1,
  'word': 'Mount',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.46240875,
  'index': 2,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99980587,
  'index': 29,
  'word': 'Nepal',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99980694,
  'index': 46,
  'word': 'Tibet',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9612443,
  'index': 60,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9922184,
  'index': 97,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9925248,
  'index': 129,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9998192,
  'index': 141,
  'word': 'Nepal',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9722634,
  'index': 193,
  'word': 'North',
  'start': None,
  'end': None}

Select only names of mountains from the list above:

In [55]:
[results[i] for i in range(len(results)) if results[i]['word'] in mount]

[{'entity': 'B-LOC',
  'score': 0.46240875,
  'index': 2,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9612443,
  'index': 60,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9922184,
  'index': 97,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9925248,
  'index': 129,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99444515,
  'index': 248,
  'word': 'Everest',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9827717,
  'index': 299,
  'word': 'Everest',
  'start': None,
  'end': None}]

It worked: we got indexing of places in the text where Mount Everest occurs.

# We check on the list of six-eight thousand meters mountains in India.

## Take only one text from dataset:

The list of mountains names:

In [54]:
mountains = ['Kangchenjunga', 'Himalayas', 'Nanda Devi', 'Kamet', 'Saltoro Kangri', 'Peak 36', 'Saser Kangri', 'Sasir Kangri',
 'Mamostong Kangri', 'K35', 'Teram Kangri', 'Jongsong Peak', 'Saltoro Mountains', 'K12', 'Kabru', 'Ghent Kangri',
 'Rimo I', 'Kirat Chuli', 'Tent Peak', 'Mana Peak', 'Apsarasas Kangri', 'Mukut Parbat', 'Singhi Kangri', 'Hardeol',
 'Chaukhamba', 'Nun Kun', 'Pauhunri', 'Pathibhara', 'Trisul', 'Satopanth', 'Tirsuli', 'Chong Kumdang Ri', 'Dunagiri',
 'Kangto', 'Kanggardo Rize', 'Nyegyi Kansang', 'Katoie Gyang', 'Kra-Daadi', 'Padmanabh', 'Shudu Tsempa', 'Langpo',
 'Chamshen Kangri', 'Tughmo Zarpo', 'Aq Tash', 'Rishi Pahar', 'Thalay Sagar', 'Mount Lakshmi', 'Kedarnath', 'Manda I',
 'Saraswati Parbat I', 'Shahi Kangri', 'Sri Kailash', 'Mana Parbat I', 'Pilapani Parbat', 'Sudarshan Parbat',
 'Chaturbhuj', 'Shyamvarna', 'Yogeshwar', 'Meru Peak', 'Chorten Nyima Ri', 'Kalanka', 'Saf Minal', 'Changabang',
 'Panchachuli', 'Lungnak La', 'Dibibokari', 'Pyramid', 'Papsura', 'Mukarbeh', 'Indrasan', 'Jorkanden', 'Manirang',
 'Baspa', 'Tirung', 'Leo Pargial', 'Kullu Pumori', 'Shigri Parbat', 'Minar', 'Akela Killa', 'Phabrang', 'Mulkilla',
 'Gya', 'Shilla', 'Kun', 'Nun', 'Zanskar', 'Kolahoi', 'Haramukh', 'Chong Kundan', 'Thingchinkhang', 'Jopunu', 'Pandim',
 'Pangarchulla Peak', 'Everest', ]

In [11]:
with open("/content/Apsarasas Kangri.txt", # open the file
          "r", encoding = 'utf-8') as f:
  data = f.read()

In [12]:
print(data)

Apsarasas Kangri is a mountain in the Siachen subrange of the Karakoram mountain range. With an elevation of 7,245 m (23,770 ft) it is the 96th highest mountain in the world. Apsarasas Kangri is located within the broader Kashmir region disputed between India, Pakistan and China. It is situated on the border between the areas controlled by China as part of the Xinjiang autonomous region, and the Siachen Glacier controlled by India as part of Ladakh.

Apsarasas was named by Grant Peterkin of the 1908 Workman expedition, from apsara ("fairies") and sas ("place"), thus "place of the fairies".[3] There are at least three main summits of near-equal height, usually labeled I to III from west to east over a distance of 5 km. The eastern summit (35°31′14″N 77°11′56″E) is separated from the other two by a saddle just over 6800 m high.

Only the western peak (Apsarasas I) appears to have been climbed. The first ascent was made over the west ridge by Yoshio Inagaki, Katsuhisa Yabuta and Takamasa 

In [14]:
tokenizer.add_tokens(mountains) # update the tokenizer

92

In [15]:
model.resize_token_embeddings(len(tokenizer)) # update the model

Embedding(29088, 768)

In [16]:
classifier = pipeline('ner', model = model, tokenizer = tokenizer)

In [None]:
results = classifier(data)


In [18]:
len(results) # we take 82 tokens from the text

82

In [24]:
results

[{'entity': 'B-LOC',
  'score': 0.6467143,
  'index': 1,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99342453,
  'index': 7,
  'word': 'Si',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.8817828,
  'index': 8,
  'word': '##ache',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.8802007,
  'index': 9,
  'word': '##n',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9928596,
  'index': 15,
  'word': 'Kara',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.98871917,
  'index': 16,
  'word': '##kor',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.9953636,
  'index': 17,
  'word': '##am',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.5768311,
  'index': 47,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99972373,
  'index': 53,
  'word': 'Kashmir',
  'start': None,
  'end': 

Filter only locations:

In [20]:
filter1 = [results[i] for i in range(len(results)) if results[i]['entity'] == 'B-LOC' or results[i]['entity'] == 'I-LOC']
len(filter1)

36

We take only 36 locations from all tokens

In [23]:
filter1

[{'entity': 'B-LOC',
  'score': 0.6467143,
  'index': 1,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99342453,
  'index': 7,
  'word': 'Si',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.8817828,
  'index': 8,
  'word': '##ache',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.8802007,
  'index': 9,
  'word': '##n',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.9928596,
  'index': 15,
  'word': 'Kara',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.98871917,
  'index': 16,
  'word': '##kor',
  'start': None,
  'end': None},
 {'entity': 'I-LOC',
  'score': 0.9953636,
  'index': 17,
  'word': '##am',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.5768311,
  'index': 47,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.99972373,
  'index': 53,
  'word': 'Kashmir',
  'start': None,
  'end': 

Select only mountains:

In [21]:
mounts = [results[i] for i in range(len(results)) if results[i]['word'] in mountains]
len(mounts)

2

We take only two mountain-token:

In [22]:
mounts

[{'entity': 'B-LOC',
  'score': 0.6467143,
  'index': 1,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.5768311,
  'index': 47,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None}]

We also see that at the beginning of the second paragraph the mountain is simply called Apsarasas, and BERT did not recognize it as a mountain. This means that you need to create an additional list of individual mountain names if they are used separately.

In [55]:
mountains2 = []
for mount in mountains:
  temp = mount.split() # divide by space
  if len(temp) > 1:
    for word in temp:
      # print(word)
      if len(word) > 2: # only big words append to list
        mountains2.append(word)
  else:
    mountains2.extend(temp)

In [41]:
mountains2

['Kangchenjunga',
 'Himalayas',
 'Nanda',
 'Devi',
 'Kamet',
 'Saltoro',
 'Kangri',
 'Peak',
 'Saser',
 'Kangri',
 'Sasir',
 'Kangri',
 'Mamostong',
 'Kangri',
 'K35',
 'Teram',
 'Kangri',
 'Jongsong',
 'Peak',
 'Saltoro',
 'Mountains',
 'K12',
 'Kabru',
 'Ghent',
 'Kangri',
 'Rimo',
 'Kirat',
 'Chuli',
 'Tent',
 'Peak',
 'Mana',
 'Peak',
 'Apsarasas',
 'Kangri',
 'Mukut',
 'Parbat',
 'Singhi',
 'Kangri',
 'Hardeol',
 'Chaukhamba',
 'Nun',
 'Kun',
 'Pauhunri',
 'Pathibhara',
 'Trisul',
 'Satopanth',
 'Tirsuli',
 'Chong',
 'Kumdang',
 'Dunagiri',
 'Kangto',
 'Kanggardo',
 'Rize',
 'Nyegyi',
 'Kansang',
 'Katoie',
 'Gyang',
 'Kra-Daadi',
 'Padmanabh',
 'Shudu',
 'Tsempa',
 'Langpo',
 'Chamshen',
 'Kangri',
 'Tughmo',
 'Zarpo',
 'Tash',
 'Rishi',
 'Pahar',
 'Thalay',
 'Sagar',
 'Mount',
 'Lakshmi',
 'Kedarnath',
 'Manda',
 'Saraswati',
 'Parbat',
 'Shahi',
 'Kangri',
 'Sri',
 'Kailash',
 'Mana',
 'Parbat',
 'Pilapani',
 'Parbat',
 'Sudarshan',
 'Parbat',
 'Chaturbhuj',
 'Shyamvarna',
 'Yo

In [42]:
tokenizer.add_tokens(mountains2) # update tokenizer by new names

55

In [43]:
model.resize_token_embeddings(len(tokenizer)) # update the model

Embedding(29143, 768)

In [44]:
classifier = pipeline('ner', model = model, tokenizer = tokenizer)

In [45]:
results = classifier(data)

In [46]:
filter1 = [results[i] for i in range(len(results)) if results[i]['entity'] == 'B-LOC' or results[i]['entity'] == 'I-LOC']
len(filter1) # first filtering

26

In [47]:
mounts = [results[i] for i in range(len(results)) if results[i]['word'] in mountains]
len(mounts) # second filtering

2

In [48]:
mounts

[{'entity': 'B-LOC',
  'score': 0.590518,
  'index': 1,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None},
 {'entity': 'B-LOC',
  'score': 0.59702426,
  'index': 47,
  'word': 'Apsarasas Kangri',
  'start': None,
  'end': None}]

We see the same mountains. This means that the model does not yet recognize parts of the short mountain name "Apsarasas" as locations. Therefore, even if we will fine-tune the model by changing its weights, there is a high probability that this location would also not be recognized.

## We train the algorithm on the entire dataset.

In [28]:
PATH = '/content/drive/MyDrive/Colab Notebooks/test_task' # take a path to files

Train dataset consist of 70 texts with many names of mountines:

In [30]:
import os
for root, dirs, files in os.walk(PATH + '/train'):
    for filename in files:
        print(filename)

Kangchenjunga.txt
Kangchenjunga3.txt
Kangchenjunga5.txt
Kangchenjunga7.txt
Himalayas.txt
Himalayas3.txt
Himalayas5.txt
Himalayas7.txt
Himalayas9.txt
Nanda Devi.txt
Nanda Devi3.txt
Kamet.txt
Kamet3.txt
Saltoro Kangri.txt
Saser Kangri.txt
Mamostong Kangri.txt
Teram Kangri.txt
Jongsong Peak.txt
K12.txt
Kabru.txt
Ghent Kangri.txt
Rimo I.txt
Kirat Chuli.txt
Mana Peak.txt
Mukut Parbat.txt
Singhi Kangri.txt
Hardeol.txt
Chaukhamba.txt
Nun Kun.txt
Pauhunri.txt
Trisul.txt
Satopanth.txt
Dunagiri.txt
Kangto.txt
Nyegyi Kansang.txt
Aq Tash.txt
Rishi Pahar.txt
Thalay Sagar.txt
Kedarnath.txt
Langpo.txt
Sri Kailash.txt
Mana Parbat I.txt
Pilapani Parbat.txt
Sudarshan Parbat.txt
Manda I.txt
Chaturbhuj.txt
Shyamvarna.txt
Yogeshwar.txt
Meru Peak.txt
Kalanka.txt
Kalanka2.txt
Saf Minal.txt
Changabang.txt
Panchachuli.txt
Panchachuli3.txt
Panchachuli5.txt
Changabang3.txt
Saf Minal3.txt
Kalanka4.txt
Chorten Nyima Ri.txt
Jopunu.txt
Meru Peak2.txt
Meru Peak4.txt
Sudarshan Parbat3.txt
Mana Peak2.txt
Mana Peak4.txt

In [49]:
total_results = dict()
for root, dirs, files in os.walk(PATH + '/train'):
    for filename in files:
        # print(filename)
        with open(PATH + '/train/' + filename, "r", encoding = 'utf-8') as f:
            data = f.read() # for each text
            results = classifier(data) # take a result
            # filter only location
            filter1 = [results[i] for i in range(len(results)) if results[i]['entity'] == 'B-LOC' or results[i]['entity'] == 'I-LOC']
            # take names of mountains from the location list
            mounts = [filter1[i] for i in range(len(filter1)) if filter1[i]['word'] in mountains]
            total_results[filename] = mounts # add to dictionary new data

In [50]:
total_results

{'Kangchenjunga.txt': [],
 'Kangchenjunga3.txt': [],
 'Kangchenjunga5.txt': [{'entity': 'B-LOC',
   'score': 0.7431501,
   'index': 92,
   'word': 'Kirat Chuli',
   'start': None,
   'end': None}],
 'Kangchenjunga7.txt': [],
 'Himalayas.txt': [],
 'Himalayas3.txt': [{'entity': 'B-LOC',
   'score': 0.9912909,
   'index': 152,
   'word': 'Everest',
   'start': None,
   'end': None},
  {'entity': 'B-LOC',
   'score': 0.88812953,
   'index': 303,
   'word': 'Everest',
   'start': None,
   'end': None},
  {'entity': 'B-LOC',
   'score': 0.9874944,
   'index': 336,
   'word': 'Everest',
   'start': None,
   'end': None}],
 'Himalayas5.txt': [{'entity': 'B-LOC',
   'score': 0.90638924,
   'index': 301,
   'word': 'Gya',
   'start': None,
   'end': None}],
 'Himalayas7.txt': [],
 'Himalayas9.txt': [],
 'Nanda Devi.txt': [],
 'Nanda Devi3.txt': [{'entity': 'I-LOC',
   'score': 0.9240715,
   'index': 283,
   'word': 'Everest',
   'start': None,
   'end': None},
  {'entity': 'B-LOC',
   'score': 

We see that some documents that definitely contain names of mountains are empty. Either BERT does not recognize names as locations, or does not recognize them as tokens at all.

In [56]:
mountains.extend(mountains2) # extend start list by separate names of mountains

At the same time, we check the data processing time:

In [88]:
%%timeit

total_results = dict()
for root, dirs, files in os.walk(PATH + '/train'):
    for filename in files:
        # print(filename)
        with open(PATH + '/train/' + filename, "r", encoding = 'utf-8') as f:
            data = f.read()
            results = classifier(data)
            filter1 = [results[i] for i in range(len(results)) if results[i]['entity'] == 'B-LOC' or results[i]['entity'] == 'I-LOC']
            mounts = [filter1[i] for i in range(len(filter1)) if filter1[i]['word'] in mountains]
            total_results[filename] = mounts

1min 53s ± 1.25 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [94]:
113 / len(total_results)

1.6142857142857143

We get 1.6 seconds to analyze one document.

In [59]:
total_results

{'Kangchenjunga.txt': [],
 'Kangchenjunga3.txt': [],
 'Kangchenjunga5.txt': [{'entity': 'B-LOC',
   'score': 0.7431501,
   'index': 92,
   'word': 'Kirat Chuli',
   'start': None,
   'end': None}],
 'Kangchenjunga7.txt': [],
 'Himalayas.txt': [{'entity': 'B-LOC',
   'score': 0.5645227,
   'index': 94,
   'word': 'Mount',
   'start': None,
   'end': None}],
 'Himalayas3.txt': [{'entity': 'B-LOC',
   'score': 0.9912909,
   'index': 152,
   'word': 'Everest',
   'start': None,
   'end': None},
  {'entity': 'B-LOC',
   'score': 0.88812953,
   'index': 303,
   'word': 'Everest',
   'start': None,
   'end': None},
  {'entity': 'B-LOC',
   'score': 0.9874944,
   'index': 336,
   'word': 'Everest',
   'start': None,
   'end': None}],
 'Himalayas5.txt': [{'entity': 'B-LOC',
   'score': 0.90638924,
   'index': 301,
   'word': 'Gya',
   'start': None,
   'end': None}],
 'Himalayas7.txt': [],
 'Himalayas9.txt': [],
 'Nanda Devi.txt': [],
 'Nanda Devi3.txt': [{'entity': 'B-LOC',
   'score': 0.80895

We also see gaps in the documents.

For clarity, let's create a dataframe:

In [78]:
res = pd.DataFrame()
res['document'] = list(total_results.keys()) # names of files
res['len_of_searched_names'] = [len(total_results[key]) for key in total_results.keys()] # number of searched names of mountains


In [79]:
res

Unnamed: 0,document,len_of_searched_names
0,Kangchenjunga.txt,0
1,Kangchenjunga3.txt,0
2,Kangchenjunga5.txt,1
3,Kangchenjunga7.txt,0
4,Himalayas.txt,1
...,...,...
65,Mana Peak4.txt,3
66,Pangarchulla Peak.txt,1
67,Shahi Kangri.txt,0
68,Saraswati Parbat I2.txt,6


We check whether the number of mountains found will increase if we remove the location filtering step. If BERT is misidentifying titles and names as other than locations, this should help. And it may decrease data processing time.

In [89]:
%%timeit
total_results = dict()
for root, dirs, files in os.walk(PATH + '/train'):
    for filename in files:
        # print(filename)
        with open(PATH + '/train/' + filename, "r", encoding = 'utf-8') as f:
            data = f.read()
            results = classifier(data)
            # filter1 = [results[i] for i in range(len(results)) if results[i]['entity'] == 'B-LOC' or results[i]['entity'] == 'I-LOC']
            mounts = [results[i] for i in range(len(results)) if results[i]['word'] in mountains]
            total_results[filename] = mounts

2min 8s ± 22.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [95]:
128 / len(total_results)

1.8285714285714285

We get 1.82 seconds per document.

In [81]:
# add a column with new data
res['len_of_searched_names2'] = [len(total_results[key]) for key in total_results.keys()]

In [82]:
res

Unnamed: 0,document,len_of_searched_names,len_of_searched_names2
0,Kangchenjunga.txt,0,4
1,Kangchenjunga3.txt,0,0
2,Kangchenjunga5.txt,1,1
3,Kangchenjunga7.txt,0,0
4,Himalayas.txt,1,2
...,...,...,...
65,Mana Peak4.txt,3,3
66,Pangarchulla Peak.txt,1,1
67,Shahi Kangri.txt,0,1
68,Saraswati Parbat I2.txt,6,10


# Conclusions.
We compared two options: select location from tokens, and then look for mountain names in them.

And you can simply search for mountain names using all selected tokens.

In this case, the search for all tokens was a little slower. But the efficiency of mountain detection has also increased by almost 30%:

In [96]:
print('second way is effecitve then first in', np.sum(res['len_of_searched_names2']) / np.sum(res['len_of_searched_names']), 'times')

second way is effecitve then first in 1.2894736842105263 times


# Options for improving the accuracy of mountain recognition.

We got not only the names of the mountains in the text.

We received a labeled dataset on which the BERT model can be further trained, which can increase the accuracy of mountain recognition.

We can take plus or minus 10-15 words from the found mountain names. And train the model only on this data. We will shorten the dataset and remove unrecognized mountain names from it. This will remove noise and help the model focus on identifying patterns.

We can use ruled-based tokenizers or conditional random fields and increase the accuracy of searching for mountain names using words that occur naturally nearby.