## Dataset Preparation

As a first step to dataset creation, I decided, as it was recommended in the task description, to discover possibilities of using ChatGPT. It appeared to me that such poferful transformer model has to be a perfect generator of a synthetic dataset about mountains as it must be aware of similar texts in internet and its main task is to actually generate new text, based on its knowledge. Furthermore, it has to be aware of how NER works how to build datasets with appropriate structure.

What I've found out is that under precise instructions, ChatGPT as is, can generate really quality dataset, format it appropriately, and advice on how to actually use it. But the problem was that ChatGPT was really restrained on number of rows it could generate in one turn. When asked for more, it kept proposing scripts, that would combine different parts of sentences to generate any number of rows. Such dataset could be big enough, but not very useful as it would contain a limited number of actual sentence patterns in it.

The script, made by chatGPT, that should create a dataset of 1000 unique sentences about mountains using 100 different patterns:

In [2]:
import numpy as np
import pandas as pd
import random

# Function to generate a very diverse dataset of 1,000 sentences
def generate_diverse_ner_data(num_sentences, num_templates, num_mountains):
    # Approximately 100 sentence templates
    sentence_templates = [
        "{} rises above its surrounding terrain.", "{} can be seen from miles away.",
        "The beauty of {} is unmatched.", "{} is on the bucket list of many climbers.",
        "Tales of {} have inspired adventurers for generations.", "The summit of {} was reached in an epic expedition.",
        "The slopes of {} are treacherous.", "{} stands as a beacon for explorers.",
        "The history of {} is as fascinating as its geography.", "Many myths surround the mystical {}.",
        "The ascent of {} is a formidable challenge.", "{}'s peak is one of the most sought after.",
        "The routes to {} vary from easy to extremely difficult.", "Ancient stories tell of gods dwelling on {}.",
        "{} is part of a larger range of mountains.", "The first ascent of {} is still debated.",
        "The view from the top of {} is breathtaking.", "{} was named after a legendary figure.",
        "The glaciers on {} are slowly melting.", "Wildlife around {} is abundant.",
        # ... 80 more templates
    ]

    # Expand the list to reach approximately 100 unique templates
    while len(sentence_templates) < num_templates:
        template = random.choice(sentence_templates)
        # Add some variation to the template
        variations = [
            template.replace("is", "remains"),
            template.replace("The", "A"),
            template.replace("many", "various"),
            template.replace("miles", "kilometers"),
        ]
        sentence_templates.extend(variations)
        sentence_templates = list(set(sentence_templates))  # Remove duplicates

    # Approximately 100 mountain names
    mountain_names = [
        "Everest", "Kilimanjaro", "Denali", "McKinley", "K2", "Annapurna",
        "Elbrus", "Aconcagua", "Fuji", "Olympus", "Mont Blanc", "Matterhorn",
        "Eiger", "Vesuvius", "Rainier", "Whitney", "Kangchenjunga", "Lhotse",
        "Makalu", "Cho Oyu",
        # ... 80 more names
    ]

    # Expand the list to reach approximately 100 unique mountain names
    while len(mountain_names) < num_mountains:
        mountain = random.choice(mountain_names)
        # Add some variation to the mountain name
        variations = [
            mountain + " Peak",
            "Mount " + mountain,
            mountain + " Range",
            "The Great " + mountain
        ]
        mountain_names.extend(variations)
        mountain_names = list(set(mountain_names))  # Remove duplicates

    # Generate the sentences
    sentences = []
    for _ in range(num_sentences):
        template = random.choice(sentence_templates)
        mountain = random.choice(mountain_names)
        sentence = template.format(mountain)
        sentences.append(sentence)

    # Shuffle the sentences to add randomness
    random.shuffle(sentences)

    # Convert to DataFrame for NER tagging
    ner_data = []
    for sentence_id, sentence in enumerate(sentences):
        tokens = sentence.split()
        for token in tokens:
            # Randomly decide if a token is a mountain name
            if token in mountain_names:
                # Randomly assign B-MNT or I-MNT tags to mountain names
                tag = np.random.choice([1, 2], p=[0.7, 0.3])
            else:
                tag = 0
            ner_data.append([sentence_id, token, tag])
    
    return pd.DataFrame(ner_data, columns=["id", "tokens", "ner_tags"])

# Generate 1,000 diverse sentences
# diverse_ner_data = generate_diverse_ner_data(1000, 100, 100)
diverse_ner_data = generate_diverse_ner_data(100, 10, 10)

# Display the first few rows of the generated dataset
diverse_ner_data.head(20)

Unnamed: 0,id,tokens,ner_tags
0,0,Whitney,1
1,0,was,0
2,0,named,0
3,0,after,0
4,0,a,0
5,0,legendary,0
6,0,figure.,0
7,1,Mont,0
8,1,Blanc's,0
9,1,peak,0


After that, I've tried using Premium version of ChatGPT and it actually started to generate csv files, each 1000 sentences in appropriate format, ready to use. Here is how it looked:

In [3]:
'''
I'd like you to generate a synthetic dataset for me. I need to use it for finetuning of bert model for NER of mountains so it should be in appropriate format.
I need 3 rows and columns id, tokens, ner_tags. Use 3 tags: 0, 1, 2 that correspond to O, B-MNT, I-MNT. Give me 100 rows for a start.
Here is a short example of desired output: ...
'''

'''
id,tokens,ner_tags
0,Mount,1
0,Vesuvius,2
0,is,0
0,known,0
0,for,0
0,its,0
0,breathtaking,0
0,views,0
0,and,0
0,challenging,0
0,trails,0
1,Mount,1
1,Vesuvius,2
1,Mount,0
1,is,0
1,the,0
1,highest,0
1,peak,0
1,in,0
1,the,0
1,region,0
1,of,0
2,Mount,1
2,Matterhorn,2
2,has,0
2,a,0
2,rich,0
2,history,0
2,and,0
2,is,0
2,a,0
2,symbol,0
2,of,0
3,Mount,1
3,Matterhorn,2
3,is,0
3,located,0
3,in,0
3,the,0
3,beautiful,0
3,landscape,0
3,of,0
'''

'\nid,tokens,ner_tags\n0,Mount,1\n0,Vesuvius,2\n0,is,0\n0,known,0\n0,for,0\n0,its,0\n0,breathtaking,0\n0,views,0\n0,and,0\n0,challenging,0\n0,trails,0\n1,Mount,1\n1,Vesuvius,2\n1,Mount,0\n1,is,0\n1,the,0\n1,highest,0\n1,peak,0\n1,in,0\n1,the,0\n1,region,0\n1,of,0\n2,Mount,1\n2,Matterhorn,2\n2,has,0\n2,a,0\n2,rich,0\n2,history,0\n2,and,0\n2,is,0\n2,a,0\n2,symbol,0\n2,of,0\n3,Mount,1\n3,Matterhorn,2\n3,is,0\n3,located,0\n3,in,0\n3,the,0\n3,beautiful,0\n3,landscape,0\n3,of,0\n'

Sentences appeared to be too short and not enough diverse.

In [4]:
'''
Increse the sentence lengths to mean of 7 words and make them more different. Try to partially use public data and real sentences for generation. 
Also try to write about mountains from different locations
'''

'''
id,tokens,ner_tags
0,Mount,1
0,Fuji,2
0,offers,0
0,some,0
0,of,0
0,the,0
0,most,0
0,spectacular,0
0,hiking,0
0,routes,0
0,in,0
1,Mount,1
1,McKinley,2
1,Mount,0
1,is,0
1,the,0
1,highest,0
1,peak,0
1,in,0
1,the,0
1,region,0
1,of,0
2,Mount,1
2,Rainier,2
2,Mount,0
2,is,0
2,the,0
2,highest,0
2,peak,0
2,in,0
2,the,0
2,region,0
2,of,0
3,Mount,1
3,Vesuvius,2
3,is,0
3,famous,0
3,for,0
3,its,0
3,snow-capped,0
3,peaks,0
3,and,0
3,alpine,0
3,flora,0
'''

'\nid,tokens,ner_tags\n0,Mount,1\n0,Fuji,2\n0,offers,0\n0,some,0\n0,of,0\n0,the,0\n0,most,0\n0,spectacular,0\n0,hiking,0\n0,routes,0\n0,in,0\n1,Mount,1\n1,McKinley,2\n1,Mount,0\n1,is,0\n1,the,0\n1,highest,0\n1,peak,0\n1,in,0\n1,the,0\n1,region,0\n1,of,0\n2,Mount,1\n2,Rainier,2\n2,Mount,0\n2,is,0\n2,the,0\n2,highest,0\n2,peak,0\n2,in,0\n2,the,0\n2,region,0\n2,of,0\n3,Mount,1\n3,Vesuvius,2\n3,is,0\n3,famous,0\n3,for,0\n3,its,0\n3,snow-capped,0\n3,peaks,0\n3,and,0\n3,alpine,0\n3,flora,0\n'

It was better, but whem I asked for 5000 sentences, it silentlyh generated only 959. So I decided that I've found the limit and started to experiment.

In [5]:
'''
Make another 1k rows but add some tricky case with present and abcent mountains
'''

'''
id,tokens,ner_tags
0,Many,0
0,climbers,0
0,dream,0
0,of,0
0,scaling,0
0,the,0
0,summit,0
0,of,0
0,a,0
0,great,0
0,peak,0
1,The,0
1,expedition,0
1,team,0
1,set,0
1,up,0
1,camp,0
1,at,0
1,the,0
1,base,0
1,of,0
2,The,0
2,expedition,0
2,team,0
2,set,0
2,up,0
2,camp,0
2,at,0
2,the,0
2,base,0
2,of,0
3,The,0
3,legend,0
3,of,0
3,revolves,0
3,around,0
3,the,0
3,mythical,0
3,creatures,0
3,of,0
'''

'\nid,tokens,ner_tags\n0,Many,0\n0,climbers,0\n0,dream,0\n0,of,0\n0,scaling,0\n0,the,0\n0,summit,0\n0,of,0\n0,a,0\n0,great,0\n0,peak,0\n1,The,0\n1,expedition,0\n1,team,0\n1,set,0\n1,up,0\n1,camp,0\n1,at,0\n1,the,0\n1,base,0\n1,of,0\n2,The,0\n2,expedition,0\n2,team,0\n2,set,0\n2,up,0\n2,camp,0\n2,at,0\n2,the,0\n2,base,0\n2,of,0\n3,The,0\n3,legend,0\n3,of,0\n3,revolves,0\n3,around,0\n3,the,0\n3,mythical,0\n3,creatures,0\n3,of,0\n'

At this point I discovered, that the sentences were not actually all unique in the dataset, and after all tries it turned out that, ChatGPT, premium or not, can only generate 50 really unique sentences.

In [6]:
'''
Make another 1000 sentences, but make it very very diverse. Use about 50 different sentence templates and about 50 mountains and combine them into 1000 sentences.
Add some randomness to it. The result should look like it was scraped over the internet, not generated by a short and determined script.
Be super creative about it. And stop putting Mount before each mountain, this it too obvious.
'''

'''
In responce to this, ChatGPT actually wrote a python script for dataset generation, and broke
'''

'\nIn responce to this, ChatGPT actually wrote a python script for dataset generation, and broke\n'

In [7]:
'''
Make another 1000 sentences but use AI generation instead of plain scripts. The result should not have so many repeats of identical sentence parts.
'''

'''
id,tokens,ner_tags
0,The,0
0,first,0
0,ascent,0
0,of,0
0,K2,1
0,was,0
0,a,0
0,monumental,0
0,event,0
0,in,0
0,mountaineering,0
0,history.,0
1,The,0
1,route,0
1,to,0
1,Kangchenjunga,1
1,is,0
1,dotted,0
1,with,0
1,both,0
1,beauty,0
1,and,0
1,peril.,0
2,K2,1
2,is,0
2,often,0
2,shrouded,0
2,in,0
2,"mist,",0
2,adding,0
2,to,0
2,its,0
2,mystique.,0
3,Annapurna,1
3,stands,0
3,as,0
3,an,0
3,icon,0
3,in,0
3,the,0
3,Fuji,1
3,landscape.,0
'''

'\nid,tokens,ner_tags\n0,The,0\n0,first,0\n0,ascent,0\n0,of,0\n0,K2,1\n0,was,0\n0,a,0\n0,monumental,0\n0,event,0\n0,in,0\n0,mountaineering,0\n0,history.,0\n1,The,0\n1,route,0\n1,to,0\n1,Kangchenjunga,1\n1,is,0\n1,dotted,0\n1,with,0\n1,both,0\n1,beauty,0\n1,and,0\n1,peril.,0\n2,K2,1\n2,is,0\n2,often,0\n2,shrouded,0\n2,in,0\n2,"mist,",0\n2,adding,0\n2,to,0\n2,its,0\n2,mystique.,0\n3,Annapurna,1\n3,stands,0\n3,as,0\n3,an,0\n3,icon,0\n3,in,0\n3,the,0\n3,Fuji,1\n3,landscape.,0\n'

In conclusion, I decided on using combined dataset of all previously generated chunks, as at least they differ from each other, and generate some more in small batches with other adjustments e.g. without word Mount in each sentence.

But before that, I just wanted to check another opportunity with Open API. The idea was to create a pool of different variations of prompts asking for unique dataset about mountains for NER and iteratively create a big dataset from small chunks that Open API would return.

In [None]:

import os
from openai import OpenAI
import json

client = OpenAI(
    os.environ.get("OPENAI_API_KEY")
)

def infer_gpt(prompt):
    try:
        resp = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            model="gpt-3.5-turbo",
            temperature=0.6,
            max_tokens=3500,
            top_p=1,
            frequency_penalty=0.02,
            presence_penalty=0.2,
            timeout=120,
        )
    except Exception as e:
        print('Exception: ', e)
        return []
    response = json.loads(resp)['choises'][0]['text']

    return response

Unfortunately, my Open AI quota was already used up, so I decided to leave this opportunity for improvements chapter.

### SUMMARY

I've discovered several ways to build a dataset, including the basic ones, that weren't described yet:
- Internet scraping
- prompting ChatGPT for generator script
- prompting ChatGPT for a chunk with unique sentences
- using openai to iteratively build dataset from chunks

The best way would be to combine all these ideas and make a dataset that would contain scraped text and build on them with openapi and diverse prompts for datset chunks.

One important note is that training and validation sets shoud be built independently, so the pattern used do not overlap significantly.