# Creating dataset for NER task

In [35]:
import requests
import re
import csv
import pandas as pd
from nltk.stem import WordNetLemmatizer

The function get_wikipedia_article(animal) sends a request to the Wikipedia API to retrieve the plain text extract of an article for the specified animal. It parses the response and extracts the relevant text from the article. This allows you to obtain a summary of the Wikipedia page for any given animal by passing its name as a parameter.

In [27]:
def get_wikipedia_article(animal):
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "format": "json",
        "prop": "extracts",
        "explaintext": True,
        "titles": animal
    }
    response = requests.get(url, params=params)
    data = response.json()
    page = next(iter(data["query"]["pages"].values()))
    return page.get("extract", "")

The function process_article(text, animal) processes the article text by splitting it into paragraphs and cleaning each one by removing square brackets and extra spaces. It then checks if the animal's name appears in each paragraph and collects the relevant ones. The function returns up to 7 paragraphs that contain the animal's name for further use.

In [28]:
def process_article(text, animal):
    paragraphs = text.split('\n')
    selected_paragraphs = []
    for p in paragraphs:
        p = re.sub(r'\[.*?\]', '', p)  # Remove content in square brackets
        p = re.sub(r'\s+', ' ', p)  # Replace multiple spaces with a single space
        cleaned_paragraph = p.strip()

        if cleaned_paragraph and animal.lower() in cleaned_paragraph.lower():
            selected_paragraphs.append(cleaned_paragraph)
        if len(selected_paragraphs) == 7:
            break
    return selected_paragraphs

animals_names.txt contains the list of animals names

In [29]:
animal_names_file = "animals_names.txt"

with open(animal_names_file, 'r', encoding='utf-8') as file:
    animals = [line.strip() for line in file if line.strip()]

I retrieve and process Wikipedia articles for each animal in the list. I extract relevant paragraphs containing the animal's name and add them to a dataset with the article text and the animal's name as a label. If no relevant paragraphs are found, I print a message and continue with the next animal.

In [None]:
dataset = []
for animal in animals:
    article_text = get_wikipedia_article(animal)
    if not article_text:
        print(f"Could not retrieve article for: {animal}")
        continue
    paragraphs = process_article(article_text, animal)
    if paragraphs:
        dataset.append({"text": "\n".join(paragraphs), "label": animal})
    else:
        print(f"No paragraphs found containing the animal name for: {animal}")


Finally, I save the selected paragraphs and their corresponding animal names into a CSV file for further analysis.

In [32]:
with open("dataset.csv", "w", newline='', encoding="utf-8") as csvfile:
    fieldnames = ["text", "label"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for data in dataset:
        writer.writerow(data)


# Preprocessing of the dataset

I will preprocess the dataset by converting it into a tagged format suitable for training a Named Entity Recognition (NER) model. The tagged format consists of sentences where each word is tagged with the corresponding entity label. In this case, the entity label will be "ANIMAL" for words that represent the animal's name and "O" for all other words.

I will use the NLTK library to perform lemmatization on the words in the dataset. Lemmatization reduces words to their base or root form, which can help improve the model's performance by reducing the vocabulary size and capturing similar words.

In [36]:
lemmatizer = WordNetLemmatizer()

Read the dataset from the CSV file into a list of dictionaries.

In [41]:
data = pd.read_csv("dataset.csv").to_dict(orient='records')

The transform_to_tagged_format function takes a text string and an entity label as input and returns the text in a tagged format. It splits the text into sentences and then into words. Each word is cleaned by removing special characters and lowercasing it. The word is then lemmatized to its base form. If the lemmatized word matches the entity label, it is tagged as "ANIMAL"; otherwise, it is tagged as "O".

In [42]:
def transform_to_tagged_format(text, entity_label):
    tagged_format = []

    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    for sentence in sentences:
        sentence_tags = []
        words = sentence.split()
        for word in words:
            cleaned_word = re.sub(r'[^\w\s]', '', word)
            lowercased_word = cleaned_word.lower()
            lemma = lemmatizer.lemmatize(lowercased_word)
            tag = "ANIMAL" if lemma == entity_label.lower() else "O"
            sentence_tags.append((cleaned_word, tag))
        tagged_format.append(sentence_tags)
    return tagged_format

I will transform the dataset into a tagged format.

In [43]:
tagged_data = [sentence for record in data for sentence in transform_to_tagged_format(record.get("text", ""), record.get("label", ""))]
    

I will convert the tagged data into a DataFrame and save it to a CSV file for further analysis.

In [46]:
sentence_column = [" ".join([word for word, tag in sentence]) for sentence in tagged_data]
tag_column = [" ".join([tag for word, tag in sentence]) for sentence in tagged_data]
tagged_df = pd.DataFrame({'Sentence': sentence_column, 'Tags': tag_column})
tagged_df.head()

Unnamed: 0,Sentence,Tags
0,Albatrosses of the biological family Diomedeid...,ANIMAL O O O O O O O O O O O O O O O O O O O O...
1,They range widely in the Southern Ocean and th...,O O O O O O O O O O O
2,They are absent from the North Atlantic althou...,O O O O O O O O O O O O ANIMAL O O O O O O O O...
3,Great albatrosses are among the largest of fly...,O ANIMAL O O O O O O O O O O O O O O O O O O O...
4,The albatrosses are usually regarded as fallin...,O ANIMAL O O O O O O O O O O O O O O O O


Save the tagged dataset to a CSV file.

In [45]:
tagged_df.to_csv("tagged_dataset.csv", index=False)