# Preparing Data for Fine-tuning a NER Model
started Oct 17th

In [7]:
import pandas as pd
import re
from bs4 import BeautifulSoup
import numpy as np

### 1. **Data Collection**
- Gather text data representative of your content.
- If your dataset is insufficient, consider augmenting it with more representative examples.


In [67]:
# Read in html sections and tables
with open('2022ApJ...924...14P.html') as file:
    soup = BeautifulSoup(file, 'html.parser')
    
#to inspect html and identify the class label
#print(soup.prettify()) 

#get texts from the html:
sections = soup.find_all('div', class_="article-text")
print(len(sections))


# Extracting all paragraphs in the section
paragraphs = soup.find_all('p')
ps=0
for i, para in enumerate(paragraphs):
    p = para.get_text()
    if (len(p)>100) and (p[0].isalpha()):
        #print(f"Paragraph {i+1}:", p)
        #print('--------------')
        ps+=1
print(i,ps)

# Read in label file 

#find where in section or table they are mentioned



50
258 133


### 2. **Annotation**
- Mark and label entities within your text.
- Entities to start with: `Object Name`, `RA`, `DEC`, `Redshift`, `Type`. We may add more later.
#### 2.1 **Figure out annotation formats**
Data can be represented in various formats:
- **BIO (or IOB) Format**
- **CoNLL Format**: Columns-based, used in datasets like CoNLL-2003. **will go with this for now**
- **Spacy Format**: JSON format (for Spacy users) with entities represented by start/end character positions.

Manual annotation can be time-consuming. If NED had not already done some part of this we could have considered: [Doccano](https://doccano.herokuapp.com/), [Prodigy](https://prodi.gy/) (by Spacy creators, paid), [Labelbox](https://www.labelbox.com/), or [Brat](http://brat.nlplab.org/). 

In [6]:
text = "John S. Maro lives in New York and is tired. He is 27.2 years old"
entities = [("John S. Maro", "PERSON"), ("New York", "LOCATION"), ("27.2","AGE")]

# Step 1: Tokenize
tokens = text.split()  # Simplistic whitespace tokenization
labels = ['O'] * len(tokens)  # Step 2: Initialize with 'O' tags

# Step 3: Match entities and assign tags
for entity, entity_type in entities:
    entity_tokens = entity.split()
    for i in range(len(tokens) - len(entity_tokens) + 1):
        if tokens[i:i+len(entity_tokens)] == entity_tokens:
            labels[i] = "B-" + entity_type
            for j in range(1, len(entity_tokens)):
                labels[i+j] = "I-" + entity_type

# Step 4: Compile to CoNLL format
conll_data = "\n".join([f"{token} {label}" for token, label in zip(tokens, labels)])

print(conll_data)


John B-PERSON
S. I-PERSON
Maro I-PERSON
lives O
in O
New B-LOCATION
York I-LOCATION
and O
is O
tired. O
He O
is O
27.2 B-AGE
years O
old O


### 3. **Train/Test Split**
- Consider an 80% training, 10% validation, and 10% test split.
- Respect document boundaries to avoid overlap between sets.

### 4. **Preprocessing**
- Tokenize consistently with the pre-trained model's tokenization.
- Other steps might include converting to lowercase, handling punctuation, etc.

### 5. **Model-Specific Formatting**
- Convert data to be compatible with your chosen framework.
- For HuggingFace Transformers, use their `TokenClassification` model format.


### 6. **Augmentation (Optional)**
For smaller datasets, consider:
- Back translation
- Synonym replacement
- Sentence shuffling

### 7. **Data Quality Checks**
- Ensure annotation consistency.
- Address issues like overlapping annotations or mislabeled entities.

After data preparation, proceed with fine-tuning your NER model, evaluating on the validation set and tuning hyperparameters as needed.
