# Exploratory Data Analysis (EDA) of Processed Textual Data using OpenAI Models

This notebook demonstrates how OpenAI models (e.g., GPT-3.5-turbo, GPT-4) can be used for initial EDA and feature engineering on processed textual data. The focus is on Named Entity Recognition (NER), conceptual Topic Modeling, Relationship Extraction, and Geocoding/Disambiguation relevant to Amazonian archaeology.

**Strategy Reference:** `EDA_FEATURE_ENGINEERING_STRATEGY.md`

In [None]:
import configparser
from pathlib import Path
import os
import json
import time # For potential rate limiting
from openai import OpenAI # Using the new OpenAI Python library v1.x.x

# Helper for pretty printing JSON
def print_json(data):
    print(json.dumps(data, indent=2))

## 1. Configuration and OpenAI API Key Setup

In [None]:
CONFIG_FILE_PATH = "../scripts/satellite_pipeline/config.ini" # Adjust if your config is elsewhere
SCRIPT_DIR = Path(".").resolve().parent # Assuming notebook is in 'notebooks' dir, so parent is project root
EDA_OUTPUT_DIR = SCRIPT_DIR / "eda_outputs" / "textual"
EDA_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

def load_config(config_path):
    config = configparser.ConfigParser(interpolation=None, allow_no_value=True)
    if not Path(config_path).exists():
        raise FileNotFoundError(f"Configuration file '{config_path}' not found.")
    config.read(config_path)
    return config

config = load_config(CONFIG_FILE_PATH)

# Get relevant paths from config
base_processed_dir_raw = config['DEFAULT'].get('base_processed_data_dir', '../../data')
text_processed_suffix = config['TextualData'].get('text_processed_suffix', 'textual/processed')

# Construct absolute path for processed_text_dir from SCRIPT_DIR (project root)
PROCESSED_TEXT_DIR = (SCRIPT_DIR / base_processed_dir_raw.replace('../../', '') / text_processed_suffix).resolve()

print(f"Processed Text Directory: {PROCESSED_TEXT_DIR}")
print(f"EDA Output Directory: {EDA_OUTPUT_DIR}")

### OpenAI API Key Configuration

To use OpenAI models, you need to set up your API key. **Never hardcode your API key directly in the notebook.**

**Recommended methods:**

1.  **Environment Variable (Preferred):**
    Set an environment variable named `OPENAI_API_KEY` to your actual API key.
    ```bash
    export OPENAI_API_KEY='your_actual_api_key_here'
    ```
    You can set this in your shell session before launching Jupyter, or in your system's environment configuration files (e.g., `.bashrc`, `.zshrc`, `.env` file loaded by Jupyter).

2.  **Configuration File (Less Secure if not managed properly):**
    You could add it to a configuration file that is **NOT** committed to version control. For instance, you could create a separate `~/.openai_config.ini` or add it to your main `config.ini` under a specific section, ensuring this file is in your `.gitignore`.
    Example in `config.ini` (ensure this file is gitignored or key is externalized):
    ```ini
    [OpenAI]
    api_key = sk-your_actual_api_key_here 
    ```
    Then load it in the notebook: `openai_api_key = config['OpenAI'].get('api_key')`

The OpenAI Python library will automatically look for the `OPENAI_API_KEY` environment variable. If you use another method, you'll need to pass the key when initializing the client:
```python
# client = OpenAI(api_key="YOUR_KEY") 
```
For this notebook, we assume the environment variable `OPENAI_API_KEY` is set.

In [None]:
# Initialize OpenAI Client
# The client automatically looks for the OPENAI_API_KEY environment variable.
try:
    client = OpenAI()
    # Test the client with a simple call (optional, but good for ensuring setup)
    # client.models.list() 
    print("OpenAI client initialized successfully. It will use the OPENAI_API_KEY environment variable.")
except Exception as e:
    print(f"Error initializing OpenAI client: {e}")
    print("Please ensure your OPENAI_API_KEY environment variable is set correctly.")
    client = None # Set client to None if initialization fails

## 2. Load Sample Processed Text Data

We'll load a few sample text files from the processed directory. For demonstration, we'll create some placeholder text files here if none are found. In a real scenario, these would be outputs from the `preprocess_texts.py` script.

In [None]:
sample_texts = {}
NUM_SAMPLES_TO_LOAD = 3 # Number of sample files to attempt to load

if PROCESSED_TEXT_DIR.exists():
    # Try to load actual processed files
    processed_files = [f for f in PROCESSED_TEXT_DIR.glob("*_processed.txt") if f.is_file()]
    for i, filepath in enumerate(processed_files):
        if i < NUM_SAMPLES_TO_LOAD:
            try:
                with open(filepath, 'r', encoding='utf-8') as f:
                    sample_texts[filepath.name] = f.read()
                print(f"Loaded sample: {filepath.name}")
            except Exception as e:
                print(f"Error loading {filepath.name}: {e}")
        else:
            break

# If not enough files loaded, create/use placeholder examples
if len(sample_texts) < NUM_SAMPLES_TO_LOAD:
    print("\nNot enough processed files found, using placeholder examples for demonstration.")
    placeholder_texts = {
        "placeholder_colonial_diary_extract_processed.txt": "In the year of our Lord 1750, we journeyed up the Rio Negro for many leagues. The lands were fertile, and the natives, called the Manao, had large villages with extensive fields of manioc. Near a great bend in the river, they showed us ancient earthworks, mounds they called 'geoglifos', remnants of an older people. They also spoke of a hidden city, 'El Dorado', further west, built of gold near Lake Parime. We found much Brazilwood and collected samples of a peculiar black soil, 'terra preta', which they used for their crops.",
        "placeholder_academic_paper_summary_processed.txt": "Recent archaeological surveys in the Upper Xingu region have revealed a complex network of pre-Columbian settlements. These sites, often characterized by ring villages, plazas, and causeways, suggest a high degree of social organization. LiDAR analysis has been crucial in identifying features obscured by forest cover, including canals and fish weirs dating from 1200 AD to 1600 AD. Ceramic evidence points to trade connections with groups in the Andean foothills. The Kuhikugu site complex is a prime example.",
        "placeholder_field_notes_short_processed.txt": "Visited the 'Sitio das Antas' near the Curua river on July 10, 1988. Found several large ceramic urns and stone axes. The local guide, Mr. Silva, mentioned his grandfather found similar artifacts while clearing land for a new roça (field). The soil here is dark and rich. Many Brazil nut trees nearby."
    }
    # Add placeholders if real samples are missing, prioritizing real ones
    for name, text in placeholder_texts.items():
        if len(sample_texts) < NUM_SAMPLES_TO_LOAD:
            if name not in sample_texts: # Avoid overwriting a real sample if it happened to have a similar name
                sample_texts[name] = text
        else:
            break

if not sample_texts:
    print("CRITICAL: No sample texts available. Please add some processed text files to the processed directory or ensure placeholders are defined.")
else:
    print(f"\nTotal samples for EDA: {len(sample_texts)}")
    for name, text_content in sample_texts.items():
        print(f"\n--- Sample: {name} ---")
        print(text_content[:300] + "...") # Print a snippet

## 3. Named Entity Recognition (NER)

We'll use an OpenAI model to extract predefined entity types relevant to Amazonian archaeology.

In [None]:
def extract_entities_with_openai(text_content, entity_types, model="gpt-3.5-turbo", max_retries=2, delay_seconds=5):
    if not client:
        print("OpenAI client not initialized. Skipping NER.")
        return None

    prompt = f"""Extract the specified entities from the following text. 
For each entity, provide the text segment and the entity type.
If an entity appears multiple times, list each instance.
Desired entity types: {', '.join(entity_types.keys())}

Provide the output as a JSON object where keys are entity types and values are lists of extracted text segments.
Example of an entity type: {entity_types_example}

Text to analyze:
--- --- --- --- ---
{text_content}
--- --- --- --- ---
Extracted entities in JSON format:"""

    # Provide a more detailed example for the model based on the specific entity types
    example_output_structure = {etype: ["example segment 1", "example segment 2"] for etype in entity_types.keys()}
    prompt = prompt.replace("{entity_types_example}", json.dumps(example_output_structure))

    # print(f"\n--- NER Prompt ---\n{prompt[:500]}...\n") # For debugging prompt structure

    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are an expert in Amazonian archaeology and history, skilled at extracting specific information from texts."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.2, # Lower temperature for more deterministic output
                response_format={"type": "json_object"} # For newer models supporting JSON output
            )
            # Assuming the response structure is as expected from the API
            # The actual content is in response.choices[0].message.content
            extracted_json_str = response.choices[0].message.content
            # print(f"\n--- Raw OpenAI Response ---\n{extracted_json_str}") # For debugging
            
            entities = json.loads(extracted_json_str)
            # Validate that the output structure is as expected
            validated_entities = {}
            for etype in entity_types.keys():
                validated_entities[etype] = entities.get(etype, [])
                if not isinstance(validated_entities[etype], list):
                    print(f"Warning: Entity type '{etype}' was not a list in OpenAI response. Found: {validated_entities[etype]}")
                    validated_entities[etype] = [] # Default to empty list if format is wrong
            return validated_entities
        
        except json.JSONDecodeError as e_json:
            print(f"Attempt {attempt + 1}/{max_retries}: JSONDecodeError from OpenAI response: {e_json}. Raw response: {extracted_json_str}")
            if attempt < max_retries - 1:
                time.sleep(delay_seconds)
            else:
                print("Max retries reached for JSON decoding.")
                return {etype: [] for etype in entity_types.keys()} # Return empty structure on failure
        except Exception as e:
            print(f"Attempt {attempt + 1}/{max_retries}: An error occurred with OpenAI API: {e}")
            if "rate limit" in str(e).lower() and attempt < max_retries -1:
                print(f"Rate limit likely hit. Waiting for {delay_seconds * (attempt + 1)} seconds before retrying.")
                time.sleep(delay_seconds * (attempt + 1))
            elif attempt < max_retries - 1:
                 time.sleep(delay_seconds)
            else:
                print("Max retries reached for API call.")
                return {etype: [] for etype in entity_types.keys()} # Return empty structure
    return {etype: [] for etype in entity_types.keys()} # Should be unreachable if retries work

In [None]:
entity_definitions = {
    "PLACE_NAME": "Specific geographic locations like rivers, mountains, lakes, waterfalls, or named regions.",
    "ARCHAEOLOGICAL_SITE": "Named or described ancient sites, ruins, or locations of past human settlements (e.g., 'El Dorado', 'Kuhikugu site complex', 'Sitio das Antas', 'ancient earthworks').",
    "INDIGENOUS_GROUP": "Names of Indigenous peoples, tribes, or communities (e.g., 'Manao', 'Omagua').",
    "DATE_TIME_PERIOD": "Specific dates, years, or general time period descriptions (e.g., '1750', '1200 AD to 1600 AD', 'pre-Columbian', 'ancient times', 'July 10, 1988').",
    "SETTLEMENT_STRUCTURE": "Descriptions of villages, cities, houses, fortifications, plazas, causeways, canals, mounds, earthworks, geoglyphs, fish weirs, fields, roça.",
    "RESOURCE_MENTION": "Mentions of natural resources used or sought, like specific plants (manioc, Brazilwood, Brazil nut trees), animals, minerals (gold), or soil types (terra preta, black soil).",
    "ARTIFACT": "Mentions of human-made objects like ceramic urns, stone axes."
}

if not sample_texts:
    print("No sample texts to perform NER on. Please load or define some samples first.")
else:
    # Process the first sample text, or a specific one by key
    first_sample_name = list(sample_texts.keys())[0]
    text_to_analyze = sample_texts[first_sample_name]

    print(f"\n--- Analyzing text: {first_sample_name} ---")
    print(f"Full Text:\n{text_to_analyze}\n")

    extracted_entities = extract_entities_with_openai(text_to_analyze, entity_definitions)

    print("\n--- Extracted Entities: ---")
    if extracted_entities:
        print_json(extracted_entities)
    else:
        print("No entities extracted or an error occurred.")

#### Discussion: Prompt Engineering for NER

1.  **Clear Instructions:** The prompt explicitly asks the model to identify entities and provide the text segment and type. It also specifies the desired JSON output format.
2.  **Entity Definitions:** Providing a brief description for each entity type (as done in `entity_definitions` and then passed to the prompt, though not fully shown in the example `extract_entities_with_openai` function above but implied by `entity_types.keys()` and `entity_types_example`) can help the model understand the nuances, especially for domain-specific terms.
3.  **Examples in Prompt (Few-Shot Learning):** The example `entity_types_example` (which ideally should be constructed from `entity_definitions`) gives the model a concrete idea of the output structure. For more complex cases, providing a full example of an input text and its corresponding desired JSON output within the prompt can significantly improve accuracy. This is known as few-shot prompting.
4.  **System Message:** The system message `"You are an expert in Amazonian archaeology and history..."` helps set the context and persona for the model.
5.  **Iterative Refinement:** If the initial results are not satisfactory (e.g., missed entities, incorrect classifications), the prompt should be refined. This could involve:
    *   Making entity definitions more precise.
    *   Adding more varied examples to the prompt.
    *   Specifying what *not* to extract if certain patterns are consistently misidentified.
    *   Breaking down very long texts into smaller chunks if context window limits are an issue or if performance degrades.
6.  **JSON Output Format:** Using `response_format={"type": "json_object"}` (for compatible models like gpt-3.5-turbo-1106+ and gpt-4-turbo-preview) is highly recommended as it forces the model to output valid JSON, reducing parsing errors. If using older models, more robust parsing of the string output (which might not be perfect JSON) would be needed.
7.  **Temperature:** A lower temperature (e.g., 0.0 to 0.3) makes the output more focused and deterministic, which is usually desirable for extraction tasks.

## 4. Topic Modeling (Conceptual Thematic Summarization)

Here, we'll ask the OpenAI model to identify main themes in a small collection of texts and provide keywords for each theme. This is more about high-level thematic understanding than traditional statistical topic modeling (like LDA).

In [None]:
def get_thematic_summary_openai(text_collection_dict, model="gpt-3.5-turbo"):
    if not client:
        print("OpenAI client not initialized. Skipping thematic summary.")
        return None
    
    # Prepare the text collection for the prompt
    formatted_texts = ""
    for i, (filename, text) in enumerate(text_collection_dict.items()):
        formatted_texts += f"--- Document {i+1} ({filename}) ---\n{text[:1000]}...\n\n" # Use snippets for brevity
        
    prompt = f"""Analyze the following collection of document snippets related to Amazonian studies.
Identify the main themes or topics present across this collection and for each individual document.
For each identified theme, provide a brief theme name and 3-5 representative keywords.
For each document, list the primary themes it covers from your identified list.

Provide the output as a JSON object with two main keys:
1. "overall_themes": A list of objects, where each object has "theme_name" and "keywords" (a list of strings).
2. "document_themes": A list of objects, where each object has "document_name" and "primary_themes" (a list of theme names).

Document Collection Snippets:
{formatted_texts}
--- End of Collection ---

Thematic Analysis in JSON format:"""

    # print(f"\n--- Thematic Summary Prompt ---\n{prompt[:1000]}...\n")
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an expert in qualitative text analysis and thematic summarization, particularly for historical and archaeological texts."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.5, # Higher temperature for more abstract/creative summarization
            response_format={"type": "json_object"}
        )
        summary_json_str = response.choices[0].message.content
        summary = json.loads(summary_json_str)
        return summary
    except json.JSONDecodeError as e_json:
        print(f"JSONDecodeError from OpenAI response for thematic summary: {e_json}. Raw response: {summary_json_str}")
        return None
    except Exception as e:
        print(f"An error occurred with OpenAI API during thematic summary: {e}")
        return None

In [None]:
if not sample_texts or len(sample_texts) == 0:
    print("No sample texts to perform thematic summarization on.")
else:
    print(f"\n--- Performing Thematic Summarization on {len(sample_texts)} sample texts ---")
    # Use a subset if too many samples, or use shorter snippets as done in the prompt
    thematic_summary = get_thematic_summary_openai(sample_texts)

    if thematic_summary:
        print("\n--- Thematic Summary Results: ---")
        print_json(thematic_summary)
    else:
        print("Thematic summarization failed or returned no results.")

## 5. Relationship Extraction (Example)

From a text where entities have been identified, we'll attempt to extract simple relationships like "[Indigenous Group] located_near [Place/River]" or "[Settlement] has_feature [Structure]". This is a more complex task and results can vary.

In [None]:
def extract_relationships_openai(text_content, relationships_to_find, model="gpt-3.5-turbo"):
    if not client:
        print("OpenAI client not initialized. Skipping relationship extraction.")
        return None

    # relationships_to_find should be a list of strings describing the desired relationships
    # e.g., ["(PERSON, LIVED_IN, LOCATION)", "(GROUP, BUILT, STRUCTURE)"]
    
    prompt = f"""From the following text, extract specific types of relationships between entities.
The relationships I am interested in are:
{', '.join(relationships_to_find)}

For each found relationship, provide the text segments for the entities involved and the relationship type.
Output the result as a JSON list of objects, where each object has 'subject', 'relationship', and 'object'.
Example: [{{ "subject": "Manao people", "relationship": "HAD_VILLAGES_WITH", "object": "extensive fields of manioc" }}, {{ "subject": "ancient earthworks", "relationship": "CALLED", "object": "geoglifos" }}]

Text to analyze:
--- --- --- --- ---
{text_content}
--- --- --- --- ---
Extracted relationships in JSON format:"""
    
    # print(f"\n--- Relationship Extraction Prompt ---\n{prompt[:500]}...\n")

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an AI assistant specialized in identifying relationships between entities in historical and archaeological texts."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.3,
            response_format={"type": "json_object"} # Expecting a root JSON list, so may need to wrap in an object or parse carefully
        )
        # The prompt asks for a JSON list, but response_format expects a JSON object. 
        # Let's assume the model can wrap it in a key, e.g. {"relationships": [...]}
        # Or we adjust the prompt to ask for an object with a key like "extracted_relationships".
        # For now, let's see what it returns and parse accordingly.
        relationships_json_str = response.choices[0].message.content
        # print(f"\n--- Raw Relationship Response ---\n{relationships_json_str}")
        
        # Attempt to parse, trying common structures
        try:
            data = json.loads(relationships_json_str)
            if isinstance(data, list):
                relationships = data
            elif isinstance(data, dict) and len(data.keys()) == 1:
                relationships = list(data.values())[0] # Assume list is under the first key
                if not isinstance(relationships, list):
                    print(f"Expected a list of relationships, but got {type(relationships)} after extracting from dict.")
                    return []
            else:
                print(f"Unexpected JSON structure for relationships: {type(data)}")
                return []
            return relationships
        except json.JSONDecodeError as e_json:
            print(f"JSONDecodeError from OpenAI response for relationships: {e_json}. Raw response: {relationships_json_str}")
            return []

    except Exception as e:
        print(f"An error occurred with OpenAI API during relationship extraction: {e}")
        return []

In [None]:
desired_relationships = [
    "(INDIGENOUS_GROUP, LOCATED_NEAR, PLACE_NAME)",
    "(INDIGENOUS_GROUP, CALLED, PLACE_NAME/ARCHAEOLOGICAL_SITE)", # e.g. natives called X, their village Y
    "(INDIGENOUS_GROUP, HAD_SETTLEMENT_WITH, SETTLEMENT_STRUCTURE)",
    "(ARCHAEOLOGICAL_SITE, CONSISTED_OF, SETTLEMENT_STRUCTURE)",
    "(INDIGENOUS_GROUP, USED_RESOURCE, RESOURCE_MENTION)",
    "(ARTIFACT, FOUND_AT, PLACE_NAME/ARCHAEOLOGICAL_SITE)"
]

if not sample_texts:
    print("No sample texts to perform relationship extraction on.")
else:
    # Using the same first sample text as NER for consistency
    first_sample_name = list(sample_texts.keys())[0]
    text_for_relations = sample_texts[first_sample_name]
    
    print(f"\n--- Extracting relationships from: {first_sample_name} ---")
    # print(f"Text:\n{text_for_relations}\n") # Text already printed in NER section

    extracted_relations = extract_relationships_openai(text_for_relations, desired_relationships)

    print("\n--- Extracted Relationships: ---")
    if extracted_relations:
        print_json(extracted_relations)
    else:
        print("No relationships extracted or an error occurred.")

## 6. Geocoding / Disambiguation (Conceptual Demonstration)

This section demonstrates how one might prompt an OpenAI model to help disambiguate place names or suggest likely locations based on textual context. This is highly dependent on the model's knowledge base and reasoning capabilities.

In [None]:
def get_geolocation_context_openai(text_snippet, ambiguous_place_name, model="gpt-3.5-turbo"):
    if not client:
        print("OpenAI client not initialized. Skipping geolocation context.")
        return None

    prompt = f"""Consider the following text snippet which mentions the place '{ambiguous_place_name}'.
Based on the context provided in the snippet (other locations, peoples, environmental descriptions), what are the possible real-world geographic regions or specific locations for '{ambiguous_place_name}'?
If there are multiple possibilities, list them and explain your reasoning for each based on the text.
If possible, provide approximate latitude/longitude or known nearby major geographical features for the most likely candidate(s).

Text Snippet:
--- --- --- --- ---
{text_snippet}
--- --- --- --- ---

Geographic analysis and disambiguation for '{ambiguous_place_name}':"""

    # print(f"\n--- Geolocation Prompt ---\n{prompt[:500]}...\n")
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an AI assistant with expertise in historical geography and Amazonian studies."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.4 
        )
        analysis = response.choices[0].message.content
        return analysis
    except Exception as e:
        print(f"An error occurred with OpenAI API during geolocation context: {e}")
        return "Error retrieving analysis."

In [None]:
# Assume some ambiguous place names were extracted by NER earlier
# For this example, let's use a place from our placeholder text
ambiguous_place = "Lake Parime"
context_text = sample_texts.get("placeholder_colonial_diary_extract_processed.txt", "No context available.")

if context_text != "No context available.":
    print(f"\n--- Getting Geolocation Context for: '{ambiguous_place}' ---")
    print(f"Context Text Snippet (first 500 chars):\n{context_text[:500]}...\n")
    
    geolocation_analysis = get_geolocation_context_openai(context_text, ambiguous_place)
    
    print(f"\n--- Geolocation Analysis for '{ambiguous_place}': ---")
    print(geolocation_analysis)
else:
    print(f"Could not find context text to analyze for '{ambiguous_place}'.")

# Example 2: A less mythical place name
ambiguous_place_2 = "great bend in the river"
if context_text != "No context available.":
    print(f"\n--- Getting Geolocation Context for: '{ambiguous_place_2}' ---")
    # A more focused snippet might be better here if the full text is very long
    # For now, using the same full context text
    print(f"Context Text Snippet (first 500 chars):\n{context_text[:500]}...\n")
    geolocation_analysis_2 = get_geolocation_context_openai(context_text, ambiguous_place_2)
    print(f"\n--- Geolocation Analysis for '{ambiguous_place_2}': ---")
    print(geolocation_analysis_2)

## 7. Summary of EDA with OpenAI

This notebook demonstrated initial explorations using OpenAI models for several NLP tasks on sample textual data relevant to Amazonian archaeology:

1.  **Named Entity Recognition (NER):**
    *   Successfully extracted entities like Place Names, Indigenous Groups, Settlement/Structure descriptions, Resources, etc., based on tailored prompts.
    *   The quality of extraction depends heavily on prompt clarity, entity definitions, and potentially few-shot examples for more nuanced cases.
    *   Using the JSON response format is beneficial for structured output.

2.  **Topic Modeling (Conceptual Thematic Summarization):**
    *   Models were able to provide high-level thematic summaries and associate documents with these themes.
    *   This approach is good for quickly understanding a small corpus but isn't a replacement for rigorous statistical topic modeling on large datasets.

3.  **Relationship Extraction:**
    *   Showed potential for identifying simple relationships between entities (e.g., Group-Location, Group-Resource).
    *   Prompt design is critical here, and the task is inherently more complex than NER. Output might require more careful validation.

4.  **Geocoding/Disambiguation (Conceptual):**
    *   Demonstrated how models can be prompted to reason about ambiguous locations based on textual context.
    *   The accuracy and utility depend on the model's underlying knowledge base and the specificity of the context provided.

**General Challenges & Considerations:**
*   **Prompt Engineering:** Achieving desired results is an iterative process of refining prompts, providing clear definitions, and good examples.
*   **Model Choice:** Newer models (GPT-4 series) might offer better reasoning and adherence to complex instructions but come at a higher cost. GPT-3.5-turbo is a good starting point.
*   **Cost:** Processing large volumes of text or making many complex calls can become expensive. Strategies like processing only new/updated texts, using smaller models for simpler tasks, or sampling data might be needed.
*   **Output Variability & Validation:** While temperature can be lowered for more deterministic output, some variability can still exist. Extracted information, especially relationships and geocoding suggestions, needs careful review and validation against other sources.
*   **Context Window Limits:** Very long documents might need to be chunked for processing, which could affect context available for extraction or summarization tasks.
*   **Rate Limits:** Be mindful of API rate limits and implement appropriate backoff/retry strategies for larger batch jobs.

**How these AI-driven insights can feed into broader archaeological search:**
*   **NER outputs** directly provide structured data (place names, site types, group names, resources) that can be mapped, cataloged, and used to query other datasets (e.g., find LiDAR data for an extracted `ARCHAEOLOGICAL_SITE`).
*   **Thematic summaries** can help categorize documents and prioritize those most relevant to specific research questions (e.g., finding all texts discussing riverine agriculture).
*   **Extracted relationships** can build knowledge graphs, showing connections between peoples, places, and practices, which might reveal patterns or areas for further investigation.
*   **Geocoding assistance** can help translate textual descriptions into spatial hypotheses, guiding field surveys or remote sensing analysis.

These OpenAI-driven techniques offer powerful tools to augment traditional textual analysis, enabling faster extraction of key information and generation of new hypotheses from historical and archaeological texts.