## üó∫Ô∏è Country‚ÄìLanguage Mapping via LLM

This block uses an OpenAI model to automatically map each country in the Twitter user dataset  
to its most commonly spoken primary language, returning standardized ISO 639-1 language codes.  
The model output is saved as a JSON file for reuse, ensuring that each country is associated with  
its dominant social language rather than just its official one.  
This mapping enables aggregation of Twitter users by language and supports accurate cross-language comparisons in later analyses.


In [1]:
import os
import json
import polars as pl
from openai import OpenAI
from dotenv import load_dotenv

# Load API key
load_dotenv("/home/jovyan/env.txt")
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Load CSV of Twitter users per country
csv_path = "/home/jovyan/Shared/project1-group1/info-470-project-1/langdata/twitter-x-users-by-country-2025.csv"
df_users = pl.read_csv(csv_path)

# Extract unique countries
countries = df_users["country"].unique().to_list()

# Prompt
prompt = f"""
You are a data assistant.
Map each of the following countries to the most commonly spoken primary language by the majority of its population ‚Äî not necessarily the official one ‚Äî using ISO 639-1 two-letter codes.
Use social and cultural prevalence as the main criterion (for example, South Africa ‚Üí "af" for Afrikaans rather than "en" for English).
Return ONLY a valid formattable like:
{{"United States": "en", "France": "fr", ...}}
DO NOT add markdown json fencing.
Countries:
{countries}
"""

# Send request to OpenAI (single batched call)
response = client.responses.create(
    model="gpt-4o-mini",
    input=prompt,
    temperature=0
)

# Save raw model output as .txt
raw_text = response.output_text

# Save raw output first for record keeping
txt_path = "/home/jovyan/Shared/project1-group1/info-470-project-1/langdata/country_lang_map_raw.txt"
with open(txt_path, "w") as f:
    f.write(raw_text)

# Clean Markdown JSON fences if present
cleaned = raw_text.strip()
if cleaned.startswith("```"):
    # remove first line (```json or ```) and last line (```)
    cleaned = "\n".join(cleaned.splitlines()[1:-1])

# Parse into JSON
mapping = json.loads(cleaned)

# Save the valid JSON
json_path = txt_path.replace(".txt", ".json")
with open(json_path, "w") as f:
    json.dump(mapping, f, indent=2)

print(f"‚úÖ Cleaned and saved JSON to {json_path}")

‚úÖ Cleaned and saved JSON to /home/jovyan/Shared/project1-group1/info-470-project-1/langdata/country_lang_map_raw.json
