# üåç Add a new language to Number Learning App

This interactive notebook guides you through adding a new language to the number trainer.
Each step includes:
- **Explanation** of what we're doing and why
- **LLM prompt** to generate the content
- **Editable output** so you can review and modify before saving

## Prerequisites
- OpenAI API key (or another LLM provider)
- Basic understanding of the target language's number system

---
## Step 0: Setup

Install dependencies and configure your LLM API.

In [None]:
# Install required packages (run once)
!pip install openai python-dotenv

In [None]:
import os
from pathlib import Path
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Initialize OpenAI client
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

# Project paths
PROJECT_ROOT = Path('..').resolve()
LANGUAGES_DIR = PROJECT_ROOT / 'src' / 'languages'
CURRICULUM_GEN_DIR = PROJECT_ROOT / 'scripts' / 'curriculum-gen' / 'languages'

def call_llm(prompt: str, system_prompt: str = None) -> str:
    """Call the LLM with a prompt and return the response."""
    messages = []
    if system_prompt:
        messages.append({"role": "system", "content": system_prompt})
    messages.append({"role": "user", "content": prompt})
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.7
    )
    return response.choices[0].message.content

def save_file(path: Path, content: str):
    """Save content to a file, creating directories if needed."""
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(content)
    print(f"‚úÖ Saved: {path}")

print("‚úÖ Setup complete!")

---
## Step 1: Choose your language

Define the basic language metadata:
- **Language ID**: lowercase, hyphenated (e.g., `german`, `sino-korean`, `brazilian-portuguese`)
- **Display name**: Human-readable name (e.g., "German", "Sino-Korean")
- **Language code**: IETF tag (e.g., `de-DE`, `ko-KR`, `pt-BR`)
- **Flag emoji**: Country flag (e.g., üá©üá™, üá∞üá∑, üáßüá∑)

In [None]:
# ‚úèÔ∏è EDIT THIS: Define your language
LANGUAGE_ID = "german"  # lowercase, hyphenated
LANGUAGE_NAME = "German"  # Display name
LANGUAGE_CODE = "de-DE"  # IETF language tag
FLAG_EMOJI = "üá©üá™"  # Country flag

# Derived paths
LANG_SRC_DIR = LANGUAGES_DIR / LANGUAGE_ID
LANG_CONFIG_FILE = CURRICULUM_GEN_DIR / f"{LANGUAGE_ID}.ts"

print(f"Language: {LANGUAGE_NAME} ({LANGUAGE_ID})")
print(f"Code: {LANGUAGE_CODE}")
print(f"Flag: {FLAG_EMOJI}")
print(f"\nFiles will be created at:")
print(f"  - {LANG_SRC_DIR}/")
print(f"  - {LANG_CONFIG_FILE}")

---
## Step 2: Analyze number patterns

Before writing code, we need to understand how numbers work in your language.

### What are patterns?

Languages have systematic rules for forming numbers. For example, in **English**:
- 0-12 are memorized individually
- 13-19 use the "-teen" suffix
- 20, 30... 90 are memorized decades
- 21-99 combine decade + unit

The LLM will analyze your target language and identify its patterns.

In [None]:
# Generate pattern analysis
pattern_prompt = f"""
Analyze the number system of {LANGUAGE_NAME} ({LANGUAGE_CODE}) and identify all the patterns 
a learner needs to understand to form any number from 0 to 1 trillion.

Structure your response like this:

## Overview
Brief description of the number system (e.g., decimal, vigesimal, special groupings)

## Patterns by Stage

### Stage 1: Atomic Base (0-12 or similar)
List each number that must be memorized individually.

### Stage 2: Teens (13-19 or equivalent)
Explain the pattern. Note any irregularities.

### Stage 3: Decades (20, 30... 90)
How are round tens formed? Any special cases?

### Stage 4: Compound numbers (21-99)
How are tens and units combined? What's the word order?

### Stage 5: Hundreds
How is "hundred" expressed? Any special rules for 100 vs 200?

### Stage 6: Thousands
How is "thousand" expressed? Any grouping differences?

### Stage 7+: Large numbers
Millions, billions, etc. Note any long-scale vs short-scale differences.

## Key irregularities
List the most important exceptions a learner must memorize.

## Comparison to English
What will be most confusing for English speakers?
"""

pattern_analysis = call_llm(pattern_prompt)
print(pattern_analysis)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify the pattern analysis if needed
# You can edit this variable directly before proceeding

pattern_analysis_edited = pattern_analysis

# Uncomment and edit if you want to make changes:
# pattern_analysis_edited = """
# Your edited version here...
# """

In [None]:
# Save pattern documentation
docs_path = PROJECT_ROOT / 'docs' / f'{LANGUAGE_ID}-patterns.md'
save_file(docs_path, f"# {LANGUAGE_NAME} Number Patterns\n\n{pattern_analysis_edited}")
print(f"\nüìÑ Pattern documentation saved to: {docs_path}")

---
## Step 3: Generate `numberToWords` function

This function converts a number to its spoken form in the target language.

Example: `numberToWords(54)` ‚Üí `"vierundf√ºnfzig"` (German)

In [None]:
# Load Swedish normalizer as reference
swedish_normalizer = (PROJECT_ROOT / 'src' / 'languages' / 'swedish' / 'normalizer.ts').read_text()

number_to_words_prompt = f"""
Create a TypeScript function that converts numbers to {LANGUAGE_NAME} words.

## Requirements
- Function signature: `export function numberTo{LANGUAGE_NAME.replace('-', '').replace(' ', '')}(n: number): string`
- Handle numbers from 0 to at least 1,000,000,000,000 (1 trillion)
- Return the standard spoken form (how a native speaker would say it)
- Use the patterns identified earlier:

{pattern_analysis_edited}

## Reference implementation (Swedish)
Here's how the Swedish version works for reference:

```typescript
{swedish_normalizer}
```

## Output format
Return ONLY the TypeScript code, no explanations. Include:
1. All necessary constants (digit names, teen names, decade names, etc.)
2. The main conversion function
3. Helper functions as needed
"""

number_to_words_code = call_llm(number_to_words_prompt)
print(number_to_words_code)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify the generated code
number_to_words_code_edited = number_to_words_code

# Remove markdown code fences if present
if number_to_words_code_edited.startswith('```'):
    lines = number_to_words_code_edited.split('\n')
    number_to_words_code_edited = '\n'.join(lines[1:-1] if lines[-1].strip() == '```' else lines[1:])

---
## Step 4: Generate `parseSpokenNumber` function

This function parses spoken/typed number representations back to integers.
It must handle various input formats:
- Pure digits: `"54"`
- Pure words: `"vierundf√ºnfzig"`
- Mixed: `"50-vier"`
- With/without hyphens and spaces

In [None]:
parse_number_prompt = f"""
Create a TypeScript function that parses {LANGUAGE_NAME} number words back to integers.

## Requirements
- Function signature: `export function parse{LANGUAGE_NAME.replace('-', '').replace(' ', '')}(input: string): number | null`
- Return `null` if the input cannot be parsed
- Handle these input variations:
  - Pure digits: "54"
  - Pure words: the native word form
  - Mixed formats: "50-four" style
  - With/without hyphens, spaces
  - Case-insensitive
- This is used for speech-to-text validation, so be lenient with input

## The numberToWords function for reference
```typescript
{number_to_words_code_edited}
```

## Output format
Return ONLY the TypeScript code for the parser function.
"""

parse_number_code = call_llm(parse_number_prompt)
print(parse_number_code)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify the parser code
parse_number_code_edited = parse_number_code

# Remove markdown code fences if present
if parse_number_code_edited.startswith('```'):
    lines = parse_number_code_edited.split('\n')
    parse_number_code_edited = '\n'.join(lines[1:-1] if lines[-1].strip() == '```' else lines[1:])

---
## Step 5: Generate `numberToRomanized` (if needed)

**Only needed if your language uses non-Latin script** (e.g., Korean, Japanese, Arabic, Chinese).

Skip this step if your language uses Latin alphabet.

In [None]:
# Set to True if your language needs romanization
NEEDS_ROMANIZATION = False  # ‚úèÔ∏è EDIT: Set to True for non-Latin scripts

romanization_code = ""

if NEEDS_ROMANIZATION:
    romanization_prompt = f"""
    Create a TypeScript function that romanizes {LANGUAGE_NAME} number words.
    
    ## Requirements
    - Function signature: `export function numberToRomanized(n: number): string`
    - Convert the native script to Latin alphabet pronunciation guide
    - Use standard romanization system for the language
    
    ## The numberToWords function for reference
    ```typescript
    {number_to_words_code_edited}
    ```
    
    ## Output format
    Return ONLY the TypeScript code.
    """
    
    romanization_code = call_llm(romanization_prompt)
    print(romanization_code)
else:
    print("‚è≠Ô∏è Skipping romanization (Latin script language)")

In [None]:
# ‚úèÔ∏è EDIT: Review romanization code if generated
romanization_code_edited = romanization_code

if romanization_code_edited and romanization_code_edited.startswith('```'):
    lines = romanization_code_edited.split('\n')
    romanization_code_edited = '\n'.join(lines[1:-1] if lines[-1].strip() == '```' else lines[1:])

---
## Step 6: Combine into normalizer.ts

Now we'll combine all the functions into a single `normalizer.ts` file.

In [None]:
# Combine all code into normalizer.ts
normalizer_content = f"""{number_to_words_code_edited}

{parse_number_code_edited}
"""

if romanization_code_edited:
    normalizer_content += f"\n\n{romanization_code_edited}"

print("=" * 60)
print("NORMALIZER.TS PREVIEW")
print("=" * 60)
print(normalizer_content[:2000] + "..." if len(normalizer_content) > 2000 else normalizer_content)

In [None]:
# ‚úèÔ∏è FINAL EDIT: Make any last changes to normalizer.ts
normalizer_content_final = normalizer_content

# Save the file
normalizer_path = LANG_SRC_DIR / 'normalizer.ts'
save_file(normalizer_path, normalizer_content_final)

---
## Step 7: Generate unit tests

Create comprehensive tests for the normalizer functions.

In [None]:
test_prompt = f"""
Create Vitest unit tests for the {LANGUAGE_NAME} normalizer functions.

## Functions to test
```typescript
{normalizer_content_final[:3000]}
```

## Requirements
- Use Vitest (import {{ describe, it, expect }} from 'vitest')
- Test edge cases: 0, single digits, teens, decades, hundreds, thousands, millions
- Test the parser with various input formats
- Include at least 20 test cases for numberToWords
- Include at least 15 test cases for parseSpokenNumber

## Output format
Return ONLY the TypeScript test code.
"""

test_code = call_llm(test_prompt)
print(test_code)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify tests
test_code_edited = test_code

if test_code_edited.startswith('```'):
    lines = test_code_edited.split('\n')
    test_code_edited = '\n'.join(lines[1:-1] if lines[-1].strip() == '```' else lines[1:])

# Save test file
test_path = LANG_SRC_DIR / 'normalizer.test.ts'
save_file(test_path, test_code_edited)

---
## Step 8: Generate help texts for curriculum

Help texts provide learning tips for each number, explaining patterns and pronunciation.

In [None]:
# Load Swedish config as reference
swedish_config = (CURRICULUM_GEN_DIR / 'swedish.ts').read_text()

help_texts_prompt = f"""
Create help texts for learning {LANGUAGE_NAME} numbers.

## Pattern analysis
{pattern_analysis_edited}

## Requirements
- Create a TypeScript Record<number, string> with help texts
- Max 170 characters per help text
- Focus on:
  - Pronunciation tips for English speakers
  - Pattern explanations ("X + Y = XY")
  - Irregularities and exceptions
  - Memory aids
- Cover these numbers at minimum:
  - 0-20 (all)
  - Decades: 30, 40, 50, 60, 70, 80, 90, 100
  - Key hundreds: 200, 500, 1000
  - Representative compounds: 21, 33, 44, 55, 66, 77, 88, 99
  - Large numbers: 1000000, 1000000000

## Reference (Swedish help texts)
```typescript
{swedish_config[:2500]}
```

## Output format
Return ONLY the TypeScript code for the helpTexts constant:
```typescript
const helpTexts: Record<number, string> = {{
    0: 'Help text for zero...',
    // ...
}}
```
"""

help_texts_code = call_llm(help_texts_prompt)
print(help_texts_code)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify help texts
help_texts_code_edited = help_texts_code

if help_texts_code_edited.startswith('```'):
    lines = help_texts_code_edited.split('\n')
    help_texts_code_edited = '\n'.join(lines[1:-1] if lines[-1].strip() == '```' else lines[1:])

---
## Step 9: Generate language config

Create the curriculum generator config file.

In [None]:
lang_id_camel = LANGUAGE_ID.replace('-', ' ').title().replace(' ', '')
lang_id_camel_lower = lang_id_camel[0].lower() + lang_id_camel[1:]

config_prompt = f"""
Create a curriculum generator config for {LANGUAGE_NAME}.

## Help texts (already generated)
```typescript
{help_texts_code_edited}
```

## Pattern analysis
{pattern_analysis_edited[:1500]}

## Reference (Swedish config structure)
```typescript
{swedish_config}
```

## Requirements
- Export as `{lang_id_camel_lower}Config: LanguageConfig`
- Include the help texts
- Create a `localizeStages` function to customize stage names/descriptions for {LANGUAGE_NAME}
- Leave `voices: []` empty (we'll add voices later)
- Import types from './types.ts'

## Output format
Return the complete TypeScript file content.
"""

config_code = call_llm(config_prompt)
print(config_code)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify config
config_code_edited = config_code

if config_code_edited.startswith('```'):
    lines = config_code_edited.split('\n')
    config_code_edited = '\n'.join(lines[1:-1] if lines[-1].strip() == '```' else lines[1:])

# Save config file
save_file(LANG_CONFIG_FILE, config_code_edited)

---
## Step 10: Generate language index.ts

Create the main language definition file.

In [None]:
# Determine function names from normalizer
func_name = f"numberTo{lang_id_camel}"
parse_name = f"parse{lang_id_camel}"
romanize_name = f"numberToRomanized" if NEEDS_ROMANIZATION else None

index_content = f"""import {{ loadCurriculum }} from '@curriculum/curriculum.ts'
import type {{ Language }} from '@languages/index.ts'

import {{ {func_name}, {parse_name}{', ' + romanize_name if romanize_name else ''} }} from './normalizer.ts'

export const {lang_id_camel_lower}: Language = {{
    id: '{LANGUAGE_ID}',
    name: '{LANGUAGE_NAME}',
    ttsLanguageCode: '{LANGUAGE_CODE}',
    sttLanguageCode: '{LANGUAGE_CODE}',
    flag: '{FLAG_EMOJI}',
    curriculum: loadCurriculum('{LANGUAGE_ID}'),
    numberToWords: {func_name},
    parseSpokenNumber: {parse_name},
{f'    numberToRomanized: {romanize_name},' if romanize_name else ''}
}}
"""

print(index_content)

In [None]:
# ‚úèÔ∏è EDIT: Review and modify
index_content_final = index_content

# Save index file
index_path = LANG_SRC_DIR / 'index.ts'
save_file(index_path, index_content_final)

---
## Step 11: Summary & next steps

### Files created
Run the cell below to see all generated files.

In [None]:
print("üìÅ Generated files:")
print(f"  ‚úÖ {LANG_SRC_DIR / 'index.ts'}")
print(f"  ‚úÖ {LANG_SRC_DIR / 'normalizer.ts'}")
print(f"  ‚úÖ {LANG_SRC_DIR / 'normalizer.test.ts'}")
print(f"  ‚úÖ {LANG_CONFIG_FILE}")
print(f"  ‚úÖ {PROJECT_ROOT / 'docs' / f'{LANGUAGE_ID}-patterns.md'}")

print("\nüìã Manual steps remaining:")
print(f"""  
1. Register config in scripts/curriculum-gen/languages/index.ts:
   import {{ {lang_id_camel_lower}Config }} from './{LANGUAGE_ID}.ts'
   export const configs = {{ ..., '{LANGUAGE_ID}': {lang_id_camel_lower}Config }}

2. Register language in src/languages/index.ts:
   import {{ {lang_id_camel_lower} }} from './{LANGUAGE_ID}/index.ts'
   export const languages = {{ ..., '{LANGUAGE_ID}': {lang_id_camel_lower} }}

3. Generate curriculum:
   pnpm cur-gen --lang {LANGUAGE_ID}

4. Run tests:
   pnpm test src/languages/{LANGUAGE_ID}/

5. Generate audio (see docs/adding-a-language.md Step 4)
""")

---
## üéâ Done!

You've created all the core files for your new language. Follow the manual steps above to complete the integration.

If you encounter issues:
1. Run `pnpm tsc --noEmit` to check for TypeScript errors
2. Run `pnpm test` to verify the normalizer functions
3. Check the pattern documentation and adjust help texts as needed