# Romeo and Juliet Text Extraction with LangExtract

This notebook demonstrates extracting characters, emotions, and relationships from Shakespeare's Romeo and Juliet using LangExtract.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/langextract/blob/main/examples/notebooks/romeo_juliet_extraction.ipynb)

## Setup

In [1]:
# Install LangExtract
%pip install -q langextract

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/85.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.7/76.7 kB[0m [31m7.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# Set up your Gemini API key
# Get your key from: https://aistudio.google.com/app/apikey
import os
from getpass import getpass

if 'GEMINI_API_KEY' not in os.environ:
    os.environ['GEMINI_API_KEY'] = getpass('Enter your Gemini API key: ')

AIzaSyA1oNZnhvoIpeIdUZu_MsCmAjbCcX0oCFQ··········


## Define Extraction Task

In [3]:
import langextract as lx
import textwrap

prompt = textwrap.dedent("""\
    Extract key entities and attributes from an identification document.
    Use exact text where possible. Do not paraphrase names or IDs.
    Include: person, document, id_number, address, dates (dob, issue, expiry),
    classifications (class, endorsements, restrictions), and flags (organ_donor).
""")

examples = [
    lx.data.ExampleData(
        text=fake_license,
        extractions=[
            lx.data.Extraction(
                extraction_class="document",
                extraction_text="DRIVER LICENSE",
                attributes={"jurisdiction": "STATE OF PACIFICA"}
            ),
            lx.data.Extraction(
                extraction_class="id_number",
                extraction_text="P123-456-789",
                attributes={"label": "DLN"}
            ),
            lx.data.Extraction(
                extraction_class="person",
                extraction_text="RIVERA, ALEX J",
                attributes={"given_name": "ALEX", "family_name": "RIVERA", "middle_initial": "J"}
            ),
            lx.data.Extraction(
                extraction_class="address",
                extraction_text="1420 Beacon Ave, Apt 5B, Northport, PC 94021",
                attributes={"city": "Northport", "postal_code": "94021", "region": "PC"}
            ),
            lx.data.Extraction(
                extraction_class="date",
                extraction_text="1991-07-18",
                attributes={"type": "DOB"}
            ),
            lx.data.Extraction(
                extraction_class="date",
                extraction_text="2024-06-15",
                attributes={"type": "Issue"}
            ),
            lx.data.Extraction(
                extraction_class="date",
                extraction_text="2032-07-18",
                attributes={"type": "Exp"}
            ),
            lx.data.Extraction(
                extraction_class="classification",
                extraction_text="Class: C",
                attributes={"class": "C"}
            ),
            lx.data.Extraction(
                extraction_class="endorsement",
                extraction_text="Endorsements: None",
                attributes={"endorsements": "None"}
            ),
            lx.data.Extraction(
                extraction_class="restriction",
                extraction_text="Restr: Corrective Lenses",
                attributes={"restriction": "Corrective Lenses"}
            ),
            lx.data.Extraction(
                extraction_class="flag",
                extraction_text="Organ Donor: YES",
                attributes={"organ_donor": "YES"}
            ),
            lx.data.Extraction(
                extraction_class="note",
                extraction_text="Temporary address valid until 2025-01-31.",
                attributes={"type": "address_validity"}
            ),
        ]
    )
]

## Extract from Sample Text

In [9]:
# Simple extraction from a short text
fake_license = textwrap.dedent("""\
    STATE OF PACIFICA • DRIVER LICENSE
    DLN: P123-456-789   Class: C   Endorsements: None   Restr: Corrective Lenses
    Name: RIVERA, ALEX J
    Address: 1420 Beacon Ave, Apt 5B, Northport, PC 94021
    DOB: 1991-07-18   Issue: 2024-06-15   Exp: 2032-07-18
""")

noisy_license = textwrap.dedent("""\
    ====== Document Scan (IMG_2031.JPG) ======
    STATE OF PACIFICA • DRIVER LICENSE
    [Help? Call 1-800-PACIFICA • v3.9.2 • Print: 2024-06-16T08:12Z]

    ▶ CUSTOMER COPY – NOT VALID FOR TRAVEL
    DLN: P123-456-789   Class: C   Endorsements: None   Restr: Corrective Lenses
    Name: RIVERA, ALEX J
    Address: 1420 Beacon Ave, Apt 5B, Northport, PC 94021
    DOB: 1991-07-18   Issue: 2024-06-15   Exp: 2032-07-18

    ---- FOOTNOTES / SYSTEM LOGS -------------------------------------
    • Reprint reason: Card damaged (case #NP-88412).
    • Payment Ref: 77-22-991  Auth: OK
    • Barcode (Code128): |||:::|||:::
    • “Birth Date” (legacy field): 1991/07/18
    • Issued by: DMV Northport Office (Window 3)
    • Temp Visitor Until: 2026-01-01 (does not change card expiration)
    • Prior Customer ID: P123-456-780 (deprecated)
    • D1N (OCR guess): P123-456-789  <-- ignore OCR guess key

    ---- RANDOM INSERTS / AD SPACE -----------------------------------
    Get 15% off car registration with SAFE-DRIVE course!  Promo ends 07/31.
    Terms apply. See pacifica.gov/safe-drive.  © State of Pacifica.

    ---- DUPLICATED METADATA (SHADOW SCAN) ---------------------------
    STATE OF PACIFICA — DRIVER LICENSE — SAMPLE
    Name: Rivera, Alex J.    Class C     (this line may vary casing/punctuation)
    Exp: 2032-07-18  Issue: 2024-06-15  DOB: 1991-07-18
    Address on file: 1420 Beacon Ave Apt 5B Northport PC 94021

    ---- PAGE 2 (BACK SIDE OCR MERGE) --------------------------------
    EMERGENCY CONTACT (self-reported): Aunt M. Rivera  (555) 212-9090
    Organ Donor: YES   Veteran: NO
    Restrictions: Corrective Lenses
    Endorsements: None
    Signature: /ARivera/
    ------------------------------------------------------------------
    End of Scan
"""

result = lx.extract(
    text_or_documents=noisy_license,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

# Display results
print(f"Extracted {len(result.extractions)} entities:\n")
for extraction in result.extractions:
    print(f"• {extraction.extraction_class}: '{extraction.extraction_text}'")
    if extraction.attributes:
        for key, value in extraction.attributes.items():
            print(f"  - {key}: {value}")

2025-08-17 18:42:34,190 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.__init__(self=<GeminiLanguageModel>, constraint=Constraint(co...NONE: 'none'>), kwargs={})
2025-08-17 18:42:34,191 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.__init__ -> None (0.0 ms)
2025-08-17 18:42:34,193 - langextract.debug - DEBUG - [langextract.inference] CALL: BaseLanguageModel.apply_schema(self=<GeminiLanguageModel>, schema_instance=GeminiSchema(...xtractions']}))
2025-08-17 18:42:34,193 - langextract.debug - DEBUG - [langextract.inference] RETURN: BaseLanguageModel.apply_schema -> None (0.0 ms)
DEBUG:absl:Initialized Annotator with prompt:
Extract key entities and attributes from an identification document.
Use exact text where possible. Do not paraphrase names or IDs.
Include: person, document, id_number, address, dates (dob, issue, expiry),
classifications (class, endorsements, restrictions), and flags (organ_donor).


Examples
Q: STATE O

[92m✓[0m Extraction processing complete



INFO:absl:Finalizing annotation for document ID doc_a18c6cb2.
INFO:absl:Document annotation completed.


[92m✓[0m Extracted [1m10[0m entities ([1m8[0m unique types)
  [96m•[0m Time: [1m6.36s[0m
  [96m•[0m Speed: [1m38[0m chars/sec
  [96m•[0m Chunks: [1m1[0m
Extracted 10 entities:

• document: 'DRIVER LICENSE'
  - jurisdiction: STATE OF PACIFICA
• id_number: 'P123-456-789'
  - label: DLN
• person: 'RIVERA, ALEX J'
  - given_name: ALEX
  - family_name: RIVERA
  - middle_initial: J
• address: '1420 Beacon Ave, Apt 5B, Northport, PC 94021'
  - city: Northport
  - postal_code: 94021
  - region: PC
• date: '1991-07-18'
  - type: DOB
• date: '2024-06-15'
  - type: Issue
• date: '2032-07-18'
  - type: Exp
• classification: 'Class: C'
  - class: C
• endorsement: 'Endorsements: None'
  - endorsements: None
• restriction: 'Restr: Corrective Lenses'
  - restriction: Corrective Lenses


## Interactive Visualization

In [10]:
# Save results to JSONL
lx.io.save_annotated_documents([result], output_name="fake_license.jsonl", output_dir=".")

# Generate interactive visualization
html_content = lx.visualize("fake_license.jsonl")

# Display in notebook
print("Interactive visualization (hover over highlights to see attributes):")
html_content

[94m[1mLangExtract[0m: Saving to [92mfake_license.jsonl[0m: 1 docs [00:00, 492.17 docs/s]

[92m✓[0m Saved [1m1[0m documents to [92mfake_license.jsonl[0m



[94m[1mLangExtract[0m: Loading [92mfake_license.jsonl[0m: 100%|██████████| 3.00k/3.00k [00:00<00:00, 7.82MB/s]

[92m✓[0m Loaded [1m1[0m documents from [92mfake_license.jsonl[0m
Interactive visualization (hover over highlights to see attributes):





In [11]:
# Save visualization to file (for downloading)
with open("fake_license.html", "w") as f:
    # Handle both Jupyter (HTML object) and non-Jupyter (string) environments
    if hasattr(html_content, 'data'):
        f.write(html_content.data)
    else:
        f.write(html_content)

print("✓ Visualization saved to fake_license.html")
print("You can download this file from the Files panel on the left.")

✓ Visualization saved to fake_license.html
You can download this file from the Files panel on the left.


## Try Your Own Text

Experiment with your own Shakespeare quotes or any literary text!

In [None]:
# Try your own text
your_text = """
JULIET: O Romeo, Romeo! wherefore art thou Romeo?
Deny thy father and refuse thy name;
Or, if thou wilt not, be but sworn my love,
And I'll no longer be a Capulet.
"""

custom_result = lx.extract(
    text_or_documents=your_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
)

print("Extractions from your text:\n")
for e in custom_result.extractions:
    print(f"• {e.extraction_class}: '{e.extraction_text}'")
    if e.attributes:
        for key, value in e.attributes.items():
            print(f"  - {key}: {value}")

[94m[1mLangExtract[0m: model=[92mgemini-2.5-flash[0m, current=[92m163[0m chars, processed=[92m163[0m chars:  [00:05]

[92m✓[0m Extraction processing complete
[92m✓[0m Extracted [1m6[0m entities ([1m3[0m unique types)
  [96m•[0m Time: [1m5.84s[0m
  [96m•[0m Speed: [1m28[0m chars/sec
  [96m•[0m Chunks: [1m1[0m
Extractions from your text:

• character: 'JULIET'
  - emotional_state: longing
• emotion: 'O Romeo, Romeo! wherefore art thou Romeo?'
  - feeling: desperate questioning
• relationship: 'thy father'
  - type: familial
• relationship: 'thy name'
  - type: lineage
• relationship: 'my love'
  - type: romantic bond
• relationship: 'Capulet'
  - type: family affiliation



