# Structured output with Instructor

This tutorial demonstrates how to use [Instructor](https://useinstructor.com/) with Writer to extract structured data like JSON from text, CSV, or PDF files.

You'll learn how to define Pydantic models, set up the Writer client, extract structured data, and generate CSV outputs.

## Prerequisites

Before getting started, you'll need:

- A [Writer AI Studio](https://app.writer.com/register) account
- An API key, which you can obtain by following the [API Quickstart](https://dev.writer.com/api-guides/quickstart)

## Setup
Install the next libs

In [None]:
%pip install "instructor[writer]" writer-sdk python-dotenv pydantic

## Environment setup

Set the `WRITER_API_KEY` environment variable. You can store it in a `.env` file or enter it interactively:

In [1]:
import getpass
import os
from writerai import Writer

if not os.getenv("WRITER_API_KEY"):
    os.environ["WRITER_API_KEY"] = getpass.getpass("Enter your Writer API key: ")

writer_client = Writer()

Enter your Writer API key:  ········


## Basic usage with Instructor

Here's a minimal example using Instructor to extract structured data from a simple string:

In [2]:
import instructor
from writerai import Writer
from pydantic import BaseModel

# Initialize Instructor client
client = instructor.from_writer(Writer(api_key=os.getenv('WRITER_API_KEY')))

class User(BaseModel):
    name: str
    age: int

user = client.chat.completions.create(
    model="palmyra-x5",
    messages=[{"role": "user", "content": "Extract: John is 30 years old"}],
    response_model=User,
)

print(user)

name='John' age=30


## Defining a data model for file extraction

You can define a Pydantic model to validate structured data extracted from text, CSV, or PDF files.

In [3]:
from typing import Annotated
from pydantic import BaseModel, AfterValidator, Field

class UserExtract(BaseModel):
    @staticmethod
    def first_last_name_validator(v):
        if v[0] != v[0].upper() or v[1:] != v[1:].lower() or not v.isalpha():
            raise ValueError("Name must contain only letters and start with uppercase letter")
        return v

    first_name: Annotated[str, AfterValidator(first_last_name_validator)] = Field(..., description="The first name")
    last_name: Annotated[str, AfterValidator(first_last_name_validator)] = Field(..., description="The last name")
    email: str

## File processing functions

Define functions to read text, CSV, or PDF files and extract content for structured parsing.

In [13]:
import asyncio
import os
from typing import Type, List, Iterable
import csv
import json
from writerai import AsyncWriter

async_writer_client = AsyncWriter()


async def fetch_file_text(file_path: str, name: str, extension: str) -> str:
    allowed_extensions = [".txt", ".csv", ".pdf"]
    if extension not in allowed_extensions:
        raise ValueError(f"File extension {extension} is not allowed")

    with open(file_path, 'rb') as file:
        file_contents = file.read()

    if extension == ".pdf":
        file = await async_writer_client.files.upload(content=file_contents,
                                                       content_disposition=f"attachment; filename={name}{extension}",
                                                       content_type="application/octet-stream")
        file_text = await async_writer_client.tools.parse_pdf(file_id=file.id, format="text")
        await async_writer_client.files.delete(file.id)
    else:
        file_text = file_contents.decode("utf-8")

    return file_text


async def repair_data(file_text: str, response_model: Type[BaseModel]) -> List[BaseModel]:
    instructor_client = instructor.from_writer(client=async_writer_client)
    entities_async_gen = await instructor_client.chat.completions.create(
        model="palmyra-x5",
        response_model=Iterable[response_model],
        max_retries=5,
        messages=[{"role": "user", "content": f"Extract entities from {file_text}"}]
    )
    
    return [entity async for entity in entities_async_gen]


def generate_csv(entities: List[BaseModel], response_model: Type[BaseModel], output_path: str = None) -> None:
    fieldnames = list(response_model.model_json_schema()["properties"].keys())
    file_path = f"{response_model.__name__}.csv"
    if output_path:
        os.makedirs(os.path.dirname(output_path), exist_ok=True)
        file_path = os.path.join(output_path, file_path)
    with open(file_path, 'w') as f:
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        for entity in entities:
            writer.writerow(json.loads(response_model(**entity.model_dump()).model_dump_json()))


async def handle_file(file_path: str, response_model: Type[BaseModel], output_path: str = None):
    name, extension = os.path.splitext(os.path.basename(file_path))
    file_text = await fetch_file_text(file_path, name, extension)
    repaired_entities = await repair_data(file_text, response_model)
    print(f"Extracted {len(repaired_entities)} entities from {file_path}")
    generate_csv(repaired_entities, response_model, output_path)

## Running the data repair tool

You can now process multiple files concurrently:

In [19]:
async def main():
    data = [
        # ("example_data/ExampleFileTextFormat.txt", UserExtract, None),
        ("example_data/AnotherExampleFileTextFormat.txt", UserExtract, "out/"),
    ]
    tasks = [handle_file(file, model, path) for file, model, path in data]
    await asyncio.gather(*tasks)

await main()


Extracted 20 entities from example_data/AnotherExampleFileTextFormat.txt
