# Sample Extractor

## Setup Path

We add the directory so that we could import the files under `src` properly.

In [12]:
import sys
from pathlib import Path

# Add the backend directory to the Python path
backend_dir = Path.cwd().parent
sys.path.insert(0, str(backend_dir))

## Import and load the `.env` file. 

For the `.env` file, you may find the template in the `.env.example` under the `backend` directory. 

For the `LITELLM_BASE_URL`, it should be `127.0.0.1:4000` in our app's case. Note that do not use `0.0.0.0` for somehow that does not work. 

For the `LITELLM_API_KEY`, it should be a **virtual key** created from the LiteLLM ui available at [this url](127.0.0.1:4000/ui). 

In [2]:
from openai import AsyncOpenAI
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

client = AsyncOpenAI(
    base_url=os.environ.get('LITELLM_BASE_URL'),
    api_key=os.environ.get('LITELLM_API_KEY'),
)

## Upload the PDF File

This one makes use of the files API to upload the file to the remote. This will give us an `upload` object that has an `file_id` that which we can reference. 

Note that try not to upload for many times, and try to delete the file that you upload after testing. Otherwise, the ghost files would cause extra money. (negilible for testing purposes, but still.)

In [7]:
upload = await client.files.create(
    file=Path('./files/sample1.pdf'),
    purpose='user_data',
)
upload

FileObject(id='file-CqHBSiKG8GPEgB9dMmDK1T', bytes=50919, created_at=1755433787, filename='sample1.pdf', object='file', purpose='user_data', status='processed', expires_at=None, status_details=None)

## Extract with Structured Output

This is the way how we are extracting the content with structured output. Be aware that this one only extracts a single EducationLLMSchema, but in some resumes, there may be many education experiences worth noting. Hence, this is just a demo. 

For reasoning, probably we won't need it. I just put it there for fun. 

The type of `output` is rather complicated. Inspect it carefully. **I think we could try use `instructor` for simpler output schemas.**

In [23]:
from src.schemas.llm.education import EducationLLMSchema

output = await client.responses.parse(
    model='gpt-5-mini',
    text_format=EducationLLMSchema,
    reasoning={
        'effort': 'medium',
    },
    input=[
        {
            'role': 'user',
            'content': [
                {
                    'type': 'input_text',
                    'text': 'Help extract the education experience of this user.'
                },
                {
                    'type': 'input_file',
                    'file_id': upload.id
                }
            ]
        }
    ]
)

## Extract Out the Structured Output

This part extracts out the content of the structured output. Note how complicated this is -- we should go with `instructor`.

In [26]:
output_items = output.output
for item in output_items:
    if item.type == 'message':
        content_items = item.content
        for content in content_items:
            if content.type == 'output_text':
                print(content.parsed.model_dump_json(indent=2))

{
  "institution_name": "National University of Singapore",
  "degree": "bachelor",
  "field_of_study": "Computer Science",
  "focus_area": "Selected for NUS Overseas College (NOC) Stockholm — year-long exchange and overseas internship; participated in classes at KTH Royal Institute of Technology and Stockholm School of Economics",
  "start_date": "August 2021",
  "end_date": "December 2025 (Expected)",
  "gpa": 4.71,
  "max_gpa": 5.0
}


## Delete the Files

Remember to delete the file uploaded so that it does not cost us extra money. However, no worries if you forget, we can always delete it in batch APIs later. 

In [None]:
await client.files.delete(upload.id)

AsyncCursorPage[FileObject](data=[], has_more=False, object='list', first_id='file-WFbaryWgWXk67jeJpYYvmU', last_id='file-CgdqFDxQNrftBzWZYLAVLTro')