# Convert IMSDB screenplays to structured json

 - Taking ScreenJSON schema as a starting point
 - Using Aladin script for now. See [Screenplay_Dataset_to_files.ipynb](./Screenplay_Dataset_to_files.ipynb) for how I converted the IMSDB dataset to individual files. I picked Aladin from that converted set. _The full dataset is huge! Weighing it at 245MB_.
 - Copying bits and pieces from OpenAI completion code in other notebooks. 

In [4]:
import os
from pathlib import Path

# Install pre-reqs
!pip install nb-js-diagrammers --quiet
!pip install iplantuml --quiet
!pip install tiktoken --quiet

REPO_ROOT = Path("~/bitbucket/").expanduser()

# download the files, if in colab
if 'google.colab' in str(get_ipython()):
    # There will be a `/content` directory
    # download to this and then allow it to go away after 8 hours.
    os.chdir("/content")
    !git clone https://github.com/juvvination/juvvination.github.io.git

    REPO_ROOT = Path("/content") / "juvvination.github.io.git"


In [3]:
import openai
import os

# Setup logging 
# Note that module needs to be reloaded for our config to take as Jupyter already configures it
# which makes all future configs no-ops.
from importlib import reload
import logging
reload(logging)
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', 
                    level=logging.DEBUG, 
                    datefmt='%I:%M:%S')

#---------------------------------
# Configure for colab
# Either paste your OpenAI Key here or put it in secrets
if 'google.colab' in str(get_ipython()):
  from google.colab import userdata
  logging.debug("Tryign to fetch OPENAI_API_KEY from your secrets. Remember to make it available to this notebook")
  os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

logging.debug("Checking if OPENAI_API_KEY is available")
assert(os.environ.get("OPENAI_API_KEY"))


# Helper function to wait on the response.
def get_completion(prompt, model="gpt-4o-mini", temperature=0) -> str:
    messages = [{"role":"user", "content":prompt}]
    response = openai.chat.completions.create(
        model=model,
        messages=messages,
        temperature=temperature)
    return response.choices[0].message.content

# For charts and such
%load_ext nb_js_diagrammers
import iplantuml

# For displaying HTML and Markdown responses from ChatGPT
from IPython.display import display, HTML, Markdown

def colorBox(txt):
    display(HTML(f"<div style='border-radius:15px;padding:15px;background-color:pink;color:black;'>{txt}</div>"))    

05:28:35 DEBUG:Checking if OPENAI_API_KEY is available


## Explore Aladin

 - Before I decide to send this to OpenAI, I want to check how many tokens this has.
 - Per [this OpenAI doc](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them), I need to use tiktoken and 
   - `gpt-4o-mini` has a 128k context (prompt + data)
   - Figure out whatelse is needed for structured outputs.

In [None]:
import json
from pathlib import Path

SCRIPT_FILE  ='Aladdin'
IMSDB_ROOT   = REPO_ROOT / "data/imsdb_scripts"
SCRIPTS_DIR  = IMSDB_ROOT / "raw_screenplay"
aladdin_path = SCRIPTS_DIR / (SCRIPT_FILE + ".json")

with open(str(aladdin_path), 'r') as in_file:
    aladdin_json = json.load(in_file)

SCRIPTS_STRUCTURED_DIR = IMSDB_ROOT / "json_screenplay"
if not SCRIPTS_STRUCTURED_DIR.exists():
    print(f"Creating {SCRIPTS_STRUCTURED_DIR}")
    os.makedirs(SCRIPTS_STRUCTURED_DIR)

In [4]:
OPENAI_MODEL = "gpt-4o-mini"

# The encoder from tiktoken is model specific. I could not locate an encoder specifically 
# for gpt-4o-mini: so asking it to figure out out based on the model.
import tiktoken
enc = tiktoken.encoding_for_model(OPENAI_MODEL)

screen_play_json_string = json.dumps(aladdin_json) 
data_tokens = enc.encode(screen_play_json_string)
print(f"{str(aladdin_path)} has {len(data_tokens)} tokens")

/home/vamsi/scripts/data/Aladdin.json has 31854 tokens


Ok. `Aladdin.json` has ~32k tokens. The limit is 128k for `gpt-4o-mini` as of Oct 2024! Looks like I can feed the whole thing in and see if it can generate the needed JSON.

## JSON Format.

The [ScreenJSON](https://www.screenjson.com/schema/objects.html) format is very extensive. For now I will use a simpler format and follow the [OpenAI example](https://platform.openai.com/docs/guides/structured-outputs/introduction)

In [None]:
from pydantic import BaseModel

# I am trying to add per-attribute and per-class docs hoping that 
# pydantic will include them in the json schema and help the LLM
# get a better feel for what the fields are. Sort of like the description 
# fields used for function calling.
class Actor(BaseModel):
    name: str
    "The name of the actor"

    physical_description: str
    "The physical description of the actor deduced from the screenplay"
    "height, weight, skin color, eye coor, hair color, hair texture, build"
    "disposition, scars or other distinguishing features"

class Role(BaseModel):
    name: str
    "The name of the role"    

class CameraAction(BaseModel):
    camera_action: str
    "A line indicating movie camera action like pan, zoom, frame etc"

class SceneAction(BaseModel):
    scene_action: str
    "lines indicating changes to the scene. New sounds, change in scenery"
    "new people or objects or situations arising or exiting"

class CutTo(BaseModel):
    cut_to: str
    "A instruction to cut to another scene. Usually"
    "indicated by 'Cut to', 'Cut_to', 'CUT TO' or similar"

# Lets see if it understands the dialogue_sequence part
# dialgue has bits like
#   come closer--(Camera zooms in hitting
#   peddler in face) Too close, a little too close.  (Camera
#   zooms back out to CU)There
class Lines(BaseModel):
    actor: Actor|Role
    "The actor or role saying the lines"

    dialogue_sequence: list[str|CameraAction|SceneAction]    
    "A sequence of actor or role attributed lines or"
    "camera actions occuring during an actor's lines"

class MiscellaneousAction(BaseModel):

    misc_action_segment: str
    "Miscellaneous script segment that is not attributed to"
    "an actor or groups of actors"

class Script(BaseModel):
    author: str
    "The author of the screen play"

    preamble: str
    "The initial preamble before any of the actor lines start"

    script_segments: list[Lines | CameraAction | SceneAction | CutTo]
    "The various segments that make up the script"
    " - lines of each actors"
    " - cut instructions"
    " - scene descriptions, internal, external etc"
    " - misc action between lines of actors"
    " - camera action that occurs between actors lines"


class ScreenplayJSON(BaseModel):
    opinion: str
    genres : list[str]    
    movie_release_date: str
    writers : list[str]
    actors: list[Actor]
    script: Script


In [10]:
from openai import OpenAI

client = OpenAI()

prompt = f"""
Convert the following screenplay in triple quotes 
into structured output. The screen play is in JSON format.
Deduce the various actors and roles involved and accurately 
convert the lines to be read by each actor or role. Pay attention to 
scene changes before, during or after an actor or role's 
lines

'''
{screen_play_json_string}
'''
"""

completion = client.beta.chat.completions.parse(
    model=OPENAI_MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful screen writing assistant who does a thorough job. Do not hallucinate and perform the conversion without changing any lines"},
        {"role": "user", "content": prompt}
    ],
    response_format=ScreenplayJSON,
)

# Note that this is an instance of the ScreenplayJSON class
# us .json() on it to generate JSON
screen_json = completion.choices[0].message

if not screen_json.parsed:
    print(screen_json.refusal)

ContentFilterFinishReasonError: Could not parse response content as the request was rejected by the content filter

In [9]:
with open(str(SCRIPTS_STRUCTURED_DIR / (SCRIPT_FILE + ".parsed.json")), 'w') as out:
    json.dump(screen_json.json(), out, indent=4)


## What exactly triggered the OpenAI content filter refusal
 - [OpenAIs content filter blocking lyrics content for seemingly no reason](https://community.openai.com/t/openai-s-content-filter-blocking-lyrics-content-for-seemingly-no-reason/320483)
 - [OpenAI Platform Moderation Guide](https://platform.openai.com/docs/guides/moderation)

The quickstart from the [OpenAI Platform Moderation Guide](https://platform.openai.com/docs/guides/moderation) says that one should send content to the moderation end-points to classify text/images as harmful or not and in what ways.

The example they provide is 

```python
response = client.moderations.create(
    model="omni-moderation-latest",
    input="...text to classify goes here...",
)

print(response)
```

The link above said that there might be some `harrasment` categories listed but eventually everyone agreed that this most-likely is a copyright infrigement. The screenplay might be similar. Even thought I am asking it to convert instead of spitting something out.

Best to not even go there I think. Will figure out workarounds.

# OpenAI Parse Progress

Giving up on the LLM approach _(Maybe a local LLM will do much better, who knows ?)_. Working an [an Antlr based parser](./StructureScreenplay_Antlr.ipynb) to make progress.

# Attempt 5

TODO
 - ⬜ Research local LLM installation _(vLLM, Ollama, HF's TGI)_ and a decent model that will fit on the 4090 or 3090+4090.
 - ⬜ Install the LLM, hopefuly with OpenAI compatible end-points
 - ⬜ Try this same thing on that.

Note that even after I get a parsed screenplay, I will need an LLM for many things. So will likely get to this soon.
 - build up a description for each character
 - summarize
 - etc.

# Attempt 4

Investigate the ContentFilter headaches..

Apparently the only way of doing this is to send the same input to the moderation end-point. Even then you might not get the full-answer as there is no moderation flat for copyrighted content.

# Attempt 3

Changed the system prompt to 
```json
{"role": "system", "content": "You are a helpful screen writing assistant who does a thorough job. Do not hallucinate and perform the conversion without changing any lines"},
```

and now I get a *ContentFilterFinishReasonError: Could not parse response content as the request was rejected by the content filter*. Something about the kiss scene ??


# Attempt - 2

> Took 2min 35s.
> gpt-4o-mini some serious hallucinations and also cut off the output. Made up some dialogs. So not a good conversion tool.

```python
prompt = f"""
Convert the following screenplay in triple quotes 
into structured output. The screen play is in JSON format.
Deduce the various actors and roles involved and accurately 
convert the lines to be read by each actor or role. Pay attention to 
scene changes before, during or after an actor or role's 
lines

'''
{screen_play_json_string}
'''
"""

completion = client.beta.chat.completions.parse(
    model=OPENAI_MODEL,
    messages=[
        {"role": "system", "content": "You are a helpful screen writing assistant who does a thorough job"},
        {"role": "user", "content": prompt}
    ],
    response_format=ScreenplayJSON,
)
```