# Final Tree

I am organizing the repos this way and putting all the scripts under the new `data` repo.

```console
root/
└── hillops/
└── data/
```

Once all these are done, the final tree looks like below with `Index.md` containing a list of all the file with their genre and year-of-release listed.

```console
data/
└── imsdb_scripts
    ├── index.md
    ├── json_screenplay
    │   └── Aladdin.gpt_mini_parsed.json
    └── raw_screenplay
        ├── 10_Things_I_Hate_About_You.json
        ├── 12.json
        ├── 127_Hours.json
        ├── 12_Monkeys.json
        ├── 12_Years_a_Slave.json
        ├── 12_and_Holding.json
        ....- snhip....
        ├── Yes_Man.json
        ├── You've_Got_Mail.json
        ├── You_Can_Count_On_Me.json
        ├── Youth_in_Revolt.json
        ├── Zero_Dark_Thirty.json
        ├── Zerophilia.json
        ├── Zootopia.json
        ├── eXistenZ.json
        └── xXx.json
```

# IMSDB Scripts

IMSDB is a huge repository of scripts. This guy has some [scripts](https://huggingface.co/datasets/mattismegevand/IMSDb) to download the entire [imsdb]() script collection.

 - His instructions ask to clone the repor and run the scripts
 - HF says cloning is to use `git clone https://huggingface.co/<your-username>/<your-model-name>`
 - In any case, he has the entire scraped set _(244MB)_ on [his datasets folder](https://huggingface.co/datasets/mattismegevand/IMSDb/tree/main) as `data.jsonl`. Using that as a starting point to save time.

# Split these out into one file per movie and store by title



In [23]:
import os
from pathlib import Path

REPO_ROOT = Path("~/bitbucket/").expanduser()

# Copy from https://huggingface.co/datasets/mattismegevand/IMSDb/tree/main as needed.
SCRIPTS_DATASET = Path("~/data.jsonl").expanduser()

# Copy into REPO_ROOT/data/imsdb_scripts/raw_screenplay 
# once satisfied
OUT_DIR = Path("~/scripts/").expanduser()

# Aiming for this type of structure.
# OUT_DIR
# ├── raw_scripts
# │   ├── 10_Things_I_Hate_About_You.json
# │   ├── 127_Hours.json
# │   ├── ...snip...
# │   ├── Year_One.json
# │   ├── Yes_Man.json
# │   ├── ...snip...
# │   ├── Zerophilia.json
# │   └── Zootopia.json
# ├── Index.md

if not OUT_DIR.exists():
    print(f"Creating {OUT_DIR}")
    os.makedirs(OUT_DIR)


In [5]:
# Needs panda's to load the jsonl ?
# import pandas as pd
# scripts_json_obj = pd.read_json(str(scripts_dataset), lines=True)
import json
with open(str(SCRIPTS_DATASET)) as f:
    data = [json.loads(line) for line in f]

In [7]:
# What keys does this have ?
# IMSDb_opinion
# genres
# script_date
# IMSDb_rating
# poster
# submitted_by
# movie_release_date
# writers
# script
# average_user_rating
# title
for key in data[0].keys():
    print(key)

IMSDb_opinion
genres
script_date
IMSDb_rating
poster
submitted_by
movie_release_date
writers
script
average_user_rating
title


In [2]:
import re
from tqdm import tqdm

doWrite = False

def canonicalize_title(origTitle: str) -> str:
    return re.sub('^".*"$', '$1', 
                  re.sub('\s+', '_', origTitle)
                 )

if doWrite :
    for script_json in tqdm(data):
        title_fname = canonicalize_title(script_json['title'])
            
        with open(str(OUT_DIR / "raw_screenplay" / title_fname) + ".json", 'w') as out:
            json.dump(script_json, out, indent=4)

## Save a markdown index file to access all of these titles

I want to group by year and genre. I have seen examples of grouping by a single key, but now want to group by year _(why not decade)_ and then by genre.

In [29]:
from itertools import groupby
from operator import itemgetter

# Thought I could sort by year but turns out many of them don't have years in them.
# Could hit an LLM to ask which year the movie was released and complete my dataset.
# For now, simply add all of these as metadata in the markdown file.
def round_year_to_decade(year):
    return int(round(year / 10.0) * 10)

def json_to_md_link(d) -> str:
    title = d['title']
    file_name = canonicalize_title(title)
    genres = re.sub('[\[\]]', '', d['genres'])           # serialized array.
    metadata = f"{genres} - {d['movie_release_date']}"    
    return f" - [{title} - {metadata}](./raw_screenplay/{file_name}.json)"

doWrite = True
if doWrite:
    # Sort by the canonical title, case insensitive
    # and generate a markdown list (include the metadata in the title)
    # so a simple text search atleast will work for now.
    data.sort(key=lambda x: canonicalize_title(x['title']).lower())
    md_script_list = [json_to_md_link(d) for d in data]

    # Generate the full markdown file and save it
    md = '''
# Scripts

The script titles are listed alphabetically and linked to the full script. Genres and year of release are listed where available.
''' + "\n".join(md_script_list)

    out_path = str(OUT_DIR / "Index.md")
    with open(out_path,  'w') as out:
        print(f"Writing {out_path}")
        out.write(md)

Writing /home/vamsi/scripts/Index.md
