# Analyzing Sefaria Schema Files

> **Goal:** build a tidy summary of every book‐level schema contained in the [Sefaria export](https://github.com/Sefaria/Sefaria-Export) and save the result as a CSV for downstream pipelines.

---

## 1  Imports

In [1]:
import json
import os
import glob
import pathlib
import pandas as pd

We rely only on the Python standard library plus **pandas** for tabular work.

---

## 2  Locate the schema JSON files

In [2]:
ROOT = r"C:\Users\yotam\OneDrive\שולחן העבודה\PROJECTS\ChavrutAI\Sefaria-Export-master\schemas"
json_files = [fp for fp in glob.glob(os.path.join(ROOT, "*.json"))
              if os.path.getsize(fp) > 0        # לא ריק
              and os.path.basename(fp) != "Sheet.json"]


* `ROOT` points to the *schemas* directory inside the Sefaria export.
* `glob.glob(…)` collects the absolute paths of every `*.json` file so we can iterate over them in the next step.

---

## 3  Parse each schema file

In [74]:
def add_record(meta, subpath, lvl, sec_en, sec_he, addr, node_type):
    """עוזר קטן כדי שלא נחזור על אותו dict שוב ושוב."""
    records.append({
        "book_en"    : meta.get("title"),
        "book_he"    : meta.get("heTitle"),
        "domain"     : (meta.get("heCategories")
                        or meta.get("categories", ["Unknown"]))[0],
        "subpath"    : subpath,             # אפשר להשאיר לעיונים עתידיים
        "level"      : lvl,                 # 1‑based
        "section_en" : sec_en,
        "section_he" : sec_he,
        "addressType": addr,
        "nodeType"   : node_type,
        "file"       : meta["_file"],
    })

def walk_nodes(node, meta, path_stack=None):
    """
    • מדלג רק על צומת‑השורש (heTitle == meta["heTitle"]).
    • subpath = '/'.join(path_stack)  ←  מסלול‑ההורה בלבד.
    """
    if path_stack is None:
        path_stack = []

    node_type = node.get("nodeType") or "SchemaNode"
    he_title  = (node.get("heTitle") or "").strip()
    en_title  = (node.get("title")   or node.get("key", "")).strip() or None

    is_root   = (not path_stack) and he_title and \
                he_title == (meta.get("heTitle") or "").strip()
    is_jagged = node_type == "JaggedArrayNode"

    # ---------- 1) רשומה עבור כל צומת‑על שאינו שורש ----------
    if he_title and not is_root:
        subpath_val = "/".join(path_stack) or he_title     # ← שינוי כאן
        add_record(
            meta      = meta,
            subpath   = subpath_val,
            lvl       = len(path_stack) + 1,
            sec_en    = en_title,
            sec_he    = he_title,
            addr      = None if not is_jagged else node.get("addressTypes", [None])[0],
            node_type = node_type,
        )

    # ---------- 2) הכנה לירידה לילדים ----------
    next_stack = path_stack + ([he_title] if he_title and not is_root else [])

    # ---------- 3) JaggedArrayNode – רמות‑עומק פנימיות ----------
    if is_jagged:
        depth   = node["depth"]
        sec_en  = node.get("sectionNames", [])
        sec_he  = node.get("heSectionNames", [])
        addr    = node.get("addressTypes", [])

        for i in range(depth):
            he_lbl = (sec_he[i] if i < len(sec_he) else "").strip()
            if not he_lbl:
                continue

            en_lbl = (sec_en[i] if i < len(sec_en) else "").strip() or None
            add_record(
                meta      = meta,
                subpath   = "/".join(next_stack),   # כולל החלק (“תשובת הרמב״ם”)
                lvl       = len(next_stack) + i + 1,
                sec_en    = en_lbl,
                sec_he    = he_lbl,
                addr      = addr[i] if i < len(addr) else None,
                node_type = node_type,
            )

    # ---------- 4) רקורסיה ----------
    for child in node.get("nodes", []):
        walk_nodes(child, meta, next_stack)


In [75]:
records = []

for fp in glob.glob(os.path.join(ROOT, "*.json")):
    try:
        with open(fp, encoding="utf-8-sig") as f:
            meta = json.load(f)
    except (json.JSONDecodeError, UnicodeDecodeError):
        # דילוג על קבצים לא‑תקינים / Sheet.json וכד׳
        continue

    meta["_file"] = os.path.basename(fp)
    walk_nodes(meta.get("schema", {}), meta)

### Why we flatten per‑level

Sefaria stores a **hierarchical** structure (e.g. *Chapter → Verse → Comment*). For data wrangling it is often easier to work with a **tidy table** where each row describes *exactly one level* of structure. The loop above therefore produces one `records` entry per level (depth) of every book.

---


## 4  Build a DataFrame

In [76]:
# Build tidy DataFrame
schema_df = pd.DataFrame(records)

# 2. (רשות) לסדר עמודות לקריאה נוחה
col_order = ["book_en","book_he","domain","subpath","level",
             "section_en","section_he","addressType","nodeType","file"]
schema_df = schema_df[col_order]


At this point `schema_df` holds **one row per structural level** across the entire library.

---

## 5  Quick sanity checks

In [77]:
print(len(schema_df.book_en.unique()),"Books Loaded")

6541 Books Loaded


In [82]:
schema_df.head(10)

Unnamed: 0,book_en,book_he,domain,subpath,level,section_en,section_he,addressType,nodeType,file
0,Abarbanel on Amos,אברבנאל על עמוס,"תנ""ך",,1,Chapter,פרק,Perek,JaggedArrayNode,Abarbanel_on_Amos.json
1,Abarbanel on Amos,אברבנאל על עמוס,"תנ""ך",,2,Verse,פסוק,Pasuk,JaggedArrayNode,Abarbanel_on_Amos.json
2,Abarbanel on Amos,אברבנאל על עמוס,"תנ""ך",,3,Comment,פירוש,Integer,JaggedArrayNode,Abarbanel_on_Amos.json
3,Abarbanel on Ezekiel,אברבנאל על יחזקאל,"תנ""ך",הקדמה,1,Introduction,הקדמה,Integer,JaggedArrayNode,Abarbanel_on_Ezekiel.json
4,Abarbanel on Ezekiel,אברבנאל על יחזקאל,"תנ""ך",הקדמה,2,Paragraph,פסקה,Integer,JaggedArrayNode,Abarbanel_on_Ezekiel.json
5,Abarbanel on Ezekiel,אברבנאל על יחזקאל,"תנ""ך",,1,Chapter,פרק,Integer,JaggedArrayNode,Abarbanel_on_Ezekiel.json
6,Abarbanel on Ezekiel,אברבנאל על יחזקאל,"תנ""ך",,2,Verse,פסוק,Integer,JaggedArrayNode,Abarbanel_on_Ezekiel.json
7,Abarbanel on Ezekiel,אברבנאל על יחזקאל,"תנ""ך",,3,Comment,פירוש,Integer,JaggedArrayNode,Abarbanel_on_Ezekiel.json
8,Abarbanel on Guide for the Perplexed,אברבנאל על מורה נבוכים,מחשבת ישראל,הקדמה,1,Introduction,הקדמה,,SchemaNode,Abarbanel_on_Guide_for_the_Perplexed.json
9,Abarbanel on Guide for the Perplexed,אברבנאל על מורה נבוכים,מחשבת ישראל,הקדמה,2,Letter to R Joseph son of Judah,איגרת,Integer,JaggedArrayNode,Abarbanel_on_Guide_for_the_Perplexed.json


In [83]:
schema_df[schema_df.book_he == 'איגרות הרמב"ם']

Unnamed: 0,book_en,book_he,domain,subpath,level,section_en,section_he,addressType,nodeType,file
16424,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת","חידושי הרמב""ם",1,Khiddushei HaRambam,"חידושי הרמב""ם",Integer,JaggedArrayNode,Iggerot_HaRambam.json
16425,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת","חידושי הרמב""ם",2,Paragraph,פסקה,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16426,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת",איגרת תימן,1,Iggeret Teiman,איגרת תימן,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16427,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת",איגרת תימן,2,Paragraph,פסקה,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16428,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת",מאמר תחיית המתים,1,Maamar Tekhiyat HaMetim,מאמר תחיית המתים,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16429,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת",מאמר תחיית המתים,2,Paragraph,פסקה,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16430,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת",מאמר קידוש השם,1,Maamar Kiddush HaShem,מאמר קידוש השם,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16431,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת",מאמר קידוש השם,2,Paragraph,פסקה,Integer,JaggedArrayNode,Iggerot_HaRambam.json
16432,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת","תשובת הרמב""ם",1,Teshuvat HaRambam,"תשובת הרמב""ם",Integer,JaggedArrayNode,Iggerot_HaRambam.json
16433,Iggerot HaRambam,"איגרות הרמב""ם","שו""ת","תשובת הרמב""ם",2,Paragraph,פסקה,Integer,JaggedArrayNode,Iggerot_HaRambam.json


In [84]:
schema_df.tail()

Unnamed: 0,book_en,book_he,domain,subpath,level,section_en,section_he,addressType,nodeType,file
57726,Zohar HaRakia,זוהר הרקיע,הלכה,מצוות לא תעשה,1,Negative Commandments,מצוות לא תעשה,Integer,JaggedArrayNode,Zohar_HaRakia.json
57727,Zohar HaRakia,זוהר הרקיע,הלכה,מצוות לא תעשה,2,Mitzvah,מצוה,Integer,JaggedArrayNode,Zohar_HaRakia.json
57728,Zohar HaRakia,זוהר הרקיע,הלכה,מצוות לא תעשה,3,Paragraph,פסקה,Integer,JaggedArrayNode,Zohar_HaRakia.json
57729,Zohar HaRakia,זוהר הרקיע,הלכה,חתימה,1,Postscript,חתימה,Integer,JaggedArrayNode,Zohar_HaRakia.json
57730,Zohar HaRakia,זוהר הרקיע,הלכה,חתימה,2,Paragraph,פסקה,Integer,JaggedArrayNode,Zohar_HaRakia.json


In [89]:
schema_df[schema_df.book_he=="דברים רבה"]

Unnamed: 0,book_en,book_he,domain,subpath,level,section_en,section_he,addressType,nodeType,file
10874,Devarim Rabbah,דברים רבה,מדרש,,1,Chapter,פרק,Perek,JaggedArrayNode,Devarim_Rabbah.json
10875,Devarim Rabbah,דברים רבה,מדרש,,2,Paragraph,פסקה,Integer,JaggedArrayNode,Devarim_Rabbah.json


Inspecting the **head** and **tail** confirms our flattening logic behaves as expected.

> **Note** One file – `Sheet.json` – logged “failed”. It is a known edge‑case representing user‐created source sheets rather than a standard book schema.

---

## 6  Persist the result

In [87]:
# Optional: save for later pipelines
schema_df.to_csv("schema_summary.csv", index=False)

A comma‑separated file makes the data immediately usable by SQL engines, BI tools, or further Python/R scripts.

---