# Create Penn Tree like feature for N1904-TF

**This feature is not yet fully stable**

## Table of content (ToC)<a class="anchor" id="TOC"></a>

* <a href="#bullet1">1 - Introduction</a>
  * <a href="#bullet1x1">1.1 - Labels</a>
  * <a href="#bullet1x2">1.2 - Mapping MACULA to Penn Treebank (PTB) </a>
* <a href="#bullet2">2 - Building the trees</a>
  * <a href="#bullet2x1">2.1 - Load TF package with the N1904-TF dataset</a>
  * <a href="#bullet2x2">2.2 - Gather the data from the XML source files</a>
* <a href="#bullet3">3 - Verification</a>    
  * <a href="#bullet3x1">3.1 - Count open and closing parenthesis</a>
  * <a href="#bullet3x2">3.2 - Display the penn tree</a>
  * <a href="#bullet3x3">3.3 - Script to check tree consistency</a>
  * <a href="#bullet3x4">3.4 - Print a tree </a>
  * <a href="#bullet3x5">3.5 - Check if the trees can be read by nltk</a>
* <a href="#bullet4">4 - Create the additonal TF feature</a>
  * <a href="#bullet4x1">4.1 - First load Text-Fabric</a> 
  * <a href="#bullet4x2">4.2 - Prepare metadata</a>
  * <a href="#bullet4x3">4.3 - Prepare featuredata</a>
  * <a href="#bullet4x4">4.4 - Link featuredata to metadata</a>
  * <a href="#bullet4x5">4.5 - Save the features to file</a>
  * <a href="#bullet4x6">4.6 - Move the files to the proper location</a>
* <a href="#bullet5">5 - Attribution and footnotes</a>
* <a href="#bullet6">6 - Notebook version</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This notebook creates a new Text-Fabric feature for the N1904-TF dataset. It creates a syntax trees **akin to** the Penn Treebank (PTB) format.

In short, the PTB format is a bracketed tree in Lisp-style:

``` lisp
(PHRASE_LABEL (POS_TAG word) ...)
```
Each phrase contains:

A label like `NP`, `VP`, `PP`, etc.

One or more child nodes: other phrases or terminal words tagged with POS. To my knowledge there is no standard format that can deal effective with Koine Greek. This Notebook may be updated to change the output format once new insights are obtained.

## 1.1 - Labels <a class="anchor" id="bullet1x1"></a>

Penn tree phrase Labels (Syntactic Categories)

| Label  | Meaning                     | Example                  |
|--------|-----------------------------|--------------------------|
| `S`    | Sentence                    | `(S (NP ...) (VP ...))`  |
| `NP`   | Noun Phrase                 | `(NP (DET the) (N man))` |
| `VP`   | Verb Phrase                 | `(VP (V saw) (NP ...))`  |
| `PP`   | Prepositional Phrase        | `(PP (P with) (NP ...))` |
| `ADJP` | Adjective Phrase            | `(ADJP (ADJ great))`     |
| `ADVP` | Adverb Phrase               | `(ADVP (ADV quickly))`   |
| `SBAR` | Subordinate Clause          | `(SBAR (IN that) (S ...))` |
| `XP`   | Generic/unknown phrase      | Fallback label           |

The extended version of PTB lets you add grammatical roles as suffixes to phrases using a dash (`-`):

| Suffix   | Role                     | Example                      |
|----------|--------------------------|------------------------------|
| `-SBJ`   | Subject                  | `(NP-SBJ (DET the) (N man))` |
| `-OBJ`   | Object                   | `(NP-OBJ (N fish))`          |
| `-PRD`   | Predicate complement     | `(VP-PRD (ADJ good))`        |
| `-TMP`   | Temporal                 | `(NP-TMP (N yesterday))`     |
| `-LOC`   | Locative                 | `(PP-LOC (P in) (NP town))`  |
| `-ADV`   | Adverbial                | `(ADVP-ADV (ADV quickly))`   |
| `-DIR`   | Directional              | `(PP-DIR ...)`               |
| `-BNF`   | Beneficiary              | `(NP-BNF ...)`               |

These make the trees much more semantically informative.


POS Tags (Part of Speech)

| Tag     | Meaning                |
|---------|------------------------|
| `N`     | Noun                   |
| `V`     | Verb                   |
| `ADJ`   | Adjective              |
| `ADV`   | Adverb                 |
| `DET`   | Determiner / Article   |
| `P`     | Preposition            |
| `PRON`  | Pronoun                |
| `CONJ`  | Conjunction            |
| `PART`  | Particle               |
| `INTJ`  | Interjection           |
| `X`     | Unknown / other        |


However, some of these tags can not be defined using the MACULA dataset / N1904-TF.

## 1.2 - Mapping MACULA to Penn Treebank (PTB) <a class="anchor" id="bullet1x2"></a>

This sheet maps syntactic roles from the MACULA Greek New Testament treebank to Penn Treebank (PTB) labels and function tags.

| MACULA Role | PTB Phrase Label | PTB Function Tag | Description              | 
|-------------|------------------|------------------|--------------------------|
| `s`         | `NP`             | `-SBJ`           | Subject                  | 
| `o`         | `NP`             | `-OBJ`           | Object                   |
| `p`         | `PP`             | *(none)*         | Prepositional phrase     |
| `vc`        | `VP`             | *(none)*         | Verb (copula)            |
| `v`         | `VP`             | *(none)*         | Verb                     |
| `pred`      | `VP`             | `-PRD`           | Predicate complement     |
| `adv`       | `ADVP`           | `-ADV`           | Adverbial modifier       |
| `advl`      | `ADVP`           | `-ADV`           | Adverbial (alternative)  |
| *(other)*   | `XP`             | *(optional)*     | Unclassified phrase      |


# 2 - Building the trees <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

In this notebook I will build the trees uses the XML source using the transformation tables above. The following Python script uses a stack to keep track of the sequence of tags in an XML file. This recursive approach is often a good fit for tree-like data structures, such as XML documents. It allows us to traverse the tree in a depth-first manner, which is exactly what we need to build the Penn Treebank-style parse tree.

In this method I base the penn tree on the *sequence* of words found in the N1904-TF dataset. This is a major issue, since we would like to get constituencies. In the Greek text, certain post-positive conjunctions 'break in' on clauses.

## 2.1 - Load TF package with the N1904-TF dataset <a class="anchor" id="bullet2x1"></a>

In [1]:
# Load the autoreload extension to automatically reload modules before executing code
%load_ext autoreload
%autoreload 2

In [2]:
from tf.fabric import Fabric
from tf.app import use

In [3]:
# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", silence="terse", hoist=globals() )

**Locating corpus resources ...**

   |     0.81s T lemma_ttr            from ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
   |     0.79s T morph_ttr            from ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
   |     1.14s T num_words            from ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
   |     0.79s T text_ttr             from ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0


Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

In [4]:
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

## 2.1 - Gather the data from the XML source files <a class="anchor" id="bullet2x1"></a>

From file otype.tf, we can easily see:
```txt
1-137779	word
137780-137806	book
137807-138066	chapter
138067-180572	clause
180573-189517	group
189518-258524	phrase
258525-266535	sentence    <===
266536-382713	subphrase
382714-390657	verse
390658-497525	wg
```
This implies that the first sentence node is 258525. So we need to 'shift' the index of dictionairy pennTreeDict which we created in this code block. This can easily be done by sentenceCounter = 258525

In [5]:
import os
from lxml import etree as ET
import unicodedata

# Define a function to normalize text to precomposed form
def normalizeToPrecomposed(s):
    return unicodedata.normalize('NFC', s.lower())

# Define a function to check for punctuation
def check_punctuation(after):
    for ch in [',', '.', '·', ';']:
        if ch in after:
            return True, ch
    return False, ''

# Define a function to map MACULA rules to PTB phrase labels
def phraseRole(role):
    if role is None:
        return ""
    role = role.lower()
    if role == "s":
        return "-SBJ"
    elif role == "o":
        return "-OBJ"
    elif role == "p":
        return ""  # ??
    elif role == "vc":
        return "" # ??
    elif role == "pred":
        return "-PRD"
    elif role in {"adv", "advl"}:
        return  "-ADV"
    else:
       return ''


class TagTracker:
    def __init__(self):
        self.stack = []
        self.penn_tree = ""
        self.word_id = 0
        self.finished_sentences = []

    def start_element(self, el):
        tag = el.tag
        if tag == 'w':
            self.word_id += 1
            word = normalizeToPrecomposed(el.text or "")
            morph = el.get('morph', '')
            role = el.get('role', '')
            sp = el.get('class', '').upper()
            after = el.get('after', '')

            if sp == 'VERB' and self.stack[-1][0] == 'wg' and self.stack[-1][1] is None:
                self.penn_tree += f" (VP-{role.upper()} ({sp} {word}) )"
            else:
                self.penn_tree += f"({sp} {word}) "

            has_punct, punct = check_punctuation(after)
            if has_punct:
                self.word_id += 1
                self.penn_tree += f"(PUNCT {punct}) "

        elif tag == 'sentence':
            self.penn_tree += "(S "
            self.stack.append(('sentence', None, el))
        elif tag == 'wg':
            rule = el.get('rule', 'wg')
            role = el.get('role', 'wg')
            type = el.get('class', 'wg').upper()
            if self.stack and self.stack[-1][0] == 'sentence':
                if rule == 'wg' and role == 'wg':
                    return
            label = phraseRole(role)
            self.penn_tree += f"({type}{label} "
            self.stack.append((tag, None, el))

    def end_element(self, el):
        if not self.stack:
            return
        tag, _, _ = self.stack[-1]
        if tag == el.tag:
            self.stack.pop()

            if tag == 'w':
                self.penn_tree += ") "
            elif tag == 'sentence':
                self.penn_tree += ")"
                self.penn_tree += "\n"  # End-of-sentence signal

                # Capture completed sentence
                self.finished_sentences.append(self.penn_tree.strip())
                self.penn_tree = ""  # Reset for next sentence
            else:
                self.penn_tree += ") "

def phraseRole(role):
    return f"-{role.upper()}" if role else ""

# === Main script ===

inputXmlFilePath = r"C:\Users\tonyj\OneDrive\Documents\GitHub\REMA-Update-XML\XML-input"

pennTreeDict = {}
startIndex = 258525       # this is the number of  the first sentence node
sentenceCounter = startIndex  

fileNames = sorted(os.listdir(inputXmlFilePath))

for inputFileName in fileNames:
    if inputFileName.lower().endswith('.xml'):
        fullInputPath = os.path.join(inputXmlFilePath, inputFileName)

        tree = ET.parse(fullInputPath)
        root = tree.getroot()

        tracker = TagTracker()

        def recurse(el):
            tracker.start_element(el)
            for c in el:
                recurse(c)
            tracker.end_element(el)

        recurse(root)

        for treeStr in tracker.finished_sentences:
            pennTreeDict[sentenceCounter] = treeStr
            sentenceCounter += 1

# Preview the first few entries (starting at key 258525)
# If all is OK for now, it should start at Matthew 1:1
previewCount = 5  # How many entries to preview

for i in range(startIndex, startIndex + previewCount):
    if i in pennTreeDict:
        print(f"Sentence {i}:\n{pennTreeDict[i]}\n{'-'*40}")
    else:
        print(f"Sentence {i} not found in dictionary.\n{'-'*40}")

Sentence 258525:
(S (CL-WG (NP-P (NOUN βίβλος) (NP-WG (NOUN γενέσεως) (NP-WG (NP-WG (NP-WG (NOUN ἰησοῦ) (NOUN χριστοῦ) ) (NP-WG (NOUN υἱοῦ) (NOUN δαυεὶδ) ) ) (NP-WG (NOUN υἱοῦ) (NOUN ἀβραάμ) (PUNCT .) ) ) ) ) ) )
----------------------------------------
Sentence 258526:
(S (WG-WG (CL-WG (NOUN ἀβραὰμ)  (VP-V (VERB ἐγέννησεν) )(NP-O (DET τὸν) (NOUN ἰσαάκ) (PUNCT ,) ) ) (WG-WG (CONJ δὲ) (CL-WG (NOUN ἰσαὰκ)  (VP-V (VERB ἐγέννησεν) )(NP-O (DET τὸν) (NOUN ἰακώβ) (PUNCT ,) ) ) ) (WG-WG (CONJ δὲ) (CL-WG (NOUN ἰακὼβ)  (VP-V (VERB ἐγέννησεν) )(WG-O (NP-WG (DET τὸν) (NOUN ἰούδαν) ) (WG-WG (CONJ καὶ) (NP-WG (DET τοὺς) (NP-WG (NOUN ἀδελφοὺς) (PRON αὐτοῦ) (PUNCT ,) ) ) ) ) ) ) (WG-WG (CONJ δὲ) (CL-WG (NOUN ἰούδας)  (VP-V (VERB ἐγέννησεν) )(WG-O (NP-WG (DET τὸν) (NOUN φαρὲς) ) (WG-WG (CONJ καὶ) (NP-WG (DET τὸν) (NOUN ζαρὰ) ) ) ) (PP-ADV (PREP ἐκ) (NP-WG (DET τῆς) (NOUN θάμαρ) (PUNCT ,) ) ) ) ) (WG-WG (CONJ δὲ) (CL-WG (NOUN φαρὲς)  (VP-V (VERB ἐγέννησεν) )(NP-O (DET τὸν) (NOUN ἐσρώμ) (PUNCT ,) ) ) ) (

# 3 - Verification <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

## 3.1 - Count open and closing parenthesis <a class="anchor" id="bullet3x1"></a>

Here we can just trow in a tree and check. Parentheses should match.

In [6]:
tree_str = """
( S ( CONJ3CL ( P-VC-S ( PP ( PREP ἐν )( NOUN ἀρχῇ ) )( VERB ἦν )( NP ( DET ὁ )( NOUN λόγος ) ( PUNCT , ) ) )( WG ( CONJ καὶ )( S-VC-P ( NP ( DET ὁ )( NOUN λόγος ) )( VERB ἦν )( PP ( PREP πρὸς )( NP ( DET τὸν )( NOUN θεόν ) ( PUNCT , ) ) ) ) )( WG ( CONJ καὶ )( P-VC-S ( NOUN θεὸς )( VERB ἦν )( NP ( DET ὁ )( NOUN λόγος ) ( PUNCT . ) ) ) ) ) )
"""

open_count = tree_str.count("(")
close_count = tree_str.count(")")
print("Opening parentheses:", open_count)
print("Closing parentheses:", close_count)

Opening parentheses: 33
Closing parentheses: 33


## 3.2 - Display the penn tree <a class="anchor" id="bullet3x2"></a>

We can also display the tree in a nice format using nltk.

In [7]:
from nltk import Tree


ptbString = """

(S (WG (CL  (VP-V (VERB ἐγένετο) )(NP-SBJ (NOUN ἄνθρωπος) (PUNCT ,) (CL  (VP-V (VERB ἀπεσταλμένος) )(PP-ADV (PREP παρὰ) (NOUN θεοῦ) (PUNCT ,) ) ) ) ) (CL (NOUN ὄνομα) (PRON αὐτῷ) (NOUN ἰωάνης) (PUNCT ·) ) ) )

"""

tree = Tree.fromstring(ptbString)
tree.pretty_print()  # console-style print     

                                               S                                 
                                               |                                  
                                               WG                                
            ___________________________________|_________________                 
           CL                                                    |               
    _______|________________                                     |                
   |                      NP-SBJ                                 |               
   |        ________________|_____________                       |                
   |       |       |                      CL                     |               
   |       |       |         _____________|____                  |                
  VP-V     |       |       VP-V              PP-ADV              CL              
   |       |       |        |         _________|______       ____|____________    
  VERB    

## 3.3 - Script to check tree consistency <a class="anchor" id="bullet3x3"></a>

The following script proved helpfull to figure out where the logic in building trees got astray. I leave it in for anyone modifying this notebook. E.g., for another dataset

In [8]:
import re

def annotate_parentheses(text):
    """
    Returns a new string where each '(' is annotated with "[open N]"
    and each ')' is annotated with "[close N]" where N is a counter.
    Uses a stack to match open and close parentheses.
    """
    stack = []
    counter = 1
    annotated = []
    
    for char in text:
        if char == "(":
            # For every open parenthesis, push the current counter
            stack.append(counter)
            annotated.append(f"( [open {counter}]")
            counter += 1
        elif char == ")":
            if stack:
                # Pop the matching open counter
                num = stack.pop()
                annotated.append(f")[close {num}]")
            else:
                # If there is no matching open, still annotate and flag it.
                annotated.append(f")[close {counter} (no matching open)]")
                counter += 1
        else:
            annotated.append(char)
    
    # If there are any unmatched open parentheses, add a warning at the end.
    if stack:
        annotated.append("\n[Warning: Unmatched open parentheses remain: " + ", ".join(map(str, stack)) + "]")
    
    return "".join(annotated)

def check_consistency(annotated_text):
    """
    Checks that every [open N] tag has a corresponding [close N] tag.
    Returns True if the numbers match; otherwise, it prints an error message and returns False.
    """
    open_tags = re.findall(r'\[open (\d+)\]', annotated_text)
    close_tags = re.findall(r'\[close (\d+)\]', annotated_text)

    if len(open_tags) != len(close_tags):
        print("The number of open and close tags are unequal")
        return False

    if set(open_tags) != set(close_tags):
        print("Mismatch in tag numbers between open and close tags.")
        return False

    return True

# the penn tree to validate:
input_text = """( S ( CL_P2CL ( NP_NPofNP ( N-NSF βίβλος )( NP_NPofNP ( N-GSF γενέσεως )( NP_Np-Appos ( NP_Np-Appos ( NP_Np-Appos ( N-GSM ἰησοῦ )( N-GSM χριστοῦ ) )( NP_NPofNP ( N-GSM υἱοῦ )( N-PRI δαυεὶδ ) ) )( NP_NPofNP ( N-GSM υἱοῦ )( N-PRI ἀβραάμ ) ( PUNCT . ) ) ) ) ) )"""

annotated = annotate_parentheses(input_text)
print("Annotated text:")
print(annotated)
print("\nConsistency Check:")
if check_consistency(annotated):
    print("The structure is consistent.")
else:
    print("The structure is inconsistent.")

Annotated text:
( [open 1] S ( [open 2] CL_P2CL ( [open 3] NP_NPofNP ( [open 4] N-NSF βίβλος )[close 4]( [open 5] NP_NPofNP ( [open 6] N-GSF γενέσεως )[close 6]( [open 7] NP_Np-Appos ( [open 8] NP_Np-Appos ( [open 9] NP_Np-Appos ( [open 10] N-GSM ἰησοῦ )[close 10]( [open 11] N-GSM χριστοῦ )[close 11] )[close 9]( [open 12] NP_NPofNP ( [open 13] N-GSM υἱοῦ )[close 13]( [open 14] N-PRI δαυεὶδ )[close 14] )[close 12] )[close 8]( [open 15] NP_NPofNP ( [open 16] N-GSM υἱοῦ )[close 16]( [open 17] N-PRI ἀβραάμ )[close 17] ( [open 18] PUNCT . )[close 18] )[close 15] )[close 7] )[close 5] )[close 3] )[close 2]

Consistency Check:
The number of open and close tags are unequal
The structure is inconsistent.


## 3.4 - Print a tree <a class="anchor" id="bullet3x4"></a>

Some simple tree printing script which does not require you to use nltk. The Node class and all parsing/formatting logic are written entirely from scratch. The function parse_input() handles a basic parenthesis-based tree parsing, similar in spirit to what nltk.Tree.fromstring() would do, but using its own logic.

In [9]:
class Node:
    def __init__(self, value):
        self.value = value
        self.children = []

def parse_input(input_text):
    """
    Parses an input string in a parenthesized format into a tree.
    Assumes that after each '(' the next token is the label.
    """
    tokens = input_text.replace('(', ' ( ').replace(')', ' ) ').split()
    stack = []
    root = None
    i = 0
    while i < len(tokens):
        token = tokens[i]
        if token == '(':
            # Next token is the node’s label.
            i += 1
            if i < len(tokens):
                label = tokens[i]
                new_node = Node(label)
                if stack:
                    stack[-1].children.append(new_node)
                else:
                    root = new_node
                stack.append(new_node)
        elif token == ')':
            if stack:
                stack.pop()
        else:
            # Any token that is not a parenthesis is a leaf.
            leaf = Node(token)
            if stack:
                stack[-1].children.append(leaf)
            else:
                root = leaf
        i += 1
    return root

def is_simple(node):
    """
    Returns True if the node is a leaf or a 'simple' node—that is, if all its children
    are leaves. (This covers typical preterminal nodes like (POS word).)
    """
    if not node.children:
        return True
    return all(not child.children for child in node.children)

def format_tree(node, indent=0):
    """
    Recursively returns a formatted string for the tree.
    
    - If the node is simple (its children are leaves), the node is printed on one line.
    - Otherwise, complex children are printed on their own indented lines.
    - Additionally, consecutive simple children are grouped on one line.
    """
    sp = "  " * indent

    # If this is a leaf, just return its value.
    if not node.children:
        return node.value

    # If all children are leaves, print everything on one line.
    if is_simple(node):
        children_str = " ".join(child.value for child in node.children)
        return f"{sp}({node.value} {children_str})"

    # Otherwise, prepare to build multiple lines.
    lines = [f"{sp}({node.value}"]
    # We'll group consecutive simple (i.e. one‑line) children together.
    simple_group = []

    def flush_group():
        nonlocal simple_group
        if simple_group:
            # Group them on one line (with one extra indent level)
            group_line = "  " * (indent+1) + " ".join(simple_group)
            lines.append(group_line)
            simple_group = []

    for child in node.children:
        if is_simple(child):
            # Format the simple child on one line (ignoring current indent).
            simple_group.append(format_tree(child, 0).strip())
        else:
            flush_group()
            # For a complex child, add its formatted version (with increased indent).
            lines.append(format_tree(child, indent+1))
    flush_group()
    lines.append(f"{sp})")
    return "\n".join(lines)

def main():
    input_text = """
    
(S (WG (CL (PP (PREP ἐν) (NOUN ἀρχῇ) ) (VERB ἦν) (NP (DET ὁ) (NOUN λόγος) (PUNCT ,) ) ) (WG (CONJ καὶ) (CL (NP (DET ὁ) (NOUN λόγος) ) (VERB ἦν) (PP (PREP πρὸς) (NP (DET τὸν) (NOUN θεόν) (PUNCT ,) ) ) ) ) (WG (CONJ καὶ) (CL (NOUN θεὸς) (VERB ἦν) (NP (DET ὁ) (NOUN λόγος) (PUNCT .) ) ) ) ) )

"""
    tree = parse_input(input_text)
    print(format_tree(tree))

if __name__ == "__main__":
    main()


(S
  (WG
    (CL
      (PP
        (PREP ἐν) (NOUN ἀρχῇ)
      )
      (VERB ἦν)
      (NP
        (DET ὁ) (NOUN λόγος) (PUNCT ,)
      )
    )
    (WG
      (CONJ καὶ)
      (CL
        (NP
          (DET ὁ) (NOUN λόγος)
        )
        (VERB ἦν)
        (PP
          (PREP πρὸς)
          (NP
            (DET τὸν) (NOUN θεόν) (PUNCT ,)
          )
        )
      )
    )
    (WG
      (CONJ καὶ)
      (CL
        (NOUN θεὸς) (VERB ἦν)
        (NP
          (DET ὁ) (NOUN λόγος) (PUNCT .)
        )
      )
    )
  )
)


## 3.5 - Check if the trees can be read by nltk <a class="anchor" id="bullet3x5"></a>

In [10]:
import os
from nltk import Tree

# Define the full path to the file containing the tree strings.
outputFilePath = r"C:\Users\tonyj\OneDrive\Documents\GitHub\REMA-Update-XML\grammar\allpenntrees.txt"

# Read the content of the file.
with open(outputFilePath, "r", encoding="utf-8") as file:
    production = file.read()

# Split the content into lines.
lines = production.splitlines()
tree_strings = []
current_tree = []
trees=[]

for line in lines:
    # Check if the line is not empty (after stripping whitespace).
    if line.strip():
        current_tree.append(line)
    else:
        # If current_tree has content, join it and add to tree_strings.
        if current_tree:
            tree_strings.append('\n'.join(current_tree))
            current_tree = []

# Add the last tree if any lines remain.
if current_tree:
    tree_strings.append('\n'.join(current_tree))

for i, tree_str in enumerate(tree_strings):
    try:
        tree = Tree.fromstring(tree_str)
        trees.append(tree)
    except ValueError as e:
        print(f"Error parsing tree #{i}:")
        print(tree_str)
        print("Error:", e)
        print("-" * 40)

print(f"Successfully parsed {len(trees)} trees.\n")

# print first tree
trees[0].pretty_print()

Successfully parsed 8011 trees.

                    S                                                                                          
                    |                                                                                           
                 CL_P2CL                                                                                       
                    |                                                                                           
                NP_NPofNP                                                                                      
   _________________|_________________________________________                                                  
  |                                                       NP_NPofNP                                            
  |        ___________________________________________________|____________________                             
  |       |                                                        

# 4 - Create the additional TF feature <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

## 4.1 - First load Text-Fabric <a class="anchor" id="bullet4x1"></a>

In [11]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
# Loading the Text-Fabric code
from tf.fabric import Fabric
from tf.app import use

In this notebook initialy we only need the N1904-TF base feature set.

In [13]:
# Load the N1904-TF app and data with the additional features
A = use ("CenterBLC/N1904", version="1.0.0", silence="terse", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

## 4.2 - Prepare metadata <a class="anchor" id="bullet4x2"></a>

As usual, start with a helper function to easily create Metadata for multiple features in one go.

In [14]:
# Common Text-Fabric metadata template function
def createMetadata(description,type):
    return {
        'author': 'OSIS initiative',
        'convertedBy': 'Tony Jurg',
        'website': 'https://github.com/tonyjurg/N1904addons', 
        'description': description,
        'coreData': 'Nestle 1904 Text-Fabric (centerBLC)',
        'coreDataUrl': 'https://github.com/CenterBLC/N1904',
        'provenance': 'jupyter Notebook (https://github.com/tonyjurg/create_TF_feature_betacode)',
        'version': '1.0.0',   # This is the version of the N1904-TF dataset against which this feature is build!
        'license': 'Creative Commons Attribution 4.0 International (CC BY 4.0)',
        'licenseUrl': 'https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md',
        'valueType': type
    }


Define the specifics for the feature we are going to add.

In [15]:
# Create metadata for Morpheus Analytic (ma_xxx) features using createMetadata function
penntree_Metadata  = createMetadata('Penn Tree like syntax tree of a sentence.','str')

In [16]:
# quick check
penntree_Metadata

{'author': 'OSIS initiative',
 'convertedBy': 'Tony Jurg',
 'website': 'https://github.com/tonyjurg/N1904addons',
 'description': 'Penn Tree like syntax tree of a sentence.',
 'coreData': 'Nestle 1904 Text-Fabric (centerBLC)',
 'coreDataUrl': 'https://github.com/CenterBLC/N1904',
 'provenance': 'jupyter Notebook (https://github.com/tonyjurg/create_TF_feature_betacode)',
 'version': '1.0.0',
 'license': 'Creative Commons Attribution 4.0 International (CC BY 4.0)',
 'licenseUrl': 'https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md',
 'valueType': 'str'}

## 4.3 - Prepare feature data <a class="anchor" id="bullet4x3"></a>

Now we are going to create the real data.

This was already done in step 2.1. The created dictionairy starts with 258525, which is the number of the first sentence node. So we could just link the dictionairy pennTreeDict in the next step.

## 4.4 - Link featuredata to metadata<a class="anchor" id="bullet4x4"></a>

Now we give the new feature its real TF name, and connect this name with the names of the related data dictionary and metadata dictionary.

In [17]:
metadata ={ 
    'penntree':      penntree_Metadata,
}

In [18]:
nodedata = {
    'penntree':     pennTreeDict,
}

## 4.5 - Save the feature to files<a class="anchor" id="bullet4x5"></a>

Now we save the new feature to its own `.tf` file.

If you don’t pass an explicit target path, `TF.save()` writes the file to the directory that already contains the loaded corpus—in this case the local on‑disk copy of the N1904 Text‑Fabric dataset.

In [19]:
TF.save(nodeFeatures=nodedata, metaData=metadata)  # silent="terse"

  0.00s Exporting 1 node and 0 edge and 0 configuration features to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0:
   |     0.05s T penntree             to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
  0.05s Exported 1 node features and 0 edge features and 0 config features to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0


True

## 4.6 - Move the files to the proper location<a class="anchor" id="bullet4x6"></a>

The last step is to take the files and store them on its final location.

In this case that will be at [the N1904addons repo](https://github.com/tonyjurg/N1904addons/tree/main/tf/1.0.0).

# 5 - Attribution and footnotes <a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

The following resources were consulted for creating this notebook:

- [Penn Treebank Tag Guidelines PDF](https://catalog.ldc.upenn.edu/docs/LDC99T42/tagguid1.pdf)
- [NLP @ Univ. of Pennsylvania](https://alliance.seas.upenn.edu/~nlp) (note: the original homepage of the Penntree bank project is gone.)
- [Building a Large Annotated Corpus of English: The Penn Treebank ](https://alliance.seas.upenn.edu/~nlp/publications/pdf/marcus1993.pdf)
- [The Penn treebank: an overview](https://www.researchgate.net/publication/2873803_The_Penn_Treebank_An_overview)

The source XML data (part of teh MACULA Greek Linguistic Datasets) is available under [CC BY 4.0](https://github.com/Clear-Bible/macula-greek/blob/main/LICENSE.md) and available on [github.com/Clear-Bible/macula-greek](https://github.com/Clear-Bible/macula-greek/tree/main).

The Text-Fabric features created in this notebook were added to the dataset published at [tonyjurg.github.io/N1904addons](https://tonyjurg.github.io/N1904addons/) and made available under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md) license.

The [Anaconda Asisstant](https://www.anaconda.com/capability/anaconda-assistant) (using [OpenAI](https://openai.com/) as backend) was used to debug and/or optimize parts of the code in this Jupyter Notebook.

This Jupyter notebook is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/Create_penntree_feature_for_TF/blob/main/LICENSE.md).

# 6 - Notebook version <a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>23 May 2025</td>
    </tr>
  </table>
</div>