# [Doc4TF](https://github.com/tonyjurg/Doc4TF)
#### *automatic creation of feature documentation for existing Text-Fabric datasets*

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Setting up the environment</a>
* <a href="#bullet3">3 - Load Text-Fabric data</a>
* <a href="#bullet4">4 - Creation of the dataset</a>
    * <a href="#bullet4x1">4.1 - Setting up some production values</a>
    * <a href="#bullet4x2">4.2 - Store data in dictionaries</a>
         * <a href="#bullet4x2x1">4.2.1 - Get node types and their node ranges</a>
         * <a href="#bullet4x2x2">4.2.2 - Determine which node types have specific features</a>
         * <a href="#bullet4x2x3">4.2.3 - Create dictionairy with description and valuefrequency per feature</a>
* <a href="#bullet5">5 - Create the pages</a>
    * <a href="#bullet5x1">5.1 - Create set of feature pages</a>
    * <a href="#bullet5x2">5.2 - Create overview page</a>
* <a href="#bullet6">6 - Licence</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

Ideally, a comprehensive documentation set should be created as part of developing a Text-Fabric dataset. However, in practice, this is not always completed during the initial phase or after changes to features. This Jupyter Notebook contains Python code to automatically generate (and thus ensure consistency) a documentation set for any [Text-Fabric](https://github.com/annotation/text-fabric) dataset. It serves as a robust starting point for the development of a brand new documentation set or as validation for an existing one. One major advantage is that the resulting documentation set is fully hyperlinked, a task that can be laborious if done manually.

The main steps in producing the documentation set are:
* Load a Text-Fabric database
* Execute the code pressent in the subsequent cells. The code will:
   * construct a few python dictionaries with relevant data from the TF datase 
   * create separate files for each feature
   * create an overview page of all featers per node type

# 2. Setting up the environment<a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

Your environment should (obviously) include the Python package `Text-Fabric`. In the current implementation of the script, the Python package `markdown2` is also required. If not installed yet, it can be installed using `pip`. (note: possibly in a future version this dependancy might be removed).

In [68]:
!pip install markdown2

Collecting markdown2
  Downloading markdown2-2.4.11-py2.py3-none-any.whl.metadata (2.0 kB)
Downloading markdown2-2.4.11-py2.py3-none-any.whl (41 kB)
   ---------------------------------------- 0.0/41.1 kB ? eta -:--:--
   --------- ------------------------------ 10.2/41.1 kB ? eta -:--:--
   ------------------- -------------------- 20.5/41.1 kB 217.9 kB/s eta 0:00:01
   ---------------------------------------- 41.1/41.1 kB 328.0 kB/s eta 0:00:00
Installing collected packages: markdown2
Successfully installed markdown2-2.4.11


# 3 - Load Text-Fabric data <a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

At this stage the Text-Fabric dataset is loaded which will be used to create a documentation set. See documentation for function [`use`](https://annotation.github.io/text-fabric/tf/app.html#tf.app.use) for various options regaring storage locations.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [3]:
# load the N1904 app and data
N1904 = use ("saulocantanhede/tfgreek2",version='0.5.2', hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,18609,13.84,187
group,8964,7.02,46
clause,30479,7.19,159
wg,106868,6.88,533
phrase,113750,2.29,189
subphrase,72845,1.0,53
word,137779,1.0,100


# 4 - Creation of the dataset<a class="anchor" id="bullet4"></a>

## 4.1 - Setting up some production values<a class="anchor" id="bullet4x1"></a>
##### [Back to TOC](#TOC)

In [6]:
# set the title for all pages (indicating the dataset the documentation is describing)
pageTitle="N1904 Greek New Testament Text-Fabric dataset [saulocantanhede/tfgreek2 - 0.5.2](https://github.com/saulocantanhede/tfgreek2)"

# location to store the resulting files.For now the same location as where the notebook resides (no ending slash)
resultLocation = "results"

# Set verbose to True if you want dictionaries printed. Setting to False does mute the output
verbose=True

## 4.2 - Store data in dictionaries<a class="anchor" id="bullet4x2"></a>

### 4.2.1 - Get node types and their node ranges<a class="anchor" id="bullet4x2x1"></a>
##### [Back to TOC](#TOC)

The following will create a dictionary containing the mapping from node type to node number.

In [7]:
# Initialize an empty dictionary
nodeDict = {}

# Iterate over C.levels.data
for item in C.levels.data:
    node,_,start,end = item
    # Create empty node list
    nodeDict[node] = []
    # Append the tuple (start, end) to the node's list
    nodeDict[node].append((start, end))
    
# Print resulting dictionary depending on setting 'verbose' 
if verbose: print(nodeDict)
print('finished')

{'book': [(137780, 137806)], 'chapter': [(137807, 138066)], 'verse': [(382714, 390657)], 'sentence': [(291260, 309868)], 'group': [(168546, 177509)], 'clause': [(138067, 168545)], 'wg': [(390658, 497525)], 'phrase': [(177510, 291259)], 'subphrase': [(309869, 382713)], 'word': [(1, 137779)]}
finished


Or alternative (with identical result)

In [8]:
# Initialize an empty dictionary
nodeDict = {}
# Iterate over node types
for NodeType in F.otype.all:
    nodeDict[NodeType] = []
    start, end = F.otype.sInterval(NodeType)
    # Append the tuple (start, end) to the node's list
    nodeDict[NodeType].append((start, end))
    
# Print resulting dictionary depending on setting 'verbose' 
if verbose: print(nodeDict)
print('finished')

{'book': [(137780, 137806)], 'chapter': [(137807, 138066)], 'verse': [(382714, 390657)], 'sentence': [(291260, 309868)], 'group': [(168546, 177509)], 'clause': [(138067, 168545)], 'wg': [(390658, 497525)], 'phrase': [(177510, 291259)], 'subphrase': [(309869, 382713)], 'word': [(1, 137779)]}
finished


### 4.2.2 - Determine which node types have specific features<a class="anchor" id="bullet4x2x2"></a>
##### [Back to TOC](#TOC)

The following will create a feature list with information about the node types that contain values for that specific feature.

In [9]:
# Initialize an empty dictionary
featureDict = {}
# Iterate over Fall(), all features
for item in Fall():
    # Use a set to store unique values for each feature
    featureDict[item] = set()  
    for node, content in Fs(item).items():
        featureDict[item].add(F.otype.v(node))
        
# Print the resulting dictionary depending on setting 'verbose' 
if verbose: print(featureDict)
print('finished')

{'after': {'word', 'subphrase', 'phrase'}, 'appositioncontainer': {'wg', 'phrase'}, 'articular': {'sentence', 'group', 'wg', 'clause', 'phrase'}, 'before': {'word', 'subphrase', 'phrase'}, 'book': {'verse', 'group', 'sentence', 'wg', 'clause', 'book', 'chapter', 'word'}, 'book_short': {'word', 'book'}, 'case': {'word', 'subphrase', 'phrase'}, 'chapter': {'verse', 'word', 'chapter'}, 'clauseType': {'wg', 'clause', 'sentence'}, 'cls': {'sentence', 'wg', 'clause', 'word', 'subphrase', 'phrase'}, 'cltype': {'wg', 'clause', 'sentence'}, 'criticalsign': {'word', 'subphrase', 'phrase'}, 'crule': {'wg', 'clause', 'sentence'}, 'degree': {'word', 'subphrase', 'phrase'}, 'discontinuous': {'word', 'subphrase', 'phrase'}, 'domain': {'word', 'subphrase', 'phrase'}, 'framespec': {'word', 'subphrase', 'phrase'}, 'function': {'wg', 'word', 'phrase'}, 'gender': {'word', 'subphrase', 'phrase'}, 'gloss': {'word', 'subphrase', 'phrase'}, 'id': {'word', 'subphrase', 'phrase'}, 'junction': {'wg', 'clause', '

### 4.2.3 - Create dictionairy with description and valuefrequency per feature<a class="anchor" id="bullet4x2x3"></a>
##### [Back to TOC](#TOC)

The following will create a dictionairy with the description per feature (taken from the meta data)

In [10]:
# Initialize an empty dictionary
featureMetaDict = {}
# Iterate over Fall(), all features
for item in Fall():
    featureMetaDict[item] = []
    featureMetaData=Fs(item).meta
    # Check if 'description' key exists in the meta dictionary
    if 'description' in featureMetaData:
        featureDescription = featureMetaData['description']
    else:
        featureDescription = "No feature description"
        
    # Check if 'valueType' key exists in the meta dictionary    
    if 'valueType' in featureMetaData:
        featureType = "unknown"
        if featureMetaData["valueType"] == 'str': featureType = "string" 
        if featureMetaData["valueType"] == 'int': featureType = "integer" 
    else:
        featureType = "not found"
    
    if item!='otype':
        FeatureFrequenceLists=Fs(item).freqList()
        FoundItems=0
        FeatureValueSetList = []  # Initialize an empty list to store feature value sets
        for value, freq in FeatureFrequenceLists:
            FoundItems+=1
            FeatureValueSet = value
            FeatureFrequencySet = freq
            FeatureValueSetList.append((FeatureValueSet,FeatureFrequencySet))
            if FoundItems==10: break


    featureMetaDict[item].append((featureDescription, featureType, FeatureValueSetList))

# Print resulting dictionary depending on setting 'verbose' 
if verbose: print(featureMetaDict)
print('finished')

{'after': [('material after the end of the word', 'string', [(' ', 238522), (',', 18878), ('.', 11408), ('·', 4710), (';', 1938), (',—', 36), ('—', 14), (').', 12), ('.]]', 8), ('·—', 8)])], 'appositioncontainer': [('1 if it is an apposition container', 'integer', [(1, 3816)])], 'articular': [('1 if the wg has an article', 'integer', [(1, 57544)])], 'before': [('this is XML attribute before', 'string', [('—', 32), ('(', 20), ('[[', 14), ('[', 2)])], 'book': [('book name (abbreviated), from ref attribute in xml', 'string', [('Luke', 38054), ('Matthew', 35292), ('Acts', 34574), ('John', 30536), ('Mark', 22386), ('Revelation', 17798), ('1Corinthians', 13504), ('Romans', 13079), ('Hebrews', 9204), ('2Corinthians', 8367)])], 'book_short': [('this is XML attribute book_short', 'string', [('LUK', 19457), ('ACT', 18394), ('MAT', 18300), ('JHN', 15644), ('MRK', 11278), ('REV', 9833), ('ROM', 7101), ('1CO', 6821), ('HEB', 4956), ('2CO', 4470)])], 'case': [('grammatical case', 'string', [('nomina

## 5 - Create the pages<a class="anchor" id="bullet5"></a>

## 5.1 - Create set of feature pages<a class="anchor" id="bullet5x1"></a>
##### [Back to TOC](#TOC)

In [11]:
import markdown2
import os

filesCreated=0
for feature in featureDict:
    # prepare the data
    featureName = feature
    nodeList = ''
    featureValues=''
    for node in featureDict[feature]:
        nodeList += f' <A HREF=\"featurebynodetype.md#{node}\">`{node}`</A>'
    featureDescription, featureType, valueFreq = featureMetaDict[feature][0]

    featureValues="Value|Frequency|\n---|---|\n"
    for value, freq in valueFreq:
        if value=='':
           featureValues+=f"empty |{freq}|\n"
        else:
           featureValues+=f"`{value}` | {freq} |\n"
    featureValues+="Note: only the first 10 items are shown"
    
    # define the template for the feature description pages
    FeaturePageTemplate = f"{pageTitle}\n#Feature: {featureName}\nData type|Available for node types|\n---|---|\n`{featureType}` |{nodeList}|\n## Description\n{featureDescription}\n## Values\n{featureValues}\n"

    # create the feature file
    FeaturePageContent = FeaturePageTemplate.format(featureName=feature, featureType=featureType, nodeList=nodeList)

    # Convert the plain text to Markdown
    markdown_content = markdown2.markdown(FeaturePageContent, extras=['tables'])

    # set up path to location to store the resulting file
    fileName = os.path.join(resultLocation, f"{feature}.md")

    try:
        with open(fileName, "w", encoding="utf-8") as file:
            file.write(markdown_content)
            filesCreated+=1
            # Write the Markdown content to a file
            if verbose: print(f"Markdown content written to {fileName}")
    except Exception as e:
        print(f"Error writing to file {fileName} (please create directory \'{resultLocation}\' first)")
        break
if filesCreated!=0: print(f'finished (writing {filesCreated} files)') 

Markdown content written to results\after.md
Markdown content written to results\appositioncontainer.md
Markdown content written to results\articular.md
Markdown content written to results\before.md
Markdown content written to results\book.md
Markdown content written to results\book_short.md
Markdown content written to results\case.md
Markdown content written to results\chapter.md
Markdown content written to results\clauseType.md
Markdown content written to results\cls.md
Markdown content written to results\cltype.md
Markdown content written to results\criticalsign.md
Markdown content written to results\crule.md
Markdown content written to results\degree.md
Markdown content written to results\discontinuous.md
Markdown content written to results\domain.md
Markdown content written to results\framespec.md
Markdown content written to results\function.md
Markdown content written to results\gender.md
Markdown content written to results\gloss.md
Markdown content written to results\id.md
Markd

## 5.2 - Create overview page<a class="anchor" id="bullet5x2"></a>
##### [Back to TOC](#TOC)

In [12]:
overviewPage = f"{pageTitle}\n#Features per node type\n"

# Iterate over node types
for NodeType in F.otype.all:
    # Initialize an empty list to store keys
    FeaturesWithNodeType = []
    # Check each set in featureDict for the presence of this nodetype
    for feature, value_set in featureDict.items():
        if NodeType in value_set:
             FeaturesWithNodeType.append(feature)
    NodeItemText=f"##{NodeType}\nFeature|Datatype|Description|Examples\n|---|---|---|---|\n"
    for item in FeaturesWithNodeType:
        featureDescription =featureMetaDict[item][0][0]
        DataType="`"+featureMetaDict[item][0][1]+"` "
        #Get some example values
        FoundItems=0
        valueExamples=''
        for value, freq in featureMetaDict[item][0][2]:
           FoundItems+=1
           valueExamples+='`'+str(value)+'` '
           if FoundItems==2: break
        NodeItemText+=f'<A HREF=\"{item}.md#readme\">{item}</A>| {DataType} | {featureDescription} | {valueExamples} \n'
    overviewPage+=NodeItemText
    

# create the feature overview file
# Convert the plain text to Markdown
markdown_content = markdown2.markdown(overviewPage, extras=['tables'])
    
# set up path to location to store the resulting file
fileName = os.path.join(resultLocation, "featurebynodetype.md")
try:
    with open(fileName, "w", encoding="utf-8") as file:
        file.write(markdown_content)
        filesCreated+=1
        # Write the Markdown content to a file
        if verbose: print(f"Markdown content written to {fileName}")
        print('Overview page created successfully')
except Exception as e:
    print(f"Error writing to file {fileName} (please create directory \'{resultLocation}\' first)")

Markdown content written to results\featurebynodetype.md
Overview page created successfully


# 6 - License<a class="anchor" id="bullet6"></a>
##### [Back to TOC](#TOC)

Licenced under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/Doc4TF/blob/main/LICENCE.md)