In [1]:
%load_ext autoreload
%autoreload 2

# Lowfat to TF - version 0.5.1 (updated version of Dirk Roorda's code v. 0.4.0)

We use the machinery of Text-Fabric combined with some custom code to convert
the lowfat XML of the Greek New Testament into TF.

# Set up

We gather all prerequisites.

In [2]:
from tf.convert.xml import XML
from lowfat import convertTaskCustom
from tf.advanced.helpers import dm
from tf.app import use
color = {2: "#1be7ff", 3: "#6eeb83", 4: "#e4ff1a", 5: "ffb800", 6: "ff5714"}

The custom code is in `lowfat.py`, here in this directory.

It consists of two functions that replace default functions in
[xmlCustom](https://annotation.github.io/text-fabric/tf/convert/xmlCustom.html),
which is part of TF.

So you only have to focus on the bits that actually touch the lowfat XML.

We pass the function `convertCustomTask()`, defined in `lowfat.py`, to the XML converter.

We also specify the way we want to see some attributes in the report files:

* keyword attributes: we want to see an inventory of all words that occur in such attributes
* trim attributes: we do not want to see the values of these attributes

In [3]:
keywordAtts = set(
    """
    case
    class
    number
    gender
    mood
    person
    role
    tense
    type
    voice
    degree
    articular
""".strip().split()
)

trimAtts = set(
    """
    domain
    frame
    gloss
    id
    lemma
    ln
    morph
    normalized
    ref
    referent
    rule
    strong
    subjref
    unicode
""".strip().split()
)

We do not want both the `Rule` and `rule` features in our dataset, because this can clash on file systems
that are case insensitive.

We translate the `frame` and `subjrefspec` attributes to edge features, but we retain the original contents in the
`framespec` and `subjref` attributes.

The name `class` is exceptionally cumbersome if you want to use it inside Python code,
so we rename it to `cls`.

Also, considering a friendly query, we switch the _label_ of node type from `w` to `word`.

In [4]:
renameAtts = {
    "Rule": "crule",
    "frame": "framespec",
    "subjref": "subjrefspec",
    "class": "cls",
}

In [5]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml=0,
    tf="0.5.1",
)

Working in repository XML-nestle1904/programs in backend Downloads
XML data version is 2022-11-01 (most recent)
TF data version is 0.5.1 (explicit existing)
Processing instructions will be ignored


In [5]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml="2000-01-01",
    tf="0.5.1",
)

Working in repository XML-nestle1904/programs in backend Downloads
XML data version is 2000-01-01 (oldest)
TF data version is 0.5.1 (explicit existing)
Processing instructions will be ignored


# Check

First we check the input:

In [8]:
X.task(check=True)

XML to TF checking: ~/Downloads/XML-nestle1904/programs/xml/2022-11-01 => ~/Downloads/XML-nestle1904/programs/report/2022-11-01
Processing instructions are ignored
Start folder gnt:
  27 27-revelation.xml                                 
End   folder gnt

151 info line(s) written to ~/Downloads/XML-nestle1904/programs/report/2022-11-01/elements.txt
0 error(s) in 0 file(s) written to ~/Downloads/XML-nestle1904/programs/report/2022-11-01/errors.txt
7 tags of which 0 with multiple namespaces written to ~/Downloads/XML-nestle1904/programs/report/2022-11-01/namespaces.txt


True

# Convert, Load, and App Creation

Here we generate, check to see that the TF is valid is to load, and create the config file that turns the dataset into a TF app.

In [29]:
X.task(convert=True)
X.task(load=True)
X.task(app=True)

XML to TF converting: ~/Downloads/XML-nestle1904/programs/xml/2022-11-01 => ~/Downloads/XML-nestle1904/programs/tf/0.5.1
  0.00s Not all of the warp features otype and oslots are present in
~/Downloads/XML-nestle1904/programs/tf/0.5.1
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, chapter, verse
   |   SECTION   FEATURES: book, chapter, verse
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-clean      punctuation, text
   |      |   text-orig-full       after, before, text
   |     0.01s OK
   |     0.00s Following director... 
  27 27-revelation.xml                                 
There are no broken subjref references.
There are 9 broken frame references.
gnt/01-matthew.xml    

True

# Test

We test a bit of the resulting dataset right here.

In [45]:
A = use("app:~/Downloads/XML-nestle1904/programs/app", version="0.5.1", hoist=globals())
#B = use("etcbc/nestle1904", hoist=globals())
#B = use("saulocantanhede/tfgreek2", version="0.5.0", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
clause,52242,8.56,324
wg,106868,6.88,533
phrase,119560,2.95,256
subphrase,72845,1.0,53
word,137779,1.0,100


## Comparing queries between the BHSA and our results

In [43]:
A.showFormats()

format | level | template
--- | --- | ---
`text-orig-clean` | **word** | `{text}{punctuation}`
`text-orig-full` | **word** | `{before}{text}{after}`


In [41]:
B.showFormats()

format | level | template
--- | --- | ---
`text-orig-full` | **word** | `{text}{after}`


In [46]:
F.before.freqList(nodeTypes={"word"})

(('—', 16), ('(', 10), ('[[', 7), ('[', 1))

In [47]:
F.after.freqList(nodeTypes={"word"})

((' ', 119261),
 (',', 9439),
 ('.', 5704),
 ('·', 2355),
 (';', 969),
 (',—', 18),
 ('—', 7),
 (').', 6),
 ('.]]', 4),
 ('·—', 4),
 (',)', 3),
 (']].', 2),
 ('(,', 1),
 ('.)', 1),
 (';)', 1),
 (';—', 1),
 (']', 1),
 (']]', 1),
 ('—,', 1))

In [48]:
F.punctuation.freqList(nodeTypes={"word"})

((' ', 119264), (',', 9462), ('.', 5717), ('·', 2359), (';', 971))

In [49]:
F.criticalsign.freqList(nodeTypes={"word"})

(('—', 25), ('(', 11), (')', 11), ('[[', 7), (']]', 7), ('[', 1), (']', 1))

In [39]:
results = A.search("""
w1:word before* after* criticalsign* unicode=Ἐφέσῳ] punctuation* text*
""")
A.show(results, multiFeatures=False, queryFeatures=True, condenseType='phrase',
       colorMap=color, hiddenTypes={"wg"})

  1.63s 1 result


In [50]:
for formats in T.formats:
    print(f'fmt={formats}\t: {T.text(206334,formats)}')

fmt=text-orig-clean	: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ Υἱοῦ Θεοῦ.
fmt=text-orig-full	: Ἀρχὴ τοῦ εὐαγγελίου Ἰησοῦ Χριστοῦ (Υἱοῦ Θεοῦ).


In [51]:
for formats in T.formats:
    print(f'fmt={formats}\t: {T.text(216371,formats)}')

fmt=text-orig-clean	: διὰ τῶν ἐπακολουθούντων σημείων.
fmt=text-orig-full	: διὰ τῶν ἐπακολουθούντων σημείων.]]


In [53]:
results = A.search("""
phrase
    word before* after* criticalsign* unicode=(Υἱοῦ punctuation* book=Mark chapter=1 verse=1 text*
    word before* after* criticalsign* after* unicode=Θεοῦ). punctuation* book=Mark chapter=1 verse=1 text*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True, condenseType='phrase',
       colorMap=color, hiddenTypes={"wg"}, withNodes=False)

  1.81s 5 results


In [82]:
results = B.search("""
phrase
    word after* unicode=(Υἱοῦ book=Mark chapter=1 verse=1
    word after* unicode=Θεοῦ). book=Mark chapter=1 verse=1
""")
B.show(results, end=1, multiFeatures=False, queryFeatures=True, condenseType='phrase',
       colorMap=color, hiddenTypes={"wg"})

  1.36s 5 results


In [397]:
results = A.search("""
c1:clause
    p1:phrase
        w1:word
w1 -parent> p1
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg"})

  0.02s 936 results


In [16]:
results = A.search("""
w1:word book=Mark chapter=16 verse=20 after* before* unicode* punctuation* text* criticalsign*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg", "clause"}, withNodes=True)

  0.84s 16 results


In [45]:
results = A.search("""
w1:word book=Mark chapter=2 verse=10 after* before* unicode* punctuation* text*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg", "clause", "phrase", "subphrase"}, withNodes=True)

  0.08s 18 results


In [43]:
results = A.search("""
word book=Acts chapter=22 verse=2 after* before* unicode*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg"})

  0.35s 13 results


In [4]:
Search0 = '''
phrase role* function#Subj
 word role=s


'''
Search0 = A.search(Search0)
A.show(Search0, start=1, end=1, condensed=False, colorMap={1:'pink', 2:'yellow'}, hiddenTypes={"wg"})

  0.80s 833 results


In [152]:
Search0 = '''
phrase role* function#Subj
 word role=s


'''
Search0 = A.search(Search0)
A.show(Search0, start=1, end=1, condensed=False, colorMap={1:'pink', 2:'yellow'}, hiddenTypes={"wg"})

  0.32s 833 results


In [49]:
results = A.search("""
book book=Jude
    verse verse=9
        c1:wg cls=cl
            p1:word role=s
            p2:word role=o

p1 -parent> c1
p2 -parent> c1
""")
A.show(results, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg"})

  0.01s 1 result


In [150]:
results = A.search("""
verse book=Matthew verse=19 chapter=1
    wg rule=CLaCL role=adv
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg"}, extraFeatures={"role"})

  0.12s 1 result


In [17]:
import sys, os, collections
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt; plt.rcdefaults()
from matplotlib.pyplot import figure
from collections import Counter

In [69]:
results = A.search("""
w:word role=p
""")

  0.33s 1904 results


In [70]:
results = A.search("""
p:phrase role=p
    w:word role=p
""")

  0.61s 1910 results


In [25]:
def possui_mais_de_uma_palavra(texto):
    palavras = texto.split()
    return len(palavras) > 1

In [71]:
results = A.search("""
p:phrase role=p
    w:word role=p
""")
A.export(results, toDir='C:/Users/TF/Downloads', toFile='test2.tsv')
results=pd.read_csv('C:/Users/TF/Downloads/test2.tsv',delimiter='\t',encoding='utf-16')
pd.set_option('display.max_columns', 50)
results.head(50)

df = pd.DataFrame(results)
df['Possui_Mais_de_Uma_Palavra'] = df['TEXT1'].apply(possui_mais_de_uma_palavra)

df_filtrado = df.loc[df['Possui_Mais_de_Uma_Palavra'] == True]

# Exibir o DataFrame resultante
print(df_filtrado)

  0.63s 1910 results
         R            S1  S2  S3   NODE1   TYPE1  \
253    254          Mark   4  31  218663  phrase   
254    255          Mark   4  31  218663  phrase   
1345  1346  1Corinthians  15   9  278203  phrase   
1346  1347  1Corinthians  15   9  278203  phrase   
1586  1587      1Timothy   1   5  290027  phrase   
1624  1625         Titus   1   7  292115  phrase   

                                                  TEXT1 role1   NODE2 TYPE2  \
253   ὡς κόκκῳ σινάπεως,ὃς ὅταν σπαρῇ ἐπὶ τῆς γῆς,μι...     p   20590  word   
254   ὡς κόκκῳ σινάπεως,ὃς ὅταν σπαρῇ ἐπὶ τῆς γῆς,μι...     p   20605  word   
1345  ὁ ἐλάχιστος τῶν ἀποστόλων,ὃς οὐκ εἰμὶ ἱκανὸς κ...     p   95932  word   
1346  ὁ ἐλάχιστος τῶν ἀποστόλων,ὃς οὐκ εἰμὶ ἱκανὸς κ...     p   95934  word   
1586  ἀγάπη ἐκ καθαρᾶς καρδίας καὶ συνειδήσεως ἀγαθῆ...     p  111691  word   
1624  ἀνέγκλητον μὴ αὐθάδη,μὴ ὀργίλον,μὴ πάροινον,μὴ...     p  114546  word   

                TEXT2 role2  Possui_Mais_de_Uma_Palavra  
25

In [19]:
results = A.search("""
p1:phrase role=apposition articular* rule* type* rela* appositioncontainer*
    word book=Jude
""")
A.show(results, end=10, multiFeatures=False, queryFeatures=True,
       condenseType="clause", colorMap=color, hiddenTypes={"wg"})

  0.49s 23 results


In [24]:
results = A.search("""
word book=Luke chapter=2 verse=35 num=1 after* unicode*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       condenseType="clause", colorMap=color, hiddenTypes={"wg"})

  0.25s 1 result


In [26]:
results = A.search("""
word unicode=σημείων.]] num* after*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       condenseType="clause", colorMap=color, hiddenTypes={"wg"})

  0.32s 1 result


In [10]:
results = A.search("""
word book* chapter* verse* unicode=[ἐν after*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       condenseType="clause", colorMap=color, hiddenTypes={"wg"})

  0.44s 1 result


In [26]:
results = A.search("""
book book=Matthew
    word framespec
    -frame> word
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       condenseType="clause", colorMap=color, hiddenTypes={"wg"})

  0.77s 6019 results


In [36]:
results = A.search("""
wg role*
""")
A.show(results, end=1, multiFeatures=False, queryFeatures=True,
       condenseType="clause", colorMap=color)

  0.47s 106868 results


In [77]:
results = B.search("""
book book=Jude
    verse verse=9
        c1:wg cls=cl
            p1:word role=s
            p2:word role=o

p1 -parent> c1
p2 -parent> c1
""")
B.show(results, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg"})

  0.41s 1 result


In [93]:
results = A.search("""
book book=Jude
    verse verse=9
        c1:wg cls=cl
            p1:word role=s
            p2:word role=o

p1 -parent> c1
p2 -parent> c1
""")
A.show(results, multiFeatures=False, queryFeatures=True,
       colorMap=color, hiddenTypes={"wg"})

  0.00s 1 result


In [104]:
results = A.search("""
book book=Jude
  phrase
  -sibling> phrase
""")
A.show(results, end=1, queryFeatures=False, hiddenTypes="wg")

  0.00s 64 results


In [80]:
results = A.search("""
verse book=Jude verse=9
    c1:clause
        p1:phrase function=Objc
        p2:phrase function=Subj

p1 -parent> c1
p2 -parent> c1
""")
A.show(results, multiFeatures=False, queryFeatures=True,
       colorMap=color, condenseType="sentence", hiddenTypes={"wg"})

  0.00s 1 result


In [39]:
results = B.search("""
verse book=Matthew chapter=4 verse=10
    c1:clause
        p1:phrase rule* cltype=Minor
        p2:phrase text=Σατανᾶ
""")
B.show(results, end=1, multiFeatures=False, queryFeatures=True, hiddenTypes={"wg"}, colorMap=color)

  0.27s 2 results


In [167]:
B.displaySetup(withNodes=False,
               standardFeatures=True,
               hiddenTypes={"clause", "phrase","subphrase"},
               hideTypes=True,
               queryFeatures=False)

In [81]:
matthew = B.nodeFromSectionStr("Matthew")
#L is the Locality class looking for downward (d) nodes
print(matthew)
s = L.d(matthew, otype="sentence")[1]
#A.pretty(s)

137780


In [86]:
results = A.search("""
verse book=Jude
  clause
  <parent- phrase
""")
A.show(results, end=1, queryFeatures=True, hiddenTypes={"wg"})

  0.00s 202 results


In [123]:
results = A.search("""
verse book=John chapter=1 verse=9
  c1:clause
      p1:phrase function=Subj
      p2:phrase function=Pred

p1 <parent> c1
p2 <parent> c1
""")
A.show(results, start=1, end=1, condensed=True, multiFeatures=False, hiddenTypes={"wg"})

  0.10s 0 results


In [46]:
results = A.search("""
verse book=John chapter=1 verse=9
    phrase function* role* rule* typ*
        word ref*
""")
A.show(results, start=1, end=1, condensed=True, multiFeatures=False, hiddenTypes={"wg"})

  1.58s 28 results


In [25]:
results = A.search("""
book book=John
  phrase
  -sibling> phrase
""")
A.show(results, start=1, end=1, condensed=True, multiFeatures=False, hiddenTypes={"wg"})

  0.06s 0 results


In [23]:
results = C.search("""
book book=John
  word
  -sibling>3> word
""")
C.show(results, start=1, end=1, condensed=True, multiFeatures=False, hiddenTypes={"wg"})

  0.28s 13 results


In [11]:
results = A.search('''
verse book=Matthew chapter=6 verse=24
    clause
        phrase typ* function* rela*
                word role*
''')
A.show(results, start=1, end=1, condensed=True, multiFeatures=False, hiddenTypes={"wg"})

  1.25s 28 results


# Browse

We are ready to browse the data.
If you run this notebook, then the next cell will open a browser window with the TF-browser
on the Greek New Testament.

In [None]:
X.task(browse=True)

# Terminate

You can stop the browser by pressing `i` twice.

# Create zip

It is time to commit and push the repo to GitHub now:

```
git add --all .
git commit "new data version"
git push origin master
```

Then go over to GitHub and create a new release there.

After that, fetch the new tags from GitHub by

```
git pull --tags
```

Then we are ready to create a zip file for publishing the dataset in a release on Github,
so that users can get it easily.

In [13]:
A.zipAll()

Data to be zipped:
	missing  app                      (??)                : ~/github/None/None/app
	missing  main data                (??)                : ~/github/None/None/tf/0.5.0


# Fetch

We now test wether users can use this dataset in the normal way.

Run this after you have attached the complete.zip file that we create earlier, to the latest release on GitHub.

In [52]:
A = use("ETCBC/nestle1904:latest")

**Locating corpus resources ...**

   |     0.17s T otype                from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     2.36s T oslots               from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.36s T text                 from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.28s T after                from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.28s T book                 from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.24s T chapter              from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |     0.26s T verse                from ~/text-fabric-data/github/ETCBC/nestle1904/tf/0.4.0
   |      |     0.05s C __levels__           from otype, oslots, otext
   |      |     1.47s C __order__            from otype, oslots, __levels__
   |      |     0.06s C __rank__             from otype, __order__
   |      |     4.38s C __levUp__            from otype, oslots, __rank__
   |      |     2.38s C __levDown__          fr

Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
wg,114879,7.6,633
clause,30152,7.37,161
phrase,42636,3.21,99
w,137779,1.0,100


Indeed, downloading and installing went without hassle.

# Demo

We demo the effect of the reshuffling of the words.

Our test corpus is the letter of Jude, first sentence, twice.

The first time we do not shuffle the words in the sentence, the second time we do.

We run the conversion with `demo = True` in `lowfat.py`.

In [5]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    xml=-1,
    tf="0.3.1t",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2000-01-01 (oldest)
TF data version is 0.3.1t (explicit new)


In [7]:
X.task(convert=True, load=True, app=True)

XML to TF converting: ~/github/ETCBC/nestle1904/xml/2000-01-01 => ~/github/ETCBC/nestle1904/tf/0.3.1t
  0.00s Not all of the warp features otype and oslots are present in
~/github/ETCBC/nestle1904/tf/0.3.1t
  0.00s Only the Feature and Edge APIs will be enabled
  0.00s Warp feature "otext" not found. Working without Text-API

  0.00s Importing data from walking through the source ...
   |     0.00s Preparing metadata... 
   |     0.00s No structure nodes will be set up
   |   SECTION   TYPES:    book, chapter, verse
   |   SECTION   FEATURES: book, chapter, verse
   |   STRUCTURE TYPES:    
   |   STRUCTURE FEATURES: 
   |   TEXT      FEATURES:
   |      |   text-orig-full       after, text
   |     0.00s OK
   |     0.00s Following director... 
   1 26-jude.xml                                       
source reading done
   |     0.00s "edge" actions: 0
   |     0.00s "feature" actions: 71
   |     0.00s "node" actions: 39
   |     0.00s "resume" actions: 0
   |     0.00s "slot" actions

True

In [9]:
A = use("ETCBC/nestle1904:clone", checkout="clone", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots/node,% coverage
book,1,34.0,100
chapter,1,34.0,100
verse,1,34.0,100
sentence,2,17.0,100
wg,34,6.0,600
w,34,1.0,100


In [10]:
(s1, s2) = F.otype.s("sentence")

In [21]:
color1 = "cyan"
color2 = "goldenrod"
start = 5
offset = 17
highlights = {
    start: color1,
    start + 1: color2,
    start + offset: color2,
    start + offset + 1: color1,
}
A.displaySetup(standardFeatures=True, highlights=highlights)

In [22]:
A.pretty(s1)

In [23]:
A.pretty(s2)

# Restore

We restore the app so that it uses the normal tf version again.

In [32]:
X = XML(
    convertTaskCustom=convertTaskCustom,
    keywordAtts=keywordAtts,
    trimAtts=trimAtts,
    renameAtts=renameAtts,
    verbose=1,
    tf="0.3.1",
)

Working in repository ETCBC/nestle1904 in backend github
XML data version is 2022-11-01 (most recent)
TF data version is 0.3.1 (explicit existing)


In [33]:
X.task(app=True)

App updating ...
	~/github/ETCBC/nestle1904/app/static/logo.png (already exists, not overwritten)
	~/github/ETCBC/nestle1904/app/static/display.css (no custom info, older orginal exists)
	~/github/ETCBC/nestle1904/app/config.yaml (generated with custom info)
	~/github/ETCBC/nestle1904/app/app.py (deleted)
Done


True

Now save this notebook, commit and push the repo again to publish this very notebook.

```
git add --all .
git commit "maker notebook updated"
git push origin master
```