# Some corpus statistics (Nestle1904LFT)

**Work in progress!**

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 - The 25 most frequent words in the corpus</a>
    * <a href="#bullet3x2">3.2 - Frequency of characters in corpus</a>
    * <a href="#bullet3x3">3.3 - Some stats on node types</a>    
    * <a href="#bullet3x4">3.4 - The available text formats</a>    
    * <a href="#bullet3x5">3.5 - List of feature frequencies</a> 
    * <a href="#bullet3x6">3.6 - Frequency list of punctuations</a>
    * <a href="#bullet3x7">3.7 - Node number ranges</a>
    * <a href="#bullet3x8">3.8 - Count the objects per type</a>
    * <a href="#bullet3x9">3.9 - Obtain meta data for a feature</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed. 

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [3]:
# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904LFT", version="0.6", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7943,17.35,100
sentence,8011,17.2,100
wg,105430,6.85,524
word,137779,1.0,100


In [4]:
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

In [5]:
# Set default view in a way to limit noise as much as possible.
N1904.displaySetup(condensed=True, multiFeatures=False, queryFeatures=False)

# 3 - Performing the queries <a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

## 3.1 - The 25 most frequent words in the corpus<a class="anchor" id="bullet3x1"></a>
##### [Back to TOC](#TOC)

The method [`freqList`](https://annotation.github.io/text-fabric/tf/core/nodefeature.html#tf.core.nodefeature.NodeFeature.freqList) returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.

In [4]:
print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
    print(f"{amount}\t{w}")

Amount	word
8545	καὶ
2769	ὁ
2684	ἐν
2620	δὲ
2497	τοῦ
1755	εἰς
1658	τὸ
1556	τὸν
1518	τὴν
1411	αὐτοῦ
1300	τῆς
1281	ὅτι
1221	τῷ
1201	τῶν
1069	οἱ
941	ἡ
921	γὰρ
902	μὴ
859	τῇ
849	αὐτῷ
817	τὰ
767	οὐκ
722	τοὺς
689	Θεοῦ
670	πρὸς


## 3.2 - Frequency of characters in corpus <a class="anchor" id="bullet3x2"></a>
##### [Back to TOC](#TOC)

This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table. 

Note the first line of the output is 'Format:  text-orig-full'. This 

In [5]:
# Library to format table
from tabulate import tabulate

# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data

# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
    print('Format: ',Key)
    # 'key' refers to the pre-defined formats the text will be displayed
    FrequencyList=FrequencyDictionary[Key]
    SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
    
    # In this example the table will be truncated to the first 15 entries
    max_rows = 15  # Set your desired number of rows here
    TruncatedTable = SortedFrequencyList[:max_rows]
    
    headers = ["character", "frequency"]
    print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
    
    # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
    N1904.dm("**Warning: table truncated!**")

Format:  text-critical
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
├─────────────┼─────────────┤
│ ἐ           │       12116 │
╘═════════════╧══

**Warning: table truncated!**

Format:  text-normalized
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       52127 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45516 │
├─────────────┼─────────────┤
│ ε           │       38807 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26404 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ ί           │       21518 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
╘═════════════╧

**Warning: table truncated!**

Format:  text-orig-full
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
╘═════════════╧═

**Warning: table truncated!**

Format:  text-transliterated
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ e           │       93371 │
├─────────────┼─────────────┤
│ o           │       87008 │
├─────────────┼─────────────┤
│ a           │       75119 │
├─────────────┼─────────────┤
│ i           │       62778 │
├─────────────┼─────────────┤
│ t           │       60011 │
├─────────────┼─────────────┤
│ n           │       56230 │
├─────────────┼─────────────┤
│ s           │       52132 │
├─────────────┼─────────────┤
│ u           │       39287 │
├─────────────┼─────────────┤
│ k           │       27300 │
├─────────────┼─────────────┤
│ p           │       25081 │
├─────────────┼─────────────┤
│ r           │       22871 │
├─────────────┼─────────────┤
│ h           │       20033 │
├─────────────┼─────────────┤
│ m           │       19218 │
├─────────────┼─────────────┤
│ l           │       18228 │
╘══════════

**Warning: table truncated!**

Format:  text-unaccented
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ α           │       75119 │
├─────────────┼─────────────┤
│ ε           │       66656 │
├─────────────┼─────────────┤
│ ο           │       65731 │
├─────────────┼─────────────┤
│ ι           │       62834 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ υ           │       39287 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ η           │       26715 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       23046 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ ω           │       21277 │
├─────────────┼─────────────┤
│ π           │       20308 │
╘═════════════╧

**Warning: table truncated!**

## 3.3 - Some stats on node types <a class="anchor" id="bullet3x3"></a>
##### [Back to TOC](#TOC)

In [8]:
C.levels.data

(('book', 5102.925925925926, 137780, 137806),
 ('chapter', 529.9192307692308, 137807, 138066),
 ('verse', 17.345965000629484, 146078, 154020),
 ('sentence', 17.198726750717764, 138067, 146077),
 ('wg', 7.583849727185382, 154021, 267467),
 ('word', 1, 1, 137779))

## 3.4 - The available text formats <a class="anchor" id="bullet3x4"></a>
##### [Back to TOC](#TOC)

Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also [module tf.advanced.options
Display Settings](https://annotation.github.io/text-fabric/tf/advanced/options.html).

In [8]:
N1904.showFormats()

format | level | template
--- | --- | ---
`text-critical` | **word** | `{unicode} `
`text-normalized` | **word** | `{normalized}{after}`
`text-orig-full` | **word** | `{word}{after}`
`text-transliterated` | **word** | `{wordtranslit}{after}`
`text-unaccented` | **word** | `{wordunacc}{after}`


The same result (although formatted different) can be obtained by the following call:

In [9]:
T.formats

{'text-critical': 'word',
 'text-normalized': 'word',
 'text-orig-full': 'word',
 'text-transliterated': 'word',
 'text-unaccented': 'word'}

Note that this data originates from file `otext.tf`:

> 
```
@config
...
@fmt:text-orig-full={word}{after}
...
```


## 3.5 - List of feature frequencies <a class="anchor" id="bullet3x5"></a>
##### [Back to TOC](#TOC)

This code generates a lot of output! For that reason we will cut it off after 5 lines per feature.

In [6]:
FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList: 
    if Feature!='otype':
        print ('Feature:',Feature,'\n\n\t value\t frequency')
        FeatureFrequenceLists=Fs(Feature).freqList()
        PrintedLine=0
        for item, freq in FeatureFrequenceLists:
            PrintedLine+=1
            print ('\t',item,'\t',freq)
            if PrintedLine==LinesToPrint: break
        print ('\n')

Feature: after 

	 value	 frequency
	   	 119270
	 ,  	 9462
	 .  	 5717
	 ·  	 2359
	 ;  	 971


Feature: book 

	 value	 frequency
	 Luke 	 21785
	 Matthew 	 20529
	 Acts 	 20307
	 John 	 17582
	 Mark 	 12695


Feature: booknumber 

	 value	 frequency
	 3 	 19457
	 5 	 18394
	 1 	 18300
	 4 	 15644
	 2 	 11278


Feature: bookshort 

	 value	 frequency
	 Luke 	 19457
	 Acts 	 18394
	 Matt 	 18300
	 John 	 15644
	 Mark 	 11278


Feature: case 

	 value	 frequency
	  	 58261
	 nominative 	 24197
	 accusative 	 23031
	 genitive 	 19515
	 dative 	 12126


Feature: chapter 

	 value	 frequency
	 1 	 12922
	 2 	 10923
	 3 	 9652
	 4 	 9631
	 5 	 8788


Feature: clausetype 

	 value	 frequency
	  	 102662
	 VerbElided 	 1009
	 Verbless 	 929
	 Minor 	 830


Feature: containedclause 

	 value	 frequency
	  	 8372
	 2 	 148
	 172 	 69
	 97 	 69
	 389 	 68


Feature: degree 

	 value	 frequency
	  	 137266
	 comparative 	 313
	 superlative 	 200


Feature: gloss 

	 value	 frequency
	 the 	 985

## 3.6 - Frequency list of punctuations <a class="anchor" id="bullet3x6"></a>
##### [Back to TOC](#TOC)

Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.

In [11]:
result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
    # important: string does contain two characters in case of punctuations
    frequency=str(freq)             #convert it to a string
    unicode_value = str(ord(string[0])) #convert it to a string
    N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))  

 String | Unicode | Frequency
--- | --- | ---

 ` ` | 32 | 119272 

 `,` | 44 | 9441 

 `.` | 46 | 5712 

 `·` | 183 | 2355 

 `;` | 59 | 969 

 `—` | 8212 | 30 

## 3.7 - Node number ranges <a class="anchor" id="bullet3x7"></a>
##### [Back to TOC](#TOC)

The node number ranges are readily available by calling `F.otype.all` which returns a list of all node types. 

In [26]:
for NodeType in F.otype.all:
    print (NodeType, F.otype.sInterval(NodeType))

book (137780, 137806)
chapter (137807, 138066)
verse (146078, 154020)
sentence (138067, 146077)
wg (154021, 268899)
word (1, 137779)


## 3.8 - Count the objects per type <a class="anchor" id="bullet3x8"></a>
##### [Back to TOC](#TOC)

Using the same API call, we can produce also another list where we are counting the number of nodes for each type.

In [27]:
for otype in F.otype.all:
    i = 0
    for n in F.otype.s(otype):
        i += 1
    print ("{:>7} {}s".format(i, otype))

     27 books
    260 chapters
   7943 verses
   8011 sentences
 114879 wgs
 137779 words


In [7]:
N1904.showProvenance(...)

## 3.9 - Obtain meta data for a feature <a class="anchor" id="bullet3x9"></a>
##### [Back to TOC](#TOC)

In [None]:
This can be usefull if you want to process all feature in a script.

In [12]:
# Just print the structured tuple returned by the function call
FeatureName='word'
MetaData=Fs(FeatureName).meta
print (MetaData)

{'Availability': 'Creative Commons Attribution 4.0 International (CC BY 4.0)', 'Converter_author': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_execution': 'Tony Jurg, ReMa Student Vrije Universiteit Amsterdam, Netherlands', 'Converter_version': '0.3', 'Convertor_source': 'https://github.com/tonyjurg/Nestle1904LFT/tree/main/tools', 'Data source': 'MACULA Greek Linguistic Datasets, available at https://github.com/Clear-Bible/macula-greek/tree/main/Nestle1904/lowfat', 'Editors': 'Eberhard Nestle', 'Name': 'Greek New Testament (Nestle 1904 based on Low Fat Tree)', 'TextFabric version': '11.4.10', 'description': 'Word as it appears in the text (excl. punctuations)', 'valueType': 'str', 'writtenBy': 'Text-Fabric', 'dateWritten': '2023-06-19T15:13:46Z'}


Now do some very basic calculation with the data:

In [13]:
print ('feature ',FeatureName, end='')
if MetaData['valueType']=='str':
    print (' is of type str.')
else:
    print (' is not of type str.')

feature  word is of type str.


# trying the various formats

In [None]:
origText=T.text(node,fmt='text-orig-full')
critText=T.text(node,fmt='text-critical-signs')

        'fmt:text-orig-full':     '{word}{after}',
        'fmt:text-normalized':    '{normalized}{after}',
        'fmt:text-unaccented':    '{wordunacc}{after}',
        'fmt:text-transliterated':'{wordtranslit}{after}', 
        'fmt:text-critical':  