# Some corpus statistics (Nestle1904LFT)

**Work in progress!**

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 - The 25 most frequent words in the corpus</a>
    * <a href="#bullet3x2">3.2 - Frequency of characters in corpus</a>
    * <a href="#bullet3x3">3.3 - Some stats on node types</a>    
    * <a href="#bullet3x4">3.4 - The available text formats</a>    
    * <a href="#bullet3x5">3.5 - List of feature frequencies</a> 
    * <a href="#bullet3x6">3.6 - Frequency list of punctuations</a>
    * <a href="#bullet3x7">3.7 - Node number ranges</a>
    * <a href="#bullet3x8">3.8 - Count the objects per type</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed. 

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [15]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

In [19]:
# load the N1904 app and data
N1904 = use ("tonyjurg/Nestle1904LFT", version="0.3", hoist=globals())

**Locating corpus resources ...**

The requested app is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app not found


findAppClass: invalid syntax (~/text-fabric-data/github/tonyjurg/Nestle1904LFT/app/app.py, line 5)


findAppClass: Api for "tonyjurg/Nestle1904LFT" not loaded
The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904LFT/tf/0.3 not found


Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7943,17.35,100
sentence,8011,17.2,100
wg,114879,7.6,633
word,137779,1.0,100


In [None]:
# The following will push the Text-Fabric stylesheet to this notebook (to facilitate proper display with notebook viewer)
N1904.dh(N1904.getCss())

# 3 - Performing the queries <a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

## 3.1 - The 25 most frequent words in the corpus<a class="anchor" id="bullet3x1"></a>
##### [Back to TOC](#TOC)

The method [`freqList`](https://annotation.github.io/text-fabric/tf/core/nodefeature.html#tf.core.nodefeature.NodeFeature.freqList) returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.

In [20]:
print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
    print(f"{amount}\t{w}")

Amount	word
8541	καὶ
2768	ὁ
2683	ἐν
2620	δὲ
2497	τοῦ
1755	εἰς
1657	τὸ
1556	τὸν
1518	τὴν
1410	αὐτοῦ
1300	τῆς
1281	ὅτι
1221	τῷ
1201	τῶν
1068	οἱ
941	ἡ
921	γὰρ
902	μὴ
859	τῇ
849	αὐτῷ
817	τὰ
767	οὐκ
722	τοὺς
688	Θεοῦ
670	πρὸς


## 3.2 - Frequency of characters in corpus <a class="anchor" id="bullet3x2"></a>
##### [Back to TOC](#TOC)

This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table. 

Note the first line of the output is 'Format:  text-orig-full'. This 

In [21]:
# Library to format table
from tabulate import tabulate

# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data

# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
    print('Format: ',Key)
    # 'key' refers to the pre-defined formats the text will be displayed
    FrequencyList=FrequencyDictionary[Key]
    SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
    
    # In this example the table will be truncated to the first 15 entries
    max_rows = 15  # Set your desired number of rows here
    TruncatedTable = SortedFrequencyList[:max_rows]
    
    headers = ["character", "frequency"]
    print(tabulate(TruncatedTable, headers=headers, tablefmt='fancy_grid'))
    
    # Add a warning using markdown (API call A.dm) allowing it to be printed in bold type
    N1904.dm("**Warning: table truncated!**")

Format:  text-orig-full
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
╘═════════════╧═

**Warning: table truncated!**

## 3.3 - Some stats on node types <a class="anchor" id="bullet3x3"></a>
##### [Back to TOC](#TOC)

In [22]:
C.levels.data

(('book', 5102.925925925926, 137780, 137806),
 ('chapter', 529.9192307692308, 137807, 138066),
 ('verse', 17.345965000629484, 146078, 154020),
 ('sentence', 17.198726750717764, 138067, 146077),
 ('wg', 7.597385074730804, 154021, 268899),
 ('word', 1, 1, 137779))

## 3.4 - The available text formats <a class="anchor" id="bullet3x4"></a>
##### [Back to TOC](#TOC)

Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus. See also [module tf.advanced.options
Display Settings](https://annotation.github.io/text-fabric/tf/advanced/options.html).

In [23]:
N1904.showFormats()

format | level | template
--- | --- | ---
`text-orig-full` | **word** | `{word}{after}`


The same result (although formatted different) can be obtained by the following call:

In [24]:
T.formats

{'text-orig-full': 'word'}

Note that this data originates from file `otext.tf`:

> 
```
@config
...
@fmt:text-orig-full={word}{after}
...
```


## 3.5 - List of feature frequencies <a class="anchor" id="bullet3x5"></a>
##### [Back to TOC](#TOC)

This code generates a lot of output! For that reason we will cut it off after 5 lines per feature.

In [25]:
FeatureList=Fall()
LinesToPrint=5
for Feature in FeatureList:
    if Feature=='otype': break # this feature needs to be skipped.
    print ('Feature:',Feature,'\n\n\t value\t frequency')
    FeatureFrequenceLists=Fs(Feature).freqList()
    PrintedLine=0
    for item, freq in FeatureFrequenceLists:
        PrintedLine+=1
        print ('\t',item,'\t',freq)
        if PrintedLine==LinesToPrint: break
    print ('\n')

Feature: after 

	 value	 frequency
	   	 119272
	 ,  	 9441
	 .  	 5712
	 ·  	 2355
	 ;  	 969


Feature: appos 

	 value	 frequency
	  	 46169
	 modifier-scope 	 29645
	 wrapper-clause-scope 	 12166
	 wrapper-scope 	 11264
	 conjuncted-wg 	 8075


Feature: book 

	 value	 frequency
	 Acts 	 1
	 Colossians 	 1
	 Ephesians 	 1
	 Galatians 	 1
	 Hebrews 	 1


Feature: book_long 

	 value	 frequency
	 Luke 	 19456
	 Acts 	 18393
	 Matthew 	 18299
	 John 	 15643
	 Mark 	 11277


Feature: booknumber 

	 value	 frequency
	 3 	 19457
	 5 	 18394
	 1 	 18300
	 4 	 15644
	 2 	 11278


Feature: bookshort 

	 value	 frequency
	 Luke 	 19457
	 Acts 	 18394
	 Matt 	 18300
	 John 	 15644
	 Mark 	 11278


Feature: case 

	 value	 frequency
	  	 58261
	 nominative 	 24197
	 accusative 	 23031
	 genitive 	 19515
	 dative 	 12126


Feature: chapter 

	 value	 frequency
	 1 	 12868
	 2 	 10923
	 3 	 9652
	 4 	 9631
	 5 	 8788


Feature: clausetype 

	 value	 frequency
	  	 112036
	 VerbElided 	 1050
	 V

## 3.6 - Frequency list of punctuations <a class="anchor" id="bullet3x6"></a>
##### [Back to TOC](#TOC)

Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.

In [11]:
result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
    # important: string does contain two characters in case of punctuations
    frequency=str(freq)             #convert it to a string
    unicode_value = str(ord(string[0])) #convert it to a string
    N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))  

 String | Unicode | Frequency
--- | --- | ---

 ` ` | 32 | 119272 

 `,` | 44 | 9441 

 `.` | 46 | 5712 

 `·` | 183 | 2355 

 `;` | 59 | 969 

 `—` | 8212 | 30 

## 3.7 - Node number ranges <a class="anchor" id="bullet3x7"></a>
##### [Back to TOC](#TOC)

The node number ranges are readily available by calling `F.otype.all` which returns a list of all node types. 

In [26]:
for NodeType in F.otype.all:
    print (NodeType, F.otype.sInterval(NodeType))

book (137780, 137806)
chapter (137807, 138066)
verse (146078, 154020)
sentence (138067, 146077)
wg (154021, 268899)
word (1, 137779)


## 3.8 - Count the objects per type <a class="anchor" id="bullet3x8"></a>
##### [Back to TOC](#TOC)

Using the same API call, we can produce also another list where we are counting the number of nodes for each type.

In [27]:
for otype in F.otype.all:
    i = 0
    for n in F.otype.s(otype):
        i += 1
    print ("{:>7} {}s".format(i, otype))

     27 books
    260 chapters
   7943 verses
   8011 sentences
 114879 wgs
 137779 words


In [14]:
N1904.showProvenance(...)