# Some corpus statistics (Nestle1904GBI)

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 - The 25 most frequent words in the corpus</a>
    * <a href="#bullet3x2">3.2 - Frequency of characters in corpus</a>
    * <a href="#bullet3x3">3.3 - Some stats on node types</a>    
    * <a href="#bullet3x4">3.4 - The available text formats</a>    
    * <a href="#bullet3x5">3.5 - List of feature frequencies</a> 
    * <a href="#bullet3x6">3.6 - Frequency list of punctuations</a>
    * <a href="#bullet3x7">3.7 - Node number ranges</a>
    * <a href="#bullet3x8">3.8 - Count the objects per type</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

This Jupyter Notebook showcases several examples of statistical analysis performed on a Text-Fabric corpus. For demonstration purposes various methods of collecting and presenting the data are employed. 

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

In [1]:
%load_ext autoreload
%autoreload 2

In [3]:
# Loading the New Testament TextFabric code
# Note: it is assumed Text-Fabric is installed in your environment.

from tf.fabric import Fabric
from tf.app import use

In [4]:
# load the app and data
N1904 = use ("tonyjurg/Nestle1904GBI:latest", hoist=globals())

**Locating corpus resources ...**

The requested data is not available offline
	~/text-fabric-data/github/tonyjurg/Nestle1904GBI/tf/0.3 not found


Name,# of nodes,# slots/node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
sentence,5720,24.09,100
verse,7943,17.35,100
clause,16124,8.54,100
phrase,72674,1.9,100
word,137779,1.0,100


# 3 - Performing the queries <a class="anchor" id="bullet3"></a>

## 3.1 - The 25 most frequent words in the corpus<a class="anchor" id="bullet3x1"></a>
##### [Back to TOC](#TOC)

The method [`freqList`](https://annotation.github.io/text-fabric/tf/core/nodefeature.html#tf.core.nodefeature.NodeFeature.freqList) returns A tuple of (value, frequency), items, ordered by frequency, highest frequencies first.

In [42]:
print("Amount\tword")
for (w, amount) in F.word.freqList("word")[0:25]:
    print(f"{amount}\t{w}")

Amount	word
8541	καὶ
2768	ὁ
2683	ἐν
2620	δὲ
2497	τοῦ
1755	εἰς
1657	τὸ
1556	τὸν
1518	τὴν
1410	αὐτοῦ
1300	τῆς
1281	ὅτι
1221	τῷ
1201	τῶν
1068	οἱ
941	ἡ
921	γὰρ
902	μὴ
859	τῇ
849	αὐτῷ
817	τὰ
767	οὐκ
722	τοὺς
688	Θεοῦ
670	πρὸς


## 3.2 - Frequency of characters in corpus <a class="anchor" id="bullet3x2"></a>
##### [Back to TOC](#TOC)

This code generates a table that displays the frequency of characters within the Text-Fabric corpus. The API call 'C.characters.data' produces a Python dictionary structure that contains the data. The remaining code unpacks and sorts this structure to present the results in a formated table. 

Note the first line of the output is 'Format:  text-orig-full'. This 

In [43]:
# Library to format table
from tabulate import tabulate

# The following API call will result in a Python dictionary structure
FrequencyDictionary=C.characters.data

# Present the results
KeyList = list(FrequencyDictionary.keys())
for Key in KeyList:
    print('Format: ',Key)
    FrequencyList=FrequencyDictionary[Key]
    SortedFrequencyList=sorted(FrequencyList, key=lambda x: x[1], reverse=True)
    headers = ["character", "frequency"]
    print(tabulate(SortedFrequencyList, headers=headers, tablefmt='fancy_grid'))

Format:  text-orig-full
╒═════════════╤═════════════╕
│ character   │   frequency │
╞═════════════╪═════════════╡
│             │      137779 │
├─────────────┼─────────────┤
│ ν           │       56230 │
├─────────────┼─────────────┤
│ α           │       51892 │
├─────────────┼─────────────┤
│ τ           │       50599 │
├─────────────┼─────────────┤
│ ο           │       45151 │
├─────────────┼─────────────┤
│ ε           │       38597 │
├─────────────┼─────────────┤
│ ς           │       27090 │
├─────────────┼─────────────┤
│ ι           │       26131 │
├─────────────┼─────────────┤
│ σ           │       24095 │
├─────────────┼─────────────┤
│ ρ           │       22871 │
├─────────────┼─────────────┤
│ κ           │       22630 │
├─────────────┼─────────────┤
│ π           │       20308 │
├─────────────┼─────────────┤
│ μ           │       19218 │
├─────────────┼─────────────┤
│ λ           │       18228 │
├─────────────┼─────────────┤
│ δ           │       12476 │
├─────────────┼─

## 3.3 - Some stats on node types <a class="anchor" id="bullet3x3"></a>
##### [Back to TOC](#TOC)

In [44]:
C.levels.data

(('book', 5102.925925925926, 137780, 137806),
 ('chapter', 529.9192307692308, 137807, 138066),
 ('sentence', 24.087237762237763, 226865, 232584),
 ('verse', 17.345965000629484, 232585, 240527),
 ('clause', 8.54496402877698, 138067, 154190),
 ('phrase', 1.8958499600957701, 154191, 226864),
 ('word', 1, 1, 137779))

## 3.4 - The available text formats <a class="anchor" id="bullet3x4"></a>
##### [Back to TOC](#TOC)

Not particular a statistic function, but still important in relation to the corpus. The output of this command provides details on available formats to present the text of the corpus.

In [19]:
N1904.showFormats()

format | level | template
--- | --- | ---
`text-orig-full` | **word** | `{word}{after}`


Note that this data is taken from file `otext.tf`:

> 
```
@config
...
@fmt:text-orig-full={word}{after}
...
```


## 3.5 - List of feature frequencies <a class="anchor" id="bullet3x5"></a>
##### [Back to TOC](#TOC)

This code generates a lot of output!

In [None]:
FeatureList=Fall()
for Feature in FeatureList:
    print ('Feature:',Feature,'\n\n\t value\t frequency')
    FeatureFrequenceLists=Fs(Feature).freqList()
    for item, freq in FeatureFrequenceLists:
        print ('\t',item,'\t',freq)
    print ('\n')

## 3.6 - Frequency list of punctuations <a class="anchor" id="bullet3x6"></a>
##### [Back to TOC](#TOC)

Make a list of punctuations with their Unicode values. Here, the function used is for printing markdown-formatted strings, although the desired result has not yet been achieved.

In [41]:
result = F.after.freqList()
N1904.dm(" String | Unicode | Frequency\n--- | --- | ---")
for (string, freq) in result:
    # important: string does contain two characters in case of punctuations
    frequency=str(freq)             #convert it to a string
    unicode_value = str(ord(string[0])) #convert it to a string
    N1904.dm(" `{}` | {} | {} ".format(string[0],unicode_value,frequency))  

 String | Unicode | Frequency
--- | --- | ---

 ` ` | 32 | 119272 

 `,` | 44 | 9441 

 `.` | 46 | 5712 

 `·` | 183 | 2355 

 `;` | 59 | 969 

 `—` | 8212 | 30 

## 3.7 - Node number ranges <a class="anchor" id="bullet3x7"></a>
##### [Back to TOC](#TOC)

The node number ranges are readily available by calling `F.otype.all` which returns a list of all node types. 

In [23]:
for NodeType in F.otype.all:
    print (NodeType, F.otype.sInterval(NodeType))

book (137780, 137806)
chapter (137807, 138066)
sentence (227738, 233457)
verse (233458, 241401)
clause (138067, 154190)
phrase (154191, 227737)
word (1, 137779)


## 3.8 - Count the objects per type <a class="anchor" id="bullet3x8"></a>
##### [Back to TOC](#TOC)

Using the same API call, we can produce also another list where we are counting the number of nodes for each type.

In [9]:
for otype in F.otype.all:
    i = 0
    for n in F.otype.s(otype):
        i += 1
    print ("{:>7} {}s".format(i, otype))

     27 books
    260 chapters
   5720 sentences
   7943 verses
  16124 clauses
  72674 phrases
 137779 words
