# Statistics for parashot (BHSA)

## Table of Content<a class="anchor" id="TOC"></a> (ToC)

* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Load Text-Fabric app and data</a>
* <a href="#bullet3">3 - Performing the queries</a>
    * <a href="#bullet3x1">3.1 Locate the parallels</a>
* <a href="#bullet4">4 - Required libraries</a>
* <a href="#bullet5">5 - Further reading</a>
* <a href="#bullet6">6 - Notebook version details</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

In this notebook we search for lexical parallels between verses in this parasha with other verses in the Tenach.

# 2 - Load Text-Fabric app and data <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

The following code will load the Text-Fabric version of the [Biblia Hebraica Stuttgartensia (Amstelodamensis)](https://etcbc.github.io/bhsa/).

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment.
from tf.fabric import Fabric
from tf.app import use

In [3]:
# load the app and data
BHSA = use ("etcbc/BHSA", mod="tonyjurg/BHSaddons/tf/:hot", hoist=globals())

**Locating corpus resources ...**

Could not get rate limit details
unexpected error from github.GithubException: 401 {"message": "Bad credentials", "documentation_url": "https:/docs.github.com/rest", "status": "401"}
	connecting to online GitHub repo tonyjurg/BHSaddons ... failed
unexpected error from github.GithubException: 401 {"message": "Bad credentials", "documentation_url": "https:/docs.github.com/rest", "status": "401"}
The offline data may not be the latest


Name,# of nodes,# slots / node,% coverage
book,39,10938.21,100
chapter,929,459.19,100
lex,9230,46.22,100
verse,23213,18.38,100
half_verse,45179,9.44,100
sentence,63717,6.7,100
sentence_atom,64514,6.61,100
clause,88131,4.84,100
clause_atom,90704,4.7,100
phrase,253203,1.68,100




- clauses: typ	Node	String	✅ clause/phrase(atom) type (VP; NP; Ellp; Ptcp; WayX)	NmCl Way0 InfC WayX
===

check named entities:

- title: `nametype`: https://github.com/ETCBC/bhsa/blob/master/docs/features/nametype.md

The following feature is present on objects of type clause.

https://github.com/ETCBC/bhsa/blob/master/docs/features/kind.md


# verses, words clauses and letters

In [4]:
# find all verse nodes for this parasha using its sequence number
parashaQuery = '''
verse parashanum=1
  sentence
     clause
        word wordboundary=1
'''
parashaResults = BHSA.search(parashaQuery)

  0.39s 1858 results


In [5]:
import pandas as pd

# Convert list of tuples to DataFrame
parashaDF = pd.DataFrame(parashaResults,columns=['verse','sentence','clause','word'])

In [6]:
# Get unique value counts for each column
parashaUniqueCounts = parashaDF.nunique()
print("Unique value counts per column:")
print(parashaUniqueCounts)

Unique value counts per column:
verse        145
sentence     397
clause       521
word        1858
dtype: int64


In [7]:
torahQuery = '''
verse book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
  sentence
     clause
        word wordboundary=1
'''
torahResults = BHSA.search(torahQuery)

  0.66s 73591 results


In [8]:
import pandas as pd

# Convert list of tuples to DataFrame
torahDF = pd.DataFrame(torahResults,columns=['verse','sentence','clause','word'])

In [9]:
# Get unique value counts for each column
torahUniqueCounts = torahDF.nunique()
print("Unique value counts per column:")
print(torahUniqueCounts)

Unique value counts per column:
verse        5482
sentence    14851
clause      19947
word        73591
dtype: int64


## Domain times

In [10]:
clauseQuery = '''
verse parashanum=1
     clause
'''
clauseResults = BHSA.search(clauseQuery)

  0.04s 535 results


In [11]:
# Initialize a dictionary to store counts of each unique domain type
clauseDomains = {}

# Loop through each clause and count occurrences of each domain type
for verse, clause in clauseResults:
    clauseDomain=F.domain.v(clause)
    # Count each domain type occurrence
    if clauseDomain in clauseDomains:
        clauseDomains[clauseDomain] += 1
    else:
        clauseDomains[clauseDomain] = 1

# Convert the counts dictionary to a DataFrame
df = pd.DataFrame(list(clauseDomains.items()), columns=['Domain', 'Count'])

# Display the result
print("Sum of types in clause domains:")
print(df)

Sum of types in clause domains:
  Domain  Count
0      ?      8
1      N    340
2      Q    157
3      D     30


# determine number of verses, sentences, clauses and words

Here the number of words in the Torah is determined by items separeted by spaces OR maqaf (diacritical mark indicating a strong connection between words). 

First check what can be placed after an individual word

In [12]:
# note: this is for the full TeNaCH!
F.trailer.freqList()

((' ', 236930),
 ('', 121801),
 ('&', 42275),
 ('00 ', 20146),
 ('05 ', 2266),
 ('00_S ', 1892),
 ('00_P ', 1165),
 ('_S ', 76),
 (' 05 ', 17),
 ('_P ', 13),
 ('00_N ', 7),
 ('00_N_P ', 1),
 ('00_N_S ', 1))

In this list, the ' ' value (i.e. a space) is used when the word is joined to the next word, while '&' indicates a maqqef (־), a diacritical mark indicating a strong connection between words. We consider both as word separators. Examining the frequency list above there are two methods to determine the word boundaries. The first is utilizing the fact that all feature values indicating a wordboundary are of lenght 1 or higher, allowing the string `(.+)` to exclude all cases where the lenght is less than 1 character. The other option is to explicitly look for spaces and maqqefs, by using `[\s&]` as regex expression. As expected, both product the same outcome. The following query determines the number of words in the torah based on this methond of counting.

In [13]:
# define query template
# The preceding 'r' before the template allows for a raw strings, preventing Python from altering the regex.

WordQuery2 = r'''
book book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
  word wordboundary=1
'''

WordResults2 = BHSA.search(WordQuery2)

  0.46s 79886 results


In [14]:
import re
from collections import Counter

torahQuery = '''
verse book=Genesis|Exodus|Leviticus|Numeri|Deuteronomium
  sentence
     clause
        word wordboundary=1
'''
torahResults = BHSA.search(torahQuery)

# Calculate total counts for each level
totalVerses = len(set([row[0] for row in torahResults]))
totalSentences = len(set([row[:2] for row in torahResults]))
totalClauses = len(set([row[:3] for row in torahResults]))

wordList = []
totalWords=0
# for the number of words we do not count word-nodes, but define it as strings either separated by a space or a maqaf
for verse, sentence, clause, word in torahResults:
    trailer = F.trailer.v(word)
    if re.search(r'[ &]', trailer):
        totalWords += 1
        wordList.append((word))  # Append tuple to list

# Count occurrences of each word
wordCounts = Counter(wordList)

# Find duplicates
duplicates = [word for word, count in wordCounts.items() if count > 1]

print("Total Words:", totalWords)
print("Duplicates:", duplicates)

  0.64s 73591 results
Total Words: 73591
Duplicates: []


In [15]:
# Convert wordList to a set for a MUCH! faster membership testing
wordSet = set(wordList)

# Iterate through WordResults2 and print nodes not in wordSet
for book, node in WordResults2:
    if node not in wordSet:
        print(f'{node} {T.sectionFromNode(node)} |{F.g_word_utf8.v(node)}|  |{F.voc_lex_utf8.v(node)}| |{F.trailer.v(node)}|')

315 ('Genesis', 1, 17) |יִּתֵּ֥ן|  |נתן| | |
316 ('Genesis', 1, 17) |אֹתָ֛ם|  |אֵת| | |
317 ('Genesis', 1, 17) |אֱלֹהִ֖ים|  |אֱלֹהִים| | |
319 ('Genesis', 1, 17) |רְקִ֣יעַ|  |רָקִיעַ| | |
321 ('Genesis', 1, 17) |שָּׁמָ֑יִם|  |שָׁמַיִם| | |
323 ('Genesis', 1, 17) |הָאִ֖יר|  |אור| | |
324 ('Genesis', 1, 17) |עַל|  |עַל| |&|
326 ('Genesis', 1, 17) |אָֽרֶץ|  |אֶרֶץ| |00 |
329 ('Genesis', 1, 18) |מְשֹׁל֙|  |משׁל| | |
332 ('Genesis', 1, 18) |יֹּ֣ום|  |יֹום| | |
336 ('Genesis', 1, 18) |לַּ֔יְלָה|  |לַיְלָה| | |
339 ('Genesis', 1, 18) |הַבְדִּ֔יל|  |בדל| | |
340 ('Genesis', 1, 18) |בֵּ֥ין|  |בַּיִן| | |
342 ('Genesis', 1, 18) |אֹ֖ור|  |אֹור| | |
344 ('Genesis', 1, 18) |בֵ֣ין|  |בַּיִן| | |
346 ('Genesis', 1, 18) |חֹ֑שֶׁךְ|  |חֹשֶׁךְ| | |
593 ('Genesis', 1, 29) |הִנֵּה֩|  |הִנֵּה| | |
594 ('Genesis', 1, 29) |נָתַ֨תִּי|  |נתן| | |
595 ('Genesis', 1, 29) |לָכֶ֜ם|  |לְ| | |
596 ('Genesis', 1, 29) |אֶת|  |אֵת| |&|
597 ('Genesis', 1, 29) |כָּל|  |כֹּל| |&|
598 ('Genesis', 1, 29) |עֵ֣שֶׂב|  |עֵשֶׂב| 

In [16]:
print (totalVerses,totalSentences,totalClauses,totalWords)

5482 14851 19947 73591


In [17]:
wordResultsList=[row[1] for row in WordResults2]
# Find duplicates
def printDuplicates(inputList):
    # Count occurrences of each item in the list
    itemCounts = Counter(inputList)
    
    # Filter items that have more than one occurrence
    duplicates = {item: count for item, count in itemCounts.items() if count > 1}
    
    # Print duplicates
    if duplicates:
        print("Duplicates found:")
        for item, count in duplicates.items():
            print(f"{item}: appears {count} times")
    else:
        print("No duplicates found")

printDuplicates(wordResultsList)

No duplicates found


In [18]:
# Convert lists to sets and find the difference
uniqueInWordList = set(wordList) - set(wordResultsList)
uniqueInWordResults2 = set(wordResultsList) - set(wordList)

In [19]:
for item in uniqueInWordList:
    print(F.trailer.v(item))

In [20]:
# Calculate the average percentage per parasha (54 parashot)
averagePercentage = (1 / 54) * 100

In [21]:
# Add 'parashanum' column by applying F.parashanum.v to each 'verse' in torahResults
torahResultsWithParasha = [(verse, sentence, clause, word, F.parashanum.v(verse), F.parashatrans.v(verse)) for verse, sentence, clause, word in torahResults]

# Convert to DataFrame and add column names
torahDF = pd.DataFrame(torahResultsWithParasha, columns=['verse', 'sentence', 'clause', 'word', 'number', 'name'])

# Display the DataFrame
print(torahDF)

         verse  sentence  clause    word  number              name
0      1414389   1172308  427559       2       1          Bereshit
1      1414389   1172308  427559       3       1          Bereshit
2      1414389   1172308  427559       4       1          Bereshit
3      1414389   1172308  427559       5       1          Bereshit
4      1414389   1172308  427559       7       1          Bereshit
...        ...       ...     ...     ...     ...               ...
73586  1420238   1187394  448734  112862      54  Vezot Haberakhah
73587  1420238   1187394  448734  112863      54  Vezot Haberakhah
73588  1420238   1187394  448734  112864      54  Vezot Haberakhah
73589  1420238   1187394  448734  112865      54  Vezot Haberakhah
73590  1420238   1187394  448734  112866      54  Vezot Haberakhah

[73591 rows x 6 columns]


In [22]:
from bokeh.models import ColumnDataSource, DataTable, TableColumn
from bokeh.io import output_notebook, show
from bokeh.layouts import column

# Prepare the Bokeh output to display in the notebook
output_notebook()

# Create the summary table using 'number' as the numeric field and 'name' as the Parasha name (text field)
summary_table = torahDF.groupby(['number', 'name']).agg({
    'verse': 'nunique',
    'sentence': 'nunique',
    'clause': 'nunique',
    'word': 'nunique'
}).reset_index()
summary_table.columns = ['ParashaNum', 'ParashaName', 'TotalVerses', 'TotalSentences', 'TotalClauses', 'TotalWords']

# Ensure ParashaNum is treated as an integer for any sorting if needed, but ParashaName remains as text
summary_table['ParashaNum'] = summary_table['ParashaNum'].astype(int)

# Sort by ParashaNum if required
summaryTable = summary_table.sort_values(by='ParashaNum').reset_index(drop=True)

# Create a ColumnDataSource from the DataFrame
source = ColumnDataSource(summaryTable)

# Define columns for the DataTable
columns = [
    TableColumn(field="ParashaNum", title="Parasha Number"),
    TableColumn(field="ParashaName", title="Parasha Name"),
    TableColumn(field="TotalVerses", title="Total Verses"),
    TableColumn(field="TotalSentences", title="Total Sentences"),
    TableColumn(field="TotalClauses", title="Total Clauses"),
    TableColumn(field="TotalWords", title="Total Words"),
]

# Create a DataTable with sortable columns
dataTable = DataTable(source=source, columns=columns, width=800, height=400, sortable=True)

# Display the table
show(column(dataTable))


## stacked bar chart parshot metrics

The plot is a stacked bar chart showing the distribution of verses, sentences, clauses, and words across the parashot in the Torah, with each bar representing a parasha. Hover tooltips display absolute counts, percentages of the Torah-wide total, and indicators showing whether each metric is above, below, or equal to the average.

The script creates combined labels for the parashot, calculates Torah-wide totals and averages for each metric, and determines how each parasha compares to these averages. It then generates an interactive chart with color-coded bars and tooltips.

In [26]:
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

# Prepare the Bokeh output to display in the notebook
output_notebook()

# Combine ParashaNum and ParashaName into a single label for the x-axis
summary_table['ParashaLabel'] = summary_table['ParashaNum'].astype(str) + " - " + summary_table['ParashaName']

# Sort the summary table by ParashaNum to ensure correct sequence order
summary_table = summary_table.sort_values(by='ParashaNum').reset_index(drop=True)

# Calculate total counts for each metric across the entire Torah for percentages
total_verses = summary_table['TotalVerses'].sum()
total_sentences = summary_table['TotalSentences'].sum()
total_clauses = summary_table['TotalClauses'].sum()
total_words = summary_table['TotalWords'].sum()

# Calculate averages for each metric
avg_verses = total_verses / len(summary_table)
avg_sentences = total_sentences / len(summary_table)
avg_clauses = total_clauses / len(summary_table)
avg_words = total_words / len(summary_table)

# Calculate percentages and indicators for each metric, adding them to the DataFrame
summary_table['VersesPct'] = (summary_table['TotalVerses'] / total_verses * 100).round(2)
summary_table['SentencesPct'] = (summary_table['TotalSentences'] / total_sentences * 100).round(2)
summary_table['ClausesPct'] = (summary_table['TotalClauses'] / total_clauses * 100).round(2)
summary_table['WordsPct'] = (summary_table['TotalWords'] / total_words * 100).round(2)

# Add indicators for each metric based on comparison to average
summary_table['VersesIndicator'] = summary_table['TotalVerses'].apply(lambda x: "↑" if x > avg_verses else ("↓" if x < avg_verses else "="))
summary_table['SentencesIndicator'] = summary_table['TotalSentences'].apply(lambda x: "↑" if x > avg_sentences else ("↓" if x < avg_sentences else "="))
summary_table['ClausesIndicator'] = summary_table['TotalClauses'].apply(lambda x: "↑" if x > avg_clauses else ("↓" if x < avg_clauses else "="))
summary_table['WordsIndicator'] = summary_table['TotalWords'].apply(lambda x: "↑" if x > avg_words else ("↓" if x < avg_words else "="))

# Prepare data for the stacked bar chart with the combined ParashaLabel
source = ColumnDataSource(summary_table)

# Create a figure with sorted ParashaLabel on the x-axis
p = figure(x_range=list(summary_table['ParashaLabel']), 
           height=600, width=900, title="Metrics Distribution Across Parashot",
           toolbar_location=None)

# Define the metrics and display names for the stacked bars
metrics = ['TotalVerses', 'TotalSentences', 'TotalClauses', 'TotalWords']
display_names = ["Total Verses", "Total Sentences", "Total Clauses", "Total Words"]
colors = ["#718dbf", "#e84d60", "#ddb7b1", "#c9d9d3"]

# Add stacked bars for each metric with display-friendly legend labels
p.vbar_stack(metrics, x='ParashaLabel', width=0.9, color=colors, source=source,
             legend_label=display_names)

# Deactivate any active drag or scroll tools (if the toolbar exists)
if p.toolbar:
    p.toolbar.active_drag = None
    p.toolbar.active_scroll = None

# Configure tooltips to show absolute values, percentages, and indicators
hover = HoverTool(tooltips=[
    ("Parasha", "@ParashaLabel"),
    ("Total Verses", "@TotalVerses (@VersesPct%) @VersesIndicator"),
    ("Total Sentences", "@TotalSentences (@SentencesPct%) @SentencesIndicator"),
    ("Total Clauses", "@TotalClauses (@ClausesPct%) @ClausesIndicator"),
    ("Total Words", "@TotalWords (@WordsPct%) @WordsIndicator")
])
p.add_tools(hover)

# Customize plot
p.y_range.start = 0
p.xgrid.grid_line_color = None
p.axis.minor_tick_line_color = None
p.outline_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
p.xaxis.axis_label = "Parasha"
p.yaxis.axis_label = "Total Counts"

# Rotate x-axis labels to 90 degrees
p.xaxis.major_label_orientation = "vertical"

# Show the plot
show(p)


In [27]:
from bokeh.plotting import figure, show
from bokeh.io import save

# Save the plot to a file
save(p, filename="metrics_distribution.html")

  save(p, filename="metrics_distribution.html")
  save(p, filename="metrics_distribution.html")


'C:\\Users\\tonyj\\OneDrive\\Documents\\GitHub\\parashot\\General\\metrics_distribution.html'