# Create TF-IDF Text-Fabric features for the N1904 GNT 

## Table of content (ToC)<a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Setup</a>
    * <a href="#bullet2x1">2.1 - Package dependencies</a>
    * <a href="#bullet2x2">2.2 - Load app and data</a>
* <a href="#bullet3">3 - Creating the data</a>
    * <a href="#bullet3x1">3.1 - Aggregate tokens per book</a>
    * <a href="#bullet3x2">3.2 - Compute TF-IDF matrix</a>
    * <a href="#bullet3x3">3.3 - Taking a random peek</a>
    * <a href="#bullet3x4">3.4 - Inspect top tokens per book</a>
    * <a href="#bullet3x5">3.5 - Aggregate tokens per book (no stop-words)</a>
    * <a href="#bullet3x6">3.6 - Compute TF-IDF matrix (no stop-words)</a>
* <a href="#bullet4">4 - Creating the data</a>
    * <a href="#bullet4x1">4.1 - Prepare metadata</a>
    * <a href="#bullet4x2">4.2 - Prepare data for feature betacode</a>
    * <a href="#bullet4x3">4.3 - Link metadata to the featuredata</a>
    * <a href="#bullet4x4">4.4 - Save the features to files</a>
* <a href="#bullet5">5 - Test the new features</a>
    * <a href="#bullet5x1">5.1 - Reload Text-Fabric with the new feature</a>
    * <a href="#bullet5x2">5.2 - Check if the new feature is loaded</a>    
    * <a href="#bullet5x3">5.3 - Move the newly created feature to final location</a>
* <a href="#bullet4">4 - Attribution and footnotes</a>
* <a href="#bullet5">5 - Required libraries</a>
* <a href="#bullet6">6 - Notebook version</a>

#  1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to ToC](#TOC)

This notebook is used to create two new Text-Fabric features that wllow to explore the distribution of Greek word forms in the Nestle 1904 Greek New Testament (GNT) using TF–IDF. The aim is to treat each book as a “document,” compute TF–IDF scores per normalized token, and then map those scores back to the Text-Fabric nodes of the corpus. This allows us to identify book-specific vocabulary and to use these weights for further quantitative or visualization-oriented analyses.

It follows the information provided in [the TF-IDF explanation on GeeksforGeeks](https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/).

# 2 - Setup <a class="anchor" id="bullet2"></a>
##### [Back to ToC](#TOC)

Setup the environment to create the feature data

## 2.1 - Package dependencies <a class="anchor" id="bullet2x1"></a>

We need to install dependencies if they are not already available, especialy `tf`, `pandas`, and `scikit-learn` to handle the TF-IDF computation. The try/except construct allows to install the package if not yet available.

In [1]:
import sys, subprocess

try:
    import tf  # Text-Fabric
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "text-fabric"])
    import tf

try:
    import pandas as pd
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "pandas"])
    import pandas as pd

try:
    from sklearn import __version__ as _sk_version
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn"])
    from sklearn import __version__ as _sk_version

## 2.2 - Load app and data <a class="anchor" id="bullet2x2"></a>

Since the new feature should act as an extention to the N1904-TF dataset, we first need to load this dataset, together with the Text-Fabric Python code.

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# Loading the Text-Fabric code
from tf.fabric import Fabric
from tf.app import use

In [4]:
# load the N1904 app and data
N1904 = use ("CenterBLC/N1904", version="1.0.0", silence="terse", hoist=globals())

**Locating corpus resources ...**

Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

# 3 - Creating the data <a class="anchor" id="bullet3"></a>
##### [Back to ToC](#TOC)

This section constructs the TF–IDF matrix over the corpus using scikit-learn’s `TfidfVectorizer`.  Because tokenization (and normalisation) has already been done via Text-Fabric, internal lowercasing and regex-based tokenization is disabled. The result is a document-term matrix whose rows correspond to books and whose columns correspond to token forms.

## 3.1 - Aggregate tokens per book <a class="anchor" id="bullet3x1"></a>

This implementation of TF-IDF treats each 'document' separately. Here, the Greek New Testament is considered the corpus, whereas each book in the Greek New Testament is treated as a document. We will use the normalized greek wordform as tokens; see feature [normalized](https://centerblc.github.io/N1904/features/normalized.html#start).

In [6]:
from collections import defaultdict
from typing import Dict, List

def collect_tokens_by_book() -> Dict[str, List[str]]:
    """Return a mapping of book name -> list of normalized word tokens."""
    tokens: Dict[str, List[str]] = defaultdict(list)
    for word in F.otype.s("word"):
        book  = F.book.v(word)
        token = F.normalized.v(word).lower()
        if token:
            tokens[book].append(token)
    return tokens

book_tokens = collect_tokens_by_book()
print(f"Identified {len(book_tokens)} books and {sum(len(v) for v in book_tokens.values()):,} tokens.")

Identified 27 books and 137,779 tokens.


## 3.2 - Compute TF-IDF matrix <a class="anchor" id="bullet3x2"></a>

`TfidfVectorizer` in scikit-learn expects raw documents, so we pass in the token lists and override `tokenizer`/`preprocessor` to leave the tokens untouched. Each book becomes a document, and the resulting matrix stores TF-IDF scores for every token in every book.

See also [TfidfVectorizer documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html).

In [7]:
from typing import Iterable, List
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    tokenizer=lambda doc: doc,        # identity tokenizer: tokens are precomputed
    preprocessor=lambda doc: doc,     # no extra preprocessing
    token_pattern=None,               # disable the default word token pattern
    lowercase=False,                  # tokens are already normalized
    strip_accents=None,               # accents are critical to meaning in Koine Greek
)

book_names = list(book_tokens.keys())
documents: Iterable[List[str]] = book_tokens.values()

tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()

# DataFrame where rows=books, columns=tokens
book_tfidf = pd.DataFrame(
    tfidf_matrix.toarray(), index=book_names, columns=feature_names
)

## 3.3 - Taking a random peek <a class="anchor" id="bullet3x3"></a>

A peek at a few random token columns.

In [8]:
book_tfidf.sample(10, axis=1).head(6)  

Unnamed: 0,οὖσαι,νυκτός,ἐνέγκαι,ἐθαύμασαν,φανερωθῇ,πληρωθήσονται,εὐρύχωρος,πάρεστιν,διηποροῦντο,γλώσσης
Matthew,0.0,0.004276,0.0,0.007149,0.0,0.0,0.002011,0.0,0.0,0.0
Mark,0.0,0.00286,0.00269,0.0,0.001764,0.0,0.0,0.0,0.0,0.004356
Luke,0.0,0.002826,0.0,0.0063,0.0,0.001772,0.0,0.0,0.0,0.0
John,0.0,0.002483,0.0,0.0,0.004594,0.0,0.0,0.00415,0.0,0.0
Acts,0.0,0.010042,0.0,0.0,0.0,0.0,0.0,0.0,0.002099,0.0
Romans,0.005722,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3.4 - Inspect top tokens per book <a class="anchor" id="bullet3x4"></a>

Sorting each column yields the most distinctive words for that book. Adjust `top_n` to retrieve more or fewer entries.

In [9]:
def top_tokens_for_book(book: str, top_n: int = 15) -> pd.DataFrame:
    """Return the top-n TF-IDF tokens for a single book."""
    scores = book_tfidf.loc[book].sort_values(ascending=False).head(top_n)
    return scores.to_frame(name="tfidf")

# display the top 10 tokens for the first book in the corpus (Matthew)
first_book = book_names[0]
top_tokens_for_book(first_book, top_n=10)

Unnamed: 0,tfidf
καί,0.646104
δέ,0.275516
ὁ,0.270269
ἐν,0.161941
τοῦ,0.161388
αὐτοῦ,0.1495
τό,0.12888
οἱ,0.123252
τόν,0.121041
εἰς,0.119936


Observation: this shows that the 'top' ones are actualy the 'stop-words'. Hence, it does make sense to create two versions:
- TF-IDF based on all tokens
- TF-IDF based on all non-stop-word tokens (i.e., that are not POS in article, conjunction, interjection, etc)

## 3.5 - Aggregate tokens per book (no stop-words)<a class="anchor" id="bullet3x5"></a>

In [10]:
from collections import defaultdict
from typing import Dict, List

stop_words = {'intj', 'prep', 'art', 'conj'}

def collect_tokens_by_book_ns() -> Dict[str, List[str]]:
    """Return a mapping of book name -> list of normalized word tokens."""
    tokens: Dict[str, List[str]] = defaultdict(list)
    for word in F.otype.s("word"):
        book  = F.book.v(word)
        token = F.normalized.v(word).lower()
        sp    = F.sp.v(word)
        if token:
            if sp not in stop_words:
                tokens[book].append(token)
    return tokens

book_tokens_ns = collect_tokens_by_book_ns()
print(f"Identified {len(book_tokens_ns)} books and {sum(len(v) for v in book_tokens_ns.values()):,} tokens.")

Identified 27 books and 88,064 tokens.


## 3.6 - Compute TF-IDF matrix (no stop-words)<a class="anchor" id="bullet3x6"></a>

In [11]:
from typing import Iterable, List
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    tokenizer=lambda doc: doc,        # identity tokenizer: tokens are precomputed
    preprocessor=lambda doc: doc,     # no extra preprocessing
    token_pattern=None,               # disable the default word token pattern
    lowercase=False,                  # tokens are already normalized
)

book_names_ns = list(book_tokens_ns.keys())
documents: Iterable[List[str]] = book_tokens_ns.values()

tfidf_matrix = vectorizer.fit_transform(documents)
feature_names_ns = vectorizer.get_feature_names_out()

# DataFrame where rows=books, columns=tokens
book_tfidf_ns = pd.DataFrame(
    tfidf_matrix.toarray(), index=book_names_ns, columns=feature_names_ns
)

# 4 - Create the TF features <a class="anchor" id="bullet4"></a>
##### [Back to ToC](#TOC)

In the following section I reuse earlier developed procedures and code to generate both the meta data and the actual token related data.

## 4.1 - Prepare metadata <a class="anchor" id="bullet4x1"></a>

In [12]:
# Common metadata template function
def createMetadata(description):
    return {
        'author': 'Tony Jurg (using TfidfVectorizer from scikit-learn)',
        'convertedBy': 'Tony Jurg',
        'website': 'https://github.com/tonyjurg/N1904addons', 
        'description': description,
        'coreData': 'Nestle 1904 Text-Fabric (centerBLC)',
        'coreDataUrl': 'https://github.com/CenterBLC/N1904',
        'provenance': 'jupyter Notebook (https://github.com/tonyjurg/Create_TF-IDF_Text-Fabric_features',
        'version': '1.0.0',   # This is the version of the N1904-TF dataset against which this feature is build!
        'license': 'Creative Commons Attribution 4.0 International (CC BY 4.0)',
        'licenseUrl': 'https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md',
        'valueType': 'int'
    }

# Create metadata dictionaries using createMetadata function
tfIdfMetadata   = createMetadata('TF–IDF score (× 1,000,000) for this token, calculated using all tokens in the GNT corpus, aggregated per book.')
tfIdfNsMetadata = createMetadata('TF–IDF score (× 1,000,000) for this token, calculated using only non-stopword tokens in the GNT corpus, aggregated per book.')

Just check one metadata dictionairy:

In [13]:
tfIdfNsMetadata

{'author': 'Tony Jurg (using TfidfVectorizer from scikit-learn)',
 'convertedBy': 'Tony Jurg',
 'website': 'https://github.com/tonyjurg/N1904addons',
 'description': 'TF–IDF score (× 1,000,000) for this token, calculated using only non-stopword tokens in the GNT corpus, aggregated per book.',
 'coreData': 'Nestle 1904 Text-Fabric (centerBLC)',
 'coreDataUrl': 'https://github.com/CenterBLC/N1904',
 'provenance': 'jupyter Notebook (https://github.com/tonyjurg/Create_TF-IDF_Text-Fabric_features',
 'version': '1.0.0',
 'license': 'Creative Commons Attribution 4.0 International (CC BY 4.0)',
 'licenseUrl': 'https://github.com/tonyjurg/N1904addons/blob/main/LICENSE.md',
 'valueType': 'int'}

## 4.2 - Prepare data for the TF-IDF features <a class="anchor" id="bullet4x2"></a>

The code below loop over all the word nodes, reads the node's value for feature text and add calculated values for each node. The float TF-IDF value is multiplied by 1,000,000 and stored as an integer in its Text-Fabric feature.   

In [14]:
stop_words = {'intj', 'prep', 'art', 'conj'}

# Initialize dictionary
tfIdfDictionary = {}
tfIdfNsDictionary = {}

# Looping over word nodes and populate the dictionary
for word in F.otype.s('word'):
    token = F.normalized.v(word).lower()
    book = F.book.v(word)
    sp = F.sp.v(word)
    if token:
        tfIdfDictionary[word]=int(book_tfidf.at[book, token]*1000000)
        if not sp in stop_words:
            tfIdfNsDictionary[word]=int(book_tfidf_ns.at[book, token]*1000000)
        else:
            tfIdfNsDictionary[word]= 0
    else:
        print(f"Warning: No token for {word}")

In [15]:
for i, (key, value) in enumerate(tfIdfNsDictionary.items()):
    if i >= 10:
        break
    print(f"{key}: {value}")

1: 4262
2: 3787
3: 30350
4: 6291
5: 19475
6: 40417
7: 19475
8: 15147
9: 15147
10: 134583


## 4.3 - Link metadata to the featuredata <a class="anchor" id="bullet4x3"></a>

Now we give the new feature its name, and connect it with the data dictionary and the metadata dictionary.

In [16]:
nodedata = {'tfidf'  : tfIdfDictionary,
            'tfidfns': tfIdfNsDictionary}
metadata = {'tfidf'  : tfIdfMetadata,
            'tfidfns': tfIdfNsMetadata}

## 4.4 - Save the features to files <a class="anchor" id="bullet4x4"></a>

Now we save the new feature to its own `.tf` file.

If you don’t pass an explicit target path, `TF.save()` writes the file to the directory that already contains the loaded corpus—in this case the local on‑disk copy of the N1904 Text‑Fabric dataset.

In [17]:
TF.save(nodeFeatures=nodedata, metaData=metadata)

  0.00s Exporting 2 node and 0 edge and 0 configuration features to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0:
   |     0.22s T tfidf                to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
   |     0.20s T tfidfns              to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
  0.43s Exported 2 node features and 0 edge features and 0 config features to ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0


True

# 5 - Test the new features <a class="anchor" id="bullet5"></a>
##### [Back to ToC](#TOC)

Next we’ll confirm that Text‑Fabric can pick up the new feature.

## 5.1 - Load the new features in a new instance <a class="anchor" id="bullet5x1"></a>

Because the `tfidf.tf` and `tfidfns.tf` files live in the same directory as the rest of the N1904 dataset that we initialy downloaded, we can use the very same 'use()' call as before in step 2. The only change is that we bind the result to a different inctance (N1904_ADD instead of N1904) so both the enriched and the original dataset can be inspected side‑by‑side.

In [18]:
# load the N1904-TF app and data in another instance 
N1904_ADD = use ('CenterBLC/N1904', silence="terse", hoist=globals())

**Locating corpus resources ...**

   |     0.71s T tfidf                from ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0
   |     0.78s T tfidfns              from ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0


Name,# of nodes,# slots / node,% coverage
book,27,5102.93,100
chapter,260,529.92,100
verse,7944,17.34,100
sentence,8011,17.2,100
group,8945,7.01,46
clause,42506,8.36,258
wg,106868,6.88,533
phrase,69007,1.9,95
subphrase,116178,1.6,135
word,137779,1.0,100


Display is setup for viewtype [syntax-view](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/syntax-view.md#start)

See [here](https://github.com/saulocantanhede/tfgreek2/blob/main/docs/viewtypes.md#start) for more information on viewtypes

## 5.2 - Check if the new features are loaded <a class="anchor" id="bullet5x2"></a>

This can be done easily using the 'A.isLoaded()' method which we will apply to both the initial and the expanded dataset:

In [19]:
print ('N1904 (original dataset):')
N1904.isLoaded('tfidf')
N1904.isLoaded('tfidfns')
print ('N1904_ADD (expanded dataset):')
N1904_ADD.isLoaded('tfidf')
N1904_ADD.isLoaded('tfidfns')

N1904 (original dataset):
tfidf                NOT LOADED
tfidfns              NOT LOADED
N1904_ADD (expanded dataset):
tfidf                node (int) TF–IDF score (× 1,000,000) for this token, calculated using all tokens in the
                                GNT corpus, aggregated per book.
tfidfns              node (int) TF–IDF score (× 1,000,000) for this token, calculated using only non-stopword
                                tokens in the GNT corpus, aggregated per book.


## 5.3 - Some basic sanity check <a class="anchor" id="bullet5x2"></a>

In [20]:
# access the class F and T of the new dataset to access the new features
FA=N1904_ADD.api.F
TA=N1904_ADD.api.T

def sampleNodesForToken(tokenNorm: str, limit: int = 5) -> None:
    tokenNorm = tokenNorm.lower()
    count = 0
    for w in FA.otype.s("word"):
        if FA.normalized.v(w) == tokenNorm:
            book, chapter, verse = TA.sectionFromNode(w)
            tfidf   = FA.tfidf.v(w)
            tfidfns = FA.tfidfns.v(w)
            print(f"{book} {chapter}:{verse}  ->  TF–IDF={tfidf}  / TF-IDF_NS={tfidfns}")
            count += 1
            if count >= limit:
                break

# Example:
sampleNodesForToken("ἔργων",limit=30)

Matthew 11:19  ->  TF–IDF=976  / TF-IDF_NS=2070
Acts 9:36  ->  TF–IDF=1019  / TF-IDF_NS=2445
Romans 3:20  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 3:27  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 3:28  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 4:2  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 4:6  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 9:12  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 9:32  ->  TF–IDF=22229  / TF-IDF_NS=42720
Romans 11:6  ->  TF–IDF=22229  / TF-IDF_NS=42720
Galatians 2:16  ->  TF–IDF=51807  / TF-IDF_NS=78961
Galatians 2:16  ->  TF–IDF=51807  / TF-IDF_NS=78961
Galatians 2:16  ->  TF–IDF=51807  / TF-IDF_NS=78961
Galatians 3:2  ->  TF–IDF=51807  / TF-IDF_NS=78961
Galatians 3:5  ->  TF–IDF=51807  / TF-IDF_NS=78961
Galatians 3:10  ->  TF–IDF=51807  / TF-IDF_NS=78961
Ephesians 2:9  ->  TF–IDF=6553  / TF-IDF_NS=13270
I_Timothy 2:10  ->  TF–IDF=10786  / TF-IDF_NS=16191
Titus 2:7  ->  TF–IDF=107222  / TF-IDF_NS=131312
Titus 2:14  ->  TF–IDF=107222  / TF-IDF_NS=131312
Titus 

## 5.4 - Move the newly created feature to their final location <a class="anchor" id="bullet5x4"></a>

The last step is to obtain the newly created feature from location ~/text-fabric-data/github/CenterBLC/N1904/tf/1.0.0 (see output of step 3.4) to is final location: https://github.com/tonyjurg/N1904addons/tree/main/tf/1.0.0.

# 6 - Attribution and footnotes <a class="anchor" id="bullet6"></a>
##### [Back to ToC](#TOC)

Greek base text: Nestle1904 Greek New Testament, edited by Eberhard Nestle, published in 1904 by the British and Foreign Bible Society. Transcription by [Diego Santos](https://sites.google.com/site/nestle1904/home). Public domain.

The [N1904-TF dataset](https://centerblc.github.io/N1904/) is available under the [MIT licence](https://github.com/CenterBLC/N1904/blob/main/LICENSE.md). Formal reference: Tony Jurg, Saulo de Oliveira Cantanhêde, & Oliver Glanz. (2024). *CenterBLC/N1904: Nestle 1904 Text-Fabric data*. Zenodo. DOI: [10.5281/zenodo.13117911](https://doi.org/10.5281/zenodo.13117910).

This notebook is released under the [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/create_TF_feature_betacode/blob/main/LICENSE.md).

# 7 - Required libraries<a class="anchor" id="bullet7"></a>
##### [Back to ToC](#TOC)

Since the scripts in this notebook utilize Text-Fabric, [it requires currently (Apr 2025) Python >=3.9.0](https://pypi.org/project/text-fabric) together with the following libraries installed in the environment:

    pandas
    sklearn
    
You can install any missing library from within Jupyter Notebook using either`pip` or `pip3`.

# 8 - Notebook version<a class="anchor" id="bullet8"></a>
##### [Back to ToC](#TOC)

<div style="float: left;">
  <table>
    <tr>
      <td><strong>Author</strong></td>
      <td>Tony Jurg</td>
    </tr>
    <tr>
      <td><strong>Version</strong></td>
      <td>1.0</td>
    </tr>
    <tr>
      <td><strong>Date</strong></td>
      <td>December 7, 2025</td>
    </tr>
  </table>
</div>