# [Doc4TF/tools/displayDeltaBetweenVersions](https://github.com/tonyjurg/Doc4TF/tools/determineDeltaBetweenVersions.ipynb)
#### *Tool to determine what features and featurevalues were changed between two Text-Fabric datasets*

Version: 0.1 (25 September 2024); implementation of enhancement feature [17](https://github.com/tonyjurg/Doc4TF/issues/17).

## Table of content <a class="anchor" id="TOC"></a>
* <a href="#bullet1">1 - Introduction</a>
* <a href="#bullet2">2 - Setting up the environment</a>
    * <a href="#bullet2x1">2.1 - Setting script version</a>
    * <a href="#bullet2x2">2.2 - Setting script parameters</a>
    * <a href="#bullet2x3">2.3 - Load Text-Fabric code</a>
* <a href="#bullet3">3 - Load the two Text-Fabric datasets</a>
    * <a href="#bullet3x1">3.1 - Load the first dataset</a>
    * <a href="#bullet3x2">3.2 - Load the second dataset</a>
* <a href="#bullet4">4 - Create dictionaries for the two datasets</a>
    * <a href="#bullet4x1">4.1 - Setting up some global variables</a>
    * <a href="#bullet4x2">4.2 -  Store all relevant data into a dictionary</a>
* <a href="#bullet5">5 - Report the delta between the datasets</a>
* <a href="#bullet6">6 - Licence</a>

# 1 - Introduction <a class="anchor" id="bullet1"></a>
##### [Back to TOC](#TOC)

The main steps in producing the comparison are:
* Load the two Text-Fabric database.
* Construct two python dictionaries stroring all the relevant data from both versions.
* Compare the two dictionaries.
* Print the results in an infromative format. 

# 2 - Preparing the environment<a class="anchor" id="bullet2"></a>
##### [Back to TOC](#TOC)

Your environment should (for obvious reasons) include the Python package `Text-Fabric`. If not installed yet, it can be installed using `pip`.

Further it is required to be able to invoke the Text-Fabric data set (either from an online resource, or from a localy stored copy). There are no further requirements as the scripts basicly operate 'stand alone'.  

## 2.1 - Setting script version<a class="anchor" id="bullet2x1"></a>

Set the version number and creation date of this script,

#### Required user action:
> Run the following cell to store details on the script version into memory.

In [None]:
scriptVersion="0.1"
scriptDate="25 September 2024"

## 2.2 - Setting script parameters<a class="anchor" id="bullet2x2"></a>

Set some parameters used by the script.

#### Required user action:
> Review the options in the following cell and execute the cell.

In [167]:
# This switch can be set to 'True' if you want additional information, such as dictionary entries to be printed. For basic output, set this switch to 'False'.
verbose=False

# Limit the number of entries in the frequency tables per node type (set to 0 for 'no limit')
tableLimit=0

## 2.3 - Load Text-Fabric code<a class="anchor" id="bullet2x3"></a>

#### Required user action:
> Load the Text-Fabric code in this notebook by running the following two cells.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Loading the Text-Fabric code
# Note: it is assumed Text-Fabric is installed in your environment
from tf.fabric import Fabric
from tf.app import use

# 3 - Load the two Text-Fabric datasets<a class="anchor" id="bullet3"></a>
##### [Back to TOC](#TOC)

In this phase, the two Text-Fabric datasets are loaded. Which datasets are loaded is specified in the parameters, as detailed below.:
```
Ax = use ("{GitHub user name}/{repository name}", version="{version}")
```

In this notebook, we will load the two different versions into two object, respectively named A1 and A2. One of the consequences of working with two Text-Fabric datasets in the same Python environment is that we need to address them individually when using advanced API functions. That also means the invocation needs to exclude the hoist=globals() option.

For various options regarding other possible storage locations, and other load options, see the documentation for function [`use`](https://annotation.github.io/text-fabric/tf/app.html#tf.app.use).

## 3.1 - Load the first dataset<a class="anchor" id="bullet3x1"></a>

#### Required user action:
>  Update the next cell to match the first version of the Text-Fabric dataset you wish to compare and then execute the cell.

In [None]:
# Load the app and data from the first dataset
A1 = use ("saulocantanhede/tfgreek2", version="0.5.6")

## 3.2 - Load the second dataset<a class="anchor" id="bullet3x2"></a>

#### Required user action:
>  Update the next cell to match the second version of the Text-Fabric datasets you want to compare and then execute the cell.

In [None]:
# Load the app and data from the second version in the set for comparison
A2 = use ("saulocantanhede/tfgreek2", version="0.5.7")

# 4 - Create dictionaries for the two datasets<a class="anchor" id="bullet4"></a>
##### [Back to TOC](#TOC)

#### Required action:
>  Execute the following cell to create dictionaries containing all relevant information for the loaded node and edge features of the two datasets.

In [169]:
import time

# Initialize both APIs
api1 = A1.api
api2 = A2.api

# Initialize empty dictionaries to store feature data for both APIs
featureDict1 = {}
featureDict2 = {}

# Define some critical variables if not already defined by execution of step 2.2
if 'tableLimit' not in globals(): tableLimit = 10 
if 'verbose' not in globals(): verbose = False
if 'scriptVersion' not in globals(): scriptVersion="not set"
if 'scriptDate' not in globals(): scriptDate="not set"

overallTime = time.time()

def getFeatureDescription(metaData):
    """
    Retrieves the description of a feature from its metadata.
    """
    return metaData.get('description', "No feature description")

def setDataType(metaData):
    """
    Determines the data type of a feature based on its metadata.
    """
    if 'valueType' in metaData:
        return "String" if metaData["valueType"] == 'str' else "Integer"
    return "Unknown"

def processFeature(feature, featureType, featureMethod, api, featureDict):
    """
    Processes a single feature and updates the feature dictionary.
    
    Parameters:
        feature (str): The name of the feature to process.
        featureType (str): Type of the feature ('Node' or 'Edge').
        featureMethod (function): Method to retrieve feature data.
        api: The API instance being processed.
        featureDict (dict): The dictionary to store feature data.
    """
    # Obtain the meta data
    featureMetaData = featureMethod(feature).meta
    featureDescription = getFeatureDescription(featureMetaData)
    dataType = setDataType(featureMetaData)

    # Initialize dictionary to store feature frequency data
    featureFrequencyDict = {}

    # Skip specific features based on type
    if not (featureType == 'Node' and feature == 'otype') and not (featureType == 'Edge' and feature == 'oslots'):
        for nodeType in api.F.otype.all:
            frequencyLists = featureMethod(feature).freqList(nodeType)
            if not isinstance(frequencyLists, int):
                if len(frequencyLists) != 0:
                    featureFrequencyDict[nodeType] = {
                        'nodetype': nodeType, 
                        'freq': frequencyLists[:tableLimit] if tableLimit > 0 else frequencyLists
                    }
            elif isinstance(frequencyLists, int):
                if frequencyLists != 0:
                    featureFrequencyDict[nodeType] = {
                        'nodetype': nodeType, 
                        'freq': [("Link", frequencyLists)]
                    }

    # Add processed feature data to the main dictionary
    featureDict[feature] = {
        'name': feature, 
        'descr': featureDescription, 
        'type': featureType, 
        'datatype': dataType, 
        'freqlist': featureFrequencyDict
    }

def process_api(api, featureDict, api_label):
    """
    Processes all node and edge features for a given API and populates the feature dictionary.
    
    Parameters:
        api: The API instance to process.
        featureDict (dict): The dictionary to store feature data.
        api_label (str): Label for the API (used in print statements).
    """
    print(f'Analyzing Node Features for {api_label}: ', end='')
    for nodeFeature in api.Fall():
        if not verbose:
            print('.', end='')  # Progress indicator
        processFeature(nodeFeature, 'Node', api.Fs, api, featureDict)
        if verbose:
            print(f'\nFeature {nodeFeature} = {featureDict[nodeFeature]}\n')
    print('\n')  # Newline after node features

    print(f'Analyzing Edge Features for {api_label}: ', end='')
    for edgeFeature in api.Eall():
        if not verbose:
            print('.', end='')  # Progress indicator
        processFeature(edgeFeature, 'Edge', api.Es, api, featureDict)
        if verbose:
            print(f'\nFeature {edgeFeature} = {featureDict[edgeFeature]}\n')
    print('\n')  # Newline after edge features

########################################################
#                     MAIN FUNCTION                    #
########################################################

# Gather generic information for first dataset (stored in API1)
print('Gathering generic details for first dataset')

# Initialize default values
corpusName1 = A1.appName
liveName1 = 'not set'
versionName1 = A1.version

# Locate corpus information for first dataset (stored in API1)
if A1.provenance:
    for parts in A1.provenance[0]: 
        if isinstance(parts, tuple):
            key, value = parts[0], parts[1]
            if verbose: print(f'API1 General info: {key}={value}')
            if key == 'corpus': corpusName1 = value
            if key == 'version': versionName1 = value
            # Value for live is a tuple
            if key == 'live': liveName1 = value[1]


# Repeat the generic information gathering for API2 if needed
print('Gathering generic details for second dataset')

# Initialize default values for API2
corpusName2 = A2.appName
liveName2 = 'not set'
versionName2 = A2.version

# Locate corpus information for API2
if A2.provenance:
    for parts in A2.provenance[0]: 
        if isinstance(parts, tuple):
            key, value = parts[0], parts[1]
            if verbose: print(f'API2 General info: {key}={value}')
            if key == 'corpus': corpusName2 = value
            if key == 'version': versionName2 = value
            # Value for live is a tuple
            if key == 'live': liveName2 = value[1]

# Process both APIs
process_api(api1, featureDict1, api_label="first dataset (stored in API1)")
process_api(api2, featureDict2, api_label="second dataset (stored in API2)")

print(f'Finished in {time.time() - overallTime:.2f} seconds.')

Gathering generic details for first dataset
Gathering generic details for second dataset
Analyzing Node Features for first dataset (stored in API1): ......................................................

Analyzing Edge Features for first dataset (stored in API1): ....

Analyzing Node Features for second dataset (stored in API2): .......................................................

Analyzing Edge Features for second dataset (stored in API2): ....

Finished in 29.47 seconds.


# 5 - Report the delta between the datasets<a class="anchor" id="bullet5"></a>
##### [Back to TOC](#TOC)

#### Required action:
>  Execute the following cell to create a detailed report indicating the delta between the two datasets.

In [171]:
from IPython.display import display, HTML

# Function to compare two feature dictionaries and report datatype and freqlist differences
def compare_feature_dicts(dict1, dict2):
    """
    Compares two feature dictionaries and returns a report of differences,
    filtering out identical entries in both datasets, and comparing 'datatype' and 'freqlist'.
    
    Parameters:
        dict1 (dict): The first feature dictionary.
        dict2 (dict): The second feature dictionary.
    
    Returns:
        dict: A dictionary containing the differences between dict1 and dict2.
    """
    report = {
        'only_in_dict1': [],
        'only_in_dict2': [],
        'differences_in_common': {}
    }
    
    # Convert keys to sets for set operations
    keys1 = set(dict1.keys())
    keys2 = set(dict2.keys())
    
    # Features only in dict1
    only_in_1 = keys1 - keys2
    report['only_in_dict1'] = sorted(list(only_in_1))
    
    # Features only in dict2
    only_in_2 = keys2 - keys1
    report['only_in_dict2'] = sorted(list(only_in_2))
    
    # Features present in both dictionaries
    common_features = keys1 & keys2
    
    for feature in common_features:
        differences = {}
        feature1 = dict1[feature]
        feature2 = dict2[feature]
        
        # Compare 'datatype'
        datatype1 = feature1.get('datatype', None)
        datatype2 = feature2.get('datatype', None)
        
        if datatype1 != datatype2:
            differences['datatype'] = {'Dataset1': datatype1, 'Dataset2': datatype2}
        
        # Compare 'freqlist'
        freqlist1 = feature1.get('freqlist', {})
        freqlist2 = feature2.get('freqlist', {})
        
        freqlist_diff = {}

        # Compare individual frequency items and only report differences
        for nodetype in freqlist1.keys() | freqlist2.keys():  # union of keys to include all nodetypes
            freq1 = dict(freqlist1.get(nodetype, {}).get('freq', []))
            freq2 = dict(freqlist2.get(nodetype, {}).get('freq', []))
            
            # Compare each tuple in the freqlist
            diff1 = [t for t in freq1.items() if t not in freq2.items()]  # Differences in Dataset1
            diff2 = [t for t in freq2.items() if t not in freq1.items()]  # Differences in Dataset2
            
            if diff1 or diff2:  # Only if there are differences
                freqlist_diff[nodetype] = {'Dataset1': diff1, 'Dataset2': diff2}
        
        if freqlist_diff:
            differences['freqlist'] = freqlist_diff
        
        if differences:
            report['differences_in_common'][feature] = differences
    
    return report


# Function to generate HTML delta report
def generate_html_delta_report(report):
    """
    Generates an HTML delta report from the comparison with highlighted differences.
    
    Parameters:
        report (dict): The comparison report generated by compare_feature_dicts.
    
    Returns:
        str: The formatted delta report in HTML.
    """
    html = []
    html.append("<!DOCTYPE html>")
    html.append("<html lang='en'>")
    html.append("<head>")
    html.append("<meta charset='UTF-8'>")
    html.append("<meta name='viewport' content='width=device-width, initial-scale=1.0'>")
    html.append("<title>Delta Report</title>")
    # Add some basic styling
    html.append("""
    <style>
        body { font-family: Arial, sans-serif; margin: 20px; }
        .only-in-1 { color: #E74C3C; }
        .only-in-2 { color: #E67E22; }
        .feature-name { color: #2980B9; }
        .diff-key { color: #8E44AD; }
        .freq-type { color: #16A085; }
        .freq-value { color: #D35400; }
        .section { margin-bottom: 20px; }
        ul { list-style-type: disc; margin-left: 40px; }
        .nodetype { color: #2C3E50; }
        .api1 { color: #3498DB; }
        .api2 { color: #1ABC9C; }
        table { border-collapse: collapse; width: 100%; margin-bottom: 20px; }
        th, td { border: 1px solid #dddddd; text-align: left; padding: 8px; }
        th { background-color: #f2f2f2; }
    </style>
    """)
    html.append("</head>")
    html.append("<body>")
    
    html.append("<h1>Delta Report</h1>")
    
    # Features only in dict1
    if report['only_in_dict1']:
        html.append("<div class='section'>")
        html.append("<h2>Features only in dataset 1:</h2>")
        html.append("<ul>")
        for feature in report['only_in_dict1']:
            html.append(f"<li class='only-in-1'>{feature}</li>")
        html.append("</ul>")
        html.append("</div>")
    
    # Features only in dict2
    if report['only_in_dict2']:
        html.append("<div class='section'>")
        html.append("<h2>Features only in dataset 2:</h2>")
        html.append("<ul>")
        for feature in report['only_in_dict2']:
            html.append(f"<li class='only-in-2'>{feature}</li>")
        html.append("</ul>")
        html.append("</div>")
    
    # Differences in common features
    if report['differences_in_common']:
        html.append("<div class='section'>")
        html.append("<h2>Differences in Common Features:</h2>")
        for feature, diffs in report['differences_in_common'].items():
            html.append(f"<h3 class='feature-name'>Feature: {feature}</h3>")
            for key, change in diffs.items():
                if key == 'datatype':
                    html.append(f"<p><strong class='diff-key'>Datatype Difference:</strong></p>")
                    html.append("<ul>")
                    html.append(f"<li class='api1'>Dataset 1: {change['Dataset1']}</li>")
                    html.append(f"<li class='api2'>Dataset 2: {change['Dataset2']}</li>")
                    html.append("</ul>")
                elif key == 'freqlist':
                    freqlist = change
                    for nodetype, freq_diff in freqlist.items():
                        html.append(f"<p><strong class='diff-key'>Nodetype: {nodetype}</strong></p>")
                        html.append("<ul>")
                        # Here subval is a list of tuples, we don't need 'Dataset1' or 'Dataset2' as keys
                        dataset1_val = ', '.join([f"{t[0]}: {t[1]}" for t in freq_diff['Dataset1']]) if freq_diff['Dataset1'] else 'None'
                        dataset2_val = ', '.join([f"{t[0]}: {t[1]}" for t in freq_diff['Dataset2']]) if freq_diff['Dataset2'] else 'None'
                        if dataset1_val != 'None':
                            html.append(f"<li>Dataset 1: {dataset1_val}</li>")
                        if dataset2_val != 'None':
                            html.append(f"<li>Dataset 2: {dataset2_val}</li>")
                        html.append("</ul>")
        html.append("</div>")
    
    html.append("</body>")
    html.append("</html>")
    
    report_html = "\n".join(html)
    
    return report_html


# Function to display HTML report in Jupyter Notebook
def display_html_report(report_html):
    """
    Displays the HTML delta report within the Jupyter Notebook.
    
    Parameters:
        report_html (str): The HTML delta report content.
    """
    display(HTML(report_html))


# Example usage:
# Assuming you have `featureDict1` and `featureDict2` defined from your dataset.

# Compare the dictionaries
delta_report = compare_feature_dicts(featureDict1, featureDict2)

# Generate the HTML delta report
report_html = generate_html_delta_report(delta_report)

# Display the report in the Jupyter Notebook
display_html_report(report_html)

# 6 - License<a class="anchor" id="bullet6"></a>
##### [Back to TOC](#TOC)

Licenced under [Creative Commons Attribution 4.0 International (CC BY 4.0)](https://github.com/tonyjurg/Doc4TF/blob/main/LICENCE.md)