# Version Diffing with Python

In this tutorial, we will explore different approaches to compare JSON objects from Speckle commits. Speckle is a data management platform that allows users to store, share, and manage data across multiple applications. The goal of this tutorial is to provide you with a comprehensive understanding of the following:

1. **Using Speckle SDKs** 🛠️: We will start by demonstrating how to use the Speckle SDKs for Python to retrieve and handle source data from Speckle. This will help you understand how to work with the Speckle platform more effectively and perform the comparison using the SDKs. (Jupyter Notebook)

2. **Basic Example in Python** 🐍: Next, we will provide a simple Python implementation to compare JSON objects from two Speckle commits. This will help you understand the basic structure of the problem and how to approach it. (Jupyter Notebook)

By the end of this tutorial, you should have a clear understanding of various approaches to compare JSON objects from Speckle commits, and how to choose the best approach based on your specific use case. 💡

## Using Speckle SDKs

The Speckle SDKs provide a set of tools to help you work with the Speckle platform. In this tutorial, we will use the Speckle Python SDK (specklepy) to retrieve and handle source data from Speckle. The specklepy is a Python library that allows you to interact with the Speckle API and perform various operations on Speckle objects. You can find more information about the Speckle Python SDK [here](https://speckle.guide/dev/python.html).

### Installation

To install the Speckle Python SDK, you can use the following command:
  
  ```python
  pip install specklepy
  ```

---
*In a Jupyter Notebook, I add some special sauce for cli tasks into the Jupyter kernel. This allows me to run the same code in a notebook as I would in a terminal. `%%capture` is a Jupyter magic command that captures the output of a cell, useful for installs which are noisy outputs*


In [1]:
%%capture
%pip install specklepy

# dotenv is a library that allows you to load environment variables from a .env file
%pip install python-dotenv
%reload_ext dotenv
%dotenv

Retrieving the Speckle commit data is a two-step process. First, we need to retrieve the commit object from the Speckle server. Then, we can use the commit object to retrieve the actual data. The following code snippet demonstrates how to retrieve the commit object from the Speckle server:

In [2]:
# boilerplate user credentials and server

import os
HOST_SERVER = os.getenv('HOST_SERVER')
ACCESS_TOKEN = os.getenv('ACCESS_TOKEN')

In [3]:
from specklepy.api.client import SpeckleClient

client = SpeckleClient(host=HOST_SERVER)  # or whatever your host is
client.authenticate_with_token(ACCESS_TOKEN)  # or whatever your token is

I will choose a stream and for this first simple example i will retrieve the latest commit to the `main` branch.

In [4]:
stream_id = "20e76a799c"  # or whatever your stream id is

from specklepy.transports.server import ServerTransport
from specklepy.api.wrapper import StreamWrapper

transport = ServerTransport(client=client, stream_id=stream_id)

stream = client.stream.get(stream_id)

By default the `commit.list()` returns from most recent to oldest, so we can just take the first item in the list.

In [5]:
commits = client.commit.list(stream_id=stream_id, limit=2)

from specklepy.api import operations

# get obj id from latest commit
latest = commits[0].referencedObject
previous = commits[1].referencedObject

# receive objects from speckle
latest_data = operations.receive(obj_id=latest, remote_transport=transport)
previous_data = operations.receive(obj_id=previous, remote_transport=transport)

In this example I'm using a simple model from Rhino. This topmost object is a `Collections` object of type `rhino model`.

The totalChildrenCount is 25, but those can be a nested structure of any depth. This is collection of all the elements selected and sent in that commit. For ease this is a straightforward model, but in principle you will need to transform the nested data into a flat structure.

In the Speckle Docs we have an example of how to do this in .NET but I'll interpret it in Python here.

In [6]:
from collections.abc import Iterable, Mapping
from specklepy.objects import Base


def flatten(obj, visited=None):
    
    # Avoiding pesky circular references
    if visited is None:
        visited = set()

    if obj in visited:
        return

    visited.add(obj)

    # Define a logic for what objects to include in the diff
    should_include = any(
        [
            hasattr(obj, "displayValue"),
            hasattr(obj, "speckle_type")
            and obj.speckle_type == "Objects.Organization.Collection",
            hasattr(obj, "displayStyle"),
        ]
    )

    if should_include:
        yield obj

    props = obj.__dict__

    # traverse the object's nested properties - which may include yieldable objects
    for prop in props:
        value = getattr(obj, prop)

        if value is None:
            continue

        if isinstance(value, Base):
            yield from flatten(value, visited)

        elif isinstance(value, Mapping):
            for dict_value in value.values():
                if isinstance(dict_value, Base):
                    yield from flatten(dict_value, visited)

        elif isinstance(value, Iterable):
            for list_value in value:
                if isinstance(list_value, Base):
                    yield from flatten(list_value, visited)

NB: The Speckle ObjectLoader Javascript package also has a `flatten` function that can be used to do this. The GraphTraversal methods in SpeckleSharp can define a custom TraversalBreaker function to do this.

You'll also find flattenning methods in some of our Connectors code - choose your poison. 

In [7]:
latest_objects = list(flatten(latest_data))
previous_objects = list(flatten(previous_data))

The task at hand is to compare the two commits and find the differences.

In [8]:
from specklepy.objects.base import Base
from typing import List, Tuple


def compare_speckle_commits(
    commit1_objects: List[Base], commit2_objects: List[Base]
) -> Tuple[List[Tuple[Base, Base]], List[Tuple[None, Base]], List[Tuple[Base, None]]]:
    commit1_dict = {obj.id: obj for obj in commit1_objects[1:]}
    commit2_dict = {obj.id: obj for obj in commit2_objects[1:]}

    # Find unchanged objects
    for obj_id in commit1_dict.keys():
        if obj_id in commit2_dict.keys():
            yield (commit1_dict[obj_id], commit2_dict[obj_id]) # old, new

    # Find changed objects
    for obj_id, obj in commit1_dict.items():
        if obj_id not in commit2_dict.keys() and obj.applicationId in [
            x.applicationId for x in commit2_dict.values()
        ]:
            yield (
                obj, # old object
                [ x for x in commit2_dict.values()
                    if x.applicationId == obj.applicationId
                ][0], # new changed object
            )

    # Find added objects
    for obj_id, obj in commit2_dict.items():
        if obj_id not in commit1_dict.keys() and obj.applicationId not in [
            x.applicationId for x in commit1_dict.values()
        ]:
            yield (None, obj) # old, new

    # Find removed objects
    for obj_id, obj in commit1_dict.items():
        if obj_id not in commit2_dict.keys() and obj.applicationId not in [
            x.applicationId for x in commit2_dict.values()
        ]:
            yield (obj, None) # old, new

In this first example, knowing that we are using a small data sample, I am using lists, but this is potentially a O(n^2) operation. Converting these to sets could improve this to O(n) but we will leave that for later. Perhaps I'll award extra credit for community contributions. :D

specklepy will soon sync with speckle-sharp and include the Collections object, but for now we can an inplace version from this example.

In [9]:
from typing import Optional


class Collection(
    Base, speckle_type="Objects.Organization.Collection", detachable={"elements"}
):
    name: Optional[str] = None
    collectionType: Optional[str] = None
    elements: Optional[List[Base]] = None

The meat and potatoes of the comparison is the `diff` function. This is a recursive function that will compare the two objects and return a list of differences. It will sort the objects into 4 categories: `added`, `removed`, `modified`, and `unchanged`.

The 'modified' category is the most interesting. It will use the `applicationId` to determine if an added object is the same as a removed object. If it is, it will be marked as modified. If not, the objects from previous and latest will be marked as added and removed.

Highlighting what has changed is beyond the scope of this tutorial, but you can use the `diff` function to find the differences and then use just those ids to limit the scope of your more intensive comparison.

In [10]:
# Compare two Speckle commits and populate a base object with the results
def store_speckle_commit_diff(commit1_objects, commit2_objects):
    diff = {
      'changed_to': [],
      'changed_from': [],
      'added': [],
      'removed': [],
      'unchanged': []
    }

    for obj in compare_speckle_commits(commit1_objects, commit2_objects):
        if getattr(obj[0], "id", None) == getattr(obj[1], "id", None):            
            diff["unchanged"].append(obj[0]) # old object, though it's the same as the new object
        elif (
            obj[0] is not None
            and obj[1] is not None
            and getattr(obj[0], "id", None) != getattr(obj[1], "id", None)
        ):
            diff["changed_to"].append(obj[1]) # new object
            diff["changed_from"].append(obj[0]) # old object
        elif obj[0] is None and obj[1] is not None:
            diff["added"].append(obj[1]) # new object
        elif obj[0] is not None and obj[1] is None:
            diff["removed"].append(obj[0]) # old object

    return diff

In [11]:
diff = store_speckle_commit_diff(previous_objects, latest_objects)

In [12]:
diff_result = Base(name="Diff")

unchanged = Collection(
    name="Unchanged",
    elements=[
        o
        for o in diff["unchanged"]
        if o.speckle_type != "Objects.Organization.Collection"
    ],
    collectionType="Unchanged Objects",
)
changed_to = Collection(
    name="Changed To",
    elements=[
        o
        for o in diff["changed_to"]
        if o.speckle_type != "Objects.Organization.Collection"
    ],
    collectionType="Changed Objects",
)
changed_from = Collection(
    name="Changed From",
    elements=[
        o
        for o in diff["changed_from"]
        if o.speckle_type != "Objects.Organization.Collection"
    ],
    collectionType="Changed Objects",
)
added = Collection(
    name="Added",
    elements=[
        o for o in diff["added"] if o.speckle_type != "Objects.Organization.Collection"
    ],
    collectionType="Added Objects",
)
removed = Collection(
    name="Removed",
    elements=[
        o
        for o in diff["removed"]
        if o.speckle_type != "Objects.Organization.Collection"
    ],
    collectionType="Removed Objects",
)

diff_result["@unchanged"] = unchanged
diff_result["@changed_to"] = changed_to
diff_result["@changed_from"] = changed_from
diff_result["@added"] = added
diff_result["@removed"] = removed

obj_id = operations.send(base=diff_result, transports=[transport])
commit_id = client.commit.create(
    stream_id=transport.stream_id,
    object_id=obj_id,
    message="Diff!",
    branch_name="diffs",
)

With those three objects we can now start to compare what has changed.

With just the the added and removed objects we can start to show which are the same objects represented by a different hash.

The basic example will examine the `applicationIds` property* of the Speckle objects and compare the values of the two commits. If the values are found in both added and removed then the object will be added to the `changed` list. If the values are found in the added list only, then the object will be added to the `added` list. If the values are found in the removed list only, then the object will be added to the `removed` list.


\* *Not all software produces reliable applicationIds, so this is not a foolproof method. It is a good starting point, but you may need to use other properties to determine if two objects are the same.*

#### Just a few additional notes:

This is a simplified version of the diff process that might not work for every scenario. For example, if the model has been manipulated in such a way that the `applicationId` of an object changes (e.g., an object was deleted and a new one was created in its place), this will be detected as a removed and added object, not a modified one.

The diff operation can be quite resource intensive depending on the size of the commits. It's generally a good idea to limit the scope of the diff to the parts of the model you're interested in, if possible.

Lastly, it's important to mention that the diff operation is performed locally on your machine using the data fetched from the server. This means that your machine needs to have sufficient resources to perform this operation, especially for large models.

### Next Time: Automation

In the next tutorial, we will discuss how to automate the comparison process using the Speckle SDKs. This will demonstrate running analysis in response to change as indicated by new commits to a stream.

### Additional Resources

- [Speckle Python SDK](https://speckle.guide/dev/python.html)
- [Speckle API](https://speckle.guide/dev/api.html)
