<a href="https://colab.research.google.com/github/thefroggydoggy/python-demos/blob/main/Comparing_Python_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparing data with Python

Comparing data of various structures is an integral part of being able to verify that your solution works properly. It will often be the case that you have a way to work with known inputs and known results. Knowing how to effectively compare data will help you in devising unit tests for functions you develop.

# Basic Python Types

The `int`, `str`, `float` `bool` objects can all be compared based on their `type` and by their value. Depending on your use case the `type` may or may not be important, as long as the values are equivalent.

In [13]:
my_int = 6040
my_wrong_int = 6041
my_float = 6040.0
my_wrong_float = 6040.1
true_float = 6040.0
true_int = 6040
float_tolerances = [0.2, 2**(-16)]
my_str = 'froggy'
my_wrong_str = 'Froggy'
true_str = 'froggy'

### Compare types - use `isinstance`
print(
    f'isinstance(my_int, type(true_int)) -> {isinstance(my_int, type(true_int))}', # True
    f'isinstance(my_wrong_int, type(true_int)) -> {isinstance(my_wrong_int, type(true_int))}', # True
    f'isinstance(my_float, type(true_int)) -> {isinstance(my_float, type(true_int))}', # False
    f'isinstance(my_str, type(true_str)) -> {isinstance(my_str, type(true_str))}', # True
    sep='\n'
)
### Compare value
print(
    f'my_float == true_int -> {my_float == true_int}', # True
    f'my_wrong_str == true_str -> {my_wrong_str == true_str}', # False
    ### Floats should be checked for being "close enough" - what that mean depends on the use case.
    f'abs(my_float - true_float) <= float_tolerances[0] -> {abs(my_float - true_float) <= float_tolerances[0]}', # True
    abs(my_wrong_float - true_float) <= float_tolerances[0], # True (wide tolerance)
    abs(my_wrong_float - true_float) <= float_tolerances[1], # False (narrower tolerance)
    sep='\n'
)

isinstance(my_int, type(true_int)) -> True
isinstance(my_wrong_int, type(true_int)) -> True
isinstance(my_float, type(true_int)) -> False
isinstance(my_str, type(true_str)) -> True
my_float == true_int -> True
my_wrong_str == true_str -> False
abs(my_float - true_float) <= float_tolerances[0] -> True
True
False


## Basic Python data structures

- `list` and `tuple` types are equal when...
  - Both are the same length.
  - All objects at corresponding index are equal.
  - Compare by verifying length and iterating through objects checking for equality.
- `dict` types are equal when...
  - Both dicts have the same keys.
  - The values associated with each key are equal.
  - Compare by verifying keys are the same and iterating through objects checking for equality.
- `set` types are equal when both contain the same values.
  - Compare by casting to sorted list and comparing as if it were a list.

## Handling nested data structures

The basic Python data structures can be nested in an infinite number of patterns. We can implement a recursive solution to compare them. Here's the strategy:

1. Determine the types for the objects being compared.
2. Use comparison strategy outlined above for the appropriate type to determine equality.
3. Use a recursive call to determine equality of child objects.
  - Base types do not make a recursive call since there are no children.
  - Unrecognized or mismatched types return `False`.

In [78]:
def compare_nested_data(a, b, lvl=0, tolerance=2**(-16), verbose=False):
  ### Define Helper functions

  ### Compares Lists or Tuples
  def compare_lists(a, b, lvl=0, verbose=False):
    ### check length
    if len(a) != len(b):
      if verbose: print(f'lvl {lvl}: lists/tuples different lengths')
      return False
    ### iterate over objects and compare recursively
    for i, (ai, bi) in enumerate(zip(a, b)):
      c = compare_nested_data(ai, bi, lvl+1, verbose=verbose)
      if not c:
        if verbose: print(f'lvl {lvl}: mismatch at index {i}: {a} is not equal to {b}')
        return False
    return True
  ### Compares Dictionaries
  def compare_dicts(a, b, lvl=0, verbose=False):
    ### Check keys
    keys_match = a.keys() == b.keys()
    if not keys_match:
      if verbose: print(f'lvl {lvl}: keys do not match: mismatched keys - a: {(a.keys() - b.keys())}, b:  {(b.keys() - a.keys())}')
      return False
    ### Iterate over values and compare recursively
    for k in a.keys():
      c = compare_nested_data(a[k], b[k], lvl+1, verbose=verbose)
      if not c:
        if verbose: print(f'lvl {lvl}: mismatch at key {k}: {a[k]} is not equal to {b[k]}')
        return False
    return True

  ### Handle different types

  ### Handle lists/tuples
  if (isinstance(a, tuple) and isinstance(b, tuple)) or (isinstance(a, list) and isinstance(b, list)):
    return compare_lists(a,b, lvl, verbose=verbose)
  ### Handle sets
  if (isinstance(a, set) and isinstance(b, set)):
    return compare_lists(sorted(a), sorted(b), lvl, verbose=verbose)
  ### Handle dicts
  if isinstance(a, dict) and isinstance(b, dict):
    return compare_dicts(a, b, lvl, verbose=verbose)
  ### Handle base type float
  if isinstance(a, float) and isinstance(b, float):
    if (abs(a - b) > tolerance):
      if verbose: print(f'lvl {lvl}: {a} is different from {b}')
      return False
    return True
  ### Handle other base types
  if (isinstance(a, str) and isinstance(b, str)) or (isinstance(a, int) and isinstance(b, int)) or (isinstance(a, bool) and isinstance(b, bool)):
    if (a != b):
      if verbose: print(f'lvl {lvl}: {a} is different from {b}')
      return False
    return True
  if verbose: print(f'Either type mismatch between a({type(a)}) and b({type(b)}) or unsupported type.')
  return False


In [79]:
compare_nested_data(tuple_c, tuple_b, verbose=True)
print()
compare_nested_data({(1,2), (3, ('foo', 'bar'))},
    {(1,2), (1,2), (3, ('foo',))})
print()
compare_nested_data({'foo':{'lower level': 3}, 'bar': {'something': {'else': 4}}}, 
    {'foo':{'lower level': 3}, 'baz': {'something': {'else': 3}}})
print()
compare_nested_data(tuple_c, list_b)
print()

lvl 2: 4 is different from 1
lvl 1: mismatch at index 0: [4, 2, 3, 1, 4, 2, 3, 1] is not equal to [1, 2, 3, 4, 1, 2, 3, 4]
lvl 0: mismatch at index 1: ([1, 2, 3, 4, 1, 2, 3, 4], [4, 2, 3, 1, 4, 2, 3, 1]) is not equal to ([1, 2, 3, 4, 1, 2, 3, 4], [1, 2, 3, 4, 1, 2, 3, 4])




