# Problem

There needs to be a way to compare:
1. Two REDCap projects with each second.
2. The REDCap project before vs. after.

Because, for example, we need to know:
1. Difference in data running a script on Test project vs. Target project
2. Difference in data before vs. after running a script on a project.

# Solution

### This notebook is currently very limited. 
- It can only do effective comparison if the condition/filter logic is simple. 
- A simple filter logic is one that has only one condition. Something with multiple clauses like `[variable2] = '2' AND [variable3] = '3'` is not yet handled.

This notebook compares REDCap data between 2 excel files.
This Excel file can either be another project's data, or this project's data at a previous time point.



# Code 
### 1. Define first & second excel files

In [None]:
import pandas as pd
from datetime import datetime
from Filter.Filter import Filter
from Change.Change import Change
from utils import check_equality

In [None]:
# FIRST: data that comes first
first_excel_path = "Personal Test_before_update.xlsx"
first_excel_df = pd.read_excel(first_excel_path)

# OTHER: data that comes after
second_excel_path = "Personal Test_before_update.xlsx"
second_excel_df = pd.read_excel(second_excel_path)

In [None]:
# Define the conditions and their expected differences
conditions_lst = [
    Filter("[patient_status_1] = ''", [Change("patient_status_1", "", "1")])
]

### 2. Get merged and diff df

In [None]:
# Merge

merged_df = pd.merge(first_excel_df, second_excel_df, how="outer", indicator=True)
diff_df = merged_df[merged_df["_merge"] != "both"]

In [None]:
diff_df

In [None]:
diff_df.to_excel(f"Diff_{datetime.now()}.xlsx")
merged_df.to_excel(f"Merge_{datetime.now()}.xlsx")

### 3. Validate comparison

Define what's the expected differences.

In [None]:
def validate_conditions(conditions_lst, df):
    first_row = df.iloc[0]
    second_row = df.iloc[1]

    for condition in conditions_lst:
        field_name = condition.condition_field
        operator = condition.condition_operator
        value = condition.condition_value
        changes = condition.changes

        print("=========")

        first_value = first_row[field_name]
        second_value = second_row[field_name]
        print("Values:")
        print(f"{first_value} -> {second_value}")
        condition_validated = operator(second_row[field_name], value)
        if (value == "''") and (pd.isna(first_value)):
            condition_validated = True

        # If second_row satisfies condition, see if change is valid
        if condition_validated:
            for change in changes:
                change_field_name = change.field_name
                change_old_value = change.old_value
                change_new_value = change.new_value

                print(f"""Change:
{change_field_name}
{change_old_value}
{change_new_value}""")
                print(f"""
second_row[change_field_name]: {second_row[change_field_name]}
    """)

                if check_equality(change_old_value, second_row[change_field_name]):
                    print("Same!")

                # If this change is not applicable, skip
                else:
                    print("Different!")
                    continue

                if not (
                    check_equality(change_old_value, second_row[change_field_name])
                    and check_equality(change_new_value, first_row[change_field_name])
                ):
                    raise ValueError(f"""Invalid difference with condition {condition.condition_obj}. 
                                     
                                    Base row[change_field_name]:
                                    {second_row[change_field_name]}
                                    Change old value:
                                    {change_old_value}

                                    Other row:
                                    {first_row[change_field_name]}
                                    Change new value:
                                    {change_new_value}
                                     """)

In [None]:
diff_groupby_df = diff_df.groupby("record_id")

for name, group in diff_groupby_df:
    if len(group) != 2:
        raise Exception(f"""There should be exactly 2 rows each group. 
                        Name: {name}
                        {group}""")
    validate_conditions(conditions_lst, group)

print(f"Comparison successful! Between: \n{first_excel_path}\n{second_excel_path}")