# Issue 442

We observed a social worker appearing twice in the registry.

This notebook investigates that report. We initially speculate the provider has actually been created twice, then confirm that in documents.

## Imports

In [None]:
import json
import operator
import pathlib

import pandas as pd
from IPython.display import HTML
from scope.documents import document_set
from scope.populate.data.archive import Archive

## Obtain Archive 

In [None]:
# Provide full path to encrypted archive.
# archive_file_name = input("Encrypted archive file name: ")
archive_file_name = "archive_scca_v0.7.0_20230702_final.zip"

# Provide password to encrypted archive.
archive_password = input("Encrypted archive password: ")

# Obtain a full path to encrypted archive, relative to the location of the notebook.
# Expects the encrypted archive to be in the "secrets/data" directory.
archive_path = pathlib.Path(
    "../../../secrets/data",
    archive_file_name,
)

In [None]:
print("Decrypting archive:")
print("{}".format(archive_path.resolve()))

# Obtain the archive.
archive = Archive.read_archive(
    archive_path=archive_path,
    password=archive_password,
)

print("Decryption complete.")

## Obtain Providers Documents

Later cells will further filter and inspect these documents.

In [None]:
# Obtain all documents in the "providers" collection.
documents_providers = archive.collection_documents(
    collection="providers",
)

# Filter out the sentinel.
documents_providers = documents_providers.remove_sentinel()

## Identify Duplicate Providers

Obtain all the current providers to visually determine if there is a duplicate.

In [None]:
# Filter out old revisions.
documents_current = documents_providers.remove_revisions()

# Convert to dataframe.
df_current = pd.DataFrame.from_records(documents_current.documents)

# Filter to relevant columns.
df_current = df_current[
    [
        "name",
        "providerId",
    ]
]

# Sort for inspection.
df_current = df_current.sort_values(
    [
        "name",
        "providerId",
    ]
)

HTML(df_current.to_html(index=False))

## Inspect Duplicates

The above confirms a duplicate provider with two distinct providerId.

Inspect the history of those documents. Confirm the first was created on 2022-03-28, then a duplicate on 2022-08-29.

In [None]:
documents_bfi3jrlzu7ygu = documents_providers.filter_match(
    match_values={
        "providerId": "bfi3jrlzu7ygu",
    }
).order_by_revision()

for document_current in documents_bfi3jrlzu7ygu:
    print(document_set.datetime_from_document(document=document_current))
    print(json.dumps(document_current, indent=2))

In [None]:
documents_jg6bdxyte6nwk = documents_providers.filter_match(
    match_values={
        "providerId": "jg6bdxyte6nwk",
    }
).order_by_revision()

for document_current in documents_jg6bdxyte6nwk:
    print(document_set.datetime_from_document(document=document_current))
    print(json.dumps(document_current, indent=2))

## Inspect References

Inspect all documents that reference the two providerId.

We will want to understand this when preparing a fix during data migration.

In [None]:
# Any document that contains the providerId.
documents_bfi3jrlzu7ygu_references = document_set.DocumentSet(
    documents=[
        document_current
        for document_current in archive.entries.values()
        if "bfi3jrlzu7ygu" in json.dumps(document_current)
    ]
)

# But not the providerIdentity documents themselves.
documents_bfi3jrlzu7ygu_references = documents_bfi3jrlzu7ygu_references.remove_all(documents=documents_bfi3jrlzu7ygu)

print("{} references".format(len(documents_bfi3jrlzu7ygu_references)))

In [None]:
# Any document that contains the providerId.
documents_jg6bdxyte6nwk_references = document_set.DocumentSet(
    documents=[
        document_current
        for document_current in archive.entries.values()
        if "jg6bdxyte6nwk" in json.dumps(document_current)
    ]
)

# But not the providerIdentity documents themselves.
documents_jg6bdxyte6nwk_references = documents_jg6bdxyte6nwk_references.remove_all(documents=documents_jg6bdxyte6nwk)

print("{} references".format(len(documents_jg6bdxyte6nwk_references)))