# Problem Statement

In imaging studies, PHI is generally wiped out by applying anonymization rules on DICOM tags, combined with the CRPs' efforts to delete series with PHI, or redact information burned on the images.

However, in rare cases, PHI might come through in studies/series' descriptions. For example, the description would include the subjects' names.

# Solution

Using NLTK library, the notebook finds low frequencies words in the studies uploaded to the specified Inteleshare project.

# Code
### 1. Setup Inteleshare project

In [None]:
import AMBRA_Utils
import itertools
from nltk import FreqDist, word_tokenize

In [None]:
ambra_account_name = "MOST"
ambra = AMBRA_Utils.utilities.get_api()
account = ambra.get_account_by_name(ambra_account_name)
namespace = account.get_location_by_name('3 - Assigned Studies')

# 2. Get all studies and series

In [None]:
studies = list(namespace.get_studies())

In [None]:
studies_desc_tokens = []
series_desc_tokens = []

for study in studies:
    study_tokens = word_tokenize(' '.join(study.formatted_description.split('_')).lower())
    studies_desc_tokens.append(study_tokens)
    # studies_desc.append(study.formatted_description)
    series = study.get_series()
    for s in series:
        s_tokens = word_tokenize(' '.join(s.formatted_description.split('_')).lower())
        series_desc_tokens.append(s_tokens)

In [None]:
# Studies

studies_desc_tokens_flat = list(itertools.chain.from_iterable(studies_desc_tokens))
studies_desc_freq = FreqDist(studies_desc_tokens_flat)

# Get least frequent
num = 40
studies_least_common = studies_desc_freq.most_common()[-num:]

# Manually check if studies desc contains PHI
studies_least_common

In [None]:
# Series

series_desc_tokens_flat = list(itertools.chain.from_iterable(series_desc_tokens))

# Get rid of hash strings by ignoring strings with more than 10 characters
# and have more than 6 numbers
series_desc_tokens_flat_filtered = []
for series_token in series_desc_tokens_flat:
    num_count = 0
    for char in series_token:
        if char.isdigit():
            num_count += 1

    if not (num_count >= 6 and len(series_desc_tokens) >= 10):
        series_desc_tokens_flat_filtered.append(series_token)


series_desc_freq = FreqDist(series_desc_tokens_flat_filtered)

# Get least frequent
num = 400
series_least_common = series_desc_freq.most_common()[-num:]

# Manually check if studies desc contains PHI
series_least_common