# 1-OCR.ipynb

  > [Zachary Kilhoffer](https://zkilhoffer.github.io/ "Author's website")

  > Updated 2024-06-17

## Description
The code is part of the topic modeling pipeline for privacy standards described in this paper: 
- [Kilhoffer, Z. et al. (2024). "Cloud Privacy Beyond Legal Compliance: An NLP analysis of certifiable privacy and security standards"](https://ieeexplore.ieee.org/document/10631062 "IEEE page for paper")


This script shows much of the data cleaning work required. 

It handles OCR error correction by querying a GPT model.
***

In [None]:
import pandas as pd

In [None]:
# load data, which is all OCR'd text
df = pd.read_excel(
    r"all_document_data.xlsx",
    index_col=0,
)

Notes about the standards after manual inspection:
- c5: control_name is very short, control_text is detail;
- eu_coc: naming conventions different. control_name is closer to control text. control_text is actually "control guidance";
- iso_27002 - most of the control_text is referential.

## OpenAI OCR error correction

In [None]:
from openai import OpenAI

# function to retrieve key
def read_key_from_file(filename=r"../YOURS-HERE.txt"):
    with open(filename, 'r') as file:
        return file.read().strip()

# initialize client
client = OpenAI(organization="YOURS-HERE", api_key=read_key_from_file())

In [None]:
# make a variable with some random text before we get charged money from API calls to OpenAI!
text_to_check = df.loc[1506]['control_text']
# the text will look a little messy because of OCR errors
text_to_check

'control the organization should determine and securely maintain the necessary records in support of its obligations for the processing of personally identifiable information. implementation guidance a way to maintain records of the processing of personally identifiable information is to have an inventory or list of the personally identifiable information processing activities that theorganization performs. such an inventory can include: the type of processing; the purposes for the processing; a description of the categories of personally identifiable information and personally identifiable information principals (e.g. children); the categories of recipients to whom personally identifiable information has been or will be disclosed, including recipients in third countries or international organizations; a general description of the technical and organizational security measures; and a privacy impact assessment report. such an inventory should have an owner who is responsible for its acc

In [None]:
# querying API
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",  # plenty of choices of models, but this one is cheap, fast, and pretty good
  messages=[
    {"role": "system", "content": "I will provide you text from cybersecurity and privacy documentation like ISO-IEC 27017, FedRamp, etc. The text is the result of imperfect optical character recognition (OCR), which needs to have errors fixed. Your job is to return the same text with OCR errors corrected. Some text I give you will include headings. This is especially likely if there's no ending punctuation. When some part of the text is likely to be a heading, treat it like its own sentence, separating it from other text with a period. Examples of headings may include something like 'Controls' or 'Responsibilities of data controller'."},
    {"role": "user", "content": text_to_check}
  ]
)

# Show main part of response
completion.choices[0].message.content

'Control: The organization should determine and securely maintain the necessary records in support of its obligations for the processing of personally identifiable information. \n\nImplementation guidance: A way to maintain records of the processing of personally identifiable information is to have an inventory or list of the personally identifiable information processing activities that the organization performs. \n\nSuch an inventory can include:\n\n- The type of processing\n- The purposes for the processing\n- A description of the categories of personally identifiable information and personally identifiable information principals (e.g., children)\n- The categories of recipients to whom personally identifiable information has been or will be disclosed, including recipients in third countries or international organizations\n- A general description of the technical and organizational security measures\n- A privacy impact assessment report\n\nSuch an inventory should have an owner who i

In [None]:
# That worked, so we make our function
def ocr_correction(text_to_check):
    completion = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {
                "role": "system",
                "content": "I will provide you text from cybersecurity and privacy documentation like FedRamp, SOC-2, etc. The text is the result of imperfect optical character recognition (OCR), which needs to have errors fixed. Your job is to return the same text with OCR errors corrected. Some text I give you will include headings. Text is especially likely to be a heading if there's no ending punctuation. When some part of the text is likely to be a heading, treat it like its own sentence, separating it from other text with a period. Examples of headings may include something like 'Controls' or 'Responsibilities of data controller'.",
            },
            {"role": "user", "content": text_to_check},
        ],
    )
    return completion.choices[0].message.content

Note: correcting the text of the control's name was not successful, probably due to the shorter texts? 

Once in a while ChatGPT just started writing stuff about the name of the control, rather than simply returning a version of it with any OCR errors fixed.

In [None]:
# applying function to correct OCR text
df['control_text_corrected'] = df['control_text'].apply(lambda x: ocr_correction(x))

In [None]:
# Apply the custom function to create the new column
df['full_control_text'] = df.apply(concatenate_with_condition, axis=1)

print(df.shape)

(1661, 9)


In [None]:
# fixing the "nan. <control text>" problem for some documents. the documents: iso_27002, iso_27018, iso_27017, eu_coc, iso_27701
df['full_control_text'] = df['full_control_text'].str.replace('Nan. ', '')

# Identifying reference-only controls

- It's valuable to remove controls that are mere references like "the blah blah in ISO/IEC 10.1.2 applies"
- But some of the longer texts have interesting information, so we keep them (>250 characters)
- We use rules and regex to identify the reference controls


The actual code was longer but not really instructive, as I can't show the original texts due to copyright.

In [None]:
df['document'].value_counts(dropna=False)

document
fedramp      410
iso_27701    238
ccm          197
iso_27017    154
c5           121
iso_27018    114
iso_27002    110
nist         100
iso_27001     93
eu_coc        62
soc2          61
NaN            1
Name: count, dtype: int64

In [None]:
# regex patterns into one
pattern = (
    r"the control.*implementation guidance and other information stated in.*appl|"
    r"the requirements stated in.*appl|"
    r"the requirement stated in.*appl|"
    r"the objective specified in.*appl|"
    r"the objectives specified in.*appl|"
    r"control.*and the associated implementation guidance.*specified in.*appl|"
    r"the objective specified in.*iso.*appl|"
    r"the objectives specified in iso.*appl|"
    r"control.*implementation guidance and other information.*appl"
)

# Apply the pattern and length condition to filter the dataframe
reference_only_df = df[
    (df["control_text"].str.len() < 250)
    & df["control_text"].str.contains(pattern, regex=True)
]

In [None]:
# how many controls from each document are referential?
reference_only_df['document'].value_counts()

document
iso_27017    107
iso_27701     97
iso_27001     93
iso_27018     70
iso_27002     69
Name: count, dtype: int64

In [None]:
# how many controls from each document in total?
df['document'].value_counts()

document
fedramp      410
ccm          197
iso_27701    141
c5           121
eu_coc        62
soc2          61
iso_27017     47
iso_27018     44
iso_27002     41
Name: count, dtype: int64