In [1]:
import logging
log_level = 25
logging.basicConfig(level=log_level) # root logger

import re
import pandas as pd
from bs4 import BeautifulSoup
import sys
sys.path.append('E:/Code/chat/gdpr')


In [None]:
from ico.guidelines.guidelines.spiders.monitor_worker import MonitorWorkerSpider
os.chdir('../ico/guidelines/guidelines/') # change working directory
!scrapy crawl monitor_worker -O monitor_worker.json  # run the spider -O will overwrite monitor_worker.json -o will append to it
os.chdir('../../../conversion_notebooks/') # reset working directory
print(f"Current working directory: {os.getcwd()}")

In [23]:
import json
data_file = "../ico/guidelines/guidelines/monitor_worker.json"
with open(data_file, 'r', encoding='utf-8') as file:
    data = json.load(file)

# later I will split data_file into individual questions. I will use the file split_data_file to save that data
# Split the path into directory and file name
directory, file_name = os.path.split(data_file)
# Split the file name into name and extension
name, extension = os.path.splitext(file_name)
# Append "_split" to the file name
split_file_name = name + "_split" + extension
# Combine the directory and the new file name
split_data_file = os.path.join(directory, split_file_name)

print(f"The file {split_data_file} will be used later to save the chunked version of the data_file")

The file ../ico/guidelines/guidelines\monitor_worker_split.json will be used later to save the chunked version of the data_file


## Code to search through an ICO Guidance page
The content of these pages (already scraped using scrapy) pages have a particular layout. They can either just be a page of un-sectioned text or, after potentially having some notes, they start with a section called "In detail" which is a Table of Content of FAQs. The method below, finds the "In detail" if it exists and then looks for the links (e.g. #can1) in the ToC that link in the document. It only fetches the first instance of each link.

The output, is a list of FAQs from the ToC along with their relative link

In [22]:
def find_faq_questions_and_links(content):

    start_pos = content.find("In detail")
    
    # If "In detail" is found, slice the content to start after it
    if start_pos != -1:
        content = content[start_pos:]
    
    # Regular expression to find all <a> tags with href and the associated text. The # here ensures we only find links in the input content
    matches = re.findall(r'<a href="#([^"]+)"[^>]*>(.*?)</a>', content)

    # Initialize an empty set to track unique href references
    seen_hrefs = set()

    # Filter the list to include only valid and unique references that appear later in the document
    questions_with_references = []
    for href, question in matches:
        if href not in seen_hrefs and f'id="{href}"' in content:
            questions_with_references.append((question, href))
            seen_hrefs.add(href)

    return questions_with_references

# Quick check to see how many pages have been downloaded and which of them contains a ToC
print(f"Downloaded the content from {len(data)} pages")
counter = 1
for page in data:
    name = page["section"]
    questions_with_references = find_faq_questions_and_links(page['content'])
    if len(questions_with_references) > 0:
        print(f"Page {counter}: {name:<100}- is an FAQ page")    
    else:
        print(f"Page {counter}: {name:<100}- contains only content, and no FAQ")
    counter += 1

Downloaded the content from 6 pages
Page 1: Employment practices and data protection: monitoring workers                                        - contains only content, and no FAQ
Page 2: Data protection and monitoring workers                                                              - is an FAQ page
Page 3: What do we need to do if we use monitoring tools that use solely automated processes?               - is an FAQ page
Page 4: Specific data protection considerations for different ways or methods of monitoring workers         - is an FAQ page
Page 5: Can we use biometric data for time and access control and monitoring?                               - is an FAQ page
Page 6: Checklists                                                                                          - contains only content, and no FAQ


For each FAQ page, divide the full page into subsections so that 1 subsection is 1 Question 

In [25]:
def subdivide_content_into_faq(content, questions_with_references):

    # this is the pattern of the question in the body of the content
    section_pattern = r'<a id=\"{}\"></a>.*?'

    # List to store the subsections
    subsections = []

    # Keep track of the last position in the content but only start the search after the heading "In detail" if it exists
    last_pos = 0

    for question, href in questions_with_references:
        # Compile the pattern for the current href
        pattern = re.compile(section_pattern.format(href))
        
        # Find the section start position
        match = pattern.search(content, last_pos)
        
        if match:
            # Get the position of the match
            start_pos = match.start()
            
            if last_pos == 0 and start_pos > 0:
                # Capture text before the first section if there is any
                subsections.append(content[last_pos:start_pos].strip())
            
            # Find the next occurrence of any section or the end of the content
            next_section_match = None
            for next_question, next_href in questions_with_references:
                if next_href != href:
                    next_pattern = re.compile(section_pattern.format(next_href))
                    next_section_match = next_pattern.search(content, start_pos + len(match.group()))
                    if next_section_match:
                        break
            
            if next_section_match:
                end_pos = next_section_match.start()
            else:
                end_pos = len(content)
            
            # Extract the subsection
            subsection = content[start_pos:end_pos].strip()
            subsections.append(subsection)
            
            # Update the last position
            last_pos = end_pos

    # Capture any remaining content after the last section
    if last_pos < len(content):
        subsections.append(content[last_pos:].strip())

    # The variable subsections now contains the list of content parts split by the questions_with_references
    return subsections

In [26]:
# create subdivided text

subdivided_data = []
for page in data:    
    section = page["section"]
    url = page["url"]
    content = page['content']
    questions_with_references = find_faq_questions_and_links(content)
    if len(questions_with_references) == 0:
        entry = {
            "section": section,
            "url": url,
            "content": content,
        }
        subdivided_data.append(entry)
    else:
        subsections = subdivide_content_into_faq(content, questions_with_references)
        if len(subsections) != len(questions_with_references) + 1:            
            print("Could not find links for every FAQ in the ToC for page: " + section)
            break
        else:
            # entry for the ToC
            entry = {
                "section": section,
                "url": url,
                "content": subsections[0],
            }
            subdivided_data.append(entry)

            for i in range(0, len(questions_with_references)):
                question, href = questions_with_references[i]
                entry = {
                    "section": section,
                    "subsection": question,
                    "url": url + "#" + href,
                    "content": subsections[i+1],
                }
                subdivided_data.append(entry)
    
    
# split_data_file defined near the top where the data_file is created
with open(split_data_file, 'w', encoding='utf-8') as f:
    json.dump(subdivided_data, f, indent=4)

The content is enriched with a lot of html tags. The LLM won't need these so here is a method to remove the tags

In [27]:
def extract_visible_text(html_content):
    # Parse the HTML content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Replace <p> tags with double newlines
    for p in soup.find_all('p'):
        p.insert_after('\n\n')

    # Get the visible text
    visible_text = soup.get_text()

    # Normalize whitespace and remove extra newlines
    visible_text = '\n\n'.join([line.strip() for line in visible_text.splitlines() if line.strip()])

    return visible_text

## Once you have the sections in a json file

In [30]:
# Read the JSON file into a DataFrame
df = pd.read_json(split_data_file)
df['subsection'] = df['subsection'].fillna("")

In [31]:
df

Unnamed: 0,section,url,content,subsection
0,Employment practices and data protection: moni...,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<div class=""article-content"">\r\n <...",
1,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<div class=""article-content"">\r\n <...",
2,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp1""></a>What do we mean by monitoring ...",What do we mean by monitoring workers?
3,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp2""></a>Can we monitor workers?</h3>\n...",Can we monitor workers?
4,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp3""></a>How do we lawfully monitor wor...",How do we lawfully monitor workers?
5,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp4""></a>How do we identify a lawful ba...",How do we identify a lawful basis?
6,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp5""></a>What if our monitoring involve...",What if our monitoring involves special catego...
7,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp6""></a>What about criminal offence da...",What about criminal offence data?
8,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp7""></a>Are there other laws we should...",Are there other laws we should consider?
9,Data protection and monitoring workers,https://ico.org.uk/for-organisations/uk-gdpr-g...,"<a id=""dp8""></a>How do we ensure our monitorin...",How do we ensure our monitoring is fair?


In [3]:


#columns = ["section", "subsection", "point", "subpoint", "heading", "text", "section_reference"]
columns = ["section", "subsection", "heading", "text", "section_reference"]
section = ""
subsection = ""
point = ""
heading = False
text = ""
section_reference = ""

section_pattern = re.compile(r'^(\d+) (?!\.)(.+)$')
subsection_pattern = re.compile(r'(\d\.\d)(?!\.)\s*(.*)') 
point_pattern = re.compile(r'^(\d\.\d\.\d)\s+(.+)$') 
annex_pattern = re.compile(r'(?i)^ANNEX (\d+) - (.+)$') # ignore capitalization


annex_started = False
data = []
table_data = []
for entry in doc_as_array:
    if entry.strip() != '':
        section_match = section_pattern.match(entry)
        subsection_match = subsection_pattern.match(entry)
        point_match = point_pattern.match(entry)
        annex_match = annex_pattern.match(entry)

        if section_match:
            if annex_started:
                data.append(["", "", "", False, entry, "Annex"])
            else:
                match = section_match
                section = match.group(1)
                subsection = ""
                point = ""
                heading = True
                text = match.group(2)
                section_reference = section
                data.append([section, subsection, point, heading, text, section_reference])
        elif subsection_match:
            match = subsection_match
            section = section
            subsection = match.group(1)
            point = ""
            heading = True
            text = match.group(2)
            section_reference = subsection
            data.append([section, subsection, point, heading, text, section_reference])
        elif point_match:
            match = point_match
            section = section
            subsection = subsection
            point = match.group(1)
            heading = True
            text = match.group(2)
            section_reference = point
            data.append([section, subsection, point, heading, text, section_reference])

        elif annex_match:
            match = annex_match
            annex_started = True
            section = "Annex"
            subsection = ""
            point = ""
            heading = True
            text = match.group(1)
            section_reference = section
            data.append([section, subsection, point, heading, text, section_reference])
        
        else:
            if annex_started:
                data.append(["", "", "", False, entry, "Annex"])
            else:
                section = section
                subsection = subsection
                point = point
                heading = False
                text = entry
                section_reference = section_reference
                
                data.append(["", "", "", heading, text, section_reference])




df = pd.DataFrame(data, columns = columns)
df = df.drop([0,1,2,3,4])
df.reset_index(inplace=True)

#df.loc[df["section_reference"] == "",  "section_reference"] = "INTRODUCTION"
# Remove my note about the table
#df = df[df["text"] != 'Note this table contains a column "References to BCR-C, application form BCR-C, and / or supporting documents[^14]" which is empty in the document because it is supposed to be filled out by the controller'] 

In [2]:
import re
import pandas as pd
file_path = "../../original/video.md"
with open(file_path, 'r', encoding = "utf-8") as file:
    text = file.read()

lines = text.split('\n')
# get rid of empty lines
lines = [line for line in lines if line]

doc_as_array = []
notes_as_array = []
#footnote_pattern = re.compile(r'^(\[\^\d{1,2}\]:)(.*)$')
footnote_pattern = re.compile(r'^\[\^(\d{1,2})\]:(.*)$')
for entry in lines:
    footnote_match = footnote_pattern.match(entry)
    if footnote_match:
        notes_as_array.append([footnote_match.group(1), footnote_match.group(2).strip()])
    else:
        doc_as_array.append(entry)

columns = ["note_number", "text"]
df_notes = pd.DataFrame(notes_as_array, columns = columns)
df_notes = df_notes[df_notes["text"].str.strip() != '']

In [4]:
df[220:240]
#df.iloc[192]["text"]
#df[df["text"] == " "]

Unnamed: 0,index,section,subsection,point,heading,text,section_reference
220,225,,,,False,- Data encryption.,9.3.2
221,226,,,,False,- Use of hardware and software based solutions...,9.3.2
222,227,,,,False,"- Detection of failures of components, softwar...",9.3.2
223,228,,,,False,- Means to restore availability and access to ...,9.3.2
224,229,,,,False,135. Access control ensures that only authoriz...,9.3.2
225,230,,,,False,- Ensuring that all premises where monitoring ...,9.3.2
226,231,,,,False,- Positioning monitors in such a way (especial...,9.3.2
227,232,,,,False,"- Procedures for granting, changing and revoki...",9.3.2
228,233,,,,False,- Methods and means of user authentication and...,9.3.2
229,234,,,,False,- User performed actions (both to the system a...,9.3.2


In [5]:
# Add footnotes
import re

def find_footnote_references(text):
    pattern = r'\[\^(\d+)\]'
    return re.findall(pattern, text)

for index, row in df.iterrows():
    footnotes = find_footnote_references(row['text'])
    if footnotes:
        augmented_note = row['text']
        for note in footnotes:
            augmented_note += f"\n[^{note}]: {df_notes[df_notes['note_number'] == note].iloc[0]['text']}"
        print(f"Row {index} augmented with footnotes")
        #print(augmented_note)
        df.at[index, "text"] = augmented_note
        #print

Row 8 augmented with footnotes
Row 18 augmented with footnotes
Row 19 augmented with footnotes
Row 25 augmented with footnotes
Row 26 augmented with footnotes
Row 33 augmented with footnotes
Row 35 augmented with footnotes
Row 54 augmented with footnotes
Row 57 augmented with footnotes
Row 62 augmented with footnotes
Row 65 augmented with footnotes
Row 68 augmented with footnotes
Row 103 augmented with footnotes
Row 117 augmented with footnotes
Row 160 augmented with footnotes
Row 161 augmented with footnotes
Row 168 augmented with footnotes
Row 182 augmented with footnotes
Row 195 augmented with footnotes
Row 199 augmented with footnotes
Row 200 augmented with footnotes
Row 205 augmented with footnotes
Row 233 augmented with footnotes


In [6]:
file = "../../inputs/documents/video.parquet" # use parquet to deal with the complex text so I don't need to worry about escape characters
df.to_parquet(file, engine = 'pyarrow')

#df_no_table.to_csv(file, encoding = "utf-8", sep="|", index = False, na_rep="", quotechar='"')


## Check that the document class works as expected

In [9]:
import sys
sys.path.append('E:/Code/chat/gdpr')

import importlib
import gdpr_rag.documents.video
importlib.reload(gdpr_rag.documents.video)
from gdpr_rag.documents.video import Video

path_to_manual_as_csv_file = "../../inputs/documents/video.parquet"

doc = Video(path_to_manual_as_csv_file)


In [12]:
from IPython.display import Markdown, display

section = "1"

section = "3.1.1"
#section = "10"

print(doc.get_heading(section))
display(Markdown(doc.get_text(section)))


3 LAWFULNESS OF PROCESSING
3.1 Legitimate interest, Article 6 (1) (f)
3.1.1 Existence of legitimate interests


# 3 LAWFULNESS OF PROCESSING

## 3.1 Legitimate interest, Article 6 (1) (f)

### 3.1.1 Existence of legitimate interests

18. Video surveillance is lawful if it is necessary in order to meet the purpose of a legitimate interest pursued by a controller or a third party, unless such interests are overridden by the data subject's interests or fundamental rights and freedoms (Article 6 (1) (f)). Legitimate interests pursued by a controller or a third party can be legal[^8], economic or non-material interests.[^9] However, the controller should consider that if the data subject objects to the surveillance in accordance with Article 21 the controller can only proceed with the video surveillance of that data subject if it is a compelling legitimate interest which overrides the interests, rights and freedoms of the data subject or for the establishment, exercise or defence of legal claims.

19. Given a real and hazardous situation, the purpose to protect property against burglary, theft or vandalism can constitute a legitimate interest for video surveillance.

20. The legitimate interest needs to be of real existence and has to be a present issue (i.e. it must not be fictional or speculative)[^10]. A real-life situation of distress needs to be at hand - such as damages or serious incidents in the past - before starting the surveillance. In light of the principle of accountability, controllers would be well advised to document relevant incidents (date, manner, financial loss) and related criminal charges. Those documented incidents can be a strong evidence for the existence of a legitimate interest. The existence of a legitimate interest as well as the necessity of the monitoring should be reassessed in periodic intervals (e. g. once a year, depending on the circumstances).

21. Example: A shop owner wants to open a new shop and wants to install a video surveillance system to prevent vandalism. He can show, by presenting statistics, that there is a high expectation of vandalism in the near neighbourhood. Also, experience from neighbouring shops is useful. It is not necessary that a damage to the controller in question must have occurred. As long as damages in the neighbourhood suggest a danger or similar, and thus can be an indication of a legitimate interest. It is however not sufficient to present national or general crime statistic without analysing the area in question or the dangers for this specific shop.

22. Imminent danger situations may constitute a legitimate interest, such as banks or shops selling precious goods (e.g. jewellers), or areas that are known to be typical crime scenes for property offences (e. g. petrol stations).

23. The GDPR also clearly states that public authorities cannot rely their processing on the grounds of legitimate interest, as long as they are carrying out their tasks, Article 6 (1) sentence 2.

  
[^8]: European Court of Justice, Judgment in Case C-13/16, Rīgas satiksme case, 4 may 2017  
[^9]: see WP217, Article 29 Working Party.  
[^10]: see WP217, Article 29 Working Party, p. 24 seq. See also ECJ Case C-708/18 p.44