# DPK Example: Detect Hate Abuse and Profanity (HAP) speech

 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/data-prep-kit-examples/blob/main/data-prep-kit/hap_1.ipynb)

 This notebook will illustrate how we can use Data Prep Kit's  [HAP detector](https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/universal/hap) to tag HAP speech in documents.

 References and credits:

 - https://github.com/data-prep-kit/data-prep-kit/tree/dev/transforms/universal/hap
 - https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/hap/hap_python.ipynb
 - https://github.com/proz92/RAG-with-watsonx-HAP-Guardrails
  

## Step-1: Figure out Runtime Environment

### 1.1 - Determine runtime

Determine if we are running on Google colab or local python environment

In [1]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 1.2 - Install dependencies if running on Google Colab

In [2]:
## Download any code files we may need

if RUNNING_IN_COLAB:
    !wget -O 'file_utils.py'   'https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data-prep-kit/file_utils.py'

In [3]:
# %%capture
# %%time

import os

if RUNNING_IN_COLAB:
  ## setup a sandbox env to avoid conflicts with colab libraries
  !pip install -q condacolab
  import condacolab
  condacolab.install()

  !conda create -n my_env python=3.11 -y
  !conda activate my_env
  !pip install  --default-timeout=100  \
        'data-prep-toolkit-transforms[hap, pdf2parquet]'==1.1.0 \
        humanfriendly
  ## terminate the current kernel, so we restart the runtime
  os.kill(os.getpid(), 9)
  ## restart the session

### 1.3 - Restart Runtime

After installing dependencies, you may want to <font color="red">restart runtime</font>, so libraries will be loaded

You do this by going to **`Runtime --> Restart Session`**

Then you can continue to the next step (no need to re-run the notebook)

## Step-2: Configuration  & Utils

### 2.1 - Basic Config

In [4]:
import os

if os.getenv("COLAB_RELEASE_TAG"):
   print("Running in Colab")
   RUNNING_IN_COLAB = True
else:
   print("NOT in Colab")
   RUNNING_IN_COLAB = False

NOT in Colab


### 2.2 - Setup input/outpur directories

In [5]:
## setup path to utils folder
import sys
sys.path.append('../utils')

In [6]:
# If connection to https://huggingface.co/ failed, uncomment the following path
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'

In [7]:
import os, sys
import shutil

if RUNNING_IN_COLAB:
    input_dir = "input/"
    shutil.os.makedirs(input_dir, exist_ok=True)
else:
    input_dir = "../data/hap/"

output_dir = "output"
output_pdf2pq_dir = os.path.join (output_dir, '01_pdf2pq_out')
output_hap_dir = os.path.join (output_dir, '02_hap_out')
output_md_dir = os.path.join (output_dir, "03_md")


## clear output folder
shutil.rmtree(output_dir, ignore_errors=True)
shutil.os.makedirs(output_dir, exist_ok=True)
shutil.os.makedirs(output_md_dir, exist_ok=True)
print ("✅ Cleared output directory")

✅ Cleared output directory


## Step-3: Inspect the Data

Sample data files are [here](https://github.com/sujee/data-prep-kit-examples/tree/main/data/hap)

### 3.1 -Download Data

In [8]:
from file_utils import download_file

if RUNNING_IN_COLAB:
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/hap/earth.pdf', os.path.join(input_dir, 'earth.pdf'))
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/hap/hap1.pdf', os.path.join(input_dir, 'hap1.pdf'))
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/hap/hap2.pdf', os.path.join(input_dir, 'hap2.pdf'))
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/hap/hap3.pdf', os.path.join(input_dir, 'hap3.pdf'))
    download_file ('https://raw.githubusercontent.com/sujee/data-prep-kit-examples/main/data/hap/hap4.pdf', os.path.join(input_dir, 'hap4.pdf'))
else:
    print ('input : ', input_dir)

input :  ../data/hap/


## Step-4: Extract Data from PDF (pdf2parquet)

This step we will read PDF files and extract the text data.

[Pdf2Parquet documentation](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)

We use the [Docling package](https://github.com/DS4SD/docling).

### 4.1 - Execute

In [9]:
%%time

from dpk_pdf2parquet.transform_python import Pdf2Parquet
from dpk_pdf2parquet.transform import pdf2parquet_contents_types

print (f"🏃🏼 Processing input='{input_dir}' --> output='{output_pdf2pq_dir}'\n", flush=True)

result = Pdf2Parquet(input_folder= input_dir,
                    output_folder= output_pdf2pq_dir,
                    data_files_to_use=['.pdf'],
                    pdf2parquet_contents_type=pdf2parquet_contents_types.MARKDOWN,   # markdown
                    ).transform()

if result == 0:
    print (f"✅ Operation completed successfully")
else:
    raise Exception (f"❌ Operation  failed")

🏃🏼 Processing input='../data/hap/' --> output='output/01_pdf2pq_out'



12:05:39 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 8}
12:05:39 INFO - pipeline id pipeline_id
12:05:39 INFO - code location None
12:05:39 INFO - data factory data_ is using local data access: input_folder - ../data/hap/ output_folder - output/01_pdf2pq_out
12:05:39 INFO - data factory data_ max_files -1, n_sample -1
12:05:39 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf'], files to checkpoint ['.parquet']
12:05:39 INFO - orchestrator pdf2parquet started at 2025-03-26 12:05:39
12:05:39 INFO - Number of files is 5, source profile {'max_file_size': 0.055823326110839844, 'min_file_size': 0.0

✅ Operation completed successfully
CPU times: user 8.29 s, sys: 1.87 s, total: 10.2 s
Wall time: 14 s


### 4.2 - Inspect Generated output


In [10]:
from file_utils import read_parquet_files_as_df

print ("Displaying contents of : ", output_pdf2pq_dir)
output_df = read_parquet_files_as_df(output_pdf2pq_dir)
output_df


Displaying contents of :  output/01_pdf2pq_out
Successfully read 5 parquet files with 5 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename
0,hap3.pdf,## HAP Example - Hate (Mild)\n\nI hate all tal...,1,0,3,4640a2a1-bbef-4dff-870c-951df7e5ffab,14415754726703435224,pdf,61704194822aec595d2b38e8cbe223b3191c859dc8cfdc...,53,2025-03-26T12:05:47.035007,0.161497,hap3.pdf
1,hap4.pdf,## HAP Example\n\nI hate this traffic!,1,0,3,7988fcc4-7763-4cfd-b91d-1d2b4c0e11e5,1118223425461926585,pdf,ebf223c519ac1c4e9c3702e89366ca0b2d131d5f6fbd5a...,36,2025-03-26T12:05:47.199717,0.163099,hap4.pdf
2,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,186d1e43-07a0-4e68-a41a-16c426cf236f,9456501710902741619,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,2025-03-26T12:05:46.871889,0.160447,hap2.pdf
3,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,4,a3c6dde1-4e6f-4e15-9e21-ea4497517f44,4197663060196400691,pdf,d1157451bf6b9b69c16bf89f05c06a0210497306236c72...,136,2025-03-26T12:05:46.709816,0.184696,hap1.pdf
4,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,c7b68ce5-8a50-4eb5-baf7-546580cd47c0,1139588115783882743,pdf,3766e7a7dfb15354f2a8c77e43db4cfa40d4627f921126...,611,2025-03-26T12:05:46.511357,0.945042,earth.pdf


In [11]:
print (output_df[output_df['filename'] == 'hap1.pdf'].iloc[0,]['contents'])

## HAP example - Abuse and Profanity

You are such an idiot for making these stupid decisions!

you are such a complete f*cking failure!


In [12]:
print (output_df[output_df['filename'] == 'hap2.pdf'].iloc[0,]['contents'])

## HAP Example - Hate

I hate all immigrants!


In [13]:
print (output_df[output_df['filename'] == 'hap3.pdf'].iloc[0,]['contents'])

## HAP Example - Hate (Mild)

I hate all tall people!


In [14]:
print (output_df[output_df['filename'] == 'hap4.pdf'].iloc[0,]['contents'])

## HAP Example

I hate this traffic!


## Step-5: HAP Detector

[HAP transform documentation](https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/hap/)

Some parameters:

- `model_name_or_path` - specify the HAP model, which should be compatible with HuggingFace's AutoModelForSequenceClassification. Defaults to IBM's open-source toxicity classifier **ibm-granite/granite-guardian-hap-38m**
- `annotation_column` - the column name containing hap (toxicity) score in the output .parquet file. Defaults to hap_score.
- `doc_text_column`- the column name containing the document text in the input .parquet file. Defaults to contents.
- `batch_size` - modify it based on the infrastructure capacity. Defaults to 128.
- `max_length` - the maximum length for the tokenizer. Defaults to 512.

Here are HAP detection models

- [ibm-granite/granite-guardian-hap-38m](https://huggingface.co/ibm-granite/granite-guardian-hap-38m)
- [ibm-granite/granite-guardian-hap-125m](https://huggingface.co/ibm-granite/granite-guardian-hap-125m)

### 5.1 - Execute

In [15]:
from dpk_hap.transform_python import HAP


result = HAP(input_folder= output_pdf2pq_dir,
        output_folder= output_hap_dir,
        model_name_or_path= 'ibm-granite/granite-guardian-hap-38m',
        annotation_column= "hap_score",
        doc_text_column= "contents",
        inference_engine= "CPU",
        max_length= 512,
        batch_size= 128,
        ).transform()

if result == 0:
    print (f"✅ Operation completed successfully")
else:
    raise Exception (f"❌ Operation  failed")

[nltk_data] Downloading package punkt_tab to /home/sujee/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
12:05:47 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'contents', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} 
12:05:47 INFO - pipeline id pipeline_id
12:05:47 INFO - code location None
12:05:47 INFO - data factory data_ is using local data access: input_folder - output/01_pdf2pq_out output_folder - output/02_hap_out
12:05:47 INFO - data factory data_ max_files -1, n_sample -1
12:05:47 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
12:05:47 INFO - orchestrator hap started at 2025-03-26 12:05:47
12:05:47 INFO - Number of files is 5, source profile {'max_file_size': 0.009408950805664062, 'min_file_size': 0.005721092224121094, 'total_file_siz

Processing batch: 0/0
    filename                                           contents  num_pages  \
0  earth.pdf  ## Earth\n\n## Solar System\n\nOur solar syste...          1   

   num_tables  num_doc_elements                           document_id  \
0           0                11  c7b68ce5-8a50-4eb5-baf7-546580cd47c0   

         document_hash  ext  \
0  1139588115783882743  pdf   

                                                hash  size  \
0  3766e7a7dfb15354f2a8c77e43db4cfa40d4627f921126...   611   

                date_acquired  pdf_convert_time source_filename  hap_score  
0  2025-03-26T12:05:46.511357          0.945042       earth.pdf   0.000406  
Processing batch: 0/0
   filename                                           contents  num_pages  \
0  hap1.pdf  ## HAP example - Abuse and Profanity\n\nYou ar...          1   

   num_tables  num_doc_elements                           document_id  \
0           0                 4  a3c6dde1-4e6f-4e15-9e21-ea4497517f44   

        

### 5.2 - Inspect Generated output

Let's see the output.  Inspect **hap_score** output

In [16]:
from file_utils import read_parquet_files_as_df

print ("Displaying contents of : ", output_hap_dir)
output_df = read_parquet_files_as_df(output_hap_dir)
output_df

Displaying contents of :  output/02_hap_out
Successfully read 5 parquet files with 5 total rows


Unnamed: 0,filename,contents,num_pages,num_tables,num_doc_elements,document_id,document_hash,ext,hash,size,date_acquired,pdf_convert_time,source_filename,hap_score
0,hap3.pdf,## HAP Example - Hate (Mild)\n\nI hate all tal...,1,0,3,4640a2a1-bbef-4dff-870c-951df7e5ffab,14415754726703435224,pdf,61704194822aec595d2b38e8cbe223b3191c859dc8cfdc...,53,2025-03-26T12:05:47.035007,0.161497,hap3.pdf,0.359412
1,hap4.pdf,## HAP Example\n\nI hate this traffic!,1,0,3,7988fcc4-7763-4cfd-b91d-1d2b4c0e11e5,1118223425461926585,pdf,ebf223c519ac1c4e9c3702e89366ca0b2d131d5f6fbd5a...,36,2025-03-26T12:05:47.199717,0.163099,hap4.pdf,0.00148
2,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,1,0,3,186d1e43-07a0-4e68-a41a-16c426cf236f,9456501710902741619,pdf,18b2a552b5b54d5bd374a1416117d019a5e18a5da1547a...,45,2025-03-26T12:05:46.871889,0.160447,hap2.pdf,0.90329
3,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,1,0,4,a3c6dde1-4e6f-4e15-9e21-ea4497517f44,4197663060196400691,pdf,d1157451bf6b9b69c16bf89f05c06a0210497306236c72...,136,2025-03-26T12:05:46.709816,0.184696,hap1.pdf,0.997993
4,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,1,0,11,c7b68ce5-8a50-4eb5-baf7-546580cd47c0,1139588115783882743,pdf,3766e7a7dfb15354f2a8c77e43db4cfa40d4627f921126...,611,2025-03-26T12:05:46.511357,0.945042,earth.pdf,0.000406


In [17]:
output_df[['filename', 'contents', 'hap_score']]

Unnamed: 0,filename,contents,hap_score
0,hap3.pdf,## HAP Example - Hate (Mild)\n\nI hate all tal...,0.359412
1,hap4.pdf,## HAP Example\n\nI hate this traffic!,0.00148
2,hap2.pdf,## HAP Example - Hate\n\nI hate all immigrants!,0.90329
3,hap1.pdf,## HAP example - Abuse and Profanity\n\nYou ar...,0.997993
4,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,0.000406


## Step-6: Extract Clean Documents

In [20]:
from file_utils import read_parquet_files_as_df

hap_output_df = read_parquet_files_as_df(output_hap_dir)
clean_docs_df = hap_output_df[hap_output_df['hap_score'] < 0.2]

print ('clean documents')
clean_docs_df[['filename', 'contents', 'hap_score']]

Successfully read 5 parquet files with 5 total rows
clean documents


Unnamed: 0,filename,contents,hap_score
1,hap4.pdf,## HAP Example\n\nI hate this traffic!,0.00148
4,earth.pdf,## Earth\n\n## Solar System\n\nOur solar syste...,0.000406


## Step-7: Save as MD

In [21]:
for index, row in clean_docs_df.iterrows():
    output_file_name = os.path.join (output_md_dir, row['filename'] + '.md')
    with open(output_file_name, 'w') as output_file:
        output_file.write(row['contents'])

print (f"✅ Saved CLEAN markdown output to '{output_md_dir}'")

✅ Saved CLEAN markdown output to 'output/03_md'
