HAP Transform Example Notebook
=====================================

This notebook processes a CSV file containing text data to analyze for Hate, Abuse, and Profanity (HAP) scores.
It converts the CSV file into Parquet format, uses the `hap_local_python.py` script to calculate HAP scores, 
and generates outputs for further analysis.



### Overview
This notebook demonstrates the use of the HAP transformation to annotate documents with a `hap_score`, 
indicating the likelihood of Hate, Abuse, or Profanity in the text.

### Workflow
The HAP process consists of:
1. **Sentence Splitting**: Documents are split into sentences using NLTK.
2. **HAP Annotation**: Each sentence is scored between 0 and 1 (1 = high HAP, 0 = no HAP).
3. **Aggregation**: The document's final HAP score is the maximum score among all sentences.


### Configuration
- **Model Name**: IBM Granite Guardian (`ibm-granite/granite-guardian-hap-38m` by default).
- **Document Text Column** (`--doc_text_column`): Specify the input column containing document text to generate the hap_score against. Defaults to `contents`.
- **Annotation Column** (`--annotation_column`): Specify the output column for HAP scores. Defaults to `hap_score`.


### Steps in This Notebook
1. Define paths and import libraries.
2. Convert CSV input to Parquet.
3. Run the HAP transformation script.
4. View and analyze the results.


### Install dependencies

These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:
```
make venv 
source venv/bin/activate 
pip install jupyterlab
```

In [39]:
! pip install data-prep-connector
! pip install  'data-prep-toolkit[ray]==0.2.2.dev1'
! pip install  'data-prep-toolkit-transforms[ray,all]==0.2.2.dev1'
! pip install nltk==3.9.1 transformers==4.38.2 torch>=2.2.2,<=2.4.1 pandas==2.2.2

Collecting argparse (from data-prep-toolkit==0.2.2.dev1->data-prep-toolkit[ray]==0.2.2.dev1)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0
Collecting argparse (from data-prep-toolkit>=0.2.2.dev1->data-prep-toolkit-transforms==0.2.2.dev1->data-prep-toolkit-transforms[all,ray]==0.2.2.dev1)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0
zsh:1: 2.2.2, not found


### Import necessary libraries

In [40]:
import os
import sys
import pandas as pd
import subprocess

from data_processing.runtime.pure_python import PythonTransformLauncher
from data_processing.utils import ParamsUtils
from hap_transform_python import HAPPythonTransformConfiguration
from hap_transform import HAPTransform

[nltk_data] Downloading package punkt_tab to /Users/aisha/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Step 1: Setup runtime parameters

- input-folder: Path to the input data to be used by the transform.
- output-folder: Path where the output file with HAP scores will be saved.
- doc_text_column: The column containing the text for analysis (For ex.: `Customer Feedback`).
- annotation_column: The column where HAP scores will be saved (default: `hap_score`).

**Customization**: 
- Ensure the column containing the text matches the `doc_text_column` parameter.
- If your text column has a different name, update the value of `--doc_text_column` accordingly.
- You can adjust other parameters like `--batch_size` and `--max_length` if needed.

In [47]:

# create parameters
input_csv = './input-csv'
input_folder = './input-parquet'
output_folder = './output'
output_parquet_file = os.path.join(output_folder, "customer-feedback.parquet")
output_csv_file = os.path.join(output_folder, "customer-feedback.csv")

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

print(input_folder)
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # execution info
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
    # hap params
    "doc_text_column": "Customer Feedback",
}

./input-parquet


### Step 2: Setup Input Data

- Place your CSV file in the `input_csv`.
- Ensure the column containing the text matches the `doc_text_column` parameter.
- If your text column has a different name, update the `doc_text_column` parameter in later cells.

In [48]:
#Clear the existing input-parquet folder
if os.path.exists(input_csv):
    for file_name in os.listdir(input_csv):
        file_path = os.path.join(input_csv, file_name)
        try:
            os.remove(file_path)
            print(f"Deleted file: {file_path}")
        except Exception as e:
            print(f"Failed to delete {file_path}: {e}")
else:
    os.makedirs(input_csv)
    os.makedirs(input_folder)
    print(f"Created folder: {input_csv}")

Deleted file: ./input-csv/customer_feedback_file.csv


In [50]:
csv_files = [f for f in os.listdir(input_csv) if f.endswith(".csv")]

if not csv_files:
    print(f"No CSV files found in the input folder: {input_csv}")
    print("Please place a CSV file in the input folder and rerun this script.")
else:
    # Pick the first CSV file in the folder
    csv_file_path = os.path.join(input_csv, csv_files[0])
    print(f"Using CSV file: {csv_file_path}")

Using CSV file: ./input-csv/customer_feedback_file.csv


### Step 3: Convert CSV to Parquet
Convert the selected CSV file to Parquet format.

In [51]:
parquet_file_path = os.path.join(input_folder, "customer-feedback.parquet")
df = pd.read_csv(csv_file_path)
df.to_parquet(parquet_file_path, index=False)
print(f"CSV file converted to Parquet format at: {parquet_file_path}")

CSV file converted to Parquet format at: ./input-parquet/customer-feedback.parquet


### Step 4: Invoke the transform
Use python runtime to invoke the transform

In [52]:
%%capture
sys.argv = ParamsUtils.dict_to_req(d=params)
launcher = PythonTransformLauncher(runtime_config=HAPPythonTransformConfiguration())
launcher.launch()

10:55:04 INFO - hap params are {'model_name_or_path': 'ibm-granite/granite-guardian-hap-38m', 'annotation_column': 'hap_score', 'doc_text_column': 'Customer Feedback', 'inference_engine': 'CPU', 'max_length': 512, 'batch_size': 128} 
10:55:04 INFO - pipeline id pipeline_id
10:55:04 INFO - code location {'github': 'github', 'commit_hash': '12345', 'path': 'path'}
10:55:04 INFO - data factory data_ is using local data access: input_folder - ./input-parquet output_folder - ./output
10:55:04 INFO - data factory data_ max_files -1, n_sample -1
10:55:04 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']
10:55:04 INFO - orchestrator hap started at 2024-12-06 10:55:04
10:55:04 INFO - Number of files is 1, source profile {'max_file_size': 0.04511737823486328, 'min_file_size': 0.04511737823486328, 'total_file_size': 0.04511737823486328}
10:55:06 INFO - Completed 1 files (100.0%) in 0.008

In [53]:
#The specified folder will include the transformed parquet files.

import glob
glob.glob("python/output/*")

[]

### Step 5: Display Output in a Readable Format

This step checks for any existing CSV files in the output folder and removes them before generating new ones. The following actions are performed:

1. **Listing Output Files**: The script lists all files in the output folder.
2. **Check for Parquet Files**: It identifies `.parquet` files in the output folder.
3. **Remove Old CSV Files**: If any previous output files (`hap_complete_output.csv` or `hap_filtered_output.csv`) exist, they are deleted.
4. **Read Parquet File**: The Parquet file is read into a DataFrame.
5. **Filter Data**: The relevant columns, `doc_text_column` (from the environment variable) and `hap_score_column`, are selected from the DataFrame.
6. **CSV Output**: Convert the parquet output to CSV 
7. **Display Output**: Display Output in the notebook for a quick reference 

Example Output:
<div>
<style scoped>
    .dataframe tbody tr th:only-of-type {
        vertical-align: middle;
    }

    .dataframe tbody tr th {
        vertical-align: top;
    }

    .dataframe thead th {
        text-align: left;
    }
</style>
<table border="0" class="dataframe">
  <thead>
    <tr style="text-align: left;">
      <th></th>
      <th>Customer Feedback</th>
      <th>hap_score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Rating: 4 Comments: "Service was prompt, but ...</td>
      <td>0.000195</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Rating: 5 Comments: "Great help from Peter! H...</td>
      <td>0.000153</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Rating: 3 Comments: "The service was quick, b...</td>
      <td>0.000169</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Rating: 5 Comments: "Excellent service and ad...</td>
      <td>0.000158</td>
    </tr>
  </tbody>
</table>
</div>

In [55]:
# Locate transformed Parquet files in the output folder
output_parquet_files = [f for f in os.listdir(output_folder) if f.endswith(".parquet")]

if output_parquet_files:
    # Clear existing CSV files in the output folder
    for file_name in os.listdir(output_folder):
        if file_name.endswith(".csv"):
            file_path = os.path.join(output_folder, file_name)
            try:
                os.remove(file_path)
                print(f"Deleted old CSV file: {file_path}")
            except Exception as e:
                print(f"Failed to delete {file_path}: {e}")

    for output_parquet in output_parquet_files:
        original_parquet_path = os.path.join(output_folder, output_parquet)
        try:
            # Rename the Parquet file
            os.rename(original_parquet_path, output_parquet_file)
            print(f"Renamed Parquet file to: {output_parquet_file}")

            # Convert the renamed Parquet file to CSV
            transformed_df = pd.read_parquet(output_parquet_file)
            transformed_df.to_csv(output_csv_file, index=False)
            print(f"Transformed CSV file saved at: {output_csv_file}")

            # Display selected columns in tabular format
            if 'Customer Feedback' in transformed_df.columns and 'hap_score' in transformed_df.columns:
                display_df = transformed_df[['Customer Feedback', 'hap_score']]
                print("\nSelected Columns (Customer Feedback and HAP Score):")
                from IPython.display import display  # Ensure pretty display in Jupyter
                display(display_df.head(10))  # Display the first 10 rows
            else:
                print("The required columns ('Customer Feedback' and 'hap_score') are not in the transformed data.")

        except Exception as e:
            print(f"Error processing files: {e}")
else:
    print(f"No Parquet files found in the output folder: {output_folder}")


Deleted old CSV file: ./output/customer-feedback.csv
Renamed Parquet file to: ./output/customer-feedback.parquet
Transformed CSV file saved at: ./output/customer-feedback.csv

Selected Columns (Customer Feedback and HAP Score):


Unnamed: 0,Customer Feedback,hap_score
0,"Rating: 4 Comments: ""Service was prompt, but ...",0.000195
1,"Rating: 5 Comments: ""Great help from Peter! H...",0.000153
2,"Rating: 3 Comments: ""The service was quick, b...",0.000169
3,"Rating: 5 Comments: ""Excellent service and ad...",0.000158
4,"Rating: 2 Comments: ""I’m really frustrated. T...",0.000875
5,"Rating: 4 Comments: ""The service was helpful,...",0.015778
6,"Rating: 3 Comments: ""The support was helpful,...",0.000221
7,"Rating: 5 Comments: ""The service was very qui...",0.000179
8,"Rating: 5 Comments: ""The support was excellen...",0.000236
9,"Rating: 2 Comments: ""I’m frustrated with this...",0.000448
