HAP Transform Example Notebook
=====================================

This notebook processes a CSV file containing text data to analyze for Hate, Abuse, and Profanity (HAP) scores.
It converts the CSV file into Parquet format, uses the `hap_local_python.py` script to calculate HAP scores, 
and generates outputs for further analysis.



### Overview
This notebook demonstrates the use of the HAP transformation to annotate documents with a `hap_score`, 
indicating the likelihood of Hate, Abuse, or Profanity in the text.

### Workflow
The HAP process consists of:
1. **Sentence Splitting**: Documents are split into sentences using NLTK.
2. **HAP Annotation**: Each sentence is scored between 0 and 1 (1 = high HAP, 0 = no HAP).
3. **Aggregation**: The document's final HAP score is the maximum score among all sentences.


### Configuration
- **Model Name**: IBM Granite Guardian (`ibm-granite/granite-guardian-hap-38m` by default).
- **Document Text Column** (`--doc_text_column`): Specify the input column containing document text to generate the hap_score against. Defaults to `contents`.
- **Annotation Column** (`--annotation_column`): Specify the output column for HAP scores. Defaults to `hap_score`.


### Steps in This Notebook
1. Define paths and import libraries.
2. Convert CSV input to Parquet.
3. Run the HAP transformation script.
4. View and analyze the results.


### Open this notebook in Google Colab

Click link to open notebook in google colab:  [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/8381d353c3aa90d334e81ac3029ab774c753cc4b/examples/notebooks/hap/generate_hap_score_csv.ipynb)


### Install dependencies for Google Colab environment

In [None]:
! pip install data-prep-connector
! pip install  'data-prep-toolkit[ray]==0.2.2.dev1'
! pip install  'data-prep-toolkit-transforms[ray,all]==0.2.2.dev1'
! pip install nltk==3.9.1 transformers==4.38.2 torch>=2.2.2,<=2.4.1 pandas==2.2.2

### Import necessary libraries

In [None]:
import os
import pandas as pd
import subprocess
import sys

### Step 1: Define Paths
---------------------
Define the paths for the script, input folder, and output folder.

In [None]:
from pathlib import Path

notebook_dir = Path().resolve()
relative_script_dir = '../../../transforms/universal/hap/python/src/hap_local_python.py'
hap_script_path = (notebook_dir / relative_script_dir).resolve()

input_folder = "./input"
output_folder = "./output"

In [None]:
# Ensure the necessary folders exist.
os.makedirs(input_folder, exist_ok=True)
os.makedirs(output_folder, exist_ok=True)

print(f"Script Path: {hap_script_path}")
print(f"Input Folder: {input_folder}")
print(f"Output Folder: {output_folder}")

### Step 2: Check for CSV Files in Input Folder

- Place your CSV file in the `input_folder`.
- Ensure the column containing the text matches the `doc_text_column` parameter.
- If your text column has a different name, update the `doc_text_column` parameter in later cells.
- This cell sets up the file paths for the input file.


In [None]:
csv_files = [f for f in os.listdir(input_folder) if f.endswith(".csv")]

if not csv_files:
    print(f"No CSV files found in the input folder: {input_folder}")
    print("Please place a CSV file in the input folder and rerun this notebook.")
else:
    print(f"Found CSV file(s): {csv_files}")

# Pick the first CSV file in the folder
csv_file_path = os.path.join(input_folder, csv_files[0])
print(f"Using CSV file: {csv_file_path}")


### Step 3: Convert CSV to Parquet
Convert the selected CSV file to Parquet format.

In [None]:
parquet_file_path = os.path.join(input_folder, "data.parquet")
df = pd.read_csv(csv_file_path)
df.to_parquet(parquet_file_path, index=False)
print(f"CSV file converted to Parquet format at: {parquet_file_path}")

### Step 4: Define HAP Parameters

In [None]:
hap_params = {
    "model_name_or_path": "ibm-granite/granite-guardian-hap-38m",  # Default model name
    "annotation_column": "hap_score",  # Output column for HAP scores
    "doc_text_column": "Customer Feedback",  # Input column containing document text
    "inference_engine": "CPU",  # Inference engine (CPU or GPU)
    "max_length": 512,  # Maximum token length
    "batch_size": 128,  # Batch size
}

### Step 5: Run the Transform with defined HAP Paramters

This cell executes the HAP transformation script:
- `--input_file`: Path to your input CSV/Parquet file.
- `--output_file`: Path where the output file with HAP scores will be saved.
- `--doc_text_column`: The column containing the text for analysis (default: `Customer Feedback`).
- `--annotation_column`: The column where HAP scores will be saved (default: `hap_score`).

**Customization**: 
- If your text column has a different name, update the value of `--doc_text_column` accordingly.
- You can adjust other parameters like `--batch_size` and `--max_length` if needed.

In [None]:
# Copy the current environment variables
env = os.environ.copy()

# Set Environment Variables for HAP Parameters
os.environ["MODEL_NAME_OR_PATH"] = "ibm-granite/granite-guardian-hap-38m"
os.environ["ANNOTATION_COLUMN"] = "hap_score"
os.environ["DOC_TEXT_COLUMN"] = "Customer Feedback"
os.environ["INFERENCE_ENGINE"] = "CPU"
os.environ["MAX_LENGTH"] = "512"
os.environ["BATCH_SIZE"] = "128"
os.environ["INPUT_FOLDER"] = input_folder
os.environ["OUTPUT_FOLDER"] = output_folder
try:
    result = subprocess.run(
        ["python", hap_script_path],
        check=True,
        text=True,
        capture_output=True
    )

    # If successful, print the result of the transform
    print("Transform completed successfully.")
    print(result.stdout)

except subprocess.CalledProcessError as e:
    # If there was an error, print the error message
    print("Error occurred during transform execution.")
    print(e.stderr)

### Step 6: Generate the Output CSV

This step checks for any existing CSV files in the output folder and removes them before generating new ones. The following actions are performed:

1. **Listing Output Files**: The script lists all files in the output folder.
2. **Check for Parquet Files**: It identifies `.parquet` files in the output folder.
3. **Remove Old CSV Files**: If any previous output files (`hap_complete_output.csv` or `hap_filtered_output.csv`) exist, they are deleted.
4. **Read Parquet File**: The Parquet file is read into a DataFrame.
5. **Filter Data**: The relevant columns, `doc_text_column` (from the environment variable) and `hap_score_column`, are selected from the DataFrame.
6. **Save New CSV Files**: The filtered data is saved into two new CSV files:
   - `hap_complete_output.csv` (containing the full output)
   - `hap_filtered_output.csv` (containing only the filtered relevant columns).

This ensures that only the latest output is retained, and no old files remain in the output folder.

In [None]:
import os
import pandas as pd

# List all files in the output folder
output_files = os.listdir(output_folder)

if output_files:
    for file in output_files:
        if file.endswith(".parquet"):  # Check for Parquet output files
            output_file_path = os.path.join(output_folder, file)
            output_df = pd.read_parquet(output_file_path)  # Read the Parquet file
            print(f"Complete Output Parquet File Path: {output_file_path}")

            # Define the output CSV file paths
            complete_output_csv = os.path.join(output_folder, "hap_complete_output.csv")
            filtered_output_csv = os.path.join(output_folder, "hap_filtered_output.csv")

            # Remove old CSV files if they exist
            if os.path.exists(complete_output_csv):
                os.remove(complete_output_csv)
                print(f"Old complete CSV file removed: {complete_output_csv}")

            if os.path.exists(filtered_output_csv):
                os.remove(filtered_output_csv)
                print(f"Old filtered CSV file removed: {filtered_output_csv}")

            # Filter the output DataFrame to only include the relevant columns
            hap_score_column = hap_params["annotation_column"]
            doc_text_column = os.getenv('DOC_TEXT_COLUMN')  # Read from environment variable
            filtered_df = output_df[[doc_text_column, hap_score_column]]

            # Print the filtered DataFrame (only showing the HAP score and document text)
            print(f"Filtered Output (only HAP score and document text):")
            display(filtered_df)

            # Save the complete output as a CSV file
            output_df.to_csv(complete_output_csv, index=False)  # Convert the Parquet to CSV
            print(f"Complete output saved to: {complete_output_csv}")

            # Save the filtered output as a CSV file
            filtered_df.to_csv(filtered_output_csv, index=False)
            print(f"Filtered output saved to: {filtered_output_csv}")

else:
    print("No output files found. Please check the script or configuration.")
