### Creating a test dataset to use after fine-tuning

This notebook focuses on creating a test dataset to evaluate the performance of fine-tuned models on completely unseen data. The process ensures that no overlap exists between the training dataset (`filtered_dataset.csv`) and the newly created test dataset.

In [1]:
from datasets import load_dataset, concatenate_datasets
import pandas as pd
from pathlib import Path

#### Overlap validator as a function (between datasets)

In [2]:
# Validator to ensure no overlap between test_dataset and filtered_dataset.csv
def validate_no_overlap(existing_narratives, new_test_df):
	"""
    Validates that none of the 'Consumer complaint narrative' values in the new test dataset
    are present in the existing dataset.

    Parameters:
        existing_narratives: Set of existing narratives (from `filtered_dataset.csv`).
        new_test_df: Pandas DataFrame of the newly created test dataset.

    Returns:
        Bool: True if no overlap, False otherwise (with print of duplicate entries if found).
    """
	# Extract narratives from the new test dataset
	test_narratives = set(new_test_df["Consumer complaint narrative"].dropna().unique())

	# Find intersection
	overlapping_narratives = test_narratives.intersection(existing_narratives)

	# Check if there is any overlap
	if overlapping_narratives:
		print("Validation Failed: The following narratives overlap with the existing dataset:")
		for narrative in overlapping_narratives:
			print(f" - {narrative}")
		return False
	else:
		print("Validation Passed: No overlapping narratives found.")
		return True


#### Loading the original dataset and applying the same filters

In [3]:
# Load the original dataset from Hugging Face
original_dataset = load_dataset("BEE-spoke-data/consumer-finance-complaints", split = "train")

# Apply filters to get rows matching the specified criteria
filtered_dataset = original_dataset.filter(
	lambda row: row["Company"] == "BANK OF AMERICA, NATIONAL ASSOCIATION"
	            and row["Product"] in ["Credit card or prepaid card", "Mortgage"]
	            and row["Consumer complaint narrative"] is not None
)

# Load the previously processed dataset from CSV
filtered_dataset_csv = pd.read_csv("data/filtered_dataset.csv")

#### Excluding the instances that exist in **filtered_dataset**

In [6]:
# Extract narratives from the CSV to use for exclusion
existing_narratives = set(filtered_dataset_csv["Consumer complaint narrative"].dropna().unique())

# Exclude rows with matching "Consumer complaint narrative"
remaining_dataset = filtered_dataset.filter(
	lambda row: row["Consumer complaint narrative"] not in existing_narratives
)

# Separate the remaining dataset into the two product classes
mortgage_dataset = remaining_dataset.filter(lambda row: row["Product"] == "Mortgage")
credit_card_dataset = remaining_dataset.filter(lambda row: row["Product"] == "Credit card or prepaid card")

# Sample 100 rows from each class (use a fixed seed for consistent test set generation)
mortgage_sample = mortgage_dataset.shuffle(seed = 42).select(range(100))
credit_card_sample = credit_card_dataset.shuffle(seed = 42).select(range(100))

# Combine the two samples into one test dataset
test_dataset = concatenate_datasets([mortgage_sample, credit_card_sample])

# Keep only the required columns: "Consumer complaint narrative" and "Product"
test_dataset = test_dataset.remove_columns(
	[col for col in remaining_dataset.column_names if col not in ["Consumer complaint narrative", "Product"]]
)

#### Converting to pandas dataframe, validating and extracting the unseen data in test_dataset.csv

In [7]:
# Convert the dataset to a pandas DataFrame
test_df = test_dataset.to_pandas()

# Validate the new test dataset
validation_result = validate_no_overlap(existing_narratives, test_df)

# Save the dataset only if validation passes
if validation_result:
	# Define the output folder and file path using pathlib
	output_folder = Path("data")
	output_folder.mkdir(exist_ok = True)  # Create the folder if it doesn't exist
	test_output_path = output_folder / "test_dataset.csv"

	# Save the new test dataset to a CSV file for evaluation purposes
	test_df.to_csv(test_output_path, index = False)
	print(f"Test dataset successfully saved to {test_output_path}")
else:
	print("Test dataset not saved due to overlapping entries.")


Validation Passed: No overlapping narratives found.
Test dataset successfully saved to data\test_dataset.csv
