# Consumer Complaints Classification with Tagmatic

This notebook demonstrates how to use the Tagmatic library for real-world text classification using consumer complaint data. We'll classify complaints into different categories based on their content.

## Scenario

A financial institution receives thousands of consumer complaints daily. These complaints need to be automatically categorized to route them to the appropriate departments for resolution. Instead of manually creating rules or training complex ML models, we can use Tagmatic to define categories through simple descriptions and let the LLM handle the classification.

## Dataset

We're using a sample of 50 consumer complaints (10 examples from each of 5 categories) from the Consumer Financial Protection Bureau (CFPB) database. This dataser was taken from Kaggle. You can find it here:
https://www.kaggle.com/datasets/selener/consumer-complaint-database?resource=download

In [29]:
# Import required libraries
import json
import pandas as pd
from typing import List, Dict
import os
from collections import Counter

# Import Tagmatic components
from tagmatic.core.category import Category, CategorySet
from tagmatic.core.classifier import Classifier

# For this demo, we'll use OpenAI (you can replace with any LangChain LLM)
from langchain_openai import ChatOpenAI

## 1. Load and Explore the Sample Dataset

In [30]:
# Load the sample complaints dataset
with open('sample_complaints.json', 'r') as f:
    complaints_data = json.load(f)

# Convert to DataFrame for easier analysis
df = pd.DataFrame(complaints_data)

print(f"Dataset contains {len(df)} complaints")
print(f"Categories: {df['category'].unique()}")
print(f"\nCategory distribution:")
print(df['category'].value_counts())

Dataset contains 50 complaints
Categories: ['Account information incorrect' 'Debt is not yours'
 'Account status incorrect' 'Information belongs to someone else'
 'Privacy issues']

Category distribution:
category
Account information incorrect          10
Debt is not yours                      10
Account status incorrect               10
Information belongs to someone else    10
Privacy issues                         10
Name: count, dtype: int64


In [31]:
# Display a few sample complaints
print("Sample complaints:")
print("=" * 80)

for i, row in df.head(3).iterrows():
    print(f"Category: {row['category']}")
    print(f"Text: {row['text'][:200]}...")
    print("-" * 80)

Sample complaints:
Category: Account information incorrect
Text: I sent three letters to XXXX XXXX, Equifax, to verify the information on my credit report but they never sent me any verification I need to know how are they Verifying this information?...
--------------------------------------------------------------------------------
Category: Account information incorrect
Text: I tried contacting XXXX on XXXX XXXX, XXXX via correspondence USPS tracking number XXXX. The late payment dates for XXXX XXXX to XXXX XXXX ; XXXX XXXX to XXXX XXXX is incorrect, I was currently attend...
--------------------------------------------------------------------------------
Category: Account information incorrect
Text: My Name is XXXX XXXX. Am looking at my credit report and there are 8 hard inquiries from the following companies that an disputing because i haven't had any dealings with them. 
1. XXXX. There are 5 i...
--------------------------------------------------------------------------------


## 2. Define Categories Using Tagmatic

Instead of using the original category names, we'll create more business-friendly category descriptions that our classification system can understand and use.

In [None]:
# Define categories with clear, descriptive explanations
categories = [
    Category(
        name="account_information_error",
        description="Complaints about incorrect account information, wrong payment history, "
                   "inaccurate balances, or errors in how account details are reported on credit reports. "
                   "This includes disputes about payment dates, account status, or other account-related data."
    ),
    Category(
        name="debt_not_owed", 
        description="Complaints where the consumer claims they do not owe the debt being collected. "
                   "This includes cases where the debt belongs to someone else, was never authorized, "
                   "or the consumer disputes owing the amount being claimed."
    ),
    Category(
        name="account_status_wrong",
        description="Complaints about incorrect account status reporting, such as accounts showing as open "
                   "when they should be closed, incorrect payment plan status, wrong derogatory marks, "
                   "or accounts not reflecting their true current status."
    ),
    Category(
        name="identity_mix_up",
        description="Complaints involving identity theft, mixed files, or information belonging to someone else "
                   "appearing on the consumer's credit report. This includes unauthorized accounts, "
                   "fraudulent inquiries, or personal information that doesn't belong to the consumer."
    ),
    Category(
        name="privacy_issues",
        description="Complaints related to privacy violations, such as unauthorized access to credit reports, "
                   "sharing of personal information without consent, or issues with how personal data is handled "
                   "by credit reporting agencies or debt collectors."
    )
]

# Create a CategorySet - FIXED: Use keyword argument
category_set = CategorySet(categories=categories)

print("Defined Categories:")
print("=" * 50)
for category in categories:
    print(f"\n{category.name.upper()}:")
    print(f"{category.description}")

Defined Categories:

ACCOUNT_INFORMATION_ERROR:
Complaints about incorrect account information, wrong payment history, inaccurate balances, or errors in how account details are reported on credit reports. This includes disputes about payment dates, account status, or other account-related data.

DEBT_NOT_OWED:
Complaints where the consumer claims they do not owe the debt being collected. This includes cases where the debt belongs to someone else, was never authorized, or the consumer disputes owing the amount being claimed.

ACCOUNT_STATUS_WRONG:
Complaints about incorrect account status reporting, such as accounts showing as open when they should be closed, incorrect payment plan status, wrong derogatory marks, or accounts not reflecting their true current status.

IDENTITY_MIX_UP:
Complaints involving identity theft, mixed files, or information belonging to someone else appearing on the consumer's credit report. This includes unauthorized accounts, fraudulent inquiries, or personal i

## 3. Set Up the Classifier

We'll configure the Tagmatic classifier with our categories and an LLM provider.

In [39]:
# Set up the LLM (make sure you have OPENAI_API_KEY in your environment)
# You can replace this with any LangChain-compatible LLM
llm = ChatOpenAI(
    model="gpt-4.1",
    temperature=0.1,  # Low temperature for consistent classification
    max_tokens=100    # We only need short responses for classification
)

# Create the classifier
classifier = Classifier(
    categories=category_set,
    llm=llm
)

print("Classifier initialized successfully!")
print(f"Ready to classify into {len(categories)} categories")

Classifier initialized successfully!
Ready to classify into 5 categories


## 4. Test Classification on Sample Complaints

Let's test our classifier on a few individual complaints to see how it performs.

In [34]:
# Test on a few sample complaints
test_samples = df.sample(n=5, random_state=42)

print("Testing Classification on Sample Complaints:")
print("=" * 80)

for idx, row in test_samples.iterrows():
    complaint_text = row['text']
    actual_category = row['category']
    
    # Classify the complaint
    result = classifier.classify(complaint_text)
    
    print(f"\nComplaint: {complaint_text[:150]}...")
    print(f"Actual Category: {actual_category}")
    print(f"Predicted Category: {result.category}")
    print("-" * 80)

Testing Classification on Sample Complaints:

Complaint: Debt doesn't belong to me, company put in on my credit. Never had a XXXX account before. I asked them to remove it, they still have it on my credit....
Actual Category: Debt is not yours
Predicted Category: debt_not_owed
--------------------------------------------------------------------------------

Complaint: I have recently obtained a copy of my credit file and there are many mistakes in my report...
Actual Category: Information belongs to someone else
Predicted Category: account_information_error
--------------------------------------------------------------------------------

Complaint: XXXX XXXX, XXXX SOC SEC # XXXX DOB XX/XX/XXXX ADDRESS XXXX XXXX XXXX, XXXX, FL XXXX ATTENTION DISPUTE DEPARTMENT TODAYS DATE : XX/XX/XXXX This serves ...
Actual Category: Information belongs to someone else
Predicted Category: identity_mix_up
--------------------------------------------------------------------------------

Complaint: On XX/X

## 5. Classify All Complaints and Evaluate Performance

Now let's classify all complaints in our dataset and see how well our system performs.

In [35]:
# Create a mapping from original categories to our new category names
category_mapping = {
    "Account information incorrect": "account_information_error",
    "Debt is not yours": "debt_not_owed",
    "Account status incorrect": "account_status_wrong", 
    "Information belongs to someone else": "identity_mix_up",
    "Privacy issues": "privacy_issues"
}

# Add mapped categories to our dataframe
df['expected_category'] = df['category'].map(category_mapping)

print("Category mapping:")
for orig, new in category_mapping.items():
    print(f"{orig} -> {new}")

Category mapping:
Account information incorrect -> account_information_error
Debt is not yours -> debt_not_owed
Account status incorrect -> account_status_wrong
Information belongs to someone else -> identity_mix_up
Privacy issues -> privacy_issues


In [40]:
# Classify all complaints (this may take a few minutes)
print("Classifying all complaints... This may take a few minutes.")

predictions = []
confidences = []

for idx, row in df.iterrows():
    try:
        result = classifier.classify(row['text'])
        predictions.append(result.category)
        confidences.append(result.confidence)
        
        # Progress indicator
        if (idx + 1) % 10 == 0:
            print(f"Processed {idx + 1}/{len(df)} complaints")
            
    except Exception as e:
        print(f"Error classifying complaint {idx}: {e}")
        predictions.append("error")
        confidences.append(0.0)

# Add predictions to dataframe
df['predicted_category'] = predictions
df['confidence'] = confidences

print("\nClassification complete!")

Classifying all complaints... This may take a few minutes.
Processed 10/50 complaints
Processed 20/50 complaints
Processed 30/50 complaints
Processed 40/50 complaints
Processed 50/50 complaints

Classification complete!


## 6. Analyze Results

Let's analyze how well our classifier performed.

In [41]:
# Calculate accuracy
correct_predictions = (df['expected_category'] == df['predicted_category']).sum()
total_predictions = len(df)
accuracy = correct_predictions / total_predictions

print(f"Classification Results:")
print(f"=" * 40)
print(f"Total complaints: {total_predictions}")
print(f"Correct predictions: {correct_predictions}")
print(f"Accuracy: {accuracy:.2%}")


Classification Results:
Total complaints: 50
Correct predictions: 35
Accuracy: 70.00%


## 7. Demonstrate Voting Classifier for Higher Accuracy

Tagmatic's voting classifier runs multiple classifications and uses majority voting to improve accuracy.

In [42]:
# Classify all complaints using the voting classifier
print("Classifying all complaints with voting classifier... This may take a few minutes. (more than before)")

predictions = []
confidences = []

for idx, row in df.iterrows():
    try:
        result = classifier.classify(
            row['text'],
            voting_classifier=True,
            voting_rounds=3
        )
        predictions.append(result.category)
        confidences.append(result.confidence)
        
        # Progress indicator
        if (idx + 1) % 10 == 0:
            print(f"Processed {idx + 1}/{len(df)} complaints")
            
    except Exception as e:
        print(f"Error classifying complaint {idx}: {e}")
        predictions.append("error")
        confidences.append(0.0)

# Add predictions to dataframe
df['predicted_category'] = predictions
df['confidence'] = confidences

print("\nClassification complete!")

Classifying all complaints with voting classifier... This may take a few minutes. (more than before)
Processed 10/50 complaints
Processed 20/50 complaints
Processed 30/50 complaints
Processed 40/50 complaints
Processed 50/50 complaints

Classification complete!


In [44]:
# Calculate accuracy
correct_predictions = (df['expected_category'] == df['predicted_category']).sum()
total_predictions = len(df)
accuracy = correct_predictions / total_predictions

print(f"Classification Results:")
print(f"=" * 40)
print(f"Total complaints: {total_predictions}")
print(f"Correct predictions: {correct_predictions}")
print(f"Accuracy: {accuracy:.2%}")


Classification Results:
Total complaints: 50
Correct predictions: 37
Accuracy: 74.00%


We got 4% higher accuracy with the voting classifier.

## 8. Real-World Application Example

Let's simulate how this would work in a real customer service system.

In [26]:
def process_new_complaint(complaint_text: str) -> Dict:
    """
    Simulate processing a new complaint in a customer service system
    """
    # Classify the complaint
    result = classifier.classify(
        complaint_text, 
        voting_classifier=True, 
        voting_rounds=3
    )
    
    # Define routing rules based on category
    routing_rules = {
        "account_information_error": {
            "department": "Credit Reporting Department",
            "priority": "Medium",
            "sla_hours": 48
        },
        "debt_not_owed": {
            "department": "Debt Validation Team", 
            "priority": "High",
            "sla_hours": 24
        },
        "account_status_wrong": {
            "department": "Account Services",
            "priority": "Medium", 
            "sla_hours": 48
        },
        "identity_mix_up": {
            "department": "Fraud Investigation Unit",
            "priority": "High",
            "sla_hours": 12
        },
        "debt_already_paid": {
            "department": "Payment Verification Team",
            "priority": "High",
            "sla_hours": 24
        }
    }
    
    routing_info = routing_rules.get(result.category, {
        "department": "General Customer Service",
        "priority": "Low",
        "sla_hours": 72
    })
    
    return {
        "classification": result.category,
        "confidence": result.confidence,
        "routing": routing_info,
        "complaint_text": complaint_text[:100] + "..."
    }

# Test with a new complaint
new_complaint = """
I've been trying to resolve an issue with my credit report for months. 
There's an account showing that I supposedly opened with XYZ Bank, 
but I have never had any relationship with this bank. I've never 
applied for credit with them, never received any cards or statements, 
and I've never even heard of them until I saw this on my credit report. 
This is clearly a case of identity theft or mixed files, and it's 
preventing me from getting approved for a mortgage.
"""

processing_result = process_new_complaint(new_complaint)

print("New Complaint Processing Result:")
print("=" * 40)
print(f"Complaint: {processing_result['complaint_text']}")
print(f"Classification: {processing_result['classification']}")
print(f"Routed to: {processing_result['routing']['department']}")
print(f"Priority: {processing_result['routing']['priority']}")
print(f"SLA: {processing_result['routing']['sla_hours']} hours")

New Complaint Processing Result:
Complaint: 
I've been trying to resolve an issue with my credit report for months. 
There's an account showing ...
Classification: identity_mix_up
Routed to: Fraud Investigation Unit
Priority: High
SLA: 12 hours


## 9. Summary and Key Benefits

This demonstration shows how Tagmatic can be used for real-world text classification tasks with minimal setup and high accuracy.