In [None]:
Let's revise our approach to handle this specific situation. The parsing seems to be having trouble identifying the severity categories. Let's try a more direct approach with your LLM:

First, let's modify your prompt to get cleaner output:

CopyPlease generate exactly 10 deposit-related bank customer complaints with the following format for each complaint:

complaint_id: [number]
complaint: [text of complaint]
sentiment_score: [score between -1 and 1]
severity_category: [category]

Use the following severity categories:
- Very severe (-1.0 to -0.6)
- Severe (-0.6 to -0.2)
- Neutral (-0.2 to 0.2)
- Less severe (0.2 to 0.6)
- Least severe (0.6 to 1.0)

Make sure each complaint follows exactly this format for easy parsing.

Here's an improved parsing script that's more flexible with the actual structure:

pythonCopyimport pandas as pd
import re

# Paste the LLM output here
raw_text = """[Paste the entire LLM output here]"""

# Process text to extract structured data
complaints = []
current_complaint = {}
complaint_pattern = r'complaint_id:\s*(\d+)|complaint:\s*(.*)|sentiment_score:\s*([-+]?\d*\.\d+)|severity_category:\s*(.*)'

for line in raw_text.split('\n'):
    line = line.strip()
    if not line:
        # Empty line might indicate a new complaint
        if current_complaint and 'complaint' in current_complaint:
            complaints.append(current_complaint)
            current_complaint = {}
        continue
        
    # Check for complaint components
    match = re.search(complaint_pattern, line, re.IGNORECASE)
    if match:
        if match.group(1):  # complaint_id
            if current_complaint and 'complaint' in current_complaint:
                complaints.append(current_complaint)
                current_complaint = {}
            current_complaint['complaint_id'] = int(match.group(1))
        elif match.group(2):  # complaint text
            current_complaint['complaint'] = match.group(2).strip()
        elif match.group(3):  # sentiment score
            current_complaint['sentiment_score'] = float(match.group(3))
        elif match.group(4):  # severity category
            current_complaint['severity_category'] = match.group(4).strip()

# Don't forget the last complaint
if current_complaint and 'complaint' in current_complaint:
    complaints.append(current_complaint)

# Create DataFrame
df = pd.DataFrame(complaints)

# Check what we extracted
print(f"Successfully extracted {len(df)} complaints")
print(df.head())

# Save to CSV
if not df.empty:
    df.to_csv('parsed_complaints.csv', index=False)

If this still doesn't work well, you might need to resort to a more manual approach:

pythonCopy# Create your dataset manually from what the LLM generated
complaints = [
    {
        'complaint_id': 1,
        'complaint': "I deposited a check 5 days ago and it still hasn't cleared. This is ridiculous!",
        'sentiment_score': -0.75,
        'severity_category': "Very severe"
    },
    {
        'complaint_id': 2,
        'complaint': "Your mobile app crashed while I was trying to transfer money between my accounts.",
        'sentiment_score': -0.5,
        'severity_category': "Severe"
    },
    # Add more manually formatted entries
]

df = pd.DataFrame(complaints)
df.to_csv('manual_complaints.csv', index=False)
The key is to either get cleaner output from your LLM or improve the parsing approach to handle the specific format you're getting.