# Text Feedback Analysis with Embeddings and Clustering

This notebook demonstrates a complete workflow for analyzing text feedback using modern embedding models and clustering techniques. We'll transform unstructured feedback into actionable insights.

## Overview

1. **Data Loading**: Import and prepare feedback data
2. **Embedding Generation**: Convert text to numerical representations
3. **Clustering**: Group similar feedback together
4. **Analysis**: Understand cluster themes using AI
5. **Visualization**: Create interactive visualizations

Let's begin!

## 1. Setup and Imports

In [None]:
# Install required packages (run once)
!pip install pandas numpy scikit-learn umap-learn hdbscan matplotlib seaborn plotly google-genai tqdm

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from datetime import datetime
import json
import time
from pathlib import Path

# Embedding and AI
from google import genai
from google.genai import types

# Clustering and dimensionality reduction
import umap
import hdbscan
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

# Progress tracking
from tqdm.notebook import tqdm

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

## 2. Configuration

In [None]:
# Configuration
# Replace with your API key
GEMINI_API_KEY = "YOUR_API_KEY_HERE"  # Get from https://makersuite.google.com/app/apikey

# Model settings
EMBEDDING_MODEL = "gemini-embedding-001"  # or "text-embedding-004"
LLM_MODEL = "gemini-2.0-flash"

# Analysis parameters
MIN_FEEDBACK_LENGTH = 50  # Minimum character count
BATCH_SIZE = 20  # For API calls
MIN_CLUSTER_SIZE = 10  # For HDBSCAN

# Initialize Gemini client
client = genai.Client(api_key=GEMINI_API_KEY)

## 3. Data Loading and Preparation

In [None]:
# Load your data (replace with your file path)
# Example format: CSV with columns: id, feedback_text, date, category (optional)

# For demo, let's create sample data
sample_feedback = [
    "The meeting room was difficult to find and parking was a nightmare. Better signage needed.",
    "Audio quality was terrible - echo made it impossible to understand speakers clearly.",
    "Great experience documenting the city council meeting. Well organized and easy to follow.",
    "Meeting was cancelled but no one notified us. Wasted trip downtown.",
    "The agenda wasn't available until the meeting started, making it hard to prepare.",
    "Excellent facility with good WiFi and power outlets. Made note-taking much easier.",
    "Board members spoke too quickly and used lots of acronyms without explanation.",
    "Meeting ran 2 hours over schedule. Future documenters should plan accordingly.",
    "First time documenting - the training materials were very helpful!",
    "Technical issues with the streaming platform made remote attendance frustrating."
]

# Create dataframe
df = pd.DataFrame({
    'id': range(1, len(sample_feedback) + 1),
    'feedback_text': sample_feedback,
    'date': pd.date_range('2024-01-01', periods=len(sample_feedback), freq='D'),
    'category': ['Logistics'] * 2 + ['Experience'] * 2 + ['Communication'] * 2 + 
                ['Process'] * 2 + ['Training'] * 1 + ['Technical'] * 1
})

# For real data, use:
# df = pd.read_csv('your_feedback_data.csv')

print(f"Loaded {len(df)} feedback entries")
df.head()

In [None]:
# Data cleaning and filtering
print("Data cleaning...")

# Remove null values
df = df.dropna(subset=['feedback_text'])

# Filter by length
df['text_length'] = df['feedback_text'].str.len()
df_filtered = df[df['text_length'] >= MIN_FEEDBACK_LENGTH].copy()

print(f"Filtered from {len(df)} to {len(df_filtered)} entries")
print(f"Average feedback length: {df_filtered['text_length'].mean():.0f} characters")

# Show length distribution
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
df_filtered['text_length'].hist(bins=30, alpha=0.7)
plt.xlabel('Feedback Length (characters)')
plt.ylabel('Count')
plt.title('Feedback Length Distribution')

plt.subplot(1, 2, 2)
df_filtered['category'].value_counts().plot(kind='bar', alpha=0.7)
plt.xlabel('Category')
plt.ylabel('Count')
plt.title('Feedback by Category')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 4. Generate Embeddings

Now we'll convert text to numerical representations using the Gemini embedding model.

In [None]:
def generate_embeddings(texts, batch_size=BATCH_SIZE, task_type="CLUSTERING"):
    """
    Generate embeddings for a list of texts using Gemini API.
    
    Args:
        texts: List of strings to embed
        batch_size: Number of texts per API call
        task_type: One of CLUSTERING, CLASSIFICATION, SEMANTIC_SIMILARITY
    
    Returns:
        numpy array of embeddings
    """
    embeddings = []
    
    print(f"Generating embeddings for {len(texts)} texts...")
    
    for i in tqdm(range(0, len(texts), batch_size)):
        batch = texts[i:i+batch_size]
        
        retries = 3
        for attempt in range(retries):
            try:
                # Generate embeddings
                result = client.models.embed_content(
                    model=EMBEDDING_MODEL,
                    contents=batch,
                    config=types.EmbedContentConfig(task_type=task_type)
                )
                
                # Extract embedding values
                for embedding in result.embeddings:
                    embeddings.append(embedding.values)
                
                # Rate limiting
                time.sleep(0.5)
                break
                
            except Exception as e:
                if attempt < retries - 1:
                    print(f"\nRetry {attempt + 1}/{retries} after error: {e}")
                    time.sleep(2 ** attempt)  # Exponential backoff
                else:
                    print(f"\nFailed to generate embeddings for batch {i//batch_size}")
                    # Add zero embeddings for failed batch
                    for _ in batch:
                        embeddings.append([0] * 3072)  # Default size
    
    return np.array(embeddings)

# Generate embeddings for our feedback
embeddings = generate_embeddings(df_filtered['feedback_text'].tolist())
print(f"\nGenerated embeddings with shape: {embeddings.shape}")

## 5. Dimensionality Reduction and Clustering

In [None]:
# Reduce dimensions for clustering
print("Reducing dimensions with UMAP...")

# UMAP for clustering (50 dimensions)
reducer_clustering = umap.UMAP(
    n_components=50,
    n_neighbors=15,
    min_dist=0.1,
    metric='cosine',
    random_state=42
)
embeddings_reduced = reducer_clustering.fit_transform(embeddings)

# UMAP for visualization (2 dimensions)
reducer_viz = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric='cosine',
    random_state=42
)
embeddings_2d = reducer_viz.fit_transform(embeddings)

print(f"Reduced to {embeddings_reduced.shape[1]} dimensions for clustering")
print(f"Reduced to {embeddings_2d.shape[1]} dimensions for visualization")

In [None]:
# Perform clustering with HDBSCAN
print("\nClustering with HDBSCAN...")

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=MIN_CLUSTER_SIZE,
    min_samples=5,
    metric='euclidean',
    cluster_selection_epsilon=0.5,
    cluster_selection_method='eom'
)

cluster_labels = clusterer.fit_predict(embeddings_reduced)

# Add results to dataframe
df_filtered['cluster'] = cluster_labels
df_filtered['x'] = embeddings_2d[:, 0]
df_filtered['y'] = embeddings_2d[:, 1]

# Print cluster statistics
n_clusters = len(set(cluster_labels)) - (1 if -1 in cluster_labels else 0)
n_noise = list(cluster_labels).count(-1)

print(f"\nNumber of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")
print("\nCluster sizes:")
for cluster_id in sorted(set(cluster_labels)):
    if cluster_id != -1:
        size = sum(cluster_labels == cluster_id)
        print(f"  Cluster {cluster_id}: {size} points")

## 6. Visualize Clusters

In [None]:
# Create interactive scatter plot
fig = px.scatter(
    df_filtered,
    x='x', y='y',
    color='cluster',
    hover_data=['feedback_text', 'category'],
    title='Feedback Clusters Visualization',
    labels={'x': 'UMAP 1', 'y': 'UMAP 2'},
    color_discrete_map={-1: 'lightgray'}  # Noise points in gray
)

fig.update_traces(
    marker=dict(size=10, line=dict(width=1, color='white')),
    selector=dict(mode='markers')
)

fig.update_layout(
    width=900,
    height=600,
    plot_bgcolor='rgba(240,240,240,0.5)'
)

fig.show()

In [None]:
# Cluster size distribution
cluster_sizes = df_filtered[df_filtered['cluster'] >= 0]['cluster'].value_counts().sort_index()

fig_sizes = go.Figure(data=[
    go.Bar(
        x=[f'Cluster {i}' for i in cluster_sizes.index],
        y=cluster_sizes.values,
        text=cluster_sizes.values,
        textposition='auto',
    )
])

fig_sizes.update_layout(
    title='Cluster Size Distribution',
    xaxis_title='Cluster',
    yaxis_title='Number of Feedback Entries',
    showlegend=False,
    height=400
)

fig_sizes.show()

## 7. Analyze Cluster Themes

In [None]:
def analyze_cluster(cluster_feedback, cluster_id):
    """
    Use Gemini to analyze and describe a cluster based on sample feedback.
    """
    # Take representative samples (up to 10)
    samples = cluster_feedback[:min(10, len(cluster_feedback))]
    
    prompt = f"""Analyze the following {len(samples)} feedback comments that share similar themes:

{chr(10).join([f'{i+1}. "{sample}"' for i, sample in enumerate(samples)])}

Please provide:
1. A brief 2-3 sentence description of the main theme or common characteristics
2. Key topics or concerns mentioned (bullet points)
3. The general tone (positive, negative, neutral, mixed)
4. 2-3 actionable insights for improvement

Keep the analysis concise and focused on patterns across all samples."""

    try:
        response = client.models.generate_content(
            model=LLM_MODEL,
            contents=prompt,
            config=types.GenerateContentConfig(
                temperature=0.7,
                max_output_tokens=500
            )
        )
        return response.text
    except Exception as e:
        return f"Error analyzing cluster: {str(e)}"

# Analyze each cluster
cluster_analyses = {}
unique_clusters = sorted(set(cluster_labels))
unique_clusters = [c for c in unique_clusters if c >= 0]  # Exclude noise

print("Analyzing clusters...\n")
for cluster_id in unique_clusters:
    cluster_mask = df_filtered['cluster'] == cluster_id
    cluster_feedback = df_filtered[cluster_mask]['feedback_text'].tolist()
    
    print(f"\n{'='*60}")
    print(f"CLUSTER {cluster_id} ({len(cluster_feedback)} entries)")
    print(f"{'='*60}")
    
    # Get analysis
    analysis = analyze_cluster(cluster_feedback, cluster_id)
    cluster_analyses[cluster_id] = {
        'size': len(cluster_feedback),
        'analysis': analysis,
        'samples': cluster_feedback[:3]
    }
    
    print(analysis)
    
    # Show sample feedback
    print("\nSample feedback:")
    for i, sample in enumerate(cluster_feedback[:3]):
        print(f"{i+1}. {sample[:100]}...")
    
    time.sleep(1)  # Rate limiting

## 8. Export Results

In [None]:
# Create output directory
output_dir = Path('feedback_analysis_output')
output_dir.mkdir(exist_ok=True)

# Save cluster assignments
df_filtered.to_csv(output_dir / 'feedback_with_clusters.csv', index=False)
print(f"Saved cluster assignments to {output_dir / 'feedback_with_clusters.csv'}")

# Save cluster analyses
with open(output_dir / 'cluster_analyses.json', 'w') as f:
    json.dump(cluster_analyses, f, indent=2)
print(f"Saved cluster analyses to {output_dir / 'cluster_analyses.json'}")

# Generate summary report
report = f"""# Feedback Analysis Report

**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

## Summary Statistics
- Total feedback analyzed: {len(df_filtered)}
- Number of clusters found: {n_clusters}
- Noise points: {n_noise}
- Average cluster size: {len(df_filtered[df_filtered['cluster'] >= 0]) / n_clusters:.1f}

## Cluster Analyses
"""

for cluster_id, data in cluster_analyses.items():
    report += f"\n### Cluster {cluster_id} (Size: {data['size']})\n\n"
    report += data['analysis'] + "\n"

# Save report
with open(output_dir / 'analysis_report.md', 'w') as f:
    f.write(report)
print(f"\nSaved analysis report to {output_dir / 'analysis_report.md'}")

## 9. Advanced Analysis (Optional)

In [None]:
# Temporal analysis - how clusters change over time
if 'date' in df_filtered.columns:
    # Group by month and cluster
    df_filtered['month'] = pd.to_datetime(df_filtered['date']).dt.to_period('M')
    temporal_data = df_filtered[df_filtered['cluster'] >= 0].groupby(['month', 'cluster']).size().unstack(fill_value=0)
    
    # Create stacked area chart
    fig_temporal = go.Figure()
    
    for cluster in temporal_data.columns:
        fig_temporal.add_trace(go.Scatter(
            x=temporal_data.index.astype(str),
            y=temporal_data[cluster],
            mode='lines',
            stackgroup='one',
            name=f'Cluster {cluster}'
        ))
    
    fig_temporal.update_layout(
        title='Cluster Distribution Over Time',
        xaxis_title='Month',
        yaxis_title='Number of Feedback Entries',
        hovermode='x unified',
        height=400
    )
    
    fig_temporal.show()

In [None]:
# Cross-tabulation with categories (if available)
if 'category' in df_filtered.columns:
    # Create crosstab
    crosstab = pd.crosstab(df_filtered['category'], df_filtered['cluster'], normalize='index') * 100
    
    # Create heatmap
    fig_heatmap = px.imshow(
        crosstab.values,
        labels=dict(x="Cluster", y="Category", color="Percentage"),
        x=[f"Cluster {i}" if i >= 0 else "Noise" for i in crosstab.columns],
        y=crosstab.index,
        title="Category Distribution Across Clusters (%)",
        color_continuous_scale="Blues"
    )
    
    fig_heatmap.update_layout(height=400)
    fig_heatmap.show()
    
    print("\nKey insights:")
    for category in crosstab.index:
        dominant_cluster = crosstab.loc[category].idxmax()
        if dominant_cluster >= 0:
            percentage = crosstab.loc[category, dominant_cluster]
            print(f"- {category} feedback is {percentage:.0f}% in Cluster {dominant_cluster}")

## 10. Conclusions and Next Steps

### What We've Accomplished
1. ✅ Converted text feedback into numerical embeddings
2. ✅ Clustered similar feedback together
3. ✅ Analyzed themes using AI
4. ✅ Created visualizations
5. ✅ Exported results for further analysis

### Next Steps
1. **Fine-tune parameters**: Adjust `min_cluster_size` and UMAP parameters
2. **Add more data**: Include additional feedback sources
3. **Track changes**: Run analysis periodically to track trends
4. **Take action**: Use insights to improve your product/service
5. **Automate**: Set up scheduled analysis pipelines

### Tips for Production Use
- Cache embeddings to avoid re-computing
- Use batch processing for large datasets
- Implement proper error handling
- Monitor API costs
- Consider fine-tuning embeddings for your domain

In [None]:
# Save this notebook's configuration for reproducibility
config = {
    'analysis_date': datetime.now().isoformat(),
    'parameters': {
        'embedding_model': EMBEDDING_MODEL,
        'llm_model': LLM_MODEL,
        'min_feedback_length': MIN_FEEDBACK_LENGTH,
        'batch_size': BATCH_SIZE,
        'min_cluster_size': MIN_CLUSTER_SIZE,
        'umap_components': 50,
        'umap_neighbors': 15
    },
    'results': {
        'total_feedback': len(df_filtered),
        'num_clusters': n_clusters,
        'noise_points': n_noise
    }
}

with open(output_dir / 'analysis_config.json', 'w') as f:
    json.dump(config, f, indent=2)

print("Analysis complete! 🎉")
print(f"\nAll results saved to: {output_dir.absolute()}")