Skip to content

Code Interpreter Generates Inadequate Themes for Complaints in CSV #465

Open
@madhubandru

Description

@madhubandru

I encountered an issue with TaskWeaver's code interpreter when analyzing a CSV file containing customer complaints. After uploading the CSV and requesting the top 5 themes of the complaints, the interpreter generated Python code utilizing Latent Dirichlet Allocation (LDA) for topic modeling. However, the resulting themes were merely lists of words without meaningful context, which doesn't meet the expectation of coherent thematic summaries.

Steps to Reproduce:

  1. Upload a CSV file with a column containing textual data of customer complaints.
  2. Ask TaskWeaver: "Provide the top 5 themes of the complaints."

Expected Behavior: The Large Language Model (LLM) should interpret the complaints and provide coherent thematic summaries, such as:

  1. Billing Issues
  2. Service Delays
  3. Product Quality Concerns
  4. Customer Support Complaints
  5. Account Access Problems

Actual Behavior: The code interpreter generated the following Python code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Vectorize the complaint texts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['COMPLAINT_SUM_TXT'])

# Apply LDA for topic modeling
lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit(X)

# Get the top words for each topic
def get_top_words(model, feature_names, n_top_words):
    top_words = []
    for topic_idx, topic in enumerate(model.components_):
        top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    return top_words

n_top_words = 5
feature_names = vectorizer.get_feature_names_out()
top_themes = get_top_words(lda, feature_names, n_top_words)
top_themes

This code produced the following output:

1. ['Databricks', 'client', 'account', 'told', 'form']
2. ['client', 'Databricks', 'funds', 'contract', 'received']
3. ['policy', 'client', 'account', 'called', 'told']
4. ['client', 'Databricks', 'account', 'states', 'advisor']
5. ['client', 'policy', 'received', '2023', 'told']

These outputs are lists of words without clear thematic context, making it difficult to derive actionable insights.

Is there a way to enhance the taskweaver to use LLM or invoke an agent/plugin to work on output from the code and generate expected themes? Any suggestions or different approaches please provide. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions