Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Interpreter Generates Inadequate Themes for Complaints in CSV #465

Open
madhubandru opened this issue Feb 18, 2025 · 2 comments
Open

Comments

@madhubandru
Copy link

madhubandru commented Feb 18, 2025

I encountered an issue with TaskWeaver's code interpreter when analyzing a CSV file containing customer complaints. After uploading the CSV and requesting the top 5 themes of the complaints, the interpreter generated Python code utilizing Latent Dirichlet Allocation (LDA) for topic modeling. However, the resulting themes were merely lists of words without meaningful context, which doesn't meet the expectation of coherent thematic summaries.

Steps to Reproduce:

  1. Upload a CSV file with a column containing textual data of customer complaints.
  2. Ask TaskWeaver: "Provide the top 5 themes of the complaints."

Expected Behavior: The Large Language Model (LLM) should interpret the complaints and provide coherent thematic summaries, such as:

  1. Billing Issues
  2. Service Delays
  3. Product Quality Concerns
  4. Customer Support Complaints
  5. Account Access Problems

Actual Behavior: The code interpreter generated the following Python code:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Vectorize the complaint texts
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['COMPLAINT_SUM_TXT'])

# Apply LDA for topic modeling
lda = LatentDirichletAllocation(n_components=5, random_state=0)
lda.fit(X)

# Get the top words for each topic
def get_top_words(model, feature_names, n_top_words):
    top_words = []
    for topic_idx, topic in enumerate(model.components_):
        top_words.append([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]])
    return top_words

n_top_words = 5
feature_names = vectorizer.get_feature_names_out()
top_themes = get_top_words(lda, feature_names, n_top_words)
top_themes

This code produced the following output:

1. ['Databricks', 'client', 'account', 'told', 'form']
2. ['client', 'Databricks', 'funds', 'contract', 'received']
3. ['policy', 'client', 'account', 'called', 'told']
4. ['client', 'Databricks', 'account', 'states', 'advisor']
5. ['client', 'policy', 'received', '2023', 'told']

These outputs are lists of words without clear thematic context, making it difficult to derive actionable insights.

Is there a way to enhance the taskweaver to use LLM or invoke an agent/plugin to work on output from the code and generate expected themes? Any suggestions or different approaches please provide. Thank you!

@liqul
Copy link
Contributor

liqul commented Feb 19, 2025

I do not have data for reproducing, but I can see there are different ways:

  1. You can build a plugin which takes the CSV file as input so you can control the logic in the plugin. This is more suitable if you consider this flow as a common one and you need to analyze many similar files. Espeically, if you have a pre-defined list of themes, this is the best way to incorperate it into the flow.
  2. You can break the flow into two steps, e.g., first load the file and display the content, and then ask the agent to do the second step. This is ok if the file is small and can be handled by LLM entirely in one prompt. In addition, this is more for casual analysis, which is a one time issue.
  3. You can add an example to the framework to demonstrate the desired flow the agent should follow. This is in between the two above.

@madhubandru
Copy link
Author

madhubandru commented Feb 19, 2025

@liqul Thank you for your response. Themes question is a sample one, for any analytics question, taskweaver is trying to solve it using code-interpreter code.

When I upload the same CSV in Copilot and ask the same questions, it responds as expected. Is there a way to bring such functionality into Taskweaver?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants