Skip to content

sphinxbio/sliceanddice

Repository files navigation

Slice n Dice Logo

A small experiment that uses LLMs to analyze and isolate data boundaries within messy spreadsheets.

🔗 Try it here   •   🐦 @sphinx_bio   •   😼 Sphinx Bio

Slice & Dice is an experiment in using AI to extract structured data from unstructured spreadsheets. Upload an Excel/CSV and watch the AI to (attempt) to identify slice the sheet up into multiple regions of datasets.

This is a work in progress, and everyone is welcome to contribute!


🔗 Try it here

Try out the live demo on our website.

⚠️ Please don't upload any personal or private data in the demo! Your files will be visible to AI providers and our analytics platform ⚠️

Feel free to reach out about self-hosted or enterprise versions.

gif demo

👀 Discussion

Currently, the app sends the CSV/Excel data as text to a series of fast Llama3 or Mixtral prompts, some to ask for boundaries, others to check for correctness. Previous attempts to use LangChain agents, OpenAI Assistants API and a combination of Claude and OpenAI yielded fairly unimpressive results (and were very expensive and slow as well). Though Llama3 and Mixtral are very fast and affordable (especially through Groq) for prototype development and iteration, they do present various challenges and short-comings:

  • No function calling or JSON mode (sometimes the results are not well-formatted)
  • Small context window (large datasets won't work)
  • Lack of planning, reasoning, and agent-like decision-making (all models are incredibly error-prone and generally poor at handling spreadsheets)

We wrote up a blog post about the entire process

🎉 Features & Roadmap

Slice & Dice is a small experiment and playground for manipulating spreadsheets. We'd love to add more features like:

  • Manually create slices: Create slices manually, and an LLM will add context like name/descriptions.
  • Manipulate sheets & slices: Add a way to Join & Concatenate datasets and slices to manipulate, merge, or generate new tabular datasets from sheets & slices.
  • Download tabular slices: Create & download well-formed, tabular slices of data.
  • Explore alternate UIs: Add "Chat with sheets" and other UI modes to progressively get + edit the slices you need
  • Excel macro or Google Apps Script: A few have asked for building this into Excel
  • Handle more messy use cases: Many use cases will break the tool; add better slicing for more kinds of data. More use cases can be found under src/lib/samples/dirtysheets
  • UI improvements: Clear/reset button and other UI improvements makes the demo more than a toy

We'd also like to work with the community to come up with better ways to overcome some of the severe limitations for handling spreadsheets. These might include using methods to identify data type (e.g. what kind of "messed up" is the data) and sub-tasks (e.g. given that we know a spreadsheet has a control and experiment arm, how can that be used to prompt a model?). There are lots of variations and possibilities, and we'd like to explore them with you!

🚀 Tech Stack

Note that we use Posthog for usage analytics on the demo site.

Supporters

Sphinxbio Logo

Many thanks to Sphinx Bio for sponsoring this project. If you'd like to work on this (and other cool) projects, consider joining us!

License

This project is licensed under the terms of the Apache 2.0 license.

Contributing

Nothing in life is certain except Death, Taxes, and Really Messy Spreadsheets. We're excited to permanently remove the spreadsheet problem! We'd love your contributions, so please submit ideas, and errors/bug reports on Github!

Acknowledgements

This is a Sphinx Bio project! If you're interested in a hosted solution for your lab, please reach out.

Thanks to Harrison and the Langchain team for their help as well!