Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the web search connector #2

Closed
simonw opened this issue Apr 4, 2024 · 6 comments
Closed

Support for the web search connector #2

simonw opened this issue Apr 4, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Apr 4, 2024

If you add this to the API call:

diff --git a/llm_command_r.py b/llm_command_r.py
index 7a334cd..e49c599 100644
--- a/llm_command_r.py
+++ b/llm_command_r.py
@@ -43,6 +43,8 @@ class CohereMessages(llm.Model):
         if conversation:
             kwargs["chat_history"] = self.build_chat_history(conversation)
 
+        kwargs["connectors"] = [{"id": "web-search"}]
+
         if stream:
             for event in client.chat_stream(**kwargs):
                 if event.event_type == "text-generation":

Cohere will run a web search, use the results to answer the question and include citations in the returned JSON!

@simonw
Copy link
Owner Author

simonw commented Apr 4, 2024

Here's the JSON that was logged for this:

llm -m command-r-plus 'who is simon willison and what is Datasette?'

Response (with search):

Simon Willison is a British programmer, co-founder of the social conference directory Lanyrd, and co-creator of the Django Web framework. He is also the creator of Datasette, an open-source tool for exploring and publishing data. Datasette helps users take data of any shape or size and publish it as an interactive, explorable website and accompanying API.

JSON:

{
    "text": "Simon Willison is a British programmer, co-founder of the social conference directory Lanyrd, and co-creator of the Django Web framework. He is also the creator of Datasette, an open-source tool for exploring and publishing data. Datasette helps users take data of any shape or size and publish it as an interactive, explorable website and accompanying API.",
    "generation_id": "7aea92ce-bcfb-418a-8dc5-9d6f7e77c089",
    "citations": [
        "ChatCitation(start=20, end=38, text='British programmer', document_ids=['web-search_12'])",
        "ChatCitation(start=40, end=92, text='co-founder of the social conference directory Lanyrd', document_ids=['web-search_12', 'web-search_14'])",
        "ChatCitation(start=98, end=137, text='co-creator of the Django Web framework.', document_ids=['web-search_12', 'web-search_14', 'web-search_18'])",
        "ChatCitation(start=153, end=173, text='creator of Datasette', document_ids=['web-search_14', 'web-search_18'])",
        "ChatCitation(start=178, end=189, text='open-source', document_ids=['web-search_0', 'web-search_2', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'web-search_14', 'web-search_15'])",
        "ChatCitation(start=190, end=229, text='tool for exploring and publishing data.', document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_6', 'web-search_7', 'web-search_9', 'web-search_14'])",
        "ChatCitation(start=252, end=282, text='take data of any shape or size', document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_7', 'web-search_9'])",
        "ChatCitation(start=287, end=357, text='publish it as an interactive, explorable website and accompanying API.', document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_7', 'web-search_9'])"
    ],
    "documents": [
        {
            "id": "web-search_12",
            "snippet": "Simon Willison is a British programmer, co-founder of the social conference directory Lanyrd, and co-creator of the Django Web framework.\n\nSimon started his professional web development in 2000 as a web master and developer for the UK based website Gameplay, where he was instrumental in creating File Monster, a large games related file download site. In 2001 he left to attend the University of Bath. Whilst studying, he worked part-time for Incutio where he developed the Incutio XML-RPC Library, a popular XML-RPC library for PHP (used in WordPress and Drupal). During this time Simon started his web development blog. In developing the software for his blog, Simon built one of the first implementations of pingback. Through his blog he was an early adopter and evangelist of OpenID.\n\nIn 2003\u20132004, whilst working at the Lawrence Journal-World during an industrial placement year, he and other web developers (Adrian Holovaty, Jacob Kaplan-Moss and Wilson Miner) created Django, an open source web application framework for Python.\n\nAfter graduating in 2005, Simon worked on Yahoo!'s Technology Development team and on very early versions of the Fire Eagle Internet geolocation service. After Yahoo! he worked as a consultant on OpenID and web development in various publishing and media companies. Willison was hired in 2008 by the UK newspaper The Guardian to work as a software architect.\n\nIn late 2010, he launched the social conference directory Lanyrd with his wife and co-founder, Natalie Downe. They received funding from Y Combinator in early 2011. In 2013, Lanyrd was acquired by Eventbrite with Simon and Natalie joining the Eventbrite engineering team in San Francisco.",
            "timestamp": "2024-04-02T07:17:12",
            "title": "Simon Willison - Wikipedia",
            "url": "https://en.wikipedia.org/wiki/Simon_Willison"
        },
        {
            "id": "web-search_14",
            "snippet": "Simon Willison\u2019s Weblog Subscribe\n\nHere's my most recent conference bio:\n\nSimon Willison is the creator of Datasette, an open source tool for exploring and publishing data. He currently works full-time building open source tools for data journalism, built around Datasette and SQLite.\n\nPrior to becoming an independent open source developer, Simon was an engineering director at Eventbrite. Simon joined Eventbrite through their acquisition of Lanyrd, a Y Combinator funded company he co-founded in 2010.\n\nHe is a co-creator of the Django Web Framework, and has been blogging about web development and programming since 2002 at simonwillison.net\n\nYou can subscribe to my blog via email newsletter, using Atom feeds or by following me on Mastodon or Twitter.\n\nI send out a newsletter version of this blog once every week or so. You can subscribe to that here:\n\nMastodon and Twitter\n\n@simon@simonwillison.net on Mastodon\n\nThe main feed for my site combines my blog entries, my blogmarks and my collected quotations:\n\nhttps://simonwillison.net/atom/everything/\n\nIf you just want my longer form blog entries you can subscribe to this feed instead:\n\nhttps://simonwillison.net/atom/entries/\n\nEvery tag on my blog has its own feed. You can subscribe to those by adding .atom to the URL to the tag page.\n\nFor example, to subscribe to just my content about Datasette, use the following:\n\nhttps://simonwillison.net/tags/datasette.atom",
            "timestamp": "2024-02-14T13:37:33",
            "title": "About Simon Willison",
            "url": "https://simonwillison.net/about/"
        },
        {
            "id": "web-search_18",
            "snippet": "Creator of @datasetteproj, co-creator Django. Fellow at @JSKstanford. Collector of @nichemuseums. Usually hanging out with @natbat and @cleopaws. He/Him\n\nCreator of @datasetteproj, co-creator Django. Fellow at @JSKstanford. Collector of @nichemuseums. Usually hanging out with @natbat and @cleopaws. He/Him\n\nSimon Willison\u2019s Newsletter\n\nAI, LLMs, web engineering, open source, data science, Datasette, SQLite, Python and more\n\nDatasette Newsletter\n\nDive into your interests\n\nWe'll recommend top publications based on the topics you select.",
            "timestamp": "2024-02-14T08:11:58",
            "title": "Simon Willison | Substack",
            "url": "https://substack.com/@simonw"
        },
        {
            "id": "web-search_0",
            "snippet": "An open source multi-tool for exploring and publishing data\n\nDatasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.\n\nDatasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world.\n\nExplore a demo, watch a video about the project or try it out by uploading and publishing your own CSV data.\n\ndatasette.io is the official project website\n\nLatest Datasette News\n\nComprehensive documentation: https://docs.datasette.io/\n\nExamples: https://datasette.io/examples\n\nLive demo of current main branch: https://latest.datasette.io/\n\nQuestions, feedback or want to talk about the project? Join our Discord\n\nWant to stay up-to-date with the project? Subscribe to the Datasette newsletter for tips, tricks and news on what's new in the Datasette ecosystem.\n\nIf you are on a Mac, Homebrew is the easiest way to install Datasette:\n\nbrew install datasette\n\nYou can also install it using pip or pipx:\n\npip install datasette\n\nDatasette requires Python 3.8 or higher. We also have detailed installation instructions covering other options such as Docker.\n\ndatasette serve path/to/database.db\n\nThis will start a web server on port 8001 - visit http://localhost:8001/ to access the web interface.\n\nserve is the default subcommand, you can omit it if you like.\n\nUse Chrome on OS X? You can run datasette against your browser history like so:\n\ndatasette ~/Library/Application\\ Support/Google/Chrome/Default/History --nolock\n\nNow visiting http://localhost:8001/History/downloads will show you a web interface to browse your downloads data:\n\nIf you want to include licensing and source information in the generated datasette website you can do so using a JSON file that looks something like this:\n\n{ \"title\": \"Five Thirty Eight\", \"license\": \"CC Attribution 4.0 License\", \"license_url\": \"http://creativecommons.org/licenses/by/4.0/\", \"source\": \"fivethirtyeight/data on GitHub\", \"source_url\": \"https://github.com/fivethirtyeight/data\" }\n\nSave this in metadata.json and run Datasette like so:\n\ndatasette serve fivethirtyeight.db -m metadata.json\n\nThe license and source information will be displayed on the index page and in the footer. They will also be included in the JSON produced by the API.\n\nIf you have Heroku or Google Cloud Run configured, Datasette can deploy one or more SQLite databases to the internet with a single command:\n\ndatasette publish heroku database.db\n\ndatasette publish cloudrun database.db\n\nThis will create a docker image containing both the datasette application and the specified SQLite database files. It will then deploy that image to Heroku or Cloud Run and give you a URL to access the resulting website and API.\n\nSee Publishing data in the documentation for more details.\n\nDatasette Lite is Datasette packaged using WebAssembly so that it runs entirely in your browser, no Python web application server required. Read more about that in the Datasette Lite documentation.",
            "timestamp": "2024-04-02T03:48:33",
            "title": "GitHub - simonw/datasette: An open source multi-tool for exploring and publishing data",
            "url": "https://github.com/simonw/datasette"
        },
        {
            "id": "web-search_2",
            "snippet": "Hide navigation sidebar\n\nHide table of contents sidebar\n\nToggle site navigation sidebar\n\nDatasette documentation\n\nToggle Light / Dark / Auto color theme\n\nToggle table of contents sidebar\n\nThe Datasette Ecosystem\n\nPages and API endpoints\n\nAuthentication and permissions\n\nPerformance and caching\n\nCustom pages and templates\n\nInternals for plugins\n\nToggle Light / Dark / Auto color theme\n\nToggle table of contents sidebar\n\nAn open source multi-tool for exploring and publishing data\n\nDatasette is a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.\n\nDatasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools and plugins dedicated to making working with structured data as productive as possible.\n\nExplore a demo, watch a presentation about the project or Try Datasette without installing anything using Glitch.\n\nInterested in learning Datasette? Start with the official tutorials.\n\nSupport questions, feedback? Join our GitHub Discussions forum.\n\nPlay with a live demo\n\nDatasette in your browser with Datasette Lite\n\nTry Datasette without installing anything using Glitch\n\nUsing Datasette on your own computer\n\nDatasette Desktop for Mac\n\nAdvanced installation options\n\nA note about extensions\n\nThe Datasette Ecosystem\n\ndatasette serve --help-settings\n\ndatasette publish cloudrun\n\ndatasette publish heroku\n\nPages and API endpoints\n\nPublishing to Google Cloud Run\n\nPublishing to Heroku\n\nPublishing to Vercel\n\nCustom metadata and plugins\n\nDeployment fundamentals\n\nRunning Datasette using systemd\n\nRunning Datasette using OpenRC\n\nDeploying using buildpacks\n\nRunning Datasette behind a proxy\n\nNginx proxy configuration\n\nApache proxy configuration\n\nSpecial JSON arguments\n\nColumn filter arguments\n\nSpecial table arguments\n\nExpanding foreign key references\n\nDiscovering the JSON for a page\n\nCanned query parameters\n\nAdditional canned query options\n\nWritable canned queries\n\nJSON API for writable canned queries\n\nCross-database queries\n\nAuthentication and permissions\n\nUsing the \"root\" actor\n\nDefining permissions with \"allow\" blocks\n\nThe /-/allow-debug tool\n\nConfiguring permissions in metadata.json\n\nControlling access to an instance\n\nControlling access to specific databases\n\nControlling access to specific tables and views\n\nControlling access to specific canned queries\n\nControlling the ability to execute arbitrary SQL\n\nChecking permissions in plugins\n\nactor_matches_allow()\n\nThe permissions debug tool\n\nIncluding an expiry time\n\nBuilt-in permissions\n\nview-database-download\n\nPerformance and caching\n\nUsing \"datasette inspect\"\n\ndatasette-hashed-urls\n\nStreaming all records\n\nLinking to binary downloads\n\nFacets in query strings\n\nFacets in metadata.json\n\nSpeeding up facets with indexes\n\nThe table page and table view API\n\nAdvanced SQLite search queries\n\nConfiguring full-text search for a table or view\n\nSearches using custom SQL\n\nEnabling full-text search for a SQLite table\n\nConfiguring FTS using sqlite-utils\n\nConfiguring FTS using csvs-to-sqlite\n\nConfiguring FTS by hand\n\nInstalling SpatiaLite on OS X\n\nInstalling SpatiaLite on Linux\n\nSpatial indexing latitude/longitude columns\n\nMaking use of a spatial index\n\nImporting shapefiles into SpatiaLite\n\nImporting GeoJSON polygons using Shapely\n\nQuerying polygons using within()\n\nPer-database and per-table metadata\n\nSource, license and about\n\nSpecifying units for a column\n\nSetting a default sort order\n\nSetting a custom page size\n\nSetting which columns can be used for sorting\n\nSpecifying the label column for a table\n\nUsing YAML for metadata\n\nConfiguration directory mode\n\nfacet_suggest_time_limit_ms\n\nConfiguring the secret\n\nUsing secrets with datasette publish\n\nCustom pages and templates\n\nCustom CSS and JavaScript\n\nCSS classes on the <body>\n\nServing static files\n\nPublishing static assets\n\nPath parameters for pages\n\nCustom headers and status codes\n\nOne-off plugins using --plugins-dir\n\nDeploying plugins using datasette publish\n\nSeeing what plugins are installed\n\nPlugin configuration\n\nSecret configuration values\n\nWriting one-off plugins\n\nStarting an installable plugin using cookiecutter\n\nWriting plugins that accept configuration\n\nDesigning URLs for your plugin\n\nBuilding URLs within plugins\n\nprepare_connection(conn, database, datasette)\n\nprepare_jinja2_environment(env, datasette)\n\nextra_template_vars(template, database, table, columns, view_name, request, datasette)\n\nextra_css_urls(template, database, table, columns, view_name, request, datasette)\n\nextra_js_urls(template, database, table, columns, view_name, request, datasette)\n\nextra_body_script(template, database, table, columns, view_name, request, datasette)\n\npublish_subcommand(publish)\n\nrender_cell(row, value, column, table, database, datasette)\n\nregister_output_renderer(datasette)\n\nregister_routes(datasette)\n\nregister_commands(cli)\n\nregister_facet_classes()\n\nasgi_wrapper(datasette)\n\ncanned_queries(datasette, database, actor)\n\nactor_from_request(datasette, request)\n\nfilters_from_request(request, database, table, datasette)\n\npermission_allowed(datasette, actor, action, resource)\n\nregister_magic_parameters(datasette)\n\nforbidden(datasette, request, message)\n\nhandle_exception(datasette, request, exception)\n\nmenu_links(datasette, actor, request)\n\ntable_actions(datasette, actor, database, table, request)\n\ndatabase_actions(datasette, actor, database, request)\n\nskip_csrf(datasette, scope)\n\nget_metadata(datasette, key, database, table)\n\nSetting up a Datasette test instance\n\nUsing pdb for errors thrown inside Datasette\n\nUsing pytest fixtures\n\nTesting outbound HTTP calls with pytest-httpx\n\nRegistering a plugin for the duration of a test\n\nInternals for plugins\n\nThe MultiParams class\n\nReturning a response with .asgi_send(send)\n\nSetting cookies with response.set_cookie()\n\n.plugin_config(plugin_name, database=None, table=None)\n\nawait .render_template(template, context=None, request=None)\n\nawait .permission_allowed(actor, action, resource=None, default=False)\n\nawait .ensure_permissions(actor, permissions)\n\nawait .check_visibility(actor, action=None, resource=None, permissions=None)\n\n.add_database(db, name=None, route=None)\n\n.add_memory_database(name)\n\n.remove_database(name)\n\n.sign(value, namespace=\"default\")\n\n.unsign(value, namespace=\"default\")\n\n.add_message(request, message, type=datasette.INFO)\n\n.absolute_url(request, path)\n\nDatabase(ds, path=None, is_mutable=True, is_memory=False, memory_name=None)\n\nawait db.execute(sql, ...)\n\nawait db.execute_fn(fn)\n\nawait db.execute_write(sql, params=None, block=True)\n\nawait db.execute_write_script(sql, block=True)\n\nawait db.execute_write_many(sql, params_seq, block=True)\n\nawait db.execute_write_fn(fn, block=True)\n\nDatabase introspection\n\nThe _internal database\n\nThe datasette.utils module\n\nparse_metadata(content)\n\nawait_me_maybe(value)\n\nSetting up a development environment\n\nEditing and building the documentation\n\nContinuously deployed demo instances\n\nAlpha and beta releases\n\nReleasing bug fixes from a branch\n\nUpgrading CodeMirror\n\nPlugin hooks and internals\n\nPlugins and internals\n\nBug fixes and other improvements\n\nThe _internal database\n\nNamed in-memory database support\n\nCode formatting with Black and Prettier\n\nPlugins can now add links within Datasette\n\nRunning Datasette behind a proxy\n\nMagic parameters for canned queries\n\nBetter plugin documentation\n\nWritable canned queries\n\nSigned values and secrets\n\nregister_routes() plugin hooks\n\nThe road to Datasette 1.0\n\nNew plugin hook: asgi_wrapper\n\nNew plugin hook: extra_template_vars\n\nSecret plugin configuration options\n\nEasier custom templates for table rows\n\n?_through= for joins through many-to-many tables\n\nSupporting databases that change\n\nFaceting improvements, and faceting plugins\n\ndatasette publish cloudrun\n\nregister_output_renderer plugins\n\nForeign key expansions\n\nNew configuration settings\n\nControl HTTP caching with ?_ttl=\n\nImproved support for SpatiaLite",
            "timestamp": "2024-03-11T01:07:56",
            "title": "Datasette documentation",
            "url": "https://docs.datasette.io/en/stable/"
        },
        {
            "id": "web-search_5",
            "snippet": "Application and Data\n\nDatasette vs TablePlus\n\nNeed advice about which tool to choose?Ask the StackShare community!Get Advice\n\nDatasette vs TablePlus: What are the differences?\n\nDatasette: An instant JSON API for your SQLite databases. Provides an instant, read-only JSON API for any SQLite database. It also provides tools for packaging the database up as a Docker container and deploying that container to hosting providers; TablePlus: Easily edit database data and structure. TablePlus is a native app which helps you easily edit database data and structure. TablePlus includes many security features to protect your database, including native libssh and TLS to encrypt your connection.\n\nDatasette and TablePlus belong to \"Database Tools\" category of the tech stack.\n\nDatasette is an open source tool with 2.53K GitHub stars and 146 GitHub forks. Here's a link to Datasette's open source repository on GitHub.\n\nGet Advice from developers at your company using StackShare Enterprise. Sign up for StackShare Enterprise.Learn More\n\nBe the first to leave a pro\n\n5Great tool, sleek UI, run fast and secure connections\n\n2Perfect for develop use\n\nSign up to add or upvote prosMake informed product decisionsSign up now\n\n- No public GitHub repository available -\n\nProvides an instant, read-only JSON API for any SQLite database. It also provides tools for packaging the database up as a Docker container and deploying that container to hosting providers.\n\nTablePlus is a native app which helps you easily edit database data and structure. TablePlus includes many security features to protect your database, including native libssh and TLS to encrypt your connection.\n\nNeed advice about which tool to choose?Ask the StackShare community!Get Advice\n\nWhat companies use Datasette?\n\nWhat companies use TablePlus?\n\nWhat companies use Datasette?\n\nWhat companies use TablePlus?\n\nSmart Campus Management Center, Chiang Mai University\n\nSee which teams inside your own company are using Datasette or TablePlus. Sign up for StackShare EnterpriseLearn More\n\nSign up to get full access to all the companiesMake informed product decisionsSign up now\n\nWhat tools integrate with Datasette?\n\nWhat tools integrate with TablePlus?\n\nWhat tools integrate with Datasette?\n\nWhat tools integrate with TablePlus?\n\nMicrosoft SQL Server\n\nSign up to get full access to all the tool integrationsMake informed product decisionsSign up now\n\nWhat are some alternatives to Datasette and TablePlus?\n\nIt is a modern database query and access library for Scala. It allows you to work with stored data almost as if you were using Scala collections while at the same time giving you full control over when a database access happens and which data is transferred.\n\nIt makes it easy to use data access technologies, relational and non-relational databases, map-reduce frameworks, and cloud-based data services. This is an umbrella project which contains many subprojects that are specific to a given database.\n\nA cross-platform IDE that is aimed at DBAs and developers working with SQL databases.\n\nIt is a free multi-platform database tool for developers, SQL programmers, database administrators and analysts. Supports all popular databases: MySQL, PostgreSQL, SQLite, Oracle, DB2, SQL Server, Sybase, Teradata, MongoDB, Cassandra, Redis, etc.\n\nMicrosoft SQL Server Management Studio\n\nIt is an integrated environment for managing any SQL infrastructure, from SQL Server to Azure SQL Database. It provides tools to configure, monitor, and administer instances of SQL Server and databases. Use it to deploy, monitor, and upgrade the data-tier components used by your applications, as well as build queries and scripts. See all alternatives\n\nSee all the technologies you\u2019re using across your company. Sign up for StackShare EnterpriseLearn More\n\nRelated ComparisonsPostico vs TablePlus vs graphqurlDatasette vs GraphiQLDatasette vs xmysqlDatasette vs migraTablePlus vs xmysql\n\nTrending ComparisonsDjango vs Laravel vs Node.jsBootstrap vs Foundation vs Material-UINode.js vs Spring BootFlyway vs LiquibaseAWS CodeCommit vs Bitbucket vs GitHub\n\nTop ComparisonsBitbucket vs GitHub vs GitLabBootstrap vs MaterializeHipChat vs Mattermost vs SlackPostman vs Swagger UI",
            "timestamp": "2024-01-05T16:05:45",
            "title": "Datasette vs TablePlus | What are the differences?",
            "url": "https://stackshare.io/stackups/datasette-vs-tableplus"
        },
        {
            "id": "web-search_6",
            "snippet": "Simon Willison\u2019s Weblog Subscribe\n\nThe interesting ideas in Datasette\n\nDatasette (previously) is my open source tool for exploring and publishing structured data. There are a lot of ideas embedded in Datasette. I realized that I haven\u2019t put many of them into writing.\n\nPublishing read-only data Bundling the data with the code SQLite as the underlying data engine Far-future cache expiration Publishing as a core feature License and source metadata Facet everything Respect for CSV SQL as an API language Optimistic query execution with time limits Keyset pagination Interactive demos based on the unit tests Documentation unit tests\n\nPublishing read-only data\n\nDatasette provides a read-only API to your data. It makes no attempt to deal with writes. Avoiding writes entirely is fundamental to a plethora of interesting properties, many of which are expanded on further below. In brief:\n\nHosting web applications with no read/write persistence requirements is incredibly cheap in 2018\u2014often free (both ZEIT Now and a Heroku have generous free tiers). This is a big deal: even having to pay a few dollars a month is enough to dicentivise sharing data, since now you have to figure out who will pay and ensure the payments don\u2019t expire in the future.\n\nBeing read-only makes it trivial to scale: just add more instances, each with their own copy of the data. All of the hard problems in scaling web applications that relate to writable data stores can be skipped entirely.\n\nSince the database file is opened using SQLite\u2019s immutable mode, we can accept arbitrary SQL queries with no risk of them corrupting the data.\n\nAny time your data changes, you need to publish a brand new copy of the whole database. With the right hosting this is easy: deploy a brand new copy of your data and application in parallel to your existing live deployment, then switch over incoming HTTP traffic to your API at the load balancer level. Heroku and Zeit Now both support this strategy out of the box.\n\nBundling the data with the code\n\nSince the data is read-only and is encapsulated in a single binary SQLite database file, we can bundle the data as part of the app. This means we can trivially create and publish Docker images that provide both the data and the API and UI for accessing it. We can also publish to any hosting provider that will allow us to run a Python application, without also needing to provision a mutable database.\n\nThe datasette package command takes one or more SQLite databases and bundles them together with the Datasette application in a single Docker image, ready to be deployed anywhere that can run Docker containers.\n\nSQLite as the underlying data engine\n\nDatasette encourages people to use SQLite as a standard format for publishing data.\n\nRelational database are great: once you know how to use them, you can represent any data you can imagine using a carefully designed schema.\n\nWhat about data that\u2019s too unstructured to fit a relational schema? SQLite includes excellent support for JSON data\u2014so if you can\u2019t shape your data to fit a table schema you can instead store it as text blobs of JSON\u2014and use SQLite\u2019s JSON functions to filter by or extract specific fields.\n\nWhat about binary data? Even that\u2019s covered: SQLite will happily store binary blobs. My datasette-render-images plugin (live demo here) is one example of a tool that works with binary image data stored in SQLite blobs.\n\nWhat if my data is too big? Datasette is not a \u201cbig data\u201d tool, but if your definition of big data is something that won\u2019t fit in RAM that threshold is growing all the time (2TB of RAM on a single AWS instance now costs less than $4/hour).\n\nI\u2019ve personally had great results from multiple GB SQLite databases and Datasette. The theoretical maximum size of a single SQLite database is around 140TB.\n\nSQLite also has built-in support for surprisingly good full-text search, and thanks to being extensible via modules has excellent geospatial functionality in the form of the SpatiaLite extension. Datasette benefits enormously from this wider ecosystem.\n\nThe reason most developers avoid SQLite for production web applications is that it doesn\u2019t deal brilliantly with large volumes of concurrent writes. Since Datasette is read-only we can entirely ignore this limitation.\n\nFar-future cache expiration\n\nSince the data in a Datasette instance never changes, why not cache calls to it forever?\n\nDatasette sends a far future HTTP cache expiry header with every API response. This means that browsers will only ever fetch data the first time a specific URL is accessed, and if you host Datasette behind a CDN such as Fastly or Cloudflare each unique API call will hit Datasette just once and then be cached essentially forever by the CDN.\n\nThis means it\u2019s safe to deploy a JavaScript app using an inexpensively hosted Datasette-backed API to the front page of even a high traffic site\u2014the CDN will easily take the load.\n\nZeit added Cloudflare to every deployment (even their free tier) back in July, so if you are hosted there you get this CDN benefit for free.\n\nWhat if you re-publish an updated copy of your data? Datasette has that covered too. You may have noticed that every Datasette database gets a hashed suffix automatically when it is deployed:\n\nhttps://fivethirtyeight.datasettes.com/fivethirtyeight-c9e67c4\n\nThis suffix is based on the SHA256 hash of the entire database file contents\u2014so any change to the data will result in new URLs. If you query a previous suffix Datasette will notice and redirect you to the new one.\n\nIf you know you\u2019ll be changing your data, you can build your application against the non-suffixed URL. This will not be cached and will always 302 redirect to the correct version (and these redirects are extremely fast).\n\nhttps://fivethirtyeight.datasettes.com/fivethirtyeight/alcohol-consumption/drinks.json\n\nThe redirect sends an HTTP/2 push header such that if you are running behind a CDN that understands push (such as Cloudflare) your browser won\u2019t have to make two requests to follow the redirect. You can use the Chrome DevTools to see this in action:\n\nAnd finally, if you need to opt out of HTTP caching for some reason you can disable it on a per-request basis by including ?_ttl=0 in the URL query string. \u2014for example, if you want to return a random member of the Avengers it doesn\u2019t make sense to cache the response:\n\nhttps://fivethirtyeight.datasettes.com/fivethirtyeight?sql=select+*+from+[avengers/avengers]+order+by+random()+limit+1&_ttl=0\n\nPublishing as a core feature\n\nDatasette aims to reduce the friction for publishing interesting data online as much as possible.\n\nTo this end, Datasette includes a \u201cpublish\u201d subcommand:\n\n# deploy to Heroku datasette publish heroku mydatabase.db # Or deploy to Zeit Now datasette publish now mydatabase.db\n\nThese commands take one or more SQLite databases, upload them to a hosting provider, configure a Datasette instance to serve them and return the public URL of the newly deployed application.\n\nOut of the box, Datasette can publish to either Heroku or to Zeit Now. The publish_subcommand plugin hook means other providers can be supported by writing plugins.\n\nLicense and source metadata\n\nDatasette believes that data should be accompanied by source information and a license, whenever possible. The metadata.json file that can be bundled with your data supports these. You can also provide source and license information when you run datasette publish:\n\ndatasette publish fivethirtyeight.db \\ --source=\"FiveThirtyEight\" \\ --source_url=\"https://github.com/fivethirtyeight/data\" \\ --license=\"CC BY 4.0\" \\ --license_url=\"https://creativecommons.org/licenses/by/4.0/\"\n\nWhen you use these options Datasette will create the corresponding metadata.json file for you as part of the deployment.\n\nI really love faceted search: it\u2019s the first tool I turn to whenever I want to start understanding a collection of data. I\u2019ve built faceted search engines on top of Solr, Elasticsearch and PostgreSQL and many of my favourite tools (like Splunk and Datadog) have it as a core feature.\n\nDatasette automatically attempts to calculate facets against every table. You can read more about the Datasette Facets feature here\u2014as a huge faceted search fan it\u2019s one of my all-time favourite features of the project. Now I can add SQLite to the list of technologies I\u2019ve used to build faceted search!\n\nCSV is by far the most common format for sharing and publishing data online. Almost every useful data tool has the ability to export to it, and it remains the lingua franca of spreadsheet import and export.\n\nIt has many flaws: it can\u2019t easily represent nested data structures, escaping rules for values containing commas are inconsistently implemented and it doesn\u2019t have a standard way of representing character encoding.\n\nDatasette aims to promote SQLite as a much better default format for publishing data. I would much rather download a .db file full of pre-structured data than download a .csv and then have to re-structure it as a separate piece of work.\n\nBut interacting well with the enormous CSV ecosystem is essential. Datasette has deep CSV export functionality: any data you can see, you can export\u2014including the results of arbitrary SQL queries. If your query can be paginated Datasette can stream down every page in a single CSV file for you.\n\nDatasette\u2019s sister-tool csvs-to-sqlite handles the other side of the equation: importing data from CSV into SQLite tables. And the Datasette Publish web application allows users to upload their CSVs and have them deployed directly to their own fresh Datasette instance\u2014no command line required.\n\nSQL as an API language\n\nA lot of people these days are excited about GraphQL, because it allows API clients to request exactly the data they need, including traversing into related objects in a single query.\n\nGuess what? SQL has been able to do that since the 1970s!\n\nThere are a number of reasons most APIs don\u2019t allow people to pass them arbitrary SQL queries:\n\nSecurity: we don\u2019t want people messing up our data\n\nPerformance: what if someone sends an accidental (or deliberate) expensive query that exhausts our resources?\n\nHiding implementation details: if people write SQL against our API we can never change the structure of our database tables\n\nDatasette has answers to all three.\n\nOn security: the data is read-only, using SQLite\u2019s immutable mode. You can\u2019t damage it with a query\u2014INSERT and UPDATEs will simply throw harmless errors.\n\nOn performance: SQLite has a mechanism for canceling queries that take longer than a certain threshold. Datasette sets this to one second by default, though you can alter that configuration if you need to (I often bump it up to ten seconds when exploring multi-GB data on my laptop).\n\nOn hidden implementation details: since we are publishing static data rather than maintaining an evolving API, we can mostly ignore this issue. If you are really worried about it you can take advantage of canned queries and SQL view definitions to expose a carefully selected forward-compatible view into your data.\n\nOptimistic query execution with time limits\n\nI mentioned Datasette\u2019s SQL time limits above. These aren\u2019t just there to avoid malicious queries: the idea of \u201coptimistic SQL evaluation\u201d is baked into some of Datasette\u2019s core features.\n\nConsider suggested facets\u2014where Datasette inspects any table you view and tries to suggest columns that are worth faceting against.\n\nThe way this works is Datasette loops over every column in the table and runs a query to see if there are less than 20 unique values for that column. On a large table this could take a prohibitive amount of time, so Datasette sets an aggressive timeout on those queries: just 50ms. If the query fails to run in that time it is silently dropped and the column is not listed as a suggested facet.\n\nDatasette\u2019s JSON API provides a mechanism for JavaScript applications to use that same pattern. If you add ?_timelimit=20 to any Datasette API call, the underlying query will only get 20ms to run. If it goes over you\u2019ll get a very fast error response from the API. This means you can design your own features that attempt to optimistically run expensive queries without damaging the performance of your app.\n\nSQL pagination using OFFSET/LIMIT has a fatal flaw: if you request page number 300 at 20 per page the underlying SQL engine needs to calculate and sort all 6,000 preceding rows before it can return the 20 you have requested.\n\nThis does not scale at all well.\n\nKeyset pagination (often known by other names, including cursor-based pagination) is a far more efficient way to paginate through data. It works against ordered data. Each page is returned with a token representing the last record you saw, then when you request the next page the engine merely has to filter for records that are greater than that tokenized value and scan through the next 20 of them.\n\n(Actually, it scans through 21. By requesting one more record than you intend to display you can detect if another page of results exists\u2014if you ask for 21 but get back 20 or less you know you are on the last page.)\n\nDatasette\u2019s table view includes a sophisticated implementation of keyset pagination.\n\nDatasette defaults to sorting by primary key (or SQLite rowid). This is perfect for efficient pagination: running a select against the primary key column for values greater than X is one of the fastest range scan queries any database can support. This allows users to paginate as deep as they like without paying the offset/limit performance penalty.\n\nThis is also how the \u201cexport all rows as CSV\u201d option works: when you select that option, Datasette opens a stream to your browser and internally starts keyset-pagination over the entire table. This keeps resource usage in check even while streaming back millions of rows.\n\nHere\u2019s where Datasette gets fancy: it handles keyset pagination for any other sort order as well. If you sort by any column and click \u201cnext\u201d you\u2019ll be requesting the next set of rows after the last value you saw. And this even works for columns containing duplicate values: If you sort by such a column, Datasette actually sorts by that column combined with the primary key. The \u201cnext\u201d pagination token it generates encodes both the sorted value and the primary key, allowing it to correctly serve you the next page when you click the link.\n\nTry clicking \u201cnext\u201d on this page to see keyset pagination against a sorted column in action.\n\nInteractive demos based on the unit tests\n\nI love interactive demos. I decided it would be useful if every single release of Datasette had a permanent interactive demo illustrating its features.\n\nThanks to Zeit Now, this was pretty easy to set up. I\u2019ve actually taken it a step further: every successful push to master on GitHub is also deployed to a permanent URL.\n\nhttps://latest.datasette.io/\u2014the most recent commit to Datasette master. You can see the currently deployed commit hash on https://latest.datasette.io/-/versions and compare it to https://github.com/simonw/datasette/commits\n\nhttps://v0-25.datasette.io/ is a permanent URL to the 0.25 tagged release of Datasette. See also https://v0-24.datasette.io/ and https://v0-23-2.datasette.io/\n\nhttps://700d83d.datasette.io/-/versions is a permanent URL to the code from this commit: https://github.com/simonw/datasette/commit/700d83d\n\nThe database that is used for this demo is the exact same database that is created by Datasette\u2019s unit test fixtures. The unit tests are already designed to exercise every feature, so reusing them for a live demo makes a lot of sense.\n\nYou can view this test database on your own machine by checking out the full Datasette repository from GitHub and running the following:\n\npython tests/fixtures.py fixtures.db metadata.json datasette fixtures.db -m metadata.json\n\nHere\u2019s the code in the Datasette Travis CI configuration that deploys a live demo for every commit and every released tag.\n\nDocumentation unit tests\n\nI wrote about the Documentation unit tests pattern back in July.\n\nDatasette\u2019s unit tests include some assertions that ensure that every plugin hook, configuration setting and underlying view class is mentioned in the documentation. A commit or pull request that adds or modifies these without also updating the documentation (or at least ensuring there is a corresponding heading in the docs) will fail its tests.\n\nDatasette\u2019s documentation is in pretty good shape now, and the changelog provides a detailed overview of new features that I\u2019ve added to the project. I presented Datasette at the PyBay conference in August and I\u2019ve published my annotated slides from that talk. I was interviewed about Datasette for the Changelog podcast in May and my notes from that conversation include some of my favourite demos.\n\nDatasette now has an official Twitter account\u2014you can follow @datasetteproj there for updates about the project.\n\nPosted 4th October 2018 at 2:28 am \u00b7 Follow me on Mastodon or Twitter or subscribe to my newsletter\n\nMore recent articles\n\nDALL-E 3, GPT4All, PMTiles, sqlite-migrate, datasette-edit-schema - 30th October 2023\n\nNow add a walrus: Prompt engineering in DALL-E 3 - 26th October 2023\n\nExecute Jina embeddings with a CLI using llm-embed-jina - 26th October 2023\n\nEmbeddings: What they are and why they matter - 23rd October 2023\n\nWeeknotes: PyBay, AI Engineer Summit, Datasette metadata and JavaScript plugins - 22nd October 2023\n\nOpen questions for AI engineering - 17th October 2023\n\nMulti-modal prompt injection image attacks against GPT-4V - 14th October 2023\n\nWeeknotes: the Datasette Cloud API, a podcast appearance and more - 1st October 2023\n\nThings I've learned about building CLI tools in Python - 30th September 2023\n\nTalking Large Language Models with Rooftop Ruby - 29th September 2023\n\nThis is The interesting ideas in Datasette by Simon Willison, posted on 4th October 2018. projects 326 datasette 366 sqlite 223 bakeddata 9\n\nNext: How I moderated the State of Django panel at DjangoCon US.\n\nPrevious: Letterboxing on Lundy\n\nI decided to blog some of the interesting ideas in @datasetteproj that I haven't committed to writing yet. It ended up being pretty long!https://t.co/tV5ZnadTMd pic.twitter.com/WXrabE9DnV\u2014 Simon Willison (@simonw) October 4, 2018",
            "timestamp": "2023-11-02T14:03:37",
            "title": "The interesting ideas in Datasette",
            "url": "https://simonwillison.net/2018/Oct/4/datasette-ideas/"
        },
        {
            "id": "web-search_7",
            "snippet": "Find stories in data\n\nAnnotated version of this introductory video\n\nDatasette is a tool for exploring and publishing data. It helps people take data of any shape, analyze and explore it, and publish it as an interactive website and accompanying API.\n\nDatasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 46 tools and 153 plugins dedicated to making working with structured data as productive as possible.\n\nTry a demo and explore 33,000 power plants around the world, then follow the tutorial or take a look at some other examples of Datasette in action.\n\nThen read how to get started with Datasette, subscribe to the monthly-ish newsletter and consider signing up for office hours for an in-person conversation about the project.\n\nNew: Datasette Desktop - a macOS desktop application for easily running Datasette on your own computer!\n\nExploratory data analysis\n\nImport data from CSVs, JSON, database connections and more. Datasette will automatically show you patterns in your data and help you share your findings with your colleagues.\n\nInstant data publishing\n\ndatasette publish lets you instantly publish your data to hosting providers like Google Cloud Run, Heroku or Vercel.\n\nSpin up a JSON API for any data in minutes. Use it to prototype and prove your ideas without building a custom backend.\n\n18th February 2024 #\n\nDatasette 1.0a10 is a focused alpha that changes some internal details about how Datasette handles transactions. The datasette.execute_write_fn() internal method now wraps the function in a database transaction unless you pass transaction=False.\n\n16th February 2024 #\n\nDatasette 1.0a9 adds basic alter table support to the JSON API, tweaks how permissions works and introduces some new plugin debugging utilities.\n\nDatasette 1.0a8 introduces several new plugin hooks, a JavaScript plugin system and moves plugin configuration from metadata.yaml to datasette.yaml. Read more about the release in the annotated release notes for 1.0a8.\n\nDatasette Enrichments is a new feature for Datasette that supports enriching data by running custom code against every selected row in a table. Read Datasette Enrichments: a new plugin framework for augmenting your data for more details, plus a video demo of enrichments for geocoding addresses and processing text and images using GPT-4.\n\n30th November 2023 #\n\ndatasette-comments is a new plugin by Alex Garcia which adds collaborative commenting to Datasette. Alex built the plugin for Datasette Cloud, but it's also available as an open source package for people who are hosting their own Datasette instances. See Annotate and explore your data with datasette-comments on the Datasette Cloud blog for more details.\n\nDatasette 1.0a4 has a fix for a security vulnerability in the Datasette 1.0 alpha series: the API explorer interface exposed the names of private databases and tables in public instances that were protected by a plugin such as datasette-auth-passwords, though not the actual content of those tables. See the security advisory for more details and workarounds for if you can't upgrade immediately. The latest edition of the Datasette Newsletter also talks about this issue.\n\ndatasette-write-ui: a Datasette plugin for editing, inserting, and deleting rows introduces a new plugin adding add/edit/delete functionality to Datasette, developed by Alex Garcia. Alex built this for Datasette Cloud, and this post is the first announcement made on the new Datasette Cloud blog - see also Welcome to Datasette Cloud.\n\nDatasette 1.0a3 is an alpha release of Datasette that previews the new default JSON API design that\u2019s coming in version 1.0 - the single most significant change planned for that 1.0 release.\n\nNew tutorial: Data analysis with SQLite and Python. This tutorial, originally presented at PyCon 2023, includes a 2h45m video and an extensive handout that should be useful with or without the video. Topics covered include Python's sqlite3 module, sqlite-utils, Datasette, Datasette Lite, advanced SQL patterns and more.\n\nI built a ChatGPT plugin to answer questions about data hosted in Datasette describes a new experimental Datasette plugin to enable people to query data hosted in a Datasette interface via ChatGPT, asking human language questions that are automatically converted to SQL and used to generate a readable response.\n\n23rd February 2023 #\n\nUsing Datasette in GitHub Codespaces is a new tutorial showing how Datasette can be run in GitHub's free Codespaces browser-based development environments, using the new datasette-codespaces plugin.\n\nExamples of sites built using Datasette now includes screenshots of Datasette deployments that illustrate a variety of problems that can be addressed using Datasette and its plugins.\n\nSemantic search answers: Q&A against documentation with GPT3 + OpenAI embeddings shows how Datasette can be used to implement semantic search and build a system for answering questions against an existing corpus of text, using two new plugins: datasette-openai and datasette-faiss, and a new tool: openai-to-sqlite.\n\nDatasette 0.64 is out, and includes a strong warning against running SpatiaLite in production without disabling arbitrary SQL queries, plus a new --setting default_allow_sql off setting to make it easier to do that. See Datasette 0.64, with a warning about SpatiaLite for more about this release. A new tutorial, Building a location to time zone API with SpatiaLite, describes how to safely use SpatiaLite and Datasette to build and deploy an API for looking up time zones for a latitude/longitude location.\n\n15th December 2022 #\n\nDatasette 1.0a2: Upserts and finely grained permissions describes the new upsert API and much improved permissions capabilities introduced in the latest Datasette 1.0a2 alpha release.\n\ndatasette-enrichments 0.3.1 - Tools for running enrichments against data stored in Datasette\n\nFix for a bug where the row action menu did not work correctly for tables with primary keys starting with an underscore. #42\n\ndatasette-write 0.3.2 - Datasette plugin providing a UI for writing to a database\n\nRemoved unnecessary print() debug statement.\n\ndatasette-enrichments 0.3 - Tools for running enrichments against data stored in Datasette\n\nNow adds an \"Enrich this row\" row action menu item on Datasette 1.0a13 and higher. #41\n\ndatasette-packages 0.2.1 - Show a list of currently installed Python packages\n\nSwitch from pkg_resources to importlib.metadata to fix a deprecation warning. Dropped support for Python 3.7. #6\n\ndatasette-export-database 0.2.1\n\nTemporary files created for the user to download are now deleted after the download finishes, and any stale ones are cleared out when Datasette first starts running. #5\n\ndatasette-export-database 0.2\n\nDatabase action menu now shows the size of the database in the menu item description. #3\n\nConfirm that signed URL token is from the same user based on their csrftoken cookie. #4\n\ndatasette-export-database 0.1.1\n\nFix for NameError: name 'Permission' is not defined error. #2\n\ndatasette-export-database 0.1\n\nInitial release. Users with the export-database permission gain the ability to export a snapshot of a database from an option in the database action menu. #1\n\ndatasette-configure-fts 1.1.3 - Datasette plugin for enabling full-text search against selected table columns\n\nAdded a description to the table action menu item for Datasette 1.0a13 and higher.\n\ndatasette-upload-csvs 0.9.1 - Datasette plugin for uploading CSV files and converting them to database tables\n\nFixed incorrect page title on the upload page. #40\n\nAdded a description to the database action menu item for Datasette 1.0a13 and higher.\n\ndatasette-write 0.3.1 - Datasette plugin providing a UI for writing to a database\n\nAdded a description to the database action menu item for Datasette 1.0a13 and higher.\n\ndatasette-edit-schema 0.8a1 - Datasette plugin for modifying table schemas\n\nTo rename a table users must now have drop-table permission for the old name and create-table permission for the new name. #60\n\nAction menu items now have descriptions in addition to labels.\n\ndatasette-extract 0.1a3 - Import unstructured data (text and images) into structured tables\n\nExtraction jobs can now provide additional instructions to pass to the model, which are persisted and suggested for reuse when data is imported into the table in the future. #17\n\ndatasette 1.0a13 - An open source multi-tool for exploring and publishing data\n\nEach of the key concepts in Datasette now has an actions menu, which plugins can use to add additional functionality targeting that entity.\n\nPlugin hook: view_actions() for actions that can be applied to a SQL view. (#2297)\n\nPlugin hook: homepage_actions() for actions that apply to the instance homepage. (#2298)\n\nPlugin hook: row_actions() for actions that apply to the row page. (#2299)\n\nAction menu items for all of the *_actions() plugin hooks can now return an optional \"description\" key, which will be displayed in the menu below the action label. (#2294)\n\nPlugin hooks documentation page is now organized with additional headings. (#2300)\n\nImproved the display of action buttons on pages that also display metadata. (#2286)\n\nThe header and footer of the page now uses a subtle gradient effect, and options in the navigation menu are better visually defined. (#2302)\n\nTable names that start with an underscore now default to hidden. (#2104)\n\npragma_table_list has been added to the allow-list of SQLite pragma functions supported by Datasette. select * from pragma_table_list() is no longer blocked. (#2104)\n\ndatasette-enrichments-quickjs 0.1a1 - Enrich data with a custom JavaScript function\n\nQuickJS functions now run with a 4MB memory limit, avoiding potential crashes if code tries to allocate too much memory. #4",
            "timestamp": "2024-04-02T03:48:31",
            "title": "Datasette: An open source multi-tool for exploring and publishing data",
            "url": "https://datasette.io/"
        },
        {
            "id": "web-search_9",
            "snippet": "Open Source Software\n\nSubscribe to our Newsletter\n\nAn open source multi-tool for exploring and publishing data\n\nThis is an exact mirror of the Datasette project, hosted at https://github.com/simonw/datasette. SourceForge is not affiliated with Datasette. For more information, see the SourceForge Open Source Mirror Directory.\n\nDownloads: 3 This Week\n\nLast Update: 2023-12-22\n\nDatasette is a tool for exploring and publishing data. It helps people take data of any shape or size, analyze and explore it, and publish it as an interactive website and accompanying API. Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools and plugins dedicated to making working with structured data as productive as possible. Try a demo and explore 33,000 power plants around the world, then take a look at some other examples of Datasette in action. Then read how to get started with Datasette, subscribe to the monthly-ish newsletter and consider signing up for office hours for an in-person conversation about the project.\n\nImport data from CSVs, JSON, database connections and more\n\nDatasette will automatically show you patterns in your data and help you share your findings with your colleagues\n\ndatasette publish lets you instantly publish your data to hosting providers like Google Cloud Run, Heroku or Vercel\n\nSpin up a JSON API for any data in minutes\n\nUse it to prototype and prove your ideas without building a custom backend\n\nExploratory data analysis\n\nCategoriesDesktop Publishing, Data Analytics\n\nLicenseApache License V2.0\n\nOther Useful Business Software\n\nCloudflare secures and ensures the reliability of your external-facing resources such as websites, APIs, and applications.\n\nCloudflare is the foundation for your infrastructure, applications, and teams.\n\nIt protects your internal resources such as behind-the-firewall applications, teams, and devices.\n\nRate This ProjectLogin To Rate This Project\n\nBe the first to post a review of Datasette!\n\nAdditional Project Details\n\nOperating SystemsMac\n\nProgramming LanguagePython\n\nRelated Categories Python Desktop Publishing Software, Python Data Analytics Tool\n\nSimilar Business Software\n\nVisokio builds Omniscope Evo, complete and extensible BI software for data processing, analytics and reporting. A smart experience on any device. Start from any data in any shape, load, edit, blend, transform while visually exploring it, extract insights through ML algorithms, automate your...\n\nDeepnote is building the best data science notebook for teams. In the notebook, users can connect their data, explore, and analyze it with real-time collaboration and version control. Users can easily share project links with team collaborators, or with end-users to present polished assets. All...\n\nDbVisualizer is one of the world\u2019s most popular database editors. Developers, analysts, and DBAs use it to elevate their SQL experience with modern tools to visualize and manage their databases, schemas, objects, and table data, and to auto-generate, write and optimize queries. And so much...\n\nReport inappropriate content\n\nRecommended Projects\n\nOpen source web spreadsheet\n\nA lightweight and easy-to-use password manager\n\nDeSmuME: Nintendo DS emulator\n\nDeSmuME is a Nintendo DS emulator\n\nA free file archiver for extremely high compression\n\nThe free and Open Source productivity suite\n\nRelated Business Categories\n\nThanks for helping keep SourceForge clean. X",
            "timestamp": "2024-03-05T13:31:40",
            "title": "Datasette download | SourceForge.net",
            "url": "https://sourceforge.net/projects/datasette.mirror/"
        },
        {
            "id": "web-search_15",
            "snippet": "Search or jump to...\n\nSearch code, repositories, users, issues, pull requests...\n\nYou signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.\n\nSimon Willison simonw\n\n5.8k followers \u00b7 138 following\n\nhttps://simonwillison.net/\n\n@simon@fedi.simonwillison.net\n\nDeveloper Program Member\n\nCurrently working on Datasette and associated projects. Read my blog or follow @simon@simonwillison.net on Mastodon.\n\nllm-nomic-api-embed 0.1 - 2024-03-30\n\ndatasette-embeddings 0.1a2 - 2024-03-30\n\ndatasette-paste 0.1a4 - 2024-03-29\n\ntextract-cli 0.1 - 2024-03-29\n\nllm-gemini 0.1a1 - 2024-03-27\n\nllm-cmd 0.1a0 - 2024-03-26\n\nfiles-to-prompt 0.1 - 2024-03-22\n\ndatasette-enrichments 0.3.1 - 2024-03-19\n\nMore recent releases\n\nRunning OCR against PDFs and images directly in your browser - 2024-03-30\n\nllm cmd undo last git commit - a new plugin for LLM - 2024-03-26\n\nBuilding and testing C extensions for SQLite with ChatGPT Code Interpreter - 2024-03-23\n\nClaude and ChatGPT for ad-hoc sidequests - 2024-03-22\n\nWeeknotes: the aftermath of NICAR - 2024-03-16\n\nThe GPT-4 barrier has finally been broken - 2024-03-08\n\nMore on simonwillison.net\n\nInstalling tools written in Go - 2024-03-26\n\nGoogle Chrome --headless mode - 2024-03-24\n\nReviewing your history of public GitHub repositories using ClickHouse - 2024-03-20\n\nRunning self-hosted QuickJS in a browser - 2024-03-20\n\nProgrammatically comparing Python version strings - 2024-03-17\n\nRedirecting a whole domain with Cloudflare - 2024-03-15\n\nMore on til.simonwillison.net\n\nPython CLI utility and library for manipulating SQLite databases\n\nAccess large language models from the command-line\n\nFind the Python code for specified symbols\n\ns3-credentials Public\n\nA tool for creating credentials for accessing S3 buckets\n\nA command-line utility for taking automated screenshots of websites\n\nSomething went wrong, please refresh the page to try again. If the problem persists, check the GitHub status page or contact support.\n\nYou can\u2019t perform that action at this time.",
            "timestamp": "2024-03-31T05:27:22",
            "title": "simonw (Simon Willison) \u00b7 GitHub",
            "url": "https://github.com/simonw"
        },
        {
            "id": "web-search_3",
            "snippet": "Exploring and Visualizing data using Datasette\n\nDatasette is a tool for exploring and visualizing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.\n\nin datasette , sqlite , python , api\n\nDatasette is a tool for exploring and publishing data. It is used for creating and publishing JSON APIs for SQLite databases.\n\nDatasette makes it easy to expose JSON APIs from our SQLite database without the need of a custom web application.\n\nIt helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API. Datasette is aimed at data journalists, archivists, local governments and anyone else who has data that they wish to share with the world.\n\nWe can install Datasette using docker or pip. Let us install using pip(datasette requires min python3.5)\n\n$ pip install datasette\n\nTo use datasette on any SQLite db just run\n\n$ datasette serve some-sqlite-database.db\n\nthis command serves up specified SQLite database files with a web UI\n\nFor sample datasets - https://github.com/simonw/datasette/wiki/Datasettes\n\nLet us take a sample data from fivethirtyeight data repo as it has wide range of datasets.\n\nlet us use comic-characters data in the repo, and let us explore the dc-wikia-data.csv\n\nfirst let us install a package to convert the csvs to sqlite db\n\n$ pip install csvs-to-sqlite\n\nnow let us convert the csv to sqlite db\n\n$ csvs-to-sqlite dc-wikia-data.csv dc-wikia-data.db\n\nNow let us serve the sqlite db\n\n$ datasette serve dc-wikia-data.db Serve! files=('dc-wikia-data.db',) (immutables=()) on port 8001 INFO: Started server process [12668] INFO: Waiting for application startup. INFO: Uvicorn running on http://127.0.0.1:8001 (Press CTRL+C to quit)\n\nNow when we go to the url\n\nwe can see that there are 6000+ rows in the table,\n\nlet\u2019s click on the table name\n\nthis will take us to the page where we can view/filter the table and also edit the sql queries\n\nNow let us filter the table by clicking on the suggested facets like -\n\nWe can also view the filtered-data in json or csv and also download the filtered data\n\nEverything datasette can do is driven by URLs. Queries can produce responsive HTML pages or with the .json or .jsono extension can produce JSON. All JSON responses are served with an Access-Control-Allow-Origin: * HTTP header, meaning we can query them from any page.\n\nwe can view the data in the browser by accessing the link http://localhost:8001/dc-wikia-data?sql=select+*+from+[dc-wikia-data]\n\nand we can get the json for the same data by http://localhost:8001/dc-wikia-data.json?sql=select+*+from+[dc-wikia-data]\n\nOr we can get the csv for the same data by http://localhost:8001/dc-wikia-data.csv?sql=select+*+from+[dc-wikia-data]\n\nDatasette serving options\n\n$ datasette serve --help Usage: datasette serve [OPTIONS] [FILES]... Serve up specified SQLite database files with a web UI Options: -i, --immutable PATH Database files to open in immutable mode -h, --host TEXT host for server, defaults to 127.0.0.1 -p, --port INTEGER port for server, defaults to 8001 --debug Enable debug mode - useful for development --reload Automatically reload if database or code change detected - useful for development --cors Enable CORS by serving Access-Control-Allow-Origin: * --load-extension PATH Path to a SQLite extension to load --inspect-file TEXT Path to JSON file created using \"datasette inspect\" -m, --metadata FILENAME Path to JSON file containing license/source metadata --template-dir DIRECTORY Path to directory containing custom templates --plugins-dir DIRECTORY Path to directory containing custom plugins --static STATIC MOUNT mountpoint:path-to-directory for serving static files --memory Make :memory: database available --config CONFIG Set config option using configname:value datasette.readthedocs.io/en/latest/config.html --version-note TEXT Additional note to show on /-/versions --help-config Show available config options --help Show this message and exit.\n\nConverting data to SQLite DB\n\nTo view the data using we have to convert the initial data to SQLite database, and we can convert it by using python packages\n\nfor CSV - csvs-to-sqlite lets us take one or more CSV files and load them into a SQLite database.\n\nfor databases - db-to-sqlite is a CLI tool that builds on top of SQLAlchemy and allows us to connect to any database supported by that library (including MySQL, oracle and PostgreSQL), run a SQL query and save the results to a new table in a SQLite database.\n\nUsing Programmatically - sqlite-utils is a Python library and CLI tool that provides shortcuts for loading data into SQLite. It can be used programmatically (e.g. in a Jupyter notebook) to load data, and will automatically create SQLite tables with the necessary schema.\n\nDatasette\u2019s plugin system allows additional features to be implemented as Python code (or front-end JavaScript) which are wrapped up as a separate Python package.\n\ndatasette-vega - datasette-vega allows us to construct line, bar and scatter charts against our data and share links to our visualizations. It is built using the Vega charting library,\n\ndatasette-cluster-map - datasette-cluster-map The plugin works against any table with latitude and longitude columns. It can load over 100,000 points onto a map to visualize the geographical distribution of the underlying data.\n\ndatasette-cors - datasette-cors this plugin allows JavaScript running on a whitelisted set of domains to make fetch() calls to the JSON API provided by our Datasette instance.\n\nLet us check datasette-vega plugin in our dataset\n\n$ pip install datasette-vega\n\nUsing this plugin we get the charting options like bar, line and scatter.\n\nPages and API endpoints\n\nThe Datasette web application offers a number of different pages that can be accessed to explore the data, each of which is accompanied by an equivalent JSON API.\n\nThe allow_sql config option is enabled by default, which enables an interface for executing arbitrary SQL select queries against the data.\n\nEvery row in every Datasette table has its own URL. This means individual records can be linked to directly.\n\nWe can return the JSON/CSV data by appending .json/.csv to the URL path, before any ? querystring arguments.\n\nDatasette has tools for publishing and deploying our data to the internet.\n\nThe datasette publish command will deploy a new Datasette instance containing our databases directly to a Heroku or Google Cloud hosting account.\n\nWe can also use datasette package to create a Docker image that bundles our databases together with the datasette application that is used to serve them.\n\nDatasette treats SQLite database files as read-only and immutable. This means it is not possible to execute INSERT or UPDATE statements using Datasette, which allows us to expose SELECT statements to the outside world without needing to worry about SQL injection attacks.\n\nThe easiest way to execute custom SQL against Datasette is through the web UI.\n\nAny Datasette SQL query is reflected in the URL of the page, allowing us to bookmark them, share them with others and navigate through previous queries using our browser back button.\n\nDatasette supports many features like -\n\nNamed parameters - Datasette has special support for SQLite named parameters.\n\nPagination - When paginating through tables, Datasette instead orders the rows in the table by their primary key and performs a WHERE clause against the last seen primary key for the previous page.\n\nselect rowid, * from [dc-wikia-data] where rowid > 200 order by rowid limit 101\n\nAny Datasette table, view or custom SQL query can be exported as JSON/CSV.\n\ndownload file - instead of displaying CSV in your browser, this forces your browser to download the CSV to your downloads directory.\n\nexpand labels - if your table has any foreign key references this option will cause the CSV to gain additional COLUMN_NAME_label columns with a label for each foreign key derived from the linked table.\n\nstream all rows - by default CSV files only contain the first max_returned_rows records. This option will cause Datasette to loop through every matching record and return them as a single CSV file.\n\nThe default URL for the CSV representation of a table is that table with .csv appended to it:\n\nhttps://latest.datasette.io/fixtures/facetable - HTML interface\n\nhttps://latest.datasette.io/fixtures/facetable.csv - CSV export\n\nhttps://latest.datasette.io/fixtures/facetable.json - JSON API\n\nSQLite includes a powerful mechanism for enabling full-text search against SQLite records. Datasette can detect if a table has had full-text search configured for it in the underlying database and display a search interface for filtering that table.\n\nDatasette automatically detects which tables have been configured for full-text search.\n\nAdding full-text search to a SQLite table\n\nDatasette takes advantage of the external content mechanism in SQLite, which allows a full-text search virtual table to be associated with the contents of another SQLite table.\n\nTo set up full-text search for a table, we need to do two things:\n\nCreate a new FTS virtual table associated with our table\n\nPopulate that FTS table with the data that we would like to be able to run searches against\n\nDatasette provides a number of configuration options. These can be set using the --config name:value option to datasette serve.\n\nWe can set multiple configuration options at once like\n\n$ datasette dc-wikia-data.db --config default_page_size:50 \\ --config sql_time_limit_ms:3500 \\ --config max_returned_rows:2000\n\namong many config options, the most frequently used are:\n\ndefault_page_size - The default number of rows returned by the table page. We can over-ride this on a per-page basis using the ?_size=80 querystring parameter, provided we do not specify a value higher than the max_returned_rows setting. We can set this default using --config like so:\n\n$ datasette dc-wikia-data.db --config default_page_size:50\n\nmax_returned_rows - Datasette returns a maximum of 1,000 rows of data at a time. You can increase or decrease this limit like so:\n\n$ datasette dc-wikia-data.db --config max_returned_rows:2000\n\nDatasette provides a number of ways of customizing the way data is displayed. Like\n\nCustom CSS and JavaScript - we can specify a custom metadata file like this:\n\n$ datasette dc-wikia-data.db --metadata metadata.json\n\nAnd in metadata.json file can include links like this:\n\n{ \"extra_css_urls\": [ \"https://simonwillison.net/static/css/all.bf8cd891642c.css\" ], \"extra_js_urls\": [ \"https://code.jquery.com/jquery-3.2.1.slim.min.js\" ] }\n\nThe extra CSS and JavaScript files will be linked in the <head> of every page.\n\nCustom templates - We can over-ride the default templates by specifying a custom --template-dir like this:\n\n$ datasette dc-wikia-data.db --template-dir=mytemplates/\n\nBuy Django ORM Cookbook\n\nThank you for reading the Agiliq blog. This article was written by Anmol Akhilesh on Jul 12, 2019 in datasette , sqlite , python , api .\n\nYou can subscribe \u269b to our blog.\n\nWe love building amazing apps for web and mobile for our clients. If you are looking for development help, contact us today \u2709.\n\nWould you like to download 10+ free Django and Python books? Get them here",
            "timestamp": "2023-11-28T11:43:32",
            "title": "Exploring and Visualizing data using Datasette",
            "url": "https://www.agiliq.com/blog/2019/07/using-datasette/"
        }
    ],
    "is_search_required": null,
    "search_queries": [
        "ChatSearchQuery(text='Simon Willison', generation_id='490f6ae2-debf-4412-b4f4-6a1bd6de914f')",
        "ChatSearchQuery(text='what is datasette', generation_id='490f6ae2-debf-4412-b4f4-6a1bd6de914f')"
    ],
    "search_results": [
        "ChatSearchResult(search_query=ChatSearchQuery(text='what is datasette', generation_id='490f6ae2-debf-4412-b4f4-6a1bd6de914f'), connector=ChatSearchResultConnector(id='web-search'), document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9'], error_message=None, continue_on_failure=None)",
        "ChatSearchResult(search_query=ChatSearchQuery(text='Simon Willison', generation_id='490f6ae2-debf-4412-b4f4-6a1bd6de914f'), connector=ChatSearchResultConnector(id='web-search'), document_ids=['web-search_12', 'web-search_14', 'web-search_15', 'web-search_18'], error_message=None, continue_on_failure=None)"
    ],
    "finish_reason": null,
    "tool_calls": null,
    "chat_history": [
        "ChatMessage(role='USER', message='who is simon willison and what is Datasette?')",
        "ChatMessage(role='CHATBOT', message='Simon Willison is a British programmer, co-founder of the social conference directory Lanyrd, and co-creator of the Django Web framework. He is also the creator of Datasette, an open-source tool for exploring and publishing data. Datasette helps users take data of any shape or size and publish it as an interactive, explorable website and accompanying API.')"
    ],
    "response_id": "2149bf36-adab-4641-9e1f-71b15709ed76",
    "meta": {
        "api_version": {
            "version": "1"
        },
        "billed_units": {
            "input_tokens": 35899,
            "output_tokens": 82
        }
    },
    "token_count": {
        "prompt_tokens": 36719,
        "search_query_tokens": 7,
        "response_tokens": 82,
        "total_tokens": 36801,
        "billed_tokens": 35981
    }
}

@simonw simonw added the enhancement New feature or request label Apr 4, 2024
@simonw
Copy link
Owner Author

simonw commented Apr 4, 2024

The easiest way to support this is via an option:

llm -m command-r-plus 'What is Datasette?' -o websearch 1

It would be nice to make the citations visible. I'm not sure how to do that cleanly though, since LLM expects to always return just the streamed text.

@simonw
Copy link
Owner Author

simonw commented Apr 4, 2024

One option: register a new sub-command, maybe something like this:

llm command-r-search "What is Datasette?" -m command-r-plus

This would behave like the default prompt command but would add search citations, maybe as Markdown.

@simonw
Copy link
Owner Author

simonw commented Apr 4, 2024

llm command-r-search -m command-r 'what is shot-scraper?'

Shot-scraper is a command-line utility for taking automated screenshots of websites. It can also be used as a scraping tool. It is built on Playwright, a Microsoft open-source browser automation framework.

Shot-scraper is primarily intended for taking screenshots for documentation, to automate the process so that screenshots remain up-to-date without having to manually recreate them. However, it also has some devious scraping applications and can be used to scrape web pages.

Shot-scraper can be installed using pip:

pip install shot-scraper shot-scraper install

Then you can take a screenshot of a web page like this:

shot-scraper https://www.example.com/

This will create a screenshot in a file with a name tailored to the page URL, such as "example.jpg".

Sources:

GitHub - simonw/shot-scraper: A command-line utility for taking automated screenshots of websites - https://github.com/simonw/shot-scraper
shot-scraper: automated screenshots for documentation, built on Playwright - https://simonwillison.net/2022/Mar/10/shot-scraper/
shot-scraper · PyPI - https://pypi.org/project/shot-scraper/
shot-scraper - Python Package Health Analysis | Snyk - https://snyk.io/advisor/python/shot-scraper
shot-scraper - https://shot-scraper.datasette.io/en/stable/
Shot-scraper: Automating screenshots for documentation | Hacker News - https://news.ycombinator.com/item?id=33216789
Shot-scraper: automated screenshots for documentation, built on Playwright | Hacker News - https://news.ycombinator.com/item?id=30621802
Scraping pages using JavaScript - shot-scraper - https://shot-scraper.datasette.io/en/stable/javascript.html
Websites that need authentication - shot-scraper - https://shot-scraper.datasette.io/en/stable/authentication.html
GitHub - simonw/shot-scraper-template: Template repository for setting up shot-scraper - https://github.com/simonw/shot-scraper-template

@simonw
Copy link
Owner Author

simonw commented Apr 4, 2024

I'm going to fix the JSON to not bundle "ChatCitation(start=98, end=137, text='co-creator of the Django Web framework.', document_ids=['web-search_12', 'web-search_14', 'web-search_18'])" strings in there.

@simonw
Copy link
Owner Author

simonw commented Apr 4, 2024

Example from the README:

llm command-r-search 'What is the LLM CLI tool by simonw?'

Example output:

The LLM CLI tool is a command-line utility that allows users to access large language models. It was created by Simon Willison and can be installed via pip, Homebrew or pipx. The tool supports interactions with remote APIs and models that can be locally installed and run. Users can run prompts from the command line and even build an image search engine using the CLI tool.

Sources:

simonw added a commit that referenced this issue Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant