Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to table label detection #485

Open
simonw opened this issue May 23, 2019 · 10 comments
Open

Improvements to table label detection #485

simonw opened this issue May 23, 2019 · 10 comments
Assignees
Labels

Comments

@simonw
Copy link
Owner

simonw commented May 23, 2019

Label detection doesn't work if the primary key is called pk rather than id, so this page doesn't work: https://latest.datasette.io/fixtures/roadside_attraction_characteristics

Code is here:

datasette/datasette/app.py

Lines 644 to 653 in cccea85

async def label_column_for_table(self, db_name, table):
explicit_label_column = self.table_metadata(db_name, table).get("label_column")
if explicit_label_column:
return explicit_label_column
# If a table has two columns, one of which is ID, then label_column is the other one
column_names = await self.table_columns(db_name, table)
if column_names and len(column_names) == 2 and "id" in column_names:
return [c for c in column_names if c != "id"][0]
# Couldn't find a label:
return None

@simonw simonw added the small label May 23, 2019
@simonw simonw changed the title Bette rtable label detection Improvements to table label detection May 23, 2019
@simonw
Copy link
Owner Author

simonw commented May 23, 2019

If a table has more than two columns we could do a betterl job at guessing the label column. A few potential tricks:

  • look for a column called name or title
  • look for the first column of type text
  • check for the text column with the most diversity in values

@simonw
Copy link
Owner Author

simonw commented May 23, 2019

I could attempt to calculate the statistics needed for this in a time limited SQL query something like this one: https://latest.datasette.io/fixtures?sql=select+%27name%27+as+column%2C+count+%28distinct+name%29+as+count_distinct%2C+avg%28length%28name%29%29+as+avg_length+from+roadside_attractions%0D%0A++union%0D%0Aselect+%27address%27+as+column%2C+count%28distinct+address%29+as+count_distinct%2C+avg%28length%28address%29%29+as+avg_length+from+roadside_attractions

select 'name' as column, count (distinct name) as count_distinct, avg(length(name)) as avg_length from roadside_attractions
  union
select 'address' as column, count(distinct address) as count_distinct, avg(length(address)) as avg_length from roadside_attractions

simonw added a commit that referenced this issue May 25, 2019
This needs unit tests.
@simonw
Copy link
Owner Author

simonw commented May 26, 2019

The code currently assumes the primary key is called "id" or "pk" - improving it to detect the primary key using database introspection should work much better.

@simonw
Copy link
Owner Author

simonw commented May 26, 2019

Thinking about that union query: I imagine doing this with union could encourage multiple full table scans. Maybe this query would only do one? https://latest.datasette.io/fixtures?sql=select%0D%0A++count+%28distinct+name%29+as+count_distinct_column_1%2C%0D%0A++avg%28length%28name%29%29+as+avg_length_column_1%2C%0D%0A++count%28distinct+address%29+as+count_distinct_column_2%2C%0D%0A++avg%28length%28address%29%29+as+avg_length_column_2%0D%0Afrom+roadside_attractions

select
  count (distinct name) as count_distinct_column_1,
  avg(length(name)) as avg_length_column_1,
  count(distinct address) as count_distinct_column_2,
  avg(length(address)) as avg_length_column_2
from roadside_attractions

fixtures__select_count__distinct_name__as_count_distinct_column_1__avg_length_name___as_avg_length_column_1__count_distinct_address__as_count_distinct_column_2__avg_length_address___as_avg_length_column_2_from_roadside_attractions

@simonw
Copy link
Owner Author

simonw commented May 27, 2019

This code now lives in a method on the new datasette.database.Database class, which should make it easier to write unit tests for.

@simonw
Copy link
Owner Author

simonw commented May 28, 2019

I'm going to generate statistics for every TEXT column.

Any column with more than 90% distinct rows (compared to the total count of rows) will be a candidate for the label.

I will then pick the candidate column with the shortest average length.

@simonw
Copy link
Owner Author

simonw commented May 29, 2019

Another good rule of thumb: look for text fields with a unique constraint?

@simonw simonw self-assigned this Jun 24, 2019
@simonw
Copy link
Owner Author

simonw commented Oct 2, 2022

To clarify: the feature this issue is talking about relates to the way Datasette automatically displays foreign key relationships, for example on this page: https://github-to-sqlite.dogsheep.net/github/commits

image

Each of those columns is a foreign key to another table. The link text that is displayed there comes from the "label column" that has either been configured or automatically detected for that other table.

I wonder if this could be handled with a tiny machine learning model that's trained to help pick the best label column?

Inputs to that model could include:

  • The names of the columns
  • The number of unique values in each column
  • The type of each column (or maybe only TEXT columns should be considered)
  • How many null values there are
  • Is the column marked as unique?
  • What's the average (or median or some other statistic) string length of values in each column?

Output would be the most likely label column, or some indicator that no likely candidates had been found.

My hunch is that this would be better solved using a few extra heuristics rather than by training a model, but it does feel like an interesting opportunity to experiment with a tiny ML model.

Asked for tips about this on Twitter: https://twitter.com/simonw/status/1576680930680262658

@simonw
Copy link
Owner Author

simonw commented Oct 3, 2022

I love these tips - tools that can compile a simple machine learning model to a SQL query! Would be pretty cool if I could bundle a model in Datasette itself as a big in-memory SQLite SQL query:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant