Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite-utils analyze-tables command #207

Closed
simonw opened this issue Dec 12, 2020 · 4 comments
Closed

sqlite-utils analyze-tables command #207

simonw opened this issue Dec 12, 2020 · 4 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Dec 12, 2020

A command which analyzes a table (potentially taking quite a while if the table is large) and outputs information for each column - things like:

  • How many unique values does this column have?
  • How many null rows?
  • How many blank rows? (defined as empty string)
  • What are the 10 most common values?
  • What are the 10 least common values?

The command can output this information to the terminal, but it should also provide an option for writing the information to a database table so it can be explored later.

@simonw simonw added the enhancement New feature or request label Dec 12, 2020
@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

Prototype:

from collections import namedtuple

ColumnDetails = namedtuple("ColumnDetails", ("column", "num_null", "num_blank", "num_distinct", "most_common", "least_common"))

def analyze_column(db, table, column, values=10):
    num_null = db.execute("select count(*) from [{}] where [{}] is null".format(table, column)).fetchone()[0]
    num_blank = db.execute("select count(*) from [{}] where [{}] = ''".format(table, column)).fetchone()[0]
    num_distinct = db.execute("select count(distinct [{}]) from [{}]".format(column, table)).fetchone()[0]
    most_common = None
    least_common = None
    if num_distinct != 1:
        most_common = [(r[0], r[1]) for r in db.execute(
            "select [{}], count(*) from [{}] group by [{}] order by count(*) desc limit ".format(column, table, column, values)
        ).fetchall()]
        if num_distinct <= values:
            # No need to run the query if it will just return the results in revers order
            least_common = most_common[::-1]
        else:
            least_common = [(r[0], r[1]) for r in db.execute(
                "select [{}], count(*) from [{}] group by [{}] order by count(*) limit {}".format(column, table, column, values)
            ).fetchall()]
    return ColumnDetails(column, num_null, num_blank, num_distinct, most_common, least_common)


def analyze_table(db, table):
    for column in db[table].columns:
        details = analyze_column(db, table, column.name)
        print(details)

@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

I'll add a table.analyze_column(column) method which is used by the CLI tool - with a note that this is an unstable interface which may change in the future.

@simonw
Copy link
Owner Author

simonw commented Dec 12, 2020

CLI could be:

sqlite-utils analyze-tables

To analyze all tables or:

sqlite-utils analyze-tables table1 table2

To analyze specific tables.

@simonw simonw changed the title sqlite-utils analyze-table command sqlite-utils analyze-tables command Dec 12, 2020
@simonw simonw closed this as completed in 69a121e Dec 13, 2020
@simonw
Copy link
Owner Author

simonw commented Dec 13, 2020

simonw added a commit that referenced this issue Dec 13, 2020
simonw added a commit that referenced this issue Dec 13, 2020
simonw added a commit to dogsheep/github-to-sqlite that referenced this issue Dec 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant