New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlite-utils analyze-tables command #207
Comments
Prototype: from collections import namedtuple
ColumnDetails = namedtuple("ColumnDetails", ("column", "num_null", "num_blank", "num_distinct", "most_common", "least_common"))
def analyze_column(db, table, column, values=10):
num_null = db.execute("select count(*) from [{}] where [{}] is null".format(table, column)).fetchone()[0]
num_blank = db.execute("select count(*) from [{}] where [{}] = ''".format(table, column)).fetchone()[0]
num_distinct = db.execute("select count(distinct [{}]) from [{}]".format(column, table)).fetchone()[0]
most_common = None
least_common = None
if num_distinct != 1:
most_common = [(r[0], r[1]) for r in db.execute(
"select [{}], count(*) from [{}] group by [{}] order by count(*) desc limit ".format(column, table, column, values)
).fetchall()]
if num_distinct <= values:
# No need to run the query if it will just return the results in revers order
least_common = most_common[::-1]
else:
least_common = [(r[0], r[1]) for r in db.execute(
"select [{}], count(*) from [{}] group by [{}] order by count(*) limit {}".format(column, table, column, values)
).fetchall()]
return ColumnDetails(column, num_null, num_blank, num_distinct, most_common, least_common)
def analyze_table(db, table):
for column in db[table].columns:
details = analyze_column(db, table, column.name)
print(details) |
I'll add a |
CLI could be:
To analyze all tables or:
To analyze specific tables. |
CLI documentation: https://sqlite-utils.readthedocs.io/en/latest/cli.html#analyzing-tables Python library documentation: https://sqlite-utils.readthedocs.io/en/latest/python-api.html#analyzing-a-column |
A command which analyzes a table (potentially taking quite a while if the table is large) and outputs information for each column - things like:
The command can output this information to the terminal, but it should also provide an option for writing the information to a database table so it can be explored later.
The text was updated successfully, but these errors were encountered: