A zero-dependency CSV profiler. Given a CSV file, it tells you:
- How many rows and columns, what delimiter, what encoding
- Per column: inferred type, null count, distinct count, summary statistics
- For numeric columns: min, max, mean, median, stdev
- For date/datetime columns: min, max
- For string columns: min length, max length, most common value
- Examples of a few actual values per column
Pure Python, stdlib only. No pandas, no polars, no chardet. The whole thing installs in a ~50 MB alpine container.
df.describe() gives you count / mean / std / min / 25% / 50% / 75% / max
for numeric columns only. Non-numeric columns get count / unique / top / freq.
That's fine when you already know the schema, but the first question about a
CSV someone just sent you is always "what's in this file?" — and pandas
makes you do more work than you should have to.
csvkit's csvstat is closer in spirit but has been effectively unmaintained
for years and pulls a long dependency chain. csv-profile is the tiny CLI
I actually want when a file lands in Slack.
pip install csv-profile # published packageOr run from Docker:
docker build -t csv-profile .
docker run --rm -v "$PWD:/work" csv-profile users.csvcsv-profile <file.csv> [--delimiter CHAR] [--encoding ENC]
[--sample N] [--columns C1,C2,...]
[--format human|json|markdown]
<file.csv>— path to a file, or-to read from stdin--delimiter— override auto-detection (default:csv.Sniffer)--encoding— override the utf-8 → utf-8-sig → latin-1 fallback--sample N— only profile the first N rows (for huge files)--columns— restrict to a comma-separated list of column names--format—human(default),json, ormarkdown
Exit codes: 0 success, 1 file / encoding / malformed error, 2 bad args.
$ cat users.csv
id,name,age,active,signup
1,alice,30,true,2024-01-15
2,bob,25,false,2024-02-01
3,carol,,true,2024-03-10
4,dave,45,true,
$ csv-profile users.csv
rows=4 cols=5 delimiter=comma encoding=utf-8 nulls=2 bytes≈96
id int nulls=0/4 (0%) distinct=4
min=1, max=4, mean=2.5, median=2.5, stdev=1.118
most_common='1' (×1)
examples=['1', '2', '3']
name string nulls=0/4 (0%) distinct=4
len=3..5
most_common='alice' (×1)
examples=['alice', 'bob', 'carol']
age int nulls=1/4 (25%) distinct=3
...csv-profile users.csv --format markdownproduces a table you can paste directly into a GitHub PR description.
csv-profile walks each column once and assigns the strictest type that every non-null value parses as. The ladder is:
int → float → bool → date → datetime → string
One value that doesn't fit demotes the whole column to string. That's
deliberate: "mostly int with a few strings" is a seductive heuristic that
produces silent data corruption downstream. If you have dirty data, you want
to see it in the profile, not have the profile pretend it's clean.
Empty strings are treated as null. Bools accept true/false/yes/no/t/f/y/n/1/0
(case-insensitive). Dates are strict YYYY-MM-DD. Datetimes accept ISO 8601
with T or space separator, with or without fractional seconds. Timezones
are not supported — see "Tradeoffs" below.
- All-or-nothing types: no "95% of this column is int". Use
--columnsand--sampleto isolate the bad rows if you need to debug. - No "numeric strings": a column of zero-padded ZIP codes (
"02139") will be labelledint. If you need to preserve the leading zero, you needed a quoted-string-typed system from the start; csv-profile won't save you here. - No timezones:
2024-01-15T10:30:00Zparses fine but2024-01-15T10:30:00+09:00does not. Adding full RFC 3339 support would require a hand-rolled parser or pulling indateutil, and csv-profile refuses the dependency. - In-memory: columns are materialized once before stats. For multi-GB
files, use
--sample.
pip install -e ".[dev]"
pytestMIT. Part of a 100+ portfolio series by SEN 合同会社.
