Skip to content

sen-ltd/csv-profile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

csv-profile

A zero-dependency CSV profiler. Given a CSV file, it tells you:

  • How many rows and columns, what delimiter, what encoding
  • Per column: inferred type, null count, distinct count, summary statistics
  • For numeric columns: min, max, mean, median, stdev
  • For date/datetime columns: min, max
  • For string columns: min length, max length, most common value
  • Examples of a few actual values per column

Pure Python, stdlib only. No pandas, no polars, no chardet. The whole thing installs in a ~50 MB alpine container.

screenshot

Why not pandas?

df.describe() gives you count / mean / std / min / 25% / 50% / 75% / max for numeric columns only. Non-numeric columns get count / unique / top / freq. That's fine when you already know the schema, but the first question about a CSV someone just sent you is always "what's in this file?" — and pandas makes you do more work than you should have to.

csvkit's csvstat is closer in spirit but has been effectively unmaintained for years and pulls a long dependency chain. csv-profile is the tiny CLI I actually want when a file lands in Slack.

Install

pip install csv-profile       # published package

Or run from Docker:

docker build -t csv-profile .
docker run --rm -v "$PWD:/work" csv-profile users.csv

Usage

csv-profile <file.csv> [--delimiter CHAR] [--encoding ENC]
                       [--sample N] [--columns C1,C2,...]
                       [--format human|json|markdown]
  • <file.csv> — path to a file, or - to read from stdin
  • --delimiter — override auto-detection (default: csv.Sniffer)
  • --encoding — override the utf-8 → utf-8-sig → latin-1 fallback
  • --sample N — only profile the first N rows (for huge files)
  • --columns — restrict to a comma-separated list of column names
  • --formathuman (default), json, or markdown

Exit codes: 0 success, 1 file / encoding / malformed error, 2 bad args.

Example

$ cat users.csv
id,name,age,active,signup
1,alice,30,true,2024-01-15
2,bob,25,false,2024-02-01
3,carol,,true,2024-03-10
4,dave,45,true,

$ csv-profile users.csv
rows=4  cols=5  delimiter=comma  encoding=utf-8  nulls=2  bytes≈96

  id      int     nulls=0/4 (0%)  distinct=4
    min=1, max=4, mean=2.5, median=2.5, stdev=1.118
    most_common='1' (×1)
    examples=['1', '2', '3']
  name    string  nulls=0/4 (0%)  distinct=4
    len=3..5
    most_common='alice' (×1)
    examples=['alice', 'bob', 'carol']
  age     int     nulls=1/4 (25%)  distinct=3
    ...

Markdown for PR comments

csv-profile users.csv --format markdown

produces a table you can paste directly into a GitHub PR description.

Type inference

csv-profile walks each column once and assigns the strictest type that every non-null value parses as. The ladder is:

int → float → bool → date → datetime → string

One value that doesn't fit demotes the whole column to string. That's deliberate: "mostly int with a few strings" is a seductive heuristic that produces silent data corruption downstream. If you have dirty data, you want to see it in the profile, not have the profile pretend it's clean.

Empty strings are treated as null. Bools accept true/false/yes/no/t/f/y/n/1/0 (case-insensitive). Dates are strict YYYY-MM-DD. Datetimes accept ISO 8601 with T or space separator, with or without fractional seconds. Timezones are not supported — see "Tradeoffs" below.

Tradeoffs

  • All-or-nothing types: no "95% of this column is int". Use --columns and --sample to isolate the bad rows if you need to debug.
  • No "numeric strings": a column of zero-padded ZIP codes ("02139") will be labelled int. If you need to preserve the leading zero, you needed a quoted-string-typed system from the start; csv-profile won't save you here.
  • No timezones: 2024-01-15T10:30:00Z parses fine but 2024-01-15T10:30:00+09:00 does not. Adding full RFC 3339 support would require a hand-rolled parser or pulling in dateutil, and csv-profile refuses the dependency.
  • In-memory: columns are materialized once before stats. For multi-GB files, use --sample.

Development

pip install -e ".[dev]"
pytest

License

MIT. Part of a 100+ portfolio series by SEN 合同会社.

Links

About

A stdlib-only Python CLI that answers the first question you ever ask a CSV — 'what's in this file?' — in one shot.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors