csv-profile

A zero-dependency CSV profiler. Given a CSV file, it tells you:

How many rows and columns, what delimiter, what encoding
Per column: inferred type, null count, distinct count, summary statistics
For numeric columns: min, max, mean, median, stdev
For date/datetime columns: min, max
For string columns: min length, max length, most common value
Examples of a few actual values per column

Pure Python, stdlib only. No pandas, no polars, no chardet. The whole thing installs in a ~50 MB alpine container.

Why not pandas?

df.describe() gives you count / mean / std / min / 25% / 50% / 75% / max for numeric columns only. Non-numeric columns get count / unique / top / freq. That's fine when you already know the schema, but the first question about a CSV someone just sent you is always "what's in this file?" — and pandas makes you do more work than you should have to.

csvkit's csvstat is closer in spirit but has been effectively unmaintained for years and pulls a long dependency chain. csv-profile is the tiny CLI I actually want when a file lands in Slack.

Install

pip install csv-profile       # published package

Or run from Docker:

docker build -t csv-profile .
docker run --rm -v "$PWD:/work" csv-profile users.csv

Usage

csv-profile <file.csv> [--delimiter CHAR] [--encoding ENC]
                       [--sample N] [--columns C1,C2,...]
                       [--format human|json|markdown]

<file.csv> — path to a file, or - to read from stdin
--delimiter — override auto-detection (default: csv.Sniffer)
--encoding — override the utf-8 → utf-8-sig → latin-1 fallback
--sample N — only profile the first N rows (for huge files)
--columns — restrict to a comma-separated list of column names
--format — human (default), json, or markdown

Exit codes: 0 success, 1 file / encoding / malformed error, 2 bad args.

Example

$ cat users.csv
id,name,age,active,signup
1,alice,30,true,2024-01-15
2,bob,25,false,2024-02-01
3,carol,,true,2024-03-10
4,dave,45,true,

$ csv-profile users.csv
rows=4  cols=5  delimiter=comma  encoding=utf-8  nulls=2  bytes≈96

  id      int     nulls=0/4 (0%)  distinct=4
    min=1, max=4, mean=2.5, median=2.5, stdev=1.118
    most_common='1' (×1)
    examples=['1', '2', '3']
  name    string  nulls=0/4 (0%)  distinct=4
    len=3..5
    most_common='alice' (×1)
    examples=['alice', 'bob', 'carol']
  age     int     nulls=1/4 (25%)  distinct=3
    ...

Markdown for PR comments

csv-profile users.csv --format markdown

produces a table you can paste directly into a GitHub PR description.

Type inference

csv-profile walks each column once and assigns the strictest type that every non-null value parses as. The ladder is:

int → float → bool → date → datetime → string

One value that doesn't fit demotes the whole column to string. That's deliberate: "mostly int with a few strings" is a seductive heuristic that produces silent data corruption downstream. If you have dirty data, you want to see it in the profile, not have the profile pretend it's clean.

Empty strings are treated as null. Bools accept true/false/yes/no/t/f/y/n/1/0 (case-insensitive). Dates are strict YYYY-MM-DD. Datetimes accept ISO 8601 with T or space separator, with or without fractional seconds. Timezones are not supported — see "Tradeoffs" below.

Tradeoffs

All-or-nothing types: no "95% of this column is int". Use --columns and --sample to isolate the bad rows if you need to debug.
No "numeric strings": a column of zero-padded ZIP codes ("02139") will be labelled int. If you need to preserve the leading zero, you needed a quoted-string-typed system from the start; csv-profile won't save you here.
No timezones: 2024-01-15T10:30:00Z parses fine but 2024-01-15T10:30:00+09:00 does not. Adding full RFC 3339 support would require a hand-rolled parser or pulling in dateutil, and csv-profile refuses the dependency.
In-memory: columns are materialized once before stats. For multi-GB files, use --sample.

Development

pip install -e ".[dev]"
pytest

License

MIT. Part of a 100+ portfolio series by SEN 合同会社.

Links

📝 dev.to: https://dev.to/sendotltd/whats-actually-in-this-csv-a-zero-dep-python-profiler-1bfh

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
src/csv_profile		src/csv_profile
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

csv-profile

Why not pandas?

Install

Usage

Example

Markdown for PR comments

Type inference

Tradeoffs

Development

License

Links

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

csv-profile

Why not pandas?

Install

Usage

Example

Markdown for PR comments

Type inference

Tradeoffs

Development

License

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages