Add "Values are unique" to --csv #1217

tacman · 2023-10-14T12:07:53Z

It would be valuable to me to know if the all the values are unique. There is a number of unique values, and at the of the csvstat command there is

 Row count: 1856

csvstat data/subtitles_day.tsv 
  1. "IDSubtitle"

	Type of data:          Number
	Contains null values:  False
	Unique values:         1856
	Smallest value:        9,747,231
	Largest value:         9,749,339
	Sum:                   18,092,851,467
	Mean:                  9,748,303.592
	Median:                9,748,352.5
	StDev:                 628.279
	Most common values:    9,747,231 (1x)
	                       9,747,232 (1x)
	                       9,747,233 (1x)
	                       9,747,234 (1x)
	                       9,747,235 (1x)

That's lost with --csv, along with the frequency, so there's no easy way to know if the values are unique. For my purposes, I'm trying to find the primary key from a set of files, so knowing that the values are unique would be enormously helpful.

csvstat data/subtitles_day.tsv --csv | csvjson | jq

 {
    "column_id": 1,
    "column_name": "IDSubtitle",
    "type": "Number",
    "nulls": false,
    "unique": 1856,
    "min": "9,747,231",
    "max": "9,749,339",
    "sum": 18092851467,
    "mean": 9748303.592,
    "median": 9748352.5,
    "stdev": 628.279,
    "len": null,
    "freq": "9747231, 9747232, 9747233, 9747234, 9747235"
  },

If the frequency count were included in the "freq" key, I could parse that and see if the top one was just 1, but adding "Values are unique" would be better. Of course, to determine primary key I'd also check "Contains null values".

The text was updated successfully, but these errors were encountered:

jpmckinney · 2023-10-17T15:57:47Z

In HEAD, I instead added a "Non-null values" statistic (also appears in the --csv output). This information is useful for this use case as well as others.

You can thus compare non-null values to unique values. Note that if the column contains nulls, then NULL counts as one additional unique value.

tacman · 2023-10-17T16:44:05Z

Thanks!

I've been installing this via "sudo apt install csvkit" but I think I need a ppm in order to get the latest version. Is one available?

I've had trouble following the installation instructions on Ubuntu via pip.

jpmckinney · 2023-10-17T18:42:54Z

I only manage the PyPI package. Packages in Linux distributions are created independently.

I'll make a new release of the PyPI package shortly.

jpmckinney closed this as completed Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "Values are unique" to --csv #1217

Add "Values are unique" to --csv #1217

tacman commented Oct 14, 2023

jpmckinney commented Oct 17, 2023 •

edited

Loading

tacman commented Oct 17, 2023

jpmckinney commented Oct 17, 2023

Add "Values are unique" to --csv #1217

Add "Values are unique" to --csv #1217

Comments

tacman commented Oct 14, 2023

jpmckinney commented Oct 17, 2023 • edited Loading

tacman commented Oct 17, 2023

jpmckinney commented Oct 17, 2023

jpmckinney commented Oct 17, 2023 •

edited

Loading