Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Values are unique" to --csv #1217

Closed
tacman opened this issue Oct 14, 2023 · 3 comments
Closed

Add "Values are unique" to --csv #1217

tacman opened this issue Oct 14, 2023 · 3 comments

Comments

@tacman
Copy link

tacman commented Oct 14, 2023

It would be valuable to me to know if the all the values are unique. There is a number of unique values, and at the of the csvstat command there is

 Row count: 1856
csvstat data/subtitles_day.tsv 
  1. "IDSubtitle"

	Type of data:          Number
	Contains null values:  False
	Unique values:         1856
	Smallest value:        9,747,231
	Largest value:         9,749,339
	Sum:                   18,092,851,467
	Mean:                  9,748,303.592
	Median:                9,748,352.5
	StDev:                 628.279
	Most common values:    9,747,231 (1x)
	                       9,747,232 (1x)
	                       9,747,233 (1x)
	                       9,747,234 (1x)
	                       9,747,235 (1x)

That's lost with --csv, along with the frequency, so there's no easy way to know if the values are unique. For my purposes, I'm trying to find the primary key from a set of files, so knowing that the values are unique would be enormously helpful.

csvstat data/subtitles_day.tsv --csv | csvjson | jq

 {
    "column_id": 1,
    "column_name": "IDSubtitle",
    "type": "Number",
    "nulls": false,
    "unique": 1856,
    "min": "9,747,231",
    "max": "9,749,339",
    "sum": 18092851467,
    "mean": 9748303.592,
    "median": 9748352.5,
    "stdev": 628.279,
    "len": null,
    "freq": "9747231, 9747232, 9747233, 9747234, 9747235"
  },

If the frequency count were included in the "freq" key, I could parse that and see if the top one was just 1, but adding "Values are unique" would be better. Of course, to determine primary key I'd also check "Contains null values".

@jpmckinney
Copy link
Member

jpmckinney commented Oct 17, 2023

In HEAD, I instead added a "Non-null values" statistic (also appears in the --csv output). This information is useful for this use case as well as others.

You can thus compare non-null values to unique values. Note that if the column contains nulls, then NULL counts as one additional unique value.

@tacman
Copy link
Author

tacman commented Oct 17, 2023

Thanks!

I've been installing this via "sudo apt install csvkit" but I think I need a ppm in order to get the latest version. Is one available?

I've had trouble following the installation instructions on Ubuntu via pip.

@jpmckinney
Copy link
Member

I only manage the PyPI package. Packages in Linux distributions are created independently.

I'll make a new release of the PyPI package shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants