Skip to content

scoky/data_tools

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

data_tools

Useful generic python scripts for working with text files (e.g., csv) from the command line.

Working from the command line is the easiest way to explore a new dataset of text files, such as logs. However, many tasks require more advanced tools than are commonly available. So, I wrote a bunch of python scripts for exploring data to complement the standard command-line utilities. The scripts vary from general (e.g., binning log lines by a common column value or values) to domain specific (e.g., computing the stack distance between values in the file).

The repository installs both a library and a series of scripts. Most are pretty self documenting. I actively contribute scripts, so expect to see more.

Examples:

  • To create an empirical cummulative distribution function from column X (zero-based indexing) in a log file:
	<file ecdf.py -c X
  • You can filter the log file to only the rows where column Y matches a criteria with:
	<file where.py -n -e "c[Y] > 100 or c[Y] == 1" | ecdf.py -c X
  • Most of the scripts support grouping. This command will count the unique entries in column X per group in column Y:
	<file unique.py -g Y -c X
  • The result can then be piped to determine what fraction of the total each group represents:
	<file unique.py -g Y -c X | fraction.py --append -c 1
  • To compute the 5th, median, and 95th percentiles per group in the log, you could use:
	<file percentile.py -g Y -c X -p 0.05 0.5 0.95
  • Not satisfied with just numbers? Plot the data with:
	<file mode.py -g Y -c X | plot.py --geom line --mapping x=0 y=1

Additionally, log headers are supported with the --header option. If the option is provided, then column names may be specified instead of indices. A file with header looks like this:

	column_one column_two column_three
	1          2          Value
	2          3          Value2
	...

The delimiter between columns defaults to whitespace, but can be modified with the --delimiter option.

Header and delimiter may also be specified with environment variables:

	TOOLBOX_DELIMITER=,
	TOOLBOX_HEADER=true

There are many more options and combinations of scripts to perform a wide variety of tasks.

About

A few useful command line tools for working with data files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages