pysum: summarize pandas dataframes
pysum takes a pandas dataframe (and a few others arguments to
customize the output) and creates a markdown, html, or xlsx report with
summary of each of variables in the dataframe.
The program iterates through each of the columns in the dataframe and based on the datatype, creates summary statistics for each, and prints them out to a table.
The function takes the following arguments:
dataframe: pandas dataframe. No Default. The passed dataframe must also have an attribute
namethat carries the
nameof the dataframe. See examples for clarification.
round_digits: Integer. Digits to which the numbers reported should be rounded. Default is 2.
var_numbers: Boolean. Whether or not to add a column indicating the column number. Default is
missing_col: Boolean. Adds a column that reports proportion missing. Default in true.
max_distinct_values: Numeric. The maximum number of values to display frequencies for. If variable has more distinct values than this number, the remaining frequencies will be reported as a whole, along with the number of additional distinct values. Defaults to 10.
max_string_width: Integer. Limits the number of characters to display in the frequency tables. Default is 25.
output_type: String. The file format of the output file.
xlsx, html, markdown. Default is
output_file: String. The path and filename to which the script should output the results. Default is
summary.htmlin the local directory
append: Boolean. If there is an existing file, should we append the results or should we overwrite the file. Default is
true. When append is
true, the results are appended. When it is
false, the file is overwritten.
html output also depends on custom.css in the
The output is a xlsx, html, or markdown file. For numeric columns, it reports mean, standard deviation, minimum, maximum, median, IQR, Number of distinct values, Percentage that are valid, and Percentage missing, by default.
Definitions of Things in Output
- Valid = entries with non-missing values
- mean (sd) = mean (standard deviation).
- min = minimum
- med = median
- max = maximum
- IQR = Interquartile range
- CV = Coefficient of variation
For character vectors, it reports as many as
reports the number of other values, and their percentage. It also
reports percentage of observations that are valid and that are missing
Limitations: Dates by default are parsed as characters. Dates are best handled as numeric. But given the variety of formats in which dates appear, no standard support is offered for now.
Running the Script
Install the requirements:
pip install -r requirements.txt
You also need
pandoc to be installed on your machine.
import pandas import pysum # Load dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = pandas.read_csv(url, names=names) # Pass name of the dataset; required dataset.name = 'iris' pysum.summarizeDF(dataset) pysum.summarizeDF(dataset, output_type = "xlsx", append = False) pysum.summarizeDF(dataset, output_type = "markdown", append = False)
The package is based on https://github.com/dcomtois/summarytools