Skip to content

Statistics

Kanit Wongsuphasawat edited this page Jun 2, 2016 · 36 revisions

WikiAPI ReferenceStatistics

Statistics

Summary Statistics

# dl.count(values)

Count the number of values, including nulls. The return value is the same as the values.length property.

# dl.count.valid(values[, accessor])

Count the number of values that are not null, undefined or NaN. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before counting.

# dl.count.missing(values[, accessor])

Count the number of null and undefined values. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before counting.

# dl.count.distinct(values[, accessor])

Count the number of distinct values. This method treats null, undefined and NaN as three distinct values. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before counting.

# dl.count.map(values[, accessor])

Construct a map of counts per distinct value. This method treats null, undefined and NaN as three distinct unique values. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before counting distinct values.

dl.count.map(['a', 'b', 'c', null, 'a', 'b'])
// {'a': 2, 'b': 2, 'c': 1, 'null': 1}

# dl.unique(values[, accessor])

Collect unique values. This method treats null, undefined and NaN as three distinct unique values. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before tabulating unique values. The method returns an array of unique values.

dl.unique(['a', 'b', 'c', null, 'a', 'b']) // ['a', 'b', 'c', null]

# dl.median(values[, accessor])

Compute the median of an array of values, using the R-7 algorithm. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the median. This method ignores the invalid values null, undefined and NaN.

# dl.quartile(values[, accessor])

Compute the quartile boundaries of an array of values, using the R-7 algorithm. Returns a three-element array of the form [q1, q2, q3], where q1 is the lower quartile boundary, q2 is the median, q3 is the upper quartile boundary, and q3-q1 is the interquartile range). An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the quartiles. This method ignores the invalid values null, undefined and NaN.

# dl.quantile(numbers, p)

Compute the quantile of a sorted array of numbers. Returns the p-quantile of the given sorted array of numbers, where p is a number in the range [0,1]. For example, the median can be computed using p = 0.5, the first quartile at p = 0.25, and the third quartile at p = 0.75. This particular implementation uses the R-7 algorithm, which is the default for the R programming language and Excel. This method requires that numbers contains numeric elements and is already sorted in ascending order.

var a = [0, 1, 3];
dl.quantile(a, 0); // 0
dl.quantile(a, 0.5); // 1
dl.quantile(a, 1); // 3
dl.quantile(a, 0.25); // 0.5
dl.quantile(a, 0.75); // 2
dl.quantile(a, 0.1); // 0.19999999999999996 

# dl.sum(values[, accessor])

Compute the sum of an array of numbers. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the mean. This method ignores the invalid values null, undefined and NaN.

# dl.mean(values[, accessor])

Compute the arithmetic mean (average) of an array of numbers. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the mean. This method ignores the invalid values null, undefined and NaN.

# dl.mean.geometric(values[, accessor])

Compute the geometric mean of an array of positive numbers. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the mean. This method ignores the invalid values null, undefined and NaN. This method will throw an error if negative or zero values are observed.

# dl.mean.harmonic(values[, accessor])

Compute the harmonic mean of an array of numbers. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the mean. This method ignores the invalid values null, undefined and NaN.

# dl.variance(values[, accessor])

Compute the sample variance of an array of numbers. Returns an unbiased estimator of the population variance of the given values array of numbers. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the variance. This method ignores the invalid values null, undefined and NaN.

# dl.stdev(values[, accessor])

Compute the sample standard deviation of an array of numbers. Returns the standard deviation, defined as the square root of the bias-corrected variance, of the given values array of numbers. If the array has fewer than two values, returns undefined. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the standard deviation. This method ignores the invalid values null, undefined and NaN.

# dl.modeskew(values[, accessor])

Compute the Pearson mode skewness of an array of numbers. Returns the mode skewness, defined as (mean - median) / stdev. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the mode skewness. This method ignores the invalid values null, undefined and NaN.

# dl.min(values[, accessor])

Find the minimum value of an array of values, using natural order. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the minimum. This method ignores the invalid values null, undefined and NaN.

# dl.max(values[, accessor])

Find the maximum value of an array of values, using natural order. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the maximum. This method ignores the invalid values null, undefined and NaN.

# dl.extent(values[, accessor])

Find the minimum and maximum of an array of values, using natural order. Returns an array of the form [min, max]. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the extent. This method ignores the invalid values null, undefined and NaN.

# dl.extent.index(values[, accessor])

Find the integer indices of the minimum and maximum values, using natural order. Returns an array of the form [min_index, max_index]. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the extent. This method ignores the invalid values null, undefined and NaN.

# dl.rank(values[, accessor])

Compute ascending rank scores for an array of values. The minimum value is a given a rank score of 1. Other values receive higher scores. In the case of ties, the average rank is assigned to all identical values. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the rank scores. Note: the inclusion of null, undefined or NaN values may lead to unexpected results.

dl.rank([3, 5, 4]); // [1, 3, 2]
dl.rank([3, 4, 4, 5]); // [1, 2.5, 2.5, 4]

# dl.dist.mat(numbers)

Construct a mean-centered distance matrix for an array of numbers. The argument numbers must be an array of valid numerical values. Returns a matrix of all pairwise 1D distances among a set of values, centered by subtracting global, row and column means. This method is primarily used as a subroutine for computing distance correlation (dl.cor.dist) measures.

Distance and Correlation Measures

# dl.dot(x, y)
   dl.dot(values, accessor_x, accessor_y)

Compute the vector dot product of two arrays of numbers. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors.

# dl.dist(x, y[, exp])
   dl.dist(values, accessor_x, accessor_y[, exp])

Compute the vector distance between two arrays of numbers. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. The optional exp argument provides an exponent for the distance calculation. The default is 2, corresponding to normal Euclidean distance.

# dl.cor(x, y)
   dl.cor(values, accessor_x, accessor_y)

Compute the Pearson sample product-moment correlation of two arrays of numbers. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors.

# dl.cor.rank(x, y)
   dl.cor.rank(values, accessor_x, accessor_y)

Compute the Spearman rank correlation of two arrays of values. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. Ranks are computed using dl.rank.

# dl.cor.dist(x, y)
   dl.cor.dist(values, accessor_x, accessor_y)

Compute the distance correlation of two arrays of numbers. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors.

# dl.covariance(x, y)
   dl.covariance(values, accessor_x, accessor_y)

Computes the covariance between two arrays of numbers. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. This method will ignore non-valid values so long as they are paired across input vectors.

# dl.cohensd(x, y)
   dl.cohensd(values, accessor_x, accessor_y)

Computes the Cohen's d effect size between two arrays of numbers. Cohen's d is defined as the difference between two means divided by a standard deviation for the data. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. This method will ignore non-valid values so long as they are paired across input vectors.

Modeling

Datalib provides basic methods for linear regression as well as confidence intervals and significance tests for normally-distributed data. For utilities for probability distributions, see the Data Generators and Distributions documentation.

# dl.linearRegression(x, y)
   dl.linearRegression(values, accessor_x, accessor_y)

Fit a univariate linear regression model, regressing y against the predictor x. The method accepts either two arrays of numbers (x and y), or a values array of objects and two corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. This method will ignore non-valid values so long as they are paired across input vectors. Returns a "fit" object with the properties slope, intercept, rss (the sum-squared residual error), and R (the square root of the coefficient of determination, equal to Pearson's correlation coefficient).

# dl.bootstrap.ci(values[, runs, alpha, smooth])
   dl.bootstrap.ci(values, accessor[, runs, alpha, smooth])

Compute confidence intervals based on a bootstrap sample of the provided numerical values. Returns an array of the form [min_extent, max_extent]. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the extent. This method ignores the invalid values null, undefined and NaN. The runs parameter indicates the number of bootstrapping runs to perform (default 1000). The alpha parameter defaults to 0.05, indicating 95% confidence intervals. The smooth parameter determines how much noise to add to each sampled value (see dl.random.bootstrap).

# dl.z.ci(values[, alpha])
   dl.z.ci(values, accessor[, alpha])

Compute confidence intervals assuming a normal (gaussian) distribution using the specified alpha level. Returns an array of the form [min_extent, max_extent]. An optional accessor function or property string may be specified, which is equivalent to calling values.map(dl.$(accessor)) before computing the extent. This method ignores the invalid values null, undefined and NaN. The alpha parameter defaults to 0.05, indicating 95% confidence intervals.

# dl.z.test(values[, accessor, options])
   dl.z.test(x, y[, options])
   dl.z.test(values, accessor_x, accessor_y[, options])

Perform a z-test of means. Returns the p-value of the test. If a single array (with optional accessor) is provided, performs a one-sample location test. If two arrays or a table and two accessors are provided, performs a two-sample location test. A paired test is performed if specified by the options hash. This method ignores the invalid values null, undefined and NaN.

The supported options properties are:

  • paired: A boolean flag indicating if a two-sample test is being conducted on paired samples.
  • nullh: A number indicating the null hypothesis for the difference between means (default 0).

Entropy Measures

# dl.entropy(counts[, accessor])

Compute the Shannon entropy of an array of counts. An optional accessor function or property string may be specified, which is equivalent to calling counts.map(dl$(accessor)) before computing the entropy. If counts is empty or contains zero values only, this method returns zero.

# dl.mutual(x, y, counts)
   dl.mutual(values, accessor_x, accessor_y, accessor_counts)

Computes both the mutual information and mutual information distance between two discrete variables. Returns an array of the form [I, 1-I/H], where I is the mutual information and H is the joint entropy of x and y. This method accepts either three arrays of numbers or an array of values objects and three corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. The x and y values represent discrete random variables for which to compute the mutual information. For each index i, the corresponding x and y values must denote a unique bin in the two-dimensional joint frequency distribution. For each of these bins, the counts value indicates the occurrence frequency.

# dl.mutual.info(x, y, counts)
   dl.mutual.info(values, accessor_x, accessor_y, accessor_counts)

Computes the mutual information between two discrete variables. The return value is the first element in the array returned by dl.mutual. This method accepts either three arrays of numbers or an array of values objects and three corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. The x and y values represent discrete random variables for which to compute the mutual information. For each index i, the corresponding x and y values must denote a unique bin in the two-dimensional joint frequency distribution. For each of these bins, the counts value indicates the occurrence frequency.

# dl.mutual.dist(x, y, counts)
   dl.mutual.dist(values, accessor_x, accessor_y, accessor_counts)

Computes the mutual information distance between two discrete variables. The return value is the second element in the array returned by dl.mutual. This method accepts either three arrays of numbers or an array of values objects and three corresponding accessors. This second option is equivalent to the first option after calling values.map(dl.$(accessor)) for each of the accessors. The x and y values represent discrete random variables for which to compute the mutual information. For each index i, the corresponding x and y values must denote a unique bin in the two-dimensional joint frequency distribution. For each of these bins, the counts value indicates the occurrence frequency.

Binning, Histograms and Profiles

# dl.bins(options)

Determine quantitative bins (e.g., for a histogram). Based on the parameters given, this method will search over a space of possible bins, aligning step sizes with a given number base and applying constraints such as the maximum number of allowable bins. Given a set of options (see below), returns an object describing the binning scheme, including start, stop and step properties.

The supported options properties are:

  • min: (required) The minimum bin value to consider.
  • max: (required) The maximum bin value to consider.
  • base: The number base to use for automatic bin determination (default is base 10).
  • maxbins: The maximum number of allowable bins.
  • step: An exact step size to use between bins. If provided, options such as maxbins will be ignored.
  • steps: An array of allowable step sizes to choose from.
  • minstep: A minimum allowable step size (particularly useful for integer values).
  • div: An array of scale factors indicating allowable subdivisions. The default value is [5, 2], which indicates that for base 10 numbers (the default base), the method may consider dividing bin sizes by 5 and/or 2. For example, for an initial step size of 10, the method can check if bin sizes of 2 (= 10/5), 5 (= 10/2), or 1 (= 10/(5*2)) might also satisfy the given constraints.

In addition, the returned bin descriptor includes two bound methods:

  • value: given an input value, will return the corresponding bin value
  • index: given an input value, will return the integer bin index
dl.bins({min:0, max:1, maxbins:10}); // {start:0, stop:1, step:0.1}
dl.bins({min:0, max:10, maxbins:5}); // {start:0, stop:10, step:2}
var b = dl.bins({min:5, max:10, maxbins:5}); // {start:5, stop:10, step:1}

b.index(5);   // 0
b.index(6);   // 1
b.index(7.2); // 2

b.value(5);   // 5
b.value(6.5); // 6
b.value(8.7); // 8

# dl.bins.date(options)

Determine bins over common date/time units, appropriate for both Date objects or UNIX timestamp values. Based on the span of the provided min and max options, this method first determines an appropriate time unit (years, months, days, etc) and then uses dl.bin to produce a binning scheme. To ensure consistent processing, this method performs all date calculations using UTC time.

Options in addition to those supported by dl.bin are:

  • minbins: The minimum number of allowable bins (default 4).
  • utc: A boolean indicating if time binning should be performed on Coordinated Universal Time (UTC) Date values. The default value is false, in which case the local timezone is used instead.
  • unit: A string value indicating an explicit date/time unit to use for binning. If specified, no automatic search for an appropriate time unit will be performed. Valid values are the chronological units 'year', 'month', 'day', 'hour', 'minute', or 'second' and the pluralized periodic units 'months' (month of year), 'dates' (day of month), 'weekdays', 'hours' (hour of day), 'minutes' (minute of hour), or 'seconds' (second of minute).
var b = dl.bins.date({
  min: Date.parse('1/1/2000'),
  max: Date.parse('1/1/2010')
}); // {start: 2000, stop: 2010, step: 1} -> numerical year codes

b.index(Date.parse('1/1/2000')); // 0
b.index(Date.parse('1/1/2010')); // 10

b.value(b.start).toUTCString();  // "Sat, 01 Jan 2000 00:00:00 GMT"
b.value(b.stop).toUTCString();   // "Fri, 01 Jan 2010 00:00:00 GMT"
var b = dl.bins.date({
  min:  Date.parse('1/1/2000'),
  max:  Date.parse('1/1/2001'),
  unit: 'month'
}); // {start: 24000, stop: 24012, step: 1} -> numerical month codes

b.index(Date.parse('1/1/2000')); // 0
b.index(Date.parse('1/1/2001')); // 12

b.value(b.start).toUTCString();  // "Sat, 01 Jan 2000 00:00:00 GMT"
b.value(b.stop).toUTCString();   // "Mon, 01 Jan 2001 00:00:00 GMT"

# dl.$bin([values, accessor, options])

Create a function that maps values to binned values. The method optionally accepts an array of values, an accessor, and/or an options hash for dl.bins. Any combination of input arguments is legal so long as they are provided in the specified order. If the values array is provided, it will be used to configure the minimum and maximum bin values. If provided, the accessor will be used to retrieve a value prior to binning.

The options parameter may include a type property to inform the type of bins; for example, 'date' will cause dl.bins.date to be used to determine the binning scheme, 'string' will result in no additional binning. If the type property is not specified, this method will attempt to guess the type by calling dl.type(values, f). For more, see the discussion of the type property under dl.histogram.

var values = [1, 2.5, 6.4, 9.2];
var b = dl.$bin(values);
b(1.0); // 1
b(3.7); // 3
var values = [{v:1}, {v:2.5}, {v:6.4}, {v:9.2}];
var b = dl.$bin(values, 'v');
b({v:1.0}); // 1
b({v:3.7}); // 3

# dl.histogram(values[, options])
   dl.histogram(values, accessor[, options])

Compute a histogram for an array of values. Given an input values array, an optional accessor function and optional options hash, returns a histogram of per-value counts. Numerical and date values are binned (using dl.bins or dl.bins.date) prior to counting. The return value is an array of objects of the form {value: v, count: c}, where v is the (binned) value and c is the number of observed values in that bin.

The options hash may include any of the parameters accepted by dl.bins or dl.bins.date. If min and max options are not provided, they will be determined using dl.extent. The histogram method also accepts the following additional options:

  • type: The data type to use for binning. The valid values are 'string', 'number', 'integer' or 'date'. By default, the histogram method will attempt to automatically determine the type based on the first non-null value found in the values array. For string-type data, the histogram contains counts of all distinct values using dl.unique. For number-typed data, dl.bins is used to determine the binning scheme. Integer-typed data is treated similarry, but with a minstep of 1 imposed. Date-typed data is binned using dl.bins.date.
  • sort: For string-typed data only, the histogram will be sorted by count when true. Otherwise, the histogram will be use the natural order of the values.

# dl.profile(values[, accessor])

Compute a profile of summary statistics for a variable. An optional accessor function may be specified, which is equivalent to calling values.map(accessor) prior to profile calculation. The returned object contains a collection of computed statistics (see example below).

var x = dl.random.normal().samples(1000);
var p = dl.profile(x);
/*
{
  "unique": {...},      // hash of all value:count pairs
  "count": 1000,
  "valid": 1000,
  "missing": 0,
  "distinct": 1000,
  "min": -2.7983914656538427,
  "max": 2.9931226872439436,
  "mean": 0.024987537472992594,
  "stdev": 0.965105294571081,
  "median": 0.012544328964464974,
  "q1": -0.6686040453830552, // 25th percentile
  "q3": 0.7090708668163043   // 75th percentile
  "modeskew": 0.01289310977623195,
}
*/

# dl.summary(data[, fields])

Compute profiles for all variables in a data set. An optional fields array provides a list of variable names for which to compute summaries. If fields is unspecified, than all keys found on the first object in the input data array will be used as the variables. The summary method then invokes dl.profile for each variable and returns an array of the computed profiles. To generate a string representation of a summary, use dl.print.summary.

var data = [...];                // an array of data objects
var summary = dl.summary(data);  // summarize all variables
console.log(summary.toString()); // print summary statistics to the console
/*
[
{
  "field": ...,         // field name
  "unique": {...},      // hash of all value:count pairs
  "count": 1000,
  "valid": 1000,
  "missing": 0,
  "distinct": 1000,
  "min": -2.7983914656538427,
  "max": 2.9931226872439436,
  "mean": 0.024987537472992594,
  "stdev": 0.965105294571081,
  "median": 0.012544328964464974,
  "q1": -0.6686040453830552, // 25th percentile
  "q3": 0.7090708668163043   // 75th percentile
  "modeskew": 0.01289310977623195,
},
...
]
*/
You can’t perform that action at this time.