<h1>Basic Statistics</h1>                  
<ul>
  <li>Summary statistics</li>
  <li>Correlations</li>
  <li>Stratified sampling</li>
  <li>Hypothesis testing
      <ul>
          <li>Streaming Significance Testing</li>
      </ul>
  </li>
  <li>Random data generation</li>
  <li>Kernel density estimation</li>
</ul>
<h2>Summary statistics(汇总统计)</h2>
<p>We provide column summary statistics for <code>RDD[Vector]</code> through the function <code>colStats</code>
available in <code>Statistics</code>.</p>
<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics.colStats"><code>colStats()</code></a> returns an instance of
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code></a>,which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
total count.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary"><code>MultivariateStatisticalSummary</code> Python docs</a> for more details on the API.</p>

<dl>
<dt>
<em>static </em><tt>colStats</tt><big>(</big><em>rdd</em><big>)</big></dt>
<dd><p>Computes column-wise summary statistics for the input RDD[Vector].</p>
<table>
<colgroup><col>
<col>
</colgroup><tbody valign="top">
<tr><th>Parameters:</th><td><strong>rdd</strong> – an RDD[Vector] for which column-wise summary statistics
are to be computed.</td>
</tr>
<tr><th>Returns:</th><td><tt><span>MultivariateStatisticalSummary</span></tt> object containing
column-wise summary statistics.</td>
</tr>
</tbody>
</table>

Statistics的colStats函数是列统计方法，该方法可以计算每列最大值、最小值、平均值、方差值、L1范数、L2范数。

In [1]:
import numpy as np
from pyspark.mllib.stat import Statistics
mat = sc.parallelize(
    [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([3.0, 30.0, 300])]
)
summary = Statistics.colStats(mat)
print(summary.mean())
print(summary.variance())
print(summary.numNonzeros())

[   2.   20.  200.]
[  1.00000000e+00   1.00000000e+02   1.00000000e+04]
[ 3.  3.  3.]


In [2]:
mat.collect()

[array([   1.,   10.,  100.]),
 array([   2.,   20.,  200.]),
 array([   3.,   30.,  300.])]

<h2>Correlations</h2>
<p>Calculating the correlation between two `series` of data is a common operation in Statistics. In <code>spark.mllib</code> we provide the flexibility(灵活性) to calculate pairwise(成对的) correlations among many series. The supported correlation methods are currently Pearson’s and Spearman’s correlation.</p>
<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code></a> provides methods to calculate correlations between series. Depending on the type of input, two <code>RDD[Double]</code>s or an <code>RDD[Vector]</code>, the output will be a <code>Double</code> or the correlation <code>Matrix</code> respectively.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>

<dl>
<dt>
<em>static </em><tt>corr</tt><big>(</big><em>x</em>, <em>y=None</em>, <em>method=None</em><big>)</big></dt>
<dd><p>Compute the correlation (matrix) for the input RDD(s) using the
specified method.Methods currently supported: <cite>pearson (default), spearman</cite>.</p>
<p>If <font color="red">a single RDD of Vectors</font> is passed in, a correlation matrix comparing the columns in the input RDD is returned. Use <cite>method=</cite> to specify the method to be used for single RDD inout.If <font color="red">two RDDs of floats</font> are passed in, a single float is returned.</p>
<table>
<colgroup><col>
<col>
</colgroup><tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>x</strong> – an RDD of vector for which the correlation matrix is to be computed,or an RDD of float of the same cardinality as y when y is specified.</li>
<li><strong>y</strong> – an RDD of float of the same cardinality as x.</li>
<li><strong>method</strong> – String specifying the method to use for computing correlation.
Supported: <cite>pearson</cite> (default), <cite>spearman</cite></li>
</ul>
</td>
</tr>
<tr><th>Returns:</th><td><p>Correlation matrix<a href="http://baike.baidu.com/view/3523768.htm" target="_blank">(相关矩阵)</a> comparing columns in x.</p>
</td>
</tr>
</tbody>
</table>

In [3]:
seriesX = sc.parallelize([1.0, 2.0, 3.0, 3.3, 5.0])
seriesY = sc.parallelize([11.0, 22.0, 33.0, 33.0, 555.0])

print("Correlation(pearson) is " + str(Statistics.corr(seriesX, seriesY, method="pearson")))
print("Correlation(spearman) is " + str(Statistics.corr(seriesX, seriesY, method="spearman")))
data = sc.parallelize(
    [np.array([1.0, 10.0, 100.0]), np.array([2.0, 20.0, 200.0]), np.array([5.0, 33.0, 366.0])]
)
print("pearson--")
print(Statistics.corr(data, method="pearson"))
print("spearman--")
print(Statistics.corr(data, method="spearman"))

Correlation(pearson) is 0.820289577391524
Correlation(spearman) is 0.9746794344808964
pearson--
[[ 1.          0.97888347  0.99038957]
 [ 0.97888347  1.          0.99774832]
 [ 0.99038957  0.99774832  1.        ]]
spearman--
[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]


Pearson相关系数表达的是两个数值变量的线性相关性，它一般适用于正太分布。其取值范围是[-1，1]，取值为0表示不相关，取值为(0～-1]表示负相关，取值为(0～1]表示正相关。<br/>
Spearman相关系数也用来表达两个变量的相关性，但是它没有Pearson相关系数对变量的分布要求那么严格，另外Spearman相关系数可以更好地用于测量变量的排序关系。

<h2>Stratified sampling(分层抽样)</h2>
<p>Unlike the other statistics functions, which reside in <code>spark.mllib</code>, `stratified sampling` methods,
<code>sampleByKey</code> and <code>sampleByKeyExact</code>, can be performed on RDD’s of key-value pairs. For stratified sampling, the keys can be thought of as a label and the value as a specific attribute. For example
the key can be man or woman, or document ids, and the respective values can be the list of ages of the people in the population or the list of words in the documents. The <code>sampleByKey</code> method will flip a coin(抛钱币(决定)) to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an <em>expected</em> sample size. <code>sampleByKeyExact</code> requires significant more resources than the per-stratum(每层) simple random sampling used in <code>sampleByKey</code>, but will provide the exact sampling size with 99.99% confidence. <code>sampleByKeyExact</code> is currently not supported in python.</p>

<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sampleByKey" target="_blank"><code>sampleByKey()</code></a> allows users to sample approximately $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired fraction(分数值) for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of keys.</p>
<p><em>Note:</em> <code>sampleByKeyExact()</code> is currently not supported in Python.</p>
<dl>
    <dt>
        <tt>sampleByKey</tt>
        <big>(</big><em>withReplacement</em>, <em>fractions</em>, <em>seed=None</em><big>)</big>
    </dt>
<dd><p>Return a subset of this RDD sampled by key (via stratified sampling).Create a sample of this RDD using variable sampling rates for different keys as specified by fractions, a key to sampling rate map.</p>
</dd>
</dl>

In [4]:
data = sc.parallelize([(1, 'a'), (1, 'b'), (1, 'g'), (2, 'c'), (2, 'd'), (2, 'e'), (3, 'f'), (3, 'a'), (3, 's'),(3, 'z')])
fractions = {1: 0.1, 2: 0.6, 3: 0.3}
approxSample = data.sampleByKey(False, fractions)
approxSample.collect()

[(2, 'c'), (3, 's')]

<h2>Hypothesis testing</h2>
<p>Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant, whether this result occurred by chance or not. <code>spark.mllib</code> currently supports Pearson&#8217;s chi-squared ( $\chi^2$) tests for goodness of fit and independence. The input data types determine whether the goodness of fit or the independence test is conducted. The goodness of fit test requires an input type of <code>Vector</code>, whereas the independence test requires a <code>Matrix</code> as input.</p>
<p><code>spark.mllib</code> also supports the input type <code>RDD[LabeledPoint]</code> to enable feature selection via chi-squared independence tests.</p>
<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code></a> provides methods to
run Pearson&#8217;s chi-squared tests. The following example demonstrates how to run and interpret hypothesis tests.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>

<dl>
<dt>
<em>static </em><tt>chiSqTest</tt><big>(</big><em>observed</em>, <em>expected=None</em><big>)</big></dt>
<dd><p>If <cite>observed</cite> is Vector, conduct Pearson’s chi-squared goodness of fit test of the observed data against the expected distribution,or againt the uniform distribution (by default), with each category having an expected frequency of <cite>1/len(observed)</cite>.(Note: <cite>observed</cite> cannot contain negative values)</p>
<p>If <cite>observed</cite> is matrix, conduct Pearson’s independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.</p>
<p>If <cite>observed</cite> is an RDD of LabeledPoint, conduct Pearson’s independence test for every feature against the label across the input RDD.For each feature, the (feature, label) pairs are converted into a
contingency matrix for which the chi-squared statistic is computed.All label and feature values must be categorical.</p>
<table frame="void" rules="none">
<colgroup><col class="field-name">
<col >
</colgroup><tbody valign="top">
<tr ><th>Parameters:</th><td ><ul>
<li><strong>observed</strong> – it could be a vector containing the observed categorical counts/relative frequencies, or the contingency matrix (containing either counts or relative frequencies),or an RDD of LabeledPoint containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.</li>
<li><strong>expected</strong> – Vector containing the expected categorical counts/relative frequencies. <cite>expected</cite> is rescaled if the <cite>expected</cite> sum differs from the <cite>observed</cite> sum.</li>
</ul>
</td>
</tr>
<tr><th>Returns:</th><td><p>ChiSquaredTest object containing the test statistic, degrees
of freedom, p-value, the method used, and the null hypothesis.</p>
</td>
</tr>
</tbody>
</table>
</dd></dl>

MLlib当前支持用于判断拟合度或者独立性的Pearson卡方chi-squared($\chi^2$)检验。不同的输入类型决定了是做拟合度还是独立性检验。拟合度检验要求输入为Vector，独立性检验要求输入是Matrix。

In [5]:
from pyspark.mllib.linalg import Matrices,Vectors
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.stat import Statistics
vec = Vectors.dense(0.1, 0.15, 0.2, 0.3, 0.25)
goodnessofFitTestResult = Statistics.chiSqTest(vec)
print("%s\n" % goodnessofFitTestResult)
mat = Matrices.dense(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0])
independenceTestResult = Statistics.chiSqTest(mat)
print("%s\n" % independenceTestResult)

obs = sc.parallelize(
    [LabeledPoint(1.0, [1.0, 0.0, 3.0]),
     LabeledPoint(1.0, [1.0, 2.0, 0.0]),
     LabeledPoint(1.0, [-1.0, 0.0, -0.5])]
)
featureTestResults = Statistics.chiSqTest(obs)
for i, result in enumerate((featureTestResults)):
    print('Column %d:\n%s' % (i + 1, result))

Chi squared test summary:
method: pearson
degrees of freedom = 4 
statistic = 0.12499999999999999 
pValue = 0.998126379239318 
No presumption against null hypothesis: observed follows the same distribution as expected..

Chi squared test summary:
method: pearson
degrees of freedom = 2 
statistic = 0.14141414141414144 
pValue = 0.931734784568187 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

Column 1:
Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
Column 2:
Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..
Column 3:
Chi squared test summary:
method: pearson
degrees of freedom = 0 
statistic = 0.0 
pValue = 1.0 
No presumption against n

<p>Additionally, <code>spark.mllib</code> provides a 1-sample, 2-sided implementation of the Kolmogorov-Smirnov (KS) test for equality of probability distributions. By providing the name of a theoretical distribution (currently solely supported for the normal distribution) and its parameters, or a function to calculate the cumulative distribution according to a given theoretical distribution, the user can test the null hypothesis that their sample is drawn from that distribution. In the case that the user tests against the normal distribution (<code>distName="norm"</code>), but does not provide distribution parameters, the test initializes to the standard normal distribution and logs an appropriate message.</p>
<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code></a> provides methods to run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics"><code>Statistics</code> Python docs</a> for more details on the API.</p>

In [6]:
from pyspark.mllib.stat import Statistics
parallelData = sc.parallelize([0.1, 0.15, 0.2, 0.3, 0.25])
testResult = Statistics.kolmogorovSmirnovTest(parallelData, "norm", 0, 1)
print(testResult)

Kolmogorov-Smirnov test summary:
degrees of freedom = 0 
statistic = 0.539827837277029 
pValue = 0.06821463111921133 
Low presumption against null hypothesis: Sample follows theoretical distribution.


<h2>Random data generation</h2>
<p>Random data generation is useful for randomized algorithms, prototyping, and performance testing.
<code>spark.mllib</code> supports generating random RDDs with i.i.d. values drawn from a given distribution: uniform, standard normal, or Poisson.</p>
 <p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs"><code>RandomRDDs</code></a> provides factory methods to generate random double RDDs or vector RDDs.The following example generates a random double RDD, whose values follows the standard normal distribution <code>N(0, 1)</code>, and then map it to <code>N(1, 4)</code>.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs"><code>RandomRDDs</code> Python docs</a> for more details on the API.</p>

In [7]:
from pyspark.mllib.random import RandomRDDs
u = RandomRDDs.normalRDD(sc, 1000000, 10)

v = u.map(lambda x: 1.0 + 2.0 * x)

<h2 id="kernel-density-estimation">Kernel density estimation</h2>
<p><a href="https://en.wikipedia.org/wiki/Kernel_density_estimation">Kernel density estimation</a> is a technique
useful for visualizing empirical probability distributions without requiring assumptions about the particular distribution that the observed samples are drawn from. It computes an estimate of the probability density function of a random variables, evaluated at a given set of points. It achieves this estimate by expressing the PDF of the empirical distribution at a particular point as the the mean of PDFs of normal distributions centered around each of the samples.</p>
<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity"><code>KernelDensity</code></a> provides methods to compute kernel density estimates from an RDD of samples. The following example demonstrates how
to do so.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity"><code>KernelDensity</code> Python docs</a> for more details on the API.</p>

In [8]:
from pyspark.mllib.stat import KernelDensity

data = sc.parallelize([1.0, 1.0, 1.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 9.0, 9.0])

kd = KernelDensity()
kd.setSample(data)
kd.setBandwidth(3.0)

densities = kd.estimate([-1.0, 2.0, 5.0])