#### 数据类型：密集向量和稀疏向量
#### Local vector dense and sparse
密集向量的值是一个数组double array

而稀疏向量是由两个并列的数组indices and values组成，在数据前还含有一个表明向量长度的域

MLlib recognizes the following types as dense vectors:
- NumPy’s **array**
- Python’s **list**, e.g., [1, 2, 3]

and the following as sparse vectors:
- MLlib’s **SparseVector**.
- SciPy’s **csc_matrix with a single column**

In [10]:
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors

# Use a NumPy array as a dense vector.
dv1 = np.array([1.0, 0.0, 3.0])
# Use a Python list as a dense vector.
dv2 = [1.0, 0.0, 3.0]
# Create a SparseVector.
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
# Use a single-column SciPy csc_matrix as a sparse vector.
sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), \
                      np.array([0, 2])), shape=(3, 1))
print(dv1)
print(dv2)

[1. 0. 3.]
[1.0, 0.0, 3.0]


In [12]:
print(sv1)

(3,[0,2],[1.0,3.0])


In [13]:
print(sv2)

  (0, 0)	1.0
  (2, 0)	3.0


#### static parse(s) 从字符串生成Vector parse
Parse a string representation back into the Vector.

In [14]:
Vectors.parse(' ( 100,  [0],  [2])')

SparseVector(100, {0: 2.0})

#### static sparse(size, *args) 给定参数和值生成Vector Parse
Create a sparse vector, using either a dictionary, a list of (index, value) pairs, 
or two separate arrays of indices and values (sorted by index).

Parameters
- size – Size of the vector.
- args – Non-zero entries, as a dictionary, list of tuples,
or two sorted lists containing indices and values.

## SparkML中stat模块中的函数

#### 混淆矩阵
#### Correlation
计算两个特征或者数据的相关性

In [3]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('correlation example').getOrCreate()
data = [(Vectors.sparse(4, [(0, 1.0), (3, -2.0)]),),
        (Vectors.dense([4.0, 5.0, 0.0, 3.0]),),
        (Vectors.dense([6.0, 7.0, 0.0, 8.0]),),
        (Vectors.sparse(4, [(0, 9.0), (3, 1.0)]),)]
df = spark.createDataFrame(data, ["features"])

r1 = Correlation.corr(df, "features").head()
print("Pearson correlation matrix:\n" + str(r1[0]))

r2 = Correlation.corr(df, "features", "spearman").head()
print("Spearman correlation matrix:\n" + str(r2[0]))

Pearson correlation matrix:
DenseMatrix([[1.        , 0.05564149,        nan, 0.40047142],
             [0.05564149, 1.        ,        nan, 0.91359586],
             [       nan,        nan, 1.        ,        nan],
             [0.40047142, 0.91359586,        nan, 1.        ]])
Spearman correlation matrix:
DenseMatrix([[1.        , 0.10540926,        nan, 0.4       ],
             [0.10540926, 1.        ,        nan, 0.9486833 ],
             [       nan,        nan, 1.        ,        nan],
             [0.4       , 0.9486833 ,        nan, 1.        ]])


#### 卡方检验
支持独立检测特征对目标的支持状况。
“X与Y有关”，可以利用独立性检验来考察两个变量是否有关系，并且能较精确地给出这种判断的可靠程度。其中x^2的值越大，说明“x与y有关”的可能性越大

In [13]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest

data = [(0.0, Vectors.dense(0.5, 10.0)),
        (0.0, Vectors.dense(1.5, 20.0)),
        (1.0, Vectors.dense(1.5, 30.0)),
        (0.0, Vectors.dense(3.5, 30.0)),
        (0.0, Vectors.dense(3.5, 40.0)),
        (1.0, Vectors.dense(3.5, 40.0))]
df = spark.createDataFrame(data, ["label", "features"])

r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))

pValues: [0.6872892787909721,0.6822703303362126]
degreesOfFreedom: [2, 3]
statistics: [0.75,1.5]


In [22]:
spark.stop()

#### 统计值函数Summarizer
计算Dataframe中Vector的各种统计量，包括最大值，最小值，均值，方差，缺省值和数据量

In [6]:
from pyspark.ml.stat import Summarizer
from pyspark.sql import Row
from pyspark.ml.linalg import Vectors


df =sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)),\
                     Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF()

# create summarizer for multiple metrics "mean" and "count"
summarizer = Summarizer.metrics("mean", "count")

In [8]:
df.show()

+-------------+------+
|     features|weight|
+-------------+------+
|[1.0,1.0,1.0]|   1.0|
|[1.0,2.0,3.0]|   0.0|
+-------------+------+



In [9]:
df.select(summarizer.summary(df.features, df.weight)).show(truncate=False)

+-----------------------------------+
|aggregate_metrics(features, weight)|
+-----------------------------------+
|[[1.0,1.0,1.0], 1]                 |
+-----------------------------------+



In [10]:
df.select(summarizer.summary(df.features)).show(truncate=False)

+--------------------------------+
|aggregate_metrics(features, 1.0)|
+--------------------------------+
|[[1.0,1.5,2.0], 2]              |
+--------------------------------+



In [11]:
df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False)

+--------------+
|mean(features)|
+--------------+
|[1.0,1.0,1.0] |
+--------------+



In [12]:
df.select(Summarizer.mean(df.features)).show(truncate=False)

+--------------+
|mean(features)|
+--------------+
|[1.0,1.5,2.0] |
+--------------+

