Spark版本为2.0.0,Python版本为3.5.2,Jupyter notebook server版本为4.2.1

<h1 class="title">Data Types - RDD-based API</h1>
<ul id="markdown-toc">
  <li>Local vector</li>
  <li>Labeled point</li>
  <li>Local matrix</li>
  <li>Distributed matrix  
      <ul>
          <li>RowMatrix</li>
          <li>IndexedRowMatrix</li>
          <li>CoordinateMatrix</li>
          <li>BlockMatrix</li>
      </ul>
  </li>
</ul>
<p>MLlib supports local vectors and matrices stored on a single machine, as well as distributed matrices backed by one or more RDDs.Local vectors and local matrices are simple data models that serve as public interfaces. The underlying linear algebra operations are provided by <a href="http://www.scalanlp.org/">Breeze</a>.A training example used in supervised learning is called a &#8220;labeled point&#8221; in MLlib.</p>
<h2>Local vector</h2>
<p>A local vector has integer-typed and 0-based indices and double-typed values, stored on a single machine.  MLlib supports two types of local vectors: dense and sparse.  A dense vector is backed by a double array representing its entry values, while a sparse vector is backed by two parallel arrays: indices and values.  For example, a vector <code>(1.0, 0.0, 3.0)</code> can be represented in dense format as <code>[1.0, 0.0, 3.0]</code> or in sparse format as <code>(3, [0, 2], [1.0, 3.0])</code>, where <code>3</code> is the size of the vector.</p>
<p>稀疏向量和密集向量都是向量的表示方法.<br/>密集向量和稀疏向量的区别:密集向量的值就是一个普通的Double数组,而稀疏向量由两个并列的数组indices和values组成.例如:向量(1.0,0.0,1.0,3.0)用密集向量表示为[1.0,0.0,1.0,3.0],用稀疏向量表示为(4,[0,2,3],[1.0,1.0,3.0]).第一个4表示向量的长度(元素个数),[0,2,3]就是indices数组,[1.0,1.0,3.0]是values数组,表示向量0的位置的值是1.0，1的位置的值是1.0,而3的位置的值是3.0,其他的位置都是0.</p>
<p>MLlib recognizes(识别) the following types as dense vectors:</p>
<ul>
   <li>NumPy&#8217;s <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html"><code>array</code></a></li>
   <li>Python&#8217;s list, e.g., <code>[1, 2, 3]</code></li>
</ul>
<p>and the following as sparse vectors:</p>
<ul>
    <li>MLlib&#8217;s <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseVector"><code>SparseVector</code></a>.</li>
    <li>SciPy&#8217;s<a href="http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix"><code>csc_matrix</code></a>with a single column</li>
</ul>
<p>We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented
in <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors"><code>Vectors</code></a> to create sparse vectors.</p>为了提高效率推荐使用NumPy而不是list创建dense vectors，使用MLlib内置的Vectors方法创建sparse vectors.
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors"><code>Vectors</code> Python docs</a> for more details on the API.</p>

In [1]:
import numpy as np
import scipy.sparse as sps
from pyspark.mllib.linalg import Vectors

# Use a NumPy array as a dense vector.
dv1 = np.array([1.0, 0.0, 3.0])
# Use a Python list as a dense vector.
dv2 = [1.0, 0.0, 3.0]
# Create a SparseVector.
sv1 = Vectors.sparse(3, [0, 2], [1.0, 3.0])
# Use a single-column SciPy csc_matrix as a sparse vector.
sv2 = sps.csc_matrix((np.array([1.0, 3.0]), np.array([0, 2]), np.array([0, 2])), shape = (3, 1))

In [2]:
dv1

array([ 1.,  0.,  3.])

In [3]:
dv2

[1.0, 0.0, 3.0]

In [4]:
sv1.toArray()

array([ 1.,  0.,  3.])

In [5]:
sv2 .toarray()

array([[ 1.],
       [ 0.],
       [ 3.]])

csc_matrix:Compressed Sparse Column matrix（压缩稀疏列矩阵）
>csc_matrix((data, indices, indptr), [shape=(M, N)])<br/>&emsp;&emsp;is the standard CSC representation where the row indices for column i are stored in <strong>`indices[indptr[i]:indptr[i+1]]`</strong> and their corresponding values are stored in <strong>`data[indptr[i]:indptr[i+1]]`</strong>. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays.

In [6]:
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6])
sps.csc_matrix((data, indices, indptr),shape=(3, 3)).toarray()

array([[1, 0, 4],
       [0, 0, 5],
       [2, 3, 6]])

对于0列来说,indptr[0]:indptr[1]=[0,1],再看行indices[0:2]=[0,2],数据data[0:2]=[1,2],说明第0列在第0行和第2行上有数据1和2.<br/>
对于1列来说,indptr[1]:indptr[2]=[2],再看行indices[2]=[2],数据data[2]=[3],说明第1列在第2行上有数据3.<br/>
对于2列来说,indptr[2]:indptr[3]=[3:6],再看行indices[3:6]=[0,1,2],数据data[3:6]=[4,5,6],说明第2列在第0,1,2行上有数据4,5,6.<br/>

<h2>Labeled point</h2>
<p>A labeled point is represented by <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint"><code>LabeledPoint</code></a>.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint"><code>LabeledPoint</code> Python docs</a> for more details on the API.</p>

In [7]:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint

# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])

# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))

<dt>
<em>class </em>
<tt>pyspark.mllib.regression.</tt>
<tt>LabeledPoint</tt>
<big>(</big><em>label</em>, <em>features</em><big>)</big>
</dt>

<table>
<col/>
<col/>
<tbody>
<tr>
    <th>Parameters:</th>
    <td>
        <ul>
            <li><strong>label</strong> &#8211; Label for this data point.</li>
            <li><strong>features</strong> &#8211; Vector of features for this point (NumPy array,list,pyspark.mllib.linalg.SparseVector, or scipy.sparse column matrix).</li>
        </ul>
    </td>
</tr>
</tbody>
</table>
<p>Note: &#8216;label&#8217; and &#8216;features&#8217; are accessible as class attributes.</p>

In [8]:
pos.label

1.0

In [9]:
pos.features

DenseVector([1.0, 0.0, 3.0])

<p><strong><em>Sparse data</em></strong></p>

<p>It is very common in practice to have sparse training data.  MLlib supports reading training
examples stored in <code>LIBSVM</code> format, which is the default format used by <a href="http://www.csie.ntu.edu.tw/~cjlin/libsvm/"><code>LIBSVM</code></a> and <a href="http://www.csie.ntu.edu.tw/~cjlin/liblinear/"><code>LIBLINEAR</code></a>.  It is a text format in which each line represents a labeled sparse feature vector using the following format:</p>
<pre><code>label index1:value1 index2:value2 ...</code></pre>
<p>where the indices are `one-based` and in `ascending order`. After loading, the feature indices are converted to `zero-based`.</p>
<p><a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils.loadLibSVMFile"><code>MLUtils.loadLibSVMFile</code></a> reads training examples stored in LIBSVM format.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils"><code>MLUtils</code> Python docs</a> for more details on the API.</p>

<font color="red">对于如下的LIBSVM格式数据：<br/>
<pre><code>        0 128:51 129:159 130:253 131:159 132:50</code></pre>
0.0代表标签,128代表索引,51代表特征值。<br/><br/>
经MLUtils.loadLibSVMFile读取后格式转换为如下：<br/>
<pre><code>        (0.0,(5,[127,128,129,130,131],[51.0,159.0,253.0,159.0,50.0]))</code></pre>
0.0代表标签，5代表数据个数，[127,128,129,130,131]代表特征位置索引，[51.0,159.0,253.0,159.0,50.0]代表特征值。<br/><br/>
注意：索引(index)从1开始顺序递增，读取完成后，特征索引被转换成从0开始。
</font>

<dt>
<em>static </em>
    <tt>loadLibSVMFile</tt>
    <big>(</big><em>sc</em>, <em>path</em>, <em>numFeatures=-1</em>, <em>minPartitions=None</em>, <em>multiclass=None</em><big>)</big>
</dt>
<tbody>
    <tr>
        <th>Parameters:</th>
        <td>
            <ul>
                <li><strong>sc</strong> – Spark context</li>
                <li><strong>path</strong> – file or directory path in any Hadoop-supported file
system URI</li>
                <li><strong>numFeatures</strong> – number of features, which will be determined from the input data if a nonpositive value is given. This is useful when the dataset is already split into multiple files and you want to load them separately, because some features may not present in certain files, which leads to inconsistent feature
dimensions.</li>
                <li><strong>minPartitions</strong> – min number of partitions</li>
            </ul>
        </td>
    </tr>
    <tr>
        <th>Returns:</th>
        <td><p>labeled data stored as an RDD of LabeledPoint</p>
        </td>
    </tr>
</tbody>

In [10]:
from pyspark.mllib.util import MLUtils

examples = MLUtils.loadLibSVMFile(sc, "mllib/sample_libsvm_data.txt")

In [11]:
examples.first()

LabeledPoint(0.0, (692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,

<h2>Local matrix</h2>

<p>A local matrix has integer-typed row and column indices and double-typed values, stored on a single
machine.  MLlib supports dense matrices, whose entry values are stored in a single double array in
column-major order, and sparse matrices, whose non-zero entry values are stored in the Compressed Sparse
Column (CSC) format in `column-major order`(以列为主序).  For example, the following dense matrix
$$\begin{pmatrix}
1.0 & 2.0 \\
3.0 & 4.0 \\
5.0 & 6.0 \\
\end{pmatrix}$$
is stored in a one-dimensional array <code>[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]</code> with the matrix size <code>(3, 2)</code>.</p>
<p>The base class of local matrices is <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix"><code>Matrix</code></a>, and we provide two implementations: <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.DenseMatrix"><code>DenseMatrix</code></a>,and <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.SparseMatrix"><code>SparseMatrix</code></a>.
We recommend using the factory methods implemented in <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices"><code>Matrices</code></a> to create local
matrices. <strong>Remember, local matrices in MLlib are stored in `column-major order`(列主序)</strong>.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix"><code>Matrix</code> Python docs</a> and <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices"><code>Matrices</code> Python docs</a> for more details on the API.</p>


稀疏矩阵是指矩阵中的元素大部分是0的矩阵，事实上，实际问题中大规模矩阵基本上都是稀疏矩阵，很多稀疏度在90%甚至99%以上。因此我们需要有高效的稀疏矩阵存储格式。几种典型的格式：COO,CSR,CSS。<br/>
(1)Coordinate(COO)<br/>
<img src="imgs/DataTypes_01.png">
这是最简单的一种格式，每一个元素需要用一个三元组来表示，分别是(行号，列号，数值)，对应上图右边的一列。这种方式简单，但是记录单信息多(行列)，每个三元组自己可以定位，因此空间不是最优。<br/>
(2)Compressed Sparse Row(CSR)<br/>
<img src="imgs/DataTypes_02.png">
CSR是比较标准的一种，也需要三类数据来表达：数值，列号，以及行偏移。CSR不是三元组，而是整体的编码方式。数值和列号与COO一致，表示一个元素以及其列号，行偏移表示某一行的第一个元素在values里面的起始偏移位置。<br/>如上图中，第一行元素1是0偏移，第二行元素2是2偏移，第三行元素5是4偏移，第4行元素6是7偏移。在行偏移的最后补上矩阵总的元素个数，本例中是9。<br/>
(3)Compressed Sparse Column(CSC)<br/>
CSC是和CSR相对应的一种方式，即按列压缩的意思。<br/>
以上图中矩阵为例：<br/>
Values：        [1 5 7 2 6 8 3 9 4]<br/>
Row Indices：[0 2 0 1 3 1 2 2 3]<br/>
Column Offsets：[0 2 5 7 9]<br/>

In [12]:
from pyspark.mllib.linalg import Matrix, Matrices

# Create a dense matrix ((1.0, 4.0), (4.0, 5.0), (3.0, 6.0))
dm2 = Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])

# Create a sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])

In [13]:
dm2.toArray()

array([[ 1.,  4.],
       [ 2.,  5.],
       [ 3.,  6.]])

In [14]:
sm.toArray()

array([[ 9.,  0.],
       [ 0.,  8.],
       [ 0.,  6.]])

<h2>Distributed matrix(分布式矩阵)</h2>
<p>A distributed matrix has long-typed row and column indices and double-typed values, stored distributively in one or more RDDs. It is very important to choose the right format to store large and distributed matrices.Converting a distributed matrix to a different format may require a global shuffle, which is quite expensive.Four types of distributed matrices have been implemented so far.</p>
<p>The basic type is called <code>RowMatrix</code>. A <code>RowMatrix</code> is a row-oriented(面向行型) distributed
matrix without meaningful row indices, e.g., a collection of feature vectors.It is backed by an RDD of its rows, where each row is a local vector.We assume that the number of columns is not huge for a <code>RowMatrix</code> so that a single local vector can be reasonably(合理地;相当地;适度地) communicated to the driver and can also be stored/operated on using a single node. An <code>IndexedRowMatrix</code> is similar to a <code>RowMatrix</code> but with row indices,which can be used for identifying rows and executing joins.A <code>CoordinateMatrix</code> is a distributed matrix stored in <a href="https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29">coordinate list (COO)</a> format,backed by an RDD of its entries.A <code>BlockMatrix</code> is a distributed matrix backed by an RDD of <code>MatrixBlock</code>which is a tuple of <code>(Int, Int, Matrix)</code>.</p><p><strong><em>Note</em></strong></p>
<p>The underlying RDDs of a distributed matrix must be deterministic, because we cache the matrix size.In general the use of non-deterministic RDDs can lead to errors.</p>
<p>分布式矩阵如下所示:</p>
<img src="imgs/DataTypes_03.png">
<h3 id="rowmatrix">RowMatrix(行矩阵)</h3>
<p>A <code>RowMatrix</code> is a row-oriented distributed matrix without meaningful row indices, backed by an RDD
of its rows, where each row is a local vector.Since each row is represented by a local vector,the number of columns is limited by the integer range but it should be much smaller in practice.</p>
<p>A <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix"><code>RowMatrix</code></a> can be created from an <code>RDD</code> of vectors.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix"><code>RowMatrix</code> Python docs</a> for more details on the API.</p>

RowMatrix直接通过RDD[Vector]来定义并可以用来统计平均数、方差、协同方差等。分布式行矩阵就是把每行对应一个RDD，将矩阵的每行分布式存储，矩阵的每行是一个本地向量。这和多变量统计的数据矩阵比较相似。因为每行以一个本地向量表示，所以矩阵列的数量被限制在整数范围内使用，但是在实际应用中列数很小。

In [15]:
from pyspark.mllib.linalg.distributed import RowMatrix

# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])

# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)

# Get its size.
m = mat.numRows()  # 4
n = mat.numCols()  # 3

# Get the rows as an RDD of vectors again.
rowsRDD = mat.rows

In [16]:
rowsRDD.first()

DenseVector([1.0, 2.0, 3.0])

<h3>IndexedRowMatrix(行索引矩阵)</h3>
<p>An <code>IndexedRowMatrix</code> is similar to a <code>RowMatrix</code> but with meaningful row indices.  It is backed by an RDD of indexed rows, so that each row is represented by its index (long-typed) and a local 
vector.</p>
<p>An <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix"><code>IndexedRowMatrix</code></a> can be created from an <code>RDD</code> of <code>IndexedRow</code>s, where 
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRow"><code>IndexedRow</code></a> is a wrapper over <code>(long, vector)</code>.An <code>IndexedRowMatrix</code> can be converted to a <code>RowMatrix</code> by dropping its row indices.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix"><code>IndexedRowMatrix</code> Python docs</a> for more details on the API.</p>

<dl>
<dt>
<em>class </em><tt>pyspark.mllib.linalg.distributed.</tt><tt><span>IndexedRow</span></tt><big>(</big><em>index</em>, <em>vector</em><big>)</big></dt>
<dd><p>Bases: <tt><span class="pre">object</span></tt></p>
<p>Represents a row of an <span>IndexedRow</span>Matrix.</p>
<p>Just a wrapper over a (long, vector) tuple.</p>
<table>
<colgroup><col>
<col>
</colgroup><tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>index</strong> – The index for the given row.</li>
<li><strong>vector</strong> – The row in the matrix at the given index.</li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd>
</dl>

IndexedRowMatrix和RowMatrix相似，区别是它带有有一定意义的行索引。在RowMatrix中，rows的格式是RDD[Vector]；而在IndexedRowMatrix中，rows的格式是RDD[IndexedRow]，其中IndexedRow(index:Long,vector:Vector),相比RowMatrix index索引信息。一个IndexedRowMatrix可以从RDD[IndexedRow]实例创建，IndexedRow是(Int,Vector)的wrapper，而且这种矩阵可以转换成RowMatrix，从而利用其统计功能。

In [17]:
from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

# Create an RDD of indexed rows.
# - This can be done explicitly with the IndexedRow class:
indexedRows = sc.parallelize([IndexedRow(0, [1, 2, 3]), 
                              IndexedRow(1, [4, 5, 6]), 
                              IndexedRow(2, [7, 8, 9]), 
                              IndexedRow(3, [10, 11, 12])])
# - or by using (long, vector) tuples:
indexedRows = sc.parallelize([(0, [1, 2, 3]), (1, [4, 5, 6]), 
                              (2, [7, 8, 9]), (3, [10, 11, 12])])

# Create an IndexedRowMatrix from an RDD of IndexedRows.
mat = IndexedRowMatrix(indexedRows)

# Get its size.
m = mat.numRows()  # 4
n = mat.numCols()  # 3

# Get the rows as an RDD of IndexedRows.
rowsRDD = mat.rows

# Convert to a RowMatrix by dropping the row indices.
rowMat = mat.toRowMatrix()

In [18]:
rowsRDD.first()

IndexedRow(0, [1.0,2.0,3.0])

In [19]:
rowMat.rows.first()

DenseVector([1.0, 2.0, 3.0])

<h3>CoordinateMatrix(坐标矩阵)</h3>

<p>A <code>CoordinateMatrix</code> is a distributed matrix backed by an RDD of its entries.  Each entry is a tuple
of <code>(i: Long, j: Long, value: Double)</code>, where <code>i</code> is the row index, <code>j</code> is the column index, and <code>value</code> is the entry value.  A <code>CoordinateMatrix</code> should be used only when both dimensions of the matrix are huge and the matrix is very sparse.</p>
<p>A <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix"><code>CoordinateMatrix</code></a>
can be created from an <code>RDD</code> of <code>MatrixEntry</code> entries, where 
<a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.MatrixEntry"><code>MatrixEntry</code></a> is a wrapper over <code>(long, long, float)</code>.  A <code>CoordinateMatrix</code> can be converted to a <code>RowMatrix</code> by calling <code>toRowMatrix</code>, or to an <code>IndexedRowMatrix</code> with sparse rows by calling <code>toIndexedRowMatrix</code>.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix"><code>CoordinateMatrix</code> Python docs</a> for more details on the API.</p>

<dl>
<dt>
<em>class </em><tt>pyspark.mllib.linalg.distributed.</tt><tt><span>MatrixEntry</span></tt><big>(</big><em>i</em>, <em>j</em>, <em>value</em><big>)</big></dt>
<dd><p>Bases: <tt><span>object</span></tt></p>
<p>Represents an entry of a CoordinateMatrix.</p>
<p>Just a wrapper over a (long, long, float) tuple.</p>
<table>
<colgroup><col>
<col>
</colgroup><tbody valign="top">
<tr><th>Parameters:</th><td><ul>
<li><strong>i</strong> – The row index of the matrix.</li>
<li><strong>j</strong> – The column index of the matrix.</li>
<li><strong>value</strong> – The (i, j)th entry of the matrix, as a float.</li>
</ul>
</td>
</tr>
</tbody>
</table>
</dd></dl>

坐标矩阵常用于稀疏性比较高的计算中，每一项都是一个(i: Long, j: Long, value: Float)指示行列值的元组tuple。其中i是行坐标，j是列坐标，value是值。如果矩阵是非常大的而且稀疏，则坐标矩阵一定是最好的选择。坐标矩阵是通过RDD[MatrixEntry]实例创建的，MatrixEntry为(Long,Long,Float)形式，坐标矩阵可以转化为IndexedRowMatrix。

In [20]:
from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry

# Create an RDD of coordinate entries.
# - This can be done explicitly with the MatrixEntry class:
entries = sc.parallelize([MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(6, 1, 3.7)])
# - or using (long, long, float) tuples:
entries = sc.parallelize([(0, 0, 1.2), (1, 0, 2.1), (2, 1, 3.7)])

# Create an CoordinateMatrix from an RDD of MatrixEntries.
mat = CoordinateMatrix(entries)

# Get its size.
m = mat.numRows()  # 3
n = mat.numCols()  # 2

# Get the entries as an RDD of MatrixEntries.
entriesRDD = mat.entries

# Convert to a RowMatrix.
rowMat = mat.toRowMatrix()

# Convert to an IndexedRowMatrix.
indexedRowMat = mat.toIndexedRowMatrix()

# Convert to a BlockMatrix.
blockMat = mat.toBlockMatrix()

In [21]:
entriesRDD.collect()

[MatrixEntry(0, 0, 1.2), MatrixEntry(1, 0, 2.1), MatrixEntry(2, 1, 3.7)]

In [22]:
rowMat.rows.first()

SparseVector(2, {0: 1.2})

In [23]:
indexedRowMat.rows.first()

IndexedRow(0, (2,[0],[1.2]))

<h3>BlockMatrix(分块矩阵)</h3>
<p>A <code>BlockMatrix</code> is a distributed matrix backed by an RDD of <code>MatrixBlock</code>s, where a <code>MatrixBlock</code> is a tuple of <code>((Int, Int), Matrix)</code>, where the <code>(Int, Int)</code> is the index of the block, and <code>Matrix</code> is the sub-matrix at the given index with size<code>rowsPerBlock</code> x <code>colsPerBlock</code>.<code>BlockMatrix</code> supports methods such as <code>add</code> and <code>multiply</code> with another <code>BlockMatrix</code>.<code>BlockMatrix</code> also has a helper function <code>validate</code> which can be used to check whether the <code>BlockMatrix</code> is set up properly.</p>
<p>A <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix"><code>BlockMatrix</code></a>can be created from an <code>RDD</code> of sub-matrix blocks, where a sub-matrix block is a 
<code>((blockRowIndex, blockColIndex), sub-matrix)</code> tuple.</p>
<p>Refer to the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix"><code>BlockMatrix</code> Python docs</a> for more details on the API.</p>

在数学理论中，一个分块矩阵或是分段矩阵就是将矩阵分割出较小的矩形矩阵，这些较小的矩阵就被称为区块。换个方式来说，就是以较小的矩阵组合成一个矩阵。分块矩阵的分割原则是以水平线和垂直线进行划分。在分块矩阵中，位于同一行(列)的每一个子矩阵都拥有相同的列数(行数)。通过将大的矩阵以分块的方式划分，并将每个分块看作另一个矩阵的元素，这样之后再参与运算，通常可以让计算变得清晰甚至得以大幅简化。例如，有的大矩阵可以通过分块变为对角矩阵或者是三角矩阵等特殊形式的矩阵。

In [24]:
from pyspark.mllib.linalg import Matrices
from pyspark.mllib.linalg.distributed import BlockMatrix

# Create an RDD of sub-matrix blocks.
blocks = sc.parallelize([((0, 0), Matrices.dense(3, 2, [1, 2, 3, 4, 5, 6])), 
                         ((1, 0), Matrices.dense(3, 2, [7, 8, 9, 10, 11, 12]))])

# Create a BlockMatrix from an RDD of sub-matrix blocks.
mat = BlockMatrix(blocks, 3, 2)

# Get its size.
m = mat.numRows() # 6
n = mat.numCols() # 2

# Get the blocks as an RDD of sub-matrix blocks.
blocksRDD = mat.blocks

# Convert to a LocalMatrix.
localMat = mat.toLocalMatrix()

# Convert to an IndexedRowMatrix.
indexedRowMat = mat.toIndexedRowMatrix()

# Convert to a CoordinateMatrix.
coordinateMat = mat.toCoordinateMatrix()