Introduction to Matrix Library

david206 edited this page Sep 2, 2013 · 27 revisions

Contents

Getting help

Documentation

Matrix API

Third Party Modules

Videos

How-tos

Tutorials

Articles

Other

Clone this wiki locally

About

Matrix.scala is a Scalding library that introduces the possibility of treating pipes as sparse matrices and to operate on them using standard matrix operations, such as matrix multiplication:

//Computing the innerproduct of matrix A
innerProd = A * A.transpose

The matrix constructor takes in a pipe containing triples that have the assumed semantics of (row index, column index, matrix value). Additionally, the user can specify the approximate dimensions of the matrix (number of rows, columns, non-zero values) and its skewness (if the distribution of values over the row/column keys is skewed or not). This additional information can help speed-up the computation and improve scalability of the resulting job.

Type restrictions

The matrix row and column indexes can be of any type that is comparable. The usual cases are Int, Long, String. This means that labeled matrices are allowed. For example, we can create a matrix containing the number of users that like specific movie genres per geo without first reindexing the categorical fields to numerical ids:

//Loading the number of users interested in movie genres per geo from a Tsv source
val interestsMatrix = Tsv( args("input") ).read.toMatrix[String,String,Long]('geo, 'movie_genre, 'freq)

The value type decides what operations can be applied to the matrix.

The reason why this is a more powerful approach than to require for the value type to be Numeric is that all of the four operations: addition, subtraction, multiplication, division can be then extended to non-numeric types. For example:

  • String addition can be defined as string concatenation
  • List addition can be defined as list concatenation
  • Set addition can be defined as set union

These operations can be stacked together: For example, Map addition can be defined as set union and the values in the Maps intersection could be aggregated using their own definition of addition. By allowing matrix values to be structured types we can work with higher-order tensors such as cubes or four-tensors with the same library.

For more information on algebraic structures see: Algebraic_structure(Wikipedia) and the pages on Monoid, Group, Ring, Field.

Getting Started. The "Hello World!" example for the Matrix library

Graph nodes outdegrees

Graphs have a straightforward representation in Matrix library as adjacency matrices. We will use the library to compute the outdegrees of the nodes in the graph.

package com.twitter.scalding.examples

import com.twitter.scalding._
import com.twitter.scalding.mathematics.Matrix

class GraphOutDegreeJob(args : Args) extends Job(args) {

  import Matrix._

  val adjacencyMatrix = Tsv( args("input"), ('user1, 'user2, 'rel) )
    .read
    .toMatrix[Long,Long,Double]('user1, 'user2, 'rel)

  // each row i represents all of the outgoing edges from i
  // by summing out all of the columns we get the outdegree of i
  adjacencyMatrix.sumColVectors.write( Tsv( args("output") ) )
}

We convert a pipe of triples to a sparse matrix where element[i,j] represents and edge between row[i] and column[j]. We then sum the values of the columns together into a column vector that has the outdegree of node[i] at row[i].

Next steps

  • Read the Matrix API Reference: includes code snippets explaining different kinds of matrix functions (e.g., sumRowVectors, matrix product, element-wise product, diagonal, topRowElems) and much more.
  • Go over the Matrix tutorials: the tutorials range from one-liners to more complex examples that show real applications of the Matrix functions to graph problems and text processing.