# Graph Learning

# Sparse Matrices

In this lab, you will learn to use sparse matrices.

In [1]:
import numpy as np
from scipy import sparse

## CSR format

We first focus on the [CSR](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)  (Compressed Sparse Row) format. Note that there is the [CSC](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html) (Compressed Sparse Column) format, which is nothing but the CSR format of the transpose matrix.

In [2]:
# random matrix (dense format)
X_dense = np.random.randint(3, size = (10,5))

In [3]:
X_dense

array([[2, 1, 1, 0, 2],
       [0, 0, 0, 2, 0],
       [0, 2, 2, 1, 1],
       [0, 2, 2, 0, 1],
       [2, 1, 1, 2, 2],
       [0, 2, 2, 1, 0],
       [2, 2, 2, 0, 0],
       [0, 0, 1, 1, 1],
       [0, 0, 0, 0, 1],
       [2, 1, 1, 1, 1]])

In [4]:
X_csr = sparse.csr_matrix(X_dense)

In [5]:
X_csr.shape

(10, 5)

In [6]:
# number of non-zeros
X_csr.nnz

32

The data structure consists of 3 vectors:

In [7]:
X_csr.indices

array([0, 1, 2, 4, 3, 1, 2, 3, 4, 1, 2, 4, 0, 1, 2, 3, 4, 1, 2, 3, 0, 1,
       2, 2, 3, 4, 4, 0, 1, 2, 3, 4])

In [8]:
X_csr.indptr

array([ 0,  4,  5,  9, 12, 17, 20, 23, 26, 27, 32])

In [9]:
X_csr.data

array([2, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2,
       2, 1, 1, 1, 1, 2, 1, 1, 1, 1])

## To do

* Can you find the number of non-zeros from these 3 vectors?
* What about the shape of the matrix?

In [15]:
nonzero_count = len(X_csr.indices) # alternatively nonzero_count = X_csr.indptr[-1]
nonzero_count

32

In [27]:
# we can find the number of rows in the matrix
row_count = len(X_csr.indptr) - 1

# but we cannot guarantee that the number of columns is cols = max(X_csr.indices) + 1 if we don't know if there are any zero columns

## Arithmetic

Usual arithmetic operations apply to sparse matrices. The only contraint is to have the sparse matrix on the left-hand side of the operator.

In [20]:
n_row, n_col = 10, 4
X_dense = np.random.randint(2, size = (n_row, n_col))
X = sparse.csr_matrix(X_dense)

In [21]:
X_dense

array([[1, 0, 0, 1],
       [1, 1, 1, 1],
       [1, 0, 0, 0],
       [1, 1, 1, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 1, 0],
       [1, 1, 1, 0],
       [1, 1, 1, 0],
       [1, 1, 0, 0]])

In [22]:
a = np.ones(n_col, dtype=int)

In [23]:
b = X.dot(a)

In [24]:
b

array([2, 4, 1, 3, 1, 1, 2, 3, 3, 2], dtype=int32)

In [25]:
a = np.ones(n_row, dtype=int)
b = X.T.dot(a)

In [26]:
b

array([8, 6, 6, 2], dtype=int32)

In [28]:
A = np.random.randint(2, size=(n_col, 2))
B = X.dot(A)

In [29]:
A

array([[0, 1],
       [0, 1],
       [0, 1],
       [0, 0]])

In [30]:
B

array([[0, 1],
       [0, 3],
       [0, 1],
       [0, 3],
       [0, 1],
       [0, 1],
       [0, 2],
       [0, 3],
       [0, 3],
       [0, 2]], dtype=int32)

In [31]:
A = sparse.csr_matrix(A)
B = X.dot(A)

In [32]:
B

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 10 stored elements and shape (10, 2)>

In [33]:
B.toarray()

array([[0, 1],
       [0, 3],
       [0, 1],
       [0, 3],
       [0, 1],
       [0, 1],
       [0, 2],
       [0, 3],
       [0, 3],
       [0, 2]], dtype=int32)

In [34]:
X.T.dot(X)

<Compressed Sparse Column sparse matrix of dtype 'int32'
	with 16 stored elements and shape (4, 4)>

In [35]:
X.dot(X.T)

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 86 stored elements and shape (10, 10)>

In [36]:
n_row, n_col = 10, 5
X_dense = np.random.randint(3, size = (n_row, n_col))
X = sparse.csr_matrix(X_dense)

In [37]:
X

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 32 stored elements and shape (10, 5)>

In [38]:
Y = X > 1

In [39]:
Y

<Compressed Sparse Row sparse matrix of dtype 'bool'
	with 15 stored elements and shape (10, 5)>

In [40]:
Y.dot(np.ones(n_col, dtype=int))

array([1, 1, 2, 1, 3, 3, 0, 2, 0, 2], dtype=int32)

In [41]:
Y.dot(np.ones(n_col, dtype=bool))

array([ True,  True,  True,  True,  True,  True, False,  True, False,
        True])

In [42]:
Z = 2 * X + 5 * Y

In [43]:
Z

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 32 stored elements and shape (10, 5)>

In [44]:
Z.power(2)

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 32 stored elements and shape (10, 5)>

## To do

Consider the following matrix:

In [45]:
n_row, n_col = 20, 4
X = sparse.csr_matrix(np.random.randint(3, size = (n_row, n_col)))

* Compute the vector of the Euclidean norm of each row.

In [49]:
row_norms = np.sqrt(np.sum(X.multiply(X).toarray(), axis=1))
row_norms

array([3.16227766, 2.23606798, 1.        , 2.23606798, 3.46410162,
       2.44948974, 2.44948974, 2.        , 2.23606798, 3.        ,
       2.82842712, 1.        , 2.82842712, 1.73205081, 2.23606798,
       1.41421356, 2.44948974, 2.82842712, 1.73205081, 3.        ])

## Slicing

Sparse matrices can be sliced like numpy arrays.

In [51]:
n_row, n_col = 10, 5
X_dense = np.random.randint(3, size = (n_row, n_col))

In [52]:
X = sparse.csr_matrix(X_dense)

In [53]:
indices = [2, 5, 6]

In [54]:
X[indices]

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 10 stored elements and shape (3, 5)>

In [55]:
X[:, [1, 3]]

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 12 stored elements and shape (10, 2)>

## To do 

Consider the following matrix:

In [105]:
n_row, n_col = 20, 10
X = sparse.csr_matrix(np.random.randint(3, size = (n_row, n_col)))
X.data

array([2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2,
       2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,
       2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 2,
       1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 1, 1,
       2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2,
       2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1,
       2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2])

* Select the 5 rows of largest sums and build the corresponding CSR matrix (size 5 x 10).

In [103]:
# sums of each row
row_sums = np.array(X.sum(axis=1))
print(row_sums)
# 5 largest row sums
# largest_row_sums = np.argsort(row_sums)[-5:][::-1]

[[2]
 [3]
 [2]
 [4]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [2]
 [2]
 [2]
 [4]
 [2]
 [3]
 [1]
 [3]
 [2]]


## DIAG format

In [57]:
D = sparse.diags(np.arange(10))

In [58]:
D

<DIAgonal sparse matrix of dtype 'float64'
	with 10 stored elements (1 diagonals) and shape (10, 10)>

In [59]:
D.data

array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])

In [60]:
D.nnz

10

In [61]:
D = sparse.csr_matrix(D)

In [62]:
D.data

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [63]:
D.nnz

9

In [64]:
n_row, n_col = 10, 4
X = sparse.csr_matrix(np.random.randint(2, size = (n_row, n_col)))

In [65]:
D.dot(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 20 stored elements and shape (10, 4)>

In [66]:
D = sparse.diags(np.ones(4))

In [67]:
X.dot(D)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 21 stored elements and shape (10, 4)>

## To do

Consider the following matrix:

In [68]:
n_row, n_col = 20, 4
X = sparse.csr_matrix(np.random.randint(2, size = (n_row, n_col)))

Using sparse diagonal matrices:
* Normalize this matrix so that each row sums to 1 (or to 0 if the whole row is zero). 

## COO format

Another way to represent sparse matrices is the COO (COOrdinate) format. It is useful to load a matrix from a list of entries.

In [69]:
row = [1, 4, 2]
col = [2, 0, 2]
data = [1, 2, 3]

In [70]:
X_coo = sparse.coo_matrix((data, (row, col)), shape=(5, 5))

In [71]:
X_coo

<COOrdinate sparse matrix of dtype 'int32'
	with 3 stored elements and shape (5, 5)>

In [72]:
X_coo.row

array([1, 4, 2])

In [73]:
X_coo.col

array([2, 0, 2])

In [74]:
X_coo.data

array([1, 2, 3])

You can change the format:

In [75]:
X_csr = X_coo.tocsr()

In [76]:
X_csr

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 3 stored elements and shape (5, 5)>

In [77]:
X_csr.indices

array([2, 2, 0])

In [78]:
X_csr.indptr

array([0, 0, 1, 2, 2, 3])

In [79]:
X_csr.data

array([1, 3, 2])

In [80]:
X_csr.tocoo()

<COOrdinate sparse matrix of dtype 'int32'
	with 3 stored elements and shape (5, 5)>

You can directly load a CSR matrix from COO format:

In [81]:
X_csr = sparse.csr_matrix((data, (row, col)), shape=(5, 5))

In [82]:
X_csr

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 3 stored elements and shape (5, 5)>

Duplicate entries are not summed by default in the COO format (also in the CSR format in the last version of Scipy):

In [83]:
row = [1, 4, 2, 1]
col = [2, 0, 2, 2]
data = [1, 2, 3, 4]

In [84]:
X_coo = sparse.coo_matrix((data, (row, col)), shape=(5, 5))

In [85]:
X_coo

<COOrdinate sparse matrix of dtype 'int32'
	with 4 stored elements and shape (5, 5)>

In [86]:
X_coo.nnz

4

In [87]:
X_coo.toarray()

array([[0, 0, 0, 0, 0],
       [0, 0, 5, 0, 0],
       [0, 0, 3, 0, 0],
       [0, 0, 0, 0, 0],
       [2, 0, 0, 0, 0]])

In [88]:
X_coo.dot(np.ones(5, dtype=int))

array([0, 5, 3, 0, 2], dtype=int32)

In [89]:
X_coo.data

array([1, 2, 3, 4])

In [90]:
X_coo.sum_duplicates()

In [91]:
X_coo

<COOrdinate sparse matrix of dtype 'int32'
	with 3 stored elements and shape (5, 5)>

In [92]:
X_coo.data

array([5, 3, 2])

If data contain some zeros, these are coded as non-zeros by default:

In [93]:
row = [1, 4, 2]
col = [2, 0, 2]
data = [1, 2, 0]

In [94]:
X_csr = sparse.csr_matrix((data, (row, col)), shape=(5, 5))

In [95]:
X_csr

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 3 stored elements and shape (5, 5)>

In [96]:
X_csr.eliminate_zeros()

In [97]:
X_csr

<Compressed Sparse Row sparse matrix of dtype 'int32'
	with 2 stored elements and shape (5, 5)>

## To do

Consider the following matrix:

In [98]:
row = [1, 1, 2, 4, 2, 1]
col = [2, 2, 2, 0, 2, 2]
data = [1, 2, 3, 4, 0, 2]

X_csr = sparse.csr_matrix((data, (row, col)), shape=(5, 5))
X_csr.sum_duplicates()

* What is the number of non-zeros? 
* Check your answer using ``nnz``.