# SD212: Graph mining

# Sparse matrices

The objective of this lab is to learn to work with sparse matrices.

In [205]:
import numpy as np
from scipy import sparse

## CSR format

We first focus on the [CSR](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)  (Compressed Sparse Row) format. Note that there is the [CSC](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html) (Compressed Sparse Column) format, which is nothing but the CSR format of the transpose matrix.

In [206]:
# random matrix (dense format)
X_dense = np.random.randint(3, size = (10,5))

In [207]:
X_dense

array([[1, 0, 1, 0, 1],
       [1, 0, 2, 2, 2],
       [2, 0, 1, 0, 1],
       [2, 0, 2, 1, 1],
       [1, 0, 0, 0, 0],
       [0, 1, 0, 1, 1],
       [2, 0, 0, 2, 2],
       [2, 0, 2, 1, 0],
       [1, 1, 0, 2, 0],
       [1, 1, 2, 1, 0]])

In [208]:
X_csr = sparse.csr_matrix(X_dense)

In [209]:
X_csr.shape

(10, 5)

In [210]:
X_csr.nnz

31

The data structure consists of 3 vectors:

In [211]:
X_csr.indices

array([0, 2, 4, 0, 2, 3, 4, 0, 2, 4, 0, 2, 3, 4, 0, 1, 3, 4, 0, 3, 4, 0,
       2, 3, 0, 1, 3, 0, 1, 2, 3], dtype=int32)

In [212]:
X_csr.indptr

array([ 0,  3,  7, 10, 14, 15, 18, 21, 24, 27, 31], dtype=int32)

In [213]:
X_csr.data

array([1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,
       2, 1, 1, 1, 2, 1, 1, 2, 1])

## To do

Can you find the number of non-zeros ``nnz`` and the shape of the matrix from these vectors?

In [214]:
nnz = X_csr.indptr[-1]
nnz2 = len(X_csr.data)

shape = (len(X_csr.indptr) -1, max(X_csr.indices) + 1)

print("nnz = ", nnz)
print("nnz2 = ", nnz2)
print("shape = ", shape, "\n")

nnz =  31
nnz2 =  31
shape =  (10, 5) 



## Arithmetic

Usual arithmetic operations apply to sparse matrices. The only contraint is to have the sparse matrix on the left-hand side of the operator.

In [215]:
n_row, n_col = 10, 4
X_dense = np.random.randint(2, size = (n_row, n_col))
X = sparse.csr_matrix(X_dense)

In [216]:
X_dense

array([[0, 1, 1, 0],
       [0, 1, 1, 1],
       [0, 1, 0, 1],
       [1, 0, 1, 1],
       [0, 0, 1, 0],
       [1, 0, 0, 1],
       [1, 0, 0, 0],
       [0, 0, 1, 1],
       [1, 1, 0, 0],
       [0, 1, 0, 0]])

In [217]:
a = np.ones(n_col, dtype=int)

In [218]:
b = X.dot(a)

In [219]:
b  # row sum

array([2, 3, 2, 3, 1, 2, 1, 2, 2, 1])

In [220]:
a = np.ones(n_row, dtype=int)
b = X.T.dot(a)

In [221]:
b # col sum

array([4, 5, 5, 5])

In [222]:
A = np.random.randint(2, size=(n_col, 2))
B = X.dot(A)

In [223]:
A

array([[0, 1],
       [0, 0],
       [0, 0],
       [0, 1]])

In [224]:
B

array([[0, 0],
       [0, 1],
       [0, 1],
       [0, 2],
       [0, 0],
       [0, 2],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 0]])

In [225]:
A = sparse.csr_matrix(A)
B = X.dot(A)

In [226]:
B

<10x2 sparse matrix of type '<class 'numpy.int64'>'
	with 7 stored elements in Compressed Sparse Row format>

In [227]:
B.toarray()

array([[0, 0],
       [0, 1],
       [0, 1],
       [0, 2],
       [0, 0],
       [0, 2],
       [0, 1],
       [0, 1],
       [0, 1],
       [0, 0]])

In [228]:
X.T.dot(X)

<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Column format>

In [229]:
X.dot(X.T)

<10x10 sparse matrix of type '<class 'numpy.int64'>'
	with 70 stored elements in Compressed Sparse Row format>

In [230]:
n_row, n_col = 10, 5
X_dense = np.random.randint(3, size = (n_row, n_col))
X = sparse.csr_matrix(X_dense)

In [231]:
X

<10x5 sparse matrix of type '<class 'numpy.int64'>'
	with 30 stored elements in Compressed Sparse Row format>

In [232]:
Y = X > 1

In [233]:
Y

<10x5 sparse matrix of type '<class 'numpy.bool_'>'
	with 18 stored elements in Compressed Sparse Row format>

In [234]:
Y.dot(np.ones(n_col, dtype=int))

array([3, 3, 1, 1, 2, 2, 3, 2, 0, 1])

In [235]:
Y.dot(np.ones(n_col, dtype=bool))

array([ True,  True,  True,  True,  True,  True,  True,  True, False,
        True])

In [236]:
Z = 2 * X + 5 * Y

In [237]:
Z

<10x5 sparse matrix of type '<class 'numpy.int64'>'
	with 30 stored elements in Compressed Sparse Row format>

## To do

Consider the following matrix:

In [238]:
n_row, n_col = 20, 4
X = sparse.csr_matrix(np.random.randint(3, size = (n_row, n_col)))

* Compute the vector of the Euclidean norm of each row.

In [239]:
# Computing the vector of the Euclidean norm of each row
prod = X.dot(X.T)
diag = prod.diagonal()
norm = np.sqrt(diag)

norm

array([2.23606798, 2.23606798, 3.        , 2.44948974, 2.44948974,
       3.16227766, 3.        , 3.        , 1.41421356, 2.        ,
       2.64575131, 2.23606798, 3.46410162, 2.82842712, 3.60555128,
       2.44948974, 2.44948974, 2.44948974, 3.46410162, 2.23606798])

## Slicing

Sparse matrices can be sliced like numpy arrays.

In [240]:
n_row, n_col = 10, 5
X_dense = np.random.randint(3, size = (n_row, n_col))

In [241]:
X = sparse.csr_matrix(X_dense)

print(X)

  (0, 1)	1
  (0, 2)	1
  (1, 0)	1
  (1, 1)	2
  (1, 2)	1
  (1, 3)	1
  (2, 0)	2
  (2, 1)	2
  (2, 2)	2
  (2, 3)	1
  (3, 1)	2
  (3, 4)	1
  (4, 0)	2
  (4, 1)	1
  (4, 2)	1
  (4, 3)	2
  (4, 4)	2
  (5, 0)	1
  (5, 2)	1
  (6, 0)	1
  (6, 1)	2
  (6, 2)	1
  (6, 3)	2
  (7, 0)	1
  (7, 1)	2
  (7, 2)	1
  (8, 0)	1
  (8, 1)	2
  (8, 2)	1
  (9, 2)	1
  (9, 3)	1
  (9, 4)	2


In [242]:
indices = [2, 5, 6]

In [243]:
print(X[indices])

  (0, 0)	2
  (0, 1)	2
  (0, 2)	2
  (0, 3)	1
  (1, 0)	1
  (1, 2)	1
  (2, 0)	1
  (2, 1)	2
  (2, 2)	1
  (2, 3)	2


In [244]:
print(X[:, [1, 3]])

  (0, 0)	1
  (1, 0)	2
  (1, 1)	1
  (2, 0)	2
  (2, 1)	1
  (3, 0)	2
  (4, 0)	1
  (4, 1)	2
  (6, 0)	2
  (6, 1)	2
  (7, 0)	2
  (8, 0)	2
  (9, 1)	1


## To do 

Consider the following matrix:

In [245]:
n_row, n_col = 20, 10
X = sparse.csr_matrix(np.random.randint(3, size = (n_row, n_col)))

* Select the 5 rows of largest sums and build the corresponding CSR matrix (size 5 x 10).

In [246]:
sum = X.dot(np.ones(n_col))
print(sum)

# take the 5 largest sums
indices = np.argsort(sum)[-5:]
print(indices)

Y = X[indices]
Y

[14. 10.  9. 10. 10. 10. 11.  7. 11. 13.  8. 13.  9. 13. 15.  8.  9. 14.
 15. 11.]
[13 17  0 14 18]


<5x10 sparse matrix of type '<class 'numpy.int64'>'
	with 43 stored elements in Compressed Sparse Row format>

## DIAG format

In [247]:
D = sparse.diags(np.arange(10))

In [248]:
D

<10x10 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements (1 diagonals) in DIAgonal format>

In [249]:
D.data

array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])

In [250]:
D.nnz

10

In [251]:
D = sparse.csr_matrix(D)

In [252]:
D.data

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [253]:
D.nnz

9

In [254]:
n_row, n_col = 10, 4
X = sparse.csr_matrix(np.random.randint(2, size = (n_row, n_col)))

In [255]:
D.dot(X)

<10x4 sparse matrix of type '<class 'numpy.float64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [256]:
D = sparse.diags(np.ones(4))

In [257]:
X.dot(D)

<10x4 sparse matrix of type '<class 'numpy.float64'>'
	with 17 stored elements in Compressed Sparse Row format>

## To do

Consider the following matrix:

In [272]:
n_row, n_col = 20, 4
X = sparse.csr_matrix(np.random.randint(2, size = (n_row, n_col)))

Using sparse diagonal matrices:
* Normalize this matrix so that each row sums to 1 (or to 0 if the whole row is zero). 

In [273]:
# Normalize this matrix so that each row sums to 1 (or to 0 if the whole row is zero). 

# convert x data to float
X.data = X.data.astype(float)

sum = X.dot(np.ones(n_col, dtype=float))
print(sum)

for i in range(n_row):
    if sum[i] != 0:
        X.data[X.indptr[i]: X.indptr[i+1]] /= sum[i]

X.data

[2. 0. 2. 2. 3. 0. 1. 3. 2. 2. 2. 3. 3. 2. 2. 3. 0. 4. 1. 4.]


array([0.5       , 0.5       , 0.5       , 0.5       , 0.5       ,
       0.5       , 0.33333333, 0.33333333, 0.33333333, 1.        ,
       0.33333333, 0.33333333, 0.33333333, 0.5       , 0.5       ,
       0.5       , 0.5       , 0.5       , 0.5       , 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.5       , 0.5       , 0.5       , 0.5       , 0.33333333,
       0.33333333, 0.33333333, 0.25      , 0.25      , 0.25      ,
       0.25      , 1.        , 0.25      , 0.25      , 0.25      ,
       0.25      ])

## COO format

Another way to represent sparse matrices is the COO (COOrdinate) format. It is useful to load a matrix from a list of entries.

In [129]:
row = [1, 4, 2]
col = [2, 0, 2]
data = [1, 2, 3]

In [130]:
X_coo = sparse.coo_matrix((data, (row, col)), shape=(5, 5))

In [131]:
X_coo

<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in COOrdinate format>

In [132]:
X_coo.row

array([1, 4, 2], dtype=int32)

In [133]:
X_coo.col

array([2, 0, 2], dtype=int32)

In [134]:
X_coo.data

array([1, 2, 3])

You can change the format:

In [135]:
X_csr = X_coo.tocsr()

In [136]:
X_csr

<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [137]:
X_csr.indices

array([2, 2, 0], dtype=int32)

In [138]:
X_csr.indptr

array([0, 0, 1, 2, 2, 3], dtype=int32)

In [139]:
X_csr.data

array([1, 3, 2])

In [140]:
X_csr.tocoo()

<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in COOrdinate format>

You can directly load a CSR matrix from COO format:

In [141]:
X_csr = sparse.csr_matrix((data, (row, col)), shape=(5, 5))

In [142]:
X_csr

<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

Duplicate entries are summed in CSR, not in COO:

In [143]:
row = [1, 4, 2, 1]
col = [2, 0, 2, 2]
data = [1, 2, 3, 4]

In [144]:
X_coo = sparse.coo_matrix((data, (row, col)), shape=(5, 5))

In [145]:
X_coo

<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in COOrdinate format>

In [146]:
X_coo.data

array([1, 2, 3, 4])

In [147]:
X_csr = sparse.csr_matrix((data, (row, col)), shape=(5, 5))

In [148]:
X_csr

<5x5 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [149]:
X_csr.data

array([5, 3, 2])

## To do

* Build the following matrix in sparse CSR format:
$$
\begin{pmatrix}
0 & 0 & 1& 2\\
3 & 0& 0& 0\\
0& 0& 4& 0
\end{pmatrix}
$$



* The matrix in sparse CSR format is:
$$
\begin{align}
indptr &= [0,2,3,4] \\
data &= [1,2,3,4] \\
indices &= [2,3,0,2] \\
\end{align}
$$