## SciPy's sparse package

* 2-D sparse matrix data structures
   * Where a sparse matrix is relatively "empty", meaning it has many zero values.
   * Contrast this to a dense matrix where there are few zero values.
* Efficient storage and performant retrieval for various types of access
* Numpy-like matrix math functions
   * Methods on matrix objects
   * sparse.linalg and sparse.csgraph subpackages
   * If a numpy math algorithm is needed after all, must convert to dense (numpy) matrix (.toarray())
   
### Examples of when useful

* When a large amount of matrix data is sparse and must fit in limited memory.
* Large sparse matrices often appear in scientific or engineering applications when solving partial differential equations (PDEs).
   * The finite element method
   * Mechanical engineering, electrotechnics, physics, etc.
* Graph theory
   * nonzero at (i, j) means that node i is connected to node j
* Bag of Words vectorization in NLP (to be illustrated later!)

### Sparse matrix types

* Select the right type for the job.
* Row and Column matrices are more efficient for operations involving Row or Column operations, respectively.
* Each type has both general constructors and a custom constructors specific to that type.
* Some types are for efficient incremental construction and efficient conversion to another type.
   * These are typically converted to other types to be used

Name | Row | Col | Diagonal | Incr-Constr | Long Name
--- | --- | --- | --- | --- | :---
csc | . | y | . | . | Compressed Sparse Column format
csr | y | . | . | . | Compressed Sparse Row format
bsr | y | . | . | . | Block Sparse Row format
lil | . | . | . | y | List of Lists format
dok | . | . | . | y | Dictionary of Keys format
coo | . | . | . | * | COOrdinate format**
dia | . | . | y | . | DIAgonal format

>> ** COO has fast full (not incremental) initialization

In [None]:
import numpy as np
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import sparse

#### CSC -- Compressed Sparse Column format

* For efficient COLUMN access
* Uses "Yale" format:
   * A -- data in "column major" order
   * IA -- recursively defined indexes tracking the number of nonzero elements in each column (len A+1)
   * JA -- row indexes of each element in A (len A)
   
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/4d/Row_and_column_major_order.svg/170px-Row_and_column_major_order.svg.png"/>

#### CSR -- Compressed Sparse Row format

* Similar to CSC, but more efficient for operations accessing the rows
* Uses "Yale" format:
   * A -- data in "row major" order
   * IA -- recursively defined indexes tracking the number of nonzero elements in each row (len A+1)
   * JA -- column indexes of each element in A (len A)

In [None]:
# CSR Example:
A = [5, 8, 3, 6]      # nonzero values
IA = [0, 0, 2, 3, 4]  # 0 nonzero elts on 0th row, 2 on 1st, 1 on 2nd, 1 on 3rd (as ranges)
JA = [0, 1, 2, 1]     # 5 is in the 0th col, 8 and 6 are in the 1st col, 3 is in the 2nd col

sparse.csr_matrix((A, JA, IA), shape=(4,4)).toarray()

#### BSR -- Block Sparse Row format

* Essentially CSR, but having identically shaped "blocks" of dense sub-matrices inside

In [None]:
# BSR example
indptr = np.array([0, 2, 3, 6])
indices = np.array([0, 2, 2, 0, 1, 2])
data = np.array([1, 2, 3, 4, 5, 6]).repeat(4).reshape(6, 2, 2)
mtx = sparse.bsr_matrix((data, indices, indptr), shape=(6, 6))
mtx.todense()

#### LIL -- List of Lists format

* Row-based linked list
* One of two that supports efficient incremental construction (DOK is other)
* Flexible slicing
* Efficient if sparsity structure changes
* Slow arithmetics
* Use when sparsity pattern is not known apriori or changes
   * e.g., reading a sparse matrix from a text file


In [None]:
# Example 1 from scipy.sparse documentation:

# Construct a 1000x1000 lil_matrix and add some values to it:
from scipy.sparse import lil_matrix
from scipy.sparse.linalg import spsolve
from numpy.linalg import solve, norm
from numpy.random import rand

A = lil_matrix((1000, 1000))
A[0, :100] = rand(100)
A[1, 100:200] = A[0, :100]
A.setdiag(rand(1000))

# visualize the random sparse matrix
plt.spy(A.toarray(), aspect='auto')

# Now convert it to CSR format and solve A x = b for x:
A = A.tocsr()
b = rand(1000)
x = spsolve(A, b)  # sparse solve on sparse matrices

# Convert it to a dense matrix and solve, and check that the result is the same:
x_ = solve(A.toarray(), b)  # numpy solve on numpy arrays

# Now we can compute norm of the error with:
err = norm(x-x_)
err < 1e-10

# It should be small :)

#### DOK -- Dictionary of Keys format

* Dictionary mapping (row, column) index tuples to the value of the non-zero elements
   * subclass of Python dict
* One of two that supports efficient incremental construction (LIL is other)
* Efficient O(1) access to individual elements
* Flexible slicing
* Efficient sparsity structure changes
* Slow arithmetics
* Use when sparsity pattern is not known apriori or changes
   * e.g., reading a sparse matrix from a text file

In [None]:
# Example from scipy.sparse.dok_matrix documentation
from scipy.sparse import dok_matrix
S = dok_matrix((5, 5), dtype=np.float32)
for i in range(5):
    for j in range(5):
        S[i, j] = i + j    # Update element.
S.toarray()

# Every cell's value is the sum of its indices
# Not very sparse, but illustrates how to use!

#### COO -- COOrdinate format

* A convenient format for fully constructing a sparse matrix using the IJV format
* Advantages
   * Facilitates fast conversion among sparse formats
   * Permits duplicate entries
   * Very fast conversion to and from CSR/CSC formats
   * Data can be presented in any order
   * Multiple (row,col),data specs "sum" the data
* Disadvantages
   * does not directly support:
      * arithmetic operations
      * slicing

In [None]:
# Example from scipy.sparse.coo_matrix documentation
# Constructing a matrix using ijv format
from scipy.sparse import coo_matrix

# 4 at (0,0), 5 at (3,3), 7 at (1,1), 9 at (0,2)
row  = np.array([0, 3, 1, 0])
col  = np.array([0, 3, 1, 2])
data = np.array([4, 5, 7, 9])

coo_matrix((data, (row, col)), shape=(4, 4)).toarray()

#### DIA -- DIAgonal format

* Very simple scheme
* Diagonals in dense NumPy array of shape (n_diag, length)
   * fixed length -> wast space a bit when far from main diagonal
* Offset for each diagonal
   * 0 is the main diagonal
   * negative offset = below
   * positive offset = above
* Fast item-wise operations
   * direct manipulation of NumPy data array
* No slicing, no individual item access
* Specialized use (e.g., solving PDEs)

In [None]:
data = np.array([[1, 2, 3, 4]]).repeat(3, axis=0)
data

In [None]:
offsets = np.array([0, -1, 2])
mtx = sparse.dia_matrix((data, offsets), shape=(4, 4))
mtx   

In [None]:
mtx.todense()

In [None]:
np.info(sparse)

### Exercise:

#### Build *AND SHOW* the matrix in Example 2 from the documentation for "sparse".

In [None]:
# Exercise: Build AND SHOW matrix B in Example 2
from scipy import sparse
from numpy import array
# Put your code in the empty cell below after running this cell
# HINT: to "show" the matrix, convert it to a dense matrix!

In [None]:
>>> I = array([0,0,1,3,1,0,0])
>>> J = array([0,2,1,3,1,0,0])
>>> V = array([1,1,1,1,1,1,1])
>>> sparse.coo_matrix((V,(I,J)),shape=(4,4)).tocsr().toarray()

#### Observations:

* Correlated by position in each vector,
   * The *row index* is specified in vector, *I*
   * The *column index* is specified in vector *J*
   * The *data value* is specified in vector *V*

### Exercise:

#### Build *AND SHOW* the following as a sparse matrix using the IJV format:

```
array([[0, 3, 0, 1],
       [4, 0, 1, 0],
       [5, 0, 9, 0]])
```

In [None]:
# Exercise: Build and show the following as a sparse matrix:
# Fill in the ellipses ("...") below this line

I = array([...])
J = array([...])
V = array([...])
A = sparse.coo_matrix((V,(I,J)),shape=(...))

# Show A
...

In [None]:
# Solution
# Exercise: Build and show the following as a sparse matrix:
# Fill in the ellipses ("...") below this line

# NOTE: My solution shows how referencing the same cell yields an additive effect!

I = array([0,0,1,1,2,2,2])
J = array([1,3,0,2,0,2,2])
V = array([3,1,4,1,5,7,2])
A = sparse.coo_matrix((V,(I,J)),shape=(3,4))

# Show A
A.toarray()

### So, how important is it to use a sparse matrix anyway?!

* In machine learning, as the number of instances (rows, R) and features (columns, C) grow, the space required to hold the data increases (potentially N^2 growth with R*C).
* Often, when there are many features the instantiated features per instance relative to the full feature set is low.
* This generally leads to the need for sparse matrices in order to mathematically hold and manipulate the data in memory.

#### Let's take a look at some examples...

In [None]:
# Generate a random sparse matrix
rows, cols, density = 1000, 10000, 0.01

# Exercise: Change the density value above and rerun the cell to explore the effects of other values

# Create a random sparse matrix
A = sparse.rand(rows, cols, density=density, format='csr')

# ***NOTE***: Must give spy the dense matrix, hence must use .toarray() !!!
plt.spy(A.toarray(), aspect='auto')

# Count the number of non-zero entries
nz_count = sum([np.count_nonzero(a) for a in A.nonzero()])
tot_count = rows * cols
print(f'{nz_count} / {tot_count} ({100*nz_count/tot_count}%) non-zero values')

# Let's look at the memory usage
#B = np.zeros((rows, cols)); bsize = B.nbytes
bsize = 8 * rows * cols  # constant answer for dense
asize = A.data.nbytes + A.indptr.nbytes + A.indices.nbytes
print(f'sparse size = {asize} B, dense size = {bsize} B  (sparse size is {100*asize/bsize}% of dense))')

### Let's see the space required as a function of density...

In [None]:
# Plot the sparse size as a function of density in random sparse matrices
rows, cols = 1000, 10000
densities = [10**(-x) for x in range(1,9)]
sizes = list()
for d in densities:
    A = sparse.rand(rows, cols, density=d, format='csr')
    sizes.append(A.data.nbytes + A.indptr.nbytes + A.indices.nbytes)
plt.semilogx(densities[::-1], sizes[::-1], label='sparse')
plt.plot(densities, [rows*cols*8]*len(sizes), label='dense')
plt.title('Sparse size as a function of density')
plt.xlabel('density (log scale))')
plt.ylabel('size(B) (linear scale)')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

### Or conversely the space required as the size (number of elements) grows for a dense (square) matrix:

In [None]:
x = np.linspace(0, 1e6, 10)
plt.plot(x, 8.0 * (x**2) / 1e6)    
plt.xlabel('size n')    
plt.ylabel('memory [MB]') 