# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Part 3: Sparse matrix storage [27 points]

**Downloads.** For this part of the lab, you'll need to download the following dataset:

* http://cse6040.gatech.edu/fa15/UserEdges-1M.csv (about 44 MiB)

It's a list of pairs of strings. The strings, it turns out, correspond to Yelp! user IDs; a pair $(a, b)$ exists if user $a$ is friends on Yelp! with user $b$.

In [None]:
import numpy as np
import pandas as pd
from random import sample # Used to generate a random sample
from IPython.display import display

## Sample dataset

Start by looking at the sample dataset.

In [None]:
edges_raw = pd.read_csv ('UserEdges-1M.csv')
display (edges_raw.head ())

**Exercise 1** (3 points). Explain what the following code cell does.

In [None]:
edges_raw_trans = pd.DataFrame ({'Source': edges_raw['Target'],
                                 'Target': edges_raw['Source']})
edges_raw_symm = pd.concat ([edges_raw, edges_raw_trans])
edges = edges_raw_symm.drop_duplicates ()

V_names = set (edges['Source'])
V_names.update (set (edges['Target']))

num_edges = len (edges)
num_verts = len (V_names)
print ("==> |V| == %d, |E| == %d" % (num_verts, num_edges))

YOUR ANSWER HERE

## Sparse matrix storage: Baseline methods

Let's start by reminding ourselves how our previous method for storing sparse matrices, based on nested default dictionaries, works and performs.

In [None]:
def sparse_matrix (base_type=float):
    """Returns a sparse matrix using nested default dictionaries."""
    from collections import defaultdict
    return defaultdict (lambda: defaultdict (base_type))

def dense_vector (init, base_type=float):
    """
    Returns a dense vector, either of a given length
    and initialized to 0 values or using a given list
    of initial values.
    """
    # Case 1: `init` is a list of initial values for the vector entries
    if type (init) is list:
        initial_values = init
        return [base_type (x) for x in initial_values]
    
    # Else, case 2: `init` is a vector length.
    assert type (init) is int
    return [base_type (0)] * init

**Exercise 2** (3 points). Implement a function to compute $y \leftarrow A x$. Assume that the keys of the sparse matrix data structure are integers in the interval $[0, s)$ where $s$ is the number of rows or columns as appropriate.

In [None]:
def spmv (A, x, num_rows=None):
    if num_rows is None:
        num_rows = max (A.keys ()) + 1
    y = dense_vector (num_rows)
    # YOUR CODE HERE
    raise NotImplementedError()
    return y

In [None]:
# Test:
#
#   / 0.   -2.5   1.2 \   / 1. \   / -1.4 \
#   | 0.1   1.    0.  | = | 2. | = |  2.1 |
#   \ 6.   -1.    0.  /   \ 3. /   \  4.0 /

A = sparse_matrix ()
A[0][1] = -2.5
A[0][2] = 1.2
A[1][0] = 0.1
A[1][1] = 1.
A[2][0] = 6.
A[2][1] = -1.

x = dense_vector ([1, 2, 3])
y0 = dense_vector ([-1.4, 2.1, 4.0])


# Try your code:
y = spmv (A, x)

max_abs_residual = max ([abs (a-b) for a, b in zip (y, y0)])

print ("==> A:", A)
print ("==> x:", x)
print ("==> True solution, y0:", y0)
print ("==> Your solution, y:", y)
print ("==> Residual (infinity norm):", max_abs_residual)
assert max_abs_residual <= 1e-15

print ("\n(Passed.)")

> Do you notice anything interesting about the testing procedure and results?

Next, let's convert the `edges` input into a sparse matrix representing its connectivity graph. To do so, we'll first want to map names to integers.

In [None]:
id2name = {} # id2name[id] == name
name2id = {} # name2id[name] == id

for k, v in enumerate (V_names):
    # for debugging
    if k <= 5: print ("Name %s -> Vertex id %d" % (v, k))
    if k == 6: print ("...")
        
    id2name[k] = v
    name2id[v] = k

**Exercise 3** (3 points). Given `id2name` and `name2id` as computed above, convert `edges` into a sparse matrix, `G`, where there is an entry `G[s][t] == 1.0` wherever an edge `(s, t)` exists.

In [None]:
G = sparse_matrix ()

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
G_rows_nnz = [len (row_i) for row_i in G.values ()]
print ("G has {} vertices and {} edges.".format (len (G.keys ()), sum (G_rows_nnz)))

assert len (G.keys ()) == num_verts
assert sum (G_rows_nnz) == num_edges

# Check a random sample
for k in sample (range (num_edges), 1000):
    i = name2id[edges['Source'].iloc[k]]
    j = name2id[edges['Target'].iloc[k]]
    assert i in G
    assert j in G[i]
    assert G[i][j] == 1.0

print ("\n(Passed.)")

**Exercise 4** (3 points). In the above, we asked you to construct `G` using integer keys. However, since we are, after all, using default dictionaries, we could also use the vertex _names_ as keys. Construct a new sparse matrix, `H`, which uses the vertex names as keys instead of integers.

In [None]:
H = sparse_matrix ()

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
H_rows_nnz = [len (h) for h in H.values ()]
print ("`H` has {} vertices and {} edges.".format (len (H.keys ()), sum (H_rows_nnz)))

assert len (H.keys ()) == num_verts
assert sum (H_rows_nnz) == num_edges

# Check a random sample
for i in sample (G.keys (), 100):
    i_name = id2name[i]
    assert i_name in H
    assert len (G[i]) == len (H[i_name])
    
print ("\n(Passed.)")

**Exercise 5** (3 points). Implement a sparse matrix-vector multiply for matrices with _named_ keys. In this case, it will be convenient to have vectors that also have named keys; assume dictionaries as suggested below.

In [None]:
def vector_keyed (keys=None, values=0, base_type=float):
    """Returns a """
    if keys is not None:
        if type (values) is not list:
            values = [base_type (values)] * len (keys)
        else:
            values = [base_type (v) for v in values]
        x = dict (zip (keys, values))
    else:
        x = {}
    return x

def spmv_keyed (A, x):
    """Performs a aparse matrix-vector multiply for keyed matrices and vectors."""
    assert type (x) is dict
    
    y = vector_keyed (keys=x.keys (), values=0.0)
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return y

In [None]:
# Test:
#
#   'row':  / 0.   -2.5   1.2 \   / 1. \   / -1.4 \
#  'your':  | 0.1   1.    0.  | = | 2. | = |  2.1 |
#  'boat':  \ 6.   -1.    0.  /   \ 3. /   \  4.0 /

KEYS = ['row', 'your', 'boat']

A_keyed = sparse_matrix ()
A_keyed['row']['your'] = -2.5
A_keyed['row']['boat'] = 1.2
A_keyed['your']['row'] = 0.1
A_keyed['your']['your'] = 1.
A_keyed['boat']['row'] = 6.
A_keyed['boat']['your'] = -1.

x_keyed = vector_keyed (KEYS, [1, 2, 3])
y0_keyed = vector_keyed (KEYS, [-1.4, 2.1, 4.0])


# Try your code:
y_keyed = spmv_keyed (A_keyed, x_keyed)

# Measure the residual:
residuals = [(y_keyed[k] - y0_keyed[k]) for k in KEYS]
max_abs_residual = max ([abs (r) for r in residuals])

print ("==> A_keyed:", A_keyed)
print ("==> x_keyed:", x_keyed)
print ("==> True solution, y0_keyed:", y0_keyed)
print ("==> Your solution:", y_keyed)
print ("==> Residual (infinity norm):", max_abs_residual)
assert max_abs_residual <= 1e-15

print ("\n(Passed.)")

Let's benchmark `spmv()` against `spmv_keyed()` on the full data set. Do they perform differently?

In [None]:
x = dense_vector ([1.] * num_verts)
%timeit spmv (G, x)

x_keyed = vector_keyed (keys=[v for v in V_names], values=1.)
%timeit spmv_keyed (H, x_keyed)

## Alternative formats: COO and CSR formats

Take a look at the following slides, which we (hopefully) covered in class: [link](https://t-square.gatech.edu/access/content/group/gtc-3bd6-e221-5b9f-b047-31c7564358b7/slides/2016-10-17--matstore.pdf). These slides cover the basics of two list-based sparse matrix formats known as _coordinate format_ (COO) and _compressed sparse row_ (CSR).

Although these are available as native formats in SciPy, let's create native Python versions first using lists. We can then compare the performance of, say, sparse matrix-vector multiply, against the ones we ran above.

**Exercise 6** (3 points). Convert the `edges[:]` data into a coordinate (COO) data structure in native Python using three lists, `coo_rows[:]`, `coo_cols[:]`, and `coo_vals[:]`, to store the row indices, column indices, and matrix values, respectively. Use integer indices and set all values should all be set to 1.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len (coo_rows) == num_edges
assert len (coo_cols) == num_edges
assert len (coo_vals) == num_edges
assert all ([v == 1. for v in coo_vals])

# Randomly check a bunch of values
coo_zip = zip (coo_rows, coo_cols, coo_vals)
for i, j, a_ij in sample (list (coo_zip), 1000):
    assert (i in G) and j in G[i]
    
print ("\n(Passed.)")

**Exercise 7** (3 points). Implement a sparse matrix-vector multiply routine for COO implementation.

In [None]:
def spmv_coo (R, C, V, x, num_rows=None):
    """
    Returns y = A*x, where A has 'm' rows and is stored in
    COO format by the array triples, (R, C, V).
    """
    assert type (x) is list
    assert type (R) is list
    assert type (C) is list
    assert type (V) is list
    assert len (R) == len (C) == len (V)
    if num_rows is None:
        num_rows = max (R) + 1
    
    y = dense_vector (num_rows)
    
    # YOUR CODE HERE
    raise NotImplementedError()

    return y

In [None]:
# Test:
#
#   / 0.   -2.5   1.2 \   / 1. \   / -1.4 \
#   | 0.1   1.    0.  | = | 2. | = |  2.1 |
#   \ 6.   -1.    0.  /   \ 3. /   \  4.0 /

A_coo_rows = [0, 0, 1, 1, 2, 2]
A_coo_cols = [1, 2, 0, 1, 0, 1]
A_coo_vals = [-2.5, 1.2, 0.1, 1., 6., -1.]

x = dense_vector ([1, 2, 3])
y0 = dense_vector ([-1.4, 2.1, 4.0])

# Try your code:
y_coo = spmv_coo (A_coo_rows, A_coo_cols, A_coo_vals, x)

max_abs_residual = max ([abs (a-b) for a, b in zip (y_coo, y0)])

print ("==> A_coo:", list (zip (A_coo_rows, A_coo_cols, A_coo_vals)))
print ("==> x:", x)
print ("==> True solution, y0:", y0)
print ("==> Your solution:", y_coo)
print ("==> Residual (infinity norm):", max_abs_residual)
assert max_abs_residual <= 1e-15

print ("\n(Passed.)")

In [None]:
x = dense_vector ([1.] * num_verts)
%timeit spmv_coo (coo_rows, coo_cols, coo_vals, x)

**Exercise 8** (3 points). Now create a CSR data structure, again using native Python lists. Name your output CSR lists `csr_ptrs`, `csr_inds`, and `csr_vals`.

It's easiest to start with the COO representation. We've given you some starter code.

In [None]:
# Aside: What does this do? Try running it to see.

z1 = ['q', 'v', 'c']
z2 = [1, 2, 3]
z3 = ['dog', 7, 'man']

from operator import itemgetter
print (sorted (zip (z1, z2, z3), key=itemgetter (0)))

In [None]:
C = sorted (zip (coo_rows, coo_cols, coo_vals),
            key=itemgetter (0))
nnz = len (C)

assert (C[-1][0] + 1) == num_verts  # Why?

csr_inds = [j for _, j, _ in C]
csr_vals = [a_ij for _, _, a_ij in C]

# What about csr_ptrs?
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert len (csr_ptrs) == (num_verts + 1)
assert len (csr_inds) == num_edges
assert len (csr_vals) == num_edges
assert csr_ptrs[num_verts] == num_edges

# Check some random entries
for i in sample (range (num_verts), 1000):
    assert i in G
    a, b = csr_ptrs[i], csr_ptrs[i+1]
    assert (b-a) == len (G[i])
    assert all ([(j in G[i]) for j in csr_inds[a:b]])
    assert all ([(j in csr_inds[a:b] for j in G[i].keys ())])

print ("\n(Passed.)")

**Exercise 9** (3 points). Now implement a CSR-based sparse matrix-vector multiply.

In [None]:
def spmv_csr (ptr, ind, val, x, num_rows=None):
    assert type (ptr) == list
    assert type (ind) == list
    assert type (val) == list
    assert type (x) == list
    if num_rows is None: num_rows = len (ptr) - 1
    assert len (ptr) >= (num_rows+1)  # Why?
    assert len (ind) >= ptr[num_rows]  # Why?
    assert len (val) >= ptr[num_rows]  # Why?
    
    y = dense_vector (num_rows)

    # YOUR CODE HERE
    raise NotImplementedError()
    
    return y

In [None]:
# Test:
#
#   / 0.   -2.5   1.2 \   / 1. \   / -1.4 \
#   | 0.1   1.    0.  | = | 2. | = |  2.1 |
#   \ 6.   -1.    0.  /   \ 3. /   \  4.0 /

A_csr_ptrs = [ 0,        2,       4,       6]
A_csr_cols = [ 1,   2,   0,   1,  0,   1]
A_csr_vals = [-2.5, 1.2, 0.1, 1., 6., -1.]

x = dense_vector ([1, 2, 3])
y0 = dense_vector ([-1.4, 2.1, 4.0])

# Try your code:
y_csr = spmv_csr (A_csr_ptrs, A_csr_cols, A_csr_vals, x)

max_abs_residual = max ([abs (a-b) for a, b in zip (y_csr, y0)])

print ("==> A_csr_ptrs:", A_csr_ptrs)
print ("==> A_csr_{cols, vals}:", list (zip (A_csr_cols, A_csr_vals)))
print ("==> x:", x)
print ("==> True solution, y0:", y0)
print ("==> Your solution:", y_csr)
print ("==> Residual (infinity norm):", max_abs_residual)
assert max_abs_residual <= 1e-15

print ("\n(Passed.)")

In [None]:
x = dense_vector ([1.] * num_verts)
%timeit spmv_csr (csr_ptrs, csr_inds, csr_vals, x)

## Using Scipy's implementations

What you should have noticed is that the list-based COO and CSR formats do not really lead to sparse matrix-vector multiply implementations that are much faster than the dictionary-based methods. Let's instead try Scipy's native COO and CSR implementations.

In [None]:
import numpy as np
import scipy.sparse as sp

A_coo_sp = sp.coo_matrix ((coo_vals, (coo_rows, coo_cols)))
A_csr_sp = A_coo_sp.tocsr () # Alternatively: sp.csr_matrix ((val, ind, ptr))
x_sp = np.ones (num_verts)

print ("\n==> COO in Scipy:")
%timeit A_coo_sp.dot (x_sp)

print ("\n==> CSR in Scipy:")
%timeit A_csr_sp.dot (x_sp)