# FLIP (00): Data Science 
**(Module 03: Linear Algebra)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use,but NOT allowed to change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---
## Session 18 Sparse Matrices

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
from scipy import sparse
import scipy.sparse.linalg as spla
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set_context('notebook', font_scale=1.5)

## Creating a sparse matrix

There are many applications in which we deal with matrices that are mostly zeros. For example, a matrix representing social networks is very sparse - there are 7 billion people, but most people are only connected to a few hundred or thousand others directly. Storing such a social network as a sparse rather than dense matrix  will offer orders of magnitude reductions in memory requirements and corresponding speed-ups in computation.

### Coordinate format

The simplest sparse  matrix format is built from the coordinates and values of the non-zero entries.

#### From dense matrix

In [None]:
A = np.random.poisson(0.2, (5,15)) * np.random.randint(0, 10, (5, 15))
A

In [None]:
rows, cols = np.nonzero(A)
vals = A[rows, cols]

In [None]:
vals

In [None]:
rows

In [None]:
cols

In [None]:
X1 = sparse.coo_matrix(A)
X1

In [None]:
print(X1)

#### From coordinates

Note that the (values, (rows, cols)) argument is a single tuple.

In [None]:
X2 = sparse.coo_matrix((vals, (rows, cols)))
X2

In [None]:
print(X2)

#### Convert back to dense matrix

In [None]:
X2.todense()

### Compressed Sparse Row and Column formats

When we have 2 or more repeated entries in the rows or cols, we can remove the redundancy by indicating the  location of the  first occurrence of a value and its increment instead of the full coordinates. These are known as CSR or CSC formats.

In [None]:
np.vstack([rows, cols])

In [None]:
indptr = np.r_[np.searchsorted(rows, np.unique(rows)), len(rows)]
indptr

In [None]:
X3 = sparse.csr_matrix((vals, cols, indptr))
X3

In [None]:
X3.todense()

#### Casting from COO format

Because the coordinate format is more intuitive, it is often more convenient to first create a COO matrix then cast to CSR or CSC form.

In [None]:
X4 = X2.tocsr()

In [None]:
X4

### COO summation convention

When entries are repeated in a COO matrix, they are **summed**. This provides a quick way to construct confusion matrices for evaluation of multi-class  classification algorithms.

In [None]:
rows = np.r_[np.zeros(4), np.ones(4)]
cols = np.repeat([0,1], 4)
vals = np.arange(8)

In [None]:
rows

In [None]:
cols

In [None]:
vals

In [None]:
X5 = sparse.csr_matrix((vals, (rows, cols)))

In [None]:
print(X5)

## Application: Confusion matrix

### Creating a 2 by 2 confusion matrix

In [None]:
obs = np.random.randint(0, 2, 100)
pred = np.random.randint(0, 2, 100)
vals = np.ones(100).astype('int')

In [None]:
pred

In [None]:
vals.shape, obs.shape , pred.shape

In [None]:
X6 = sparse.coo_matrix((vals, (pred, obs)))

In [None]:
X6.todense()

#### Creating an $n$ by $n$ confusion matrix

For classifications with a large number of classes (e.g. image segmentation),  the savings are even more dramatic.

In [None]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

In [None]:
iris = datasets.load_iris()

In [None]:
knn = KNeighborsClassifier()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, 
                                                    test_size=0.5, random_state=42)

In [None]:
pred = knn.fit(X_train, y_train).predict(X_test)

In [None]:
X7 = sparse.coo_matrix((np.ones(len(pred)).astype('int'), (pred, y_test)))
pd.DataFrame(X7.todense(), index=iris.target_names, columns=iris.target_names)

## Application: PageRank

SciPy provides efficient routines for solving large sparse systems as for dense matrices. We will illustrate by calculating the page rank  for airports using data from the [Bureau of Transportation Statisitcs](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236).

In [None]:
data = pd.read_csv('data/airports.csv', usecols=[0,1])

In [None]:
data.shape

In [None]:
data.head()

In [None]:
lookup = pd.read_csv('data/names.csv', index_col=0)

In [None]:
lookup.shape

In [None]:
lookup.head()

In [None]:
import networkx as nx

### Construct the sparse adjacency matrix

In [None]:
g = nx.from_pandas_dataframe(data, source='ORIGIN_AIRPORT_ID', target='DEST_AIRPORT_ID')

In [None]:
airports = np.array(g.nodes())
adj_matrix = nx.to_scipy_sparse_matrix(g)

### Construct the transition matrix

In [None]:
out_degrees = np.ravel(adj_matrix.sum(axis=1))
diag_matrix = sparse.diags(1 / out_degrees).tocsr()
M = (diag_matrix @ adj_matrix).T

### Modify the transition matrix with a damping factor

The PageRank algorithm assumes that every node can be reached from every other node. To guard against case where a node has out-degree 0, we allow every node a small random chance of transitioning to any other node using a damping factor $d$. Then we solve  the linear system  to find the pagerank score $r$.

$$
r = (I - dM)^{-1}\frac{1-d}{N}\mathbb{1}
$$

or equivalently in the $Ax = b$ format

$$
(I - dM)r = \frac{1-d}{N}\mathbb{1}
$$

In [None]:
n = len(airports) 
d = 0.85 
I = sparse.eye(n, format='csc')
A = I - d * M
b = (1-d) / n * np.ones(n) # so the sum of all page ranks is 1

In [None]:
A.todense()

In [None]:
from scipy.sparse.linalg import spsolve

In [None]:
r =  spsolve(A, b)
r.sum()

In [None]:
idx = np.argsort(r)

In [None]:
top10 = idx[-10:][::-1]
bot10 = idx[:10]

In [None]:
df = lookup.loc[airports[top10]]
df['degree'] = out_degrees[top10]
df['pagerank']= r[top10]
df

In [None]:
df = lookup.loc[airports[bot10]]
df['degree'] = out_degrees[bot10]
df['pagerank']= r[bot10]
df

### Visualize the airport connections graph and label the top and bottom 5 airports by pagerank

In [None]:
import warnings

In [None]:
labels = {airports[i]: lookup.loc[airports[i]].str.split(':').str[0].values[0] 
          for i in np.r_[top10[:5], bot10[:5]]}

with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    nx.draw(g, pos=nx.spring_layout(g), labels=labels, 
            node_color='blue', font_color='red', alpha=0.5,
            node_size=np.clip(5000*r, 1, 5000*r), width=0.1)