# Constructing and Manipulating Network Data

When we think of data, we often think of tables or spreadsheets. However, network data is a set of nodes linked together that does not directly and neatly fit into tables. In this section, we will cover the basics of how to represent a network, manipulate the network data, and visualize it.

## Data is not often a network

Data often does not appear as a network. Instead, it is often a collection of observations for entities. If the observations represent interactions, dependencies, and any kind of relationships, we can abstract out the relationships as a network.

### Example

Suppose we have a dataset containing information about students and their course enrollments. Each row in the dataset represents a student's enrollment in a particular course. The columns contain information such as the student's name, the course name, the course instructor, and the semester in which the course was taken.

We can abstract out the relationships as a network by treating the students and courses as nodes in the network and drawing edges between them to represent enrollments. For example, if a student has enrolled in a particular course, we can draw an edge between the student node and the course node. This network representation can be useful for analyzing patterns of enrollment and identifying clusters of students who tend to enroll in similar courses. It can also help us understand the relationships between courses and instructors and how they may influence student enrollment.

Let's create and visualize the network. We start by loading the csv file:

In [None]:
import pandas as pd

data_file = "./course-enrollements.csv"

# Load the csv file using pandas.read_csv function and show the first few rows.
data_table = pd.read_csv(data_file)
data_table.head()

## Representing networks

### Edge table
Now, let's slice the columns expressing the relationships between students and courses.

In [None]:
# Assignment:
# Create an edge table with DataFrame and name it `edge_table`.
# Then, print the first few rows of the edge table.

Congratulations! You have just created the most common representation of network data, the **edge table** (or edge list). Formally, an edge table is a table with rows representing edges and columns representing the nodes connected by the edges. Since its simplicity, readability, and transferability across different platforms and programming languages, the edge table is widespread.

### Indexing

Edge table is direct and interpretable. But it is not an ideal format for computing, such as sorting and searching. It is often a good idea to create an **index table** by assigning numbers to individual nodes.

In [None]:
import numpy as np

# Extract the unique node labels
node_labels = list(
    set(edge_table["student_name"].values) | set(edge_table["course_name"].values)
)

# Sort the names
node_labels = sorted(node_labels)

# Create a node table
node_table = pd.DataFrame(
    {"label": node_labels, "node_id": np.arange(len(node_labels), dtype=int)}
)
node_table

Then, we recreate the edge table by using the newly created index table.

In [None]:
# Create a mapping from node labels to node index
to_node_ids = node_table[["node_id", "label"]].set_index("label")["node_id"].to_dict()

student_ids = edge_table["student_name"].map(to_node_ids)
source_ids = edge_table["course_name"].map(to_node_ids)
edge_table = pd.DataFrame({"student_id": student_ids, "course_id": source_ids})
edge_table

### Visualization

The edge table can be easily processed by machines, but it is not easily understandable for humans. Do you have any insights about the network pattern based on the edge table? This is where visualization becomes important. By visualizing the network, we can gain a better understanding of its structure in an effectively and efficiently.

In [None]:
import igraph
import numpy as np

g = igraph.Graph.DataFrame(edge_table, directed=False)
igraph.plot(
    g, vertex_label=node_table["label"].values
)  # Let's see the graph `igraph` library in our coding

Can you see any patterns?

Now, let's compute some simple statistics. How many students' enrollment for Computer Science 101?

In [None]:
# Asssignment:
# Count the number of enrollments using edge_table and save it to "cnt" variable

How many students enrolled in both course "History 101" and "English 101"?

In [None]:
# Assignment:
# Count the number of students who enrolled in both courses by using the edge table.
#
# Hint:
# 1. Find the ids of the students who enroll History 101
# 2. Find the ids of the students who enroll English 101
# 3. Find the intersection of the two sets of the students.
#
# - pandas.DataFrame has a convenient API `query` that can be used to filter rows based on column values. For instance,
# > history_course_id = to_node_ids["History 101"]
# > edge_table.query(f"course_id == {history_course_id}")
# gives the subset of the edge table with students who enroll the History 101.

### From edge list to adjacency matrix

The code to answer these simple questions may seem tedious. While the edge table is simple, it is not a convenient format for computing with network data. A more convenient format is the **adjacency matrix**. It is a square matrix, where each $(i,j)$ entry represents the presence $A_{ij}=1$ or absence $A_{ij}=0$ of edges.

Let's create the adjacency matrix by

In [None]:
import numpy as np

# Fill in other edges
A = np.array(g.get_adjacency().data)

Let's compute the number of enrollments for the Computer Science 101 by using the adjacency matrix. To do that, let's see the row corresponding to the course.

In [None]:
A[to_node_ids["Computer Science 101"], :]

It encodes the neighbors of the course, where the neighbors (i.e., students) are flagged as one. So we can compute the number of enrollments for the Computer Science 101 by

In [None]:
np.sum(A[to_node_ids["Computer Science 101"], :])

Can you compute the number of students who enroll both English 101 and History 101 by using the adjacency matrix?

Hint: Use `A[to_node_ids[course name], :]` twice.

In [None]:
# Assignment: compute the number of students who enroll both English 101 and History 101

The adjacency matrix is a powerful representation that is convenient for computation. By utilizing linear algebra
operations, you write less code and get results quickly.

### Sparse matrix: coordinate list and adjacency list

The adjacency matrix is often *sparse*. What does it mean? Let's take a look at the whole adjacency matrix


In [None]:
# Print the adjacency matrix
import sys

np.set_printoptions(threshold=sys.maxsize)
A

As you can see, only few entries have value one, while most entries have zero values. The adjacency matrix is a *redundant* representations; it has $n \times n$ values, few of which indicates the presence of edges. The number of values increases quadratically and can quickly exceeds the memory.

There are several representation for sparse matrix. The edge list is in fact a representation of a sparse matrix; it keeps only the elements of non-zero values. It is also called *coordinate list*. Why it's called *coordinate*? We can consider each entry of an adjacency matrix as a pixel and these pixels form an image.

In [None]:
import matplotlib.pyplot as plt

plt.imshow(A)

The yellow pixels represent the present of edges, and the edge table represents the `coordinate` of the edges, i.e.,

In [None]:
edge_table

Another representation is the adjacency list representation. It is a dictionary with keys being nodes. Each value associated with a key is a list of neighbor node IDs.

This representation is also known as the dictionary of keys (DOK) or list-of-list (LIL).

### Sparse matrix: compressed sparse row/column

![](https://matteding.github.io/images/csr.gif)

Compressed Sparse Row (CSR) is a highly efficient representation of a sparse matrix. Its main concept revolves around concatenating the adjacency list. Here's how it works: let's say we have an adjacency matrix.

In [None]:
[[0, 1, 1, 1], [1, 0, 1, 0], [1, 1, 0, 1], [1, 0, 1, 0]]

The adjacency list of the matrix is

In [None]:
{
    0: [1, 2, 3],  # neighbors of node 0
    1: [0, 2],  # neighbors of node 1
    2: [0, 1, 3],  # neighbors of node 2
    3: [0, 2],  # neighbors of node 3
}

Now, let's talk about the CSR matrix.
The CSR matrix representation consists of three arrays, i.e., `indices`, `indptr`, and `data`. The `indices` is formed by combining the lists into a single list.
$$
\begin{align}
\text{indices}:=[\underbrace{0,1,2}_{\text{node 0}},\underbrace{0,2}_{\text{node 1}},\underbrace{0,3,4}_{\text{node 2}},\underbrace{0,2}_{\text{node 3}}]
\end{align}
$$
The array `indices` consists of $n=4$ subarrays, with each subarray representing the neighbors of a specific node. Additionally, there is another array called `indptr` that indicates the partitioning of the
`indices` array.
$$
\begin{align}
\text{indptr}:=[0,3,5,8,10]
\end{align}
$$
The `indptr` array indicates the starting indices of each subarray. For example, `indptr[0]` indicates the index from which the neighbors of node 0 are listed, and `indptr[1]` indicates another index from which the neighbors of node 1 is listed.

If the sparse matrix is not a binary matrix, meaning the entries of the matrix can take values other than 0 and 1, the CSR representation also includes a `data` array that contains the entry values. The `data` array is divided into subarrays in the same manner as the `indices` array. For example, `data[0]` represents the weight of the edge to the neighbor `indices[0]`.

With that, let's create the CSR representation of the network of the students and courses. Let's first create the adjacency list.

In [None]:
# Assignment:
# Create the adjacency list (with a dictionary named adjList) of the network of the students and courses.
# adjList = {
#   0: [...]
#   1: [...]
#   2: [...]
#   ....
# }

Then, create the CSR representation.

In [None]:
# Assignment:
# Create the CSR representation of the network of the students and courses.

CSR matrix is useful in finding the neighbors of a specific node, just like the adjacency list. For instance,
the neighbors of node `Computer Science 101` is given by

In [None]:
node_id = to_node_ids["Computer Science 101"]
indices[indptr[node_id] : indptr[node_id + 1]]

Why it works? Let's break down the second line.
The variable

In [None]:
indices

is an array consisting of subarrays. And each subarray is associated with a row of the adjacency matrix and consists of column IDs with non-zero entry values for the row.
The subarrays are partitioned by

In [None]:
indptr

For instance,

In [None]:
indices[indptr[0] : indptr[1]]

is a subarray associated with the first row (and thus node 0), which contains the column IDs of the non-zero entries.

In [None]:
# Assignment
# Compute the number of students who enroll both English 101 and History 101 by using the CSR format.