# Defining network data

Networks appear virtually every corner of the world. But these networks may not explicitly appear in data.
Thus, defining a network from data is an essential yet often overlooked aspect of network analysis. In this notebook, we will explore non-network data and create a network dataset. This will help us understand the complex and crucial decisions required to initiate network analysis.

## Data

Our dataset is taken from [Copenhagen Network study](https://www.nature.com/articles/s41597-019-0325-x), in which the study collected data about physical proximities of about 700 students measured based on Bluetooth signals with smartphone. Our dataset is a subset of the original dataset, and I added random student names to make the dataset closer to raw form. The original data can be obtained from [here](https://figshare.com/articles/dataset/The_Copenhagen_Networks_Study_interaction_data/7267433/1?file=14000795).

## Defining your goal

There are tons of networks that can be created from the same data, and the choice depends on the specific research question. Thus, it is crucial to have a well-thought-out plan in place before touching the data.

Let's suppose that we want to use the data to identify who to vaccinate to prevent communicable disease from spreading among the students. A communicable disease spreads through contacts with close physical proximity. This means that we want a physical contact network of the students.

## Understanding data

Let us focus on our dataset

In [None]:
import pandas as pd

filename = "proximity_data.csv"
contact_data_table = pd.read_csv(filename)
contact_data_table

While you may have an idea about what the columns represent, I encourage to read the README carefully whenever available. Misunderstanding data format and semantics are a common mistake, and it becomes a disastrous as the analysis moves forward.

The README attached to the data is the following:
```raw
column names:
	- timestamp
	- user A
	- user B
	- received signal strength

Notes:
Empty scans are marked with user B = -1 and RSSI = 0
Scans of devices outside of the experiment are marked with user B = -2. All non-experiment devices are given the same ID.
```

## Inclusion criteria

Now, let's define the network of the students.
The raw data may contain errors, and not all Bluetooth signal data observed can be considered as close physical interactions. Therefore, it is necessary to filter out certain observations. To do so, we must establish clear inclusion criteria to determine what should be considered as physical interactions.

Here, our inclusion criteria are the following
1. Bluetooth signals must be stronger than -75dB.
2. Focus on the interactions between students who participated in the experiment.
3. Ignore the empty scans

In [None]:
# Assignment:
# Filter out the interactions.
#
# Hint: pandas.DataFrame.query is a conveient API for filtering rows based on the column values.
#
# data_table =
#

## Normalization

Now, let's create the network data from the filtered proximity data. The firs step is the *normalization* (or make the data tidy, which we will cover in the next module).
The normalization in context of network data means to reduce redundancy in the data, and make it easier for subsequent network analysis.
This includes
1. Assigning IDs to nodes and create a table for nodes `node_table`
2. Creating `edge_table`. Each row consists of two node IDs forming an edge.

Let's create the node table. An easiest and fastest way is to use `numpy.unique` with flag 'return_inverse=True'.

In [None]:
# Assignment:
# Create the node table (in pandas DataFrame) with columns, `node_id`, and `student_name`. For example,
#
# | node_id | student_name   |
# | 0       | Charlotte Bell |
# | 1       | Anna Volkova   |
#  .....
#
# Hint:
# numpy.unique is useful in assigning unique IDs.
# For instance, consider an array of names
# > names = ["Bob", "Alice", "Bob", "James", "James", "Hana"]
# With `numpy.unique`, we can generate unique IDs for the names
# > unique_names, name_ids = numpy.unique(names, return_inverse=True)
# where unique names is a list of unique names, and name_ids is a represention of the input array but with integer IDs, instead of the name strings.
# Don't forget to flag up `return_inverse=True` otherwise you'll get only the unique_names.
#
# The list of user names can be generated by
# > user_names = data_table[["user_a", "user_b"]].values.reshape(-1)

Next, let's create the edge table. Since two users can have duplicated edges, we need to count the number of edges between the users.

First, let's forget about the duplicated edges and create the edge table, with `src` and `trg`. We consider the network to be undirected, and thus it doesn't matter whether `user_a` is the `src` or `trg`.

In [None]:
# Assignment:
# Create the edge  table (in pandas DataFrame) with columns, `src`, Note that `trg`. `src` should be smaller or equal to `trg`.
# For example,
#
#  | src | trg |
#  | 0   | 1   |
#  | 0   | 22  |
#  .....
#

Finding the duplicates in the edge table is a common but daunting task (you will find it soon).

Since we consider the network to be undirected, we want to group rows with the same ID pairs (order insensitive, e.g., (1,2) and (2,1) represent the same pair).
The easiest but inefficient way is to use `pandas.groupby` API.

In [None]:
weighted_edge_table = (
    edge_table.groupby(["src", "trg"]).size().reset_index(name="weight")
)

Alternatively, you can represent an edge with a single complex number. A complex number consists of two values, the real part and the imaginary part.
We can create a complex number from two node Ids. For instance, for an edge connecting nodes 10 and 20, we can represent it by a complex number
$$
10 + 20 j
$$
With this representation, we can find duplicates and compute the frequencies more efficiently. For instance,

In [None]:
src_trg = edge_table["src"].values + 1j * edge_table["trg"].values  # pairing
uvalues, edge_ids, counts = np.unique(
    src_trg, return_counts=True, return_inverse=True
)  # find the unique values and compute the frequency
src, trg = np.real(uvalues[edge_ids]).astype(int), np.imag(uvalues[edge_ids]).astype(
    int
)

Finally, save the data into the files:

In [None]:
# Assignment:
#
# Create the "node_table.csv" and "edge_table.csv"

## Documentation

It is highly recommended to thoroughly document the inclusion criteria and the code used to generate the data in order to ensure reproducibility. Additionally, comprehensive documentation makes you *replaceable*, meaning that if someone wishes to repeat or improve upon the process, the document explains all the steps taken on your behalf, so that you can focus on matters that interest you most now.

```markdown
# Assignment:
# Write the documentation about the steps taken to compile the dataset. Make sure the following points:
# 1. Data source: Which source data is your dataset compiled from?
# 2. What is the inclusion criteria of your dataset? What data records are excluded?
# 3. Format of your dataset. If it is table, explain each column.
# 4. When do you compile your dataset?
```

Now, you created the network data by your own. It is mobile, shareable, and reproduceable! Let's see how the network looks like:

In [None]:
import igraph

g = igraph.Graph.DataFrame(
    edge_table[["src", "trg"]],
    directed=False,
)
igraph.plot(g, vertex_size=5, edge_width=0.1)

## Assignment

1. Your final assignment is to create a dataset of a network of characters in Les Miserable by Victor Hugo.
2. Obtain the source data from [http://ftp.cs.stanford.edu/pub/sgb/jean.dat](http://ftp.cs.stanford.edu/pub/sgb/jean.dat).
This file consists of two tables, separated by a blank line.
The first table is about the characters in the book, separated by space (or tab).
The second table contains the section numbers and the characters appearing in each section, separated by ":".
3.
4. Create the node table, containing the IDs of the characters and their names (in two letters).
5. Create the edge table, consisting of "src" and "trg" columns representing node Ids.
6. Visualize the network and write the documentation about your Les Miserable network dataset.