![](https://www.lepoint.fr/images/2016/12/13/6457609lpw-6457837-article-jpg_3958977_1250x625.jpg)

# Constructing data and data integrity

In this assignment, we will learn the importance of data integrity and a widespread network construction method, *one-mode projection*, with [Star Wars Social Network](http://evelinag.com/blog/2015/12-15-star-wars-social-network/index.html). It is a network of Star Wars characters in which an edge connects two characters if they appear in the same scene.

## Data

The Star Wars Social Network is built using movie scripts, and its construction involves complex natural language processing techniques beyond this course's scope. Therefore, we will use a pre-compiled dataset provided by the network's creator, which is available at the creator's [Github](https://github.com/evelinag/StarWars-social-network). More specifically, we will use `data/charactersPerScene.csv.` Look at [the data](https://github.com/evelinag/StarWars-social-network/blob/master/data/charactersPerScene.csv) and read carefully the documentation by the creator below.

- `charactersPerScene.csv`: each line contains the name of a character followed by the relative
   times when the character is mentioned in the screenplay. I used this data to generate character timelines.
   The values were computed as

       episode number + scene number/number of scenes in the episode

   Values [0,1] correspond to mentions in Episode I, [1,2] to Episode II etc.

   [Note that this is not a valid CSV file because each line contains
   a different number of columns]

What does that mean? Let's see a line in the file:
```csv
LUKE,2.94605809128631,3.0082135523614,3.02669404517454,...
```
The first value is decomposed into
$$
2.94605809128631 = \underbrace{2}_{\text{Episode number -1}} + \underbrace{0.94605809128631}_{\text{Scene \#}}.
$$
This means that Luke makes his first appearance in Episode 3 during a particular scene specified by $0.94605809128631$.

The original data is not a valid table since each row has a different number of columns, and thus not *tidy* (an important concept we will learn in a different module). Therefore, I reformat the data to make it easily loadable by pandas like [this](https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/charactersPerScene-tidied.csv).
```csv
character, scene,episode
LUKE, 2.94605809128631, 3
LUKE, 3.0082135523614, 4
LUKE, 3.02669404517454, 4
...
```

Although the scene \# looks like a numerical value, ***DO NOT load them as numerical numbers***. There are multiple formats to represent floating values, and depending on these formats, the values may be changed to similar but distinct float values. For example, `2.94605809128631` can be adjusted to

In [None]:
2.94605809128631  # Double-precision float (float 64)
2.946058  # Single precision float (float 32)
2.945  # half-precision float (float 16)

In Python, we often don't need to explicitly specify data types because Python is capable of automatically inferring the appropriate types. However, there are cases in which Python may misinterpret and break the data, and identifying this type of error is notoriously difficult.

Keeping this in mind, let's load the data with pandas. If you load the data with `pandas.read_csv` API, it will infer the data types of the table columns, which may break the data. Thus, we **must** specify the data types. This can be done with `dtype` argument of the API. See [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for how to specify data types.

In [None]:
# Assignment:
# By using pandas.DataFrame.to_csv API, load the data from the following URL, and save it as `data_table` with the column data types being appropriately specified.
#
import pandas as pd
import numpy as np

data_url = "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/charactersPerScene-tidied.csv"
data_table = ...  # specify the data types

In [None]:
# Test Data type
assert np.all([isinstance(s, str) for s in data_table["character"].values])
assert np.all([isinstance(s, str) for s in data_table["scene"].values])
display(data_table)

The dataset includes information on all episodes. Let's narrow our focus to a specific episode.

In [None]:
# Assignment: Select the episode you like
episode = 3 # change or keep the value.
data_table = data_table.query(f"episode=={episode}")

## Network construction

Star Wars social network consists of nodes representing the characters and edges representing their *co-appearances* in the same scene. In other words, we place edges between characters if they appear in the same scene.
We can think of the network as a *one-mode projection* of a bipartite network consisting of characters and scenes projected on the character mode. For instance, let's consider the bipartite network.

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_half.png?w=271&zoom=2)

In the figure, the blue and orange nodes represent the characters and scenes, respectively. The Start Wars social network can be constructed by placing an edge between the blue nodes (characters) and removing the orange nodes (scene). An edge is weighted by the number of times two characters co-appear in the same scene.

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_simple.png?w=271&zoom=2)

Operationally, let's construct the bipartite network in the following steps:
1. Normalize data
2. Construct the adjacency matrix $B$ of the bipartite network, where rows and columns correspond to characters and scenes, respectively.
3. One-mode projection by $B B^\top$

### Step 1

In [None]:
# Data normalization
character_names, character_ids = np.unique(data_table["character"], return_inverse=True)
scene_names, scene_ids = np.unique(data_table["scene"], return_inverse=True)

character_table = pd.DataFrame(
    {
        "character_id": np.arange(len(character_names), dtype=int),
        "name": character_names,
    }
)
scene_table = pd.DataFrame(
    {"scene_id": np.arange(len(scene_names), dtype=int), "scene_number": scene_names}
)

n_characters = character_table.shape[0]
n_scenes = scene_table.shape[0]

### Step 2

In [None]:
# Assignment:
# Construct the adjacency matrix B of the bipartite network, where
# B[i,j] = 1 if character i appears at scene j. Otherwise B[i,j] = 0.
#
# Hint: You can use the `character_ids` and `scene_ids` generated in the data normalization step.
#
from scipy import sparse

B = ...

In [None]:
# Test
assert B.shape[0] == n_characters
assert B.shape[1] == n_scenes
assert B.sum() == data_table.shape[0]

### Step 3

In [None]:
# Assignment:
# Create the Star Wars social network A by projecting the bipartite network B on the character mode (one-mode projection).
#
# Think about the matrix operation corresponding to the one-mode projection.
#
A = ...

In [None]:
# Test
import scipy

assert A.shape[0] == B.shape[0]
assert A.shape[1] == B.shape[0]

vals = scipy.linalg.svdvals(sparse.csr_matrix(B).toarray())
assert np.isclose(sparse.csr_matrix(A).trace(), np.sum(vals**2))

Let's visualize the network

In [None]:
import igraph

src, trg, weight = sparse.find(sparse.triu(A, 1))  # will remove the self-loops

edge_list = tuple(zip(src, trg))

g = igraph.Graph(
    edge_list,
    vertex_attrs=dict(label=character_table["name"].values),
    edge_attrs=dict(weight=weight),
)

# See here for how to configure the visualization
# https://igraph.org/python/tutorial/0.9.6/visualisation.html
max_edge_width = 5
min_edge_weight_for_display = 5
layout = g.layout("kk")
igraph.plot(
    g,
    vertex_size=20 * np.sqrt(A.sum(axis=1)) / np.sqrt(np.max(A.sum(axis=1))),
    vertex_color="grey",
    vertex_frame_color="black",
    vertex_label_color="yellow",
    edge_color="#adadad88",
    background="black",
    vertex_label_size=np.maximum(6, np.power(A.sum(axis=1), 0.35)),
    edge_width=max_edge_width
    * np.sqrt(g.es["weight"])
    / np.sqrt(np.max(g.es["weight"])),
    bbox=(500, 500),
    weights=np.sqrt(g.es["weight"]),
    vertex_markeredge=0,
    layout=layout,
)

## Edge weighting

The one-mode projection weights edge based on the number of times two characters co-appear in the same scene. But this simple counting method may overemphasize connections between characters, especially when many characters appear in the same scene (for example, this one):

![](https://miro.medium.com/v2/resize:fit:1400/1*YWKhI4Kup1aSrY68VelC6g.png)

If more than two characters are present in a scene, it is reasonable to consider that the level of interaction is weaker than in a scene with only two characters.

Let's implement this idea. We will adopt a fractional counting proposed by [Newman (2001)](https://journals.aps.org/pre/abstract/10.1103/PhysRevE.64.016132), which normalizes the edge weight by the number of characters appearing in a scene. The new edge weight in the Star Wars social network is given by the following formula:
$$
A_{ij} = \sum_{s \in \text{ scenes}} \frac{B_{is}B_{js}}{d_s}
$$
Where $d_s = \sum_{\ell} B_{\ell s}$ is the degree (i.e., the number of edges) of scene $s$ in the bipartite network. Variable $d_s$ is the number of characters appearing at the scene, and it discounts the strength of the connections between two characters $i$ and $j$.

In [None]:
# Assignment
# Generate the adjacency matrix `A_newman` of the Star Wars social network with Newman's edge weighting method.

In [None]:
# Test:
assert A_newman.shape[0] == B.shape[0]
assert A_newman.shape[1] == B.shape[0]

In [None]:
import igraph

src, trg, weight = sparse.find(sparse.triu(A_newman, 1))  # will remove the self-loops
edge_list = tuple(zip(src, trg))
g = igraph.Graph(
    edge_list,
    vertex_attrs=dict(label=character_table["name"].values),
    edge_attrs=dict(weight=weight),
)

# See here for how to configure the visualization
# https://igraph.org/python/tutorial/0.9.6/visualisation.html
layout = g.layout("kk")
igraph.plot(
    g,
    vertex_size=20
    * np.sqrt(A_newman.sum(axis=1))
    / np.sqrt(np.max(A_newman.sum(axis=1))),
    vertex_color="grey",
    vertex_frame_color="black",
    vertex_label_color="yellow",
    edge_color="#adadad88",
    background="black",
    vertex_label_size=np.maximum(6, np.power(A_newman.sum(axis=1), 0.4)),
    edge_width=max_edge_width
    * np.sqrt(g.es["weight"])
    / np.sqrt(np.max(g.es["weight"])),
    bbox=(500, 500),
    weights=np.sqrt(g.es["weight"]),
    vertex_markeredge=0,
    layout=layout,
)

Can you find any difference? Which one makes more sense to you?

## Assignment: Create Les Miserable Network

Let's create a dataset of a social network of the characters in Les Miserables by Victor Hugo. You can employ any edge weighting method.

Steps:
1. Utilize the pre-processed dataset available at this link: [pre-processed dataset](https://github.com/skojaku/adv-net-sci-course/tree/main/data/les_miserable).
2. Construct a node table consisting of columns labeled "node_id", "code", "name", where "node_id" column contains the integer IDs of the characters, the "code"  column contains the name codes (represented by the two letters in the characters.csv file), and "name" column corresponds to the `name` column in `characters_table` in the following code cell.
3. Develop an edge table comprising columns labeled "src", "trg", and "weight" which represent the node IDs and the weight of each edge.
4. Visualize the network.
5. Save the node table and the edge table as separate CSV files.
6. Document your Les Miserables network dataset by including the following information:
   - Specify the data source.
   - Describe the processes involved in transforming the source data into the new dataset.
   - Explain the generated data, including the attributes of nodes and edges.

Once you have completed these steps, please submit the following files:
1. The notebook file in .ipynb format, along with its corresponding HTML file.
. The node table and edge table are CSV files.

---
Write a code to generate the network here:

In [None]:
character_table_url = "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/les_miserable/characters.csv"
section_table_url = "https://raw.githubusercontent.com/skojaku/adv-net-sci-course/main/data/les_miserable/section_character_table.csv"

# Your code here ---------------------
character_table = pd.read_csv(character_table_url, dtype={"code": str, "name": str})
...

---
Visualization here

---
Documentation about your dataset.