# Advanced Learning - Class 2

## Statistical Learning with Complex Data

<hr>
<hr>
<hr>

## Introduction

### The analysis of (social) networks

**Origin in Sociology** -- The notion predates the apparition of the internet (start in the 19th century) and mainly refers to the networks built between humans and understood through the lens of sociology.
- The first researchers were Emile Durkheim and Tönnies, who studied the link between individual actions and society (topics: religion, suicide, etc.)
- In 1930, Morena was the first to advocate for the *massive use of data* in sociology (then the use of data to describe how small societies function). 

**Graph Theory** -- Graph theory has been extensively studied in Mathematics over the past centuries (e.g. Euler formalized the basis of graph theory).
- Applications: biology, chemistry, supply chains, etc.
- ***Note***: Networks are not just graphs (they are a graph, the mathematical object, plus some additional information)

Most networks are in fact described in several sources/documents. **In this case, there is an important work in modeling/encoding the relationship between individuals** (the definition of a *node* can be difficult).

### Highlights from examples

- Networks can be directly ovserved or reconstructed from sources
- The structure of networks may be extremely different, in particular in term of density
- Network analysis has very different application fields, ranging from sociology, economics, to history and medicine. 

### Applications

**Medicine** -- Public health, epidemiology
**Biology** -- modeling of drugs
**Social sciences** -- social phenomenon understanding
**Marketing** -- identification of group of clients, of influencers
**Fraud detection**

### Shapes of networks

Networks can be found under different forms:
- graph (simplest form)
- adjacency matrix / social matrix
- transactional data 
- different sources of different types (e.g. 1 or several documents, texts, tweets, text messages, images)

## Characterization and manipulation of networks

### Storage of a network

- A **graph**: a text file listing the interactions between the nodes (e.g. 1;2, 2;1, 1;3, etc.) It corresponds to a **list of all directed edges**.
- An **adjacency matrix** is a $n\times n$ square matrix with a $0$-valued diagonal (as we assume that there is no auto-reference). Elements are either 0- or 1-valued with 1s indicating a link between two points (if the network is directed $A_{i,j}\neq A_{j, i}$). 
    - The adjacency matrix is not an efficient way to store a network, especially if it is sparse
- **Transactional data** is a collection of structured data from which it is clear how to extract relationships (e.g. emails with the fields, from: A, to: B, C, cc: D, bcc: E, F, subject:\_\_, date:\_\_, etc) as graphs
    - This task needs to rely on writing a script to transform the transactional data into a graph

### Definition of a network

A network is composed of:
- nodes (individuals)
- edges (relationships) -- **note**: nodes + edges = graph
- extra information on nodes and/or edges (covariates)

The different types of networks:
- directed and undirected networks
- dynamic and static networks (**dynamic involves a temporal dimension in the relationships**)
- multiple networks (different types of connection between sets of nodes)

### Characterizing

A first way to characterize a network is to **compute general statistics for it**:
- **degree of a node** $d_i$: it measures the "importance." It measures the centrality of the node in the network.
\begin{align}
d_i &= \underset{j\neq i}{\sum}\mathbb{1}\{x_i \sim x_j\} = \underset{j\neq i}{\sum}A_{i,j}\text{ (sum of connections in the adjacency matrix)}\\
d_i&\in\{0, n-1\}\text{ (if undirected) }\{0, 2(n-1)\}\text{ (if directed)}\\
\end{align}
In most of natural networks, the distribution of the degrees follows a power law.
- **degree of a graph**. The notion of density of the network is another way to describe it:
\begin{align}
d_G=\frac{\overset{n}{\underset{i=1}{\sum}}\overset{n}{\underset{j=1, j\neq i}{\sum}}A_{i, j}}{n(n-1)\in[0, 1]}\\
n(n-1)&,\text{ maximum number of connections in a directed graph}\\
denominator&,\text{ total edges in the network}
\end{align}

The density could also be compouted for some parts of a network and the local densities may be different (small world effect). 

In [1]:
!pip install igraph network sna

Collecting igraph
  Downloading igraph-0.9.9-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 1.9 MB/s eta 0:00:01
[?25hCollecting network
  Downloading network-0.1.tar.gz (2.8 kB)
Collecting sna
  Downloading sna-0.0.12.tar.gz (2.9 kB)
Collecting texttable>=1.6.2
  Downloading texttable-1.6.4-py2.py3-none-any.whl (10 kB)
Building wheels for collected packages: network, sna
  Building wheel for network (setup.py) ... [?25ldone
[?25h  Created wheel for network: filename=network-0.1-py3-none-any.whl size=3154 sha256=e1c974519767dd49a3b87b8b32e69ea3528ab78e63a8609e1e4cbe1b4d906b72
  Stored in directory: /home/quentin/.cache/pip/wheels/7a/c8/a9/8a56b32bd0cbe99b092ca1afc5d382c5c76e69a634047e2e7e
  Building wheel for sna (setup.py) ... [?25ldone
[?25h  Created wheel for sna: filename=sna-0.0.12-py3-none-any.whl size=3814 sha256=64536df823c99983bd85ca4ca671a006c152d1372f4d845d9cc78f99a21bbec6
  Stored in directory: /home/

In [2]:
import igraph, network, sna