<a href="https://colab.research.google.com/github/ttruong1000/MAT-494-Mathematical-Methods-for-Data-Science/blob/main/4_1_Introduction_to_Network_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **4.1 - Introduction to Network Analysis**

Network analysis is essential in analysis of data not only because social
networks create huge amount of data, but also many data have networkstructured.
One of simple ways to introduce a network structure is to analyze
correlations between variables and create correlation networks, which
are widely used data mining method for studying biological networks (for
example, biological networks) based on pairwise correlations between variables.

Networks can be conveniently modeled by graphs, which we often refer
to as a social graph. The individuals within a network are the nodes, and
an edge connects two nodes if the nodes are related by the relationship that
characterizes the network. The explosive growth of social media in recent
years has attracted millions of end users, thus creating social graphs with
millions of nodes and billions of edges reflecting the interactions and relationship between these nodes.

Networks often exhibit community structure with inherent clusters. Detecting
clusters or communities is one of the critical tasks in network analysis
because of its broad applications to matters such as friend recommendations,
link predictions and collaborative filtering in online social networks. From the
graph theory perspective, clustering and community detection essentially are
to discover a group of nodes in a graph that are more connected with eachother within the group than those nodes outside the group. Given the size
and complexity of todays’ networks, clustering and community detection in
these networks face the inherent challenges.

Communities (clusters) are essential to gain spatio-temporal inside into
big datasets from networks. Spatial distances often describe the strength of
network connectivity among communities (clusters) rather than individual
nodes. As a result, good clustering results will enable us to capture key characteristics of datasets in networks.

### **4.1.0 - Python Libraries for Introduction to Network Analysis**

In [1]:
import numpy as np
from sklearn.datasets import make_moons
from sklearn.neighbors import kneighbors_graph
from scipy import sparse
import networkx as nx
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import homogeneity_score, completeness_score,v_measure_score

### **4.1.1 - Graph Models**

In this section, we briefly review some of the common notation used in graphs. Any graph consists of both a set of objects, called nodes, and the connections between these nodes, called edges. Mathematically, a graph $G$ is denoted as pair $G(V, E)$, where $V = \{v_1, v_2, \ldots v_n\}$ represents the set of nodes and $E = \{e_1, e_2, \ldots, e_m\}$ represents the set of edges and the size of the set is commonly shown as $m = |E|$. Edges are also represented by their endpoints (nodes), so $e\left(v_1, v_2\right)$ or $\left(v_1, v_2\right)$ defines an edge between nodes $v_1$ and $v_2$. Edges can have directions if one node is connected to another, but not vice versa. When edges have directions, $e\left(v_1, v_2\right)$ is not the same as $e\left(v_2, v_1\right)$. When edges are undirected, nodes are connected both ways and are called undirected edges and this kind of graph is called an undirected graph. Graphs that only have directed edges are called directed graphs and ones that only have undirected edges are called undirected graphs. Finally, mixed graphs have both directed and undirected edges.

A sequence of edges where nodes and edges are distinct, $e_1\left(v_1, v_2\right), e_2\left(v_2, v_3\right)$, $e_3\left(v_3, v_4\right), \ldots, e_i\left(v_i, v_{i + 1}\right)$, is called a path. A closed path is called a cycle. The length of a path or cycle is the number of edges traversed in the path or cycle. In a directed graph, we only count directed paths because traversal of edges is only allowed in the direction of the edges. For a connected graph, multiple paths can exist between any pair of nodes. Often, we are interested in the path that has the shortest length. This path is called the shortest path. We will also use the shortest path as distance for modeling on networks. The concept of the neighborhood of a node $v_i$ can be generalized using shortest paths. An $n$-hop neighborhood of node $v_i$ is the set of nodes that are within $n$ hops distance from the node $v_i$.

The degree of a node in a graph, which is the number of edges connected to the node, plays a significant role in the study of graphs. For a directed graph, there are two types of degrees:
1. in-degrees (edges toward the node) and 
2. out-degrees (edges away from the node).

In a network, nodes with the most connections possess the greatest degree of centrality. Degree centrality measures relative levels of importance. We often regard people with many interpersonal connections to be more important than those with few. Indegree centrality describes the popularity of a node and its prominence or prestige. Out-degree centrality describes the gregariousness of the node. For social media, degree represents the number of friends for each given user. On Facebook, a degree represents the number of friends. For Twitter, in-degree and out-degree show the number of followers and followees, respectively.

### **4.1.2 - The Laplacian Matrix**

A graph with $n$ nodes can be represented by a $n \times n$ adjacency matrix. A value of 1 at row $i$, column $j$ in the adjacency matrix indicates a connection between nodes $v_i$ and $v_j$, and a value of 0 denotes no connection between the two nodes. When generalized, any real number can be used to show the strength of connection between two nodes. In directed graphs, we can have two edges between $i$ and $j$ (one from $i$ to $j$ and one from $j$ to $i$), whereas in undirected graphs only one edge can exist. As a result, the adjacency matrix for directed graphs is not in general symmetric, whereas the adjacency matrix for undirected graphs is symmetric $\left(A = A^T\right)$. In social media, there are many directed and undirected networks. For instance, Facebook is an undirected network and Twitter is a directed network.

Consider a weighted graph $G = (V, E)$ with $n$ vertices and $m$ edges each with weights $E_{i, j}$ connecting nodes $i, j$. The adjacency of matrix $M$ of a graph is defined by $M_{ij} = E_{ij}$ if there is an edge $\{i, j\}$ and $M_{ij} = 0$, otherwise. The Laplacian matrix $L$ of $\mathrm{G}$ is an $n \times n$ symmetric matrix, with one row and column for each vertex, such that
$$
L_{ij} = \left\{\begin{array}{lr}
\sum_{k} E_{i k}, & i=j \\
-E_{i j}, & i \neq j, \text { and } v_{i} \text { is adjacent to } v_{j} \\
0, & \text { otherwise }
\end{array}\right.
$$
In addition, a $n \times m$ incidence matrix of $G$, denoted by $I_G$ has one row per vertex and one column per edge. The column corresponding to edge $\{i, j\}$ of $I_G$ is zero except the $i$-th and $j$-th entries, which are $\sqrt{E_{ij}}$ and $-\sqrt{E_{ij}}$. respectively.

##### Theorem 4.1.2.1 - Properties of the Laplacian Matrix

The Laplacian matrix $L$ of has the following properties.
1. $L = D − M$, where $M$ is the adjacency matrix and $D$ is the diagonal degree
matrix with $D_{ii} = \displaystyle\sum_k E_{ik}$.
2. $L = I_GI_G^T$
3. $L$ is symmetric positive semi-definite. All eigenvalues of $L$ are real and
non-negative, and $L$ has a full set of $n$ real and orthogonal eigenvectors.
4. Let $\mathbf{e} =[1, 1, ...,1]^T$. Then $L\mathbf{e} = 0$. Thus 0 is the smallest eigenvalue and $\mathbf{e}$ is the corresponding eigenvector.
5. If the graph $G$ has $c$ connected components, then $L$ has $c$ eigenvalues that is 0.
6. For any vector $\mathbf{x}$, $\mathbf{x}^TL\mathbf{x} = \displaystyle\sum_{\{i,j\} \in E} E_{ij}(x_i − x_j)^2$.
7. For any vector $\mathbf{x}$ and scalars $\alpha, \beta$, $(\alpha\mathbf{x} + \beta\mathbf{e})^TL(\alpha\mathbf{x} + \beta\mathbf{e}) = \alpha^2\mathbf{x}^TL\mathbf{x}$.
8. The problem
\begin{equation*}
  \min_{\mathbf{x} \neq 0} \mathbf{x}^TL\mathbf{x} \quad \text{ subject to } \mathbf{x}^T\mathbf{x} = 1, \mathbf{x}^T\mathbf{e} = 0
\end{equation*}
is solved is solved when x is the eigenvector corresponding to the second smallest eigenvalue (the Fiedler vector) $\lambda_2$ of the eigenvalue problem
\begin{equation*}
  L\mathbf{x} = \lambda\mathbf{x}
\end{equation*}

##### Theorem 4.1.2.2 - Courant-Fischer Theorem

Let $A$ be $n \times n$ symmetric matrix with an orthogonal diagonalization $A = PDP^{-1}$. The columns of $P$ are orthonormal eigenvectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_n$ of $A$. Assume that the diagonal of $D$ are arranged so that $\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$. Let $S_k$ be the span of $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ and $S_k^\perp$ denote the orthogonal complement of $S_k$. Then,
\begin{equation*}
  \min_{\mathbf{x} \neq 0, \mathbf{x} \in S_{k - 1}^\perp} \frac{\mathbf{x}^TA\mathbf{x}}{\mathbf{x}^T\mathbf{x}} = \lambda_k
\end{equation*}
When $k = 2$, $S_1^\perp$ is all $\mathbf{x}$ such that
\begin{equation*}
  \mathbf{x} \perp \mathbf{v}_1, \quad \mathbf{v}_1^T \cdot \mathbf{x} = 0
\end{equation*}

##### Corollary 4.1.2.3 - Corollary of Courant-Fischer Theorem

Let $A$ be $n \times n$ symmetric matrix with an orthogonal diagonalization $A = PDP^{-1}$. The columns of $P$ are orthonormal eigenvectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_n$ of $A$. Assume that the diagonal of $D$ are arranged so that $\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$. Then
\begin{equation*}
  \min_{\mathbf{x} \neq 0, \mathbf{x}^T\mathbf{v}_1 = 0} \frac{\mathbf{x}^TA\mathbf{x}}{\mathbf{x}^T\mathbf{x}} = \lambda_2
\end{equation*}

### **4.1.3 - References**

1. MAT 494 Chapter 4 Notes