# Community Detection Algorithms
- essential for evaluating group behaviour and emergent phenomena
- members will have more relationships within vs. outside of the group
- reveals 
    - clusters of nodes, 
    - isolated groups, and 
    - network structure
- infer similar behaviour or preferences of peer groups
- estimate resiliency
- find nested relationships
- prepare data for other analyses 

| Algorithm type                         | What it does                                                                                                                     | Example use                                                                                                                        |
|----------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------|
| Triangle Count and Cluster Coefficient | Measures how many nodes form triangles and the degree to which nodes tend to cluster together                                    | Estimating group stability and whether the network might exhibit “small-world” behaviors seen in graphs with tightly knit clusters |
| Strongly Connected Components          | Finds groups where each node is reachable from every other node in that same group following the direction of relationships      | Making product recommendations based on group affiliation or similar items                                                         |
| Connected Components                   | Finds groups where each node is reachable from every other node in that same group, regardless of the direction of relationships | Performing fast grouping for other algorithms and identify islands                                                                 |
| Label Propagation                      | Infers clusters by spreading labels based on neighborhood majorities                                                             | Understanding consensus in social communities or finding dangerous combinations of possible co-prescribed drugs                    |
| Louvain Modularity                     | Maximizes the presumed accuracy of groupings by comparing relationship weights and densities to a defined estimate or average    | In fraud analysis, evaluating whether a group has just a few discrete bad behaviors or is acting as a fraud ring                   |

# Example Graph Data: The Software Dependency Graph
## Importing Data into Apache Spark

In [1]:
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('community').getOrCreate() 
spark.sparkContext.setCheckpointDir('/home/share/learn/graph/algorithms/code/notebook/checkpoints')
from code.script.community import *
g = GraphFrame(
        spark.read.csv(op.join(data_path, 'sw-nodes.csv'), header=True), 
        spark.read.csv(op.join(data_path, 'sw-relationships.csv'), header=True)
    )

In [2]:
g.vertices.toPandas()

Unnamed: 0,id
0,six
1,pandas
2,numpy
3,python-dateutil
4,pytz
5,pyspark
6,matplotlib
7,spacy
8,py4j
9,jupyter


In [3]:
g.edges.toPandas()

Unnamed: 0,src,dst,relationship
0,pandas,numpy,DEPENDS_ON
1,pandas,pytz,DEPENDS_ON
2,pandas,python-dateutil,DEPENDS_ON
3,python-dateutil,six,DEPENDS_ON
4,pyspark,py4j,DEPENDS_ON
5,matplotlib,numpy,DEPENDS_ON
6,matplotlib,python-dateutil,DEPENDS_ON
7,matplotlib,six,DEPENDS_ON
8,matplotlib,pytz,DEPENDS_ON
9,spacy,six,DEPENDS_ON


## Triangle Count and Clustering Coefficient
- often used together
- triangle : a set of three nodes where each node has a relationship to all other nodes
- clustering coefficient a ratio of existing triangles to possible relationships
### Local Clustering Coefficient
The likelohood that a nodes neighbors are also connected

\begin{equation*}
CC(u) = {\frac{2R_u}{k_u{(k_u-1)}}}
\end{equation*}

- $u$ is a node
- $R(u)$ is the number of relationships through the neighbors of $u$ (can be obtained by using the number of triangles passing through $u$)
- $(u)$ is the degree of $u$

### Global Clustering Coefficient
The normalized sum of the local clustering coefficients

### When Should I Use Tirangle Count and Clustering Coefficient?
Triangle Count : determine stability of a group, social network analysis 
Clustering Coefficient : quickly evaluate the cohesiveness of a specific group or overal network, estimate resiliency, look for network structures

## Triangle Count with Apache Spark

In [4]:
result = g.triangleCount()
(result.sort("count", ascending=False)
    .filter('count > 0')
    .toPandas())

Unnamed: 0,count,id
0,1,six
1,1,ipykernel
2,1,jupyter
3,1,python-dateutil
4,1,matplotlib
5,1,jpy-console


## Strongly Connected Components
Finds a set of connected nodes in a directed graoh where each node is reachable in both directions form any other ode in the same set
### When Should I Use Strongly Connected Components
As an early step in graph analysis to see how a graph is structured
Identify tight clusters that may warrant independnet investigation
Profile similar behaviour or inclinations in a group for applications such as recommendation engines

In [5]:
result = g.stronglyConnectedComponents(maxIter=10)
(result.sort("component")
    .groupby("component")
    .agg(F.collect_list("id").alias("libraries"))
    .toPandas())

Unnamed: 0,component,libraries
0,180388626432,[jpy-core]
1,223338299392,[spacy]
2,498216206336,[numpy]
3,523986010112,[six]
4,549755813888,[pandas]
5,558345748480,[nbconvert]
6,661424963584,[ipykernel]
7,721554505728,[jupyter]
8,764504178688,[jpy-client]
9,833223655424,[pytz]


## Connected Components (Union Find)(Weakly Connected Components)

Finds sets of connected nodes in an undirected graph
Differs from SCC in that in only needs a path to exist in one direction wheras SCC needs them in both

### When Should I use Connected Components
Often used in early analysis to understand a graph's structure
Scales efficiently so consider for frequently updating graphs
Useful for fraud detection
Testing whether a graph is connected

## Connected Components with Apache Spark

In [16]:
result = g.connectedComponents()
(result.sort("component")
.groupby("component")
.agg(F.collect_list("id").alias("libraries"))
.toPandas())

Unnamed: 0,component,libraries
0,180388626432,"[jpy-core, nbconvert, ipykernel, jupyter, jpy-client, jpy-console]"
1,223338299392,"[spacy, numpy, six, pandas, pytz, python-dateutil, matplotlib]"
2,936302870528,"[pyspark, py4j]"


*3 clusters of nodes*

## Label Propagation
Fast algorithm for fiding communities in a graph
Suited to networks where groupings are less clear and weights can be used to help a nod edeterine which community to place itself within
Lends itself well to semisupervised learning

### Two variations of label propagation
#### Push
    - Pushes labels to neighbors to find clusters
#### Pull
    - Pulls labels from neighbors based on relationship weights to find clusters
### Semi-Supervised Learning and Seed Labels
Can return different community structures when run on the same graph depending on the order nodes are evaluated
Labeling some nodes and leaving others unlabeled can narrow the solutions
LP can be considered a semi-supervised method to finding communities
Doesn't converge on a single solution

### When Should Use Label Propagation
Community detection in large scale networks, especially if weights are available
Can be parallelized

In [8]:
result = g.labelPropagation(maxIter=10)
(result
    .sort("label")
    .groupby("label")
    .agg(F.collect_list("id"))
    .toPandas())

Unnamed: 0,label,collect_list(id)
0,549755813888,"[matplotlib, spacy, six, pandas]"
1,764504178688,"[nbconvert, ipykernel, jpy-client]"
2,833223655424,"[python-dateutil, numpy, pytz]"
3,936302870528,[pyspark]
4,1099511627776,"[jpy-core, jpy-console, jupyter]"
5,1279900254208,[py4j]


## Louvain Modularity
finds clusters by comparing community density as it assigns nodes to different groups.
One of the fastest modularity-based algorithms
Detects communities
Reveals hierarchies 
*Modularity*
- measure of community assignment that looks at the density of connetions within a cluster to an average or random sample
- technique for uncovering communities by partitioning a graph into more coarse-grained modules and then measuring the strength of the groupings
       
Useful for understanding th structure of a network at different levels of granularity

#### Quality-based grouping via modularity
within cluster relationship density compared to between cluster relationship density
optimize locally then globally

### Calculating Modularity
\begin{equation*}
M = \sum^{n_c}_{c=1}\left[{\frac{L_c}{L}}-\left({\frac{k_c}{2L}}\right)^2\right]
\end{equation*}
where :
- $L$ is the number of relationships in the entire group.
- $L_c$ is the number of relationships in a partition
- $k_c$ is the total degree of nodes in a partition

Steps include 
1. "greedy" assignment of nodes to commmunities, favouring local optimizations of modularity
2. definition of a more coarse-grained network based on the communities found in the first step, used in the next iteration of the algorithm

Evaluating the modularity of a group
\begin{equation*}
Q = \frac{1}{2m}\sum_{u,v}\left[A_{uv}-\frac{{k_u}{k_v}}{2m}\right]\delta(c_u,c_v)
\end{equation*}
where:
- $u$ and $v$ are nodes
- $m$ is the total relationship weight across the entire graph($2m$ is a common normalization value in modularity formulas)
- $A_{uv}-\frac{{k_u}{k_v}}{2m}$ is the strength of the relationship between $u$ and $v$ compared to what we would expect with a random assignment (tends toward averages) of those
nodes in the network.
- $A_{uv}$ is the weight of the relationship between $u$ and $v$.
- $k_u$ is the sum of relationship weights for $u$.
- $k_v$ is the sum of relationship weights for $v$.
- $\delta(c_u,c_v)$ is equal to 1 if $u$ and $v$ are assigned to the same community, and 0 if they are not.

### When Should I use Louvain?
Find communities in vast networks
Applies a heuristic that is useful on large graphs
Evaluating the structure of complex networks
 - uncovering many levels of hierarchies
 
### Validating Communities
Chosing the right algorithm for a particular problem is challenging and requires a bit of exploration

Validate accuracy of communities found by comparing results to a benchmark based on data with known communities

Best known benchmarks:
- Girvan Newman (GN) generates random homogenous network
- Lancichinetti-Fortunato-Radicchi (LFR) creates a more heterogenous grpah where node degrees and community size are distributed according to a power law

Important to match benchmark to dataset