Community formation is common in all types of networks, and identifying them is essential for evaluating group behavior and emergent phenomena. The general principle in finding communities is that its members will have more relationships within the group than with nodes outside their group. Identifying these related sets reveals clusters of nodes, isolated groups, and network structure. This information helps infer similar behavior or preferences of peer groups, estimate resiliency, find nested relationships, and prepare data for other analyses. Community detection algorithms are also commonly used to produce network visualization for general inspection.

社区的形成在所有类型的网络中都是常见的，识别它们对于评价群体行为和突发现象是必不可少的。寻找社区的一般原则是，其成员在组内的关系将多于与组外节点的关系。识别这些相关集揭示了节点集群、隔离组和网络结构。这些信息有助于推断同类群体的类似行为或偏好，评估弹性，找到嵌套关系，并为其他分析准备数据。社区检测算法也常用来生成用于一般检查的网络可视化。

We ll provide details on the most representative community detection algorithms:
    
    Triangle Count and Clustering Coefficient for overall relationship density 

     Strongly Connected Components and Connected Components for finding connected clusters 

    Label Propagation for quickly inferring groups based on node labels 

     Louvain Modularity for looking at grouping quality and hierarchies

我们将提供最具代表性的社区检测算法的细节
    
    总体关系密度的三角形计数和聚类系数
    强连接组件和用于查找连接集群的连接组件
    用于基于节点标签快速推断组的标签传播
    用于查看分组质量和层次结构的Louvain模块化

We ll explain how the algorithms work and show examples in Apache Spark and Neo4j. In cases where an algorithm is only available in one platform, we ll provide just one example. We use weighted relationships for these algorithms because they re typically used to capture the significance of different relationships.

我们将解释这些算法是如何工作的，并展示Apache Spark和Neo4j中的示例。在算法只在一个平台上可用的情况下，我们只提供一个例子。我们对这些算法使用加权关系，因为它们通常用于捕捉不同关系的重要性。

Figure 6-1 gives an overview of the differences between the community detection algorithms covered here, and Table 6-1 provides a quick reference as to what each algorithm calculates with example uses.

图6-1概述了这里介绍的社区检测算法之间的差异，表6-1提供了每个算法使用示例计算内容的快速参考。

![48.png](./picture/48.png)

We use the terms set, partition, cluster, group, and community interchangeably. These terms are different ways to indicate that similar nodes can be grouped. Community detection algorithms are also called clustering and partitioning algorithms. In each section, we use the terms that are most prominent in the literature for a particular algorithm.

我们可以互换使用术语集、分区、集群、组和社区。这些术语是表示可以对类似节点进行分组的不同方法。社区检测算法也称为聚类和划分算法。在每一节中，我们使用文献中最突出的术语来表示特定的算法。

![49.png](./picture/49.png)

First, we ll describe the data for our examples and walk through importing the data into Spark and Neo4j. The algorithms are covered in the order listed in Table 6-1. For each, you ll find a short description and advice on when to use it. Most sections also include guidance on when to use related algorithms. We demonstrate example code using sample data at the end of each algorithm section.

首先，我们将为示例描述数据，并逐步将数据导入Spark和Neo4j。这些算法按表6-1中列出的顺序进行介绍。对于每一个，你会发现一个简短的描述和建议，关于什么时候使用它。大多数章节还包括何时使用相关算法的指导。我们在每个算法部分的末尾使用示例数据演示示例代码。

When using community detection algorithms, be conscious of the density of the relationships. 

If the graph is very dense, you may end up with all nodes congregating in one or just a few clusters. You can counteract this by filtering by degree, relationship weights, or similarity metrics. 

On the other hand, if the graph is too sparse with few connected nodes, you may end up with each node in its own cluster. In this case, try to incorporate additional relationship types that carry more relevant information.

在使用社区检测算法时，要注意关系的密度。

如果图形非常密集，则可能会导致所有节点聚集在一个或几个集群中。您可以通过按程度、关系权重或相似性度量进行过滤来抵消这种影响。

另一方面，如果图过于稀疏，连接的节点太少，则可能会导致每个节点都位于自己的集群中。在这种情况下，尝试合并包含更多相关信息的其他关系类型。

# Example Graph Data: The Software Dependency Graph

Dependency graphs are particularly well suited for demonstrating the sometimes subtle differences between community detection algorithms because they tend to be more connected and hierarchical. The examples in this chapter are run against a graph containing dependencies between Python libraries, although dependency graphs are used in various fields, from software to energy grids. This kind of software dependency graph is used by developers to keep track of transitive interdependencies and conflicts in software projects. You can download the nodes and files from the book s GitHub repository.

依赖关系图特别适合于演示社区检测算法之间有时存在的细微差异，因为它们往往更具有关联性和层次性。本章中的示例是针对包含Python库之间依赖关系的图运行的，尽管依赖关系图用于从软件到能源网格的各个领域。开发人员使用这种软件依赖关系图来跟踪软件项目中的传递性依赖关系和冲突。您可以从该书的GitHub存储库下载节点和文件。

![50.png](./picture/50.png)
![51.png](./picture/51.png)


Figure 6-2 shows the graph that we want to construct. Looking at this graph, we see that there are three clusters of libraries. We can use visualizations on smaller datasets as a tool to help validate the clusters derived by community detection algorithms.

图6-2显示了我们想要构造的图。看这个图，我们看到有三个库集群。我们可以使用较小数据集上的可视化作为工具来帮助验证由社区检测算法派生的集群。


![52.png](./picture/52.png)

## Importing the Data into Neo4j

The following query imports the nodes:

And this imports the relationships:

# Triangle Count and Clustering Coefficient

The Triangle Count and Clustering Coefficient algorithms are presented together because they are so often used together. Triangle Count determines the number of triangles passing through each node in the graph. A triangle is a set of three nodes, where each node has a relationship to all other nodes. Triangle Count can also be run globally for evaluating our overall dataset.

由于三角形计数算法和聚类系数算法经常同时使用，因此本文将它们结合起来进行研究。三角形计数决定通过图中每个节点的三角形数量。三角形是由三个节点组成的集合，其中每个节点与所有其他节点都有关系。三角形计数也可以全局运行，以评估整个数据集。

Networks with a high number of triangles are more likely to exhibit small-world structures and behaviors.

三角形数量多的网络更有可能表现出小世界结构和行为。

The goal of the Clustering Coefficient algorithm is to measure how tightly a group is clustered compared to how tightly it could be clustered. The algorithm uses Triangle Count in its calculations, which provides a ratio of existing triangles to possible relationships. A maximum value of 1 indicates a clique where every node is connected to every other node.

聚类系数算法的目标是度量一个组的聚类程度与它可以聚类的程度。该算法在计算中使用三角形计数，它提供了现有三角形与可能关系的比例。最大值1表示一个小集团，其中每个节点都连接到其他节点。

There are two types of clustering coefficients: local clustering and global clustering.
    
聚类系数有两种类型:局部聚类和全局聚类。
    

## Local Clustering Coefficient

The local clustering coefficient of a node is the likelihood that its neighbors are also connected. The computation of this score involves triangle counting.

一个节点的局部聚类系数是它的邻居也被连接的可能性。这个分数的计算包括三角形计数。

The clustering coefficient of a node can be found by multiplying the number of triangles passing through the node by two and then diving that by the maximum number of relationships in the group, which is always the degree of that node, minus one. Examples of different triangles and clustering coefficients for a node with five relationships are portrayed in Figure 6-3.

一个节点的聚类系数可以通过将通过该节点的三角形数乘以2，然后再乘以该组中关系的最大数量，即该节点的度- 1来求得。图6-3描述了具有5个关系的节点的不同三角形和聚类系数的例子。

![37.png](./picture/37.png)

Note in Figure 6-3, we use a node with five relationships which makes it appear that the clustering coefficient will always equate to 10% of the number of triangles. We can see this is not the case when we alter the number of relationships. If we change the second example to have four relationships (and the same two triangles) then the coefficient is 0.33.

注意，在图6-3中，我们使用了一个具有5个关系的节点，这使得聚类系数总是等于三角形数量的10%。我们可以看到，当我们改变关系的数量时，情况并非如此。如果我们把第二个例子改为有四个关系(同样的两个三角形)，那么系数是0.33。

The clustering coefficient for a node uses the formula:
![53.png](./picture/53.png)
    
where:

    u is a node.

    R(u) is the number of relationships through the neighbors of u (this can be obtained by using the number of triangles passing through u).

    k(u) is the degree of u.
    
节点的聚类系数采用该公式
   ![53.png](./picture/53.png)
    
    u是一个节点。
    R(u)是通过u的邻居的关系数(这可以通过使用通过u的三角形数得到)。
    k(u)是u的度数。

## Global Clustering Coefficient

The global clustering coefficient is the normalized sum of the local clustering coefficients. 

Clustering coefficients give us an effective means to find obvious groups like cliques, where every node has a relationship with all other nodes, but we can also specify thresholds to set levels (say, where nodes are 40% connected).

全局聚类系数是局部聚类系数的归一化和。

聚类系数为我们提供了一种有效的方法来查找明显的组，比如cliques，其中每个节点都与所有其他节点有关系，但是我们也可以指定阈值来设置级别(例如，节点连接了40%)。

## When Should I Use Triangle Count and Clustering Coefficient?

Use Triangle Count when you need to determine the stability of a group or as part of calculating other network measures such as the clustering coefficient. Triangle counting is popular in social network analysis, where it is used to detect communities.

当您需要确定一个组的稳定性或计算其他网络度量(如集群系数)的一部分时，请使用三角形计数。三角形计数在社交网络分析中很流行，它被用来检测社区。

Clustering Coefficient can provide the probability that randomly chosen nodes will be connected. You can also use it to quickly evaluate the cohesiveness of a specific group or your overall network. Together these algorithms are used to estimate resiliency and look for network structures.

聚类系数可以提供随机选择的节点被连接的概率。您还可以使用它来快速评估特定组或整个网络的凝聚力。这些算法一起被用来估计弹性和寻找网络结构。

Example use cases include:
 
Identifying features for classifying a given website as spam content. This is described in Efficient Semi-Streaming Algorithms for Local Triangle Counting in Massive Graphs , a paper by L. Becchetti et al.

识别用于将给定网站分类为垃圾邮件内容的特性。在L. Becchetti等人的一篇论文中，描述了大规模图中局部三角形计数的高效半流算法。

Investigating the community structure of Facebook s social graph, where researchers found dense neighborhoods of users in an otherwise sparse global graph. Find this study in the paper  The Anatomy of the Facebook Social Graph , by J. Ugander et al.

调查了Facebook社交图的社区结构，研究人员发现，在一个原本稀疏的全球图中，用户密集的社区。在J. Ugander等人的论文《Facebook社交图的解剖》(the Anatomy of the Facebook Social Graph)中可以找到这项研究。

 Exploring the thematic structure of the web and detecting communities of pages with common topics based on the reciprocal links between them. For more information, see Curvature of Co-Links Uncovers Hidden Thematic Layers in the World Wide Web , by J.-P. Eckmann and E. Moses.
    
探索web的主题结构，并基于它们之间的相互链接检测具有共同主题的页面的社区。有关更多信息，请参见j.p。埃克曼和e。摩西。 

## Triangles with Neo4j

Getting a stream of the triangles isn t available using Spark, but we can return it using Neo4j:

使用Spark获取三角形流是不可用的，但是我们可以使用Neo4j返回它:

Running this procedure gives the following result

![54.png](./picture/54.png)

We see the same six libraries as we did before, but now we know how they re connected. matplotlib, six, and python-dateutil form one triangle. jupyter, jpy-console, and ipykernel form the other.

我们看到和以前一样的6个库，但是现在我们知道它们是如何连接的。matplotlib、six和python-dateutil组成一个三角形。木星、jpy-console和ipykernel组成了另一个。

We can see these triangles visually in Figure 6-4.

我们可以在图6-4中直观地看到这些三角形。

![55.png](./picture/55.png)

## Local Clustering Coefficient with Neo4j

We can also work out the local clustering coefficient. The following query will calculate this for each node:

我们还可以求出局部聚类系数。下面的查询将为每个节点计算这个值:

Running this procedure gives the following result:

![56.png](./picture/56.png)

ipykernel has a score of 1, which means that all ipykernel s neighbors are neighbors of each other. We can clearly see that in Figure 6-4. This tells us that the community directly around ipykernel is very cohesive.

ipykernel的得分为1，这意味着所有ipykernel的邻居都是彼此的邻居。我们可以在图6-4中清楚地看到这一点。这告诉我们，直接围绕ipykernel的社区非常有凝聚力。

We ve filtered out nodes with a coefficient score of 0 in this code sample, but nodes with low coefficients may also be interesting. A low score can be an indicator that a node is a structural hole a node that is well connected to nodes in different communities that aren t otherwise connected to each other. This is a method for finding potential bridges that we discussed in Chapter 5.

在这个代码示例中，我们过滤掉了系数值为0的节点，但是系数较低的节点也可能很有趣。得分低可以表明一个节点是一个结构洞，一个节点与不同社区的节点连接良好，而这些节点之间没有其他连接。这是我们在第5章中讨论的寻找潜在桥梁的方法。

# Strongly Connected Components

The Strongly Connected Components (SCC) algorithm is one of the earliest graph algorithms. SCC finds sets of connected nodes in a directed graph where each node is reachable in both directions from any other node in the same set. Its runtime operations scale well, proportional to the number of nodes. In Figure 6-5 you can see that the nodes in an SCC group don t need to be immediate neighbors, but there must be directional paths between all nodes in the set.

强连通分量(SCC)算法是最早的图论算法之一。SCC在一个有向图中找到连接的节点集，其中每个节点在两个方向上都可以从同一集中的任何其他节点到达。它的运行时操作伸缩性好，与节点的数量成正比。在图6-5中，您可以看到SCC组中的节点不需要是直接邻居，但是集合中的所有节点之间必须有方向路径。

![57.png](./picture/57.png)

Decomposing a directed graph into its strongly connected components is a classic application of the Depth First Search algorithm. Neo4j uses DFS under the hood as part of its implementation of the SCC algorithm.

将有向图分解为强连通分量是深度优先搜索算法的经典应用。Neo4j使用DFS作为其实现SCC算法的一部分。

## When Should I Use Strongly Connected Components?

Use Strongly Connected Components as an early step in graph analysis to see how a graph is structured or to identify tight clusters that may warrant independent investigation. A component that is strongly connected can be used to profile similar behavior or inclinations in a group for applications such as recommendation engines.

使用强连接组件作为图分析的早期步骤，以了解图是如何构造的，或者识别需要独立研究的紧密集群。强连接的组件可用于为推荐引擎等应用程序分析组中的类似行为或倾向。

Many community detection algorithms like SCC are used to find and collapse clusters into single nodes for further intercluster analysis. You can also use SCC to visualize cycles for analyses like finding processes that might deadlock because each subprocess is waiting for another member to take action.

许多像SCC这样的社区检测算法被用来发现并将集群分解成单个节点，以便进一步进行集群间的分析。您还可以使用SCC来可视化分析的周期，比如查找可能死锁的进程，因为每个子进程都在等待另一个成员采取行动。

Example use cases include:    

    Finding the set of firms in which every member directly and/or indirectly owns  shares in every other member, as in  The Network of Global Corporate Control ,  an analysis of powerful transnational corporations by S. Vitali, J. B. Glattfelder,  and S. Battiston.    

    Computing the connectivity of different network configurations when measuring  routing performance in multihop wireless networks. Read more in  Routing Performance  in the Presence of Unidirectional Links in Multihop Wireless Networks ,  by M. K. Marina and S. R. Das.    

    Acting as the first step in many graph algorithms that work only on strongly connected  graphs. In social networks we find many strongly connected groups. In  these sets people often have similar preferences, and the SCC algorithm is used to  find such groups and suggest pages to like or products to purchase to the people  in the group who have not yet done so.

例如，用例包括:

    查找每个成员直接和/或间接拥有其他成员股份的公司集合，如在全球公司控制网络中，S. Vitali、J. B. Glattfelder和S. Battiston对强大的跨国公司的分析。

    在测量多跳无线网络路由性能时，计算不同网络配置的连通性。请阅读M. K. Marina和S. R. Das在多跳无线网络中单向链路存在时的路由性能。

    在许多只在强连通图上工作的图算法中，作为第一步。在社交网络中，我们发现许多紧密联系的群体。在这些集合中，人们通常具有相似的偏好，而SCC算法用于查找此类组，并向组中尚未这样做的人推荐要喜欢的页面或要购买的产品。

Some algorithms have strategies for escaping infinite loops, but if we re writing our own algorithms or finding nonterminating processes, we can use SCC to check for cycles.

有些算法有策略可以避免无限循环，但是如果我们编写自己的算法或查找非终止进程，我们可以使用SCC检查循环。

## Strongly Connected Components with Neo4j

Let s run the same algorithm using Neo4j. Execute the following query to run the algorithm:
    
    让我们使用Neo4j运行相同的算法。执行以下查询来运行算法:

The parameters passed to this algorithm are:
 
Library 

     The node label to load from the graph 

 DEPENDS_ON 

     The relationship type to load from the graph
    
传递给算法的参数为:
    
Library 

    要从图中加载的节点标签
    
DEPENDS_ON

    要从图中加载的关系类型

This is the output we ll see when we run the query:

![58.png](./picture/58.png)

As with the Spark example, every node is in its own partition.

与Spark示例一样，每个节点都在自己的分区中。

So far the algorithm has only revealed that our Python libraries are very well behaved, but let s create a circular dependency in the graph to make things more interesting. This should mean that we ll end up with some nodes in the same partition.

到目前为止，该算法只显示了我们的Python库表现得很好，但是让我们在图中创建一个循环依赖关系，使事情变得更有趣。这应该意味着我们最终会在同一个分区中有一些节点。

The following query adds an extra library that creates a circular dependency between py4j and pyspark:
    
    下面的查询添加了一个额外的库，它在py4j和pyspark之间创建了一个循环依赖关系

We can clearly see the circular dependency that got created in Figure 6-6.

我们可以清楚地看到图6-6中创建的循环依赖关系。

![59.png](./picture/59.png)

Now if we run the SCC algorithm again we ll see a slightly different result:
    
    现在，如果我们再次运行SCC算法，我们将看到一个稍微不同的结果:

![60.png](./picture/60.png)

Before we move on to the next algorithm we ll delete the extra library and its relationships from the graph:
    
    在继续下一个算法之前，我们将从图中删除额外的库及其关系:

# Connected Components

The Connected Components algorithm (sometimes called Union Find or Weakly Connected Components) finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set. It differs from the SCC algorithm because it only needs a path to exist between pairs of nodes in one direction, whereas SCC needs a path to exist in both directions. Bernard A. Galler and Michael J. Fischer first described this algorithm in their 1964 paper, An Improved Equivalence Algorithm .

连接组件算法(有时称为联盟找到或弱连接组件)发现套连接节点在一个无向图,每个节点可以从任何其他节点在同一组。它不同于鳞状细胞癌算法,因为它只需要一个路径之间存在双节点在一个方向上,而鳞状细胞癌需要存在于两个方向的道路。Bernard A. Galler和Michael J. Fischer在他们1964年的论文《改进的等价算法》中首次描述了这种算法。

## When Should I Use Connected Components?

As with SCC, Connected Components is often used early in an analysis to understand a graph s structure. Because it scales efficiently, consider this algorithm for graphs requiring frequent updates. It can quickly show new nodes in common between groups, which is useful for analysis such as fraud detection.

与SCC一样，连接组件通常在分析的早期用于理解图s结构。因为它可以有效地伸缩，所以可以考虑这种算法用于需要频繁更新的图。它可以快速显示组之间的新节点，这对于分析(如欺诈检测)非常有用。

Make it a habit to run Connected Components to test whether a graph is connected as a preparatory step for general graph analysis. Performing this quick test can avoid accidentally running algorithms on only one disconnected component of a graph and getting incorrect results.

养成运行连接组件的习惯，以测试图是否连接，作为常规图分析的准备步骤。执行这个快速测试可以避免在图的一个断开连接的组件上意外地运行算法，从而得到不正确的结果。

Example use cases include: 

    Keeping track of clusters of database records, as part of the deduplication process. Deduplication is an important task in master data management applications; the approach is described in more detail in An Efficient Domain- Independent Algorithm for Detecting Approximately Duplicate Database Records , by A. Monge and C. Elkan. 
    
    Analyzing citation networks. One study uses Connected Components to work out how well connected a network is, and then to see whether the connectivity remains if hub or authority nodes are moved from the graph. This use case is explained further in Characterizing and Mining Citation Graph of Computer Science Literature , a paper by Y. An, J. C. M. Janssen, and E. E. Milios.
    
示例用例包括:
    
    作为重复数据删除过程的一部分，跟踪数据库记录的集群。重复数据删除是主数据管理应用中的一项重要工作;A. Monge和C. Elkan在一种有效的独立于域的算法中更详细地描述了这种方法，该算法用于检测近似重复的数据库记录。分析引文网络。
    
    一项研究使用连接组件来计算网络的连接情况，然后看看如果从图中移动hub或authority节点，连接是否仍然存在。这个用例将在描述和挖掘计算机科学文献的引文图(Y. An, J. C. M. Janssen, E. E. Milios的一篇论文)中进一步解释。
    
    

## Connected Components with Neo4j

We can also execute this algorithm in Neo4j by running the following query:

我们还可以通过运行以下查询在Neo4j中执行这个算法

The parameters passed to this algorithm are:
 
Library 

     The node label to load from the graph 

 DEPENDS_ON 

     The relationship type to load from the graph
    
传递给算法的参数为:
    
Library 

    要从图中加载的节点标签
    
DEPENDS_ON

    要从图中加载的关系类型

Here’s the output:
![61.png](./picture/61.png)

Both of the community detection algorithms that we ve covered so far are deterministic: they return the same results each time we run them. Our next two algorithms are examples of nondeterministic algorithms, where we may see different results if we run them multiple times, even on the same data.

到目前为止，我们介绍的两种社区检测算法都是确定的:每次运行它们时，它们返回相同的结果。接下来的两个算法是不确定性算法的例子，如果我们多次运行它们，即使是在相同的数据上，我们也可能看到不同的结果。

# Label Propagation

The Label Propagation algorithm (LPA) is a fast algorithm for finding communities in a graph. In LPA, nodes select their group based on their direct neighbors. This process is well suited to networks where groupings are less clear and weights can be used to help a node determine which community to place itself within. It also lends itself well to semisupervised learning because you can seed the process with preassigned, indicative node labels. 


标签传播算法(LPA)是一种在图中寻找社区的快速算法。在LPA中，节点根据它们的直接邻居选择它们的组。这个过程非常适合于分组不太清晰的网络，并且可以使用权重帮助节点确定将自己放在哪个社区中。它也很适合半监督学习，因为您可以使用预先分配的指示性节点标签为进程播种。

The intuition behind this algorithm is that a single label can quickly become dominant in a densely connected group of nodes, but it will have trouble crossing a sparsely connected region. Labels get trapped inside a densely connected group of nodes, and nodes that end up with the same label when the algorithm finishes are considered part of the same community. The algorithm resolves overlaps, where nodes are potentially part of multiple clusters, by assigning membership to the label neighborhood with the highest combined relationship and node weight.

这种算法背后的直觉是，在密集连接的节点组中，单个标签可以迅速占据主导地位，但它在跨越稀疏连接的区域时会遇到困难。标签被困在一个密集连接的节点组中，当算法完成时，最终具有相同标签的节点被认为是相同社区的一部分。该算法通过将成员分配给具有最高组合关系和节点权重的标签邻域，解决了节点可能是多个集群的一部分的重叠问题。

LPA is a relatively new algorithm proposed in 2007 by U. N. Raghavan, R. Albert, and S. Kumara, in a paper titled Near Linear Time Algorithm to Detect Community Structures in Large-Scale Networks . 


LPA是联合国Raghavan、R. Albert和S. Kumara在2007年发表的一篇论文中提出的一种相对较新的算法，该论文的题目是《在大规模网络中检测社区结构的近线性时间算法》。

Figure 6-8 depicts two variations of Label Propagation, a simple push method and the  more typical pull method that relies on relationship weights. The pull method lends itself well to parallelization.

图6-8描述了标签传播的两种变体，一种简单的push方法和更典型的依赖关系权重的pull方法。pull方法很适合并行化。

![62.png](./picture/62.png)

The steps often used for the Label Propagation pull method are: 

 1. Every node is initialized with a unique label (an identifier), and, optionally preliminary seed labels can be used. 

 2. These labels propagate through the network. 

 3. At every propagation iteration, each node updates its label to match the one with the maximum weight, which is calculated based on the weights of neighbor nodes and their relationships. Ties are broken uniformly and randomly.

 4. LPA reaches convergence when each node has the majority label of its neighbors.As labels propagate, densely connected groups of nodes quickly reach a consensus on a unique label. At the end of the propagation, only a few labels will remain, and nodes that have the same label belong to the same community.
 

标签传播拉拔方法常用的步骤有:

1. 每个节点都使用唯一的标签(标识符)初始化，并且可以使用可选的初始种子标签。

2. 这些标签通过网络传播。

3. 在每次传播迭代中，每个节点都会更新其标签以匹配具有最大权值的标签，最大权值是根据邻居节点的权值及其关系计算的。领带是均匀而随机地断开的。

4. 当每个节点具有其邻居的多数标签时，LPA达到收敛。随着标签的传播，密集连接的节点组很快就会对一个惟一的标签达成一致。在传播结束时，只剩下几个标签，具有相同标签的节点属于同一个社区。

## Semi-Supervised Learning and Seed Labels

In contrast to other algorithms, Label Propagation can return different community structures when run multiple times on the same graph. The order in which LPA evaluates nodes can have an influence on the final communities it returns.

与其他算法不同的是，标签传播可以在同一图形上多次运行时返回不同的社区结构。LPA计算节点的顺序可能会影响它返回的最终社区。

The range of solutions is narrowed when some nodes are given preliminary labels (i.e., seed labels), while others are unlabeled. Unlabeled nodes are more likely to adopt the preliminary labels.

当某些节点被赋予初始标签(即，而其他的则没有标签。未标记的节点更可能采用初始标记。

This use of Label Propagation can be considered a semi-supervised learning method to find communities. Semi-supervised learning is a class of machine learning tasks and techniques that operate on a small amount of labeled data, along with a larger amount of unlabeled data. We can also run the algorithm repeatedly on graphs as they evolve.

这种标签传播的使用可以被认为是一种半监督学习方法来寻找社区。半监督学习是一种机器学习任务和技术，它对少量有标记的数据和大量无标记的数据进行操作。我们还可以在图的演化过程中反复运行算法。

Finally, LPA sometimes doesn t converge on a single solution. In this situation, our community results will continually flip between a few remarkably similar communities and the algorithm would never complete. Seed labels help guide it toward a solution. Spark and Neo4j use a set maximum number of iterations to avoid never-ending execution. You should test the iteration setting for your data to balance accuracy and execution time.

最后，LPA有时不收敛于一个解。在这种情况下，我们的社区结果将在几个非常相似的社区之间不断地切换，而算法永远不会完成。种子标签有助于引导它找到解决方案。Spark和Neo4j使用一组最大迭代次数来避免无休止的执行。您应该测试数据的迭代设置，以平衡准确性和执行时间。


## When Should I Use Label Propagation?

Use Label Propagation in large-scale networks for initial community detection, especially when weights are available. This algorithm can be parallelized and is therefore extremely fast at graph partitioning.

在大型网络中使用标签传播进行初始社区检测，特别是在有权值时。该算法可以并行化，因此在图形划分方面非常快。

Example use cases include: 

    Assigning polarity of tweets as a part of semantic analysis. In this scenario, positive and negative seed labels from a classifier are used in combination with the Twitter follower graph. For more information, see Twitter Polarity Classification with Label Propagation over Lexical Links and the Follower Graph , by M. Speriosu et al. 

    将tweet的极性指定为语义分析的一部分。在这个场景中，分类器中的正面和负面种子标签将与Twitter关注者图结合使用。有关更多信息，请参见M. Speriosu等人编写的通过词汇链接和关注者图传播标签的Twitter极性分类。

    Finding potentially dangerous combinations of possible co-prescribed drugs, based on the chemical similarity and side effect profiles. See Label Propagation Prediction of Drug Drug Interactions Based on Clinical Side Effects , a paper by P. Zhang et al.

    根据药物的化学相似性和副作用特征，找出可能的联合处方药物的潜在危险组合。参见P. Zhang等人的文章《基于临床副作用的药物相互作用的标签传播预测》。

## Label Propagation with Neo4j

Now let s try the same algorithm with Neo4j. We can execute LPA by running the following query:

现在让我们用Neo4j来尝试相同的算法。我们可以通过运行以下查询来执行LPA:

![63.png](./picture/63.png)

The parameters passed to this algorithm are: 

Library 

    The node label to load from the graph 

DEPENDS_ON 
    
    The relationship type to load from the graph iterations: 

10 

     The maximum number of iterations to run
    
传递给算法的参数为:
    
Library  
    
    节点标签以从图中加载
    
DEPENDS_ON

    从图迭代中加载的关系类型
    
10

    要运行的最大迭代数  

The results, which can also be seen visually in Figure 6-9, are fairly similar to those we got with Apache Spark.

结果(在图6-9中也可以直观地看到)与我们使用Apache Spark得到的结果非常相似。

![64.png](./picture/64.png)

We can also run the algorithm assuming that the graph is undirected, which means that nodes will try to adopt labels from the libraries they depend on as well as ones that depend on them.

我们还可以运行这个算法，假设图是无向的，这意味着节点将尝试采用它们所依赖的库和依赖它们的库中的标签。

To do this, we pass the DIRECTION:BOTH parameter to the algorithm:
    
        为此，我们将DIRECTION:BOTH参数传递给算法:

If we run that, we ll get the following output:
    
如果我们运行它，我们将得到以下输出:  

![65.png](./picture/65.png)

The number of clusters has reduced from six to four, and all the nodes in the matplotlib part of the graph are now grouped together. This can be seen more clearly in Figure 6-10.

集群的数量已经从6个减少到4个，并且现在将图的matplotlib部分中的所有节点分组在一起。在图6-10中可以更清楚地看到这一点。

![66.png](./picture/66.png)

Although the results of running Label Propagation on this data are similar for undirected and directed calculation, on complicated graphs you will see more significant differences. This is because ignoring direction causes nodes to try and adopt more labels, regardless of the relationship source.

尽管在无向和有向计算中，在此数据上运行标签传播的结果类似，但是在复杂的图上，您将看到更显著的差异。这是因为忽略方向会导致节点尝试采用更多的标签，而不管关系源是什么。

# Louvain Modularity

The Louvain Modularity algorithm finds clusters by comparing community density as it assigns nodes to different groups. You can think of this as a what if analysis to try various groupings with the goal of reaching a global optimum.

Louvain模块化算法在将节点分配给不同组时，通过比较社区密度来找到集群。您可以将此视为一种假设分析，尝试各种组合，以达到全局最优。

Proposed in 2008, the Louvain algorithm is one of the fastest modularity-based algorithms. As well as detecting communities, it also reveals a hierarchy of communities at different scales. This is useful for understanding the structure of a network at different levels of granularity.

Louvain算法是2008年提出的一种基于模块的快速算法。除了检测社区外，它还揭示了不同规模的社区的层次结构。这对于理解不同粒度级别的网络结构非常有用。

Louvain quantifies how well a node is assigned to a group by looking at the density of connections within a cluster in comparison to an average or random sample. This measure of community assignment is called modularity.

Louvain通过与平均或随机样本比较集群内连接的密度来量化一个节点分配给一个组的情况。这种度量社区分配的方法称为模块化。

## Quality-based grouping via modularity

Modularity is a technique for uncovering communities by partitioning a graph into more coarse-grained modules (or clusters) and then measuring the strength of the groupings. As opposed to just looking at the concentration of connections within a cluster, this method compares relationship densities in given clusters to densities between clusters. The measure of the quality of those groupings is called modularity.

模块化是一种揭示社区的技术，它将一个图划分为更粗粒度的模块(或集群)，然后测量组的强度。与仅仅查看集群内连接的集中不同，该方法将给定集群中的关系密度与集群之间的关系密度进行了比较。度量这些分组的质量称为模块性。

Modularity algorithms optimize communities locally and then globally, using multiple iterations to test different groupings and increasing coarseness. This strategy identifies community hierarchies and provides a broad understanding of the overall structure. However, all modularity algorithms suffer from two drawbacks

模块化算法首先在局部优化社区，然后在全局优化社区，使用多次迭代来测试不同的分组，并不断增加粗度。此策略标识社区层次结构，并提供对总体结构的广泛理解。然而，所有的模块化算法都有两个缺点:
    
    They merge smaller communities into larger ones. 

    他们将较小的社区合并成较大的社区。
    
    A plateau can occur where several partition options are present with similar modularity, forming local maxima and preventing progress.

    当多个分区选项以相似的模块性出现时，可能会出现平台，从而形成局部最大值并阻碍进程。


For more information, see the paper The Performance of Modularity Maximization in Practical Contexts , by B. H. Good, Y.-A. de Montjoye, and A. Clauset.

有关更多信息，请参见b.h. Good、y.a的论文《模块化最大化在实际环境中的性能》。de Montjoye和A. Clauset。

## Calculating Modularity

A simple calculation of modularity is based on the fraction of the relationships within the given groups minus the expected fraction if relationships were distributed at random between all nodes. The value is always between 1 and 1, with positive values indicating more relationship density than you d expect by chance and negative values indicating less density. Figure 6-11 illustrates several different modularity scores based on node groupings.

一个简单的模块化计算是基于给定组内关系的分数减去所有节点间随机分布关系的期望分数。值总是在1和1之间，正值表示关系密度大于您偶然期望的关系密度，负值表示关系密度较低。图6-11显示了基于节点分组的几个不同模块性得分。

![67.png](./picture/67.png)

The formula for the modularity of a group is:
 ![68.png](./picture/68.png) 

where:
 L is the number of relationships in the entire group.

 Lc is the number of relationships in a partition.

 kc is the total degree of nodes in a partition.


    组的模量公式为:

    L是整个组中关系的个数。
    Lc是一个分区中关系的数量。
    kc是分区中节点的总次数。

The calculation for the optimal partition at the top of Figure 6-11 is as follows：

计算最优分区图6尺11寸的顶部如下:


The dark partition is
![69.png](./picture/69.png)

The light partition is
![70.png](./picture/70.png)

These are added together for

![71.png](./picture/71.png)

Initially the Louvain Modularity algorithm optimizes modularity locally on all nodes, which finds small communities; then each small community is grouped into a larger conglomerate node and the first step is repeated until we reach a global optimum.

鲁万模量算法首先对所有节点的模量进行局部优化，找到小的群落;

然后将每个小社区分组成一个更大的联合节点，重复第一步，直到达到全局最优。

The algorithm consists of repeated application of two steps, as illustrated in Figure 6-12.

该算法由两个步骤的重复应用组成，如图6-12所示。

![72.png](./picture/72.png)

The Louvain algorithm s steps include:
    
    1. A “greedy” assignment of nodes to communities, favoring local optimizations of modularity.
    
    2. The definition of a more coarse-grained network based on the communities found in the first step. This coarse-grained network will be used in the next iteration of the algorithm.
    
    1. 将节点贪婪地分配给社区，有利于模块化的局部优化。
    2. 根据第一步中找到的社区定义更粗粒度的网络。这个粗粒度网络将在算法的下一个迭代中使用。

Part of the first optimization step is evaluating the modularity of a group. Louvain uses the following formula to accomplish this：

第一个优化步骤的一部分是评估一个组的模块性。鲁汶使用以下公式来完成这项工作:

![73.png](./picture/73.png)

![74.png](./picture/74.png)

Another part of that first step evaluates the change in modularity if a node is moved to another group. Louvain uses a more complicated variation of this formula and then determines the best group assignment.

第一步的另一部分是，如果将节点移动到另一个组，则计算模块性的变化。劳文使用了这个公式更复杂的变体，然后确定了最佳分组分配。

## When Should I Use Louvain?

Use Louvain Modularity to find communities in vast networks. This algorithm applies a heuristic, as opposed to exact, modularity, which is computationally expensive. Louvain can therefore be used on large graphs where standard modularity algorithms may struggle. 

 Louvain is also very helpful for evaluating the structure of complex networks, in particular uncovering many levels of hierarchies such as what you might find in a criminal organization. The algorithm can provide results where you can zoom in on different levels of granularity and find subcommunities within subcommunities within subcommunities.
 
 使用Louvain模块化在巨大的网络中寻找社区。该算法采用启发式，而不是精确的模块化，后者的计算成本很高。因此，Louvain可以用于标准模块化算法可能会遇到困难的大型图。
 
Louvain对于评估复杂网络的结构也很有帮助，特别是揭示许多层次结构，比如在犯罪组织中可能发现的结构。该算法可以提供可以放大不同粒度级别的结果，并在子社区中找到子社区中的子社区。

Example use cases include:
    
    Detecting cyberattacks. The Louvain algorithm was used in a 2016 study by S. V. Shanbhaq of fast community detection in large-scale cybernetworks for cybersecurity applications. Once these communities have been detected they can be used to detect cyberattacks. 
    
    Extracting topics from online social platforms, like Twitter and YouTube, based on the co-occurence of terms in documents as part of the topic modeling process. This approach is described in a paper by G. S. Kido, R. A. Igawa, and S. Barbon Jr., Topic Modeling Based on Louvain Method in Online Social Networks . 
    
    Finding hierarchical community structures within the brain s functional network, as described in Hierarchical Modularity in Human Brain Functional Networks by D. Meunier et al.

    检测网络攻击。尚巴克(s.v. Shanbhaq)在2016年的一项研究中使用了Louvain算法，该算法研究的是用于网络安全应用的大规模网络中的快速社区检测。一旦这些社区被发现，它们就可以被用来检测网络攻击。

    从Twitter和YouTube等在线社交平台提取主题，基于文档中词汇的共同出现作为主题建模过程的一部分。G. S. Kido, R. a .井川庆和S. S. Barbon Jr.在一篇论文中描述了这种方法，他们的论文是基于Louvain方法在在线社交网络中的主题建模。

    D. Meunier等人在《人脑功能网络的层次模块性》一书中描述了在大脑功能网络中发现层次共同体结构。

Modularity optimization algorithms, including Louvain, suffer from two issues. First, the algorithms can overlook small communities within large networks. You can overcome this problem by reviewing the intermediate consolidation steps. Second, in large graphs with overlapping communities, modularity optimizers may not correctly determine the global maxima. In the latter case, we recommend using any modularity algorithm as a guide for gross estimation but not complete accuracy.

模块化优化算法，包括Louvain，存在两个问题。首先，算法可以忽略大型网络中的小社区。您可以通过回顾中间合并步骤来克服这个问题。其次，在具有重叠社区的大型图中，模块优化器可能不能正确地确定全局最大值。在后一种情况下，我们建议使用任何模块算法作为粗略估计的指导，但不能完全准确。

## Louvain with Neo4j

Let s see the Louvain algorithm in action. We can execute the following query to run the algorithm over our graph:
    
    让我们看看Louvain算法的实际应用。我们可以执行以下查询来在图上运行算法:

The parameters passed to this algorithm are:
Library 
    
        The node label to load from the graph 

DEPENDS_ON 
    
        The relationship type to load from the graph 
        
传递给算法的参数为:
    
Library
    
    节点标签以从图中加载
    
DEPENDS_ON
    
    要从图中加载的关系类型
   
 These are the results

![75.png](./picture/75.png)

The communities column describes the community that nodes fall into at two levels. The last value in the array is the final community and the other one is an intermediate community.

communities列描述节点在两个级别上所属的社区。数组中的最后一个值是最终社区，另一个是中间社区。

The numbers assigned to the intermediate and final communities are simply labels with no measurable meaning. Treat these as labels that indicate which community nodes belong to such as belongs to a community labeled 0 , a community labeled 4 , and so forth.

分配给中间和最终社区的数字只是没有可衡量意义的标签。将这些节点视为表明哪些社区节点属于的标签，例如属于标记为0的社区、标记为4的社区，等等。

For example, matplotlib has a result of [2,0]. This means that matplotlib s final community is labeled 0 and its intermediate community is labeled 2. 

例如，matplotlib的结果是[2,0]。这意味着matplotlib的最终社区标记为0，其中间社区标记为2。

It s easier to see how this works if we store these communities using the write version of the algorithm and then query it afterwards. The following query will run the Louvain algorithm and store the result in the communities property on each node.

如果我们使用算法的写版本存储这些社区，然后查询它，就更容易看出这是如何工作的。下面的查询将运行Louvain算法，并将结果存储在每个节点上的communities属性中。

We could also store the resulting communities using the streaming version of the algorithm, followed by calling the SET clause to store the result. The following query shows how we could do this.

我们还可以使用算法的流版本存储结果社区，然后调用SET子句来存储结果。下面的查询显示了我们如何做到这一点.

Once we ve run either of those queries, we can write the following query to find the final clusters:
    
    一旦我们运行了这些查询中的任何一个，我们就可以编写下面的查询来找到最终的集群

l.communities[-1] returns the last item from the underlying array that this property stores.

l.communities[-1]从该属性存储的基础数组返回最后一项。

 Running the query yields this output: **结果跟原文不同**
![76.png](./picture/76.png)

......

......

......

......

......

......

......

......

......

......

......

......

......

......

......


# Validating Communities

Community detection algorithms generally have the same goal: to identify groups. However, because different algorithms begin with different assumptions, they may uncover different communities. This makes choosing the right algorithm for a particular problem more challenging and a bit of an exploration. 

        社区检测算法通常具有相同的目标:识别群体。然而，由于不同的算法开始于不同的假设，它们可能揭示不同的社区。这使得为特定的问题选择正确的算法变得更有挑战性和一点探索。
        
Most community detection algorithms do reasonably well when relationship density is high within groups compared to their surroundings, but real-world networks are often less distinct. We can validate the accuracy of the communities found by comparing our results to a benchmark based on data with known communities. 
        
        与周围环境相比，当群体内部的关系密度较高时，大多数社区检测算法都表现得相当不错，但现实世界中的网络往往没有那么明显。通过将结果与基于已知社区的数据的基准进行比较，我们可以验证发现的社区的准确性。

Two of the best-known benchmarks are the Girvan-Newman (GN) and Lancichinetti Fortunato Radicchi (LFR) algorithms. The reference networks that these algorithms generate are quite different: GN generates a random network which is more homogeneous, whereas LFR creates a more heterogeneous graph where node degrees and community size are distributed according to a power law. 
        
        两个最著名的基准测试是Girvan-Newman (GN)和Lancichinetti Fortunato Radicchi (LFR)算法。这些算法生成的参考网络有很大的不同:GN生成一个更加均匀的随机网络，而LFR生成一个更加异构的图，其中节点度和社区大小按照幂律分布。
        
Since the accuracy of our testing depends on the benchmark used, it s important to match our benchmark to our dataset. As much as possible, look for similar densities, relationship distributions, community definitions, and related domains.
        
       由于测试的准确性取决于所使用的基准，因此将基准与数据集匹配非常重要。尽可能多地寻找相似的密度、关系分布、社区定义和相关领域。

# Summary

Community detection algorithms are useful for understanding the way that nodes are grouped together in a graph. 
    
    社区检测算法有助于理解节点在图中分组的方式。
    
In this chapter, we started by learning about the Triangle Count and Clustering Coefficient algorithms. We then moved on to two deterministic community detection algorithms: Strongly Connected Components and Connected Components. These algorithms have strict definitions of what constitutes a community and are very useful for getting a feel for the graph structure early in the graph analytics pipeline. 
    
    在本章中，我们首先学习三角形计数和聚类系数算法。然后，我们转向两种确定性的社区检测算法:强连接组件和连接组件。这些算法对构成社区的内容有严格的定义，对于在图形分析管道的早期了解图形结构非常有用。
    
We finished with Label Propagation and Louvain, two nondeterministic algorithms which are better able to detect finer-grained communities. Louvain also showed us a hierarchy of communities at different scales. 

    我们完成了标签传播和Louvain，这两种不确定性算法能够更好地检测细粒度的社区。鲁温还向我们展示了不同规模社区的等级制度。
    
In the next chapter, we ll take a much larger dataset and learn how to combine the algorithms together to gain even more insight into our connected data.

    在下一章中，我们将使用一个更大的数据集，并学习如何将这些算法组合在一起，以获得对连接数据的更深入的了解。