Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They re useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups. Although many of these algorithms were invented for social network analysis, they have since found uses in a variety of industries and fields.

中心算法用于理解图中特定节点的角色及其对网络的影响。它们很有用，因为它们确定了最重要的节点，并帮助我们理解群体动态，比如可信度、可访问性、事物传播的速度以及群体之间的桥梁。尽管这些算法中有许多是为社交网络分析而发明的，但它们已经在许多行业和领域得到了应用。

We'll cover the following algorithms: 

    Degree Centrality as a baseline metric of connectedness 

    Closeness Centrality for measuring how central a node is to the group, including two variations for disconnected groups 

    Betweenness Centrality for finding control points, including an alternative for approximation 

    PageRank for understanding the overall influence, including a popular option for personalization

我们将介绍以下算法:

    度中心性作为连接度的基线度量
    用于度量节点对组的中心程度的紧密性中心性，包括断开连接的组的两个变体
    寻找控制点的介数中心性，包括近似的另一种选择
    PageRank用于了解总体影响，包括一个流行的个性化选项

Different centrality algorithms can produce significantly different results based on what they were created to measure. When you see suboptimal answers, it s best to check the algorithm you ve used is aligned to its intended purpose.

不同的中心性算法可以根据它们被创建来度量的内容产生显著不同的结果。当你看到次优答案时，最好检查一下你使用的算法是否与它的预期目标一致。

We ll explain how these algorithms work and show examples in Spark and Neo4j. Where an algorithm is unavailable on one platform or where the differences are unimportant, we ll provide just one platform example.

我们将解释这些算法的工作原理，并在Spark和Neo4j中展示示例。如果算法在一个平台上不可用，或者差异不重要，我们只提供一个平台示例。


Figure 5-1 shows the differences between the types of questions centrality algorithms can answer, and Table 5-1 is a quick reference for what each algorithm calculates with an example use.

图5-1显示了中心性算法能够回答的问题类型之间的差异，表5-1是每个算法使用示例计算内容的快速参考。

![21.png](./picture/21.png)
![22.png](./picture/22.png)

Several of the centrality algorithms calculate shortest paths between every pair of nodes. This works well for small- to medium-sized graphs but for large graphs can be computationally prohibitive. To avoid long runtimes on larger graphs, some algorithms (for example, Betweenness Centrality) have approximating versions.

有几种中心性算法计算每对节点之间的最短路径。这对于中小型图很有效，但是对于大型图，在计算上是不允许的。为了避免在较大的图上运行时间过长，一些算法(例如，中间性中心性)有近似的版本。

First, we ll describe the dataset for our examples and walk through importing the data into Apache Spark and Neo4j. Each algorithm is covered in the order listed in Table 5-1. We ll start with a short description of the algorithm and, when warranted, information on how it operates. Variations of algorithms already covered will include less detail. Most sections also include guidance on when to use the related algorithm. We demonstrate example code using a sample dataset at the end of each section. Let s get started

首先，我们将为示例描述数据集，并逐步将数据导入Apache Spark和Neo4j。每个算法按表5-1中列出的顺序进行介绍。我们将从对算法的简短描述开始，并在必要时提供有关算法如何运行的信息。已经涉及的算法的变化将包括更少的细节。大多数部分还包括何时使用相关算法的指导。我们在每个部分的末尾使用一个示例数据集演示示例代码。我们开始吧

# Example Graph Data: The Social Graph

Centrality algorithms are relevant to all graphs, but social networks provide a very relatable way to think about dynamic influence and the flow of information. The examples in this chapter are run against a small Twitter-like graph. You can download the nodes and relationships files we ll use to create our graph from the book s GitHub repository.

中心性算法与所有图表都相关，但社交网络提供了一种非常相关的方式来考虑动态影响和信息流。本章中的示例是针对一个类似twitter的小图表运行的。您可以从该书的GitHub存储库下载我们用来创建图形的节点和关系文件。

![23.png](./picture/23.png)
![24.png](./picture/24.png)
![25.png](./picture/25.png)

We have one larger set of users with connections between them and a smaller set with no connections to that larger group.

我们有一个更大的用户组，他们之间有连接，还有一个更小的用户组，没有连接到更大的组。

## Importing the Data into Neo4j

We’ll load the data for Neo4j. The following query imports nodes:

And this query imports relationships:

# Degree Centrality

# Closeness Centrality with Neo4j

Neo4j s implementation of Closeness Centrality uses the following formula:
    
![26.png](./picture/26.png)

where:

    u is a node. 

    n is the number of nodes in the same component (subgraph or group) as u.

    d(u,v) is the shortest-path distance between another node v and u.

Neo4j的close Centrality实现使用了以下公式
![26.png](./picture/26.png)

    u是一个节点。
    n是与u相同组件(子图或组)中的节点数。
    d(u,v)是另一个节点v与u之间的最短路径距离。

## When Should I Use Closeness Centrality?

Apply Closeness Centrality when you need to know which nodes disseminate things the fastest. Using weighted relationships can be especially helpful in evaluating interaction speeds in communication and behavioral analyses.

当您需要知道哪个节点传播速度最快时，应用贴近中心。使用加权关系对评估交流和行为分析中的交互速度特别有帮助。

Example use cases include: 

     Uncovering individuals in very favorable positions to control and acquire vital information and resources within an organization. One such study is Mapping Networks of Terrorist Cells , by V. E. Krebs.

     As a heuristic for estimating arrival time in telecommunications and package delivery, where content flows through the shortest paths to a predefined target. It is also used to shed light on propagation through all shortest paths simultaneously, such as infections spreading through a local community. Find more details in Centrality and Network Flow , by S. P. Borgatti.

     Evaluating the importance of words in a document, based on a graph-based keyphrase extraction process. This process is described by F. Boudin in A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction .
 

示例用例包括:

    发现处于非常有利位置的个人，以控制和获取组织内的重要信息和资源。其中一项研究是v·e·克雷布斯(V. E. Krebs)绘制的恐怖分子细胞网络。
    
    作为估计电信和包裹递送中到达时间的一种启发式方法，其中内容通过最短路径流到预定义的目标。它也被用来阐明同时通过所有最短路径的传播，例如通过当地社区传播的感染。有关中心性和网络流的更多细节，S. P.博加蒂著。
    
    基于基于图的关键字提取过程，评估文档中单词的重要性。在比较基于图的关键字提取的中心度量时，F. Boudin描述了这个过程。

Closeness Centrality works best on connected graphs. When the original formula is applied to an unconnected graph, we end up with an infinite distance between two nodes where there is no path between them. This means that we ll end up with an infinite closeness centrality score when we sum up all the distances from that node. To avoid this issue, a variation on the original formula will be shown after the next example.

封闭性中心性在连通图上表现得最好。当原始公式应用于一个不连通图时，我们得到两个节点之间没有路径的无限距离。这意味着，当我们把所有到那个节点的距离加起来时，我们将得到一个无限接近的中心性得分。为了避免这个问题，在下一个示例之后将显示对原始公式的更改。

## Closeness Centrality with Neo4j

A call to the following procedure will calculate the closeness centrality for each of the  nodes in our graph:
    
 对以下过程的调用将计算图中每个节点的贴近度中心性:

**[下载algo](https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases)**

**[安装algo](https://blog.csdn.net/qq_38737992/article/details/89036406)**

Running this procedure gives the following output:
![27.png](./picture/27.png)

We get the same results as with the Spark algorithm, but, as before, the score represents their closeness to others within their subgraph but not the entire graph.

我们得到了与Spark算法相同的结果，但与之前一样，分数表示它们与子图中其他元素的亲密程度，而不是整个图。

In the strict interpretation of the Closeness Centrality algorithm, all the nodes in our graph would have a score of because every node has at least one other node that it s unable to reach. However, it s usually more useful to implement the score per component.

在严格解释封闭中心算法时，图中所有节点的得分都为，因为每个节点至少有一个它无法到达的其他节点。然而，实现每个组件的得分通常更有用。

Ideally we d like to get an indication of closeness across the whole graph, and in the next two sections we ll learn about a few variations of the Closeness Centrality algorithm that do this.

理想情况下，我们希望得到整个图的亲密度的指示，在接下来的两节中，我们将学习亲密度中心性算法的一些变体。

## Closeness Centrality Variation: Wasserman and Faust

Stanley Wasserman and Katherine Faust came up with an improved formula for calculating closeness for graphs with multiple subgraphs without connections between those groups. Details on their formula are in their book, Social Network Analysis: Methods and Applications. The result of this formula is a ratio of the fraction of nodes in the group that are reachable to the average distance from the reachable nodes. The formula is as follows

![28.png](./picture/28.png)

where:

    u is a node. 

    N is the total node count. 

    n is the number of nodes in the same component as u. 

     d(u,v) is the shortest-path distance between another node v and u.
     
Stanley Wasserman和Katherine Faust提出了一个改进的公式，用于计算具有多个子图的图的亲密度，而这些子图之间没有连接。关于他们公式的细节在他们的书《社交网络分析:方法和应用》中。该公式的结果是可达节点数与可达节点平均距离之比。公式如下:

![28.png](./picture/28.png)

    u是一个节点。
    N是节点总数。
    n是与u相同组件中的节点数。
    d(u,v)是另一个节点v与u之间的最短路径距离。

We can tell the Closeness Centrality procedure to use this formula by passing the parameter improved: true.
    
    我们可以通过传递改进后的参数true来告诉封闭中心过程使用这个公式。
    
The following query executes Closeness Centrality using the Wasserman and Faust
formula:
    
    下面的查询使用Wasserman和Faust公式执行紧密性中心

![29.png](./picture/29.png)

As Figure 5-6 shows, the results are now more representative of the closeness of nodes to the entire graph. The scores for the members of the smaller subgraph (David, Amy, and James) have been dampened, and they now have the lowest scores of all users. This makes sense as they are the most isolated nodes. This formula is more useful for detecting the importance of a node across the entire graph rather than within its own subgraph.

如图5-6所示，现在的结果更能代表节点与整个图的紧密程度。较小子图的成员(David、Amy和James)的得分已经降低，现在他们在所有用户中得分最低。这是有道理的，因为它们是最孤立的节点。这个公式对于检测一个节点在整个图中的重要性比在它自己的子图中更有用。

![30.png](./picture/30.png)

In the next section we ll learn about the Harmonic Centrality algorithm, which achieves  similar results using another formula to calculate closeness.

在下一节中，我们将学习谐波中心算法，它使用另一个公式来计算接近度，从而得到类似的结果。

## Closeness Centrality Variation: Harmonic Centrality

Harmonic Centrality (also known as Valued Centrality) is a variant of Closeness Centrality, invented to solve the original problem with unconnected graphs. In Harmony in a Small World , M. Marchiori and V. Latora proposed this concept as a practical representation of an average shortest path.

调和中心性(也称为值中心性)是封闭性中心性的一种变体，是为了解决原始图的不连通问题而发明的。在一个小世界里，M. Marchiori和V. Latora和谐地提出了这一概念，将其作为平均最短路径的实际表示。

When calculating the closeness score for each node, rather than summing the distances of a node to all other nodes, it sums the inverse of those distances. This means that infinite values become irrelevant.

在计算每个节点的贴近度得分时，它不是将一个节点到所有其他节点的距离相加，而是将这些距离的倒数相加。这意味着无穷值变得无关紧要。

The raw harmonic centrality for a node is calculated using the following formula

![31.png](./picture/31.png)
where:

     u is a node.

     n is the number of nodes in the graph.

     d(u,v) is the shortest-path distance between another node v and u.
        
用下式计算节点的原始谐波中心性

![31.png](./picture/31.png)

    u是一个节点。
    n是图中的节点数。
    d(u,v)是另一个节点v与u之间的最短路径距离。

As with closeness centrality, we can also calculate a normalized harmonic centrality with the following formula
![32.png](./picture/32.png)
In this formula, ∞ values are handled cleanly.


对于贴近中心性，我们也可以用以下公式计算归一化谐波中心性
![32.png](./picture/32.png)

### Harmonic Centrality with Neo4j

The following query executes the Harmonic Centrality algorithm:
    
    下面的查询执行调和中心算法

Running this procedure gives the following result

 运行此过程将得到以下结果

 ![33.png](./picture/33.png)

The results from this algorithm differ from those of the original Closeness Centrality algorithm but are similar to those from the Wasserman and Faust improvement. Either algorithm can be used when working with graphs with more than one connected component.

该算法与原封闭中心算法的结果有所不同，但与沃瑟曼法和浮士德法的改进结果相似。当处理具有多个连接组件的图时，可以使用这两种算法。

# Betweenness Centrality

Sometimes the most important cog in the system is not the one with the most overt power or the highest status. Sometimes it s the middlemen that connect groups or the brokers who the most control over resources or the flow of information. Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph. It is typically used to find nodes that serve as a bridge from one part of a graph to another.

有时候，这个系统中最重要的齿轮并不是拥有最明显权力或最高地位的齿轮。有时候，是中间商连接着各个团体，或者是中间商对资源或信息流的控制最大。中间性中心性是一种检测节点对图中信息或资源流的影响程度的方法。它通常用于查找充当从图的一个部分到另一个部分的桥梁的节点。

The Betweenness Centrality algorithm first calculates the shortest (weighted) path between every pair of nodes in a connected graph. Each node receives a score, based on the number of these shortest paths that pass through the node. The more shortest paths that a node lies on, the higher its score.

中介中心算法首先计算连通图中每对节点之间的最短路径(加权)。每个节点根据通过该节点的最短路径的数量获得一个分数。一个节点所处的最短路径越多，它的得分就越高。

Betweenness Centrality was considered one of the three distinct intuitive conceptions of centrality when it was introduced by Linton C. Freeman in his 1971 paper, A Set of Measures of Centrality Based on Betweenness .

当林顿·c·弗里曼(Linton C. Freeman)在其1971年的论文《一套基于中介的中心性度量方法》(Set Measures of Centrality Based on中介)中引入中介中心性时，中介中心性被认为是三个截然不同的直觉中心性概念之一。

## Bridges and control points

A bridge in a network can be a node or a relationship. In a very simple graph, you can find them by looking for the node or relationship that, if removed, would cause a section of the graph to become disconnected. However, as that s not practical in a typical graph, we use a Betweenness Centrality algorithm. We can also measure the betweenness of a cluster by treating the group as a node.

网络中的桥可以是节点或关系。在一个非常简单的图中，您可以通过查找节点或关系来找到它们，如果删除节点或关系，将导致图的某个部分断开连接。然而，由于这在典型的图中不实用，我们使用了中介中心性算法。我们还可以通过将组作为节点来度量集群的介数。



A node is considered pivotal for two other nodes if it lies on every shortest path between those nodes, as shown in Figure 5-7.

如果一个节点位于其他两个节点之间的每个最短路径上，则该节点被认为是其他两个节点的关键节点，如图5-7所示。

![34.png](./picture/34.png)

Pivotal nodes play an important role in connecting other nodes if you remove a pivotal node, the new shortest path for the original node pairs will be longer or more costly. This can be a consideration for evaluating single points of vulnerability.

在连接其他节点时，关键节点起着重要的作用。如果删除关键节点，则原始节点对的新最短路径将更长或更昂贵。这可以作为评估单点脆弱性的一个考虑因素。

## Calculating betweenness centrality

The betweenness centrality of a node is calculated by adding the results of the following formula for all shortest paths:
![35.png](./picture/35.png)

where:

     u is a node. 

    p is the total number of shortest paths between nodes s and t.

    p(u) is the number of shortest paths between nodes s and t that pass through node u.


通过对所有最短路径添加以下公式的结果，计算节点的介数中心性
![35.png](./picture/35.png)

    u是一个节点。
    p是节点s和t之间最短路径的总数。
    p(u)是节点s和t之间通过的最短路径的个数节点。

Figure 5-8 illustrates the steps for working out betweenness centrality.

图5-8说明了计算中间性中心性的步骤。

![36.png](./picture/36.png)

Here s the procedure: 

    1. For each node, find the shortest paths that go through it. a. B, C, E have no shortest paths and are assigned a value of 0. 

    2. For each shortest path in step 1, calculate its percentage of the total possible shortest paths for that pair. 

    3. Add together all the values in step 2 to find a node s betweenness centrality score. The table in Figure 5-8 illustrates steps 2 and 3 for node D. 

    4. Repeat the process for each node.

程序如下:
    
    1. 对于每个节点，找出通过它的最短路径。a、B、C、E没有最短路径，赋值为0。
    2. 对于步骤1中的每个最短路径，计算其占该对最短路径总数的百分比。
    3. 将步骤2中的所有值相加，找到一个节点的中间性中心性得分。图5-8中的表格说明了节点D的步骤2和步骤3。
    4 所示。对每个节点重复该过程。

## When Should I Use Betweenness Centrality?

Betweenness Centrality applies to a wide range of problems in real-world networks. We use it to find bottlenecks, control points, and vulnerabilities.

中介中心性适用于现实网络中的许多问题。我们使用它来查找瓶颈、控制点和漏洞。

Example use cases include:

 Identifying influencers in various organizations. Powerful individuals are not necessarily in management positions, but can be found in brokerage positions using Betweenness Centrality. Removal of such influencers can seriously destabilize the organization. This might be considered a welcome disruption by law enforcement if the organization is criminal, or could be a disaster if a business loses key staff it underestimated. More details are found in Brokerage Qualifications in Ringing Operations , by C. Morselli and J. Roy.

Uncovering key transfer points in networks such as electrical grids. Counterintuitively, removal of specific bridges can actually improve overall robustness by islanding disturbances. Research details are included in Robustness of the European Power Grids Under Intentional Attack , by R. Solé, et al.

Helping microbloggers spread their reach on Twitter, with a recommendation engine for targeting influencers. This approach is described in a paper by S. Wu et al., Making Recommendations in a Microblog to Improve the Impact of a Focal User .

示例用例包括:

识别不同组织中的影响者。有权势的人不一定在管理职位上，但可以在经纪职位上使用中介中心性。清除这些影响者会严重破坏组织的稳定。如果组织是犯罪的，这可能被认为是执法部门所乐见的破坏;如果企业失去了它低估的关键员工，这可能是一场灾难。更多细节可以在C. Morselli和J. Roy的《振铃业务经纪资格》一书中找到。

揭示网络中的关键传输点，如电网。与直觉相反，移除特定桥梁实际上可以通过岛屿干扰提高整体的鲁棒性。研究细节包含在R. Sole等人的《欧洲电网在故意攻击下的鲁棒性》一书中。

帮助微博用户在推特上传播他们的影响力，通过一个针对有影响力的人的推荐引擎。S. Wu等人在一篇论文中描述了这种方法，他们在微博上提出建议，以提高焦点用户的影响力。

Betweenness Centrality makes the assumption that all communication between nodes happens along the shortest path and with the same frequency, which isn t always the case in real life. Therefore, it doesn t give us a perfect view of the most influential nodes in a graph, but rather a good representation. Mark Newman explains this in more detail in Networks: An Introduction (Oxford University Press, p186).
    
    中介中心假设节点间的所有通信都是沿着最短路径进行的，并且频率相同，但在现实生活中并不总是这样。因此，它并没有给我们一个关于图中最有影响力的节点的完美视图，而是一个很好的表示。Mark Newman在《网络:导论》(牛津大学出版社，p186)中对此作了更详细的解释。

## Betweenness Centrality with Neo4j

Spark doesn t have a built-in algorithm for Betweenness Centrality, so we ll demonstrate this algorithm using Neo4j. A call to the following procedure will calculate the betweenness centrality for each of the nodes in our graph

Spark没有内置的中间性中心性算法，所以我们将使用Neo4j演示该算法。对以下过程的调用将计算图中每个节点的介数中心性

Running this procedure gives the following result:
    ![38.png](./picture/38.png)

As we can see in Figure 5-9, Alice is the main broker in this network, but Mark and Doug aren t far behind. In the smaller subgraph all shortest paths go through David, so he is important for information flow among those nodes.

如图5-9所示，Alice是这个网络中的主要代理，但是Mark和Doug也不甘落后。在较小的子图中，所有最短路径都要经过David，因此他对于这些节点之间的信息流非常重要。


![39.png](./picture/39.png)

For large graphs, exact centrality computation isn t practical. The fastest known algorithm for exactly computing betweenness of all the nodes has a runtime proportional to the product of the number of nodes and the number of relationships.


对于大型图，精确的中心性计算是不现实的。精确计算所有节点之间关系的最快算法的运行时与节点数量和关系数量的乘积成正比。

We may want to filter down to a subgraph first or use (described in the next section) that works with a subset of nodes.

我们可能希望首先过滤到子图，或者使用(在下一节中描述)处理节点子集的子图。

We can join our two disconnected components together by introducing a new user called Jason, who follows and is followed by people from both groups of users

通过引入一个名为Jason的新用户，我们可以将两个断开连接的组件连接在一起，该用户跟踪来自这两组用户的人员

If we rerun the algorithm we’ll see this output:

![40.png](./picture/40.png)

Jason has the highest score because communication between the two sets of users will pass through him. Jason can be said to act as a local bridge between the two sets of users, as illustrated in Figure 5-10.

Jason得分最高，因为两组用户之间的通信将通过他进行。Jason可以说是这两组用户之间的本地桥梁，如图5-10所示。

![41.png](./picture/41.png)

Before we move on to the next section, let s reset our graph by deleting Jason and his relationships

在我们进入下一节之前，让我们通过删除Jason和他的关系来重置我们的图表

## Betweenness Centrality Variation: Randomized-Approximate Brandes

Recall that calculating the exact betweenness centrality on large graphs can be very expensive. We could therefore choose to use an approximation algorithm that runs much faster but still provides useful (albeit imprecise) information.

回想一下，在大型图上计算准确的介数中心性可能非常昂贵。因此，我们可以选择使用一种运行速度快得多但仍然提供有用(尽管不精确)信息的近似算法。

The Randomized-Approximate Brandes (RA-Brandes for short) algorithm is the best-known algorithm for calculating an approximate score for betweenness centrality. Rather than calculating the shortest path between every pair of nodes, the RABrandes algorithm considers only a subset of nodes. Two common strategies for selecting the subset of nodes are:
    
随机近似布兰兹(简称RA-Brandes)算法是计算介数中心性近似得分的最著名算法。RABrandes算法只考虑节点的子集，而不是计算每对节点之间的最短路径。选择节点子集的两种常见策略是

**Random** 

Nodes are selected uniformly, at random, with a defined probability of selection. The default probability is: log10 N e2 . If the probability is 1, the algorithm works the same way as the normal Betweenness Centrality algorithm, where all nodes are loaded.
    
**随机**

节点的选择是一致的，随机的，具有确定的选择概率。默认概率是:log10ne2。如果概率为1，则该算法的工作方式与常规的介数中心性算法相同，即加载所有节点。

**Degree** 

Nodes are selected randomly, but those whose degree is lower than the mean are automatically excluded (i.e., only nodes with a lot of relationships have a chance of being visited). 

As a further optimization, you could limit the depth used by the Shortest Path algorithm, which will then provide a subset of all the shortest paths.

 随机选取节点，但度低于均值的节点自动排除(即，只有具有大量关系的节点才有机会被访问)。
 
 作为进一步的优化，您可以限制最短路径算法使用的深度，然后最短路径算法将提供所有最短路径的子集。

### Approximation of Betweenness Centrality with Neo4j

The following query executes the RA-Brandes algorithm using the random selection method:
    
    下面的查询使用随机选择方法执行RA-Brandes算法:

Running this procedure gives the following result:
 ![42.png](./picture/42.png)   

Our top influencers are similar to before, although Mark now has a higher ranking than Doug.

虽然马克现在的排名比道格高，但我们最具影响力的人物还是和以前差不多。

Due to the random nature of this algorithm, we may see different results each time that we run it. On larger graphs this randomness will have less of an impact than it does on our small sample graph.


由于该算法的随机性，我们每次运行时可能会看到不同的结果。对于较大的图，这种随机性的影响要小于它对小样本图的影响。

# PageRank

PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas PageRank considers the influence of a node s neighbors, and their neighbors. For example, having a few very powerful friends can make you more influential than having a lot of less powerful friends. PageRank is computed either by iteratively distributing one node s rank over its neighbors or by randomly traversing the graph and counting the frequency with which each node is hit during these walks.

PageRank是最著名的中心性算法。它度量节点的传递(或方向)影响。我们讨论的所有其他中心算法都度量节点的直接影响，而PageRank则考虑节点的邻居及其邻居的影响。例如，拥有几个非常强大的朋友比拥有一大堆不那么强大的朋友能让你更有影响力。PageRank的计算方法要么是将一个节点的秩迭代地分布到它的邻居上，要么是随机遍历图并计算每个节点在这些遍历过程中被命中的频率。

PageRank is named after Google cofounder Larry Page, who created it to rank websites in Google s search results. The basic assumption is that a page with more incoming and more influential incoming links is more likely a credible source. PageRank measures the number and quality of incoming relationships to a node to determine an estimation of how important that node is. Nodes with more sway over a network are presumed to have more incoming relationships from other influential nodes.

PageRank是以谷歌的联合创始人拉里·佩奇的名字命名的，拉里·佩奇创建了PageRank，用于在谷歌的搜索结果中对网站进行排名。一个基本的假设是，一个页面有更多的输入和更有影响力的输入链接，更有可能是一个可信的来源。PageRank度量到节点的传入关系的数量和质量，以确定对该节点重要性的估计。在网络中具有更大影响力的节点被认为具有更多来自其他有影响力节点的传入关系。

## Influence

The intuition behind influence is that relationships to more important nodes contribute more to the influence of the node in question than equivalent connections to less important nodes. Measuring influence usually involves scoring nodes, often with weighted relationships, and then updating the scores over many iterations. Sometimes all nodes are scored, and sometimes a random selection is used as a representative distribution.

影响背后的直觉是，与相对不那么重要的节点的等效连接相比，与更重要节点的关系对相关节点的影响更大。衡量影响通常涉及评分节点，通常是加权关系，然后在多次迭代中更新评分。有时对所有节点进行评分，有时使用随机选择作为代表性分布。

Keep in mind that centrality measures represent the importance of a node in comparison to other nodes. Centrality is a ranking of the potential impact of nodes, not a measure of actual impact. For example, you might identify the two people with the highest centrality in a network, but perhaps policies or cultural norms are in play that actually shift influence to others. Quantifying actual impact is an active research area to develop additional influence metrics.

请记住，与其他节点相比，集中度度量代表了节点的重要性。中心性是对节点潜在影响的排序，而不是对实际影响的度量。例如，你可能认为这两个人在网络中具有最高的中心地位，但也许政策或文化规范正在发挥作用，实际上将影响转移到其他人。量化实际影响是开发额外影响指标的一个活跃研究领域。

## The PageRank Formula

PageRank is defined in the original Google paper as follows：

PageRank在原始的谷歌论文中定义如下：

 ![43.png](./picture/43.png)
    
where: 

    We assume that a page u has citations from pages T1 to Tn. 

     d is a damping factor which is set between 0 and 1. It is usually set to 0.85. You can think of this as the probability that a user will continue clicking. This helps minimize rank sink, explained in the next section. 

    1-d is the probability that a node is reached directly without following any relationships. 

    C(Tn) is defined as the out-degree of a node T.  
   
    我们假设a页u有从T1页到Tn页的引用。
    d是一个阻尼系数，它的取值范围在0和1之间。通常设置为0.85。您可以将此视为用户将继续单击的概率。这有助于最小化秩下沉，下一节将对此进行解释。
    1-d是不遵循任何关系直接到达节点的概率。
    C(Tn)定义为节点T的出度。

Figure 5-11 walks through a small example of how PageRank will continue to update the rank of a node until it converges or meets the set number of iterations.

图5-11展示了PageRank如何继续更新节点的秩，直到它收敛或满足设置的迭代次数。

 ![44.png](./picture/44.png)

## Iteration, Random Surfers, and Rank Sinks

PageRank is an iterative algorithm that runs either until scores converge or until a set number of iterations is reached.

PageRank是一种迭代算法，它要么运行到分数收敛，要么运行到达到一定的迭代次数。

Conceptually, PageRank assumes there is a web surfer visiting pages by following links or by using a random URL. A damping factor _d _ defines the probability that the next click will be through a link. You can think of it as the probability that a surfer will become bored and randomly switch to another page. A PageRank score represents the likelihood that a page is visited through an incoming link and not randomly.

从概念上讲，PageRank假定有一个web冲浪者通过跟随链接或使用随机URL访问页面。阻尼因子_d定义了下一次点击将通过链接的概率。你可以把它看作是一个冲浪者感到无聊并随机切换到另一个页面的概率。PageRank分数表示页面通过传入链接访问的可能性，而不是随机访问的可能性。


A node, or group of nodes, without outgoing relationships (also called a dangling node) can monopolize the PageRank score by refusing to share. This is known as a rank sink. You can imagine this as a surfer that gets stuck on a page, or a subset of pages, with no way out. Another difficulty is created by nodes that point only to each other in a group. Circular references cause an increase in their ranks as the surfer bounces back and forth among the nodes. These situations are portrayed in Figure 5-12.

没有传出关系(也称为悬空节点)的节点或节点组可以通过拒绝共享独占PageRank得分。这就是所谓的秩下沉。你可以把这想象成一个冲浪者被困在一个页面上，或者一个页面的子集里，没有出路。另一个困难是由仅在组中相互指向的节点造成的。当冲浪者在节点之间来回跳转时，循环引用会导致它们的级别增加。这些情况如图5-12所示。

![45.png](./picture/45.png)

There are two strategies used to avoid rank sinks. First, when a node is reached that has no outgoing relationships, PageRank assumes outgoing relationships to all nodes. Traversing these invisible links is sometimes called teleportation. Second, the damping factor provides another opportunity to avoid sinks by introducing a probability for direct link versus random node visitation. When you set d to 0.85, a completely random node is visited 15% of the time.

有两种策略可以避免排名下降。首先，当到达一个没有传出关系的节点时，PageRank假设所有节点都有传出关系。穿越这些看不见的连接有时被称为瞬间传送。其次，阻尼因子通过引入直接链路相对于随机节点访问的概率，提供了另一个避免下沉的机会。当您将d设置为0.85时，访问完全随机节点的概率为15%。

Although the original formula recommends a damping factor of 0.85, its initial use was on the World Wide Web with a power-law distribution of links (most pages have very few links and a few pages have many). Lowering the damping factor decreases the likelihood of following long relationship paths before taking a random jump. In turn, this increases the contribution of a node s immediate predecessors to its score and rank.

虽然最初的公式建议阻尼系数为0.85，但它最初是在万维网上使用链接的幂律分布(大多数页面只有很少的链接，少数页面有很多链接)。降低阻尼因子会降低在随机跳转之前遵循长关系路径的可能性。反过来，这会增加节点的前一个节点对其得分和排名的贡献。

If you see unexpected results from PageRank, it is worth doing some exploratory analysis of the graph to see if any of these problems are the cause. Read Ian Rogers s article, The Google PageRank Algorithm and How It Works to learn more.

如果您从PageRank中看到了意想不到的结果，那么有必要对该图进行一些探索性分析，以确定这些问题是否是原因。阅读伊恩罗杰斯的文章，谷歌PageRank算法和如何工作，以了解更多。

## When Should I Use PageRank?

PageRank is now used in many domains outside web indexing. Use this algorithm whenever you re looking for broad influence over a network. For instance, if you re looking to target a gene that has the highest overall impact to a biological function, it may not be the most connected one. It may, in fact, be the gene with the most relationships with other, more significant functions.

PageRank现在在web索引之外的许多领域中使用。当你在网络上寻找广泛的影响时，使用这个算法。例如，如果你想要瞄准一个对生物功能有最大整体影响的基因，它可能不是最相关的。事实上，它可能是与其他更重要功能联系最多的基因。

Example use cases include:
    
    Presenting users with recommendations of other accounts that they may wish to follow (Twitter uses Personalized PageRank for this). The algorithm is run over a graph that contains shared interests and common connections. The approach is described in more detail in the paper WTF: The Who to Follow Service at Twitter , by P. Gupta et al.
        
        向用户推荐他们可能希望关注的其他帐户(Twitter为此使用个性化的PageRank)。该算法在包含共享兴趣和公共连接的图上运行。P. Gupta等人在论文WTF: The Who to Follow Service at Twitter中更详细地描述了这种方法。
            
    Predicting traffic flow and human movement in public spaces or streets. The algorithm is run over a graph of road intersections, where the PageRank score reflects the tendency of people to park, or end their journey, on each street. This is described in more detail in Self-Organized Natural Roads for Predicting Traffic Flow: A Sensitivity Study , a paper by B. Jiang, S. Zhao, and J. Yin.
        
        预测公共空间或街道的交通流量和人类活动。该算法在道路交叉口的图上运行，其中PageRank评分反映了人们在每条街道上停车或结束旅程的倾向。这在自组织的自然道路交通流量预测中有更详细的描述。
        
    As part of anomaly and fraud detection systems in the healthcare and insurance industries. PageRank helps reveal doctors or providers that are behaving in an unusual manner, and the scores are then fed into a machine learning algorithm.
    
    作为医疗和保险行业异常和欺诈检测系统的一部分。PageRank帮助揭示医生或提供者的异常行为，然后将分数输入机器学习算法。
    
David Gleich describes many more uses for the algorithm in his paper, PageRank Beyond the Web .

David Gleich在他的论文PageRank Beyond the Web中描述了该算法的更多用途。

## PageRank with Neo4j

We can also run PageRank in Neo4j. A call to the following procedure will calculate the PageRank for each of the nodes in our graph：

我们还可以在Neo4j中运行PageRank。调用以下过程将计算图中每个节点的PageRank

Running this procedure gives the following result:
    ![46.png](./picture/46.png)

As with the Spark example, Doug is the most influential user, and Mark follows closely after as the only user that Doug follows. We can see the importance of the nodes relative to each other in Figure 5-13.

与Spark示例一样，Doug是最有影响力的用户，Mark紧随其后，是Doug关注的惟一用户。我们可以在图5-13中看到节点之间的重要性。

PageRank implementations vary, so they can produce different scoring even when the ordering is the same. Neo4j initializes nodes using a value of 1 minus the dampening factor whereas Spark uses a value of 1. In this case, the relative rankings (the goal of PageRank) are identical but the underlying score values used to reach those results are different.

PageRank实现各不相同，因此即使顺序相同，它们也可以产生不同的评分。Neo4j使用值1减去阻尼因子来初始化节点，而Spark使用值1。在本例中，相对排名(PageRank的目标)是相同的，但是用于达到这些结果的基础得分值不同。

![47.png](./picture/47.png)

As with our Spark example, the relationships in the graph on which we ran the PageRank algorithm don t have weights, so each relationship is considered equal. Relationship weights can be considered by including the weightProperty property in the config passed to the PageRank procedure. For example, if relationships have a property weight containing weights, we would pass the following config to the procedure: weightProperty: "weight".
    
与我们的Spark示例一样，我们运行PageRank算法的图中的关系没有权重，因此每个关系都被认为是相等的。可以通过在传递给PageRank过程的配置中包含weightProperty属性来考虑关系权重。例如，如果关系的属性权重包含权重，我们将向过程传递以下配置:weightProperty:“weight”。

# Summary

Centrality algorithms are an excellent tool for identifying influencers in a network. In this chapter we ve learned about the prototypical centrality algorithms: Degree Centrality, Closeness Centrality, Betweenness Centrality, and PageRank. We ve also covered several variations to deal with issues such as long runtimes and isolated components, as well as options for alternative uses.
    
    在网络中，中心算法是识别影响者的一个很好的工具。在本章中，我们学习了典型的中心性算法:度中心性、亲密中心性、中介中心性和PageRank。我们还讨论了处理长运行时和隔离组件等问题的几种变体，以及替代使用的选项。
        

There are many wide-ranging uses for centrality algorithms, and we encourage their exploration for a variety of analyses. You can apply what we ve learned to locate optimal touch points for disseminating information, find the hidden brokers that control the flow of resources, and uncover the indirect power players lurking in the shadows.

     中心性算法有许多广泛的用途，我们鼓励他们探索各种各样的分析。您可以应用我们所学到的知识来定位传播信息的最佳接触点，找到控制资源流动的隐藏代理，并发现隐藏在阴影中的间接权力参与者。

Next, we’ll turn to community detection algorithms that look at groups and partitions.

接下来，我们将转向关注组和分区的社区检测算法。