# Install GREG

## Confirm installation of Java

You must install OpenJDK 8 or Oracle Java 8 on your windows system before you install neo4j.

**Note:** recommended for Neo4j 3.0.x Version 7 is recommended for releases prior to 2.3.0.

If you didn't install java in your computer, then you can refer [Install Java tutorial](InstallJava.ipynb).

## Install neo4j

### Download neo4j

Website:  https://neo4j.com/download-center/#releases

![79.png](./picture/79.png)

Here we select `neo4j 3.5.6(zip)` version in windows.

### Unzip the file 

Find the zip file you just downloaded and right-click, extract all.

Change the folder name `neo4j-community-3.5.6` to `GREG`.

Here we put the GREG folder to `F:/June28`, 

open the file

`F:\June28\GREG\bin\neo4j.ps1`

find out (on line27)

`Import-Module "$PSScriptRoot\Neo4j-Management.psd1"`

and change to 

`Import-Module "F:\June28\GREG\bin\Neo4j-Management.psd1"`

save the file. 

**Now you can use neo4j or keep following to install GREG.**



### Import GREG database

the folder `bigbin` is our GREG database, we should copy this to `F:\June28\GREG\data\databases` directory.
    
After that, we make some changes in  `F:\June28\GREG\conf\neo4j.conf` file.

find out 

`#dbms.active_database=graph.db` 

and change to

`dbms.active_database=bigbin`

save the file.

## To run GREG as a console application

**Open Command Prompt and input:**

**Move to `GREG\bin` direvtory**

**Start GREG(neo4j)** 

![30.png](./picture/30.png)

For additional commands and to learn about the Windows PowerShell module included in the Zip file, see the Windows installation documentation.

**Open Neo4j Browser**


Visit http://localhost:7474 in your web browser.
Connect using the username `neo4j` with default password `neo4j`.
![80.png](./picture/80.png)

# Install algo and apoc

If we are using a standalone Neo4j Server, the library will need to be installed and configured manually.

1. Download `graph-algorithms-algo-[version].jar` from [the matching release](https://github.com/neo4j-contrib/neo4j-graph-algorithms/releases) and copy it into the `$NEO4J_HOME/plugins ` directory. We can work out which release to download by referring to the versions file.

 Download `apoc-[version-all].jar` from [the matching release](https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/) and copy it into the `$NEO4J_HOME/plugins ` directory. We can work out which release to download by referring to the versions file.
 
**Note: You must download the release versions file which refers to your neo4j.  Otherwise, you will do not start neo4j.**

2. Add the following to your `$NEO4J_HOME/conf/neo4j.conf` file:

  We need to give the library unrestricted access because the algorithms use the lower level Kernel API to read from, and to write to Neo4j.

3. Restar Neo4j

4. Verifying installation

     Once we’ve installed the library, to see a list of all the algorithms, run the following query:

# pathfinding

In this chapter, we only discuss shortest path and All pairs Shortest path. As you see, we show you overview of pathfinding and graph search algorithms.

Algorithm type	|What it does	|Example use
-----------------|----------|----------
Shortest Path|	Calculates the shortest path between a pair of nodes	|Finding driving directions between two locations
All Pairs Shortest Path	|Calculates the shortest path between all pairs of nodes in the graph	|Evaluating alternate routesaround a traffic jam

## Shortest path

The Shortest Path algorithm calculates the shortest (weighted) path between a pair of nodes. 

The Shortest Path algorithm operates by first finding the lowest-weight relationship from the start node to directly connected nodes. It keeps track of those weights and moves to the closest node. It then performs the same calculation, but now as a cumulative total from the start node. The algorithm continues to do this, evaluating a wave of cumulative weights and always choosing the lowest weighted cumulative path to advance along, until it reaches the destination node.

Distance is often used within an algorithm as the name of the relationship property that indicates the cost of traversing between a pair of nodes. It s not required that this be an actual physical measure of distance. Hop is commonly used to express the number of relationships between two nodes. You may see some of these terms combined, as in It s a five-hop distance to London or That s the lowest cost for the distance.

In our example, we set the relationship property as null, because the database didn't have relationship property we can explore. Then, we only to find shortest paths based on hop (the number of nodes).

### algo.shortestPaths.stream

`algo.shortestPaths.stream` function calculates the shortest path between a pair of nodes.  `algo.shortestPaths.stream` form likes as following:

#### Parameters

Name	|Type	|Default	|Optional	|Description
-----------|--------|----------------|-----------------|----------------
startNode|node|null|no|The start node
endNode|node|null|no|The end node
weightProperty|string|null|yes|The property name that contains weight. If null, treats the graph as unweighted. Must be numeric.
nodeQuery|string|null|yes|The label to load from the graph. If null, load all nodes
relationshipQuery|string|null|yes|The relationship-type to load from the graph. If null, load all nodes
defaultValue|float|null|yes|The default value of the weight in case it is missing or invalid
direction|string|outgoing|yes|The relationship direction to load from the graph. If 'both', treats the relationships as undirected

#### Results

Name	|Type	|Description
-------|----------|------------
nodeId|int|Node ID
cost|int|The cost it takes to get from start node to specific node

### Example

Example is find out shortest relationships of  "BHLHE40" and "ATF1" TF genes:

Result
![1.png](./picture/1.png)

- Here the cost is the cumulative total for relationships (or hops).

- In our example, we set the relationship property as `null`.

## All Pairs Shortest Path

The All Pairs Shortest Path (APSP) algorithm calculates the shortest (weighted) path between all pairs of nodes. It s more efficient than running the Single Source Shortest Path algorithm for every pair of nodes in the graph.

The calculation for APSP is easiest to understand when you follow a sequence of operations. The diagram in Figure 2-2 walks through the steps for node A.

![11.png](./picture/11.png)

-- This picture is from [*Graph Algorithms*](https://neo4j.com/lp/book-graph-algorithms-thanks/?aliId=eyJpIjoiT1lBd0tIeEh6Y2N6ajZCYiIsInQiOiJPemxyM1BhUG9uczhBdzFYRUwrM3Z3PT0ifQ%253D%253D).

Initially the algorithm assumes an infinite distance to all nodes. When a start node is selected, then the distance to that node is set to 0. The calculation then proceeds as follows：


1. From start node A we evaluate the cost of moving to the nodes we can reach and update those values. Looking for the smallest value, we have a choice of B (cost of 3) or C (cost of 1). C is selected for the next phase of traversal.

2. Now from node C, the algorithm updates the cumulative distances from A to nodes that can be reached directly from C. Values are only updated when a lower cost has been found
`A=0, B=3, C=1, D=8, E=∞`

3. Then B is selected as the next closest node that hasn t already been visited. It has relationships to nodes A, D, and E. The algorithm works out the distance to those nodes by summing the distance from A to B with the distance from B to each of those nodes. Note that the lowest cost from the start node A to the current node is always preserved as a sunk cost. The distance (d) calculation results

 `d(A,A) = d(A,B) + d(B,A) = 3 + 3 = 6`

 `d(A,D) = d(A,B) + d(B,D) = 3 + 3 = 6`

 `d(A,E) = d(A,B) + d(B,E) = 3 + 1 = 4`

        In this step the distance from node A to B and back to A, shown as d(A,A) = 6, is greater than the shortest distance already computed (0), so its value is not updated.
    
        The distances for nodes D (6) and E (4) are less than the previously calculated distances, so their values are updated.
    
4. E is selected next. Only the cumulative total for reaching D (5) is now lower, and therefore it is the only one updated.

5. When D is finally evaluated, there are no new minimum path weights; nothing is updated, and the algorithm terminates.

**NOTE**

Some pairs of nodes might not be reachable from each other, which means that there is no shortest path between these nodes. The algorithm doesn’t return distances for these pairs of nodes.

Even though the All Pairs Shortest Path algorithm is optimized to run calculations in parallel for each node, this can still add up for a very large graph. Consider using a subgraph if you only need to evaluate paths between a subcategory of nodes.

### algo.allShortestPaths.stream

 - The first parameter to this procedure is the property to use to work out the shortest weighted path. 
 
 - yields a stream of {sourceNodeId, targetNodeId, distance}

#### Parameters

Name	|Type	|Default	|Optional	|Description
-----------|--------|----------------|-----------------|----------------
weightProperty|string|null|yes|The property name that contains weight. If null, treats the graph as unweighted. Must be numeric.
nodeQuery|string|null|yes|The label to load from the graph. If null, load all nodes
relationshipQuery|string|null|yes|The relationship-type to load from the graph. If null, load all nodes
defaultValue|float|null|yes|The default value of the weight in case it is missing or invalid
direction|string|outgoing|yes|The relationship direction to load from the graph. If 'both', treats the relationships as undirected

#### Results

Name	|Type	|Description
-------|----------|------------
sourceNodeId|long|The start node ID
endNode|long|The end node ID
distance|int|The distance it takes to get from start node to specific node

### Example

Now, we want to find out all pairs TF genes shortest path in our database.

If we set this to null then the algorithm will calculate the unweighted shortest paths between all pairs of nodes.

The following query does this:

Result
![12.png](./picture/12.png)

## exercise

#### 1. Find shortest path between "FOXA1" and "ZNF143" genes, the result should be as graph.

#### 2. Try to find all pair TF genes shotest paths only belong Bind relationship.

# Centrality

Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They are useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups. 

**In biologically, centrality algorithms help us find out what are important genes, which one play an important position to control or regulate, and what is critical gene in the research project.**

Here we introduce closness centrality, pageRank and betweenness centrality. Now, let us to know what it does with those centrality algorithms.

Algorithm type	|What it does	|Example use
---------|------------|-------
Degree Centrality	|Measures the number of relationships a node has|Estimating a person's popularity by looking at their in-degree and using their out-degree to estimate gregariousness
Closeness Centrality	|Calculates which nodes have the shortest paths to all other nodes	|Finding the optimal location of new public services for maximum accessibility
Betweenness Centrality	|Measures the number of shortest paths that through a node 	|Improving drug targeting by finding the control genes for sepcific diseases
PageRank	|Estimates a current node's importance from its linked neighbors( popularized by Google)	|Finding the most influential features for extraction in machine learning and rankinng text for entity relevance in natural language processing

The measure of a node s centrality is its average farness (inverse distance) to all other nodes. Nodes with a high closeness score have the shortest distances from all other nodes.

![15.png](./picture/15.png)
-- This picture is from [*Graph Algorithms*](https://neo4j.com/lp/book-graph-algorithms-thanks/?aliId=eyJpIjoiT1lBd0tIeEh6Y2N6ajZCYiIsInQiOiJPemxyM1BhUG9uczhBdzFYRUwrM3Z3PT0ifQ%253D%253D).

## Degree Centrality

### algo.degree.stream

#### Parameters

Name	|Type	|Default	|Optional|	Description
--------|--------|-----------|---------|--------------
label|string|null|yes|The label to load from the graph. If null, load all nodes.
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all nodes.
direction|string|incoming|yes|The relationship direction to load from the graph. If 'both', treats the relationships as undirected.
concurrency|int|available CPUs|yes|The number of concurrent threads.

#### Results

Name	|Type	|Description
--------|----------|-----------
nodeId|long|Node ID
score|float|Degree Centrality score

### Example

results 
![17.png](./picture/17.png)

## Closeness Centrality

Closeness Centrality is a way of detecting nodes that are able to spread information efficiently through a subgraph.

Neo4j's implementation of Closeness Centrality uses the following formula:
    
![26.png](./picture/26.png)

where:

    u is a node. 

    n is the number of nodes in the same component (subgraph or group) as u.

    d(u,v) is the shortest-path distance between another node v and u.

**In our graph, we can use this measure to get which gene are able to relate other genns efficiently.**

### algo.closeness.stream

#### Paramenters

Name| Type| Default | Optional | Description
----|----|-----|------|-------
label | string |null|yes |The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
concurrency|int|available CPUs|yes|The number of concurrent threads
graph|string|'heavy' |yes |Use 'heavy' when describing the subset of the graph with label and relationship-type parameter,. Use 'cypher' for describing the subset with cypher node-statement and relationship-statement

#### Results

Name	|Type	|Description
----|-----|------
node|long|Node ID
centrality|float|Closeness centrality weight

### Example

To calculate the closeness centrality for each of TF nodes in our graph:

result
![2.png](picture/2.png)

## Closeness Centrality Variation: Wasserman and Faust

Stanley Wasserman and Katherine Faust came up with an improved formula for calculating closeness for graphs with multiple subgraphs without connections between those groups. Details on their formula are in their book, Social Network Analysis: Methods and Applications. The result of this formula is a ratio of the fraction of nodes in the group that are reachable to the average distance from the reachable nodes. The formula is as follows

![28.png](./picture/28.png)

where:

    u is a node. 

    N is the total node count. 

    n is the number of nodes in the same component as u. 

     d(u,v) is the shortest-path distance between another node v and u.

**To calculating closeness for graphs with multiple subgraphs without connections between those groups by improve fomula.**

### algo.closeness.stream

We can tell the Closeness Centrality procedure to use this formula by passing the parameter `improved: true`.

#### Paramenters

Name| Type| Default | Optional | Description
----|----|-----|------|-------
label | string |null|yes |The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
concurrency|int|available CPUs|yes|The number of concurrent threads

#### Results

Name	|Type	|Description
----|-----|------
node|long|Node ID
centrality|float|Closeness centrality weight

### Example

To calculate the closeness centrality for each of TF nodes in our graph by Wasserman and Faust:

result
![3.png](./picture/3.png)

## Closeness Centrality Variation: Harmonic Centrality

**Harmonic Centrality (also known as Valued Centrality) is a variant of Closeness Centrality, invented to solve the original problem with unconnected graphs. In Harmony in a Small World , M. Marchiori and V. Latora proposed this concept as a practical representation of an average shortest path.**

When calculating the closeness score for each node, rather than summing the distances of a node to all other nodes, **it sums the inverse of those distances**. This means that infinite values become irrelevant.

The raw harmonic centrality for a node is calculated using the following formula

![31.png](./picture/31.png)
where:

     u is a node.

     n is the number of nodes in the graph.

     d(u,v) is the shortest-path distance between another node v and u.

### algo.closeness.harmonic.stream

 - yields centrality for each node

#### Paramenters

Name| Type| Default | Optional | Description
----|----|-----|------|-------
label | string |null|yes |The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
concurrency|int|available CPUs|yes|The number of concurrent threads

#### Results

Name	|Type	|Description
----|-----|------
node|long|Node ID
centrality|float|Closeness centrality weight

### Example

To calculate the closeness centrality for each of TF nodes in our graph by Harmonic Centrality:

Result
![4.png](./picture/4.png)

## Betweenness Centrality


Betweenness Centrality is a way of detecting the amount of influence a node has over the flow of information or resources in a graph. It is typically used **to find nodes that serve as a bridge from one part of a graph to another**.

The Betweenness Centrality algorithm first calculates the shortest (weighted) path between every pair of nodes in a connected graph. Each node receives a score, based on the number of these shortest paths that pass through the node. The more shortest paths that a node lies on, the higher its score.

![34.png](./picture/34.png)
-- This picture is from [*Graph Algorithms*](https://neo4j.com/lp/book-graph-algorithms-thanks/?aliId=eyJpIjoiT1lBd0tIeEh6Y2N6ajZCYiIsInQiOiJPemxyM1BhUG9uczhBdzFYRUwrM3Z3PT0ifQ%253D%253D).


Pivotal nodes play an important role in connecting other nodes if you remove a pivotal node, the new shortest path for the original node pairs will be longer or more costly. This can be a consideration for evaluating single points of vulnerability.

The betweenness centrality of a node is calculated by adding the results of the following formula for all shortest paths:
![35.png](./picture/35.png)

where:

     u is a node. 

    p is the total number of shortest paths between nodes s and t.

    p(u) is the number of shortest paths between nodes s and t that pass through node u.

The figure illustrates the steps for working out betweenness centrality.

![36.png](./picture/36.png)

Here s the procedure: 

    1. For each node, find the shortest paths that go through it. a. B, C, E have no shortest paths and are assigned a value of 0. 

    2. For each shortest path in step 1, calculate its percentage of the total possible shortest paths for that pair. 

    3. Add together all the values in step 2 to find a node s betweenness centrality score. The table in Figure 5-8 illustrates steps 2 and 3 for node D. 

    4. Repeat the process for each node.

### algo.betweenness.stream

#### Paramenters

Name| Type| Default | Optional | Description
----|----|-----|------|-------
label | string |null|yes |The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
concurrency|int|available CPUs|yes|The number of concurrent threads
direction|string|outgoing|yes|The relationship direction to load from the graph. If 'both', treats the relationships as undirected

#### Results

Name	|Type	|Description
----|-----|------
node|long|Node ID
centrality|float|Closeness centrality weight

### Example

To measure the number of shotest paths that pass through a TF gene in this database.

Results
![5.png](./picture/5.png)

## Betweenness Centrality Variation: Randomized-Approximate Brandes

The Randomized-Approximate Brandes (RA-Brandes for short) algorithm is the best-known algorithm for calculating an approximate score for betweenness centrality. **Rather than calculating the shortest path between every pair of nodes, the RABrandes algorithm considers only a subset of nodes.** Two common strategies for selecting the subset of nodes are:
    
**Random** 

    Nodes are selected uniformly, at random, with a defined probability of selection. The default probability is: log10 N e2 . If the probability is 1, the algorithm works the same way as the normal Betweenness Centrality algorithm, where all nodes are loaded.
    
**Degree** 

    Nodes are selected randomly, but those whose degree is lower than the mean are automatically excluded (i.e., only nodes with a lot of relationships have a chance of being visited). 

    As a further optimization, you could limit the depth used by the Shortest Path algorithm, which will then provide a subset of all the shortest paths.

### algo.betweenness.sampled.stream

#### Paramenters

Name| Type| Default | Optional | Description
----|----|-----|------|-------
label | string |null|yes |The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
strategy|string|'random'|yes|The node selection strategy
probability|float|log10(N) / e^2|yes|The probability a node is selected. Values between 0 and 1. If 1, selects all nodes and works like original Brandes algorithm
maxDepth|int|Integer.MAX|yes|The depth of the shortest paths traversal
concurrency|int|available CPUs|yes|The number of concurrent threads
direction|string|outgoing|yes|The relationship direction to load from the graph. If 'both', treats the relationships as undirected

#### Results

Name	|Type	|Description
----|-----|------
node|long|Node ID
centrality|float|Closeness centrality weight

### Example

The following query executes the RA-Brandes algorithm using the random selection method.

results
![6.png](./picture/6.png)

## PageRank

PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas **PageRank considers the influence of a node s neighbors, and their neighbors.**  For example, having a few very powerful friends can make you more influential than having a lot of less powerful friends. **PageRank is computed either by iteratively distributing one node s rank over its neighbors or by randomly traversing the graph and counting the frequency with which each node is hit during these walks.**

PageRank is defined in the original Google paper as follows：

 ![43.png](./picture/43.png)
    
where: 

    We assume that a page u has citations from pages T1 to Tn. 

     d is a damping factor which is set between 0 and 1. It is usually set to 0.85. You can think of this as the probability that a user will continue clicking. This helps minimize rank sink, explained in the next section. 

    1-d is the probability that a node is reached directly without following any relationships. 

    C(Tn) is defined as the out-degree of a node T.  

### algo.pageRank.stream

#### Paramenters

Name| Type| Default | Optional | Description
----|----|-----|------|-------
label | string |null|yes |The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
direction|string|'OUTGOING'|yes|The relationship-direction to use in the algorithm
iterations|int|20|yes|How many iterations of PageRank to run
dampingFactor|float|0.85|yes|The damping factor of the PageRank calculation
concurrency|int|available CPUs|yes|The number of concurrent threads

#### results

Name	|Type	|Description
---|---|-----
nodeId|long|Node ID
score|float|PageRank weight

### Example

A call to the following procedure will calculate the PageRank for each of the TF gene in our graph：

![7.png](./picture/7.png)

## Execises

#### 1. Try to find out which range on Chr1 has bind maximum number of LncRNA?

#### 2. Try to find out which LncRNA can bind biggest range on Chr1?

DO NOT RUN if you computer memory is less than 8G.

#### 3. Calculates which TF have maximum number of **the relationships** bind to chr1_Range.

DO NOT RUN if you computer memory is less than 8G.

#### 4. To campare what is difference in Closeness Centrality and two variation(Wasserman and Faust, Harmonic Centrality).



#### 5. To find out a LncRNA that thoughs the maximum number of shortest paths,please use  Closeness Centrality, Wasserman and Faust and Harmonic Centrality.

DO NOT RUN if you computer memory is less than 8G.

#### 6. Calculates which TF have maximum number of **shortest paths** bind to chr1_Range.

Betweenness Centrality

Randomized-Approximate Brandes

#### 7. Use PageRank to estimates which TF is most important in the subgraph that TFs bind on chr1_range. 

#### 8. The default label and relationship-type projection has a limitation of 2 billion nodes and 2 billion relationships. Therefore, if our projected graph contains more than 2 billion nodes or relationships, we will need to use huge graph projection.Set `graph:'huge'` on your cypher commands to test whether your results are the same.

# Clustering

Algorithm type	|What it does 	|Example use
-----------|-----------|-----------------
Triangle Count and Clustering Coefficient	|Measures how many nodes form traingles and the degree to which nodes tend to cluster together	|Estimating group stability and whether the network myght exhibit “small-world” behaviors seen in graphs with tightly knit clusters
Strongly Connected Components	|Finds groups where each node is reachable from every other node in that same group following the direction of relationships 	|Making product recommendations based on group affiliation or similar items
Connected Components	|Finds groups where each node is reachable from every other node in that same group, regardless of the direction of relationships	|Performing fast grouping for other algorithms and identify islands
Labek Progation	|Infers clusters by spreading labels based on neighborhood majorities	|Understanding consensus in social communities or finding dangerous combinations of possible co-prescribed drugs
Lovain Modularity	|Maximizes the presumed accuracy of groupings by comparing relationship weights and densities to a defined estimate or average	|In fraud analysis, evaluating whether a group has just a few discreete bad behaviors or is acting as a fraud ring

![48.png](./picture/48.png)
-- This picture is from [*Graph Algorithms*](https://neo4j.com/lp/book-graph-algorithms-thanks/?aliId=eyJpIjoiT1lBd0tIeEh6Y2N6ajZCYiIsInQiOiJPemxyM1BhUG9uczhBdzFYRUwrM3Z3PT0ifQ%253D%253D).

## Triangles


Triangle Count determines the number of triangles passing through each node in the graph. A triangle is a set of three nodes, where each node has a relationship to all other nodes. Triangle Count can also be run globally for evaluating our overall dataset.

Networks with a high number of triangles are more likely to exhibit small-world structures and behaviors.

The goal of the Clustering Coefficient algorithm is to measure how tightly a group is clustered compared to how tightly it could be clustered. The algorithm uses Triangle Count in its calculations, which provides a ratio of existing triangles to possible relationships. A maximum value of 1 indicates a clique where every node is connected to every other node.

### algo.triangle.stream

#### Paramenters

Name	|Type	|Default	|Optional	|Description
----|------|-----|------|------
label|string|null|yes|The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all nodes
concurrency|int|available CPUs|yes|The number of concurrent threads

#### Results

Name	|Type	|Description
-----|------|--------
nodeA|int|The ID of node in the given triangle
nodeB|int|The ID of node in the given triangle
nodeC|int|The ID of node in the given triangle

### Example

Getting a stream of the triangles of TF

result
![8.png](./picture/8.png)

**The problem is TF gene can interaction itself**

## Local Clustering Coefficient

Clustering Coefficient can provide the probability that randomly chosen nodes will be connected. You can also use it to quickly evaluate the cohesiveness of a specific group or your overall network. Together these algorithms are used to estimate resiliency and look for network structures.

### algo.triangleCount.stream

#### Paramenters

Name	|Type	|Default	|Optional	|Description
----|------|-----|------|------
label|string|null|yes|The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all nodes
concurrency|int|available CPUs|yes|The number of concurrent threads

#### Results

Name	|Type	|Description
-----|------|--------
nodeId|int|The ID of node
triangles|int|The number of triangles a node is member of

 - yield nodeId, number of triangles

### Example

We can also work out the local clustering coefficient. The following query will calculate this for each TF gene:

![9.png](./picture/9.png)

BTAF1 has a score of 1, which means that all BTAF1's neighbors are neighbors of each other.This tells us that the community directly around BTAF1 is very cohesive.

## Strongly Connected Components


Use Strongly Connected Components as an early step in graph analysis to see how a graph is structured or to identify tight clusters that may warrant independent investigation. A component that is strongly connected can be used to profile similar behavior or inclinations in a group for applications such as recommendation engines.

### algo.scc.stream

#### Paramenters

Name	|Type	|Default	|Optional	|Description
----|----|----|-----|----
label|string|null|yes|The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
concurrency|int|available CPUs|yes|The number of concurrent threads

#### Results

Name	|Type	|Description
-----|------|--------
nodeId|int|The ID of node
partition|int|Partition ID

### Example

## Connected Components

The Connected Components algorithm (sometimes called Union Find or Weakly Connected Components) finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set. It differs from the SCC algorithm because it only needs a path to exist between pairs of nodes in one direction, whereas SCC needs a path to exist in both directions. 

### algo.unionFind.stream

#### Paramenters

Name	|Type	|Default	|Optional	|Description
----|----|----|-----|----
label|string|null|yes|The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
weightProperty|string|null|yes|The property name that contains weight. If null, treats the graph as unweighted. Must be numeric.
threshold|float|null|yes|The value of the weight above which the relationship is not thrown away
defaultValue|float|null|yes|The default value of the weight in case it is missing or invalid
concurrency|int|available CPUs|yes|The number of concurrent threads


#### Results

Name	|Type	|Description
-----|------|--------
nodeId|int|The ID of node
setId|int|Partition ID

### Example

## Label Propagation


The Label Propagation algorithm (LPA) is a fast algorithm for finding communities in a graph. In LPA, nodes select their group based on their direct neighbors. This process is well suited to networks where groupings are less clear and weights can be used to help a node determine which community to place itself within. It also lends itself well to semisupervised learning because you can seed the process with preassigned, indicative node labels.

### algo.labelPropagation.stream

### Example

Result

![13.png](./picture/13.png)

Try to run the code again, Maybe you will find results will be different.

**Set `graph:'cypher'` in the config, and run the code.** More about Cypher projection, you can see on [this web](https://neo4j.com/docs/graph-algorithms/current/projected-graph-model/cypher-projection/).

   The first query `Match(m:TF) return id(m) as id` returns TF node ids. The Cypher loader expects the query to return an id field.

   The second query `Match(n:TF)-[r:Interaction]-(m) return id(n) AS source, id(m) AS target` returns pairs of node ids that have a `Interaction` relationship between them in our projected graph. The Cypher loader expects the query to return source and target fields.

**Note that in both queries we use the id function to return the node id.**



    If the label and relationship-type projection is not selective enough to describe our subgraph to run the algorithm on, we can use Cypher statements to project subsets of our graph. Use a node-statement instead of the label parameter and a relationship-statement instead of the relationship-type, and use graph:'cypher' in the config.

    Relationships described in the relationship-statement will only be projected if both source and target nodes are described in the node-statement, otherwise they will be ignored.

    Cypher projection enables us to be more expressive in describing our subgraph that we want to analyse, but might take longer to project the graph with more complex cypher queries.

## Louvain


The Louvain Modularity algorithm finds clusters by comparing community density as it assigns nodes to different groups. You can think of this as a what if analysis to try various groupings with the goal of reaching a global optimum.

### algo.louvain

#### Paramenters

Name	|Type	|Default	|Optional	|Description
----|----|----|-----|----
label|string|null|yes|The label to load from the graph. If null, load all nodes
relationship|string|null|yes|The relationship-type to load from the graph. If null, load all relationships
weightProperty|string|null|yes|The property name that contains weight. If null, treats the graph as unweighted. Must be numeric.
write|boolean|true|yes|Specifies if the result should be written back as a node property
writeProperty|string|'community'|yes|The property name written back to the ID of the community that particular node belongs to
communityProperty|string|null|yes|The property name that contains an initial or pre-defined community (must be a number)
defaultValue|float|null|yes|The default value of the weight in case it is missing or invalid
concurrency|int|available CPUs|yes|The number of concurrent threads


#### Results

Name	|Type	|Description
-----|------|--------
nodes|int|The number of nodes considered
communityCount|int|The number of communities found

### Example

![10.png](./picture/10.png)

## Exercises

### To cluster subgraph that `CTCF`  bind to chr1 by different clustering ways.

Trangles

Local Clustering Coefficient

Strongly Connected Components

Connected Components

Label Propagation

louvain

# network alignment

## Jaccard Similarity

![78.png](./picture/78.png)

The Jaccard index is a statistic used for comparing the similarity between pairs of sample sets or nodes in our example. It is defined as the size of the intersection divided by the size of the union of the sample sets.

### algo.similarity.jaccard

#### Paramenters

Name	|Type	|Default	|Optional	|Description
--------|-----|------------|--------|---------
data|list|null|no|A list of maps of the following structure: {item: nodeId, categories: [nodeId, nodeId, nodeId]}
top|int|0|yes|The number of similar pairs to return. If 0, it will return as many as it finds.
topK|int|0|yes|The number of similar values to return per node. If 0, it will return as many as it finds.
similarityCutoff|int|-1|yes|The threshold for Jaccard similarity. Values below this will not be returned.
write|boolean|false|yes|Indicates whether results should be stored.
writeProperty|string|score|yes|The property to use when storing results.

#### Results

Name	|Type	|Description
--------|-------|---------
nodes|int|The number of nodes passed in.
similarityPairs|int|The number of pairs of similar nodes computed.
write|boolean|Indicates whether results were stored.
writeRelationshipType|string|The relationship type used when storing results.
writeProperty|string|The property used when storing results.
min|double|The minimum similarity score computed.
max|double|The maximum similarity score computed.
mean|double|The mean of similarities scores computed.
stdDev|double|The standard deviation of similarities scores computed.
p25|double|The 25 percentile of similarities scores computed.
p50|double|The 50 percentile of similarities scores computed.
p75|double|The 75 percentile of similarities scores computed.
p90|double|The 90 percentile of similarities scores computed.
p95|double|The 95 percentile of similarities scores computed.
p99|double|The 99 percentile of similarities scores computed.
p999|double|The 99.9 percentile of similarities scores computed.
p100|double|The 25 percentile of similarities scores computed.

### Example

To compare Two network That TFs bind to Bin1 and Bin2 in chr1.  

![14.png](./picture/14.png)

### algo.similarity.jaccard.stream

#### Paramenters

Name	| Type	|Default	|Optional|	Description
--------|--------|-----------|---------|-------------
data|list|null|no|A list of maps of the following structure: {item: nodeId, categories: [nodeId, nodeId, nodeId]}
top|int|0|yes|The number of similar pairs to return. If 0, it will return as many as it finds.
topK|int|0|yes|The number of similar values to return per node. If 0, it will return as many as it finds.
similarityCutoff|int|-1|yes|The threshold for Jaccard similarity. Values below this will not be returned.
degreeCutoff|int|0|yes|The threshold for the number of items in the targets list. If the list contains less than this amount, that node will be excluded from the calculation.
concurrency|int|available CPUs|yes|The number of concurrent threads.
sourceIds|long[]|null|yes|The ids of items from which we need to compute similarities. Defaults to all the items provided in the data parameter.
targetIds|long[]|null|yes|The ids of items to which we need to compute similarities. Defaults to all the items provided in the data parameter.
sourceIds|long[]|null|yes|The ids of items from which we need to compute similarities. Defaults to all the items provided in the data parameter.
targetIds|long[]|null|yes|The ids of items to which we need to compute similarities. Defaults to all the items provided in the data parameter.

#### Results

Name	|Type	|Description
-----|--------------------|--------------
item1|int|The ID of one node in the similarity pair.
item2|int|The ID of other node in the similarity pair.
count1|int|The size of the targets list of one node.
count2|int|The size of the targets list of other node.
intersection|int|The number of intersecting values in the two nodes targets lists.
similarity|int|The Jaccard similarity of the two nodes.

### Example

The following will return a stream of TFs pairs along with their intersection and Jaccard similarities:

results
![19.png](./picture/19.png)

### Example--Specifying source and target ids

Sometimes, we don't want to compute all pairs similarity, but would rather specify subsets of items to compare to each other. 

We do this using the `sourceIds` and `targetIds` keys in the config. 

We might want to use this technique when comparing nodes with different labels that intersect on a common label.

**The following will find similarities between subgraphs based on direct relationship of "CBX5" and "BACH1":**

results
![18.png](./picture/18.png)

## Exercise

Now, try to get jaccard similarity between TF based on compare TF bind Chr1 between MCF7 and K562 cell.

# Reference

1. The Neo4j Graph Algorithms User Guide v3.5   &nbsp;&nbsp;&nbsp;   &nbsp; [html](https://neo4j.com/docs/graph-algorithms/current/)    [PDF](https://neo4j.com/docs/pdf/neo4j-graph-algorithms-3.5.pdf)

2. Graph Algorithms    [PDF](https://neo4j.com/lp/book-graph-algorithms-thanks/?aliId=eyJpIjoiT1lBd0tIeEh6Y2N6ajZCYiIsInQiOiJPemxyM1BhUG9uczhBdzFYRUwrM3Z3PT0ifQ%253D%253D)