# Quickstart Neo4j

Let's download and visualize the graph data about communities which is hosted on [Neo4j](https://neo4j.com/docs/). Neo4j can be querried with the [Cypher querrying language](https://neo4j.com/docs/cypher/).

## Downloading data from Neo4j 

Here is an example of using the Neo4JDownloader to retrieve all nodes from the graph. 
Adapt the Neo4JDownloader class as you see fit for retrieving information. 

In [None]:
# Define your nodes

nodes = ["user", "repo", "org"]

# Define your relationships (edges)

relationships = {
    "member_of": {"type1": {"source": "user", "target": "org"}},
    "owner_of": {
        "type1": {"source": "user", "target": "repo"},
        "type2": {"source": "org", "target": "repo"},
    },
    "contributor_of": {
        "type1": {"source": "user", "target": "repo"},
        "type2": {"source": "org", "target": "repo"},
    },
    "parent_of": {
        "type1": {"source": "repo", "target": "repo"},
    },
}

In [None]:
import neo4j
from utils.neo4jdownloader import Neo4JDownloader
from pathlib import Path
import os

def get_downloader():
    secrets_dir = Path("/secrets")
    NEO4J_URI = (secrets_dir / "neo4j_uri").read_text()
    NEO4J_USERNAME = (secrets_dir / "neo4j_user").read_text()
    NEO4J_PASSWORD = (secrets_dir / "neo4j_password").read_text()
    NEO4J_DATABASE = (secrets_dir / "neo4j_database").read_text()

    return Neo4JDownloader(NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE)

def extract_data(nodes, relationships):
    downloader = get_downloader()

    try:
        nodes_ids, nodes_features = downloader.retrieve_nodes(nodes)
        edges_indices, edges_attributes = downloader.retrieve_edges(relationships)

        return nodes_ids, nodes_features, edges_indices, edges_attributes
    finally:
        downloader.close()

In [None]:
nodes_ids, nodes_features, edges_indices, edges_attributes = extract_data(nodes, relationships)
# example of looking at the output
# print(nodes_ids["org"])
# print(nodes_features["org"])
# print(edges_indices)

In [None]:
from utils.builder_dataframe import neo4j_to_dataframe
df = neo4j_to_dataframe(nodes_ids, nodes_features, edges_indices, relationships)
df.head()

Let's make a graph for the first 200 nodes of the graph: 

In [None]:
from utils.builder_models import df_to_pydantic_models
graph = df_to_pydantic_models(df.head(200), relationships)

Let's see how we can filter our dataframe (as a classic pandas dataframe) to get all information about EPFL or SDSC. We will then use one of these to continue visualizations.

In [None]:
import re

epfl_pattern = r"EPFL"
epfl_df = df[
    df['source'].astype(str).str.contains(epfl_pattern, flags=re.IGNORECASE, na=False) |
    df['target'].astype(str).str.contains(epfl_pattern, flags=re.IGNORECASE, na=False)
]
epfl_df.head()

In [None]:
sdsc_pattern = r"(SwissDataScienceCenter|SDSC)"
sdsc_df = df[
    df["source"].astype(str).str.contains(sdsc_pattern, flags=re.IGNORECASE, na=False) |
    df["target"].astype(str).str.contains(sdsc_pattern, flags=re.IGNORECASE, na=False)
]
sdsc_df.head()

In [None]:
sdsc_graph = df_to_pydantic_models(sdsc_df, relationships)
epfl_graph = df_to_pydantic_models(epfl_df, relationships)

## Visualizing the Graph

Here is some code snippet for visualizing the graph you have downloaded from Neo4J.

**Disclaimer: if you run visualization on an entire graph of over 5000 nodes, it may take a long time or may give unreadable graphs. Be smart in your visualizations.**

For CUSTOM visualizations, if you are familiar with networkx, there is a `create_networkx_graph` function in `visualization.py` from which you can kick off other graph manipulations or visualizations.

### Visualize the entire graph or a subpart of the graph. 

We suggest: 
1. select from the dataframe (classic pandas operations)
2. convert to the pydantic models 
3. run visualizations

In [None]:
from utils.visualization import visualize_graph
from pathlib import Path

output_path = Path("plots/graphs/graph_200_visualization.png")
visualize_graph(graph, output_path)

output_path = Path("plots/graphs/sdsc_graph.png")
visualize_graph(sdsc_graph, output_path)

output_path = Path("plots/graphs/epfl_graph.png")
visualize_graph(epfl_graph, output_path)

### Visualize Clusters in the graph
This function will make clusters based on the nodes that are handed over. 
We use the function from networkx `nx.weakly_connected_components(G)`

In [None]:
from utils.visualization import visualize_clusters
from pathlib import Path
output_dir = Path("plots/clusters/")

cluster_prefix_name = "200_first_nodes"
visualize_clusters(graph, output_dir, cluster_prefix_name)

cluster_prefix_name = "sdsc"
visualize_clusters(sdsc_graph, output_dir, cluster_prefix_name)

cluster_prefix_name = "epfl"
visualize_clusters(epfl_graph, output_dir, cluster_prefix_name)


### Follow up on this example visualization: 

We can see for EPFL that just a string matching does not manage to find many of the EPFL affiliated repositories. How can we complement with other tools and other approaches to find a better EPFL graph ? Your turn to play around, good luck !