In [1]:
from IPython import display

**Links**
- [Neo4j Browser Interface](http://localhost:7474/browser/)
- [Cypher Query Language Docs](https://neo4j.com/developer/cypher/)

# Importing MMR dataset in Neo4j
Due to the huge size of the dataset (consisting of both usernames + MMR files), I used the [batch-import tool](https://neo4j.com/docs/operations-manual/current/tools/import/) which helped me to conveniently import the dataset in a `graph.db` Neo4j database. As recommended by the guidelines, I kept the actual data files separate from the headers, since the headers have to provide crucial info to the Neo4j interpreter such that the database is built the proper way.

## Data Preparation
The structure of the database consists of **nodes** of type **User** and relationships with label **MENTIONS**, such that a common (undirected) node-to-node relationship would be represented as `(n:User)-[r:MENTIONS]-(m:User)` in Cypher query language.

**Note**: since we're not going to have more than one node type (all nodes are `Users`) and one relationship type (all relationships are `MENTIONS`), when querying in Cypher the types may be omitted and the common node-to-node pattern becomes `(n)-[r]-(m)`.

For this purpose, the `mmr_encoded_header.csv` and the `usernames_header.csv` are shown below:
- **MMR Header**: `:START_ID(User),:END_ID(User)`
- **Usernames Header**: `username,encoding:ID(User)`

It's worth pointing out that the `encoding` property of the `User` node has been intentionally named as such and not `id`, not to create confusion with the internal built-in property `<id>` of Neo4j. Intuitively, each line of the MMR dataset will represent a relationship that should connect nodes based on the property labeled with `ID(User)` defined by the usernames header file.

## Network Data Import
After these preparation steps, I was then ready to bulk import the data with the following shell command:
```
neo4j-admin import --id-type INTEGER --nodes:User "path/to/usernames_header.csv,path/to/usernames.csv" --relationships:MENTIONS "path/to/mmr_encoded_header.csv,path/to/mmr_encoded.csv"
```
The tool turned out to be fairly quick and I ended up with a nice complete network consisting of:

| Total Nodes | Total Relationships |
|---|---|
| 612.497.446 (~600M) | 89.486.651 (~90M) |

Last but not least, an **index** has been created on the encoding property by adding a unique property constraint:
```
CREATE CONSTRAINT ON (n:User) ASSERT n.encoding IS UNIQUE
```

# Working with Neo4j and Cypher
Neo4j comes handy to quickly query the network and get to see the structure and some examples in the GUI provided by the browser. After playing around a bit with a few example queries, I noticed a downside of the current network structure as defined in the previous sections.

Possible optimizations:
- Multiple relationships are added between users: why not adding info on such as the period? After encoding it.
- I could remove the username string from the nodes to save space, but let's experiment with the performance before.


In [2]:
display.display_png()