# Data Import

In [None]:
import sys
!{sys.executable} -m pip install -r requirements.txt

In [2]:
import pandas as pd

In [3]:
postings = pd.concat([pd.read_csv(f'data/Postings_{timeframe}.csv', sep=';', dtype={'ID_Posting': str, 'ID_Posting_Parent': str, 'ID_CommunityIdentity': str, 'ID_Article': str}, parse_dates=['ArticlePublishingDate', 'PostingCreatedAt', 'UserCreatedAt']) for timeframe in ['01052019_07052019', '08052019_15052019', '16052019_23052019', '24052019_31052019']])
votes = pd.concat([pd.read_csv(f'data/Votes_{timeframe}.csv', sep=';', dtype={'ID_CommunityIdentity': str, 'ID_Posting': str}, parse_dates=['VoteCreatedAt', 'UserCreatedAt']) for timeframe in ['01052019_07052019', '08052019_15052019', '16052019_23052019', '24052019_31052019']])
following = pd.read_csv('data/Following_Ignoring_Relationships_01052019_31052019.csv', sep=';', dtype={'ID_CommunityIdentity': str, 'ID_CommunityIdentityConnectedTo': str})

First, we have to read the data stored in `data/` in order to transform it into a form that can be loaded into a Neo4j database.

Neo4j supports all kinds of approaches to import data, but unfortunately, most of them are not well suited for large amounts of data like the ones we are dealing with here. The most efficient way to import data into Neo4j is to use the [neo4j-admin import command](https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/). However, it requires the data to be in a very specific format. Therefore, we have to transform our data accordingly and store it in the `graph/` directory.

To load data into Neo4j, we have to create separate CSV files for nodes and edges. The nodes CSV file has to contain a header row with the column name `:ID`, which has to be globally unique. The edges CSV file has to contain a header row with the column names `:START_ID` and `:END_ID`. The `:START_ID` column contains the ID of the start node and the `:END_ID` column contains the ID of the end node.

## Nodes

In [4]:
postings['ID_Posting_Global'] = 'p' + postings['ID_Posting']
postings['ID_Posting_Parent_Global'] = 'p' + postings['ID_Posting_Parent']
postings['ID_CommunityIdentity_Global'] = 'u' + postings['ID_CommunityIdentity']
postings['ID_Article_Global'] = 'a' + postings['ID_Article']
votes['ID_Posting_Global'] = 'p' + votes['ID_Posting']
votes['ID_CommunityIdentity_Global'] = 'u' + votes['ID_CommunityIdentity']
following['ID_CommunityIdentity_Global'] = 'u' + following['ID_CommunityIdentity']
following['ID_CommunityIdentityConnectedTo_Global'] = 'u' + following['ID_CommunityIdentityConnectedTo']

In our dataset, we have different types of nodes and each node contains a unique ID within its type. However, the IDs are not globally unique. Therefore, we have to create a new ID for each node that is globally unique. We can do this by concatenating the type of the node with its ID. For example, the node with the ID `1` and the type `Posting` will get the new ID `p1`. This way, we can ensure that the IDs are globally unique.

### Postings

In [5]:
missing_nodes = postings['ID_Posting_Parent'].dropna()
missing_nodes = missing_nodes[~missing_nodes.isin(postings['ID_Posting'])]
missing_nodes = pd.DataFrame({'ID_Posting_Global': 'p' + missing_nodes, 'ID_Posting': missing_nodes})

posting_nodes = postings[['ID_Posting_Global', 'ID_Posting', 'PostingHeadline', 'PostingComment', 'PostingCreatedAt']].copy()
posting_nodes = pd.concat([posting_nodes, missing_nodes])
posting_nodes.rename(columns={'ID_Posting_Global': ':ID', 'ID_Posting': 'id:long', 'PostingHeadline': 'headline', 'PostingComment': 'comment', 'PostingCreatedAt': 'created_at'}, inplace=True)
posting_nodes[:int(len(posting_nodes) / 2)].to_csv('graph/posting_1.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')
posting_nodes[int(len(posting_nodes) / 2):].to_csv('graph/posting_2.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

For postings, we can find iformation about the nodes in the postings CSV file. However, we have to consider that the column `ID_Posting_Parent` can reference a posting, for which we do not have any information in the CSV file.

### Users

In [6]:
missing_nodes = pd.concat([following['ID_CommunityIdentity'].dropna(), following['ID_CommunityIdentityConnectedTo'].dropna()])
missing_nodes = pd.DataFrame({'ID_CommunityIdentity_Global': 'u' + missing_nodes, 'ID_CommunityIdentity': missing_nodes})

user_nodes = pd.concat([missing_nodes, postings[['ID_CommunityIdentity_Global', 'ID_CommunityIdentity', 'UserCommunityName', 'UserGender', 'UserCreatedAt']], votes[['ID_CommunityIdentity_Global', 'ID_CommunityIdentity', 'UserCommunityName', 'UserGender', 'UserCreatedAt']]])
user_nodes = user_nodes.groupby('ID_CommunityIdentity').last().reset_index()
user_nodes.rename(columns={'ID_CommunityIdentity_Global': ':ID', 'ID_CommunityIdentity': 'id:long', 'UserCommunityName': 'community_name', 'UserGender': 'gender', 'UserCreatedAt': 'created_at'}, inplace=True)
user_nodes.to_csv('graph/user.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

For users, we encounter a similar problem, where the postings and votes CSV files contains information about referenced users, but the following file does not contain any information about the users referenced. Therfore, we have to consider that there might be users that are referenced in the following CSV file, but for which we do not have any information.

By concatinating all information about users and grouping it by the user ID, we can ensure that we only create one node per user.

### Articles

In [7]:
article_nodes = postings[['ID_Article_Global', 'ID_Article', 'ArticlePublishingDate', 'ArticleTitle', 'ArticleChannel', 'ArticleRessortName']]
article_nodes = article_nodes.groupby('ID_Article').last().reset_index()
article_nodes.rename(columns={'ID_Article_Global': ':ID', 'ID_Article': 'id:long', 'ArticlePublishingDate': 'publishing_date', 'ArticleTitle': 'title', 'ArticleChannel': 'channel', 'ArticleRessortName': 'ressort'}, inplace=True)
article_nodes.to_csv('graph/article.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

Information about articles is only contained in the postings file. Similar to the users, we have to ensure that we only create one node per article by grouping the information by the article ID.

## Edges

By using the globally unique IDs we created for the nodes, we can now create the edges CSV files by simply referencing the IDs of the nodes for each connection.

### Parent Posting Relationship

In [8]:
has_parent_edges = postings[['ID_Posting_Global', 'ID_Posting_Parent_Global']][postings['ID_Posting_Parent_Global'].notnull()].copy()
has_parent_edges.rename(columns={'ID_Posting_Global': ':START_ID', 'ID_Posting_Parent_Global': ':END_ID'}, inplace=True)
has_parent_edges.to_csv('graph/has_parent.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

### Posting Article Relationship

In [9]:
posted_on_edges = postings[['ID_Posting_Global', 'ID_Article_Global']].copy()
posted_on_edges.rename(columns={'ID_Posting_Global': ':START_ID', 'ID_Article_Global': ':END_ID'}, inplace=True)
posted_on_edges.to_csv('graph/posted_on.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

### User Posting Relationship

In [10]:
posted_by_edges = postings[['ID_Posting_Global', 'ID_CommunityIdentity_Global']].copy()
posted_by_edges.rename(columns={'ID_Posting_Global': ':START_ID', 'ID_CommunityIdentity_Global': ':END_ID'}, inplace=True)
posted_by_edges.to_csv('graph/posted_by.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

### Following/Ignoring Relationship

In [11]:
follows_edges = following[following['ID_CommunityConnectionType'] == 1][['ID_CommunityIdentity_Global', 'ID_CommunityIdentityConnectedTo_Global']].copy()
follows_edges.rename(columns={'ID_CommunityIdentity_Global': ':START_ID', 'ID_CommunityIdentityConnectedTo_Global': ':END_ID'}, inplace=True)
follows_edges.to_csv('graph/follows.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

In [12]:
ignores_edges = following[following['ID_CommunityConnectionType'] == 2][['ID_CommunityIdentity_Global', 'ID_CommunityIdentityConnectedTo_Global']].copy()
ignores_edges.rename(columns={'ID_CommunityIdentity_Global': ':START_ID', 'ID_CommunityIdentityConnectedTo_Global': ':END_ID'}, inplace=True)
ignores_edges.to_csv('graph/ignores.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

### Upvoted/Downvoted Relationship

In [13]:
upvoted_edges = votes[votes['VotePositive'] == 1][['ID_CommunityIdentity_Global', 'ID_Posting_Global', 'VoteCreatedAt']].copy()
upvoted_edges.rename(columns={'ID_CommunityIdentity_Global': ':START_ID', 'ID_Posting_Global': ':END_ID', 'VoteCreatedAt': 'created_at'}, inplace=True)
upvoted_edges[:int(len(upvoted_edges) / 2)].to_csv('graph/upvoted_1.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')
upvoted_edges[int(len(upvoted_edges) / 2):].to_csv('graph/upvoted_2.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

In [14]:
downvoted_edges = votes[votes['VoteNegative'] == 1][['ID_CommunityIdentity_Global', 'ID_Posting_Global', 'VoteCreatedAt']].copy()
downvoted_edges.rename(columns={'ID_CommunityIdentity_Global': ':START_ID', 'ID_Posting_Global': ':END_ID', 'VoteCreatedAt': 'created_at'}, inplace=True)
downvoted_edges.to_csv('graph/downvoted.csv', index=False, date_format='%Y-%m-%dT%H:%M:%S.%fZ')

## Importing

In the end, we can now import the data stored in the `graph/` directory into a Neo4j database by using the following command: 

```{bash}
bin/neo4j-admin database import full --nodes=Posting=import/posting_1.csv,import/posting_2.csv --nodes=User=import/user.csv --nodes=Article=import/article.csv --relationships=HAS_PARENT=import/has_parent.csv --relationships=POSTED_ON=import/posted_on.csv --relationships=POSTED_BY=import/posted_by.csv --relationships=FOLLOWS=import/follows.csv --relationships=IGNORES=import/ignores.csv --relationships=UPVOTED=import/upvoted_1.csv,import/upvoted_2.csv --relationships=DOWNVOTED=import/downvoted.csv neo4j
```