# An Analysis of Social Network Metrics based on Neo4j Graph Database

## Abstract


## Introduction
The increase of users in social networks has grown the investment and dissemination of social media \cite{social}. In this scenario, companies understood the pontential of using social networks to influence customers and incorporating social media marketing in their strategies of businesses. In this way, some studies about to find the relationships between online publications and users' interactions whit them, data mining and prediction emerge \cite{social}. 

Considering the volume of these generated data every day and the their characteristics, the traditional relational databases have some limitations. For this kind of data, in which data connectivity and topological information are important, the NoSQL (Not Only SQL) has been shown a good approach, mainly the Graph database (GDB) \cite{neo4j}. A graph is a collection of vertices and edges, representing entities as nodes and relationships among them. For them, this structure allows us to model all kind of contexts \cite{graphDB}.

Considering the scenario and the relevance of studies in social media, this research aims to combine resources that are offered by GDBs and data from social media. The main goal of this research is to index information about social media metrics in a GDB, in order to discover relations and patterns inside the data. 

The question that has based this research is if and how a GDB can facilitate data analyses of social media metrics. Considering the characteristics of GDB and the nature of the information in social network, our hypothesis is that storing data in graphs enables us to have more accessible and understandable queries, in comparison to other methods of storing data such as SQL, files, and other usual ones.

This paper is organized as follows. Section 2 presents

## Related Works

The proposal of this research is based on some researches in social media and graph database. Specially, research with Facebook metrics and GDB.

Moro et al. \cite{social} present an approach to predict the performance metrics of a post published in brands' Facebook page by using data mining method. In order to validade their proposal, a dataset composed by 790 posts published by a company in the year of 2014 and 12 performance metrics extracted was used. The final dataset generated, called Facebook Metrics, was used in the experiment of this paper (it is explained in Section \ref{description}).

Souza et al. \cite{ewsdn} present an approach to provide a semantic modeling language support in a GDB with data about networking computing. They present a model of data and some primitives to answer questions about network, for example, shortest path between nodes and count in degree of an specific node. The GDB selected was Neo4j and Cypher query language (a similar to SQL).

Robinson et al. \cite{graphDB} explain the GDB models and important characteristics: native graph and native graph processing.

## Description of data

The dataset used in this research \cite{social} is available in http://archive.ics.uci.edu/ml/datasets/Facebook+metrics. The dataset is composed by 19 features and 500 instances. The features are from two groups: (i) list of input features used for modeling and (ii) list of output features to be modeled. For this research, the input features were selected, and the outputs related with the interaction number:

* Category - Manual content characterization: action (special offers and contests), product (direct advertisement, explicit brand content), and inspiration (non-explicit brand related content).
* Page total likes - Number of people who have liked the company's page.
* Type - Type of content (Link, Photo, Status, Video).
* Post month - Month the post was published (January, ..., December).
* Post hour -  Hour the post was published (0, 1, 2, ..., 23).
* Post weekday - Weekday the post was published (Sunday, ..., Saturday).
* Paid - If the company paid to Facebook for advertising (yes, no)
* Comments - Number of comments on the publication.
* Likes - Number of "Likes" on the publication.
* Shares - Number of times the publication was shared.


In [1]:
import pandas as pd
import numpy as np

data = pd.read_csv('../data/dataset_Facebook.csv', sep=";")
print(data.columns)

Index(['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'Lifetime Post Total Reach',
       'Lifetime Post Total Impressions', 'Lifetime Engaged Users',
       'Lifetime Post Consumers', 'Lifetime Post Consumptions',
       'Lifetime Post Impressions by people who have liked your Page',
       'Lifetime Post reach by people who like your Page',
       'Lifetime People who have liked your Page and engaged with your post',
       'comment', 'like', 'share', 'Total Interactions'],
      dtype='object')


## Methodology 

## Methods
Neo4j Graph Database will be used in this research. Neo4j has libraries to be used in Python.

The metrics data from social network are indexed in a GDB, then some
## Methods
Neo4j Graph Database will be used in this research. Neo4j has libraries to be used in Python.
 queries are performed in order to discover what patterns emerge. The GDB selected to be used is Neo4j\cite{graphDB}. Neo4j is the chosen one due to the model of graph (property graph), documentation available and for its free access (comunity edition). 

In summary the Figure 1 shows the base workflow used in this experiment.
:::First findings (result of gdb)
### Workflow

![Research Workflow](../figures/research.png)



### Data Model
The Neo4j's model is the *Labeled Property Graph*. In this model:
* Nodes and relationships contain properties (key-value pairs);
* Nodes can be labeled with one or more labels;
* Relationships are named and directed (always have a start and a end node).

![Example of Property Graph](../figures/property-graph.svg)

These characterists allow us to represent the data in an intuitive way \cite{graphDB}. Furthermore, all necessary information can be modeled and stored. For modeling data in Neo4j, the follows elements are considered:
* Node - It is a entity.
* Label - It is a type of a node, it has a name and it groups the nodes in subsets.
* Relationship - It is an representing interaction between nodes.

## Results
The main contribuition of this research is the data model designed for Facebook metrics, since it offers a visual understandign of how significant the publications were in social media environment.
![Data Model of Facebook Metrics](../figures/data-model.svg)

## Future Works
To create further queries about data relations and patterns.
How data mining can be combined with GDB, based on the results of this research.

## References
 S. Moro, P. Rita and B. Vala. Predicting social media performance metrics and evaluation of the impact on brand building: A data mining approach. Journal of Business Research, Elsevier, In press, 2016.