# Preprocessing Data
The Dataset used in these experiments, Facebook Metrics, is available on [Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Facebook+metrics). You can also find the related files in this repository, in the directory [/data](../data).

## Facebook Metrics Dataset
The data is related to posts published during the year of 2014 (from 1st of January to 31st of December) on the Facebook's page of a renowned cosmetics brand. The dataset has 500 instances and 19 attributes.

### Describing Attributes
The authors published the dataset with original Facebook metrics + data mining outputs. Altough, for these initial experiments, only some attributes were selected, as it will be explained in the section **Attributes Selection** below.

### Original Dataset Attributes

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";")

In [None]:
print("Original Dataset Attributes")
#print(data.head())
print(data.columns)

In [None]:
#Checking the dataset
#print("Dataset")
#data

### Selecting Attributes
Because the main goal of this step is to index the information in Graph Database, the attributes that represent post's raw data were used. As follows:

* Page total likes
* Type (Link, Photo, Status, Video)
* Category (Action, Product, Inspiration)
* Post Month
* Post Weekday
* Post hour
* Paid
* comment
* like
* share

#### Reading dataset
In this code, the reading method selects only the targeted attributes.

In [None]:
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";", usecols=['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'comment', 'like', 'share'] )

#### Finding and Removing null values
Cleaning data.

In [None]:
data.isnull().any()

def num_missing(x):
  return sum(x.isnull())

print ("Missing values per column:")
print (data.apply(num_missing, axis=0))


In [None]:
data = data.dropna()

#### Adding Attributes
For indexing the data in Graph Database, other attribues (columns) were added:
* Post Id - ID Number for the post
* Increase in Likes - How many more likes were accounted for 


In [None]:
# Creating the colunm of Id
dataLength = len(data['Page total likes'])

dataIndexes = [x for x in range(dataLength)]

data['Post id'] = pd.Series(dataIndexes, index=data.index)

In [None]:
# Some descriptions of data
data.describe()

##### Sorting Data
For making more sense, it was necessary to perform a descending chronological sort data. 

In [None]:
dataOrderedDescending = data.sort_values(by='Post id', ascending=False)
dataOrderedDescending

##### Calculating the Increase in Likes
The information *Page total likes* is important, but other interesting information to be stored in database is the increase/decrease of likes, meaning how many likes the page had in that day.

In [None]:
listIncreaseLikes = []

#
currentPageLikes  = dataOrderedDescending['Page total likes'].iloc[0]

for i, row in dataOrderedDescending.iterrows():   
    dif = int(row['Page total likes']) - currentPageLikes
    
    listIncreaseLikes.append(dif)
    
    if (row['Page total likes'] != currentPageLikes):
        currentPageLikes = row['Page total likes']
    
dataOrderedDescending['Increase likes'] = pd.Series(listIncreaseLikes, index=dataOrderedDescending.index)


#### Altering category from numbers to values
The dataset's category values are numbers (1,2,3), but according paper's authors, the corresponding categories are:
* 1 = Action
* 2 = Product
* 3 = Inspiration

So, for being more clear and for indexing data in Graph Database, it is necessary alter the category to its correspond value.

In [None]:
categoriesNames = ["","Action","Product","Inspiration"]
listCategories  = []

for i, row in dataOrderedDescending.iterrows():   
    listCategories.append((categoriesNames[int(row["Category"])]))
    
dataOrderedDescending["Category"] = pd.Series(listCategories, index=dataOrderedDescending.index)

#### Salving Output File

In [None]:
dataOrderedDescending.to_csv('..\data\dataset_Facebook_processed.csv', ";", index=False)
print("File saved: dataset_Facebook_processed.csv")