# Preprocessing Data
The Dataset used in these experiments, Facebook Metrics, is available on [Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Facebook+metrics). You can also find the related files in this repository, in the directory [/data](../data).

## Facebook Metrics Dataset
The data is related to posts published during the year of 2014 (from 1st of January to 31st of December) on the Facebook's page of a renowned cosmetics brand. The dataset has 500 instances and 19 attributes.

### Describing Attributes
The authors published the dataset with original Facebook metrics + data mining outputs. Altough, for these initial experiments, only some attributes were selected, as it will be explained in the section **Attributes Selection** below.

### Original Dataset Attributes

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";")

In [None]:
print("Original Dataset Attributes")
#print(data.head())
print(data.columns)

## Preprocessing Steps
The steps to perform the preprocessing data are:

1. Select the interested features - open the file and select the colunms;
2. Identify null values and exclude them - find the null values and exclude the corresponding rows;
3. Alter the category from number to value - the category feature, in the original dataset, is represented by a number (1,2,3), in this step, the corresponding category value is selected (Action, Product, Inspiration);
4. Alter the weekday from number to value - the weekday feature is represented by number, in this is step, the corresponding weekday value is selected (Sunday, Monday, etc.);
5. Add Id attribute - add a sequencial number to represent the publication's identifier;
6. Sort the dataset - perform a descending chronological sort data;
7. Calculate and Add the increase in likes - calculate how many likes the page had in that day
8. Save the output - create the *preprocessed data*

### Selecting Attributes
Because the main goal of this step is to index the information in Graph Database, the attributes that represent post's raw data were used. As follows:

* Page total likes
* Type (Link, Photo, Status, Video)
* Category (Action, Product, Inspiration)
* Post Month
* Post Weekday
* Post hour
* Paid
* comment
* like
* share

#### Reading dataset
In this code, the reading method selects only the targeted attributes.

In [None]:
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";", usecols=['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'comment', 'like', 'share'] )

#### Finding and Removing null values
Cleaning data.

In [None]:
data.isnull().any()

def num_missing(x):
  return sum(x.isnull())

print ("Missing values per column:")
print (data.apply(num_missing, axis=0))


In [None]:
data = data.dropna()

##### Altering category from numbers to values
The category values in the dataset are expressed in numbers (1,2,3), but according to the paper's authors, the corresponding categories are:
* 1 = Action
* 2 = Product
* 3 = Inspiration

So, for making data clearer and for indexing it in Graph Database, it is necessary to alter the category to its corresponding name.

In [None]:
categoriesNames = ["","Action","Product","Inspiration"]
listCategories  = []

for i, row in data.iterrows():   
    listCategories.append((categoriesNames[int(row["Category"])]))
    
data["Category"] = pd.Series(listCategories, index=data.index)

##### Altering weekdays from numbers to values
The category values in the dataset are expressed in numbers (1,2,3,4,5,6 e 7). Then, to making data clearer and for indexing it in Graph Database, it is necessary to alter the weekday to its corresponding name.

In [None]:
weekdaysNames = ["","Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"]
listWeekdays  = []

for i, row in data.iterrows():   
    listWeekdays.append((weekdaysNames[int(row["Post Weekday"])]))
    
data["Post Weekday"] = pd.Series(listWeekdays, index=data.index)

#### Adding Attributes
For indexing the data in Graph Database, other attribues (columns) were added:
* Post Id - ID Number for the post
* Increase in Likes - How many more likes were accounted for 


In [None]:
# Creating the colunm of Id
dataLength = len(data['Page total likes'])

dataIndexes = [x for x in range(dataLength)]

data['Post id'] = pd.Series(dataIndexes, index=data.index)

##### Sorting Data
For making more sense, it was necessary to perform a descending chronological sort data. 

In [None]:
dataOrderedDescending = data.sort_values(by='Post id', ascending=False)
dataOrderedDescending

##### Calculating the Increase in Likes
The information *Page total likes* is important, but other interesting information to be stored in database is the increase/decrease of likes, meaning how many likes the page had in that day.

In [None]:
listIncreaseLikes = []

#
currentPageLikes  = dataOrderedDescending['Page total likes'].iloc[0]

for i, row in dataOrderedDescending.iterrows():   
    dif = int(row['Page total likes']) - currentPageLikes
    
    listIncreaseLikes.append(dif)
    
    if (row['Page total likes'] != currentPageLikes):
        currentPageLikes = row['Page total likes']
    
dataOrderedDescending['Increase likes'] = pd.Series(listIncreaseLikes, index=dataOrderedDescending.index)


#### Salving Output File

In [None]:
dataOrderedDescending.to_csv('..\data\dataset_Facebook_processed.csv', ";", index=False)
print("File saved: dataset_Facebook_processed.csv")