# Preprocessing Data
The Dataset used in these experiments, Facebook Metrics, is available on [Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Facebook+metrics), you can also find the related files in this repository, in the [/data](../data) directory.

## Facebook Metrics Dataset
The data is related to posts published during the year of 2014 (from 1st of January to 31th of December) on the Facebook's page of a renowned cosmetics brand. The dataset has 500 instances and 19 attributes.

### Attributes Description
The authors published the dataset with original Facebook metrics + data mining outputs. Altough, for these initial experiments, only some attributes were selected, as it will explained in the section **Attributes Selection** below.

### Original Dataset Attributes

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";")

In [2]:
print("Original Dataset Attributes")
#print(data.head())
print(data.columns)

Original Dataset Attributes
Index(['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'Lifetime Post Total Reach',
       'Lifetime Post Total Impressions', 'Lifetime Engaged Users',
       'Lifetime Post Consumers', 'Lifetime Post Consumptions',
       'Lifetime Post Impressions by people who have liked your Page',
       'Lifetime Post reach by people who like your Page',
       'Lifetime People who have liked your Page and engaged with your post',
       'comment', 'like', 'share', 'Total Interactions'],
      dtype='object')


In [3]:
#Checking the dataset
#print("Dataset")
#data

### Attributes Selection
Because the main goal of this step is index the information in Graph Database, I use the attributes that represents post's raw data the posts:
* Page total likes
* Type (Link, Photo, Status, Video)
* Category (Action, Product, Inspiration)
* Post Month
* Post Weekday
* Post hour
* Paid
* comment
* like
* share

#### Reading dataset
In this code, the reading select only the interested attributes.

In [4]:
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";", usecols=['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'comment', 'like', 'share'] )

#### Finding and Removing null values
Cleaning data.

In [5]:
data.isnull().any()

def num_missing(x):
  return sum(x.isnull())

print ("Missing values per column:")
print (data.apply(num_missing, axis=0))


Missing values per column:
Page total likes    0
Type                0
Category            0
Post Month          0
Post Weekday        0
Post Hour           0
Paid                1
comment             0
like                1
share               4
dtype: int64


In [6]:
data = data.dropna()

#### Adding Columns
For indexing the data in Graph Database, I added other columns:
* Post Id - ID Number for the post
* Increase Likes - How many likes more


In [7]:
# Creating the colunm of Id
dataLength = len(data['Page total likes'])

dataIndexes = [x for x in range(dataLength)]

data['Post id'] = pd.Series(dataIndexes, index=data.index)

In [8]:
# Some descriptions of data
data.describe()

Unnamed: 0,Page total likes,Category,Post Month,Post Weekday,Post Hour,Paid,comment,like,share,Post id
count,495.0,495.0,495.0,495.0,495.0,495.0,495.0,495.0,495.0,495.0
mean,123173.268687,1.886869,7.028283,4.133333,7.844444,0.280808,7.557576,179.145455,27.264646,247.0
std,16203.818031,0.853268,3.304274,2.030735,4.385064,0.449849,21.274384,324.412161,42.656388,143.038456
min,81370.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,112324.0,1.0,4.0,2.0,3.0,0.0,1.0,57.0,10.0,123.5
50%,129600.0,2.0,7.0,4.0,9.0,0.0,3.0,101.0,19.0,247.0
75%,136393.0,3.0,10.0,6.0,11.0,1.0,7.0,188.0,32.5,370.5
max,139441.0,3.0,12.0,7.0,23.0,1.0,372.0,5172.0,790.0,494.0


##### Sorting Data
For making more sense, it is necessary to perform a descending sort data. 

In [9]:
dataOrderedDescending = data.sort_values(by='Post id', ascending=False)
dataOrderedDescending

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,comment,like,share,Post id
498,81370,Photo,3,1,4,11,0.0,7,91.0,38.0,494
497,81370,Photo,1,1,5,2,0.0,4,93.0,18.0,493
496,81370,Photo,2,1,5,8,0.0,0,53.0,22.0,492
495,85093,Photo,3,1,7,2,0.0,5,53.0,26.0,491
494,85093,Photo,3,1,7,10,0.0,10,125.0,41.0,490
493,85093,Photo,3,1,1,2,0.0,17,185.0,55.0,489
492,85979,Link,1,1,5,11,0.0,0,128.0,9.0,488
491,85979,Photo,3,1,6,3,1.0,1,105.0,46.0,487
490,85979,Photo,3,1,6,11,0.0,1,79.0,30.0,486
489,85979,Photo,3,1,7,2,0.0,1,74.0,28.0,485


##### Calculating the Increase Likes
The information *Page total likes* is important, but another interesting information to be stored in database is the increase/decrease of likes, it means, how many likes the page had in that day.

In [10]:
listIncreaseLikes = []

currentPageLikes  = dataOrderedDescending['Page total likes'].iloc[0]

for i, row in dataOrderedDescending.iterrows():   
    dif = int(row['Page total likes']) - currentPageLikes
    
    listIncreaseLikes.append(dif)
    
    if (row['Page total likes'] != currentPageLikes):
        currentPageLikes = row['Page total likes']
    
dataOrderedDescending['Increase likes'] = pd.Series(listIncreaseLikes, index=dataOrderedDescending.index)


#### Salving Output File

In [11]:
dataOrderedDescending.to_csv('..\data\dataset_Facebook_processed.csv', ";", index=False)
print("File saved: dataset_Facebook_processed.csv")

File saved: dataset_Facebook_processed.csv
