# Preprocessing Data
The Dataset used in these experiments, Facebook Metrics, is available on [Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Facebook+metrics), you can also find the related files in this repository, in the [/data](../data) directory.

## Facebook Metrics Dataset
The data is related to posts published during the year of 2014 (from 1st of January to 31th of December) on the Facebook's page of a renowned cosmetics brand. The dataset has 500 instances and 19 attributes.

### Attributes Description
The authors published the dataset with original Facebook metrics + data mining outputs.

In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";")

### Original Dataset Attributes

In [None]:
print("Original Dataset Attributes")
#print(data.head())

In [None]:
print(data.columns)

In [None]:
print("Dataset")
data

### Attributes Selection
In the experiments of this research, I extract the attributes from data mining outputs, because the main goal is index the information in Graph Database and get some queries to help the process of data mining.

#### Reading dataset

In [12]:
data = pd.read_csv('../data/dataset_Facebook.csv', sep=";", usecols=['Page total likes', 'Type', 'Category', 'Post Month', 'Post Weekday',
       'Post Hour', 'Paid', 'comment', 'like', 'share'] )

#### Finding and Removing null values

In [13]:
data.isnull().any()

def num_missing(x):
  return sum(x.isnull())

print ("Missing values per column:")
print (data.apply(num_missing, axis=0))


Missing values per column:
Page total likes    0
Type                0
Category            0
Post Month          0
Post Weekday        0
Post Hour           0
Paid                1
comment             0
like                1
share               4
dtype: int64


In [14]:
data = data.dropna()

#### Adding Columns
I added the columns: 
* Post Id - ID Number for the post
* Increase Likes -

In [65]:
dtLength = len(data['Page total likes'])

dataIndexes = [x for x in range(dtLength)]

data['Post id'] = pd.Series(dataIndexes, index=data.index)

In [16]:
#data['like increase'] = data['Page total likes'] - 139441
#print(data.ix[2])
data.describe()

Unnamed: 0,Page total likes,Category,Post Month,Post Weekday,Post Hour,Paid,comment,like,share,Post id
count,495.0,495.0,495.0,495.0,495.0,495.0,495.0,495.0,495.0,495.0
mean,123173.268687,1.886869,7.028283,4.133333,7.844444,0.280808,7.557576,179.145455,27.264646,247.0
std,16203.818031,0.853268,3.304274,2.030735,4.385064,0.449849,21.274384,324.412161,42.656388,143.038456
min,81370.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,112324.0,1.0,4.0,2.0,3.0,0.0,1.0,57.0,10.0,123.5
50%,129600.0,2.0,7.0,4.0,9.0,0.0,3.0,101.0,19.0,247.0
75%,136393.0,3.0,10.0,6.0,11.0,1.0,7.0,188.0,32.5,370.5
max,139441.0,3.0,12.0,7.0,23.0,1.0,372.0,5172.0,790.0,494.0


In [93]:
#print(data.get(1))
#data.sort_index(1, ascending=False)
#newData = data.sort_index(ascending=False)
newData = data.sort_values(by='Post id', ascending=False)

In [94]:
listIncreaseLikes = []
currentPageLikes  = 

for i, row in newData.iterrows():
    dif = int(row['Page total likes']) - currentPageLikes
    listIncreaseLikes.append(dif)
    if (row['Page total likes'] != currentPageLikes):
        currentPageLikes = row['Page total likes']
    
print(listIncreaseLikes)


[81370, 0, 0, 3723, 0, 0, 886, 0, 0, 0, 0, 0, 0, 512, 0, 0, 0, 0, 418, 0, 0, 0, 0, 0, 4100, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 3065, 0, 0, 0, 0, 0, 1446, 0, 0, 0, 0, 2537, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1380, 0, 0, 0, 0, 0, 0, 0, 1958, 0, 0, 0, 0, 2858, 0, 0, 0, 0, 0, 979, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1763, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1950, 0, 0, 0, 0, 0, 1408, 0, 0, 0, 0, 0, 2340, 0, 0, 0, 525, 0, 198, 0, 0, 0, 0, 0, 344, 0, 0, 0, 0, 0, 0, 1329, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1434, 0, 0, 0, 0, 0, 0, 0, 852, 0, 0, 0, 1490, 0, 1507, 0, 0, 0, 0, 0, 0, 0, 0, 1893, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 672, 0, 0, 0, 0, 0, 0, 0, 0, 0, 529, 0, 0, 0, 0, 0, 204, 0, 0, 0, 0, 79, 0, 0, 0, 0, 658, 950, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1568, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1191, 0, 0, 0, 0, 0, 0, 0, 509, 0, 0, 0, 330, 0, 0, 0, 0, 0, 98, 0, 0, 80, 0, 0, 0, 148, 0, 0, 0, 0, 0, 0, 2

#### Salving Output File

In [95]:
newData.to_csv('..\data\dataset_Facebook_processed.csv', ";", index=False)
print("File saved: dataset_Facebook_processed.csv")

File saved: dataset_Facebook_processed.csv
