# Facebook Comment Volume Prediction
The fast-expanding social networking services had drawn extensive public attention since a decade ago. The amount of data that is uploaded to the social networking services is growing day by day. So, there emerges a massive requirement to study the extremely dynamic behavior of users on these services. This is an introductory work to model the user patterns, predict user behaviors and study the effectiveness of machine learning approaches on Facebook. We modeled the user comment patterns over the posts on Facebook Pages based on various parameters and predicted that the number of comments a post is expected to receive in the next H hours.

In [1]:
import pandas as pd
import numpy as np

# Reading the Data
The dataset is loaded and given reasonable column names. Not all columns will be useful however, so we only select some of the columns

In [2]:
names = ['likes', 'visitors', 'people', 'category', 'unknown0', 'unknown1', 'unknown2', 'unknown3', 'unknown4', 'unknown5', 'unknown6', 'unknown7', 'unknown8', 'unknown9', 'unknown10', 'unknown11', 'unknown12', 'unknown13', 'unknown14', 'unknown15', 'unknown16', 'unknown17', 'unknown18', 'unknown19', 'unknown20', 'unknown21', 'unknown22', 'unknown23', 'unknown24', 'cc1 ', 'cc2', 'cc3 ', 'cc4', 'cc5', 'base time', 'length', 'shares', 'promoted', 'hours', 'sunday', 'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'base sunday', 'base monday', 'base tuesday', 'base wednesday', 'base thursday', 'base friday', 'base saturday', 'incoming comments']
df_train = pd.read_csv('Dataset/Training/Features_Variant_5.csv', header=None, names=names)
df_train.head(6)

Unnamed: 0,likes,visitors,people,category,unknown0,unknown1,unknown2,unknown3,unknown4,unknown5,...,friday,saturday,base sunday,base monday,base tuesday,base wednesday,base thursday,base friday,base saturday,incoming comments
0,634995,0,463,1,0.0,1280.0,13.158779,1.0,94.99364,0.0,...,0,0,0,0,0,0,1,0,0,0
1,634995,0,463,1,0.0,1280.0,13.158779,1.0,94.99364,0.0,...,0,0,1,0,0,0,0,0,0,0
2,634995,0,463,1,0.0,1280.0,13.158779,1.0,94.99364,0.0,...,1,0,0,0,0,0,0,0,1,0
3,634995,0,463,1,0.0,1280.0,13.158779,1.0,94.99364,0.0,...,1,0,0,1,0,0,0,0,0,0
4,634995,0,463,1,0.0,1280.0,13.158779,1.0,94.99364,0.0,...,0,0,0,0,0,0,1,0,0,0
5,634995,0,463,1,0.0,1280.0,13.158779,1.0,94.99364,0.0,...,0,0,0,0,0,1,0,0,0,0


In [3]:
ytrain = df_train['incoming comments'].values
Xtrain = df_train[['likes', 'people', 'cc1 ', 'cc2', 'cc3 ', 'cc4', 'cc5', 'base time', 'length', 'shares', 'promoted', 'hours', 'sunday','monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday']]

# Training the Model
Given the data, we use fit the data to a linear regression. After that, we calculate the RSS of the training data to make sure the model is reasonably accurate

In [4]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(Xtrain, ytrain)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [5]:
ytrain_pred = regr.predict(Xtrain)
RSS_train = np.mean((ytrain_pred-ytrain)**2)/(np.std(ytrain)**2)
print(RSS_train)

0.68374955361


# Testing the Model
The final step is to test the model. We concatenate all the testing dataset and test the model. Then we calculate the RSS of the testing data to decide the accuracy of our linear model. 

In [6]:
import glob
path = 'Dataset/Testing/TestSet/'
allFiles = glob.glob(path + '/*.csv')
testdfs = []
for file in allFiles:
    df = pd.read_csv(file, header=None, names=names, na_values='?')
    testdfs.append(df)

df_test = pd.concat(testdfs, axis=0, ignore_index=True)

In [7]:
ytest = df_test['incoming comments'].values
Xtest = df_test[['likes', 'people','cc1 ', 'cc2', 'cc3 ', 'cc4', 'cc5', 'base time', 'length', 'shares', 'promoted', 'hours', 'sunday','monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday']]
ytest_pred = regr.predict(Xtest)

In [8]:
RSS_test = np.mean((ytest_pred-ytest)**2)/(np.std(ytest)**2)
print(RSS_test)

0.887817998335


# Conclusion
Since the RSS calculations for our training and testing data is large, it shows that our analysis has much room for improvement. The dataset that was used did not provide a proper correlation between its factors and the target variable
# Suggestions for Improvement
1. Find a better dataset that entails the correlation
2. Use a different technique, such as svm or neural network