# AnalyzeData
>Here we do various kinds of analysis on our data. We first load the data from the saved pickle file. The dataset contains posts from april 2009 to july 2017. Refer to GetData.ipynb for details on querying the Facebook graph API.

In [1]:
import json
import pickle
import pandas as pd
import numpy as np
from collections import Counter, OrderedDict, defaultdict

# local wrapper over bokeh
import vis_bokeh as bkh

# bokeh
from bokeh.plotting import figure
from bokeh.io import show, output_notebook
from bokeh.layouts import row, gridplot, column
from bokeh.models.tools import HoverTool
output_notebook()

##### Load data from pickle 

In [2]:
data = pickle.load(file=open("steam_data.pkl", "rb"))

In [23]:
df = pd.io.json.json_normalize(data = data)
df.loc[0]

comments.data                                                                  []
comments.summary.can_comment                                                 True
comments.summary.order                                                     ranked
comments.summary.total_count                                                    9
created_time                                             2017-07-28T17:01:24+0000
id                                                  67919847338_10154886336202339
likes.data                                                                     []
likes.summary.can_like                                                       True
likes.summary.has_liked                                                     False
likes.summary.total_count                                                      47
message                         Today's Deal: Save 60% on Sherlock Holmes: The...
shares.count                                                                    2
Name: 0, dtype: 

##### Lets drop the unneccessary columns 

In [24]:
df.drop(df.columns[[0,1,2,6,7,8]], 1, inplace=True)
# Rename the columns
df.columns = ['Comments', 'CreatedTime', 'Id', 'Likes', 'Message', 'Shares']
# The shares column has a lot of NaNs, replace them with 0
df.Shares.fillna(0, inplace=True)
df.loc[0]

Comments                                                       9
CreatedTime                             2017-07-28T17:01:24+0000
Id                                 67919847338_10154886336202339
Likes                                                         47
Message        Today's Deal: Save 60% on Sherlock Holmes: The...
Shares                                                         2
Name: 0, dtype: object

##### Lets do some time analysis. How the no of comments, likes and shares on posts have varied  

In [6]:
likes = df.Likes.tolist()
comments = df.Comments.tolist()
shares = df.Shares.tolist()

# the above three have the latest posts in the beginning, so reverse all three
likes.reverse()
comments.reverse()
shares.reverse()

# # Now plot on a multi line graph
plot1 = bkh.line_graph([likes], xlabel='Time', ylabel='Likes')
plot2 = bkh.line_graph([comments], xlabel='Time', ylabel='Comments')
plot3 = bkh.line_graph([shares], xlabel='Time', ylabel='Shares')
grid = gridplot([[plot1, plot2], [plot3, None]])
show(column(plot1, plot2, plot3))

>We have a lot of data (6516 data points) and line charts do not help much in this case. The lines do not show much of a pattern in the activities. There are a lot of spikes in our data , for some of the extra popular posts.

##### Lets draw a scatter plot of likes vs comments

In [25]:
plot = bkh.scatter_plot(likes, comments, xlabel='Likes', ylabel='Comments')
corr = np.corrcoef(likes, comments)
print("Correlation between likes and commnets = ", corr[0][1])
show(plot)

Correlation between likes and commnets =  0.651560874778


>As expected there is a strong correlation between no of likes and no of comments

>Except for a few, most of the posts are under 500 comments and 1000 likes. 

##### Lets list the posts with more than 5000 likes and 1500 commnets

In [8]:
popular_posts = df[(df.Likes>5000) & (df.Comments>1500)]
popular_posts.sort_values(by=['Likes', 'Comments', 'Shares'],ascending=False, inplace=True)
for msg, id in zip(popular_posts.Message, popular_posts.Id):
    print(msg, '\n\nURL: https://facebook.com/'+id, '\n\n\t\t\t************************\n')

Grand Theft Auto V is Coming Soon to Steam!

The biggest, most dynamic and most diverse open world ever created and now packed with layers of new detail, Grand Theft Auto V blends storytelling and gameplay in new ways as players repeatedly jump in and out of the lives of the games three lead characters, playing all sides of the games interwoven story. 

URL: https://facebook.com/67919847338_10152188716552339 

			************************

Left 4 Dead 2 is FREE today only on Steam!*

Free Zombies!

To celebrate the holidays in a special way this year, Left 4 Dead 2 will be free until 10AM PST on 12/26.

It will be free as in, grab it now, pay no money, and it is yours to keep forever  … 

URL: https://facebook.com/67919847338_10151870697512339 

			************************

Add PAYDAY 2 to your account for FREE starting now to the first 5 Million customers! Once you add the game it will remain in your account permanently, so don't miss out on this opportunity to play a great game! 

URL

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


### Year by year and month by month analysis 

In [9]:
df_times = df
times = df_times.CreatedTime.str.split('-')
years = []
months = []

for time in times:
    years.append(time[0])
    months.append(time[1])
df_times['Years'] = years
df_times['Months'] = months

df_times.drop(df_times.columns[['1']], 1, inplace=True)
df_times.head()

  result = getitem(key)


Unnamed: 0,Comments,Id,Likes,Message,Shares,Years,Months
0,9,67919847338_10154886336202339,47,Today's Deal: Save 60% on Sherlock Holmes: The...,2.0,2017,7
1,4,67919847338_10154886171172339,53,X Rebirth VR Edition is Now Available on Steam...,1.0,2017,7
2,2,67919847338_10154886171112339,51,The Wizards is Now Available on Steam Early Ac...,2.0,2017,7
3,9,67919847338_10154886116497339,83,Sundered is Now Available on Steam and is 10% ...,4.0,2017,7
4,14,67919847338_10154886058252339,92,Car Mechanic Simulator 2018 is Now Available o...,9.0,2017,7


##### Plot for no of posts made in each year and each month

In [10]:
# takes in a dictionary and plots a bra graph for it
def plot_bars(dct, title='', xlabel=''):
    bars = list(dct.values())
    x_labels = [str(x) for x in dct.keys()]
    plot = bkh.bar_graph(bars, title=title, ylabel='Count', xlabel=xlabel, x_labels=x_labels)
    return plot

In [11]:
class OrderedCounter(Counter, OrderedDict):
     'Counter that remembers the order elements are first seen'
     def __repr__(self):
         return '%s(%r)' % (self.__class__.__name__,
                            OrderedDict(self))
     def __reduce__(self):
         return self.__class__, (OrderedDict(self),)


year_counts = OrderedCounter(years)
month_counts = OrderedCounter(months)

# bar graph for yearly posts
plot1 = plot_bars(year_counts, 'No of posts each year', 'Years')
# bar graph for monthly posts
plot2 = plot_bars(month_counts, 'No of posts each month', 'Months')

show(column(plot1, plot2))

> **Posts Each Year**: The page was active in the beginning (2009, 2010) but then slowed down for the next 2 years (2011, 2012). Later the no of posts increased signinficantly. Possible reason might be them realizing the importance of social media on their customer's daily lives. Steam works in the field of gaming, and thus relies heavily on social media advertisement. 2017 is low because data is only till july

> **Posts Each Month**: The page is most active during the mid of the year and least active during January. Possible reason might be people being in the new year mood. Also the general, customers are in new year mood and thus gaming comapnies avoid releasing new products leading to reduced posts

#### The above 2 bar graphs show how active the company is during various phases. Lets analyze how active its customers are. 
Analyzing avg likes on posts each year and each month. Each data point for example will be avg no of likes per post in the month of january (or in year 2015)

In [12]:
likes_years = defaultdict(int)
likes_months = defaultdict(int)

for ind, row in df_times.iterrows():
    likes_years[row.Years] += row.Likes
    likes_months[row.Months] += row.Likes

In [13]:
# now get the average of each
# likes_years and likes_months contain the total likes each year and month
# year_counts and month_count scontain the no of posts each year and month
# So on dividing we will get the avg no of likes each post for each year and month

likes_years_avg = OrderedDict()
likes_months_avg = OrderedDict()

for i in range(2009, 2018):
    year = str(i)
    likes_years_avg[year] = likes_years[year] / year_counts[year]

for i in range(1, 13):
    month = str(i)
    if(len(month)==1): month = '0' + month     # 1 becomes 01
    likes_months_avg[month] = likes_months[month] / month_counts[month]

In [14]:
plot1 = plot_bars(likes_years_avg, 'Avg likes per post per year', 'Years')
plot2 = plot_bars(likes_months_avg, 'Avg likes per post per month', 'Months')
show(column(plot1, plot2))

> **Per year graph**: The year with max avg no of likes was 2012. Also one interesting thing is 2012 was the year with the lowest no of posts. So does this mean that, since the no of posts in 2012 were low, they were valued much more by its customers? Or maybe the quality of posts decreased with increasing the posts every year. Also since gamers usually love game related stuffs, maybe the rare occurence of a Steam post was highly valued by the community. In 2012 they were making just 3 posts every 2months, so maybe a Steam post was highly valued. 

> **Per month graph**: Here also the community is most active during the mid of the year. Apart from june, there is not too much variation over the other months. Like the facebook page, maybe the customers are also in New year mood during January. But one interesting thing is drop in average from Decmber to January which was not the case in the number of posts.

##### An interesting question we had was, does making too many posts decrease their value. Lets make  a scatter plot of no of posts each year and avg likes per post

In [21]:
avg = list(likes_years_avg.values())
counts = list(year_counts.values())
# Counts is starting from the year 2017, so we reverse it
counts.reverse()

plot = bkh.scatter_plot(counts, avg, ylabel='Avg likes per post', xlabel='No of posts')
plot.add_tools(HoverTool())
show(plot)


>There is not too much of a variation, bt the years where the no of posts were  <=600 are the only three years with an avg likes ratio of more than 200. They are the years, 2011, 2012 and 2017. In these three, 2017 is an outlier other wise the other 2 have very less no of posts (18 and 35 compared to 604 in 2017). 2014 on the other hand had maximum no of posts(1731) bt less than avg number of likes per post(164).

lets find the correlation between avg likes and no of posts

In [16]:
correlation = np.corrcoef(avg, counts)
correlation

array([[ 1.        , -0.44918521],
       [-0.44918521,  1.        ]])

>The correlation between these two comes out to be **-0.44** which says a bit that increasing number of posts does have a bit negative effect on the likes of the posts. ***Maybe they should focus more on quality posts rather than the no of posts***

>But one of the principles of Correlation says that for few points the Correlation should be verry close to {-1, 1} to be statictically signinficant. Here we have just 9 data point sand so can not say anything confidentky about the significance of teh correlation between the two. Anot

We can get more data by considering the posts and their likes each months. That will give us 9years * 12 = 108 data points to deal with. 

In [17]:
likes_per_month = defaultdict(list)
for ind, row in df_times.iterrows():
    likes_per_month[row.Years + row.Months].append(row.Likes)
    
# Now create a list of tuples for avg likes and counts
avg_likes = []
posts = []
for k,l in likes_per_month.items():
    avg_likes.append(sum(l)/len(l))
    posts.append(len(l))

##### Scatter plot 

In [20]:
corr = np.corrcoef(avg_likes, posts)
print("\t\tCorrelation between avg likes per post and no of posts for that month = ", corr[0][1])

plot = bkh.scatter_plot(posts, avg_likes, xlabel='Posts for the month', ylabel='Avg likes per post',
                       title='Likes per post vs no of posts for the month')
show(plot)

		Correlation between avg likes per post and no of posts for that month =  -0.26149280084


>Well, one of points can be considered an outlier but overall the months which had a lot of incoming posts has lesser no of likes per post (There are 8 points with more than 500 likes and all have less than 100 posts for that month). On increasing the posts the person's wall maybe gets crowded by their posts and they start avoiding them. 

>Also the correlation(-0.26) is on the negative side. However it is too low to say anything confidently.