<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Data-and-Libraries" data-toc-modified-id="Load-Data-and-Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Data and Libraries</a></span></li><li><span><a href="#Visualize" data-toc-modified-id="Visualize-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Visualize</a></span><ul class="toc-item"><li><span><a href="#Views-and-Comments-in-time" data-toc-modified-id="Views-and-Comments-in-time-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Views and Comments in time</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Let's-do-some-data-exploration-to-try-to-understand-what-makes-a-post-appear." data-toc-modified-id="Let's-do-some-data-exploration-to-try-to-understand-what-makes-a-post-appear.-2.1.0.1"><span class="toc-item-num">2.1.0.1&nbsp;&nbsp;</span>Let's do some data exploration to try to understand what makes a post appear.</a></span></li></ul></li></ul></li></ul></li></ul></div>

# Load Data and Libraries

In [None]:
path = '../sample_data/combined.csv'

import altair as alt
import pandas as pd

# for the notebook only (not for JupyterLab) run this command once per session
alt.renderers.enable('notebook')

df = pd.read_csv(path)
mask = df['reactions'].str.match("^\d+(\.\d+)*$")
df = df[mask]
df['reactions'] = pd.to_numeric(df['reactions'])

# Visualize

## Views and Comments in time

We can explore the available posts and whether and how posts have been seen.

In [None]:
alt.Chart(df, width=800, height=600).transform_calculate(
    url='https://www.facebook.com' + alt.datum.url
).mark_circle(color='red', filled=True).encode(
    x='date:T',
    y='comments:Q',
    color='source:N',
    size=alt.Size('total_views:O'),
    href='url:N',
    tooltip=['source:N', 'seen_by:O', 'total_views:O', 'reactions:Q',
             'likes:Q', 'ahah', 'love', 'wow', 'grrr:Q', 'sigh:Q', 'comments:Q']
).interactive()

Let's see how many posts have been seen and how many have not, and how many posts are seen how many times.

In [None]:
chart1 = alt.Chart(df, width=200, height=300).mark_bar().encode(
    x='visible:N',
    y='count(visible):Q',
    color='visible:N'
)

chart2 = alt.Chart(df).mark_bar().encode(
    y='count(postId):Q',
    x='total_views:O',
    color='visible:N'
)

chart1 | chart2

In [None]:
alt.Chart(df, width=200, height=300).mark_bar().encode(
    x='visible:N',
    y='count(source):Q',
    color='visible:N',
    column='source:N'
)

In [None]:
datevalues = pd.to_datetime(df.date)
data = df[['visible', 'source', 'postId']].groupby(
    ['visible', 'source', datevalues.dt.floor('d')]).count().reset_index()

alt.Chart(data).mark_area().encode(
    x="date:T",
    y=alt.Y("sum(postId):Q"),
    color="visible:N",
    column='source'
).properties(title='Comparison between seen and unseen unique posts by source.')

In [None]:
data = df[['visible', 'source', 'postId', 'total_views']].groupby(
    ['visible', 'source', datevalues.dt.floor('d')]).sum().reset_index()

alt.Chart(data).mark_area().encode(
    x="date:T",
    y=alt.Y("sum(total_views):Q",
            stack="normalize"
            ),
    color="source:N"
).properties(width=600,
             height=400,
             title='Comparison between (normalized) amount of posts appearead by source - non unique.')

#### Let's do some data exploration to try to understand what makes a post appear.

First, we can see the difference of reactions between posts that appeared and posts that did not. Altough this is not enough to understand how other peoples reactions influence wether a post will appear on not on a timeline, it might give some insights.

In [None]:
data = df[['visible', 'likes', 'ahah', 'love', 'wow',
           'sigh', 'grrr', 'comments']].groupby('visible').mean()

We can also get similar stats by aggregating by number of users that have seen the post. Again, we can see there might be some correlation between how many reactions a post has and the number of users that see that content.

In [None]:
correlations = df
sources_dict = {'ABC.es': 1, 'eldiario.es': 2}
correlations['source_int'] = correlations['source'].apply(
    lambda x: sources_dict[x])


def color_negative_red(value):
    """
    Colors elements in a dateframe
    green if positive and red if
    negative. Does not color NaN
    values.
    """

    if value < -0.5:
        color = 'red'
    elif value > 0.5:
        color = 'green'
    else:
        color = 'black'

    return 'color: %s' % color


correlations = correlations.corr().drop(['postId'])
correlations = correlations.drop(correlations.index[0])
correlations.style.applymap(color_negative_red)

In [None]:
data = df[['visible', 'source', 'comments', 'likes', 'ahah', 'love',
           'wow', 'sigh', 'grrr']].groupby(['visible', 'source']).mean()

data = data.stack().reset_index(-1).iloc[:, ::-1]
data.columns = ['mean', 'type']
data = data.reset_index()
alt.Chart(data).mark_line().encode(
    x='visible:O',
    y=alt.Y('mean(mean):Q', axis=alt.Axis(title='Average reactions per post')),
    color='type:N'
).properties(
    height=600,
    width=300,
    title='Change by reaction between seen and unseen posts').interactive()

In [None]:
datevalues = pd.to_datetime(df.date)
data = df[['visible', 'comments', 'likes', 'ahah', 'love',
           'wow', 'sigh', 'grrr']].groupby(['visible']).mean()

data = data.stack().reset_index(-1).iloc[:, ::-1]
data.columns = ['mean', 'type']
data = data.reset_index()

alt.Chart(data).mark_bar().encode(
    x='visible:N',
    y='mean:Q',
    color='visible:N',
    column='type:N'
).interactive()

In [None]:
data = df[['visible', 'source', 'comments', 'likes', 'ahah', 'love',
           'wow', 'sigh', 'grrr']].groupby(['visible', 'source']).mean()

data = data.stack().reset_index(-1).iloc[:, ::-1]
data.columns = ['mean', 'type']
data = data.reset_index()

alt.Chart(data).mark_bar().encode(
    y='type:N',
    x="mean:Q",
    color='type:N',
    column='visible'
).properties(
    width=400,
    height=300,
    title='Comparison between seen and unseen posts average composition of the reactions'
)

Moreover, we can check how does the number of times a post has been seen and the average number of reactions.
This allows to see which posts are more likely to be seen given a specific type of reaction.

In [None]:
data = df[['visible', 'comments', 'likes', 'ahah', 'love',
           'wow', 'sigh', 'grrr']].groupby(['visible']).mean()
data = data.stack().reset_index(-1).iloc[:, ::-1]
data = data.reset_index()
data.columns = ['visible', 'mean', 'type']
alt.Chart(data).mark_bar().encode(
    x='type:N',
    y='mean:Q',
    color='type:N',
    column='visible:N'
).properties(
    width=300,
    height=200
).interactive()

In [None]:
data = df[['total_views', 'reactions', 'comments']
          ].groupby('total_views').mean().reset_index()
alt.Chart(data).mark_bar().encode(
    y='comments:Q',
    x='total_views:O'
)