<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Visualizing-Semantic-Data" data-toc-modified-id="Visualizing-Semantic-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Visualizing Semantic Data</a></span></li><li><span><a href="#Load-and-prepare-data" data-toc-modified-id="Load-and-prepare-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load and prepare data</a></span></li><li><span><a href="#Visualize" data-toc-modified-id="Visualize-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Visualize</a></span><ul class="toc-item"><li><span><a href="#Labels-per-user" data-toc-modified-id="Labels-per-user-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Labels per user</a></span></li><li><span><a href="#Users-per-label" data-toc-modified-id="Users-per-label-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Users per label</a></span></li><li><span><a href="#Compare" data-toc-modified-id="Compare-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Compare</a></span></li></ul></li><li><span><a href="#Wordcloud" data-toc-modified-id="Wordcloud-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Wordcloud</a></span></li></ul></div>

# Visualizing Semantic Data

In this notebook we manipulate and visualize label data produced by Dandelion, using the fbTREX semantic API.

First of all we load the usual libraries, and the output csvs of labels.py for two different users.
We are going to use the sample dataset. You can change the paths for file1 and file2, as well as the number of top sources to get for both of the users.

Then we build a dataframe containing the n top sources per day for the two different users.

# Load and prepare data

Combine data for two users and take a look at the dataset.

In [None]:
# import libraries
import pandas as pd
import altair as alt
alt.renderers.enable('notebook')

# configure files location and number of top labels to get.
file1 = '../sample_data/user_a_labels.csv'
file2 = '../sample_data/user_b_labels.csv'
top = 5

# load the data
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df = pd.concat([df1, df2])

df.head()

Group by word and get the n top labels for each user per day.

In [None]:
# filter out to get only top n labels

keep_list = df1.groupby('word')['count'].sum().nlargest(5).index.tolist()
df1 = df1[df1['word'].isin(keep_list)]
keep_list = df2.groupby('word')['count'].sum().nlargest(5).index.tolist()
df2 = df2[df2['word'].isin(keep_list)]

top = pd.concat([df1, df2])

# Visualize

## Labels per user

In [None]:
alt.Chart(top).mark_line().encode(
    x='impressionTime:T',
    y='count:Q',
    color='word:N',
    row='user:N'
).properties(
    width = 600,
    height = 450
)

## Users per label

Choose a list of words (in this example, 'Barcelona' and 'Partido Popular').
Then we show the trending of the two words on the two users profiles.

In [None]:
words_list = ['Barcelona', 'Partido Popular']

filtered = df[df['word'].isin(words_list)]
alt.Chart(filtered).mark_line().encode(
    x='impressionTime:T',
    y='count:Q',
    color='user:N',
    row='word:N'
).properties(
    width = 600,
    height = 300
)

## Compare

In [None]:

df1 = df1.sort_values('count', axis=0, ascending=False)
df2 = df2.sort_values('count', axis=0, ascending=False)

user1 = alt.Chart(df1).mark_bar().encode(
    x='count:Q',
    y=alt.Y(
        'word:N',
        sort=alt.SortField(
            field="count:Q",
            order="descending"
        )
    )
).properties(title=df1.user.value_counts().idxmax())

user2 = alt.Chart(df2).mark_bar().encode(
    x='count:Q',
    y=alt.Y(
        'word:N',
        sort=alt.SortField(
            field="count:Q",
            order="descending"
        )
    )
).properties(title=df2.user.value_counts().idxmax())

user1 & user2

# Wordcloud

In [None]:
from wordcloud import WordCloud
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
matplotlib.rcParams['figure.figsize'] = [20, 10]

data = df1.word.str.join(sep='').reset_index()
data.columns = ['date', 'words']
data = data.words.str.cat(sep=' ')

wordcloud = WordCloud(font_path='../src/fonts/DejaVuSans.ttf',
                      relative_scaling = 1.0,
                      width=2000,
                      height=1000
                      ).generate(data)
plt.imshow(wordcloud)
plt.figsize=(20,10)
plt.axis("off")
plt.title('User '+df1.user.value_counts().idxmax())

