<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load,-Clean-and-Explore" data-toc-modified-id="Load,-Clean-and-Explore-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load, Clean and Explore</a></span></li><li><span><a href="#Venn-diagrams" data-toc-modified-id="Venn-diagrams-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Venn diagrams</a></span><ul class="toc-item"><li><span><a href="#Load-libraries-and-define-functions" data-toc-modified-id="Load-libraries-and-define-functions-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Load libraries and define functions</a></span></li><li><span><a href="#Load-data-(the-youtube.py---video-csv-output-specified-as-path)" data-toc-modified-id="Load-data-(the-youtube.py---video-csv-output-specified-as-path)-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Load data (the youtube.py --video csv output specified as path)</a></span></li><li><span><a href="#Clean-the-data" data-toc-modified-id="Clean-the-data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Clean the data</a></span></li><li><span><a href="#Visualize" data-toc-modified-id="Visualize-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Visualize</a></span></li></ul></li></ul></div>

# Load, Clean and Explore

In order to use this, you need first to use youtube.py in the folder src to download data based on a specific videoId.

Then we:
1. import all the needed libraries,
2. specify the path to the csv file (the output of dashboard/src/youtube.py),
3. check that the title of the video corresponds to the one we want to analyze

First of all, specify a path to a valid .csv file in the box below.

In [None]:
path = '../sample_data/video.csv'

In [None]:
import math
import itertools
from matplotlib import pyplot as plt
from matplotlib_venn import venn2, venn3
import numpy as np
import squarify
import pandas as pd
%matplotlib inline

In [None]:


df = pd.read_csv(path)

title = df['sourceTitle'][0]
df = df[['watcher', 'id', 'related_source',
         'related_videoId', 'related_title', 'related_index']]

print(title)

Now we need to know how many users have seen that video, and choose two of them in order to compare the videos suggested to each of them.

In [None]:
df.watcher.unique()

We can now pick the two users, then we create two separate datasets.

In [None]:
user1 = 'garbanzo-muffin-orzo'
user2 = 'milk-quince-alfalfa'

df1 = df[df['watcher'] == user1]
df2 = df[df['watcher'] == user2]


For each of the two users, we need to choose only one id (one specific session of recommended videos).
Then we reduce the datasets to that session only, so we can make a comparison.

First, choose among the unique values for df1, then for df2. Those will be id1 and id2.

In [None]:
df1.id.unique()

In [None]:
df2.id.unique()

In [None]:
id1 = '46e2fff75bbc798cce53388de80e9ac7e9f8a6ef'
id2 = '120bce0960e72e333ebc5f95e48611c6b8d64b20'


df1 = df[df['id'] == id1]
df2 = df[df['id'] == id2]
df1.index = df1.related_index
df2.index = df2.related_index

df1 = df1[['related_source', 'related_title', 'related_videoId', 'related_index']]
df2 = df2[['related_source', 'related_title', 'related_videoId', 'related_index']]

Now we can see which posts appeared to each user after watching the same video.

In [None]:
merge1 = df1[['related_title']].rename(columns={"related_title": user1})
merge2 = df2[['related_title']].rename(columns={"related_title": user2})
compare = pd.concat([merge1,merge2], axis=1)
compare

Or see the most common suggested video sources (Youtube channels) for each user.

In [None]:
channels1 = df1.related_source.value_counts().rename_axis('name').to_frame('Count')
channels2 = df2.related_source.value_counts().rename_axis('name').to_frame('Count')

fig, (ax, ax2) = plt.subplots(ncols=2, figsize=(18, 8))


channels1.plot(kind='barh', ax=ax, title=user1)
channels2.plot(kind='barh', ax=ax2, title=user2)
ax2 = ax2.yaxis.set_label_position("right")

plt.tight_layout()

In [None]:
fig = plt.figure(figsize=(15,10))
plt.title('Suggested channels after watching: '+title+'\n User: '+user1)
squarify.plot(sizes=channels1.Count, label=channels1.index, alpha=.8)
plt.axis('off')
plt.show()

In [None]:
fig = plt.figure(figsize=(15,10))
plt.title('Suggested channels after watching: '+title+'\n User: '+user2)
squarify.plot(sizes=channels2.Count, label=channels2.index, alpha=.8)
plt.axis('off')
plt.show()

# Venn diagrams

## Load libraries and define functions

Initialize some functions to generate venn diagrams

In [None]:
# Generate list index for itertools combinations
def gen_index(n):
    x = -1
    while True:       
        while True:
            x = x + 1
            if bin(x).count('1') == n:
                break
        yield x

# Generate all combinations of intersections
def make_intersections(sets):
    l = [None] * 2**len(sets)
    for i in range(1, len(sets) + 1):
        ind = gen_index(i)
        for subset in itertools.combinations(sets, i):
            inter = set.intersection(*subset)
            l[next(ind)] = inter
    return l

# Get weird reversed binary string id for venn
def number2venn_id(x, n_fill):
    id = bin(x)[2:].zfill(n_fill)
    id = id[::-1]
    return id

# Iterate over all combinations and remove duplicates from intersections with
# more sets
def sets2dict(sets):
    l = make_intersections(sets)
    d = {}
    for i in range(1, len(l)):
        d[number2venn_id(i, len(sets))] = l[i]
        for j in range(1, len(l)):
            if bin(j).count('1') < bin(i).count('1'):
                l[j] = l[j] - l[i]
                d[number2venn_id(j, len(sets))] = l[j] - l[i]
    return d

In [None]:
fig = plt.figure(figsize=(20,10))
plt.title('Video suggestions after: '+title)
v = venn2([set(df1.related_videoId), set(df2.related_videoId)], (user1, user2))

In [None]:
df1["uniqueId"] = df1["related_index"].map(str) + df1["related_videoId"]
df2["uniqueId"] = df2["related_index"].map(str) + df2["related_videoId"]

fig = plt.figure(figsize=(20,10))
plt.title('Video suggestions after: '+title+'\n With both videoId and position in the suggested list in common.')
v2 = venn2([set(df1.uniqueId), set(df2.uniqueId)], (user1, user2))

## Load data (the youtube.py --video csv output specified as path)

In [None]:
path = '../sample_data/video.csv'

df = pd.read_csv(path)

title = df['sourceTitle'][0]
df = df[['watcher','id','related_source','related_videoId','related_title','related_index']]

print(title)

## Clean the data

First, specify three users  among the unique ones you have in your dataset


In [None]:
df.watcher.unique()

In [None]:
user1, user2, user3, user4 = df.watcher.unique()

df1 = df[df.watcher == user1]
df2 = df[df.watcher == user2]
df3 = df[df.watcher == user3]

Then, you need to pick just one session (id) per user.

In [None]:
df1.id.unique()

In [None]:
df2.id.unique()

In [None]:
df3.id.unique()

In [None]:
df1 = df1[df1.id == '46e2fff75bbc798cce53388de80e9ac7e9f8a6ef']
df2 = df2[df2.id == 'b2a48bdfddc7a9a0cd0f706da9c850dc790d7c41']
df3 = df3[df3.id == '120bce0960e72e333ebc5f95e48611c6b8d64b20']

Then we create the three clean sets with one id each

In [None]:
A = set(df1.related_source)
B = set(df2.related_source)
C = set(df3.related_source)

sets_source = [A, B, C]

## Visualize

by Source name

In [None]:
d = sets2dict(sets_source)

# Plot it
plt.figure(figsize=(40,20))
h = venn3(sets_source, (user1, user2, user3))
for k, v in d.items():
   l = h.get_label_by_id(k)
   if l:
       l.set_fontsize(12)
       l.set_text('\n'.join(sorted(v)))
