# MemeTracker Dataset

The following notebook contains the documented code used to preprocess the MemeTracker dataset for our experiments. 

**Note:** Because the dataset is quite large, a machine with at least ~40GB of RAM is necessary to run this notebook.

---

Import libs

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import glob
import gzip
import pickle
import pandas as pd
import numpy as np
import multiprocessing

from datetime import datetime
import pytz

from matplotlib import pyplot as plt
%matplotlib inline

import networkx as nx

from tsvar.preprocessing import Dataset

# Set larger cell width for nicer visualization
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

---

## 1. Download and parse the dataset

First, all the files from the dataset must be downloaded from the [SNAP](http://snap.stanford.edu/data/memetracker9.html) dataset repository.

Then the raw files must be parsed using the `raw2df.py` script provided by [NPHC](https://github.com/achab/nphc/tree/master/nphc/datasets/memetracker) to format the raw data in a convenient tabular format.

The point processes can then be built using the following notebook.

---

## 2. Load the raw MemeTracker dataframe

Set the input directory where the parsed dataframes are located

In [None]:
DATA_DIR = './parsed_memetracker_data'

Load raw dataframes (in parallel)

In [None]:
list_df_files = sorted(glob.glob(os.path.join(DATA_DIR, 'parsed', 'df_*.csv')))

def worker(fname):
    return pd.read_csv(fname)

pool = multiprocessing.Pool(len(list_df_files))

jobs = list()
for fname in list_df_files:
    job = pool.apply_async(worker, (fname, ))
    jobs.append(job)

data = list()
for job in jobs:
    data.append(job.get())

df = pd.concat(data, ignore_index=True)
del data

pool.close()
pool.terminate()

* `Blog` = receiver
* `Hyperlink` = sender

Vizualize the dataset

In [None]:
print(df.shape)
df

---

## 3. Clean the dataframe

### 3.1. Clean columns

#### Clean the `Hyperlink column`

In [None]:
df['Hyperlink'] = df['Hyperlink'].str.strip()  # Remove whitespaces (that appear in null hyperlinks)

#### Cast `Date` and build timestamps

In [None]:
df['Date'] = pd.to_datetime(df['Date'])
df['Timestamp'] = df['Date'].values.astype(np.int64) // (10 ** 9)

### 3.2. Find the top-100 blogs

In [None]:
df['has_hyperlink'] = df.Hyperlink != ''  # Indicate if event has hyperlink
df['has_hyperlink'] = df['has_hyperlink'].astype(int)

Build the count of number of hyperlink per blogs, i.e., how many times a blog was cited.

Build the count of number of posts per blogs.

In [None]:
count_series = df.groupby('Blog').agg({'PostNb': set})['PostNb'].apply(len)

Keep only the top-100 sites

In [None]:
top_num = 100

top_series = count_series.sort_values(ascending=False).iloc[:top_num]
print(f'There are {top_series.sum():,d} items in the top-{top_num:d} sites')
display(top_series)

### 3.3. Keep only events between sites in the top-100 blogs

We finally remove all events coming from hyperlinks that are not part of the top-100 blogs.

In [None]:
top_site_set = set(top_series.index.tolist())  # All top blog sites

top_blog_mask = df['Blog'].isin(top_site_set)     # Blogs is in top
top_hp_mask = df['Hyperlink'].isin(set(list(top_site_set) + ['']))  # Hyperlink is in top or no hyperlink (i.e. is null)

In [None]:
# Build mask of valid events
valid_event_mask = top_blog_mask & top_hp_mask

# Filter
df_top = df.loc[valid_event_mask]
assert len(df_top) == np.sum(valid_event_mask)

print(f'{np.sum(valid_event_mask):,d} events are between the top-{top_num} sites'
      f' out of the {len(df):,d} ({np.sum(valid_event_mask)*100/len(df):.2f}%)')

### 3.4. Final formatting steps

Build numerical index for each blog

In [None]:
top_name_to_idx_map = dict(zip(top_series.index, range(top_num)))

# Make numerical index for blogs
df_top['Blog_idx'] = df_top['Blog'].apply(lambda name: top_name_to_idx_map[name])

# Add hyperlinks index
top_name_to_idx_map[''] = None  # Set None for No-Hyperlink
df_top['Hyperlink_idx'] = df_top['Hyperlink'].apply(lambda name: top_name_to_idx_map[name]).astype(pd.Int32Dtype())

In [None]:
df_top = df_top.sort_values(by='Timestamp')  # Translate time origin
df_top['Timestamp'] -= df_top['Timestamp'].min()  # Translate time origin
df_top = df_top[['Hyperlink_idx', 'Blog_idx', 'Hyperlink', 'Blog', 'Date', 'Timestamp']]  # df_top = df_top[['Hyperlink_idx', 'Blog_idx', 'Hyperlink', 'Blog', 'Date', 'Timestamp']]

In [None]:
print(df_top.shape)
display(df_top.head(10))

Save the clean dataframe

In [None]:
df_top.to_pickle(os.path.join(DATA_DIR, 'memetracker-top100-clean.pickle.gz'), compression='gzip')