Plotting many series together in a heatmap can show highly correlated series very clearly.

I did this offline, dumping the whole training set to ~145 PNG files 1000 pixels high each, then browsed them with an app... There are around 49000 page titles and hundreds of shared patterns between them all - sorting by the "Page" column is a good first start at finding correlated series.

Here are some of the highlights, patterns that really stand out.

Heatmaps are good for a general overview, but I cannot find a good way to see the page titles alongside the graphics, so quite limited. Some of them make good guessing games ;)

In [None]:
import numpy as np
import pandas as pd
import gc, os, sys
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline

In [None]:
def dtypes():
    train = pd.read_csv("../input/train_2.csv", index_col='Page', nrows=2)
    return {c:np.float32 for c in train.columns}

In [None]:
train = np.log1p(pd.read_csv("../input/train_2.csv", index_col='Page', dtype=dtypes()))
train.columns = train.columns.astype('datetime64[ns]')
train.fillna(0, inplace=True)
train.sort_index(inplace=True)
train.head()

In [None]:
def save_plot(fname, df):
    dat = df.values
    dat = (dat / np.max(dat)) * 255.
    print (fname, np.min(dat), np.max(dat))
    imageio.imsave(fname, dat.astype(np.uint8))

#save_plot('testplot.png', train.loc[stats.sort_values('sum').tail(900).index])

In [None]:
def substr(s):
    return train.loc[train.index.str.contains(s)]

def substrs(l):
    return pd.concat((substr(s) for s in l))

In [None]:
# Change the palette for whole notebook here...
cmap = 'magma'

def show_plot(df):
    dat = df.values
    dat = (dat / np.max(dat)) * 255.

    fig, ax = plt.subplots()
    fig.set_size_inches(18, 8)
    ax.xaxis_date()
    ax.yaxis.tick_left()
    ax.grid('off')
    cols = df.columns
    x_lims = [ cols.min(), cols.max() ]
    x_lims = mdates.date2num(x_lims)
    y_lims = [0, df.shape[0]]
    plt.imshow(dat, cmap=cmap, aspect='equal', extent=[x_lims[0], x_lims[1],  y_lims[0], y_lims[1]])
    date_format = mdates.DateFormatter('%y-%m-%d')
    ax.xaxis.set_major_formatter(date_format)
    plt.show()

def show_plot_for_index(idx):
    print(idx)
    show_plot(train.loc[idx])

In [None]:
show_plot(train.loc[train.index.str.startswith('API:')])

In [None]:
train.max(1).describe()

In [None]:
show_plot(train.loc[train.index.str.contains('Olympic')])

In [None]:
show_plot(train.loc[train.index.str.contains('Super_Bowl')])

In [None]:
show_plot(train.loc[train.index.str.contains('Special:WhatLinksHere')])

Pages that have a date in the URL: highly time sensitive (note some feint vertical blue traces later on - pages are fetched at once for maintenance? Or spidered?)

In [None]:
show_plot(substr('/featured/201'))

In [None]:
show_plot(substrs(('Game_of_Thrones', 'Walking_Dead')))

French TV series - more weekly cycles...

In [None]:
show_plot(substr('serie_de_televis'))

In [None]:
show_plot(substr('Topic:'))

Current and past F1 drivers, nice illustration of the F1 calendar. Which one won his first race in May 2016 and which won a championship in November that year?

In [None]:
show_plot(substrs(('Lewis_Hamilton', 'Nico_Rosberg', 'Max_Verstappen', 'Niki_Lauda')))

Monthly access patterns...

In [None]:
show_plot(train.loc[train.index.str.startswith('Category:Deletion')])

Halloween: when grouping by page title, this is the group for which the median benchmark has the worst SMAPE error for the validation period Sept 13 - Nov 13...

In [None]:
show_plot(train.loc[train.index.str.startswith('Halloween')])

In [None]:
show_plot(train.loc[train.index.str.startswith('Fußball-')])

Seems like national flag SVG files are fetched simultaneously, periodically. Or is it automated maintenance?

In [None]:
show_plot(substr('File:Flag_'))

In [None]:
show_plot(substr('Help:Categories'))

Unlikely premier league champions...

In [None]:
show_plot(substr('Leicester'))

Brexit...

In [None]:
show_plot(substr('United_Kingdom'))

Highly stationary, badly spelt, now dormant?

In [None]:
show_plot(substr('User:GoogleAnalitycsRoman'))

This is interesting: Prince the musician died in April 2016 - but most pages containing "Prince" had a surge of hits on that day... Something to do with the Royal family happened early November 2016...

TODO no good! really need to see page titles...

In [None]:
show_plot(substr('Prince'))

In [None]:
show_plot(substrs(['Star_Wars', 'スター']))

Some samples of pages from different regions - different seasonality visible.

In [None]:
def do_sample(df, n):
    return df.sample(n, random_state=42)

In [None]:
show_plot(do_sample(substr('_es.wiki'), 200))

In [None]:
show_plot(do_sample(substr('_ru.wiki'), 200))

Note: French specific thin bright line around October 2016 and similar in early 2016 for Russian pages above - days of unusually high spider activity?

In [None]:
show_plot(do_sample(substr('_fr.wiki'), 200))

In [None]:
show_plot(do_sample(substr('_de.wiki'), 200))

Darker left edges mean more missing data at start of series for Japanese & Chinese sites.

In [None]:
show_plot(do_sample(substr('_ja.wiki'), 200))

In [None]:
show_plot(do_sample(substr('_zh.wiki'), 200))

In [None]:
stats = train.sum(1).to_frame('sum')
stats['mean'] = train.mean(1)
stats['max'] = train.max(1)
stats['min'] = train.min(1)
stats.describe()

Find pages where the proportion of views is highests for one particular day:

In [None]:
date_of_interest = '2016-08-22'
ser = (train.loc[:, date_of_interest] / stats['sum']).sort_values().dropna()
show_plot(train.loc[ser[-300:].index])

Find pages where the views have a strong peak:

In [None]:
ser = (stats['max'] / stats['mean']).sort_values().dropna()
show_plot(train.loc[ser[-300:].index])

Least peaky - weekly cycle visible and xmas/new-years as vertical feint dark lines:

In [None]:
ser = (stats['max'] / stats['mean']).sort_values().dropna()
show_plot_for_index (ser[:300].index)

Pages with highest level of minimum views:

In [None]:
ser = stats['min'].sort_values().dropna()
show_plot_for_index(ser[-200:].index)

In [None]:
import seaborn as sns

In [None]:
def show_heatmap(df):
    g = sns.clustermap(df.corr(), figsize=(12,12))
    plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
    plt.show()

Correlation heatmap for F1 drivers: note spider pages are less correlated to others and are mostly split off into a subgroup...

In [None]:
show_heatmap(substrs(('Lewis_Hamilton', 'Nico_Rosberg', 'Max_Verstappen', 'Niki_Lauda')).T)

In [None]:
show_heatmap(substr('File:Flag').sample(25, random_state=4242).T)