#Usage
This notebook is intended to be a collection of functions and processes to be run on typical clickstream datasets. 

##Data Format(s)

###Structured data
This notebook largely assumes that the data is organized roughly as below:

|      date_time      | user_id | content_id |     attr1    |           attr2           |   attr3  | ... |
|:-------------------:|:-------:|:----------:|:------------:|:-------------------------:|:--------:|:---:|
| 2011-01-01 00:00:00 |  user_x |  webpage_i |  google.com  |    "slick amazon deals"   | Nebraska | ... |
| 2011-01-01 00:10:00 |  user_y |  webpage_i | facebook.com | "reddit photoshopbattles" | New York | ... |


###List of content accessed
Some datasets may also be a collection of clickstream pathways, and that's fine too! We can still salvage meaningful information from these datasets. Here, each individual row corresponds to a separate session, and each element in the row represents a content ID. The order in which these content IDs are listed is typically the order in which they were accessed. This information may or may not be linked to other attributes like timestamps, user IDs, or other attributes like the richer dataset shown above:

```
10307 10311 12487
12559
12695 12703 18715
10311 12387 12515 12691 12695 12699 12703 12823 12831 12847 18595 18679 18751
...
```

If your click data is not in this format, you may consider pre-processing it into to this format. If you have a specific proprietary format that you would like to analyze using this code, feel free to contact me.

In [2]:
#boilerplate imports
import pandas as pd
import numpy as np
from collections import Counter
from bokeh.plotting import figure, output_file, show
from bokeh.io import output_notebook

In [3]:
output_notebook()

###Let's dive right in!
The simplest form of data you'll encounter is the list of clicks. Here's an example of how to get started with the simplest use-case.

In [14]:
#getting the data from a URL [can be made easier if you have a local copy]
#data obtained from Philippe Fournier Viger's SPMF page. This is a KDD Cup 2000 dataset
import urllib2
target_url = "http://www.philippe-fournier-viger.com/spmf/datasets/BMS1_itemset_mining.txt"
list_of_clicks = urllib2.urlopen(target_url).readlines()

In [20]:
#see what the data looks like
for _line in list_of_clicks[:5]:
    print " -> ".join(_line.split())

10307 -> 10311 -> 12487
12559
12695 -> 12703 -> 18715
10311 -> 12387 -> 12515 -> 12691 -> 12695 -> 12699 -> 12703 -> 12823 -> 12831 -> 12847 -> 18595 -> 18679 -> 18751
10291 -> 12523 -> 12531 -> 12535 -> 12883


In [21]:
#each line is a string, let's make it into a list exactly like we did when viewing it
for _line_num in range(len(list_of_clicks)):
    list_of_clicks[_line_num] = (list_of_clicks[_line_num]).split()

In [40]:
#let's find the most common pages accessed in the dataset
#flatten the list of lists...
import itertools
flat_list_of_clicks = list(itertools.chain(*list_of_clicks))

In [53]:
from bokeh.charts import Bar, show
#p = figure(title = "Most common pages in BMS webview")
p = Bar(pd.DataFrame({'a':[3,4,5,6],'names':['a','b','c','d'],'dfjks':[5,6,7,8]}), label = 'names', values ='a')
show(p)

ImportError: No module named ipykernel.comm

In [45]:
Counter((Counter(flat_list_of_clicks)).values())

Counter({1: 14, 3: 10, 2: 9, 4: 8, 6: 8, 15: 7, 8: 6, 22: 6, 24: 6, 5: 5, 12: 5, 13: 5, 14: 5, 7: 5, 10: 4, 67: 4, 77: 4, 90: 4, 58: 4, 410: 4, 9: 3, 16: 3, 23: 3, 38: 3, 87: 3, 92: 3, 94: 3, 101: 3, 110: 3, 152: 3, 43: 3, 11: 2, 19: 2, 20: 2, 25: 2, 27: 2, 33: 2, 40: 2, 47: 2, 56: 2, 63: 2, 64: 2, 65: 2, 70: 2, 73: 2, 587: 2, 78: 2, 80: 2, 93: 2, 96: 2, 105: 2, 109: 2, 113: 2, 128: 2, 129: 2, 145: 2, 148: 2, 164: 2, 167: 2, 173: 2, 187: 2, 204: 2, 207: 2, 230: 2, 232: 2, 233: 2, 240: 2, 211: 2, 251: 2, 260: 2, 272: 2, 797: 2, 294: 2, 297: 2, 319: 2, 345: 2, 209: 2, 407: 2, 411: 2, 439: 2, 440: 2, 448: 2, 512: 1, 2049: 1, 88: 1, 18: 1, 533: 1, 3612: 1, 543: 1, 32: 1, 34: 1, 35: 1, 36: 1, 37: 1, 39: 1, 291: 1, 42: 1, 257: 1, 45: 1, 46: 1, 48: 1, 561: 1, 562: 1, 365: 1, 54: 1, 521: 1, 570: 1, 863: 1, 60: 1, 62: 1, 66: 1, 68: 1, 69: 1, 583: 1, 72: 1, 3658: 1, 79: 1, 81: 1, 82: 1, 83: 1, 84: 1, 85: 1, 86: 1, 600: 1, 1113: 1, 1039: 1, 97: 1, 99: 1, 100: 1, 103: 1, 106: 1, 620: 1, 229: 1, 14