# ServeWPM Tutorial
ServeWPM is a docker image for running [OpenWPM](https://github.com/citp/OpenWPM).

### Running Cells 
To run a cell select it and type **shift-enter** or by select the `Run` button in the menu. A star to the left of the cell like `[*]` indicates that the cell is currently running.

### Project Settings
First, get ready to import the OpenWPM files by adding them to the path. `Run` each Python cell to execute it.

In [None]:
import sys
import os
sys.path.insert(0, os.environ['FRAMEWORK'])

Next, set the project name without quotes. In this case, `demo` is the PROJECT_NAME.

In [None]:
%env PROJECT_NAME demo
PROJECT_NAME = os.environ['PROJECT_NAME']

# Run OpenWPM Example
First, run a version of the `demo.py` example from OpenWPM below. 

The import and project name statements are included below so that the example below could run as a standalone python file.

Notes: The `%%time` command tells Jupyter to time this cell block. The output directory is already specified as the `$PROJECT_NAME/` dir with the line: `manager_params['data_directory'] = os.environ['NOTEBOOKS'] + '/' + PROJECT_NAME`. The OpenWPM imports here are `TaskManager` and `CommandSequence`.

In [None]:
%%time
from __future__ import absolute_import
from six.moves import range
import sys
import os
sys.path.insert(0, os.environ['FRAMEWORK'])
from automation import TaskManager, CommandSequence

# The local output directory
PROJECT_NAME = os.environ['PROJECT_NAME']

# The list of sites that we wish to crawl
NUM_BROWSERS = 3
sites = ['http://www.example.com',
         'http://www.princeton.edu',
         'http://citp.princeton.edu/']

# Loads the manager preference and 3 copies of the default browser dictionaries
manager_params, browser_params = TaskManager.load_default_params(NUM_BROWSERS)

# Update browser configuration (use this for per-browser settings)
for i in range(NUM_BROWSERS):
    # Record HTTP Requests and Responses
    browser_params[i]['http_instrument'] = True
    # Enable flash for all three browsers
    browser_params[i]['disable_flash'] = False
browser_params[0]['headless'] = True  # Launch only browser 0 headless

# Update TaskManager configuration (use this for crawl-wide settings)
manager_params['data_directory'] = os.environ['NOTEBOOKS'] + '/' + PROJECT_NAME
manager_params['log_directory'] = os.environ['NOTEBOOKS'] + '/' + PROJECT_NAME

# Instantiates the measurement platform
# Commands time out by default after 60 seconds
manager = TaskManager.TaskManager(manager_params, browser_params)

# Visits the sites with all browsers simultaneously
for site in sites:
    command_sequence = CommandSequence.CommandSequence(site)

    # Start by visiting the page
    command_sequence.get(sleep=0, timeout=60)

    # dump_profile_cookies/dump_flash_cookies closes the current tab.
    command_sequence.dump_profile_cookies(120)

    # index='**' synchronizes visits between the three browsers
    manager.execute_command_sequence(command_sequence, index='**')

# Shuts down the browsers and waits for the data to finish logging
manager.close()

### Expected output
The resulting output should look something like this when it finishes:
```
...
BrowserManager       - INFO     - BROWSER 3: EXECUTING COMMAND: ('DUMP_PROFILE_COOKIES', 1508454416.26996, 8)
BrowserManager       - INFO     - BROWSER 1: EXECUTING COMMAND: ('DUMP_PROFILE_COOKIES', 1508454416.269908, 7)
CPU times: user 180 ms, sys: 100 ms, total: 280 ms
Wall time: 39 s
```

# Querying The Database
The output files are in the local `$PROJECT_NAME/` dir and include: `openwpm.log`, `crawl-data.sqlite`, `screenshots`, and `sources`.

## Using Django
There is a Django app setup to import the type of `crawl-data.sqlite` structure. The app is called `export` and has a read only connection to `export/crawl-data.sqlite`.

First, copy the `crawl-data.sqlite` output file from `$PROJECT_NAME` to `export` using bash:

In [None]:
!cp $PROJECT_NAME/crawl-data.sqlite export/crawl-data.sqlite

Next, import the modules:

In [None]:
from export.models import (
    Crawl,
    Crawlhistory,
    FlashCookies,
    HttpRequests,
    HttpResponses,
    Localstorage,
    ProfileCookies,
    SiteVisits,
    Task,
    Xpath,
)

Finally, use the Django ORM (object-relational-model) to inspect each object created from each table (i.e. `SiteVisits` as below):

In [None]:
visits = SiteVisits.objects.all()
v = visits[0]
v.__dict__

For more on Django's ORM, there is extensive documentation: https://docs.djangoproject.com/en/dev/topics/db/queries/#retrieving-objects

### Using Sqlite3
You can also, of course, use `sqlite3` or other scripts to analyze the data in `$PROJECT_NAME/crawl-data.json`.

First, connect to the database:

In [None]:
import sqlite3
rel_path = 'PROJECT_PATH/crawl-data.sqlite'
db = sqlite3.connect(rel_path)
cursor = db.cursor()

Then have fun:

In [None]:
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = list(zip(*cursor.fetchall())[0])
print(tables)

pairs = []
for table in tables:
    cursor.execute("SELECT * from {}".format(table))
    elements = cursor.fetchall()
    pairs.append((table, elements))

print(pairs[1])

# Serving Django
Django isn't just a great ORM, it also is a webserver. Although it is not running right now, it can be started from here. 

First, create a superuser, or admin. There is an example password below.

In [None]:
from django.contrib.auth.models import User
User.objects.create_superuser('admin', 'admin@example.com', 'synthetics1126599/commencements')

Next

In [None]:
!gunicorn ServeWPM.wsgi:application --bind 0.0.0.0:8000 --workers 3

Finally, log in to port 8000 as the superuser. So if this is run on localhost, go to [http:127.0.0.1:8000/](http:127.0.0.1:8000/).


# Troubleshooting
If the OpenWPM script fails, try starting firefox by itself:

In [None]:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://www.princeton.edu')
browser.quit()

# That's It!