pywik

Welcome to pywik the static webserver csv logfile analyzer.

Features

can filter and list:

Visited Pages,
Search Queries,
Server errors,
unknown files (not in the good set),
bot visits,
404 pages (malware shows up here),
visits from tor users (and 404s from tor users),
external referers

Setup

install the dependencies (you will need mongodb):

virtualenv --no-site-packages env
source env/bin/activate
pip install -r deps.txt

setup environment for pywik:

mkdir logs
./updatelists.sh
mkdir data/myhost
touch data/myhost/goodpaths
touch data/myhost/ignoremissing
touch data/myhost/ignorepaths
touch data/myhost/ownhosts
touch data/myhost/classes

Host specific files

pywik uses a few host specific files, which improve the output considerably. Create a directory under data with your hostname as the name and populate the following files accordingly.

ownhosts

a list of hostnames that are considered part of your infrastructure. Any log entries with referers from other than these hosts are considered external hits.

goodpath

Any path considered a page visit, each line is a regexp.

ignoremissing

Any path that is regularly generating 404 responses, each line is a regexp.

ignorepaths

Any path that is uninteresting for tracking pageviews, like all requisites for pages (e.g. .css, .js, etc files), each line is a regexp.

rss

Each line is a regexp for an rss/atom feed.

classes

This file allows you to categorise the entries. The format is the following: Each class starts with its name, then pairwise fieldnames and regexps. Classes are separated with empty lines.

Users
path
/user/\?id=

Indexed Products
path
/products/\?id=
http_user_agent
.*Googlebot/2\.1

The above example defines two new classes:

Users are any entries that start with the path ”user?id=”
indexed products, certain paths starting with “/products…” and are hit by googlebot - notice the double rule one for the path, the other for the user agent

Web-server logformat

set your webserver to use the following logformats, or use:

./ncsa2csv.py <access.log | ./load.py mysite

to convert from NCSA logs to csv format - note however that this is missing some data, that the csv based format provides.

Apache

For Apache the following should work:

LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-http
LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-https

nginx

For nginx the following should work:

log_format csv-http  '"$time_local";$connection;"$remote_addr";0;"$http_host";"$request";'
   '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
   '"$http_user_agent";"$http_x_forwarded_for";$msec';
log_format csv-https '"$time_local";$connection;"$remote_addr";1;"$http_host";"$request";'
   '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
   '"$http_user_agent";"$http_x_forwarded_for";$msec';

and for your hosts use them for logging:

access_log /var/log/nginx/access.csv csv-http;

or

access_log /var/log/nginx/access.csv csv-https;

respectively for https hosts stanzas.

Running pywik

./fetchlogs.sh myhost.net
./pywik.py month myhost | less

if you find anything interesting, you can extract all logentries matching certain fields:

./getentries.py logs/access.csv myhost path 'cart.php?a=asdf&templatefile=../../../configuration.php'

Alternatively you can also run pywik as a Flask webapp:

./webapp.py

Point your browser at http://localhost:5002/myhost/today and start clicking around.

Plugins

You can easily extend the functionality of pywik using plugins. Plugins can be

global if you put them into data/plugins
or site-specific if you put them in data/<site>/plugins

There are two kind of plugins:

those that generate queries for filtered listings for output,
and those that enrich the database with while parsing the logfile

For examples look into data/plugins, **addrapp** and **tor** are good canditates for starting off.

Plugin Initialization

Plugins providing an init(ctx) function, will be able to initialize themselves. The param ctx is a dictionary, that currently only has one key ‘host’.

query plugins

Query plugins implement a queries() function that returns a list of:

('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])

Where ‘title’ is the title to be displayed,
the second elem is a dict containing a mongodb filter expression,
the final elem is a list of fieldnames to be returned by mongo for each matching elements

This can be as simple as:

def queries():
    return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']),
            ('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])]

loader plugins

Loader plugins enrich the information in each log entry during database import. A loader plugin implements a process(entry) interface, that returns the changed entry.

def process(entry):
   if entry['path']=='/foo': entry['foo']='bar'
   return entry

Here’s a more advanced example (you can find more in data/plugins)

from load import basepath
with open('%s/data/torexits.csv' % basepath,'r') as fp:
   torexits=[x.strip() for x in fp]
#print '[tor plugin]', len(torexits), 'torexits loaded'

def process(entry):
   if entry['remote_addr'] in torexits:
      entry['tags'].append('tor')
   return entry

Bugs

Many, reporting them is encouraged, fixing them very welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
data		data
static/css		static/css
templates		templates
.gitignore		.gitignore
README.org		README.org
common.py		common.py
deps.txt		deps.txt
fetchlogs.sh		fetchlogs.sh
getentries.py		getentries.py
load.py		load.py
ncsa2csv.py		ncsa2csv.py
plugins.py		plugins.py
pywik.py		pywik.py
pywikrc		pywikrc
updatelists.sh		updatelists.sh
webapp.py		webapp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pywik

Features

Setup

Host specific files

ownhosts

goodpath

ignoremissing

ignorepaths

rss

classes

Web-server logformat

Apache

nginx

Running pywik

Plugins

Plugin Initialization

query plugins

loader plugins

Bugs

About

Releases

Packages

Languages

stef/pywik

Folders and files

Latest commit

History

Repository files navigation

pywik

Features

Setup

Host specific files

ownhosts

goodpath

ignoremissing

ignorepaths

rss

classes

Web-server logformat

Apache

nginx

Running pywik

Plugins

Plugin Initialization

query plugins

loader plugins

Bugs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages