Skip to content

stef/pywik

Repository files navigation

pywik

Welcome to pywik the static webserver csv logfile analyzer.

Features

can filter and list:

  • Visited Pages,
  • Search Queries,
  • Server errors,
  • unknown files (not in the good set),
  • bot visits,
  • 404 pages (malware shows up here),
  • visits from tor users (and 404s from tor users),
  • external referers

Setup

install the dependencies (you will need mongodb):

virtualenv --no-site-packages env
source env/bin/activate
pip install -r deps.txt

setup environment for pywik:

mkdir logs
./updatelists.sh
mkdir data/myhost
touch data/myhost/goodpaths
touch data/myhost/ignoremissing
touch data/myhost/ignorepaths
touch data/myhost/ownhosts
touch data/myhost/classes

Host specific files

pywik uses a few host specific files, which improve the output considerably. Create a directory under data with your hostname as the name and populate the following files accordingly.

ownhosts

a list of hostnames that are considered part of your infrastructure. Any log entries with referers from other than these hosts are considered external hits.

goodpath

Any path considered a page visit, each line is a regexp.

ignoremissing

Any path that is regularly generating 404 responses, each line is a regexp.

ignorepaths

Any path that is uninteresting for tracking pageviews, like all requisites for pages (e.g. .css, .js, etc files), each line is a regexp.

rss

Each line is a regexp for an rss/atom feed.

classes

This file allows you to categorise the entries. The format is the following: Each class starts with its name, then pairwise fieldnames and regexps. Classes are separated with empty lines.

Users
path
/user/\?id=

Indexed Products
path
/products/\?id=
http_user_agent
.*Googlebot/2\.1

The above example defines two new classes:

  • Users are any entries that start with the path ”user?id=”
  • indexed products, certain paths starting with “/products…” and are hit by googlebot - notice the double rule one for the path, the other for the user agent

Web-server logformat

set your webserver to use the following logformats, or use:

./ncsa2csv.py <access.log | ./load.py mysite

to convert from NCSA logs to csv format - note however that this is missing some data, that the csv based format provides.

Apache

For Apache the following should work:

LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-http
LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-https

nginx

For nginx the following should work:

log_format csv-http  '"$time_local";$connection;"$remote_addr";0;"$http_host";"$request";'
   '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
   '"$http_user_agent";"$http_x_forwarded_for";$msec';
log_format csv-https '"$time_local";$connection;"$remote_addr";1;"$http_host";"$request";'
   '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";'
   '"$http_user_agent";"$http_x_forwarded_for";$msec';

and for your hosts use them for logging:

access_log /var/log/nginx/access.csv csv-http;

or

access_log /var/log/nginx/access.csv csv-https;

respectively for https hosts stanzas.

Running pywik

./fetchlogs.sh myhost.net
./pywik.py month myhost | less

if you find anything interesting, you can extract all logentries matching certain fields:

./getentries.py logs/access.csv myhost path 'cart.php?a=asdf&templatefile=../../../configuration.php'

Alternatively you can also run pywik as a Flask webapp:

./webapp.py

Point your browser at http://localhost:5002/myhost/today and start clicking around.

Plugins

You can easily extend the functionality of pywik using plugins. Plugins can be

  • global if you put them into data/plugins
  • or site-specific if you put them in data/<site>/plugins

There are two kind of plugins:

  • those that generate queries for filtered listings for output,
  • and those that enrich the database with while parsing the logfile

For examples look into data/plugins, **addrapp** and **tor** are good canditates for starting off.

Plugin Initialization

Plugins providing an init(ctx) function, will be able to initialize themselves. The param ctx is a dictionary, that currently only has one key ‘host’.

query plugins

Query plugins implement a queries() function that returns a list of:

('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])
  • Where ‘title’ is the title to be displayed,
  • the second elem is a dict containing a mongodb filter expression,
  • the final elem is a list of fieldnames to be returned by mongo for each matching elements

This can be as simple as:

def queries():
    return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']),
            ('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])]

loader plugins

Loader plugins enrich the information in each log entry during database import. A loader plugin implements a process(entry) interface, that returns the changed entry.

def process(entry):
   if entry['path']=='/foo': entry['foo']='bar'
   return entry

Here’s a more advanced example (you can find more in data/plugins)

from load import basepath
with open('%s/data/torexits.csv' % basepath,'r') as fp:
   torexits=[x.strip() for x in fp]
#print '[tor plugin]', len(torexits), 'torexits loaded'

def process(entry):
   if entry['remote_addr'] in torexits:
      entry['tags'].append('tor')
   return entry

Bugs

Many, reporting them is encouraged, fixing them very welcome.

About

webserver log analyzer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published