Welcome to pywik the static webserver csv logfile analyzer.
can filter and list:
- Visited Pages,
- Search Queries,
- Server errors,
- unknown files (not in the good set),
- bot visits,
- 404 pages (malware shows up here),
- visits from tor users (and 404s from tor users),
- external referers
install the dependencies (you will need mongodb):
virtualenv --no-site-packages env source env/bin/activate pip install -r deps.txt
setup environment for pywik:
mkdir logs ./updatelists.sh mkdir data/myhost touch data/myhost/goodpaths touch data/myhost/ignoremissing touch data/myhost/ignorepaths touch data/myhost/ownhosts touch data/myhost/classes
pywik uses a few host specific files, which improve the output considerably. Create a directory under data with your hostname as the name and populate the following files accordingly.
a list of hostnames that are considered part of your infrastructure. Any log entries with referers from other than these hosts are considered external hits.
Any path considered a page visit, each line is a regexp.
Any path that is regularly generating 404 responses, each line is a regexp.
Any path that is uninteresting for tracking pageviews, like all requisites for pages (e.g. .css, .js, etc files), each line is a regexp.
Each line is a regexp for an rss/atom feed.
This file allows you to categorise the entries. The format is the following: Each class starts with its name, then pairwise fieldnames and regexps. Classes are separated with empty lines.
Users path /user/\?id= Indexed Products path /products/\?id= http_user_agent .*Googlebot/2\.1
The above example defines two new classes:
- Users are any entries that start with the path ”user?id=”
- indexed products, certain paths starting with “/products…” and are hit by googlebot - notice the double rule one for the path, the other for the user agent
set your webserver to use the following logformats, or use:
./ncsa2csv.py <access.log | ./load.py mysite
to convert from NCSA logs to csv format - note however that this is missing some data, that the csv based format provides.
For Apache the following should work:
LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-http LogFormat "%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x" csv-https
For nginx the following should work:
log_format csv-http '"$time_local";$connection;"$remote_addr";0;"$http_host";"$request";' '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";' '"$http_user_agent";"$http_x_forwarded_for";$msec'; log_format csv-https '"$time_local";$connection;"$remote_addr";1;"$http_host";"$request";' '$status;$request_length;$body_bytes_sent;$request_time;"$http_referer";"$remote_user";' '"$http_user_agent";"$http_x_forwarded_for";$msec';
and for your hosts use them for logging:
access_log /var/log/nginx/access.csv csv-http;
or
access_log /var/log/nginx/access.csv csv-https;
respectively for https hosts stanzas.
./fetchlogs.sh myhost.net ./pywik.py month myhost | less
if you find anything interesting, you can extract all logentries matching certain fields:
./getentries.py logs/access.csv myhost path 'cart.php?a=asdf&templatefile=../../../configuration.php'
Alternatively you can also run pywik as a Flask webapp:
./webapp.py
Point your browser at http://localhost:5002/myhost/today and start clicking around.
You can easily extend the functionality of pywik using plugins. Plugins can be
- global if you put them into data/plugins
- or site-specific if you put them in data/<site>/plugins
There are two kind of plugins:
- those that generate queries for filtered listings for output,
- and those that enrich the database with while parsing the logfile
For examples look into data/plugins, **addrapp** and **tor** are good canditates for starting off.
Plugins providing an init(ctx) function, will be able to initialize themselves. The param ctx is a dictionary, that currently only has one key ‘host’.
Query plugins implement a queries() function that returns a list of:
('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])
- Where ‘title’ is the title to be displayed,
- the second elem is a dict containing a mongodb filter expression,
- the final elem is a list of fieldnames to be returned by mongo for each matching elements
This can be as simple as:
def queries():
return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']),
('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])]
Loader plugins enrich the information in each log entry during database import. A loader plugin implements a process(entry) interface, that returns the changed entry.
def process(entry):
if entry['path']=='/foo': entry['foo']='bar'
return entry
Here’s a more advanced example (you can find more in data/plugins)
from load import basepath
with open('%s/data/torexits.csv' % basepath,'r') as fp:
torexits=[x.strip() for x in fp]
#print '[tor plugin]', len(torexits), 'torexits loaded'
def process(entry):
if entry['remote_addr'] in torexits:
entry['tags'].append('tor')
return entry
Many, reporting them is encouraged, fixing them very welcome.