Skip to content

wikimedia/analytics-udp-filters

Repository files navigation

=Wikimedia's generic UDP filtering system=

Wikimedia Foundation (c) 2012 // Diederik van Liere
This code has been released under the GPL2. 

=Dependencies=
libgeoip-dev and libcidr0-dev are dependencies that needs to be installed
if you are going to compile manually.

=Installation=

==Compiling==
You can run the following sequence to install udp-filter:
* ./configure
* make install

==Debian package==
You can create a package yourself using the following steps:
(replace strings where necessary)

# git clone git://gerrit.wikimedia.org:29416/analytics/udp-filters.git
# mv udp-filters udp-filters-0.1+git<yyyymmdd>
# tar -cvzf udp-filters-0.1+git<yyyymmdd>.tar.gz udp-filters-0.1+git<yyyymmdd>/
#dh_make -c gpl2 -e <your_email@wikimedia.org> -f ../udp-filters-0.1+git<yyyymmdd>.tar.gz 
# make changes (if necessary) to control-sample and copy it to debian/control
# make changes (if necessary) to copyright-sample and copy it to debian/copyright
#dpkg-depcheck -d ./configure (#output of dpkg-depcheck (libraries) must be added to debian/control, the following three packages need to be added to the control file: 
## libgeoip-dev
## libcidr-dev
## mime-support
## mawk
## autotools-dev
#dpkg-buildpackage -rfakeroot -sgpg (if you want to sign the package then make sure that name/ email address of the maintainer in the control file 'exactly' matches the name and email address of your GPG key.

==Example control file==
<pre>
Source: udp-filters
Section: utils
Priority: extra
Maintainer: Diederik van Liere (Wikimedia Foundation) <dvanliere@wikimedia.org>
Build-Depends: debhelper (>= 7.0.50~), autotools-dev, libgeoip-dev,mime-support, mawk
Standards-Version: 3.8.4
Homepage: <http://www.mediawiki.org/wiki/Analytics/UDP-filters>
Vcs-Git: git://gerrit.wikimedia.org:29416/analytics/udp-filters.git
Vcs-Browser: https://gerrit.wikimedia.org/r/gitweb?p=analytics/udp-filters.git

Package: udp-filters
Architecture: any
Depends: libc6 (>= 2.4), libgeoip1 (>= 1.4.6)
Description: <Wikimedia's udp-filter system.>
 <Wikimedia has a udp-logger that sends packets from the squid servers containing pageviews. UDP-filtes allows you to configure a filter and write particular pageviews, based on a combination of domain and url matching, to a logfile. It also offers geocoding and anonymization of ip addresses. >
</pre>

=Background=
This new filter system replaces the old collection of filters written in C. 

=Command line arguments for udp-filter=
The following is a list of valid command line parameters. 

Either --path or --domain are mandatory (you can use them both, the other command line parameters are optional:
-p or --path:         the string or multiple strings separated by a comma that indicate what you want to match.
-d or --domain:       the part of the domain name that you want to match. For example, 'en.m.' would match all English mobile Wikimedia projects.

-g or --geocode:      flag to indicate geocode the log, by default turned off.
-b or --bird:         parameter that is mandatory when specifying -g or --geocode. Valid choices are <country>, <region> and <city>.
-a or --anonymize:    flag to indicate anonymize the log, by default turned off.
-i or --ip:           flag to indicate ip-filter the log, by default turned off. You can supply comma separated ip adresses, or comma-separated ip-ranges.

-m or --maxmind:     specify alternative path to MaxMind database.

-c or --country_list: limit the log to particular countries, this should be a comma separated list of country codes. Valid country codes are the ISO 3166 country codes (see http://www.maxmind.com/app/iso3166). 
-r or --regex:        the parameters -p and -u are interpreted as regular expressions. Regular expression searching is probably slower so substring matching is recommended.
-f or --force:        do not match on either domain, path, or ip address, basically turn filtering off. Can be useful when filtering for specific country.

-v or --verbose:      output detailed debug information to stderr, not recommended in production.
-h or --help:         show this menu with all command line options.
	

For options -u, -p and -c you can enter multiple values if you separate them by 
a comma.

=Examples for udp-filter=


./udp_filter -d en.wikipedia            this will log all pageviews (depending 
on the sampling rate) if the domain contains en.wikipedia

./udp_filter -p SOPO                    this will log all pageviews (depending 
on the sampling rate) when the url (excluding the domain name) contains SOPA.
So this will collect SOPA *across* projects.

./udp_filter -d en.wikipedia -p SOPA	this will log all pageviews (depending 
on the sampling rate) where the domain contains en.wikipedia and the url 
contains SOPA.

./udp_filter -d en.wikipedia -p SOPA,PIPA   this will log all pageviews (
depending on the sampling rate) where the domain contains en.wikipedia and the 
url contains either SOPA or PIPA.

./udp_filter -d en.wikipedia -a	        this will log all pageviews (depending 
on the sampling rate) if the domain contains en.wikipedia and replace the 
ip address of the visitor with 0.0.0.0

./udp_filter -d en.wikipedia -g         this will log all pageviews (depending 
on the sampling rate) if the domain contains en.wikipedia and replace the 
ip address of the visitor with the country code. See for a list of all the 
valid country codes: http://www.maxmind.com/app/iso3166

./udp_filter -d en.wikipedia -g -c BA   this will log all pageviews (depending 
on the sampling rate) if the domain contains en.wikipedia and replace the ip 
address of the visitor with the country code. In addition, only hits from Brasil 
(BA) will be logged.

./udp_filter -d en.wikipedia -g -m GeoIP.dat   this specifies an alternative path 
for the MaxMind database and this  will log all pageviews (depending 
on the sampling rate) if the domain contains en.wikipedia.

./udp_filter -d en.wikipedia -p SOPA -c US  this will log all pageviews 
(depending on the sampling rate) where the domain contains en.wikipedia and 
the url contains SOPA and the visitor comes from the US. 

./udp_filter -d en.wikipedia -v         this turns on verbose logging and can 
help in debugging and verifying that the appropriate hits are being logged.
This setting is not recommended in production.

./udp_filter -i 71.190.22.0/24,2607:f0d0:1002:51::/64          this will filter for logs with IP addresses that match the given CIDR ranges.  IPv4 and IPv6 CIDR blocks are supported.

=Description of multiplexer=
The multiplexor (src/multiplexor.c) is a standalone program that creates a set of child
processes (all running the same command), then reads lines from stdin and writes each
line to one of the children in round-robin fashion. It is intended to read from processes
like udp2log that have limited buffering and tend to lose data if their output stream is not
read fast enough. So it provides 2 benefits to minimize data loss:
(a) Improves throughput when spare CPU capacity is available in additional cores.
(b) Provides additional buffering when transient spikes slow down a single child.

=Command line arguments for multiplexer=
-cmd <cmd>   -- command to run in subprocesses (default: none)
[-proc n]    -- number of child processes to create (default: 2)
[-lines n]   -- max. lines to process (default: 0 = infinite)
[-o <path>]  -- path to output files (default: /var/tmp/multiplexor_)
[-p <cmd>]   -- path to output pipe command (default: none)
-cmd is required; only one of -o and -p is allowed

=Examples for multiplexer=
The udp2log config file can contain lines like this:

# create 2 children with output going to /var/tmp/data_* files; each child runs udp-filter with
# the given arguments
#
pipe 1 /usr/bin/multiplexor -o /var/tmp/data_ -proc 2 -cmd "/usr/bin/udp-filter -F '\t' -i 189.40.0.0/16"

# create 4 children each running udp-filter; output from each is piped to another process
# running /usr/local/bin/foo which deals with the data as it sees fit.
#
pipe 1 /usr/bin/multiplexor -p /usr/local/bin/foo -proc 4 -cmd "/usr/bin/udp-filter -F '\t' -i 189.40.0.0/16"

Acknowledgements
Many thanks to Roan Kattouw and Tim Starling for showing me the way around in C.

About

Github mirror of analytics/udp-filters - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing)

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •