webxray is a tool for analyzing third-party content on webpages and identifying the companies which collect user data. A command line user interface makes webxray easy to use for non-programmers, and those with advanced needs may analyze millions of pages with proper configuration. webxray is a professional tool designed for academic research, and may be used by privacy compliance officers, regulators, and those who are generally curious about hidden data flows on the web.
webxray uses a custom library of domain ownership to chart the flow of data from a given third-party domain to a corporate owner, and if applicable, to parent companies. Tracking attribution reports produced by webxray provide robust granularity. Reports of the average numbers of third-parties and cookies per-site, most commonly occurring third-party domains and elements, volumes of data transferred, use of SSL encryption, and more are provided out-of-the-box. A flexible data schema allows for the generation of custom reports as well as authoring extensions to add additional data sources.
By default, webxray uses Chrome to load pages, stores data in a SQLite database, and can be used on a normal desktop computer. Users with advanced needs may install webxray on a server and leverage MySQL or PostgreSQL for heavy-duty data storage.
More information and detailed installation instructions may be found on the project website.
webxray depends on several pieces of software being installed on your computer in advance. The webxray website has detailed instructions for setting up the software on Ubuntu and macOS. If you are familiar with installing dependencies on your own, the following are needed:
Python 3.4+ is required:
Python 3.4+ https://www.python.org
If you want to use Google Chrome as your browser engine you must install:
Chrome 66+ https://www.google.com/chrome/ Chrome Driver https://sites.google.com/a/chromium.org/chromedriver/ Selenium https://pypi.python.org/pypi/selenium
If you want to use the PhantomJS browser engine instead of Chrome you must install:
PhantomJS 1.9+ http://phantomjs.org
If you want to use the MySQL database engine you must install:
MySQL https://www.mysql.com MySQL Python Connector https://dev.mysql.com/downloads/connector/python/
If you want to use the PostgreSQL database engine you must install:
PostgreSQL https://www.postgresql.org psycopg http://initd.org/psycopg/
If the dependencies above are met all you can clone this repository and get started:
git clone https://github.com/timlib/webxray.git
To start webxray in interactive mode type:
The prompts will guide you to scanning a sample list of websites using the default settings of Chrome in windowed mode and a SQLite database. If you wish to run several browsers in paralell to increase speed, leverage a more powerful database engine, or perform other advanced tasks, please see the project website for details.
Using webxray to Analyze Your Own List of Pages
The raison d'être of webxray is to allow you to analyze pages of your choosing. In order to do so, first place all of the page addresses you wish to scan into a text file and place this file in the "page_lists" directory. Make sure your addresses start with "http://" or "https://", if not, webxray will not recognize them as valid addresses. Once you have placed your page list in the proper directory you may run webxray and it will allow you to select your page list.
Viewing and Understanding Reports
Use the interactive mode to guide you to generating an analysis once you have completed your data collection. When it is completed it will be output to the '/reports' directory. This will contain a number of csv files:
- db_summary.csv: a basic report of what is in the database and how many pages loaded
- aggregated_tracking_attribution.csv: details on percentages of sites tracked by different companies and their subsidiaries
- 3p_domain.csv: most frequently occurring third-party domains
- 3p_element.csv: most frequently occurring third-party elements of all types
- 3p_image.csv: most frequently occurring third-party images
- 3p_uses.csv: high-levels stats on what first-parties use third-party services for as well as rates of cookies being set for different types of services. Note that first-party use may be different than the ultimate third-party use. For example, a site may use audience measurement tools from a third-party to gain insights into traffic, but the third-party may use this data for marketing.
- 3p_ssl_use.csv: rates at which detected third-parties encrypt requests
- data_xfer_summary.csv: volume and percentage of data received from first- and third-party domains
- data_xfer_aggregated.csv: volume and percentage of data received from various companies
- data_xfer_by_domain.csv: volume and percentage of data received from specific third-party domains
- network: pairings between page domains and third-party domains, you can import this info to network visualization software
- per_page_data_flow.csv: one giant file that lists the requests made for each page, off by default
Important Note on Speed and Parallelization
webxray can load many pages in parallell and may be used for analyzing millions of pages fairly quickly. However, out-of-the-box, webxray is configured to only scan one page at a time. If you think your system can handle more (and chances are it can!), open the 'run_webxray.py' file and search for the first occurance of the 'pool_size' variable. When you find that there are instructions on how to increase the numbers of pages you can do concurrently. Please find additional information on the project website.
This tool is produced by Timothy Libert, if you are using it for academic research, please cite the most pertinent publication from his Google Scholar page.
webxray is FOSS and licensed under GPLv3, see LICENSE.md for details.