Github mirror of "wikimedia/discovery/golden" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
docs
modules
reportupdater @ 91583e5
.gitignore
.gitmodules
.gitreview
CHANGELOG.md
CONDUCT.md
README.md
config.R
golden.Rproj
main.sh
test.R

README.md

Golden Retriever Scripts

This repository contains aggregation/acquisition scripts for extracting data from the MySQL/Hive databases for computing metrics for various teams within Discovery. It uses Analytics' Reportupdater infrastructure. This codebase is maintained by Discovery's Analysis team:

For questions and comments, contact Deb, Mikhail, or Chelsy.

Table of Contents

Setup and Usage

As of T170494, the setup and daily runs are Puppetized on stat1005 via the statistics::discovery module (also mirrored on GitHub).

Dependencies

pip install -r reportupdater/requirements.txt

Some of the R packages require C++ libraries, which are installed on stat1002 -- that use compute.pp (GitHub) -- by being listed in packages (GitHub). See operations-puppet/modules/statistics/manifests/packages.pp (GitHub) for example.

# Set WMF proxies:
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")

# Set path for packages:
lib_path <- "/srv/discovery/r-library"
.libPaths(lib_path)

# Essentials:
install.packages(
  c("devtools", "testthat", "Rcpp",
    "tidyverse", "data.table", "plyr",
    "optparse", "yaml", "data.tree",
    "ISOcodes", "knitr", "glue",
    # For wmf:
    "urltools", "ggthemes", "pwr",
    # For polloi's datavis functions:
    "shiny", "shinydashboard", "dygraphs", "RColorBrewer",
    # For polloi's data manipulation functions:
    "xts", "mgcv", "zoo",
    # For forecasting modules:
    "bsts", "forecast", "prophet"
    # ^ see note below
  ),
  repos = c(CRAN = "https://cran.rstudio.com/"),
  lib = lib_path
)

# 'uaparser' requires C++11, and libyaml-cpp, boost-system, boost-regex C++ libraries
devtools::install_github("ua-parser/uap-r", configure.args = "-I/usr/include/yaml-cpp -I/usr/include/boost")

# 'ortiz' is needed for Search team's user engagement calculation | https://phabricator.wikimedia.org/diffusion/WDOZ/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/ortiz")

# 'wmf' is needed for querying MySQL and Hive | https://phabricator.wikimedia.org/diffusion/1821/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")

# 'polloi' is needed for wikiid-splitting | https://phabricator.wikimedia.org/diffusion/WDPL/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")

Don't forget to add packages to test.R because that script checks that all packages are installed before performing a test run of the reports.

To update packages, use update-library.R:

Rscript /etc/R/update-library.R -l /srv/discovery/r-library
Rscript /etc/R/update-library.R -l /srv/discovery/r-library -p polloi

Testing

If you wish to run all the modules without writing data to files or checking for missingness, use:

Rscript test.R >> test_`date +%F_%T`.log.md 2>&1
# The test script automatically uses yesterday's date.

# Alternatively:
Rscript test.R --start_date=2017-01-01 --end_date=2017-01-02 >> test_`date +%F_%T`.log.md 2>&1

# You can disbale forecasting modules:
Rscript test.R --disable_forecasts >> test_`date +%F_%T`.log.md 2>&1

# And have it include samples of the existing data (for comparison):
Rscript test.R --include_samples >> test_`date +%F_%T`.log.md 2>&1

The testing utility finds all the modules, builds a list of the reports, and then performs the appropriate action depending on whether the report is a SQL query or a script. Each module's output will be printed to console. This should go without saying, but running through all the modules will take a while. The script outputs a Markdown-formatted log that can be saved to file using the commands above. Various statistics on the execution times will be printed at the end, including a table of all the reports' execution times. The table can be omitted using the --omit_times option.

Modules

Adding New Metrics Modules

MySQL

For metrics computed from event logging data stored in MySQL, try to write pure SQL queries whenever possible, using the conventions described here. Use the following template to get started:

SELECT
  DATE('{from_timestamp}') AS date,
  ...,
  COUNT(*) AS events
FROM {Schema_Revision}
WHERE timestamp >= '{from_timestamp}' AND timestamp < '{to_timestamp}'
GROUP BY date, ...;

Hive

The scripts that invoke Hive (e.g. the ones that count web requests) must follow the conventions described here. Use the following template to get started:

#!/bin/bash

hive -e "USE wmf;
SELECT
  '$1' AS date,
  ...,
  COUNT(*) AS requests
FROM webrequest
WHERE
  webrequest_source = 'text' -- also available: 'maps' and 'misc'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '$1'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '$2'
  ...
GROUP BY
  '$1',
  ...;
" 2> /dev/null | grep -v parquet.hadoop

R

A note on paths: Reportupdater does not cd into the query folder. So you'll need to execute scripts relative to the path you're executing Reportupdater from, e.g. Rscript modules/metrics/search/some_script.R -d $1

These scripts have 2 parts: the script part that is called by update_reports.py, which must adhere to Reportupdater's script conventions:

#!/bin/bash

Rscript modules/metrics/search/script.R --date=$1
# Alternatively: Rscript modules/metrics/search/script.R -d $1

script.R that is called should adhere to one of the two templates below. Note that in both, we specify file = "" in write.table because we want to print the data as TSV to console for Reportupdater.

MySQL in R

For R scripts that need to fetch (and process) data from MySQL, use the following template:

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library("optparse"))

option_list <- list(
  make_option(c("-d", "--date"), default = NA, action = "store", type = "character")
)

# Get command line options, if help option encountered print help and exit,
# otherwise if options not found on command line then set defaults:
opt <- parse_args(OptionParser(option_list = option_list))

if (is.na(opt$date)) {
  quit(save = "no", status = 1)
}

# Build query:
date_clause <- as.character(as.Date(opt$date), format = "LEFT(timestamp, 8) = '%Y%m%d'")

query <- paste0("
SELECT
  DATE('", opt$date, "') AS date,
  COUNT(*) AS events
FROM TestSearchSatisfaction2_15922352
WHERE ", date_clause, "
GROUP BY date;
")

# Fetch data from MySQL database:
results <- tryCatch(suppressMessages(wmf::mysql_read(query, "log")), error = function(e) {
  quit(save = "no", status = 1)
})

# ...whatever else you need to do with the data before returning a TSV to console...

write.table(results, file = "", append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)

Hive in R

For R scripts that need to fetch (and process) data from Hive, use the following template:

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library("optparse"))

option_list <- list(
  make_option(c("-d", "--date"), default = NA, action = "store", type = "character")
)

# Get command line options, if help option encountered print help and exit,
# otherwise if options not found on command line then set defaults:
opt <- parse_args(OptionParser(option_list = option_list))

if (is.na(opt$date)) {
  quit(save = "no", status = 1)
}

# Build query:
date_clause <- as.character(as.Date(opt$date), format = "year = %Y AND month = %m AND day = %d")

query <- paste0("USE wmf;
SELECT
  TO_DATE(ts) AS date,
  COUNT(*) AS pageviews
FROM webrequest
WHERE
  webrequest_source = 'text'
  AND ", date_clause, "
  AND is_pageview
GROUP BY
  TO_DATE(ts);
")

# Fetch data from database using Hive:
results <- tryCatch(wmf::query_hive(query), error = function(e) {
  quit(save = "no", status = 1)
})

# ...whatever else you need to do with the data before returning a TSV to console...

write.table(results, file = "", append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)

Adding New Forecasting Modules

Forecasting modules assume that all the data is current (hence why they are scheduled to run after the metrics modules in main.sh) and the forecast is made for the next day. For example, if backfilling a forecast for 2016-12-01, the model is fit using all available data up to and including 2016-11-30.

There are three model wrappers in modules/forecasts/models.R:

  • forecast_arima() which models the time series via ARIMA and accepts the following inputs:
    • x: a 1-column xts object
    • arima_params: a list w/ order & seasonal components
    • bootstrap_ci: whether prediction intervals are computed using simulation with resampled errors
    • bootstrap_npaths: number of sample paths used in computing simulated prediction intervals
    • transformation = a transformation to apply to the data ("none", "log", "logit", or "in millions"); the function back-transforms the predictions to the original scale depending on the transformation chosen
  • forecast_bsts() which models the time series via BSTS and accepts the following inputs:
    • x: a 1-column xts object
    • n_iter: number of MCMC iterations to keep
    • burn_in: number of MCMC iterations to throw away as burn-in,
    • transformation: a transformation to apply to the data ("none", "log", "logit", or "in millions"); the function back-transforms the predictions to the original scale depending on the transformation chosen
    • ar_lags: number of lags ("p") in the AR(p) process, omitted by default so an AR(p) state component is NOT added to the state specification
  • forecast_prophet() which models the time series via Facebook's Core Data Science team's open source forecasting procedure Prophet and accepts the following inputs:
    • x: a 1-column xts object
    • n_iter: number of MCMC samples (default 500). If greater than 0, will perform a full Bayesian inference using Stan with 4 chains. If 0, will do perform a fast maximum a posteriori probability (MAP) estimatation.
    • transformation: a transformation to apply to the data ("none", "log", "logit", or "in millions"); the function back-transforms the predictions to the original scale depending on the transformation chosen

When adding a new forecasting module, add a script-type report to the respective config.yaml and use the following template for the script:

#!/bin/bash

Rscript modules/forecasts/forecast.R --date=$1 --metric=[your forecasted metric] --model=[ARIMA [--bootstrap_ci]|BSTS|Prophet]

Change the --metric and --model arguments accordingly. The actual data-reading and metric-forecasting calls are in a switch statement in modules/forecasts/forecast.R. Don't forget to add the forecasted metric to the --metric option's help text at the top of forecast.R and don't forget to subset the data after reading it in (e.g. dplyr::filter(data, date <= as.Date(opt$date)))

Additional Information

This repository can be browsed in Phabricator/Diffusion, but is also (read-only) mirrored to GitHub.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.