Skip to content
Github mirror of "wikimedia/discovery/golden" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
docs
modules/metrics
reportupdater @ 7cbea5d
.gitignore
.gitmodules
.gitreview
CHANGELOG.md
CONDUCT.md
README.md
config.R
golden.Rproj
main.sh
test.R

README.md

Golden Retriever Scripts

This repository contains aggregation/acquisition scripts for extracting data from the MySQL/Hive databases for computing metrics for various teams within Discovery. It uses Analytics' Reportupdater infrastructure. This codebase is maintained by Product Analytics team:

For questions and comments, contact Mikhail, or Chelsy.

Table of Contents

Setup and Usage

As of T170494, the setup and daily runs are Puppetized on stat1005 via the statistics::discovery module (also mirrored on GitHub).

Dependencies

pip install -r reportupdater/requirements.txt

Some of the R packages require C++ libraries, which are installed on stat1002 -- that use compute.pp (GitHub) -- by being listed in packages (GitHub). See operations-puppet/modules/statistics/manifests/packages.pp (GitHub) for example.

# Set WMF proxies:
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")

# Set path for packages:
lib_path <- "/srv/discovery/r-library"
.libPaths(lib_path)

# Essentials:
install.packages(
  c("devtools", "testthat", "Rcpp",
    "tidyverse", "data.table", "plyr",
    "optparse", "yaml", "data.tree",
    "ISOcodes", "knitr", "glue",
    # For wmf:
    "urltools", "ggthemes", "pwr",
    # For polloi's datavis functions:
    "shiny", "shinydashboard", "dygraphs", "RColorBrewer",
    # For polloi's data manipulation functions:
    "xts", "mgcv", "zoo"
  ),
  repos = c(CRAN = "https://cran.rstudio.com/"),
  lib = lib_path
)

# 'ortiz' is needed for Search team's user engagement calculation | https://phabricator.wikimedia.org/diffusion/WDOZ/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/ortiz")

# 'wmf' is needed for querying MySQL and Hive | https://phabricator.wikimedia.org/diffusion/1821/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")

# 'polloi' is needed for wikiid-splitting | https://phabricator.wikimedia.org/diffusion/WDPL/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")

Don't forget to add packages to test.R because that script checks that all packages are installed before performing a test run of the reports.

To update packages, use update-library.R:

Rscript /etc/R/update-library.R -l /srv/discovery/r-library
Rscript /etc/R/update-library.R -l /srv/discovery/r-library -p polloi

Testing

If you wish to run all the modules without writing data to files or checking for missingness, use:

Rscript test.R >> test_`date +%F_%T`.log.md 2>&1
# The test script automatically uses yesterday's date.

# Alternatively:
Rscript test.R --start_date=2017-01-01 --end_date=2017-01-02 >> test_`date +%F_%T`.log.md 2>&1

# And have it include samples of the existing data (for comparison):
Rscript test.R --include_samples >> test_`date +%F_%T`.log.md 2>&1

The testing utility finds all the modules, builds a list of the reports, and then performs the appropriate action depending on whether the report is a SQL query or a script. Each module's output will be printed to console. This should go without saying, but running through all the modules will take a while. The script outputs a Markdown-formatted log that can be saved to file using the commands above. Various statistics on the execution times will be printed at the end, including a table of all the reports' execution times. The table can be omitted using the --omit_times option.

Modules

Adding New Metrics Modules

Hive queries

The scripts that invoke Hive (e.g. the ones that count web requests or use event logging data in Hadoop) must follow the conventions described here. Use the following template to get started:

#!/bin/bash

hive -e "USE wmf;
SELECT
  '$1' AS date,
  ...,
  COUNT(*) AS requests
FROM webrequest
WHERE webrequest_source = 'text' -- also available: 'maps' and 'misc'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '$1'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '$2'
  ...
GROUP BY
  '$1',
  ...;
" 2> /dev/null | grep -v parquet.hadoop
#!/bin/bash

hive -e "USE event;
SELECT
  '$1' AS date,
  ...
FROM ${SCHEMA_NAME}
WHERE CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) >= '$1'
  AND CONCAT(year, '-', LPAD(month, 2, '0'), '-', LPAD(day, 2, '0')) < '$2'
  -- optional: specifying revision ID
  ...;
" 2> /dev/null | grep -v parquet.hadoop

R scripts

A note on paths: Reportupdater does not cd into the query folder. So you'll need to execute scripts relative to the path you're executing Reportupdater from, e.g. Rscript modules/metrics/search/some_script.R -d $1

These scripts have 2 parts: the script part that is called by update_reports.py, which must adhere to Reportupdater's script conventions:

#!/bin/bash

Rscript modules/metrics/search/script.R --date=$1
# Alternatively: Rscript modules/metrics/search/script.R -d $1

script.R that is called should adhere to one of the two templates below. Note that in both, we specify file = "" in write.table because we want to print the data as TSV to console for Reportupdater.

MySQL in R

For R scripts that need to fetch (and process) data from MySQL, use the following template:

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library("optparse"))

option_list <- list(
  make_option(c("-d", "--date"), default = NA, action = "store", type = "character")
)

# Get command line options, if help option encountered print help and exit,
# otherwise if options not found on command line then set defaults:
opt <- parse_args(OptionParser(option_list = option_list))

if (is.na(opt$date)) {
  quit(save = "no", status = 1)
}

# Build query:
date_clause <- as.character(as.Date(opt$date), format = "LEFT(timestamp, 8) = '%Y%m%d'")

query <- paste0("
SELECT
  DATE('", opt$date, "') AS date,
  COUNT(*) AS events
FROM TestSearchSatisfaction2_15922352
WHERE ", date_clause, "
GROUP BY date;
")

# Fetch data from MySQL database:
results <- tryCatch(suppressMessages(wmf::mysql_read(query, "log")), error = function(e) {
  quit(save = "no", status = 1)
})

# ...whatever else you need to do with the data before returning a TSV to console...

write.table(results, file = "", append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)

Hive in R

For R scripts that need to fetch (and process) data from Hive, use the following template:

#!/usr/bin/env Rscript

suppressPackageStartupMessages(library("optparse"))

option_list <- list(
  make_option(c("-d", "--date"), default = NA, action = "store", type = "character")
)

# Get command line options, if help option encountered print help and exit,
# otherwise if options not found on command line then set defaults:
opt <- parse_args(OptionParser(option_list = option_list))

if (is.na(opt$date)) {
  quit(save = "no", status = 1)
}

# Build query:
date_clause <- as.character(as.Date(opt$date), format = "year = %Y AND month = %m AND day = %d")

query <- paste0("USE wmf;
SELECT
  TO_DATE(ts) AS date,
  COUNT(*) AS pageviews
FROM webrequest
WHERE
  webrequest_source = 'text'
  AND ", date_clause, "
  AND is_pageview
GROUP BY
  TO_DATE(ts);
")

# Fetch data from database using Hive:
results <- tryCatch(wmf::query_hive(query), error = function(e) {
  quit(save = "no", status = 1)
})

# ...whatever else you need to do with the data before returning a TSV to console...

write.table(results, file = "", append = FALSE, sep = "\t", row.names = FALSE, quote = FALSE)

Additional Information

This repository can be browsed in Phabricator/Diffusion, but is also (read-only) mirrored to GitHub.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

You can’t perform that action at this time.