Skip to content

Github mirror of "wikimedia/discovery/golden" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing

Notifications You must be signed in to change notification settings

wikimedia/wikimedia-discovery-golden

Repository files navigation

Golden Retriever Scripts – ARCHIVED

This repository contains archived aggregation/acquisition scripts for extracting data from the MySQL/Hive databases for computing metrics for Search Platform team (formerly Discovery). It usesReportupdater infrastructure. This codebase was maintained by Product Analytics team's Mikhail Popov and was decommissioned as part of T227782.

Table of Contents

Setup and Usage

As of T170494, the setup and daily runs are Puppetized on stat1007 via the statistics::discovery module (also mirrored on GitHub).

Dependencies

pip install -r reportupdater/requirements.txt

Some of the R packages require C++ libraries, which are installed on stat1007 -- that use compute.pp (GitHub) -- by being listed in packages (GitHub). See operations-puppet/modules/statistics/manifests/packages.pp (GitHub) for example.

# Set WMF proxies:
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")

# Set path for packages:
lib_path <- "/srv/discovery/r-library"
.libPaths(lib_path)

# Essentials:
install.packages(
  c("devtools", "testthat", "Rcpp",
    "tidyverse", "data.table", "plyr",
    "optparse", "yaml", "data.tree",
    "ISOcodes", "knitr", "glue",
    # For wmf:
    "urltools", "ggthemes", "pwr",
    # For polloi's datavis functions:
    "shiny", "shinydashboard", "dygraphs", "RColorBrewer",
    # For polloi's data manipulation functions:
    "xts", "mgcv", "zoo"
  ),
  repos = c(CRAN = "https://cran.rstudio.com/"),
  lib = lib_path
)

# 'ortiz' is needed for Search team's user engagement calculation | https://phabricator.wikimedia.org/diffusion/WDOZ/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/ortiz")

# 'wmf' is needed for querying MySQL and Hive | https://phabricator.wikimedia.org/diffusion/1821/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")

# 'polloi' is needed for wikiid-splitting | https://phabricator.wikimedia.org/diffusion/WDPL/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")

Don't forget to add packages to test.R because that script checks that all packages are installed before performing a test run of the reports.

To update packages, use update-library.R:

Rscript /etc/R/update-library.R -l /srv/discovery/r-library
Rscript /etc/R/update-library.R -l /srv/discovery/r-library -p polloi

Testing

If you wish to run all the modules without writing data to files or checking for missingness, use:

Rscript test.R >> test_`date +%F_%T`.log.md 2>&1
# The test script automatically uses yesterday's date.

# Alternatively:
Rscript test.R --start_date=2017-01-01 --end_date=2017-01-02 >> test_`date +%F_%T`.log.md 2>&1

# And have it include samples of the existing data (for comparison):
Rscript test.R --include_samples >> test_`date +%F_%T`.log.md 2>&1

The testing utility finds all the modules, builds a list of the reports, and then performs the appropriate action depending on whether the report is a SQL query or a script. Each module's output will be printed to console. This should go without saying, but running through all the modules will take a while. The script outputs a Markdown-formatted log that can be saved to file using the commands above. Various statistics on the execution times will be printed at the end, including a table of all the reports' execution times. The table can be omitted using the --omit_times option.

Modules

Additional Information

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

About

Github mirror of "wikimedia/discovery/golden" - our actual code is hosted with Gerrit (please see https://www.mediawiki.org/wiki/Developer_access for contributing

Resources

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published