This repository contains archived aggregation/acquisition scripts for extracting data from the MySQL/Hive databases for computing metrics for Search Platform team (formerly Discovery). It usesReportupdater infrastructure. This codebase was maintained by Product Analytics team's Mikhail Popov and was decommissioned as part of T227782.
As of T170494, the setup and daily runs are Puppetized on stat1007 via the statistics::discovery module (also mirrored on GitHub).
pip install -r reportupdater/requirements.txtSome of the R packages require C++ libraries, which are installed on stat1007 -- that use compute.pp (GitHub) -- by being listed in packages (GitHub). See operations-puppet/modules/statistics/manifests/packages.pp (GitHub) for example.
# Set WMF proxies:
Sys.setenv("http_proxy" = "http://webproxy.eqiad.wmnet:8080")
Sys.setenv("https_proxy" = "http://webproxy.eqiad.wmnet:8080")
# Set path for packages:
lib_path <- "/srv/discovery/r-library"
.libPaths(lib_path)
# Essentials:
install.packages(
c("devtools", "testthat", "Rcpp",
"tidyverse", "data.table", "plyr",
"optparse", "yaml", "data.tree",
"ISOcodes", "knitr", "glue",
# For wmf:
"urltools", "ggthemes", "pwr",
# For polloi's datavis functions:
"shiny", "shinydashboard", "dygraphs", "RColorBrewer",
# For polloi's data manipulation functions:
"xts", "mgcv", "zoo"
),
repos = c(CRAN = "https://cran.rstudio.com/"),
lib = lib_path
)
# 'ortiz' is needed for Search team's user engagement calculation | https://phabricator.wikimedia.org/diffusion/WDOZ/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/ortiz")
# 'wmf' is needed for querying MySQL and Hive | https://phabricator.wikimedia.org/diffusion/1821/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/wmf")
# 'polloi' is needed for wikiid-splitting | https://phabricator.wikimedia.org/diffusion/WDPL/
devtools::install_git("https://gerrit.wikimedia.org/r/wikimedia/discovery/polloi")Don't forget to add packages to test.R because that script checks that all packages are installed before performing a test run of the reports.
To update packages, use update-library.R:
Rscript /etc/R/update-library.R -l /srv/discovery/r-library
Rscript /etc/R/update-library.R -l /srv/discovery/r-library -p polloiIf you wish to run all the modules without writing data to files or checking for missingness, use:
Rscript test.R >> test_`date +%F_%T`.log.md 2>&1
# The test script automatically uses yesterday's date.
# Alternatively:
Rscript test.R --start_date=2017-01-01 --end_date=2017-01-02 >> test_`date +%F_%T`.log.md 2>&1
# And have it include samples of the existing data (for comparison):
Rscript test.R --include_samples >> test_`date +%F_%T`.log.md 2>&1The testing utility finds all the modules, builds a list of the reports, and then performs the appropriate action depending on whether the report is a SQL query or a script. Each module's output will be printed to console. This should go without saying, but running through all the modules will take a while. The script outputs a Markdown-formatted log that can be saved to file using the commands above. Various statistics on the execution times will be printed at the end, including a table of all the reports' execution times. The table can be omitted using the --omit_times option.
- Metrics (modules/metrics)
- Search (configuration)
- API usage
- Search on Android and iOS apps
- Event counts
- Load times (invokes load_times.R)
- Invoke source counts on Android (T143726)
- Positions of clicked results on Android (T143726)
- Search on Mobile Web
- Event counts
- Load times (invokes load_times.R)
- Session counts (invokes mobile_session_counts.R)
- Search on Desktop
- Event counts
- Load times (invokes load_times.R)
- Survival/LDN: Retention of users on visited pages (T113297)
- Dwell-time: % of users visiting results for more than 10s (T113297, T113513, Change 240593)
- Time spent on search result pages (SRPs) (invokes srp_survtime.R)
- PaulScore (T144424)
- Bounce rate (invokes desktop_return_rate.R)
- Dwell-time, PaulScore, event counts, etc. broken down by language-project (planned, T150410)
- Zero results rate (all invoke cirrus_aggregates.R)
- Sister search
- Article pageviews from full-text search
- Full-text SRP views by device and agent type
- Wikidata Query Service (configuration)
- Maps (configuration)
- Kartotherian usage
- Users by country (T119448)
- Tile requests (T113832)
- Maps prevalence on wikis (T170022)
- Kartotherian usage
- External Traffic (configuration)
- Search (configuration)
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.