-
Notifications
You must be signed in to change notification settings - Fork 0
/
report.Rmd
147 lines (128 loc) · 8.22 KB
/
report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: The unfulfilled promises of sitemaps
subtitle: Statistical analysis of search engine-referred traffic to the Italian Wikipedia
author: Mikhail Popov
date: "`r sub('^0(\\d)', '\\1', format(Sys.Date(), '%d %B %Y'))`"
abstract: |
In July 2018, the Italian Wikipedia instituted a temporary block to protest the copyright potential copyright changes in Europe. This had a negative side-effect of search engines sending traffic to the redirected page even after the redirect was turned off, so the Wikimedia Foundation's Performance Team created and deployed sitemaps for the Italian Wikipedia (and submitted them to Google Search Console) on August 10th, 2018 -- both as a fix to the problem at hand and as an initial foray into sitemaps for Wikimedia projects.
We used Bayesian time series intervention analysis with hierarchical modeling to analyze desktop traffic to the Italian Wikipedia from all recognized search engines and found that approximately 9.2K pageviews (95% CI: -4.6K--31K) -- an increase of 0.75% from the daily average of 1.24M -- could be attributed to the deployment of sitemaps, up to a long-term boost of 35K (95% CI: -12K--87K). These initial results are not particularly impressive. Although the confidence interval extends below 0 -- signifying chance of negative effect -- we can say with some certainty that sitemaps at least do not hurt traffic, but there is currently no strong, convincing evidence that they improve it either (especially not by much). We recommend further testing on wikis where there would be no artefacts such as the redirect, and where the size of the wiki might affect how useful sitemaps are to its search engine-referred traffic.
----
**Open report:** Code, data, and RMarkdown source document are [publicly available](https://github.com/wikimedia-research/SEO-Adhoc-Sitemaps-Itwiki) under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).
output:
wmfpar::pdf_report:
short_title: Italian Wikipedia sitemap impact (T202643)
cite_r_packages:
# Presenation:
- kableExtra
# Data workflow:
- magrittr
- glue
- dplyr
- tidyr
- purrr
- broom
- readr
# Data visualization:
- ggplot2
- patchwork
# Modeling
- stats # arima()
- mlr # createDummyFeatures()
- rstan
- bridgesampling
# Misc.
- wmf
extra_bibs:
- references.bib
includes:
after_body: appendix.tex
header-includes:
- \usepackage{pdflscape}
- \newcommand{\blandscape}{\begin{landscape}}
- \newcommand{\elandscape}{\end{landscape}}
nocite: '@*'
---
```{r appendix, include=FALSE}
# Use the latest version of the appendix:
rmarkdown::render("appendix.Rmd", output_file = "appendix.tex", quiet = TRUE, envir = new.env())
```
```{r setup, include=FALSE}
library(knitr); library(kableExtra)
opts_chunk$set(
echo = FALSE, message = FALSE, warning = FALSE,
dev = "png", dpi = 600
)
options(knitr.table.format = "latex", scipen = 500, digits = 4)
library(ggplot2)
library(patchwork)
library(zeallot)
```
# Introduction
For a few days in July 2018, all traffic that went to the Italian Wikipedia was redirected (via JavaScript) to a [page protesting potential copyright changes in the European Union](https://it.wikipedia.org/wiki/Wikipedia:Comunicato_3_luglio_2018/en); refer to [Hershenov [-@hershenov_2018]](https://blog.wikimedia.org/2018/06/29/eu-copyright-proposal-will-hurt-web-wikipedia/) for more information. After the redirect ended, it was observed that there was still a non-negligible amount of traffic going to the redirect page (see [T199252](https://phabricator.wikimedia.org/T199252)), with almost a million pageviews on July 11th, a full six days after the redirect was turned off. In an attempt to fix the problem while also implementing a recommendation from our SEO consultation with Go Fish Digital ([T198965](https://phabricator.wikimedia.org/T198965)), the Wikimedia Performance Team created [sitemaps](https://en.wikipedia.org/wiki/Sitemaps) for the Italian Wikipedia and deployed them on 10 August 2018, when it was also submitted to the Google Search Console.
In this report, we present the results of Bayesian time series intervention analysis with hierarchical modeling to determine whether the deployment of the sitemaps had a positive impact on search engine-referred desktop traffic to the Italian Wikipedia, as measured by [pageviews](https://meta.wikimedia.org/wiki/Research:Page_view). Our statistical model employed an [autoregressive moving average component](https://en.wikipedia.org/wiki/Autoregressive%E2%80%93moving-average_model) for the time series with the change-due-to-intervention modeled as immediate, gradually levelling-off.
Using the model, we estimated the effect of the sitemaps to be an additional 9.2K desktop pageviews (95% CI: -4.6K--31K) to the Italian Wikipedia from all recognized search engines, which would make it an increase of 0.75% from the average of 1.24M search engine-referred desktop pageviews per day. These initial results do not look particularly impressive or convincing. Granted, deploying the sitemaps to the Italian Wikipedia -- whose traffic from Google suffered a negative side-effect due to the copyright redirect -- makes the results potentially slightly unreliable due to the unmodeled artefact. Therefore, we recommend further tests by deploying on a couple of big and small wikis where there were no additional interventions leading up to the deployment of sitemaps. Such additional tests would enable us to identify with greater certainty whether the sitemaps help traffic, and whether smaller wikis would benefit more or less than bigger wikis.
\clearpage
```{r data, include=FALSE}
source("data.R")
itwiki_pageviews <- itwiki_pvs %>%
dplyr::select(-day) %>%
tidyr::gather(access_method, pageviews, both, desktop, `mobile web`)
```
```{r datavis_ts, fig.height=6, fig.width=10, fig.cap="The negative trend of the slowly decreasing desktop traffic cancels out the positive trend of the slowly increasing mobile (web) traffic, with the overall search engine traffic around 8 million pageviews per day."}
p1 <- ggplot(
dplyr::filter(itwiki_pageviews, access_method == "both"),
aes(x = date, y = pageviews)
) +
geom_line() +
geom_vline(
aes(xintercept = date), linetype = "dashed",
data = dplyr::filter(events, event == "sitemap deployment")
) +
scale_x_date(date_breaks = "6 months", date_minor_breaks = "1 month") +
labs(
x = "Date", y = "Pageviews (in millions)",
title = "Search engine traffic to the Italian Wikipedia"
) +
wmf::theme_min(14)
p2 <- ggplot(
dplyr::filter(itwiki_pageviews, access_method != "both"),
aes(x = date, y = pageviews, color = access_method)
) +
geom_line() +
geom_vline(
aes(xintercept = date), linetype = "dashed",
data = dplyr::filter(events, event == "sitemap deployment")
) +
scale_color_manual(values = c("#2a4b8d", "#ac6600")) +
scale_x_date(date_breaks = "6 months", date_minor_breaks = "1 month") +
labs(
x = "Date", y = "Pageviews (in millions)", color = "Access method",
title = "Search engine traffic to the Italian Wikipedia, by access method"
) +
wmf::theme_min(14)
p1 + p2 + plot_layout(ncol = 1)
```
```{r dataviz_monthly_seasonality, fig.width=8, fig.height=4, fig.cap="Every year the Italian Wikipedia has a similar pattern of search engine-referred traffic across months, which indicated to us that year and month should be included in the model of general search engine-referred traffic."}
month_abbreviations <- set_names(month.abb, month.name)
itwiki_pageviews %>%
dplyr::filter(access_method == "desktop") %>%
dplyr::mutate(month = factor(month_abbreviations[month], month_abbreviations)) %>%
ggplot(aes(x = month, y = pageviews, group = year, color = year)) +
stat_summary(fun.y = mean, geom = "line", size = 1.1) +
scale_color_manual(values = unname(wmf::colors_discrete(3))) +
labs(
x = "Month", y = "Pageviews (in millions)", color = "Year",
title = "Average daily desktop traffic from search engines"
) +
wmf::theme_min(14)
```
\clearpage
# Methods
```{r methods, child="methods.Rmd"}
```
# Results and Discussion
```{r results, child="results.Rmd"}
```
\clearpage
# References
\footnotesize