Report.Rmd

---
output:
  html_document:
    css: Report.css
    fig_width: 10
    includes:
      before_body: Summary.html
    keep_md: yes
    toc: yes
---

```{r setup, include = FALSE}
library(ggplot2)
library(magrittr)
library(knitr)
library(scales)
library(mclust)
library(gridExtra)
library(urltools) # devtools::install_github('ironholds/urltools')
library(xts)
library(dygraphs)
import::from(dplyr, group_by, select, summarize,
             right_join, left_join, ungroup, mutate,
             keep_where = filter)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
load('data/Queries_2015-10-05.RData')
load('data/Referrers_2015-10-05.RData')
```

```{r utils, include = FALSE}
top_n <- function(x, n = 10) {
  y <- sort(table(x), decreasing = TRUE)
  head(names(y), n)
}
"%notin%" <- function(x, y) !(x %in% y)
smart_trim <- function(x, width) {
  trimmed <- paste0(strtrim(x, width), '...')
  ifelse(sapply(x, nchar) > width, trimmed, x)
}
```

## Background

The Wikidata Query Service (WDQS) is designed to let users run queries on the data contained in Wikidata. The service uses *SPARQL Protocol and RDF Query Language* (SPARQL) as the query language.

## Statistics

```{r queries_per_day}
wdqs_queries %>%
  group_by(date) %>%
  summarize(Queries = n()) %>%
  ggplot(data = ., aes(x = date, y = Queries)) +
  geom_line(size = 1.1) +
  geom_vline(xintercept = as.numeric(as.Date("2015-09-07")),
             linetype = "dashed", color = "purple") +
  geom_text(x = as.numeric(as.Date("2015-09-01")), y = 1000,
            label = "Announcement", color = "purple") +
  ggtitle("Daily service usage (total queries per day)") +
  wmf::theme_fivethirtynine()
```

```{r users_per_day}
wdqs_queries %>%
  select(date, user_id) %>%
  unique %>%
  group_by(date) %>%
  summarize(Users = n()) %>%
  ggplot(data = ., aes(x = date, y = Users)) +
  geom_hline(yintercept = 0) +
  geom_line(size = 1.1) +
  geom_vline(xintercept = as.numeric(as.Date("2015-09-07")),
             linetype = "dashed", color = "purple") +
  geom_text(x = as.numeric(as.Date("2015-09-01")), y = 400,
            label = "Announcement", color = "purple") +
  ggtitle("Daily service usage (total users per day)") +
  wmf::theme_fivethirtynine()
```

The number of users of the service has fallen since the announcement, vacillating at around 100 users per day in the recent weeks.

```{r median_queries_per_user}
wdqs_queries %>%
  group_by(date, user_id) %>%
  summarize(`queries per user` = n()) %>%
  group_by(date) %>%
  summarize(`median queries per user` = median(`queries per user`),
            `25%` = quantile(`queries per user`, 0.25),
            `75%` = quantile(`queries per user`, 0.75)) %>%
  ggplot(data = .) +
  geom_ribbon(aes(x = date, ymax = `75%`, ymin = `25%`),
              alpha = 0.3) +
  geom_vline(xintercept = as.numeric(as.Date("2015-09-07")),
             linetype = "dashed", color = "purple") +
  geom_text(x = as.numeric(as.Date("2015-09-14")), y = 60,
            label = "Announcement", color = "purple") +
  geom_line(aes(x = date, y = `median queries per user`)) +
  ggtitle("Median queries per user per day") +
  wmf::theme_fivethirtynine()
```

The lower and upper bounds represent the first and third quartiles (25% and 75%). Here we can see that the number of queries per user has stabilized a lot after the announcement, mostly because prior to the announcement the queries came from few bots testing the service.

```{r countries, eval = FALSE}
summarize(group_by(users, countries), n = n())$countries %>%
  sort %>% paste0(collapse = ', ') %>% print
```

WDQS users are a very geographically diverse bunch! In fact, 73 different countries[^countries] were represented between August 23<sup>rd</sup> and October 4<sup>th</sup>.

```{r top_10_countries, width = 4, height = 8}
users %>%
  mutate(Country = ifelse(countries %in% top_n(countries),
                          countries, 'Other')) %>%
  group_by(Country) %>%
  summarize(Users = n()) %>%
  mutate(Total = sum(Users),
         Percent = Users/Total) %>%
  select(-Total) %>%
  ggplot(data = ., aes(y = Percent, x = reorder(Country, Users))) +
  scale_y_continuous(labels = scales::percent_format()) +
  geom_bar(stat = "identity", alpha = 0.4) +
  geom_text(aes(label = sprintf("%.2f%%", 100*Percent))) +
  coord_flip() +
  xlab("Country") +
  ggtitle("Top 10 countries by number of WDQS users") +
  wmf::theme_fivethirtynine()
```

U.S., U.K., Germany, and France are the top-represented countries, with U.S. leading the pack.

```{r top_10_browsers, width = 8, height = 4}
users %>%
  mutate(Browser = ifelse(browser %in% top_n(browser, 5),
                          browser, 'Other')) %>%
  group_by(Browser) %>%
  summarize(Users = n()) %>%
  mutate(Total = sum(Users),
         Percent = Users/Total) %>%
  select(-Total) %>%
  ggplot(data = ., aes(y = Percent, x = reorder(Browser, -Users))) +
  scale_y_continuous(labels = scales::percent_format()) +
  geom_bar(stat = "identity", alpha = 0.4) +
  geom_text(aes(label = sprintf("%.2f%%", 100*Percent))) +
  xlab("Browser") +
  ggtitle("Top 5 browsers by number of WDQS users") +
  wmf::theme_fivethirtynine()
```

Chrome and Firefox are, unsurprisingly, WDQS users' preferred browsers.

```{r top_10_oses, width = 4, height = 8}
users %>%
  mutate(`Operating System` = ifelse(os %in% top_n(os, 10),
                                     os, 'Other')) %>%
  group_by(`Operating System`) %>%
  summarize(Users = n()) %>%
  mutate(Total = sum(Users),
         Percent = Users/Total) %>%
  select(-Total) %>%
  ggplot(data = ., aes(y = Percent, x = reorder(`Operating System`, Users))) +
  scale_y_continuous(labels = scales::percent_format()) +
  geom_bar(stat = "identity", alpha = 0.4) +
  geom_text(aes(label = sprintf("%.2f%%", 100*Percent))) +
  coord_flip() +
  xlab("Operating System") +
  ggtitle("Top 5 OSes by number of WDQS users") +
  wmf::theme_fivethirtynine()
```

Windows 7 and Mac OS X users are by far the most popular operating systems among WDQS users.

### User-written queries vs provided example queries

Perhaps the biggest challenge of working with this dataset was the fact that a lot of the queries that our users ran were just sample queries we provide on query.wikidata.org or examples found on MediaWiki/Wikitech. Therefore, we put together a procedure for detecting whether a query is an example or not.

We were able to find a few queries and manually mark them as examples. Other queries that were perfect matches to these manually verified queries were marked as "definitely an example".

#### Methods

First, we compiled a list of example queries, stripped out extraneous spaces ("condensed"), and cropped them using the maximum length of condensed user-submitted queries.

The procedure can be described as follows. For each user-submitted query $Q$:

1. Compute the Levenshtein distance $d_m$ between it and each of the $M$ example queries on file ($E_m$ for $m = 1, \ldots, M$), giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform the query into the example.
2. Calculate the maximum edit distance for each of the query-example query pairs. That is, if $|q|$ is the length of the query $q$, then the maximum edit distance is $\max\{|Q|, |E_m|\}$ for each $m = 1, \ldots, M$.
3. Normalize the computed Levenshtein distance by dividing it by the maximum edit distance. Let $d_m^N \in [0, 1]$ be this normalized distance.
4. Find the smallest normalized distance $\hat{d}$: $$ \hat{d} = \min_{m = 1, \ldots, M} d_m^N$$
5. Calculate the "pseudo probability" $\hat{p} = 1 - \hat{d}$.
6. Calculate the "pseudo odds" $\hat{\theta} = \hat{p}/(1-\hat{p})$.
7. Then:
    a. If $\hat{\theta} \geq 4$, mark as "probably example".
    b. If $2 \geq \hat{\theta} < 4$, mark as "maybe an example, maybe not".
    c. And "definitely not an example" (or "definitely an example" as the case may be) otherwise.

That is, if a query is 4 or more times more likely to be an example than not, it makes sense to say it's probably an example.

#### Results

```{r query_by_chance_of_sample}
wdqs_queries %>%
  group_by(date, sample) %>%
  summarize(n = n()) %>%
  ggplot(data = .) +
  geom_line(aes(x = date, y = n, color = sample)) +
  ylab("Number of queries") +
  scale_color_discrete(name = "Example query") +
  wmf::theme_fivethirtynine() +
  geom_line(aes(x = date, y = n),
            data = wdqs_queries %>%
              group_by(date) %>%
              summarize(n = n()),
            linetype = "dashed") +
  ggtitle("Total queries, example queries, and user-written queries over time")
```

Total queries (black dashes) over time and how many were the sample queries we provided for demonstration.


```{r example_query_counts}
wdqs_queries %>%
  group_by(sample) %>%
  summarize(`Total queries` = n()) %>%
  dplyr::rename(`Is the query an example?` = sample) %>%
  mutate(Total = sum(`Total queries`),
         `% of total` = round(100*`Total queries`/Total, 2)) %>%
  select(-Total) %>%
  knitr::kable()
```

For many of the statistical break downs in this report, we will restrict ourselves to queries that are definitely not examples we provided.

### Tracking daily service usage in top countries

```{r queries_from_countries_over_time}
wdqs_queries %>%
  left_join(users, by = "user_id") %>%
  keep_where(countries != 'Unknown') %>%
  mutate(Country = ifelse(countries %in% top_n(countries, 6),
                          countries, 'Other'),
         Day = as.POSIXct(round(timestamp, "days"))) %>%
  keep_where(Country != 'Other') %>%
  group_by(Day, Country) %>%
  summarize(Queries = n()) %>%
  ggplot(data = ., aes(x = Day, y = Queries)) +
  facet_grid(Country ~ .) +
  scale_x_datetime(labels = date_format("%a %m/%d")) +
  geom_vline(xintercept = as.numeric(lubridate::ymd("2015-09-07")),
             linetype = "dashed", color = "purple") +
  # geom_smooth(method = "loess", se = FALSE) +
  geom_line() +
  theme_bw() +
  ggtitle("Usage in top 5 countries (by total queries)")
```

Varying patterns of WDQS usage by country (top 5 countries, over time). Purple dashes mark the public announcement.

```{r users_from_countries_over_time}
wdqs_queries %>%
  left_join(users, by = "user_id") %>%
  keep_where(countries != 'Unknown') %>%
  mutate(Country = ifelse(countries %in% top_n(countries, 6),
                          countries, 'Other'),
         Day = as.POSIXct(round(timestamp, "days"))) %>%
  keep_where(Country != 'Other') %>%
  group_by(Day, Country) %>%
  summarize(Users = (function(x) { length(unique(x)) })(user_id)) %>%
  ggplot(data = ., aes(x = Day, y = Users)) +
  facet_grid(Country ~ .) +
  scale_x_datetime(labels = date_format("%a %m/%d")) +
  geom_vline(xintercept = as.numeric(lubridate::ymd("2015-09-07")),
             linetype = "dashed", color = "purple") +
  # geom_smooth(method = "loess", se = FALSE) +
  geom_line() +
  theme_bw() +
  ggtitle("Users in top 5 countries (by total queries)")
```

Varying patterns of WDQS unique users by country (top 5 countries, over time). Purple dashes mark the public announcement. What is very interesting is that South Korea is a top 5 country in usage but with barely any users.

### Who are our most active users?

#### Top 20 users by total queries

```{r most_active_overall, as.is = TRUE}
most_active_overall <- wdqs_queries %>%
  group_by(user_id) %>%
  summarize(`total queries` = n()) %>%
  dplyr::top_n(20, `total queries`) %>%
  dplyr::arrange(desc(`total queries`)) %>%
  left_join(users, by = "user_id") %>%
  select(-c(browser, device)) %>%
  dplyr::rename(country = countries)
knitr::kable(select(most_active_overall, -user_id))
```

#### Top 20 users by daily service usage

```{r most_active_daily, as.is = TRUE}
most_active_daily <- wdqs_queries %>%
  group_by(user_id, date) %>%
  summarize(`queries per day` = n()) %>%
  group_by(user_id) %>%
  summarize(`median queries per day` = median(`queries per day`)) %>%
  dplyr::top_n(20, `median queries per day`) %>%
  dplyr::arrange(desc(`median queries per day`)) %>%
  left_join(users, by = "user_id") %>%
  select(-c(browser, device)) %>%
  dplyr::rename(country = countries)
knitr::kable(select(most_active_daily, -user_id))
```

#### Users who made it to both lists

```{r most_active_both, as.is = TRUE}
dplyr::inner_join(select(most_active_overall, c(user_id, `total queries`)),
                  select(most_active_daily, c(user_id, `median queries per day`)),
                  by = "user_id") %>%
  left_join(users, by = "user_id") %>%
  dplyr::arrange(desc(`total queries`)) %>%
  select(-c(user_id, browser, device)) %>%
  dplyr::rename(country = countries) %>%
  knitr::kable()
```

```{r most_active_cleanup}
rm(most_active_overall, most_active_daily)
```

### Referrers

```{r top_referers_overall, as.is = TRUE}
top_referers_overall <- wdqs_refs %>%
  # Do some url cleaning up before summarizing:
  mutate(url = sub("query.wikidata.org/#.*", "query.wikidata.org/#…QUERY…", url)) %>%
  mutate(url = ifelse(is.na(url), "--", url)) %>%
  mutate(url = smart_trim(url, 50)) %>%
  # Summarize:
  group_by(url) %>%
  summarize(total = sum(n)) %>%
  mutate(`total total` = sum(total),
         `% of total` = 100*total/`total total`) %>%
  select(-`total total`) %>%
  dplyr::top_n(20, total) %>%
  dplyr::arrange(desc(total)) %>%
  ungroup
knitr::kable(top_referers_overall, digits = 2)
```

```{r top_referers_overall_part_deux, as.is = TRUE}
top_referers_overall_2 <- wdqs_refs %>%
  # Do some url cleaning up before summarizing:
  mutate(url = sub("query.wikidata.org/#.*", "query.wikidata.org/#…QUERY…", url)) %>%
  mutate(url = ifelse(is.na(url), "--", url)) %>%
  mutate(url = smart_trim(url, 50)) %>%
  mutate(domain = urltools::domain(url)) %>%
  # Summarize:
  group_by(domain) %>%
  summarize(total = sum(n)) %>%
  mutate(`total total` = sum(total),
         `% of total` = 100*total/`total total`) %>%
  select(-`total total`) %>%
  dplyr::top_n(20, total) %>%
  dplyr::arrange(desc(total)) %>%
  ungroup
knitr::kable(top_referers_overall_2, digits = 2)
```

#### Daily visitors...

**...from ourselves:**

```{r top_refers_daily_self}
wdqs_refs %>%
  mutate(url = sub("query.wikidata.org/#.*", "query.wikidata.org/#…QUERY…", url)) %>%
  keep_where(url %in% c('https://query.wikidata.org', 'https://query.wikidata.org/', 'https://query.wikidata.org/#…QUERY…')) %>%
  mutate(url = smart_trim(url, 50)) %>%
  # Summarize:
  group_by(date, url) %>%
  summarize(total = sum(n)) %>%
  ungroup %>%
  ggplot(data = ., aes(x = date, y = total, color = url)) +
  geom_line() +
  ggtitle("Referred from query.wikidata.org") +
  ylab("Total visitors") +
  wmf::theme_fivethirtynine()
```

**...from others:**

```{r top_referers_daily_dynamic, eval = TRUE}
daily_refs <- wdqs_refs %>%
  # Perform the same cleanup procedure as above:
  keep_where(!is.na(url) & !grepl("query.wikidata.org", url)) %>%
  mutate(url = smart_trim(url, 50)) %>%
  mutate(domain = urltools::domain(url)) %>%
  # Summarize:
  group_by(date, domain) %>%
  summarize(total = sum(n)) %>%
  ungroup %>%
  # Throw away the referrers that aren't in the overall top 11:
  keep_where(domain %in% head(top_referers_overall_2$domain, 6)) %>%
  tidyr::spread(domain, total, fill = 0) %>%
  # Make it xts-compatible:
  mutate(date = as.POSIXct(date)) %>%
  { xts(.[, -1], order.by = .$date) }
# Plot it!
dygraph(daily_refs, width = "95%",
        main = "Daily visitors from top 4 referrers",
        xlab = "Date", ylab = "Visitors") %>%
  dyOptions(colors = scales::hue_pal()(ncol(daily_refs)),
            labelsKMB = TRUE, pointSize = 3, strokeWidth = 2) %>%
  dyLegend(show = "always", showZeroValues = FALSE,
           labelsDiv = "refer_labels")
```

_**Note** that this is an interactive graph like the ones we use in Discovery Dashboards. Mouse-over to see the values of the time series in the legend. You can also zoom in on a particular range. (Zoom out by double-clicking.)_

<strong>Legend:</strong>

<div id="refer_labels"></div>

```{r top_referers_daily_static, eval = FALSE}
wdqs_refs %>%
  # Perform the same cleanup procedure as above:
  keep_where(!is.na(url) & !grepl("query.wikidata.org", url)) %>%
  mutate(url = smart_trim(url, 50)) %>%
  mutate(domain = urltools::domain(url)) %>%
  # Summarize:
  group_by(date, domain) %>%
  summarize(total = sum(n)) %>%
  ungroup %>%
  # Throw away the referrers that aren't in the overall top 11:
  keep_where(domain %in% head(top_referers_overall_2$domain, 6)) %>%
  ggplot(data = .) +
  geom_line(aes(x = date, y = total, color = domain)) +
  wmf::theme_fivethirtynine()
```

#### Top referrers (of query requests):

```{r top_referers_queries, as.is = TRUE}
wdqs_queries %>%
  mutate(referer = smart_trim(referer, 40)) %>%
  group_by(referer, user_id) %>%
  summarize(`queries by user` = n()) %>%
  group_by(referer) %>%
  summarize(users = n()) %>%
  dplyr::top_n(7, users) %>%
  dplyr::arrange(desc(users)) %>%
  knitr::kable()
```

The referers were shortened for privacy and space reasons as they contained queries.

## Queries

### Query lengths

```{r nchar_hist}
wdqs_queries %>%
  keep_where(sample == "definitely no") %>%
  select(c(length, length_condensed)) %>%
  dplyr::rename(`Query` = length,
                `Condensed query` = length_condensed) %>%
  tidyr::gather(`Length of`, value, 1:2) %>%
  ggplot(data = .,
         aes(x = value, fill = `Length of`)) +
  geom_density(alpha = 0.5, adjust = 3) +
  xlab("Number of characters in query") +
  ggtitle("Distribution of (definitely not example) query lengths") +
  wmf::theme_fivethirtynine()
```

We can see multiple modes in the distribution of query lengths, which suggests that the distribution is a mixture of several distributions. The next step is to use a clustering algorithm to separate the distributions out into distinct groups. For this task, we chose a model-based clustering algorithm.

We performed model-based clustering on the log10-transformed character counts of condensed queries that were "definitely not" sample queries we provided. (Model-based clustering relies on Gaussian mixture models, so the log10 transformation was employed to correct for the right-skewness and make the data Normal.)

```{r nchar_clust, cache = TRUE}
# Use model-based clustering on the log10 transformed character counts:
set.seed(0)
wdqs_queries_nonexample <- wdqs_queries %>%
  keep_where(sample == "definitely no") %>%
  mutate(log10length = log10(length),
         log10length_condensed = log10(length_condensed))
clust <- Mclust(wdqs_queries_nonexample$log10length_condensed, modelNames = "V")
wdqs_queries_nonexample$cluster <- factor(LETTERS[clust$classification])

# Merge into overall dataset:
wdqs_queries <- wdqs_queries_nonexample %>%
  select(query_id, cluster) %>%
  left_join(wdqs_queries, ., by = "query_id")

# For calculating f(x):
x <- seq(min(wdqs_queries_nonexample$log10length_condensed),
         max(wdqs_queries_nonexample$log10length_condensed),
         length.out = 1e3)

# Calculate the f(x) using the mean and sd of that cluster:
densities <- sapply(1:length(clust$parameters$pro),
                    function(i) {
                      dnorm(x,
                            mean = clust$parameters$mean[i],
                            sd = sqrt(clust$parameters$variance$sigmasq[i]))
                    }) %>%
  as.data.frame() %>%
  { colnames(.) <- LETTERS[1:ncol(.)]; . } %>%
  cbind(n = x, .) %>%
  # Convert wide to long:
  tidyr::gather("cluster", "density", -1) %>%
  # Normalize:
  group_by(cluster) %>%
  mutate(density = density/max(density)) %>%
  ungroup

# Adjust densities for visualization:
for (i in 1:length(clust$parameters$pro) ) {
  densities$density[densities$cluster == LETTERS[i]] %<>%
    { . * clust$parameters$pro[i] }
}; rm(i)

# This will be used to add cluster means (on the back-transformed scale) to the plot:
annotations <- data.frame(x = unname(round(10^clust$parameters$mean, 1)),
                          y = rep(0.15, length(clust$parameters$pro)),
                          cluster = factor(levels(wdqs_queries_nonexample$cluster)))

# Plot the log-normal densities:
ggplot(data = densities) +
  geom_line(aes(x = 10^n, y = density, color = cluster),
            size = 1.1) +
  xlab("Number of characters in query") +
  ggtitle("Two distinct categories of query lengths") +
  scale_color_discrete(name = "Group") +
  geom_vline(data = annotations, linetype = "dashed",
             aes(xintercept = x, color = cluster)) +
  geom_text(data = annotations, aes(label = x, x = x, y = y)) +
  wmf::theme_fivethirtynine() +
  # scale_x_log10() +
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        axis.line.y = element_blank())

# Clean up:
rm(x, densities, annotations, wdqs_queries_nonexample)
```

The most optimal model was a 2-component univariate mixture with unequal variances. The centers for the 2 clusters (on the raw scale) are: 97 and 576 characters.

### Example queries

#### Shortest queries

```{r example_query_shortest}
wdqs_queries %>%
  keep_where(sample == "definitely no") %>%
  dplyr::arrange(length) %>%
  select(query_condensed) %>%
  unique() %>%
  head(25) %>%
  unlist %>%
  unname %>%
  matrix(nrow = 5, ncol = 5) %>%
  as.data.frame() %>%
  knitr::kable(col.names = rep("&nbsp;", 5))
```

#### Some of the longer queries

```{r longest_query, comment = NA}
wdqs_queries %>%
  keep_where(pseudo_odds < 0.15) %>%
  select(query_condensed, length_condensed) %>%
  unique() %>%
  dplyr::top_n(3, length_condensed) %>%
  select(query_condensed) %>%
  unlist %>%
  paste(collapse = "\n\n======================\n\n") %>%
  cat
```

#### Some of the longer queries (part 2)

```{r longest_query_part_deux, comment = NA}
wdqs_queries %>%
  keep_where(sample == "definitely no") %>%
  keep_where(!grepl("Query to return latitudes and longitudes", query)) %>%
  keep_where(!grepl("inttech.flab.fujitsu.co.jp", query)) %>%
  keep_where(!grepl("DESCRIBE <https", query)) %>%
  keep_where(!grepl("# ", query, fixed = TRUE)) %>%
  select(query, length_condensed) %>%
  unique() %>%
  dplyr::top_n(3, length_condensed) %>%
  select(query) %>%
  unlist %>%
  paste(collapse = "\n======================\n\n") %>%
  cat
```

#### Examples of Category "A" queries:

```{r group_a_queries, comment = NA}
set.seed(0)
wdqs_queries %>%
  keep_where(cluster == "A") %>%
  keep_where(!grepl("Query to return latitudes and longitudes", query)) %>%
  keep_where(!grepl("inttech.flab.fujitsu.co.jp", query)) %>%
  keep_where(!grepl("DESCRIBE <http", query)) %>%
  keep_where(!grepl("# ", query, fixed = TRUE)) %>%
  keep_where(length_condensed > 97) %>%
  select(query, length_condensed) %>%
  unique() %>%
  dplyr::sample_n(3) %>%
  select(query) %>%
  unlist %>%
  paste(collapse = "\n======================\n\n") %>%
  cat
```

#### Examples of Category "B" queries:

```{r group_b_queries, comment = NA}
set.seed(0)
wdqs_queries %>%
  keep_where(cluster == "B") %>%
  keep_where(!grepl("Query to return latitudes and longitudes", query)) %>%
  keep_where(!grepl("inttech.flab.fujitsu.co.jp", query)) %>%
  keep_where(!grepl("DESCRIBE <http", query)) %>%
  keep_where(!grepl("# ", query, fixed = TRUE)) %>%
  keep_where(!grepl("FILTER (lang", query, fixed = TRUE)) %>%
  keep_where(length_condensed > 576) %>%
  select(query, length_condensed) %>%
  unique() %>%
  dplyr::sample_n(2) %>%
  select(query) %>%
  unlist %>%
  paste(collapse = "\n======================\n\n") %>%
  cat
```

## Discussion

### Partial Queries

One of the bigger challenges encountered in this analysis was the fact that queries were cropped. When the user executes a query, their query is passed via GET, and is saved in Varnish as an encoded `uri_path`. Varnish, however, has a character limit, so the encoded queries get cropped. Therefore, when we decode the queries, the end result is also cropped. So a lot of the queries in this dataset were partial queries.

### Example Matching

Another issue (and this may be actually be the biggest issue) is that many of the queries are sample queries found on various WDQS-related MediaWiki/Wikitech articles. We (read: I) had to compile together as many of the example queries as we could and then perform approximate string matching to separate the queries that are user-written from the ones that are probably examples.

The process we employed was ad-hoc, not very robust, and highly dubious, but not entirely unreasonable. We recommend collaborating with our language expert (read: Trey) to develop a more robust methodology for detecting when the query submitted matches an example query we have on file.

Furthermore, for the sake of time, we did not include example queries from other languages in our initial compilation of examples. Some of the queries that were deemed "definitely not an example" actually WERE most definitely examples written in French.

## Acknowledgements

We would like to thank Trey Jones for his advice in dealing with approximate string matching, and Oliver Keyes for his review of this report and helpful feedback.

## References

- [*Wikidata query service* on MediaWiki](https://www.mediawiki.org/wiki/Wikidata_query_service) and [WDQS User Manual](https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual)
- [*SPARQL* on Wikipedia](https://en.wikipedia.org/wiki/SPARQL)
- [**mclust**](http://www.stat.washington.edu/mclust/): Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation by Fraley, C. and [Raftery, A.](https://en.wikipedia.org/wiki/Adrian_Raftery)

[^countries]: The countries are: Algeria, Angola, Argentina, Armenia, Australia, Austria, Azerbaijan, Belarus, Belgium, Brazil, Bulgaria, Cambodia, Canada, Chile, China, Colombia, Croatia, Czech Republic, Denmark, Ecuador, Egypt, Estonia, Finland, France, Germany, Ghana, Greece, Guadeloupe, Hungary, India, Indonesia, Iran, Ireland, Israel, Italy, Japan, Latvia, Luxembourg, Malaysia, Mali, Malta, Martinique, Mexico, Montenegro, Nepal, Netherlands, New Zealand, Norway, Poland, Portugal, Qatar, Republic of Korea, Romania, Russia, Saudi Arabia, Serbia, Singapore, Slovak Republic, Slovenia, South Africa, Spain, Sri Lanka, Sweden, Switzerland, Taiwan, Thailand, Turkey, Ukraine, United Kingdom, United States, Uruguay, Venezuela, and Vietnam.