# Desktop Search Term Analysis

[Task](https://phabricator.wikimedia.org/T260243)

## Purpose: 
Understand what search terms get entered to help guide the how to format page titles of each search result.

## Data Source/Approach:

- Reviewed top search terms for ten large Wikipedias and ten small Wikipedias. For small wikipedias, I selected Wikis with at least least 100,000 articles to have enough data for the analysis and avoid any privacy/sensitive data concerns that may result from reviewing small wikis with only few number of searches. 
- Reviewed frequency of multi vs single word searches overall.
- Data includes searches conducted in August 2020 and comes from [SearchSatisfaction table](https://meta.wikimedia.org/wiki/Schema:SearchSatisfaction). 
- There are two types of seach events recorded in SearchSatisfaction: (1) Autocomplete searches - user types in search widget and reviews/selects option from the drop down menu. (2) Fulltext search - a user provides text in the search widget and then directed to the Search Result Page if the article does not exist. This analysis focuses primarily on autocomplete results since the pupose of the analysis is how to format page titles in the search results provided in the drop down.
 - For autocomplete events, the query field can include multiple text entries as the user types. For example, if someone is searching for Paris. There might be a record for "Pa", "Par" and finally "Paris". To help isolate to complete searches, I looked for the longest character length query entered in a search session and also set a minimum search query length to at least 3 characters.
 - The search terms below reflect what the person enters into the search header box not what was selected in the provided drop down menu. For example, if the person just typed "Par" and then clicked "paris" in the search results provided in the drop down, the search term would be recorded as "Par"


In [None]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(tidyverse); library(wmfdata)
})

# Common search terms on large-size Wikipedias

I reviewed the top search terms for a set of 10 large size Wikipedia (English, Spanish, German, French, Japan, Russiona, Italian, Chinses, Portuguese, and Polish) in August 2020. 

In [103]:
#Review top autocomplete search terms on large size wikis
query<-
"
-- find complete search term entered into search widget for all sessions
WITH ranked_searches AS (
  SELECT
    event.searchSessionid AS search_session,
    event.query AS search_query,
--find longest length search query in each session to remove partial searches
    RANK() OVER (PARTITION BY event.searchSessionid
                 ORDER BY LENGTH(event.query) DESC) AS ranking
  FROM event.SearchSatisfaction
  WHERE year = 2020 and month = 08
    AND event.source = 'autocomplete'
    AND event.action = 'searchResultPage'
    AND wiki IN ('enwiki', 'eswiki', 'dewiki', 'frwiki', 'jawiki', 'ruwiki', 'itwiki', 'zhwiki', 'ptwiki', 'plwiki')
    AND useragent.is_bot = false 
)

SELECT 
    search_query,
    Count(*) as n_searches
FROM ranked_searches
WHERE 
--longest character search term entered in session
    ranking = 1 
--looking for sessions with at five characters in length
    AND LENGTH(search_query) > 2
GROUP BY 
    search_query
ORDER BY n_searches DESC
LIMIT 100
"

In [104]:
autocomplete_queries_largewiki <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [111]:
head(autocomplete_queries_largewiki, 15)

Unnamed: 0_level_0,search_query,n_searches
Unnamed: 0_level_1,<chr>,<int>
1,nasdaq,156726
2,2020,31542
3,part of an url,22403
4,tenet,22117
5,kamala,17500
6,covid,14902
7,lucifer,14334
8,belarus,14111
9,joe biden,13479
10,the,13173


In [41]:
#Review top full term search terms on large size wikis

query <- 
"SELECT
    event.query AS search_query,
    COUNT(*) AS n_searches
FROM
    event.SearchSatisfaction
WHERE
    event.action = 'searchResultPage' 
    AND event.source = 'fulltext' 
--top 10 wikis by size
    AND wiki IN ('eswiki', 'dewiki', 'frwiki', 'jawiki', 'ruwiki', 'itwiki', 'zhwiki', 'ptwiki', 'plwiki', 'arwiki') 
    AND year = 2020 and month = 08 
    AND useragent.is_bot = false
GROUP BY
    event.query
ORDER BY n_searches DESC
LIMIT 100"

In [42]:
fulltext_queries_largewiki <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [74]:
head(fulltext_queries_largewiki, 10)

Unnamed: 0_level_0,search_query,n_searches
Unnamed: 0_level_1,<chr>,<int>
1,part of an url,73912
2,us party affiliation,6406
3,va privatization,5110
4,MGM Television,3859
5,armed teachers,3410
6,Warner Bros. Television Distribution,2943
7,"""lots of""",2791
8,"""日本語吹替""",2740
9,Warner Bros. Family Entertainment,2378
10,Jemeter,2083


# Common search terms on 10 smaller size wikis

Top search terms for a set of 10 small size Wikipedias (Persian, Catalan, Serbian, Indonesian, Norwegian, Korean, Finnish, Hungarian, Czech, and Serbo-Croatian) in August 2020. Note: I selected with at least least 100,000 articles to have enough data for the analysis and avoid any privacy/sensitive data concerns that may result from reviewing small wikis with only few number of searches.

In [86]:
#Review top autocomplete search terms on small size wikis
query<-
"
-- find complete search term entered into search widget for all sessions
WITH ranked_searches AS (
  SELECT
    event.searchSessionid AS search_session,
    event.query AS search_query,
--find longest length search query in each session to remove partial searches
    RANK() OVER (PARTITION BY event.searchSessionid
                 ORDER BY LENGTH(event.query) DESC) AS ranking
  FROM event.SearchSatisfaction
  WHERE year = 2020 and month = 08
    AND event.source = 'autocomplete'
    AND event.action = 'searchResultPage'
    AND wiki IN ('fawiki', 'cawiki', 'srwiki', 'idwiki', 'nowiki', 'kowiki', 'fiwiki', 'huwiki', 'cswiki', 'shwiki') 
    AND useragent.is_bot = false 
)
SELECT 
    search_query,
    Count(*) as n_searches
FROM ranked_searches
WHERE 
--longest character search term entered in session
    ranking = 1 
--looking for search terms with at least 2 characters in length
    AND LENGTH(search_query) > 2
GROUP BY 
    search_query
ORDER BY n_searches DESC
LIMIT 100
"

In [87]:
autocomplete_queries_smallwiki <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [109]:
head(autocomplete_queries_smallwiki, 15)

Unnamed: 0_level_0,search_query,n_searches
Unnamed: 0_level_1,<chr>,<int>
1,libanon,1067
2,bělorusko,983
3,2020,720
4,ledek,668
5,suomi,535
6,سوپر جام,441
7,covid,344
8,usa,341
9,ایران,322
10,praha,318


Based on the above, the top search terms on the small and larger wikis reviewed in August are single words. "Kamala" received more views than "Kamala Harris" indicating that a larger number of people selected the search result provided in the drop down menu or pressed the search button prior to entering her entire name into the search box.

There are a couple terms such as "part of an url" that are likely caused by unidentified bots. Other common search terms appear to include names of people and events currently in the news.

There's also a large number of searches that start with "the".  For "the" searches, this was the longest recorded search term for those sessions so they either started typing a search starting with "the" and selected one of the provided drop down results or more likely just abandoned the search.

# Check number of single word vs multi word search terms 

In [124]:
query <-
"
-- find complete search term entered into search widget for all sessions
WITH ranked_searches AS (
  SELECT
    event.searchSessionid AS search_session,
    event.query AS search_query,
--find longest length search query in each session to remove partial searches
    RANK() OVER (PARTITION BY event.searchSessionid
                 ORDER BY LENGTH(event.query) DESC) AS ranking
  FROM event.SearchSatisfaction
  WHERE year = 2020 and month = 08
    AND event.source = 'autocomplete'
    AND event.action = 'searchResultPage'
    AND wiki IN ('fawiki', 'cawiki', 'srwiki', 'idwiki', 'nowiki', 'kowiki', 'fiwiki', 'huwiki', 'cswiki', 'shwiki') 
    AND useragent.is_bot = false 
)
SELECT 
-- find number of words
   (LENGTH(search_query) - LENGTH(REGEXP_REPLACE(search_query,' ',''))+1) AS num_words,
    Count(*) AS n_searches
FROM ranked_searches
WHERE 
--longest character search term entered in session
    ranking = 1 
--looking for search terms with at least 2 characters in length
    AND LENGTH(search_query) > 2 
GROUP BY 
    (LENGTH(search_query) - LENGTH(REGEXP_REPLACE(search_query,' ',''))+1) 
ORDER BY n_searches DESC
LIMIT 100"

In [125]:
autocomplete_queries_wordcount <-  wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [126]:
head(autocomplete_queries_wordcount, 20)

Unnamed: 0_level_0,num_words,n_searches
Unnamed: 0_level_1,<int>,<int>
1,1,914678
2,2,733311
3,3,174888
4,4,55795
5,5,21810
6,6,10633
7,7,5564
8,8,3058
9,9,1963
10,10,1127


In [134]:
autocomplete_queries_wordcount_prop <- autocomplete_queries_wordcount %>%
    mutate(prop_searches = round(n_searches/sum(n_searches) *100, 2))

head(autocomplete_queries_wordcount_prop, 10)

Unnamed: 0_level_0,num_words,n_searches,prop_searches
Unnamed: 0_level_1,<int>,<int>,<dbl>
1,1,914678,47.46
2,2,733311,38.05
3,3,174888,9.07
4,4,55795,2.9
5,5,21810,1.13
6,6,10633,0.55
7,7,5564,0.29
8,8,3058,0.16
9,9,1963,0.1
10,10,1127,0.06


One word or two word searches account for 85.5% of autocomplete searches. One word searches are conducted more frequently (by 24.7%) than two word searches.