# Boolean Searching

Boolean search is one of the most powerful methods available in search to help you find "the needle in the haystack" by letting you narrow down and remove irrelevant results to (ideally) reveal the set of records you want. But boolean search escalates in complexity as you start combining operators, and the subtle interactions between them can become challenging for even those with the most logical of minds to follow. 

The boolean operators available are:

  * Phrase Searching - e.g. "William Morris"
  * Negated Searching  - e.g. -silver
  * AND/OR searching - gelato rome (AND), gelato|rome (OR)
  * Fuzzy searching (not actually a boolean operator) - graffito~2

## Phrase searching

Perhaps the most simplest boolean operator to understand, phrase searching allows you to specify you want an exact phrase (a word or a sequence of words) to be matched, exactly as it is written. Normally in search, if you search for multiple words, any occurance of the words anywhere on the page will be considered a match. For example, if you want to search for any objects that contain the
words *William Morris*, you probably do not want to see objects created by the (hyperthetical) partnership of "William Smith" and "Jane Morris", but these would be returned as they have the words you have searched for in. However, if you search instead using double quotes around the search words (e.g. *"William Morris"*), you will now only see a smaller number of results that have that exact phrase in.

(N.B. of course, if you really want to only find objects connected to the Victorian Artist & Designer William Morris, you would be better use a filter either on the person, or on the maker if you are only interested in objects he was involved in creating, see {ref}`Filters`)


In [None]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=William Morris')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words William and Morris somewhere in the record")

In [1]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q="William Morris"')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f'There are {record_count} objects that have the phrase "William Morris" somewhere in the record')

There are 2265 objects that have the phrase "William Morris" somewhere in the record


## Negated Searching

Also fairly simple to understand, this simply lets you exclude results you do not want to see, even though they would match otherwise. This is most commonly used in tandem with another operator or with a query, as otherwise you are likely to return a very large number of results (e.g. returning all records that do not mention pineapples would be over a million records).

You can apply negation to single words in a text search (i.e. find records that contain the word *blue* but not the word *azure*), to phrases (as described above) and to {ref}`Filters` (i.e. find records that are not associated with the maker William Morris )

All you need to do apply negation is to prefix the word or phrase or filter term with a hyphen '-' (there must not be a space between the hyphen and the word/phrase/term, see following example)

In [2]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=blue')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word blue somewhere in the record")

There are 62119 objects that have the word blue somewhere in the record


In [3]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=blue azure')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words blue and azure somewhere in the record")

There are 120 objects that have the words blue and azure somewhere in the record


In [4]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=blue -azure')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'blue' and not the word 'azure' somewhere in the record")

There are 61999 objects that have the word 'blue' and not the word 'azure' somewhere in the record


## AND/OR Searching

One of the key issues in searching within a large dataset is whether you want to have a broad set of results, some of which might be correct (and some not), or whether you want a smaller set of results which are mostly correct (but might be missing some other correct results). 

Using AND or OR in you search parameters lets you decide which of these approaches to take, by indicating when you search for multiple words whether you want to find records that contain _all_ the words, or if you are happy to see records that only contain some of the words. Up until now, without your knowledge, we have been defaulting to using AND, so only records that match on all the terms are returned.

XXX need to bring in scoring somehow here

To continue using AND you do not need to change anything. To switch to using OR you need to use the '|' operator between terms. However, you might need to consider operator precedence at this point, as XXX. To force the correct interpretation you need to also use brackets '(' and ')' around the alternative terms

For example, you might want to find mentions of 

In [None]:
## AND Searching



### OR Searching 

One use of OR searching is to allow for searching for variations on a topic, allowing you to find mentions of either or both in one search query. For example if you want to find all records that must mention 'photography' and one or both of 'travel' and 'landscape' you could run two queries seperately:

In [25]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=travel photography')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'travel' and 'photography' somewhere in the record")

There are 12371 objects that have the words travel and photography somewhere in the record


In [26]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=landscape photography')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'landscape' and 'photography' somewhere in the record")

There are 6269 objects that have the words travel and photography somewhere in the record


and then combine the results (removing duplicates when a record would have matched both queries as it contains all three words). Easier though would be to send one query:

In [22]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=(travel|landscape) photography')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the words 'travel' and/or 'landscape' and 'photography' somewhere in the record")

There are 13683 objects that have the words travel and photography somewhere in the record


In [None]:
which then returns all the records that:
    * Mention 'travel' and 'photograpy'
    * Mention 'landscape' and 'photography'
    * Mention 'travel' and 'landscape' and 'photography'


## Fuzzy Searching

This is not strictly a boolean operator, but it's close enough to be considered so and can be combined with the other boolean operators. It lets you run a search query with a certain degree of variation in spelling allowed in the results it returns, allowing you to find records that might have slightly different spellings across times & cultures, singular and plural terms, typos and so on. For example, you might want to find any variations of how Shakespeare might be spelt:

In [12]:
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Shakespeare')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'Shakespeare' somewhere in the record")

There are 5535 objects that have the word 'Shakespeare' somewhere in the record


In [11]:
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Shakespeare~1')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'Shakespeare' (with a fuzzy distance of 1 character change allowed) somewhere in the record")

There are 5569 objects that have the word 'Shakespeare' (with a fuzzy distance of 1 character change allowed) somewhere in the record


In [10]:
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Shakespeare~2')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} objects that have the word 'Shakespeare' (with a fuzzy distance of 2 character changes allowed) somewhere in the record")

There are 5805 objects that have the word 'Shakespeare' (with a fuzzy distance of 2 character changes allowed) somewhere in the record


As you can see from the increasing numbers, as you increase the fuzzy distance more records are returned. So for example the 2nd search would also include 'Shakespere' (one character removed),
the 3rd search would also include 'Shakspere' (two characters removed). 

Of course, the danger is as you increase the fuzzy distance, the more likely irrelevant results could be returned. So equally valid for the 3rd search would be 'Wakespear', 'Zakespeare', 'Rhakezpeare'
and so on. The shorter the word, the less useful fuzzy searching would be, as too many irrelevant results would be returned.

## Boolean Operators Combination

By combining operators together, boolean search can return exact results, but these results can sometimes be hard to interpret.

For example, you might want to find:

 * All records mentioning 'Yorkshire' (or minor variations on the spelling to within a 2 character distance inclusive)
 * But not if they mention 'Scarborough' 
 * But they must also mention 'chair'
 
Let's build this up step-by-step.

In [13]:
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Yorkshire~2')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} object records that mention Yorkshire (within a 2 character distance) somewhere in the record")

There are 7600 objects that have the word Yorkshire (within 2 character distance) somewhere in the record


In [14]:
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Yorkshire~2 -Scarborough')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough somewhere in the record")

There are 7546 objects that have the word Yorkshire but not Scarborough somewhere in the record


In [17]:
import requests
req = requests.get('https://api.vam.ac.uk/v2/objects/search?q=Yorkshire~2 -Scarborough chair')
object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]
record_count = object_info["record_count"]
print(f"There are {record_count} object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough and also mention a chair")

There are 4153 object records that mention Yorkshire (within a 2 character distance) without mentioning Scarborough and also mention a ring and/or a chair


{note}`
If you are using negation and boolean OR operators together, the order they are written may affect your results. For a very detailed investigation of this see <https://github.com/elastic/elasticsearch/issues/4707>`