# Boolean Searching

Boolean search is one of the most powerful methods available in search to help you find "the needle in the haystack" by letting you narrow down and remove irrelevant results to (hopefully!) reveal the set of records you want, but it does require some thought, especially if you combine boolean operators, as the interaction between them can be hard to visualise. The operators available are:

  * Phrase Searching
  * Negated Searching
  * AND/OR searching
  * Fuzzy searching (not actually a boolean operator)

In [None]:
import requests

## Phrase searching

Perhaps the most simplest boolean operator to understand, phrase searching allows you to specify you want an exact phrase (a word or a sequence of words) to be matched, exactly as it is written. Normally in search, if you search for multiple words, any occurance of the words anywhere on the page will be considered a match. For example, if you want to search for any objects that contain the
words *William Morris*, you probably do not want to see objects created by the partnership of "William Smith" and "Jane Morris", but these would be returned as they have the words you have searched for in. However, if you search instead using double quotes around the search words (e.g. *"William Morris"*), you will now only see a smaller number of results that have that exact phrase in.

(N.B. of course, if you really want to only find objects connected to the Victorian Artist & Designer William Morris, you would be better use a filter on person, or on maker if objects he was involved in creating, see Filters XXX)


In [None]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=William Morris')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the words William and Morris somewhere in the record" % object_info["record_count"])

In [None]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q="William Morris"')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print('There are %d objects that have the phrase "William Morris" somewhere in the record' % object_info["record_count"])

## Fuzzy Searching

## Negated Searching

Also fairly simple to understand, this simply lets you exclude results you do not want to see, even though they would match otherwise. This is most commonly used in tandem with another operator or with a query, as otherwise you are likely to return a very large number of results.

You can apply negation to single words in a text search (i.e. find records that contain the word *blue* but not the word *azure*), to phrases (as described above) and to Filters (i.e. find records that are not associated with the maker William Morris )

All you need to do apply negation is to prefix the word or phrase or filter term with a hyphen '-' (there must not be a space between the hyphen and the word/phrase/term, see example below)

In [None]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=blue')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word blue somewhere in the record" % object_info["record_count"])

In [6]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=blue azure')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the words blue and azure somewhere in the record" % object_info["record_count"])

There are 120 objects that have the words blue and azure somewhere in the record


In [10]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=blue -azure')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word 'blue' and not the word 'azure' somewhere in the record" % object_info["record_count"])

There are 62001 objects that have the words blue and not the word azure somewhere in the record


## AND/OR Searching

Now things start getting more complicated! One of the key issues in searching within large datasets is whether you want to have a broad set of results, some of which might be correct and some not, or whether you want a smaller set of results which are all or mostly correct (but might be missing some other correct results). A shorthand decription for this problem  is to think about getting a million results as opposed to no results. ... Narrowing down from a large set or broading up from a small set

AND or OR lets you decide which of this approaches to take, by indicating when you search for multiple words whether you want to find records that contain _all_ the words, or if you are happy to see records that only contain some of the words. Up until now, without your knowledge, we have been defaulting to using AND..

XXX need to bring in scoring somehow here

To continue using AND you do not need to change anything. To switch to using OR you need to use the '|' operator between words. However, you might need to consider operator precence at this point, as 
XXX. To force the correct interpretation you need to also use brackets '(' and ')' around the alternative terms

For example, you might want to find mentions of 

### OR Searching 

One use of OR searching is to allow for searching for variations on a topic, allowing you to find mentions of either of both in one search. For example XXX

In [25]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=travel photography')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the words travel and photography somewhere in the record" % object_info["record_count"])

There are 12371 objects that have the words travel and photography somewhere in the record


In [26]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=landscape photography')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the words travel and photography somewhere in the record" % object_info["record_count"])

There are 6269 objects that have the words travel and photography somewhere in the record


In [22]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=(travel|landscape) photography')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the words 'travel' and/or 'landscape' and 'photography' somewhere in the record" % object_info["record_count"])

There are 13683 objects that have the words travel and photography somewhere in the record


In [24]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=travel landscape photography')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the words travel and landscape and photography somewhere in the record" % object_info["record_count"])

There are 4957 objects that have the words travel and photography somewhere in the record


That might be worth a recap!

 * First we search for records that mention *travel* and *photography*
 * Then we search for records that mentioned *landscape and *photography*
 * Then searched for records that mention *photography* and either or both of *landscape* and *travel*
 * Finally we searched for records that mention *photography* and *landscape* and *travel*

The third result might seem surprising, as you may think it should just be the sum of the first two results, but don't forgot records that contain both *travel* and *landscape* will now only count as one record.
If you subtract the last result from the sum of the first two you will see how the third total was reached

## Fuzzy Searching

As mentioned, not a boolean operator but close enough to be considered in this XXX Fuzzy searching is fairly self-exploaintorry, it lets you tell the search algorithm to allow a certain degree of variation in spelling in the results it returns, allowing you to find results that might have slightly different spellings across times & cultures, singular and plural terms, typos and so on. For example, you might want to find XXX Shakespeare variations


In [27]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=Shakespeare')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word 'Shakespeare' somewhere in the record" % object_info["record_count"])

There are 5533 objects that have the word Shakespeare somewhere in the record


In [28]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=Shakespeare~1')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word 'Shakespeare' (with a fuzzy distance of 1 character change allowed) somewhere in the record" % object_info["record_count"])

There are 5567 objects that have the word Shakespeare somewhere in the record


In [29]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=Shakespeare~2')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word Shakespeare (with a fuzzy distance of 2 character changes allowed) somewhere in the record" % object_info["record_count"])

There are 5803 objects that have the word Shakespeare (with a distance of 2 characters allowed) somewhere in the record


As you can see from the increasing numbers, as you increase the fuzzy distance more records are returned. So for example the 2nd search would also include 'Shakespear' (one character removed),
the 3rd search would also include 'Shakspere' (two characters removed). 

Of course, the danger is as you increase the fuzzy distance, the more likely irrelevant results could be returned. So equally valid for the 3rd search would be 'Wakespear', 'Zakespeare', 'Rhakezpeare'
and so on. The shorter the word the less useful fuzzy searching would be, as too many irrelevant results would be returned.

## Boolean Operators Ordering

As we've seen already above, combining the operators is where boolean search can be very powerful, but also hard to intiuvely grasp. For example, you might want to find:

 * All records mentioning Yorkshire (or minor variations on that spelling of 2 character distance)
 * But not mentioning Scarborough 
 * That also mention a ring or a chair


In [95]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=Yorkshire~2')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word Yorkshire (within 2 chracter distance) somewhere in the record" % object_info["record_count"])

There are 7581 objects that have the word Yorkshire (within 2 chracter distance) somewhere in the record


In [96]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=Yorkshire~2 -Scarborough')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word Yorkshire but not Scarborough somewhere in the record" % object_info["record_count"])

There are 7527 objects that have the word Yorkshire but not Scarborough somewhere in the record


In [97]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=(ring|chair) Yorkshire~2 -Scarborough')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word Yorkshire but not Scarborough and a ring and/or a chair somewhere in the record" % object_info["record_count"])

There are 4201 objects that have the word Yorkshire but not Scarborough and a ring and/or a chair somewhere in the record


In [98]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=-Scarborough (ring|chair) Yorkshire~2')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word Yorkshire but not Scarborough (or 2 character edits distance) somewhere in the record" % object_info["record_count"])

There are 4225 objects that have the word Yorkshire but not Scarborough (or 2 character edits distance) somewhere in the record


In [99]:
req = requests.get('http://vam-etc-test-api.azureedge.net/api/v2/objects/search?q=Yorkshire~2 -Scarborough (ring|chair)')

object_data = req.json()
object_info = object_data["info"]
object_records = object_data["records"]

print("There are %d objects that have the word Yorkshire but not Scarborough (or 2 character edits distance) somewhere in the record" % object_info["record_count"])

There are 11436 objects that have the word Yorkshire but not Scarborough (or 2 character edits distance) somewhere in the record


XXX Explain why operator order affects results

## Further Reading

  * Boolean Search
  * Hamming Distance