# Introduction to Document Similarity with Elasticsearch


## Document Distance
Short intro to document similarity - finding a way to represent the distance between documents. 

Two things needed:
 - The first is to encode the documents as vectors. There are options for this (one-hot, frequency, distributed). 
 - The second is to decide how to measure distance - Euclidean? Manhattan? Cosine? There are tradeoffs but cosine is a good place to start with text. Why? Documents encoded as vectors are sparse; each vector could be as long as the number of unique words across the full corpus.  That means that two documents of very different lengths (e.g. a single recipe and a cookbook), could be encoded with the same length vector, which might overemphasize the magnitude of the book's document vector at the expense of the recipe's document vector. Cosine distance helps to correct for variations in vector magnitudes resulting from uneven length documents, and enables us to measure the distance between the book and recipe.
 
Document similarity is something we cover in the book - in chapter 6 and then in chapter 10 where we build a chatbot that, among other things, uses a nearest neigbor search to recommend recipes that are similar to the ingredients listed by the user. One of my observations during the prototyping phase for that chapter is how slow vanilla nearest neighbor search is. This led me to think about different ways to optimize the search, from using variations like k-ball, to using other Python libraries like Spotify's Annoy, and also to other kind of tools altogether that attempt to deliver a similar results as quickly as possible. Like ElasticSearch!
 
## What is Elasticsearch
Elasticsearch is a text search engine that leverages the information retrieval library Lucene together with a key-value store to ...


## Get the Sample Corpus
The Yellowbrick machine learning visualization library hosts several datasets wrangled from the UCI Machine Learning Repository or built by District Data Labs to present the examples used throughout this documentation, one of which is a text corpus of news documents collected from different domain area RSS feeds. If you haven't downloaded the data, you can do so by running:

    $ python -m yellowbrick.download
    
This should create a folder named data in your current working directory that contains all of the datasets. You can load a specified dataset as follows:


In [1]:
import os

from sklearn.datasets.base import Bunch
from yellowbrick.download import download_all

## The path to the test data sets
FIXTURES  = os.path.join(os.getcwd(), "data")

## Dataset loading mechanisms
datasets = {
    "hobbies": os.path.join(FIXTURES, "hobbies")
}


def load_data(name, download=True):
    """
    Loads and wrangles the passed in text corpus by name.
    If download is specified, this method will download any missing files. 
    """
    
    # Get the path from the datasets 
    path = datasets[name]
    
    # Check if the data exists, otherwise download or raise 
    if not os.path.exists(path):
        if download:
            download_all() 
        else:
            raise ValueError((
                "'{}' dataset has not been downloaded, "
                "use the download.py module to fetch datasets"
            ).format(name))
    
    # Read the directories in the directory as the categories. 
    categories = [
        cat for cat in os.listdir(path) 
        if os.path.isdir(os.path.join(path, cat))
    ]
    
    
    files  = [] # holds the file names relative to the root 
    data   = [] # holds the text read from the file 
    target = [] # holds the string of the category 
        
    # Load the data from the files in the corpus 
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)
            
            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())
        
    
    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )

In [97]:
corpus = load_data('hobbies')

hobby_types  = {}

for category in corpus.categories:
    texts = []
    for idx in range(len(corpus.data)):
        if corpus['target'][idx] == category:
            texts.append(' '.join(corpus.data[idx].split()))
    hobby_types[category] = texts
    
# The original documents preserve paragraph structure, which we'll squash here for readability
# food_stories = [' '.join(text.split()) for text in hobby_types['cooking']]
food_stories = [text for text in hobby_types['cooking']]

In [98]:
print(food_stories[5])

Smoothies are one of our favorite breakfast options year-round, but especially in the winter because it's an easy way to brighten the mornings by adding extra fruit to our diets. And since frozen fruit works just as well (if not better — hello, icy, creamy texture), there's no worry about whether it's in season. If you're stuck in the doldrums of gray days, try these tropical smoothies. It won't be quite like you're relaxing on a white-sand beach, but it'll be closer to it than your normal bowl of oatmeal. Sometimes, you just have to go with a classic. This smoothie recipe is one of our standbys. We might dress it up or down with a touch of mint or swap in coconut water, but we can always count on it to be delicious. You'll definitely get bonus points if you use a fresh coconut to make this smoothie. But because we know mornings are hectic, you can also swap in a can of coconut milk, and it will be just as creamy and delicious. Adding mint to your smoothie is a total pro move. Not only

In [51]:
print(food_stories[23])

When Elise and Ben bought their Mount Washington house, the kitchen looked like this. They used this sad kitchen for five years until they finally updated it. The end result is modernized, but still captures the spirit and integrity of the storied older house. Using IKEA cabinets to keep the costs down, they were able to upgrade and personalize the look with Calcutta marble and Schoolhouse Electric hardware for all the cabinets. And what an amazing transformation after: More details to drool over: In Elise's words: I probably made about 12 different floor plans and elevations trying to fit everything in, yet achieve a light, open, airy feeling. The biggest issues were the appliances. How do you fit these huge things, plus a ton of storage space, into a 10x10 room? Really, the only solution was to downsize as much as possible and get a a little creative with the old "kitchen triangle." Breaking up the cooktop and oven, forgoing a good looking oven hood and for a sensible micro hood, and

## Start Elasticsearch
In the command line, start elasticsearch.

```bash
$ cd elasticsearch-<version>        
$ ./bin/elasticsearch
```

## Create an Index
Now we will create an index. Think of an index as a database in PostgreSQL or MongoDB. An Elasticsearch cluster can contain multiple _indices_ (e.g. relational or noSql databases), which in turn contain multiple _types_ (similar to MongoDB collections or PostgreSQL tables). These types hold multiple _documents_ (similart o MongoDB documents or PostgreSQL rows), and each document has _properties_ (like MongoDB document key-values or PostgreSQL columns).

```bash
curl -X PUT "localhost:9200/cooking " -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    }
}
'
```

 And the response:
 
```bash
{"acknowledged":true,"shards_acknowledged":true,"index":"cooking"}
```

### Get All Indices or Delete a Specific Index

```bash
$ curl -X GET "localhost:9200/_cat/indices?v"
```

```bash
$ curl -X DELETE "localhost:9200/cooking"
```

### Manually Add Some Documents to the Index

```bash
$ curl -X PUT "localhost:9200/cooking/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
      "description":  "Smoothies are one of our favorite breakfast options year-round."
}
'
```

```bash
$ curl -X PUT "localhost:9200/cooking/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
      "description":  "A smoothie is a thick, cold beverage made from pureed raw fruit."
}
'
```

```bash
$ curl -X PUT "localhost:9200/cooking/_doc/3?pretty" -H 'Content-Type: application/json' -d'
{
      "description":  "Eggs Benedict is a traditional American breakfast or brunch dish."
}
'
```    
    
### Search the Index for Similar Documents
    
Let's look for documents that mention "breakfast":
    
```bash
$ curl -XGET 'localhost:9200/cooking/_search?q=description:breakfast&pretty'
```

And the response:

```bash
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.48233607,
    "hits" : [
      {
        "_index" : "cooking",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.48233607,
        "_source" : {
          "description" : "Smoothies are one of our favorite breakfast options year-round."
        }
      },
      {
        "_index" : "cooking",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.48233607,
        "_source" : {
          "description" : "Eggs Benedict is a traditional American breakfast or brunch dish."
        }
      }
    ]
  }
}
```

But if we look for documents that mention "smoothie"...

```bash
$ curl -XGET 'localhost:9200/cooking/_search?q=description:smoothie&pretty'
```

We only get the second document back, since the word "smoothie" is pluralized in the first document:

```bash
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9331132,
    "hits" : [
      {
        "_index" : "cooking",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9331132,
        "_source" : {
          "description" : "A smoothie is a thick, cold beverage made from pureed raw fruit."
        }
      }
    ]
  }
}
```

We can work around this by using a fuzzy search, which will return both the first and second documents:

```bash
curl -XGET "localhost:9200/cooking/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
       "fuzzy" : { "description" : "smoothie" }
    }
}
'
```
    

 
## Adding Documents in Bulk

Now let's try to add all of the cooking from the corpus to the index.

In [4]:
from elasticsearch.helpers import bulk
from elasticsearch import Elasticsearch

class ElasticIndexer(object):
    """
    Create an ElasticSearch instance, and given a list of documents, 
    index the documents into ElasticSearch.
    """
    def __init__(self, index_name, doc_type):
        self.type = doc_type
        self.index_name = index_name
        self.elastic_search = Elasticsearch()
        
    def make_documents(self, texts):        
        for document in texts:
            yield {
                "_index": "cooking",
                "_type": "_doc",
                "description": document
            }        
    def index(self, texts):
        bulk(self.elastic_search, self.make_documents(texts))

    def clear_elastic_search(self):
        self.elastic_search.indices.delete(
            index=self.index_name,
            ignore=[400, 404]
        )

In [5]:
indexer = ElasticIndexer("cooking", "_doc")
indexer.clear_elastic_search()
indexer.index(food_stories)

In [6]:
from pprint import pprint

query = {"match_all": {}}
result = indexer.elastic_search.search(index="cooking", body={"query":query})

print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])

30 hits 

First result:

{'_id': 'Fdn8J2QB7PzBcXb-DYLP',
 '_index': 'cooking',
 '_score': 1.0,
 '_source': {'description': 'Smoothies are one of our favorite breakfast '
                            'options year-round, but especially in the winter '
                            "because it's an easy way to brighten the mornings "
                            'by adding extra fruit to our diets. And since '
                            'frozen fruit works just as well (if not better — '
                            "hello, icy, creamy texture), there's no worry "
                            "about whether it's in season. If you're stuck in "
                            'the doldrums of gray days, try these tropical '
                            "smoothies. It won't be quite like you're relaxing "
                            "on a white-sand beach, but it'll be closer to it "
                            'than your normal bowl of oatmeal. Sometimes, you '
                            'just hav

In [7]:
query = {"fuzzy":{"description":"breakfast"}}
result = indexer.elastic_search.search(index="cooking", body={"query":query})

print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])

8 hits 

First result:

{'_id': 'Itn8J2QB7PzBcXb-DYLP',
 '_index': 'cooking',
 '_score': 2.5993373,
 '_source': {'description': 'Ever wonder how your favorite chefs make their '
                            'food taste and look so good? Tune in this weekend '
                            'as Food Network chefs reveal secrets from their '
                            'own kitchen, their time working at restaurants '
                            'and more. First up on Saturday, Ree Drummond is '
                            'answering her fans’ most-asked questions while '
                            'she makes Mexican chicken soup and white sandwich '
                            'bread from scratch. Then, the co-hosts of The '
                            'Kitchen are spilling the secrets restaurants use '
                            'to make your food taste more delicious, while '
                            'making potato chip-crusted salmon and a '
                            'pickle-brine

(from [here](https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-mlt-query.html))

## More Like This 
Elasticsearch expose a way of doing more advanced querying based on document similarity, which is called "More Like This" (MLT). Given an input document or set of documents, MLT wraps all of the following behavior:

 - extraction of a set of representative terms from the input
 - selection of terms with the highest scores*
 - formation of a disjunctive query using these terms
 - query execution
 - results returned
 
*_Note: this is done using term frequency-inverse document frequency (TF-IDF). Term frequency-inverse document frequency is an encoding method that normalizes term frequency in a document with respect to the rest of the corpus. As such, TF-IDF measures the relevance of a term to a document by the scaled frequency of the appearance of the term in the document, normalized by the inverse of the scaled frequency of the term in the entire corpus. This has the effect of selecting terms that make the input document or documents the most unique._ 
 
We can build an MLT query in much the same way as we did the `"fuzzy"` search above. The Elasticsearch MLT query exposes many [search parameters](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-mlt-query.html#_document_input_parameters), but the only required one is `"like"`, to which we can specify a string, a document, or multiple documents.

Let's see if we can find any documents from our corpus that are similar to a New York Times [review](https://www.nytimes.com/2018/05/29/dining/don-angie-review.html) for the Italian restaurant Don Angie.

In [None]:
red_sauce_renaissance = """
    Ever since Rich Torrisi and Mario Carbone began rehabilitating chicken Parm and 
    Neapolitan cookies around 2010, I’ve been waiting for other restaurants to carry 
    the torch of Italian-American food boldly into the future. This is a major branch 
    of American cuisine, too important for its fate to be left to the Olive Garden. 
    For the most part, though, the torch has gone uncarried. I have been told that 
    Palizzi Social Club, in Philadelphia, may qualify, but because Palizzi is a 
    veritable club — members and guests only, no new applications accepted — I don’t 
    expect to eat there before the nation’s tricentennial. Then in October, a place 
    opened in the West Village that seemed to hit all the right tropes. It’s called 
    Don Angie. Two chefs share the kitchen — Angela Rito and her husband, Scott 
    Tacinelli — and they make versions of chicken scarpariello, antipasto salad and 
    braciole. The dining room brings back the high-glitz Italian restaurant décor of 
    the 1970s and ’80s, the period when Formica and oil paintings of the Bay of Naples 
    went out and mirrors with gold pinstripes came in. The floor is a black-and-white 
    checkerboard. The bar is made of polished marble the color of beef carpaccio. 
    There is a house Chianti, and it comes in a straw-covered bottle. There is hope 
    for a red-sauce renaissance, after all.
"""

In [77]:
query = {
    "more_like_this" : {
        "fields" : ["description"],
        "like" : red_sauce_renaissance,
        "min_term_freq" : 3,
        "max_query_terms" : 50,
        "min_doc_freq" : 4
    }
}

result = indexer.elastic_search.search(index="cooking", body={"query":query})

print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])

28 hits 

First result:

{'_id': 'LNn8J2QB7PzBcXb-DYLP',
 '_index': 'cooking',
 '_score': 4.373271,
 '_source': {'description': 'Shopping for light bulbs can be as simple or '
                            'complex as you make it. My roommate and I, who '
                            'spend an inordinate amount of money at the nearby '
                            "corner store (because it's right there!), tend to "
                            "suffer through whatever light bulbs they've got "
                            "in stock—more often than not they're soft white "
                            '60-watt incandescents. I realize I should be '
                            'embarrassed. But in our living room, where there '
                            'are four to five lamps lighting a space barely '
                            'larger than a dog house, these hazy bulbs create '
                            'a warm, layered blanket of light that we love in '
                            'the

In [81]:
query = {
    "more_like_this" : {
        "fields" : ["description"],
        "like" : red_sauce_renaissance,
        "unlike" : [food_stories[23], food_stories[28]],
        "min_term_freq" : 2,
        "max_query_terms" : 50,
        "min_doc_freq" : 4
    }
}

result = indexer.elastic_search.search(index="cooking", body={"query":query})

print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])

4 hits 

First result:

{'_id': 'GNn8J2QB7PzBcXb-DYLP',
 '_index': 'cooking',
 '_score': 1.130699,
 '_source': {'description': 'Skinnytaste Dinner Plan (Week 20) posted March '
                            '12, 2016 by Gina Hi everyone! I’m back from '
                            'another week in Charleston where I was shooting '
                            'the second half of my new cookbook, Skinnytaste '
                            'Fast and Slow (comes out this Fall). While there '
                            'I did a book signing at the Charleston Wine + '
                            'Food Festival, got to try some new restaurants '
                            'and did some sight seeing, if you haven’t been '
                            'there yet it’s a must! This is week 20 of sharing '
                            'my weekly dinner plans. I use The Skinnytaste '
                            'Meal Planner pictured below to plan my dinners '
                            'for the week

In [100]:
from elasticsearch.helpers import bulk
from elasticsearch import Elasticsearch

class ElasticIndexer(object):
    """
    Create an ElasticSearch instance, and given a list of documents, 
    index the documents into ElasticSearch.
    """
    def __init__(self):
        self.elastic_search = Elasticsearch()
        
    def make_documents(self, textdict):        
        for category, docs in textdict:
            for document in docs:
                yield {
                    "_index": category,
                    "_type": "_doc",
                    "description": document
                }        
    def index(self, textdict):
        bulk(self.elastic_search, self.make_documents(textdict))

In [101]:
indexer = ElasticIndexer()
indexer.index(hobby_types.items())

In [102]:
query = {"match_all": {}}
result = indexer.elastic_search.search(index=["cooking","books","sports"], body={"query":query})

print("%d hits \n" % result['hits']['total'])
print("First result:\n")
pprint(result['hits']['hits'][0])

220 hits 

First result:

{'_id': 'M9nIKGQB7PzBcXb-RYkm',
 '_index': 'books',
 '_score': 1.0,
 '_source': {'description': 'After forty-five years of playing in Crosby, '
                            'Stills, Nash, and Young, Graham Nash has '
                            'announced that he is done with tolerating—and '
                            'mollifying—David Crosby’s notoriously big, '
                            'cantankerous mouth. Nash told Billboard, “In my '
                            'world, there will never, ever be a Crosby, '
                            'Stills, Nash, and Young record and there will '
                            'never be another Crosby, Stills, and Nash record '
                            'or show.” He went a bit further in an earlier '
                            'interview with the magazine Lust for Life, '
                            'saying: I don’t like David Crosby right now. He’s '
                            'been awful for me the last two years