# Introduction to Document Similarity with Elasticsearch


## Document Distance
Short intro to document similarity - finding a way to represent the distance between documents. 

Two things needed:
 - The first is to encode the documents as vectors. There are options for this (one-hot, frequency, distributed). 
 - The second is to decide how to measure distance - Euclidean? Manhattan? Cosine? There are tradeoffs but cosine is a good place to start with text. Why? Documents encoded as vectors are sparse; each vector could be as long as the number of unique words across the full corpus.  That means that two documents of very different lengths (e.g. a single recipe and a cookbook), could be encoded with the same length vector, which might overemphasize the magnitude of the book's document vector at the expense of the recipe's document vector. Cosine distance helps to correct for variations in vector magnitudes resulting from uneven length documents, and enables us to measure the distance between the book and recipe.
 
Document similarity is something we cover in the book - in chapter 6 and then in chapter 10 where we build a chatbot that, among other things, uses a nearest neigbor search to recommend recipes that are similar to the ingredients listed by the user. One of my observations during the prototyping phase for that chapter is how slow vanilla nearest neighbor search is. This led me to think about different ways to optimize the search, from using variations like k-ball, to using other Python libraries like Spotify's Annoy, and also to other kind of tools altogether that attempt to deliver a similar results as quickly as possible. Like ElasticSearch!
 
## What is Elasticsearch
Elasticsearch is a text search engine that leverages the information retrieval library Lucene together with a key-value store to ...


## Get the Sample Corpus
The Yellowbrick machine learning visualization library hosts several datasets wrangled from the UCI Machine Learning Repository or built by District Data Labs to present the examples used throughout this documentation, one of which is a text corpus of news documents collected from different domain area RSS feeds. If you haven't downloaded the data, you can do so by running:

    $ python -m yellowbrick.download
    
This should create a folder named data in your current working directory that contains all of the datasets. You can load a specified dataset as follows:


In [1]:
import os

from sklearn.datasets.base import Bunch
from yellowbrick.download import download_all

## The path to the test data sets
FIXTURES  = os.path.join(os.getcwd(), "data")

## Dataset loading mechanisms
datasets = {
    "hobbies": os.path.join(FIXTURES, "hobbies")
}


def load_data(name, download=True):
    """
    Loads and wrangles the passed in text corpus by name.
    If download is specified, this method will download any missing files. 
    """
    
    # Get the path from the datasets 
    path = datasets[name]
    
    # Check if the data exists, otherwise download or raise 
    if not os.path.exists(path):
        if download:
            download_all() 
        else:
            raise ValueError((
                "'{}' dataset has not been downloaded, "
                "use the download.py module to fetch datasets"
            ).format(name))
    
    # Read the directories in the directory as the categories. 
    categories = [
        cat for cat in os.listdir(path) 
        if os.path.isdir(os.path.join(path, cat))
    ]
    
    
    files  = [] # holds the file names relative to the root 
    data   = [] # holds the text read from the file 
    target = [] # holds the string of the category 
        
    # Load the data from the files in the corpus 
    for cat in categories:
        for name in os.listdir(os.path.join(path, cat)):
            files.append(os.path.join(path, cat, name))
            target.append(cat)
            
            with open(os.path.join(path, cat, name), 'r') as f:
                data.append(f.read())
        
    
    # Return the data bunch for use similar to the newsgroups example
    return Bunch(
        categories=categories,
        files=files,
        data=data,
        target=target,
    )

In [2]:
corpus = load_data('hobbies')

hobby_types  = {}

for category in corpus.categories:
    texts = []
    for idx in range(len(corpus.data)):
        if corpus['target'][idx] == category:
            texts.append(corpus.data[idx])
    hobby_types[category] = texts
    
food_stories = [text for text in hobby_types['cooking']]

In [3]:
print(food_stories[5])


	   Smoothies are one of our favorite breakfast options year-round, but especially in the winter because it's an easy way to brighten the mornings by adding extra fruit to our diets. And since frozen fruit works just as well (if not better — hello, icy, creamy texture), there's no worry about whether it's in season.



	   If you're stuck in the doldrums of gray days, try these tropical smoothies. It won't be quite like you're relaxing on a white-sand beach, but it'll be closer to it than your normal bowl of oatmeal.



	         Sometimes, you just have to go with a classic. This smoothie recipe is one of our standbys. We might dress it up or down with a touch of mint or swap in coconut water, but we can always count on it to be delicious.



	         You'll definitely get bonus points if you use a fresh coconut to make this smoothie. But because we know mornings are hectic, you can also swap in a can of coconut milk, and it will be just as creamy and delicious.



	        Adding m

## Start Elasticsearch
In the command line, start elasticsearch.

```bash
$ cd elasticsearch-<version>        
$ ./bin/elasticsearch
```

## Create an Index
Now we will create an index. Think of an index as a database in PostgreSQL or MongoDB. An Elasticsearch cluster can contain multiple _indices_ (e.g. relational or noSql databases), which in turn contain multiple _types_ (similar to MongoDB collections or PostgreSQL tables). These types hold multiple _documents_ (similart o MongoDB documents or PostgreSQL rows), and each document has _properties_ (like MongoDB document key-values or PostgreSQL columns).

```bash
curl -X PUT "localhost:9200/cooking " -H 'Content-Type: application/json' -d'
{
    "settings" : {
        "index" : {
            "number_of_shards" : 1,
            "number_of_replicas" : 1
        }
    }
}
'
```

 And the response:
 
```bash
{"acknowledged":true,"shards_acknowledged":true,"index":"cooking"}
```

### Get All Indices or Delete a Specific Index

```bash
$ curl -X GET "localhost:9200/_cat/indices?v"
```

```bash
$ curl -X DELETE "localhost:9200/cooking"
```

### Add Some Documents to the Index

```bash
$ curl -X PUT "localhost:9200/cooking/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
      "description":  "Smoothies are one of our favorite breakfast options year-round."
}
'
```

```bash
$ curl -X PUT "localhost:9200/cooking/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
      "description":  "A smoothie is a thick, cold beverage made from pureed raw fruit."
}
'
```

```bash
$ curl -X PUT "localhost:9200/cooking/_doc/3?pretty" -H 'Content-Type: application/json' -d'
{
      "description":  "Eggs Benedict is a traditional American breakfast or brunch dish."
}
'
```    
    
### Search the Index for Similar Documents
    
Let's look for documents that mention "breakfast":
    
```bash
$ curl -XGET 'localhost:9200/cooking/_search?q=description:breakfast&pretty'
```

And the response:

```bash
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.48233607,
    "hits" : [
      {
        "_index" : "cooking",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.48233607,
        "_source" : {
          "description" : "Smoothies are one of our favorite breakfast options year-round."
        }
      },
      {
        "_index" : "cooking",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.48233607,
        "_source" : {
          "description" : "Eggs Benedict is a traditional American breakfast or brunch dish."
        }
      }
    ]
  }
}
```

But if we look for documents that mention "smoothie"...

```bash
$ curl -XGET 'localhost:9200/cooking/_search?q=description:smoothie&pretty'
```

We only get the second document back, since the word "smoothie" is pluralized in the first document:

```bash
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.9331132,
    "hits" : [
      {
        "_index" : "cooking",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.9331132,
        "_source" : {
          "description" : "A smoothie is a thick, cold beverage made from pureed raw fruit."
        }
      }
    ]
  }
}
```

We can work around this by using a fuzzy search, which will return both the first and second documents:

```bash
curl -XGET "localhost:9200/cooking/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
       "fuzzy" : { "description" : "smoothie" }
    }
}
'
```
    

 


In [4]:
# Now let's do it in bulk

# def make_documents(index_name, index_type, texts):
#     return {
#         "_index": index_name,
#         "_type": index_type,
#         "description": texts["document"]
#     }

# def index(es_instance, index_name, index_type, texts):
#     bulk(es_instance, make_documents())

# def clear_es_indices(es_instance, index_name):
#     es_instance.indices.delete(
#         index=index_name,
#         ignore=[400, 404]
#     )