# Introduction to Document Similarity with Elasticsearch


## Document Distance
Short intro to document similarity - finding a way to represent the distance between documents. 

Two things needed:
 - The first is to encode the documents as vectors. There are options for this (one-hot, frequency, distributed). 
 - The second is to decide how to measure distance - Euclidean? Manhattan? Cosine? There are tradeoffs but cosine is a good place to start with text. Why? Documents encoded as vectors are sparse; each vector could be as long as the number of unique words across the full corpus.  That means that two documents of very different lengths (e.g. a single recipe and a cookbook), could be encoded with the same length vector, which might overemphasize the magnitude of the book's document vector at the expense of the recipe's document vector. Cosine distance helps to correct for variations in vector magnitudes resulting from uneven length documents, and enables us to measure the distance between the book and recipe.
 
Document similarity is something we cover in the book - in chapter 6 and then in chapter 10 where we build a chatbot that, among other things, uses a nearest neigbor search to recommend recipes that are similar to the ingredients listed by the user. One of my observations during the prototyping phase for that chapter is how slow vanilla nearest neighbor search is. This led me to think about different ways to optimize the search, from using variations like k-ball, to using other Python libraries like Spotify's Annoy, and also to other kind of tools altogether that attempt to deliver a similar results as quickly as possible. Like ElasticSearch!
 
## What is Elasticsearch
Elasticsearch is a text search engine that leverages the information retrieval library Lucene together with a key-value store to ...


In [1]:
corpus_path = "/Users/rebeccabilbro/Desktop/waves/stuff/atap/code/mini_food_corpus"

In [2]:
import os
import bs4
import nltk
import codecs

from nltk.corpus.reader.api import CorpusReader
from nltk.corpus.reader.api import CategorizedCorpusReader
from readability.readability import Document as Paper
from readability.readability import Unparseable

DOC_PATTERN = r'(?!\.)[\w\s\d\-]+\.html'
CAT_PATTERN = r'([a-z_\s]+)/.*'

# Tags to extract as paragraphs from the HTML text
TAGS = [
    'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'h7', 'p', 'li'
]


class HTMLCorpusReader(CategorizedCorpusReader, CorpusReader):
    """
    A corpus reader for raw HTML documents to enable preprocessing.
    """

    def __init__(self, root, fileids=DOC_PATTERN,
                 encoding='latin-1', **kwargs):
        """
        Initialize the corpus reader.  Categorization arguments
        (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to
        the ``CategorizedCorpusReader`` constructor.  The remaining
        arguments are passed to the ``CorpusReader`` constructor.
        """
        # Add the default category pattern if not passed into the class.
        if not any(key.startswith('cat_') for key in kwargs.keys()):
            kwargs['cat_pattern'] = CAT_PATTERN

        CategorizedCorpusReader.__init__(self, kwargs)
        CorpusReader.__init__(self, root, fileids, encoding)

        self._tags = TAGS

    def resolve(self, fileids, categories):
        """
        Returns a list of fileids or categories depending on what is passed
        to each internal corpus reader function. Implemented similarly to
        the NLTK ``CategorizedPlaintextCorpusReader``.
        """
        if fileids is not None and categories is not None:
            raise ValueError("Specify fileids or categories, not both")

        if categories is not None:
            return self.fileids(categories)
        return fileids

    def docs(self, fileids=None, categories=None):
        """
        Returns the complete text of an HTML document, closing the document
        after we are done reading it and yielding it in a memory safe fashion.
        """
        # Resolve the fileids and the categories
        fileids = self.resolve(fileids, categories)

        # Create a generator, loading one document into memory at a time.
        for path, encoding in self.abspaths(fileids, include_encoding=True):
            with codecs.open(path, 'r', encoding=encoding) as f:
                yield f.read()

    def titles(self, fileids=None, categories=None):
        """
        Uses BeautifulSoup to identify titles from the
        head tags within the HTML
        """
        for doc in self.docs(fileids, categories):
            soup = bs4.BeautifulSoup(doc, 'lxml')
            try:
                yield soup.title.text
                soup.decompose()
            except AttributeError as e:
                continue
                
    def html(self, fileids=None, categories=None):
        """
        Returns the HTML content of each document, cleaning it using
        the readability-lxml library.
        """
        for doc in self.docs(fileids, categories):
            try:
                yield Paper(doc).summary()
            except Unparseable as e:
                print("Could not parse HTML: {}".format(e))
                continue

In [3]:
# Use the corpus reader to read in the pre-processed documents from disk
corpus = HTMLCorpusReader(corpus_path)

# use the corpus reader's text method to retrieve the titles and descriptions of each recipe
titles = list(corpus.titles())

In [7]:
# print out the title and part-of-speech tagged description of the 501st recipe
print(titles[500])

Kid-Friendly Mac & Cheese Recipe


## Start Elasticsearch
In the command line, start elasticsearch.

    $  cd elasticsearch-<version>
    $  ./bin/elasticsearch


## Create an Index
Now we will create an index. Think of an index as a database in PostgreSQL or MongoDB. An Elasticsearch cluster can contain multiple _indices_ (e.g. relational or noSql databases), which in turn contain multiple _types_ (similar to MongoDB collections or PostgreSQL tables). These types hold multiple _documents_ (similart o MongoDB documents or PostgreSQL rows), and each document has _properties_ (like MongoDB document key-values or PostgreSQL columns).

    $  curl -X PUT "localhost:9200/recipes" -H 'Content-Type: application/json' -d'
        {
           "settings" : {
              "number_of_shards" : 1,
              "number_of_replicas" : 1
           }
        }
        '
 And the response:
 
    {"acknowledged":true,"shards_acknowledged":true,"index":"recipes"}



    
## Index the Recipes

Now let's begin by indexing our recipe for Chocolate Peanut Butter Pie.


    $  curl -X PUT "localhost:9200/website/recipes" -H 'Content-Type: application/json' -d'
        {
          "title": "Chocolate Peanut Butter Pie",
          "description":  "test test test"
        }
        '

## Search for Recipes

    $  curl -XGET 'localhost:9200/website/recipes/_search?q=title:Pie&pretty'
    
And the response:

    {
      "took" : 72,
      "timed_out" : false,
      "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : 1,
        "max_score" : 0.2876821,
        "hits" : [
          {
            "_index" : "website",
            "_type" : "recipes",
            "_id" : "DfDRFGQBpu4qb7-sCXfn",
            "_score" : 0.2876821,
            "_source" : {
              "title" : "Chocolate Peanut Butter Pie",
              "description" : "test test test"
            }
          }
        ]
      }
    }
    
    
Now let's do it in bulk

In [5]:
def make_documents(index_name, index_type, texts):
    return {
        "_index": index_name,
        "_type": index_type,
        "title": texts["title"],
        "description": texts["document"]
    }

def index(es_instance, index_name, index_type, texts):
    bulk(es_instance, make_documents())

def clear_es_indices(es_instance, index_name):
    es_instance.indices.delete(
        index=index_name,
        ignore=[400, 404]
    )