Elasticsearch
# Indexación de documentos

<p style="font-size: large; margin-top: 100px;">César de Pablo Sánchez</p>
<p style="font-size: large">@zdepablo</p>

In [6]:
## Code from: https://www.reddit.com/r/IPython/comments/34t4m7/lpt_print_json_in_collapsible_format_in_ipython/

import uuid
from IPython.display import display_javascript, display_html, display
import json

class RenderJSON(object):
    def __init__(self, json_data):
        if isinstance(json_data, dict):
            self.json_str = json.dumps(json_data)
        else:
            self.json_str = json
        self.uuid = str(uuid.uuid4())

    def _ipython_display_(self):
        display_html('<div id="{}" style="height: 600px; width:100%;"></div>'.format(self.uuid),
            raw=True
        )
        display_javascript("""
        require(["https://rawgit.com/caldwell/renderjson/master/renderjson.js"], function() {
          document.getElementById('%s').appendChild(renderjson(%s))
        });
        """ % (self.uuid, self.json_str), raw=True)

## Indexación de documentos
 - Gestión de los índices 
 - Gestión de los esquemas (mappings) 
 - Tipos de datos
 - Análisis de texto

## Gestión de los índices
 - ES gestiona automaticamente la creacion de indices y tipos de documentos al crear un índice
     - Sin embargo, no siempre es el comportamiento deseado
 - Podemos gestionar índices con la API de indices
    - operaciones CRUD sobre un índice: GET, POST, PUT, DELETE
    

## Proceso de indexación
 - Indexacion es un proceso generalmente batch
     - no es posible cambiar muchas opciones despues de haber creado el índice
     - cualquier cambio generalmente requiere reindexar
     
 - ES tiene la opcion de indexar de forma incremental, incluso en (Near) Real Time 
 

## Obteniendo información de un índice

In [1]:
import requests

# El indice no existe
r = requests.get('http://localhost:9200/myindex?pretty')
print r.text

{
  "error" : {
    "root_cause" : [ {
      "type" : "index_not_found_exception",
      "reason" : "no such index",
      "resource.type" : "index_or_alias",
      "resource.id" : "myindex",
      "index" : "myindex"
    } ],
    "type" : "index_not_found_exception",
    "reason" : "no such index",
    "resource.type" : "index_or_alias",
    "resource.id" : "myindex",
    "index" : "myindex"
  },
  "status" : 404
}



## Creando un índice
 - al crear podemos definir parámetros de configuración: 
    - Replicacion y distribución del índice
    - Analizadores por defecto
    - Medida de relevancia 
    - etc.

## Creando un índice

In [2]:
index_options = """
{
    "settings" : {
        "index" : {
            "number_of_shards" : 3,
            "number_of_replicas" : 2
        }
    }
}
"""

# Creamos un indice 
r = requests.put('http://localhost:9200/tvseries', data = index_options)
r.json()

{u'acknowledged': True}

In [3]:
r = requests.get('http://localhost:9200/tvseries?pretty')
print r.text

{
  "tvseries" : {
    "aliases" : { },
    "mappings" : { },
    "settings" : {
      "index" : {
        "creation_date" : "1456516494362",
        "number_of_shards" : "3",
        "number_of_replicas" : "2",
        "uuid" : "f0hoEKnKRC-GAV80Hsn97A",
        "version" : {
          "created" : "2010099"
        }
      }
    },
    "warmers" : { }
  }
}



In [13]:
r = requests.get('http://localhost:9200/tvseries/_aliases?pretty')
print r.text

{
  "tvseries" : {
    "aliases" : { }
  }
}



## Borrando el índice

In [23]:
# Borramos un indice 
r = requests.delete('http://localhost:9200/tvseries')
r.json()

{u'acknowledged': True}

## Cerrando y abriendo índices 
  - Un indice cerrado
    - no se puede consultar 
    - no requiere espacio en memoria
    - permanece usando disco

In [20]:
r = requests.post('http://localhost:9200/tvseries/_close')
print r.text

{"acknowledged":true}


In [25]:
r = requests.post('http://localhost:9200/tvseries/_open')
print r.text

{"acknowledged":true}


## Otras operaciones de administración sobre el índice

Estas operaciones se realizan de forma automática, pero se pueden forzar de forma manual 

  - **Optimizar** - Reduce el número de segmentos de índice de Lucene. A medida que se indexan documentos aumenta el número de segmentos que Lucene usa. Podemos forzar a que cree segmentos los más grandes posibles. 
  - **Refrescar** - Actualiza las últimas operaciones del índice
  - **Flush**  - Libera memoria y actualiza el log transaccional 
  - **Limpiar cache** - vacia la cache del índice. Por razones de rendimiento cada nodo mantiene ciertas partes del índice. 

## Otras operaciones de administración sobre el índice

Se invocan como las operaciones de cerrar y abrir:  
  <pre> Ejemplo: POST &lt;index&gt;/_flush </pre>

## Tipos de documentos

En una BBDD relacional se definen las entidades y el tipo de datos en el esquema físico => CREATE TABLE

En un sistema de búsqueda, los tipos de documentos se definen mediante un mapeo (*mappings*)
  - ES no requiere definir un esquema, puede inferirlo => *dynamic mapping*
  - *dynamic mapping* es intereante durante el desarrollo
  - Sin embargo, generalmente se requiere más control
  

##  Mapeos (*Mappings*) 
  - Un mapeo permite definir los tipos de cada uno de los campos de un documentos y como se almacenan
  - Cada mapeo generalmente está asociado a un índice. 
  - La API de mappings permite: 
     - Gestionar los mapeos - operaciones CRUD 
     - Añadir o borrar campos a cada uno de los mapeos

## Obteniendo el mapeo de un tipo
  - El índice *megacorps* y el tipo *employee* se crearon por defecto al añadir un documento 

In [5]:
import requests

employee = """
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}
"""

r = requests.put('http://localhost:9200/megacorp/employee/1?pretty', 
                 data = employee)

In [8]:
r = requests.get('http://localhost:9200/megacorp/_mappings')

In [10]:
RenderJSON(r.json())

In [29]:
r = requests.get('http://localhost:9200/tvseries/_mappings?pretty')
print r.text

{
  "tvseries" : {
    "mappings" : { }
  }
}



In [7]:
breaking_bad = requests.get('http://api.tvmaze.com/singlesearch/shows?q=breaking-bad')

In [8]:
RenderJSON(breaking_bad.json())

JSON formatter: https://jsonformatter.curiousconcept.com/

In [53]:
r = requests.delete('http://localhost:9200/tvseries/')
r.text

u'{"acknowledged":true}'

In [9]:
r = requests.post('http://localhost:9200/tvseries/serie/169', data = breaking_bad.text)
RenderJSON(r.json())

In [10]:
r = requests.get('http://localhost:9200/tvseries/serie/169')
RenderJSON(r.json())

In [18]:
r = requests.get('http://localhost:9200/tvseries/_search?q=genres:drama&pretty')
RenderJSON(r.json())

In [11]:
series = ['blindspot','the knick','house of cards', 'orange is the new black',
          'true detective', 'game of thrones',
          'the tudors','isabel', 'versailles', 'los serrano']

for s in series:  
  data = requests.get('http://api.tvmaze.com/singlesearch/shows?q=' + s ) 
  id = data.json()['id']
  response = requests.post('http://localhost:9200/tvseries/serie/' + str(id), data = data)
  print s + " indexed: " + response.text 

blindspot indexed: {"_index":"tvseries","_type":"serie","_id":"1855","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
the knick indexed: {"_index":"tvseries","_type":"serie","_id":"51","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
house of cards indexed: {"_index":"tvseries","_type":"serie","_id":"175","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
orange is the new black indexed: {"_index":"tvseries","_type":"serie","_id":"170","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
true detective indexed: {"_index":"tvseries","_type":"serie","_id":"5","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
game of thrones indexed: {"_index":"tvseries","_type":"serie","_id":"82","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
the tudors indexed: {"_index":"tvseries","_type":"serie","_id":"712","_version":1,"_shards":{"total":1,

In [20]:
r = requests.get('http://localhost:9200/tvseries/_search?q=status:ended&pretty')
RenderJSON(r.json())

## Tipos de datos (campos) :  Tipos básicos  

  * Texto: string *
  * Númericos:
      - byte, short, integer, long
      - float, double
  * Booleanos: boolean
  * Fechas: date / format
 

## Tipos de datos: Texto 
En realidad el tipo string se puede tratar de multiples formas: 

* Keywords: 
   - para valores exactos: direcciones de correo, codigos, etiquetas
   - se suele usar para filtrar, ordenar o agregar 

* Full Text:
   - para campos de texto que nos interesa buscar: texto de la noticia. correo
   - se suele usar para buscar 

## Tipos de datos textuales:  Decisiones importantes:

  - ¿Cómo se indexa? 
      - index: analyzed => full text
      - index: not_analyzed => keyword 
  - ¿cómo se analiza?  - analyzer 
  - ¿Cómo se almacena?
      - term_vector: se almacena un vector de terminos por documento
      - store: se almacena el contenido completo
              
              
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/string.html

## Tipos de datos: Tipos complejos       
   - Null values
   - Arrays
   - Objects 
   - Nested
   - Tipos especificos ES 
        - Geocoordenadas - geo_point, geo_shape
        - IPs
        - Completion
  

## Analizadores

 - Secuencia de transformaciones que se realizan sobre un campo de texto antex de indexar. 
 - Objetivo: 
    - Limpiar y normalizar el texto 
    - Mejorar la relevancia - quitando palabras comunes 
    - Añadir sinónimos
    - etc

### Standard Analyzer

In [12]:
summary = '''Breaking Bad follows protagonist Walter White, a chemistry teacher who lives in New Mexico with his wife 
          and teenage son who has cerebral palsy. White is diagnosed with Stage III cancer and given a prognosis of 
          two years left to live. With a new sense of fearlessness based on his medical prognosis, and a desire to secure
          his family's financial security, White chooses to enter a dangerous world of drugs and crime and ascends to power 
          in this world. The series explores how a fatal diagnosis such as White's releases a typical man from the daily 
          concerns and constraints of normal society and follows his transformation from mild family man to a kingpin 
          of the drug trade.'''


In [13]:
r = requests.post('http://localhost:9200/_analyze?analyzer=standard' , 
                  data = summary)
RenderJSON(r.json())

- elimina la mayor parte de signos de puntuacion 
- tokeniza por los espacios 
- pasa todos los terminos (tokens) a minúsculas 
 
- ¿qué hace con el genitivo sajon?

### Keyword Analyzer

In [14]:
r = requests.post('http://localhost:9200/_analyze?analyzer=keyword' , data = summary)
RenderJSON(r.json())


## Tipos de analizadores

 * Standard analyzer
 * Simple analyzer
 * Whitespace analyzer
 * Stop analyzer
 * Keyword analyzer
 * Pattern analyzer - definir expresiones regulares para separar palabras

## Tipos de analizadores (II)

 * Language analyzer
    - 33 lenguajes: english, spanish, catalan, basque, galician, portuguese, german, french, arabic ...  
 * [Snowball analyzer](http://snowballstem.org/) 
    - 14 lenguajes: importante por razones históricas, se usan frecuentemente 
 * Custom analyzer 

### English analyzer

In [15]:
r = requests.post('http://localhost:9200/_analyze?analyzer=english&pretty' , data = summary)
RenderJSON(r.json())


- elimina la mayor parte de signos de puntuacion 
- tokeniza principalmente por los espacios, pero tiene en cuenta el genitivo sajón 
- pasa todos los terminos (*tokens*) a minúsculas 
- elimina ciertos terminos ([*stopwords*](http://members.unine.ch/jacques.savoy/clef/index.html)) que son muy comunes: a, the, in ...  
- hace *stemming* - transforma las palabras a una forma raiz heurística

## Estructura de un analizador

  - [Filtro de caracteres](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) (Character filters) -cero o más
    - Pasar a minúsculas 
    - Eliminar acentos y diacríticos
    - Eliminar signos de puntuación 
    - etc..
  - [Tokenizador](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) (Tokenizer) - uno 
  - [Filtros de tokens](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenfilters.html) (Token filters) -  cero o más  
    - Palabras de parada (Stopwords)
    - Lematizador (Stemming) 
    - Sinónimos 
    - Mapeo
    - etc.. 

## Estructura de un analizador: Ejemplo

In [None]:
<pre>
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   [] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}
</pre>

## ¿Por qué hacer *stemming*?

La mayoría de los lenguajes tienen una componente morfológica, que generalmente no varía el significado

* Number: fox, foxes
* Tense: pay, paid, paying
* Gender: waiter, waitress
* Person: hear, hears
* Case: I, me, my
* Aspect: ate, eaten

### Stemming

Sin embargo, hacer *stemming* de forma 100% correcta es complicado, ya que para que sea práctico tiene que ser heurístico.

La alternativa es la *lematización*, basada en un análisis morfológico y contexto de la palabra, es mucho más preciso pero computacionalmente más costoso. Sin embargo tampoco los resultados son significativamente mejores en la tarea de búsqueda. 

### ¿Qué análisis se usa en mi campo? 

In [16]:

summary = '''Breaking Bad follows protagonist Walter White, a chemistry teacher who lives in New Mexico with his wife 
          and teenage son who has cerebral palsy. White is diagnosed with Stage III cancer and given a prognosis of 
          two years left to live. With a new sense of fearlessness based on his medical prognosis, and a desire to secure
          his family's financial security, White chooses to enter a dangerous world of drugs and crime and ascends to power 
          in this world. The series explores how a fatal diagnosis such as White's releases a typical man from the daily 
          concerns and constraints of normal society and follows his transformation from mild family man to a kingpin 
          of the drug trade.'''

r = requests.post('http://localhost:9200/tvseries/_analyze?field=summary&pretty' , data = summary)
RenderJSON(r.json())

### ¿Qué análisis se usa en mi campo?  (II)

In [17]:
name = '''Breaking Bad'''

r = requests.get('http://localhost:9200/tvseries/_analyze?field=name&pretty' , data = name)
RenderJSON(r.json())


## Definiendo nuestro mapping a medida

In [18]:
r = requests.get('http://localhost:9200/tvseries/_mappings')
RenderJSON(r.json())

### Actualizando un mapping....

In [19]:
mapping = '''{
    "serie" : {
        "properties" : {
            "name" : {"type" : "string", "store" : true }
        }
    }
}'''

r = requests.put('http://localhost:9200/tvseries/_mappings/serie?pretty' , data = mapping)
print r.text

{
  "error" : {
    "root_cause" : [ {
      "type" : "merge_mapping_exception",
      "reason" : "Merge failed with failures {[mapper [name] has different [store] values]}"
    } ],
    "type" : "merge_mapping_exception",
    "reason" : "Merge failed with failures {[mapper [name] has different [store] values]}"
  },
  "status" : 400
}



### Definiendo el mapping al crear el índice

In [20]:
index_options = '''
{ 
  "mappings" : { 
      "my_serie" : {
        "properties" : {
          "_links" : {
            "properties" : {
              "nextepisode" : {
                "properties" : {
                  "href" : {
                    "type" : "string"
                  }
                }
              },
              "previousepisode" : {
                "properties" : {
                  "href" : {
                    "type" : "string"
                  }
                }
              },
              "self" : {
                "properties" : {
                  "href" : {
                    "type" : "string"
                  }
                }
              }
            }
          },
          "externals" : {
            "properties" : {
              "imdb" : {
                "type" : "string"
              },
              "thetvdb" : {
                "type" : "long"
              },
              "tvrage" : {
                "type" : "long"
              }
            }
          },
          "genres" : {
            "type" : "string"
          },
          "id" : {
            "type" : "long"
          },
          "image" : {
            "properties" : {
              "medium" : {
                "type" : "string"
              },
              "original" : {
                "type" : "string"
              }
            }
          },
          "language" : {
            "type" : "string"
          },
          "name" : {
            "type" : "string",
            "index": "not_analyzed"
          },
          "network" : {
            "properties" : {
              "country" : {
                "properties" : {
                  "code" : {
                    "type" : "string"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "timezone" : {
                    "type" : "string"
                  }
                }
              },
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "premiered" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          "rating" : {
            "properties" : {
              "average" : {
                "type" : "double"
              }
            }
          },
          "runtime" : {
            "type" : "long"
          },
          "schedule" : {
            "properties" : {
              "days" : {
                "type" : "string"
              },
              "time" : {
                "type" : "string"
              }
            }
          },
          "status" : {
            "type" : "string"
          },
          "summary" : {
            "type" : "string"
          },
          "type" : {
            "type" : "string"
          },
          "updated" : {
            "type" : "long"
          },
          "url" : {
            "type" : "string"
          },
          "webChannel" : {
            "properties" : {
              "country" : {
                "properties" : {
                  "code" : {
                    "type" : "string"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "timezone" : {
                    "type" : "string"
                  }
                }
              },
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "weight" : {
            "type" : "long"
          }
        }
      }
    }
  } 
}
'''

In [34]:
RenderJSON(index_options)

In [21]:

requests.delete('http://localhost:9200/tvseries')
requests.delete('http://localhost:9200/my_tvseries')

r = requests.post('http://localhost:9200/my_tvseries', data = index_options)
print r.text

{"acknowledged":true}


In [22]:
name = '''Breaking Bad'''

r = requests.get('http://localhost:9200/my_tvseries/_analyze?field=name&pretty' , data = name)
print r.text


{
  "tokens" : [ {
    "token" : "Breaking Bad",
    "start_offset" : 0,
    "end_offset" : 12,
    "type" : "word",
    "position" : 0
  } ]
}



## Definiendo el mapping (II)

In [25]:
index_options = '''
{ 
  "mappings" : { 
      "my_serie" : {
        "properties" : {
          "_links" : {
            "properties" : {
              "nextepisode" : {
                "properties" : {
                  "href" : {
                    "type" : "string"
                  }
                }
              },
              "previousepisode" : {
                "properties" : {
                  "href" : {
                    "type" : "string"
                  }
                }
              },
              "self" : {
                "properties" : {
                  "href" : {
                    "type" : "string"
                  }
                }
              }
            }
          },
          "externals" : {
            "properties" : {
              "imdb" : {
                "type" : "string"
              },
              "thetvdb" : {
                "type" : "long"
              },
              "tvrage" : {
                "type" : "long"
              }
            }
          },
          "genres" : {
            "type" : "string"
          },
          "id" : {
            "type" : "long"
          },
          "image" : {
            "properties" : {
              "medium" : {
                "type" : "string"
              },
              "original" : {
                "type" : "string"
              }
            }
          },
          "language" : {
            "type" : "string"
          },
          "name" : {
            "type" : "string",
            "index": "not_analyzed"
          },
          "network" : {
            "properties" : {
              "country" : {
                "properties" : {
                  "code" : {
                    "type" : "string"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "timezone" : {
                    "type" : "string"
                  }
                }
              },
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "premiered" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          "rating" : {
            "properties" : {
              "average" : {
                "type" : "double"
              }
            }
          },
          "runtime" : {
            "type" : "long"
          },
          "schedule" : {
            "properties" : {
              "days" : {
                "type" : "string"
              },
              "time" : {
                "type" : "string"
              }
            }
          },
          "status" : {
            "type" : "string"
          },
          "summary" : {
            "type" : "string",
            "index": "analyzed",
            "analyzer": "english"
          },
          "type" : {
            "type" : "string"
          },
          "updated" : {
            "type" : "long"
          },
          "url" : {
            "type" : "string"
          },
          "webChannel" : {
            "properties" : {
              "country" : {
                "properties" : {
                  "code" : {
                    "type" : "string"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "timezone" : {
                    "type" : "string"
                  }
                }
              },
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "weight" : {
            "type" : "long"
          }
        }
      }
    }
  } 
}
'''

In [26]:
requests.delete('http://localhost:9200/my_tvseries')

r = requests.post('http://localhost:9200/my_tvseries', data = index_options)
print r.text

{"acknowledged":true}


In [27]:
r = requests.post('http://localhost:9200/my_tvseries/my_serie/169', data = breaking_bad.text)
r.text

u'{"_index":"my_tvseries","_type":"my_serie","_id":"169","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}'

In [43]:
summary = '''Breaking Bad follows protagonist Walter White, a chemistry teacher who lives in New Mexico with his wife 
          and teenage son who has cerebral palsy. White is diagnosed with Stage III cancer and given a prognosis of 
          two years left to live. With a new sense of fearlessness based on his medical prognosis, and a desire to secure
          his family's financial security, White chooses to enter a dangerous world of drugs and crime and ascends to power 
          in this world. The series explores how a fatal diagnosis such as White's releases a typical man from the daily 
          concerns and constraints of normal society and follows his transformation from miMld family man to a kingpin 
          of the drug trade.'''

r = requests.post('http://localhost:9200/my_tvseries/_analyze?field=summary' , data = summary)
RenderJSON(r.json())

### Definiendo el mapping (III)

In [23]:
index_options = '''
{ 
  "mappings" : { 
      "my_serie" : {
        "properties" : {
          "_links" : {
            "properties" : {
              "nextepisode" : {
                "properties" : {
                  "href" : {
                    "type" : "string",
                    "index" : "no"
                  }
                }
              },
              "previousepisode" : {
                "properties" : {
                  "href" : {
                    "type" : "string",
                    "index" : "no"
                  }
                }
              },
              "self" : {
                "properties" : {
                  "href" : {
                    "type" : "string",
                    "index" : "no"
                   }
                }
              }
            }
          },
          "externals" : {
            "properties" : {
              "imdb" : {
                "type" : "string",
                "index" : "no"
              },
              "thetvdb" : {
                "type" : "long",
                "index": "no"
              },
              "tvrage" : {
                "type" : "long",
                "index": "no"
              }
            }
          },
          "genres" : {
            "type" : "string",
            "index": "not_analyzed"
          },
          "id" : {
            "type" : "long"
          },
          "image" : {
            "properties" : {
              "medium" : {
                "type" : "string",
                "index": "no"
              },
              "original" : {
                "type" : "string",
                "index": "no"
              }
            }
          },
          "language" : {
            "type" : "string",
            "index": "not_analyzed"
          },
          "name" : {
            "type" : "string"
          },
          "network" : {
            "properties" : {
              "country" : {
                "properties" : {
                  "code" : {
                    "type" : "string",
                    "index": "not_analyzed"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "timezone" : {
                    "type" : "string",
                    "index": "not_analyzed"
                  }
                }
              },
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "premiered" : {
            "type" : "date",
            "format" : "strict_date_optional_time||epoch_millis"
          },
          "rating" : {
            "properties" : {
              "average" : {
                "type" : "double"
              }
            }
          },
          "runtime" : {
            "type" : "long"
          },
          "schedule" : {
            "properties" : {
              "days" : {
                "type" : "string",
                "index": "not_analyzed"
              },
              "time" : {
                "type" : "date",
                "format" : "hour_minute",
                "ignore_malformed": true
              }
            }
          },
          "status" : {
            "type" : "string",
            "index": "not_analyzed"            
          },
          "summary" : {
            "type" : "string",
            "index": "analyzed",
            "analyzer": "english"
          },
          "type" : {
            "type" : "string",
            "index": "not_analyzed"            
          },
          "updated" : {
            "type" : "long"
          },
          "url" : {
            "type" : "string",
            "index": "not_analyzed"            
          },
          "webChannel" : {
            "properties" : {
              "country" : {
                "properties" : {
                  "code" : {
                    "type" : "string",
                    "index": "not_analyzed"
                  },
                  "name" : {
                    "type" : "string"
                  },
                  "timezone" : {
                    "type" : "string",
                    "index": "not_analyzed"
                  }
                }
              },
              "id" : {
                "type" : "long"
              },
              "name" : {
                "type" : "string"
              }
            }
          },
          "weight" : {
            "type" : "long"
          }
        }
      }
    }
  } 
'''

In [24]:
requests.delete('http://localhost:9200/my_tvseries')

r = requests.post('http://localhost:9200/my_tvseries', data = index_options)
print r.text

{"acknowledged":true}


In [25]:
r = requests.post('http://localhost:9200/my_tvseries/my_serie/169', data = breaking_bad.text)
r.text

u'{"_index":"my_tvseries","_type":"my_serie","_id":"169","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}'

In [26]:
series = ['blindspot','the knick','house of cards', 'orange is the new black',
          'true detective', 'game of thrones',
          'the tudors','isabel', 'versailles', 'los serrano']

for s in series:  
  data = requests.get('http://api.tvmaze.com/singlesearch/shows?q=' + s ) 
  id = data.json()['id']
  response = requests.post('http://localhost:9200/my_tvseries/my_serie/' + str(id), data = data)
  print s + " indexed: " + response.text 

blindspot indexed: {"_index":"my_tvseries","_type":"my_serie","_id":"1855","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
the knick indexed: {"_index":"my_tvseries","_type":"my_serie","_id":"51","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
house of cards indexed: {"_index":"my_tvseries","_type":"my_serie","_id":"175","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
orange is the new black indexed: {"_index":"my_tvseries","_type":"my_serie","_id":"170","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
true detective indexed: {"_index":"my_tvseries","_type":"my_serie","_id":"5","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
game of thrones indexed: {"_index":"my_tvseries","_type":"my_serie","_id":"82","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
the tudors indexed: {"_index":"my_tvseries","_type":"my_serie","_id

#### Algunas comprobaciones

In [28]:
r = requests.get('http://localhost:9200/my_tvseries/_search?q=genres:Drama&pretty')
print r.text

{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 9,
    "max_score" : 1.0953102,
    "hits" : [ {
      "_index" : "my_tvseries",
      "_type" : "my_serie",
      "_id" : "51",
      "_score" : 1.0953102,
      "_source":{"id":51,"url":"http://www.tvmaze.com/shows/51/the-knick","name":"The Knick","type":"Scripted","language":"English","genres":["Drama","Medical"],"status":"To Be Determined","runtime":60,"premiered":"2014-08-08","schedule":{"time":"22:00","days":["Friday"]},"rating":{"average":8.7},"weight":2,"network":{"id":19,"name":"Cinemax","country":{"name":"United States","code":"US","timezone":"America/New_York"}},"webChannel":null,"externals":{"tvrage":36033,"thetvdb":279977,"imdb":"tt2937900"},"image":{"medium":"http://tvmazecdn.com/uploads/images/medium_portrait/0/417.jpg","original":"http://tvmazecdn.com/uploads/images/original_untouched/0/417.jpg"},"summary":"<p>New York City, 190

In [49]:
r = requests.get('http://localhost:9200/my_tvseries/_search?q=genres:Drama&pretty')
print r.text

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 9,
    "max_score" : 1.0953102,
    "hits" : [ {
      "_index" : "my_tvseries",
      "_type" : "my_serie",
      "_id" : "169",
      "_score" : 1.0953102,
      "_source":{"id":169,"url":"http://www.tvmaze.com/shows/169/breaking-bad","name":"Breaking Bad","type":"Scripted","language":"English","genres":["Drama","Crime","Thriller"],"status":"Ended","runtime":60,"premiered":"2008-01-20","schedule":{"time":"22:00","days":["Sunday"]},"rating":{"average":9.3},"weight":2,"network":{"id":20,"name":"AMC","country":{"name":"United States","code":"US","timezone":"America/New_York"}},"webChannel":null,"externals":{"tvrage":18164,"thetvdb":81189,"imdb":"tt0903747"},"image":{"medium":"http://tvmazecdn.com/uploads/images/medium_portrait/0/2400.jpg","original":"http://tvmazecdn.com/uploads/images/original_untouched/0/2400.jpg"},"summary":"<p><em><strong>\"B

## Campo de búsqueda por defecto:  *_all*

- Campo por defecto de tipo string 
- Almacena el contenido de todos los campos como una sola cadena 
- El contenido es **analizado** usando WhitespaceTokenizer, **indexado** pero no **almacenado**. 
- Se puede buscar pero no recuperar. 
- Permite buscar en cualquiera de los campos 

Se puede deshabilitar: 
  - por completo para un tipo 
  - para un campo


In [29]:
r = requests.get('http://localhost:9200/my_tvseries/_search?q=madrid')
RenderJSON(r.json())

In [73]:
r = requests.get('http://localhost:9200/my_tvseries/_search?
                 q=network.country.timezone:madrid&pretty')
RenderJSON(r.json())

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}



In [72]:
r = requests.get('http://localhost:9200/my_tvseries/_search?q=
                 network.country.timezone:Europe\/Madrid&pretty')
RenderJSON(r.json())

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 2.2992828,
    "hits" : [ {
      "_index" : "my_tvseries",
      "_type" : "my_serie",
      "_id" : "9274",
      "_score" : 2.2992828,
      "_source":{"id":9274,"url":"http://www.tvmaze.com/shows/9274/isabel","name":"Isabel","type":"Scripted","language":"Spanish","genres":["Drama"],"status":"Ended","runtime":60,"premiered":"2012-09-10","schedule":{"time":"","days":[]},"rating":{"average":null},"weight":0,"network":{"id":147,"name":"RTVE","country":{"name":"Spain","code":"ES","timezone":"Europe/Madrid"}},"webChannel":null,"externals":{"tvrage":32792,"thetvdb":262381,"imdb":"tt2011533"},"image":{"medium":"http://tvmazecdn.com/uploads/images/medium_portrait/32/81594.jpg","original":"http://tvmazecdn.com/uploads/images/original_untouched/32/81594.jpg"},"summary":"<p>Life of Isabella I of Castile, also known as Isabella the C

## Múltiples mapeos para un campo: *copy_to*

  - Permite copiar el contenido de un campo a otro nuevo
  - En el nuevo campo se puede usar un análisis diferente. 
  