# Interactive Tech Talk
## Elasticsearch + Python 
By Sean Kozer

The purpose of this interactive tech talk is to familiarise you with Elasticsearch concepts and how to apply them in Python projects.

You will be searching, filtering and performing statistical analysis on movie data from IMDB.

### Connecting to Elasticsearch

The first step is to create a connection. We will be creating a persistent connection so we won't have to explicitly pass it into each method call.

The `"elasticsearch"` hostname is the name of ES service in `docker-compose.yml`

In [1]:
from elasticsearch_dsl import connections

# Create the default connection to Elasticsearch
client = connections.create_connection(hosts=["elasticsearch"], timeout=20)

### Defining the schema

Elasticsearch is technically a schemaless datastore. But to avoid issues down the track, it's explicitly define a mapping.

Elasticsearch-DSL provides a convenient wrapper which bears some similarities with Django models.

We'll be using several different datatypes for our movie data.

- For string fields which require full-text search, we're going to use the `Text` type.
- For string fields which are only used for filtering and sorting, we're going to use the `Keyword` type.
- The other datatypes we'll use are `Integer` and `Date`, which are self explanatory in this demo.

In [2]:
from elasticsearch_dsl import Date, Document, Integer, Keyword, Text

INDEX_NAME = "movies"

class Movie(Document):
    title = Text()
    overview = Text()
    genre = Keyword()
    release_date = Date()
    revenue = Integer()
    production_companies = Keyword()

    class Index:
        name = INDEX_NAME
    
    def get_display_name(self):
        year = self.release_date.strftime('%Y')
        return '{title} ({year})'.format(title=self.title, year=year)
    
    def __repr__(self):
        return '<Movie: {}>'.format(self.get_display_name())

### Creating the index and mapping

Now that we have our mapping, it's time to push it to Elasticsearch.

In [66]:
# Create the mapping
Movie.init()

# Confirm the mapping exists
client.indices.get_mapping(INDEX_NAME)

{'movies': {'mappings': {'doc': {'properties': {'genre': {'type': 'keyword'},
     'overview': {'type': 'text'},
     'production_companies': {'type': 'keyword'},
     'release_date': {'type': 'date'},
     'revenue': {'type': 'integer'},
     'title': {'type': 'text'}}}}}}

### Indexing our first movie

Let's create a movie and push it to Elasticsearch. You'll notice the syntax here is very similar to Django models.

In [4]:
# Create an example movie
movie = Movie(
    meta={'id': 1},
    title="Example Movie",
    overview="This movie is about cats and dogs",
    genre=["Comedy", "Comedy"],
    release_date="2018-11-02",
    revenue=1000000,
    production_companies="Pixar"
)
movie.save()

True

### Finding our first movie

**Success!** Let's search the index to try find the movie we created.

Elasticsearch returns all results when no filters or queries are specified.

Because this is just a test movie, we'll delete it right after we print it.

**Note:** Pagination defaults to 10 results. You can increase this by using Python array slicing.

In [5]:
from elasticsearch_dsl import Search

# Search for all movies (10 at a time) and delete them one by one
movies = Movie.search()
for movie in movies:
    print(movie)
    movie.delete()


<Movie: Example Movie (2018)>


### Use real data

Now that we're confident everything is working, let's index real movies!

In [68]:
import csv
import ast

MAX_ROWS = 1000

with open("input/movies.csv") as csvfile:
    reader = csv.DictReader(csvfile)
    for idx, row in enumerate(reader):
        if idx >= MAX_ROWS:
            break

        movie = Movie(
            meta={"id": row["imdb_id"]},
            title=row["title"],
            overview=row["overview"],
            genre=[g["name"] for g in ast.literal_eval(row["genres"])],
            release_date=row["release_date"] or None,
            revenue=row["revenue"],
            production_companies=[p["name"] for p in ast.literal_eval(row["production_companies"])],      
        )
        movie.save()    

### Full text search

In our mapping we defined the title as a `Text` datatype which allows tokenisation and analysis at index time and search time.

By default, it uses the `standard` analyzer which has sensible, but basic defaults.<br>
You can create custom analyzers in index settings and assign them to fields in the mapping.<br>

For this demo, we'll keep it simple.

Let's see this in action by searching for movie titles containing the word "Blue".



In [81]:
from elasticsearch_dsl.query import Match

# Find all movies with the word "blue" in the title
search = Movie.search().query(
    Match(title="blue")
)
list(search)

[<Movie: Blue Sky (1994)>,
 <Movie: Blue Chips (1994)>,
 <Movie: Three Colors: Blue (1993)>,
 <Movie: Devil in a Blue Dress (1995)>,
 <Movie: Blue in the Face (1995)>]

### Filtering

Elasticsearch comes with many types of filters.

The `Term` filter will return results with an exact match.

A good example is filtering movies by genre. Because we've defined genre as a `Keyword` datatype in the mapping, we can filter movies containing an exact genre.

**Note:** Movies with multiple genres (e.g. Action, Comedy) will be included if at least one of the genres matches.

As an example, let's find all comedy movies.

In [82]:
from elasticsearch_dsl.query import Term

# Find all comedy movies
search = Movie.search().filter(
    Term(genre="Comedy")
)
list(search)

[<Movie: Waiting to Exhale (1995)>,
 <Movie: Father of the Bride Part II (1995)>,
 <Movie: Sabrina (1995)>,
 <Movie: The American President (1995)>,
 <Movie: Dracula: Dead and Loving It (1995)>,
 <Movie: Four Rooms (1995)>,
 <Movie: Don't Be a Menace to South Central While Drinking Your Juice in the Hood (1996)>,
 <Movie: Friday (1995)>,
 <Movie: Muppet Treasure Island (1996)>,
 <Movie: Rumble in the Bronx (1995)>]

### What about other filters?

Sometimes we might want to filter non-string fields in more useful ways. For example, filtering for a range of values.

In this example, we'll use the `Range` filter to find all movies released in the year 1994.

**Note:** If this were a real application, it would be more performant to index `year` as its own field with an `Integer` datatype.

In [86]:
from elasticsearch_dsl.query import Range

# Find all movies released in 1994
search = Movie.search().filter(
    Range(release_date={"gte": "1994-01-01", "lt": "1995-01-01"})
)
list(search)

[<Movie: Crumb (1994)>,
 <Movie: Disclosure (1994)>,
 <Movie: Drop Zone (1994)>,
 <Movie: Eat Drink Man Woman (1994)>,
 <Movie: Ed Wood (1994)>,
 <Movie: I.Q. (1994)>,
 <Movie: Junior (1994)>,
 <Movie: L'Enfer (1994)>,
 <Movie: Little Odessa (1994)>,
 <Movie: Mixed Nuts (1994)>]

In [87]:
from elasticsearch_dsl.aggs import Terms

# Return all movies
search = Movie.search()

# Find the most common genres
search.aggs.bucket("genres", Terms(field="genre"))
results = search.execute()

for genre in results.aggregations.genres.buckets:
    print("{} has {} movies".format(genre.key, genre.doc_count))


Drama has 570 movies
Comedy has 376 movies
Romance has 259 movies
Thriller has 212 movies
Action has 169 movies
Family has 131 movies
Crime has 130 movies
Adventure has 110 movies
Fantasy has 78 movies
Mystery has 76 movies


In [101]:
from elasticsearch_dsl.aggs import Stats

# Return all movies
search = Movie.search()

# Find the most common production companies
search.aggs.bucket(
    "production_companies",
    Terms(field="production_companies").metric(
        "revenue_stats",
        Stats(field="revenue")
    )
)
results = search.execute()

for production_company in results.aggregations.production_companies.buckets:
    revenue_stats = production_company.revenue_stats
    print("{} has {} movies with top gross of ${:,} and total gross of ${:,}".format(
        production_company.key,
        production_company.doc_count,
        int(revenue_stats.max),
        int(revenue_stats.sum),
    ))


Warner Bros. has 76 movies with top gross of $494,471,524 and total gross of $5,003,717,411
Paramount Pictures has 60 movies with top gross of $677,945,399 and total gross of $3,515,498,387
Universal Pictures has 58 movies with top gross of $920,100,000 and total gross of $4,897,919,596
Metro-Goldwyn-Mayer (MGM) has 41 movies with top gross of $400,176,459 and total gross of $844,235,705
Twentieth Century Fox Film Corporation has 40 movies with top gross of $816,969,268 and total gross of $4,742,379,769
Miramax Films has 38 movies with top gross of $213,928,762 and total gross of $479,548,959
New Line Cinema has 37 movies with top gross of $351,583,407 and total gross of $1,457,081,732
Hollywood Pictures has 35 movies with top gross of $335,062,621 and total gross of $1,660,633,116
Columbia Pictures has 31 movies with top gross of $141,407,024 and total gross of $833,418,073
TriStar Pictures has 29 movies with top gross of $262,797,249 and total gross of $1,452,386,294
