# What is a database index

It's a data structure used to quickly locate data in a database table without having to search through all rows. It uses extra space in the disk (used to store the indexes) in order to increase speed performance. A single table can have multiple indexes, each one can be associated to one or more columns of that table.

# How to create an index in Django

Index creation on Django is done when defining the model class. Although there are multiple ways to create an index, the simplest one is to us pass the `db_index=True` kwarg in the definition of a model field.

You can also set a list of `indexes` inside a model `Meta` class and define which field(s) each index will use, as in the example below. More on the different types of indexes later.

```python
from django.db import models

class MyModel(models.Model):
    foo = models.CharField(max_length=255)
    bar = models.CharField(max_length=255)
    class Meta:
        indexes = [
            models.Index(fields=['foo']),
            models.Index(fields=['bar']),
        ]
```

However, Django may automatically create some indexes, even if not explicitly asked to, to optimize for the most common cases. Let's consider the following scenario:

```python
from django.db import models

class Team(models.Model):
    name = models.CharField(max_length=255)
    tv_name = models.CharField(max_length=4, unique=True)
    city = models.CharField(max_length=255)

class Player(models.Model):
    name = models.CharField(max_length=255)
    position = models.CharField(max_length=2)
    team = models.ForeignKey(Team, on_delete=models.CASCADE)

class Stadium(models.Model):
    name = models.CharField(max_length=255)
    nickname = models.CharField(max_length=255)
    teams = models.ManyToManyField(Team)
```

In [1]:
%%sh
django-admin sqlmigrate teams 0001

BEGIN;
--
-- Create model Team
--
CREATE TABLE "teams_team" ("id" bigserial NOT NULL PRIMARY KEY, "name" varchar(255) NOT NULL, "tv_name" varchar(4) NOT NULL UNIQUE, "city" varchar(255) NOT NULL);
--
-- Create model Stadium
--
CREATE TABLE "teams_stadium" ("id" bigserial NOT NULL PRIMARY KEY, "name" varchar(255) NOT NULL, "nickname" varchar(255) NOT NULL);
CREATE TABLE "teams_stadium_teams" ("id" bigserial NOT NULL PRIMARY KEY, "stadium_id" bigint NOT NULL, "team_id" bigint NOT NULL);
--
-- Create model Player
--
CREATE TABLE "teams_player" ("id" bigserial NOT NULL PRIMARY KEY, "name" varchar(255) NOT NULL, "position" varchar(2) NOT NULL, "team_id" bigint NOT NULL);
CREATE INDEX "teams_team_tv_name_7d091fee_like" ON "teams_team" ("tv_name" varchar_pattern_ops);
ALTER TABLE "teams_stadium_teams" ADD CONSTRAINT "teams_stadium_teams_stadium_id_team_id_0174a71c_uniq" UNIQUE ("stadium_id", "team_id");
ALTER TABLE "teams_stadium_teams" ADD CONSTRAINT "teams_stadium_teams_stadium_id_1521a159_

By running `sqlmigrate`, we can inspect the SQL that Django will execute to create the table.

Notice how it calls `CREATE INDEX` multiple times, even though we never explicitly define a single index for any of our fields.

```sql
CREATE INDEX "teams_stadium_teams_stadium_id_1521a159" ON "teams_stadium_teams" ("stadium_id");
CREATE INDEX "teams_stadium_teams_team_id_57c6bc0d" ON "teams_stadium_teams" ("team_id");
CREATE INDEX "teams_player_team_id_4ee5cf70" ON "teams_player" ("team_id");
```

The indexes above are created to optimize the queries we make to access our one-to-many/many-to-many fields, but there's another index there that's used for the `tv_name` field.
```sql
CREATE INDEX "teams_team_tv_name_7d091fee_like" ON "teams_team" ("tv_name" varchar_pattern_ops);
```

This last index is created because we defined `tv_name` as `unique=True` directly in the field declaration. It will use the `varchar_pattern_ops` PostgreSQL function to try to pattern match incoming values with the values stored in the database. Maybe you don't need that extra optimization. In our case, `tv_name` is a short string (only 4 characters) and represents the name of the teams that will be displayed in the TV broadcast of a football match between them (i.e.: MUFC for Manchester United, RMAD for Real Madrid, ROMA for Roma, etc). We'll talk more about this index when discussing the tradeoffs.

In this context of short names where we can even have names that are almost the same, maybe running a full equality comparison is enough in terms of speed, and can save us some valuable database space. As we still want `tv_name` to be unique, we can remove `unique=True` from the field and create a unique constraint in the model `Meta` class.

```python
class Team(models.Model):
    name = models.CharField(max_length=255)
    tv_name = models.CharField(max_length=4)
    city = models.CharField(max_length=255)

    class Meta:
        constraints = [
            models.UniqueConstraint(
                fields=['tv_name'],
                name='%(app_label)s_%(class)s_tv_name_unique',
            ),
        ]
```

In [2]:
%%sh
django-admin sqlmigrate teams 0002

BEGIN;
--
-- Alter field tv_name on team
--
ALTER TABLE "teams_team" DROP CONSTRAINT "teams_team_tv_name_unique";
DROP INDEX IF EXISTS "teams_team_tv_name_7d091fee_like";
--
-- Create constraint teams_team_tv_name_unique on model team
--
ALTER TABLE "teams_team" ADD CONSTRAINT "teams_team_tv_name_unique" UNIQUE ("tv_name");
COMMIT;


Now let's populate our database using a pre-made script. For simplicity, we've duplicated the models in two different apps, one using the original version (`run_legacy`) and another using the modified version (`run`).

In [3]:
from common import scripts

scripts.clear()

In [4]:
%%time
scripts.run()

Created 10000 teams...
Created 20000 teams...
Created 30000 teams...
Created 40000 teams...
Created 50000 teams...
Created 60000 teams...
Created 70000 teams...
Created 80000 teams...
Created 90000 teams...
Created 100000 teams...
Created 110000 teams...
Created 120000 teams...
Created 130000 teams...
Created 140000 teams...
Created 150000 teams...
Created 160000 teams...
Created 170000 teams...
Created 180000 teams...
Created 190000 teams...
Created 200000 teams...
Created 210000 teams...
Created 220000 teams...
Created 230000 teams...
Created 240000 teams...
Created 250000 teams...
Created 260000 teams...
Created 270000 teams...
Created 280000 teams...
Created 290000 teams...
Created 300000 teams...
Created 310000 teams...
Created 320000 teams...
Created 330000 teams...
Created 340000 teams...
Created 350000 teams...
Created 360000 teams...
Created 370000 teams...
Created 380000 teams...
Created 390000 teams...
Created 400000 teams...
Created 410000 teams...
Created 420000 teams...
C

In [5]:
from teams.models import Team

Team.objects.count()

456976

In [6]:
%%time
scripts.run_legacy()

Created 10000 teams...
Created 20000 teams...
Created 30000 teams...
Created 40000 teams...
Created 50000 teams...
Created 60000 teams...
Created 70000 teams...
Created 80000 teams...
Created 90000 teams...
Created 100000 teams...
Created 110000 teams...
Created 120000 teams...
Created 130000 teams...
Created 140000 teams...
Created 150000 teams...
Created 160000 teams...
Created 170000 teams...
Created 180000 teams...
Created 190000 teams...
Created 200000 teams...
Created 210000 teams...
Created 220000 teams...
Created 230000 teams...
Created 240000 teams...
Created 250000 teams...
Created 260000 teams...
Created 270000 teams...
Created 280000 teams...
Created 290000 teams...
Created 300000 teams...
Created 310000 teams...
Created 320000 teams...
Created 330000 teams...
Created 340000 teams...
Created 350000 teams...
Created 360000 teams...
Created 370000 teams...
Created 380000 teams...
Created 390000 teams...
Created 400000 teams...
Created 410000 teams...
Created 420000 teams...
C

In [7]:
from legacy_teams.models import Team as LegacyTeam

LegacyTeam.objects.count()

456976

# Tradeoffs

Let's see what's the size of the indexes and constraints that Django created automatically for the `tv_name` column on both tables. We can use the database functions `pg_size_pretty` and `pg_relation_size` to help us.

In [8]:
from django.db import connection

with connection.cursor() as cursor:
    cursor.execute(
        """
        select
            pg_size_pretty(pg_relation_size('legacy_teams_team_tv_name_ec392eeb_like'))
        """
    )
    row = cursor.fetchone()

row

('10088 kB',)

For a table with 456.976 rows, Django allocated around 10MB only for that index. You can imagine this number will grow even further if our table has more rows, which is a safe assumption when thinking of real world scenarios.

Now let's compare how many time it would take for Django to run the same query on both `Team` tables. We'll use the `.explain(analyze=True)` to help us inspecting what'll be the strategy used to search the database. We can see the difference is pretty negligible, since PostgreSQL will use the uniqueness constraint that we have added to the `Team` model to increase performance.

In [9]:
from legacy_teams.models import Team as LegacyTeam
LegacyTeam.objects.filter(tv_name="MUFC").explain(analyze=True)

"Index Scan using legacy_teams_team_tv_name_ec392eeb_like on legacy_teams_team  (cost=0.42..8.44 rows=1 width=40) (actual time=0.012..0.013 rows=1 loops=1)\n  Index Cond: ((tv_name)::text = 'MUFC'::text)\nPlanning Time: 0.107 ms\nExecution Time: 0.026 ms"

In [10]:
from teams.models import Team
Team.objects.filter(tv_name="MUFC").explain(analyze=True)

"Index Scan using teams_team_tv_name_unique on teams_team  (cost=0.42..8.44 rows=1 width=40) (actual time=0.047..0.048 rows=1 loops=1)\n  Index Cond: ((tv_name)::text = 'MUFC'::text)\nPlanning Time: 0.099 ms\nExecution Time: 0.066 ms"

# Further optimizations

Let's now try some other optimizations in our models. First of all, we'll add an index to the `name` field on our `Team` model, which will look like this:

```python
class Team(models.Model):
    name = models.CharField(max_length=255, db_index=True)
    tv_name = models.CharField(max_length=4)
    city = models.CharField(max_length=255)

    class Meta:
        constraints = [
            models.UniqueConstraint(
                fields=['tv_name'],
                name='%(app_label)s_%(class)s_tv_name_unique',
            ),
        ]
```

In [11]:
%%sh
django-admin sqlmigrate teams 0003

BEGIN;
--
-- Alter field name on team
--
CREATE INDEX "teams_team_name_c519f9ad" ON "teams_team" ("name");
CREATE INDEX "teams_team_name_c519f9ad_like" ON "teams_team" ("name" varchar_pattern_ops);
COMMIT;


By adding the `db_index=True` to the `name` column, that same `varchar_pattern_ops` index from before was created. This time, it may come useful, as pattern matching on clubs names can come handy, especially as they can be pretty large. Let's see the impact of this. The first query will be done on the table with the index (`Team`) while the second query will be executed on the `LegacyTeam` table, which doesn't have an index.

In [12]:
from teams.models import Team

Team.objects.filter(name="abcdeFGHijkl").explain(analyze=True)

"Index Scan using teams_team_name_c519f9ad_like on teams_team  (cost=0.42..8.44 rows=1 width=40) (actual time=0.062..0.062 rows=0 loops=1)\n  Index Cond: ((name)::text = 'abcdeFGHijkl'::text)\nPlanning Time: 0.077 ms\nExecution Time: 0.077 ms"

In [13]:
from legacy_teams.models import Team as LegacyTeam

LegacyTeam.objects.filter(name="abcdeFGHijkl").explain(analyze=True)

"Gather  (cost=1000.00..7379.18 rows=1 width=40) (actual time=17.252..18.462 rows=0 loops=1)\n  Workers Planned: 2\n  Workers Launched: 2\n  ->  Parallel Seq Scan on legacy_teams_team  (cost=0.00..6379.08 rows=1 width=40) (actual time=13.668..13.669 rows=0 loops=3)\n        Filter: ((name)::text = 'abcdeFGHijkl'::text)\n        Rows Removed by Filter: 152325\nPlanning Time: 0.071 ms\nExecution Time: 18.514 ms"

The difference in performance is pretty visible when querying over an indexed column against the "same" column that doesn't have an index in the other table, with the query in the indexed column being 200 times faster.

## Partial indexes

Now let's imagine we work for Globe, a company that just bought the broadcasting rights for the Zora's Domain Football Cup. We then decide to query all the teams that are from that city, so we can build a special page on our website just for that competition.

In [22]:
from legacy_teams.models import Team as LegacyTeam

LegacyTeam.objects.filter(city="Zora's Domain").explain(analyze=True)

"Seq Scan on legacy_teams_team  (cost=0.00..9711.20 rows=44464 width=40) (actual time=0.014..34.544 rows=45900 loops=1)\n  Filter: ((city)::text = 'Zora''s Domain'::text)\n  Rows Removed by Filter: 411076\nPlanning Time: 0.135 ms\nExecution Time: 35.735 ms"

36ms is certainly "fine", but can we do better? The answer is YES.

There's a special type of index named Partial Index. It allows us to index a subset of our database that we know will be used a lot in our queries. You can, i.e.:, index a subset of teams that are from the "Zora's Domain" city.

```python
class Team(models.Model):
    name = models.CharField(max_length=255, db_index=True)
    tv_name = models.CharField(max_length=4)
    city = models.CharField(max_length=255)

    class Meta:
        constraints = [
            models.UniqueConstraint(
                fields=['tv_name'],
                name='%(app_label)s_%(class)s_tv_name_unique',
            ),
        ]
        indexes = [
            models.Index(
                fields=['city'],
                condition=models.Q(city="Zora's Domain"),
                name='%(app_label)s_%(class)s_city_zora',
            ),
        ]
```

We do so by adding a new index to the `city` field, that will create a "subset" of rows that fit a given `condition` (in our case, `city == "Zora's Domain"`).

Inspecting the migration, we see that PostgreSQL created an index for us that does exactly that.

In [24]:
%%sh
django-admin sqlmigrate teams 0004

BEGIN;
--
-- Create index teams_team_city_zora on field(s) city of model team
--
CREATE INDEX "teams_team_city_zora" ON "teams_team" ("city") WHERE "city" = 'Zora''s Domain';
COMMIT;


Now let's compare the results:

In [23]:
from teams.models import Team

Team.objects.filter(city="Zora's Domain").explain(analyze=True)

"Bitmap Heap Scan on teams_team  (cost=389.99..4948.48 rows=44759 width=40) (actual time=3.196..16.572 rows=45900 loops=1)\n  Recheck Cond: ((city)::text = 'Zora''s Domain'::text)\n  Heap Blocks: exact=3999\n  ->  Bitmap Index Scan on teams_team_city_zora  (cost=0.00..378.80 rows=44759 width=0) (actual time=2.490..2.491 rows=45900 loops=1)\nPlanning Time: 0.115 ms\nExecution Time: 18.136 ms"

Just by creating this index, we're able to reduce the query time in half. Great news!

And if we compare this result with another query on the same table and column, but looking for a different city (i.e.: `Tarrey Town`), we'll see that the query has similar performance (and planning) to the not-indexed column from the legacy table.

In [25]:
from teams.models import Team

Team.objects.filter(city="Tarrey Town").explain(analyze=True)

"Seq Scan on teams_team  (cost=0.00..9712.00 rows=44790 width=40) (actual time=0.011..37.783 rows=45615 loops=1)\n  Filter: ((city)::text = 'Tarrey Town'::text)\n  Rows Removed by Filter: 411361\nPlanning Time: 0.093 ms\nExecution Time: 38.942 ms"

Now let's check the space in disk that PostgreSQL took to create that index. Only 328kB, which is also great news.

In [26]:
from django.db import connection

with connection.cursor() as cursor:
    cursor.execute(
        """
        select
            pg_size_pretty(pg_relation_size('teams_team_city_zora'))
        """
    )
    row = cursor.fetchone()

row

('328 kB',)