# query optimization

the process of query optimization is very similar to the process of refactoring.

the goal is to make the query more efficient without changing it's behavior.

so ideally you should write TDD-style tests for your queries first before you start optimizing them to avoid any unexpected behavior changes.

in this notebook we will go through some common query optimization techniques based on examples.

In [71]:
!chmod +x ./reset.sh && ./reset.sh > /dev/null
!psql postgres -c "reset all;"

Did not find any relations.
psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  database "sueszli" does not exist
ERROR:  cannot drop the currently open database
ERROR:  current user cannot be dropped
RESET


In [70]:
import os
import sys
import json

def equals_query(query1: str, query2: str, prefix: str="") -> bool:
    output1 = os.popen(f'psql postgres -c "{prefix} {query1};"').read()
    output2 = os.popen(f'psql postgres -c "{prefix} {query2};"').read()
    length_match = len(output1) == len(output2)
    
    output1 = output1.splitlines()
    output2 = output2.splitlines()
    output1.sort()
    output2.sort()
    output1 = "\n".join(output1)
    output2 = "\n".join(output2)
    non_order_match = output1 == output2
    
    return length_match and non_order_match

def print_simple_execution_plan(query: str, prefix: str="") -> None:
    output = os.popen(f'psql postgres -c "{prefix} explain (analyze, verbose off, costs off, settings off, generic_plan off, buffers off, wal off, timing off, summary off) {query};"').read()
    print(output)

def print_verbose_execution_plan(query: str, prefix: str="") -> None:
    output = os.popen(f'psql postgres -c "{prefix} explain (analyze, verbose, costs, settings, buffers, wal, timing, summary) {query};"').read()
    print(output)

def print_avg_time(query: str, prefix: str="", iters:int=10) -> None:   
    def get_exec_time(query: str, prefix: str="") -> float:
        output = os.popen(f'psql postgres -c "{prefix} explain (analyze, verbose, costs, settings, buffers, wal, timing, summary, format json) {query};"').read()
        output = output.splitlines()[2:-2]
        for i in range(len(output)):
            if len(output[i]) > 2:
                output[i] = output[i][:-1]
        output = "\n".join(output)
        output = json.loads(output)[0]
        return float(output["Execution Time"])

    vals = []
    for _ in range(iters):
        vals.append(get_exec_time(query, prefix))
        sys.stdout.write(f"\r{iters - _} iterations left")
        sys.stdout.write("\r")
        sys.stdout.flush()
    print(f"average execution time in {iters} interations: {sum(vals) / iters:.2f} ms")

# demo
demo_q1 = "SELECT a,b,c FROM r NATURAL JOIN s NATURAL JOIN t"
demo_q2 = "SELECT a,b,c FROM r NATURAL JOIN s NATURAL JOIN t ORDER BY a,b,c"
assert equals_query(demo_q1, demo_q2)
# print_simple_execution_plan(demo_q1)
# print_verbose_execution_plan(demo_q1)
print_avg_time(demo_q1)


average execution time in 10 interations: 325.58 ms


# example a) text search query

original query: `76.858 ms`

- uses

```sql
SELECT COUNT(*) FROM comments
WHERE substring(text from '%#" hello #"%' for '#') IS NOT NULL
OR substring(text from '%#" hi #"%' for '#') IS NOT NULL;
```

simplification: `48.933 ms`

- using dedicated library functions for text search rather than substring
- to learn more about postgres string matching:
    - https://www.postgresql.org/docs/9.1/functions-string.html
    - https://www.postgresql.org/docs/9.1/functions-matching.html

```sql
SELECT COUNT(*) FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';
```

adding index on the text column: `30.10 ms`

- improving lookup time by adding an index on the text column

```sql
CREATE INDEX text_index ON comments (text);
SELECT COUNT(*) FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';
```



### simplifying the string matching

In [326]:
# equivalent
!psql postgres -c "select substring('this is swaggy' from '%#\" is #\"%' for '#');"
!psql postgres -c "select substring('this is swaggy' from '%@\" is @\"%' for '@');"
!psql postgres -c "select substring('this is swaggy' from '%$\" is $\"%' for '$');"
!psql postgres -c "select substring('this is swaggy', ' is ');"

# simplified parts of original query
!psql postgres -c "SELECT * FROM comments WHERE substring(text from '%#\" hi #\"%' for '#') IS NOT NULL;"
!psql postgres -c "SELECT * FROM comments WHERE substring(text, ' hi ') IS NOT NULL;"

 substring 
-----------
  is 
(1 row)

 substring 
-----------
  is 
(1 row)

 substring 
-----------
  is 
(1 row)

 substring 
-----------
  is 
(1 row)

  id   | postid | score |                                                                                                                                           text                                                                                                                                            |      creationdate       | userid | userdisplayname | contentlicense 
-------+--------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+--------+-----------------+----------------
 37171 |  17355 |     1 | the infrared radiation from the sun is shortwave infrared an

In [339]:
# original
!psql postgres -c "SELECT * FROM comments WHERE substring(text from '%#\" hello #\"%' for '#') IS NOT NULL OR substring(text from '%#\" hi #\"%' for '#') IS NOT NULL;"

# simplified
!psql postgres -c "SELECT * FROM comments WHERE substring(text, ' hello ') IS NOT NULL OR substring(text, ' hi ') IS NOT NULL;"
!psql postgres -c "SELECT * FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';"

  id   | postid | score |                                                                                                                                           text                                                                                                                                            |      creationdate       | userid | userdisplayname | contentlicense 
-------+--------+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------+--------+-----------------+----------------
 37171 |  17355 |     1 | the infrared radiation from the sun is shortwave infrared and is not stopped by the atmosphere,the earth emits infrared radiation in longwave radiation and a lot of this gets absorbed and radiated back to the suface 

In [343]:
# checking performance (we can't use my library because it breaks on nested double quotes)

!psql postgres -c "explain (analyze) SELECT count(*) FROM comments WHERE substring(text, ' hello ') IS NOT NULL OR substring(text, ' hi ') IS NOT NULL;"
!psql postgres -c "explain (analyze) SELECT count(*) FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';"

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1475.85..1475.86 rows=1 width=8) (actual time=76.807..76.807 rows=1 loops=1)
   ->  Seq Scan on comments  (cost=0.00..1412.44 rows=25362 width=0) (actual time=61.066..76.797 rows=1 loops=1)
         Filter: (("substring"(text, ' hello '::text) IS NOT NULL) OR ("substring"(text, ' hi '::text) IS NOT NULL))
         Rows Removed by Filter: 25362
 Planning Time: 0.274 ms
 Execution Time: 76.858 ms
(6 rows)

                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1412.46..1412.47 rows=1 width=8) (actual time=48.866..48.867 rows=1 loops=1)
   ->  Seq Scan on comm

### other optimizations

In [354]:
# create index on text
!psql postgres -c "CREATE INDEX text_index ON comments (text);"
print_avg_time("SELECT count(*) FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';")

ERROR:  relation "text_index" already exists
average execution time in 10 interations: 31.76 ms


In [355]:
!psql postgres -c "SELECT count(score) FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';"
print_avg_time("SELECT count(score) FROM comments WHERE text LIKE '% hello %' OR text LIKE '% hi %';")


 count 
-------
     1
(1 row)

average execution time in 10 interations: 32.43 ms
