### Why ibis?

For me, it is mainly about achieving higher performance when handling large data already residing in databases.  ibis uses the actual database's compute resources.  In contrast, pandas' `read_sql` uses just your PC's resources.  It is well known by Python ETL developers that pandas is slow for retrieving records from SQL databases.
> High performance execution: Execute at the speed of your backend, not your local computer

However, ibis is not without limitations either.  Biggest drawback for wider ibis adoption is that it does not support your typical enterprise database platforms, namely IBM DB2 and Microsoft SQL Server (they are working on SQL Server support).  These 2 database platforms are widely used at my company.  However, it does support PostgreSQL, which IT has recently begun to support.  The other minor drawback is it is yet another API syntax that a user will have to learn, although ibis has adopted some of pandas dataframe syntax and SQL's syntax.  But as a result, it has a much more feature rich [API](http://ibis-project.org/docs/api.html).  So if you are a seasoned pandas and SQL developer, it is relatively easy to get up to speed with ibis.

View ibis home [page](http://ibis-project.org/) for other reasons why you may consider ibis.

In [None]:
import ibis
import os
import pandas as pd
import psycopg2
ibis.options.interactive = True

#### Server details

In [2]:
host = 'some_host'
port = '5432'
db = 'some_db'
user = os.environ['some_user']
pwd = os.environ['some_pwd']

#### Define ibis connection object

In [3]:
conn = ibis.postgres.connect(
    url=f'postgresql://{user}:{pwd}@{host}:{port}/{db}'
)

#### `conn` object has useful methods

In [4]:
conn.list_tables()

['associate_assets',
 'associate_desks',
 'associate_devices',
 'associate_locker',
 'associate_master',
 'window_master',
 'window_master_excel',
 'wm_carriers',
 'wm_changepoints',
 'wm_destinations',
 'wm_return_routes',
 'wm_shuttles']

#### Let's time how long it takes to query a 30K+ row table

In [5]:
%%timeit
associates = conn.table('associate_master')

475 µs ± 187 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [9]:
associates  = conn.table('associate_master')

#### Number of rows in the associate master table

In [12]:
associates.count()

31405

#### `associates` is an ibis table expression

In [10]:
type(associates)

ibis.expr.types.TableExpr

#### We can save it as a pandas dataframe using execute() method

In [11]:
df = associates.execute(limit=40000)

In [13]:
type(df)

pandas.core.frame.DataFrame

In [14]:
df.shape

(31405, 71)

Since ibis objects don't have built-in mechanism to plot your data and pandas does, it is nice to have the ability to convert an ibis table to a pandas dataframe.

#### Let's see how long it would take using normal SQL using psycopg2 library

In [15]:
%%timeit
with psycopg2.connect(host=host, port=port, database=db, user=user, password=pwd) as conn:
    sql = "select * from public.associate_master"
    df = pd.read_sql(sql, conn)

19.7 s ± 3.79 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [16]:
df.shape

(31405, 71)

On average, pandas' `read_sql` takes several seconds to retrieve 30K rows of data.  ibis on average took only microseconds!

#### What about ORMs like sqlalchemy?  It is common knowledge ORMs are not performant either, but if you want to be convinced...

In [17]:
from sqlalchemy import create_engine

In [18]:
engine = create_engine(f'postgresql://{user}:{pwd}@{host}:{port}/{db}')

**With chunking:**

In [19]:
%%timeit
df_orm = pd.read_sql_table('associate_master', con=engine, schema='public', chunksize=10000)

19.8 s ± 1.43 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


**Without chunking:**

In [20]:
%%timeit
df_orm = pd.read_sql_table('associate_master', con=engine, schema='public')

26.7 s ± 3.58 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Using ORM, it takes several seconds as well.