# Tutorial 3a: Extreme Averaging

In this notebook we will set up a huge single column database and take the average of the numbers in it. The goal is to stress test `pandas` ability to work with large datasets.

### Requirements

You will need to have installed `sqlalchemy` and `pandas`.

In [1]:
import numpy as np
import pandas as pd
import os

import sqlalchemy as sq
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.sql import func

## Building the Database

In [2]:
dbfile = 'huge.db'

try: os.remove(dbfile)
except: pass

In [3]:
Base = declarative_base()

In [4]:
class Number(Base):
    __tablename__ = 'number'
    
    id = sq.Column(sq.Integer, primary_key=True)
    value = sq.Column(sq.Float, nullable=False)

In [5]:
engine = sq.create_engine('sqlite:///'+dbfile)
Base.metadata.create_all(engine)

Base.metadata.bind = engine
DBSession = sessionmaker(bind=engine)
session = DBSession()

We are now ready to insert data into our database. We'll use a low level insertion command to fill the table fast - you can read more about how this works [here](http://docs.sqlalchemy.org/en/latest/faq/performance.html).

In [6]:
N = int(1e3)

In [7]:
%%time
engine.execute(
    Number.__table__.insert(),
        [{"value": x} for x in xrange(N)]
    )

CPU times: user 9.92 ms, sys: 1.57 ms, total: 11.5 ms
Wall time: 15.2 ms


<sqlalchemy.engine.result.ResultProxy at 0x1067ac3d0>

For comparison, below is some code that is several hundred times slower:

In [8]:
%%time
j = N
while j < 2*N:
    session.add(Number(value=float(j)))
    session.commit()
    j += 1

CPU times: user 1.57 s, sys: 726 ms, total: 2.29 s
Wall time: 4.86 s


At about 20ms per 1000 numbers, we should be able to make a 10 billion row database in:

In [9]:
N = int(1e10)
print np.round((N * ((20*1e-3) / 1e3) / 3600.0), 1),"hours"

55.6 hours


Let's try for something smaller: we should be able to do 10 million rows in a few minutes. We already inserted 2000 rows above, so there's no need to do them again! The resulting db should be about 100Mb in size. Then let's see how `pandas` gets on with that.

In [10]:
N = int(1e7)

In [11]:
%%time
engine.execute(
    Number.__table__.insert(),
        [{"value": x} for x in xrange(2000, N)]
    )

CPU times: user 49.5 s, sys: 3.94 s, total: 53.4 s
Wall time: 54.5 s


<sqlalchemy.engine.result.ResultProxy at 0x2b778c650>

In [12]:
!du -h $dbfile

124M	huge.db


## Querying Data and Estimating the Mean

Let's do this two different ways, using the built-in SQL `avg` function, and `pandas` plus `numpy`.

In this function we use the `time` package to measure wallclock time, and `guppy` to measure memory usage.

We expect the mean to be equal to `0.5*(N-1)`:

In [13]:
0.5*(N-1)

4999999.5

In [14]:
def estimate_mean(method='sql_function'):
    
    import time as wallclock
    import guppy
    
    measure = guppy.hpy()
    
    start, end = {}, {}
    start['memory'] = measure.heap().size
    start['time'] = wallclock.time()
    
    df, mean = None, None

    if method == 'pandas_query':
        df = pd.read_sql(session.query(Number.value).statement, session.bind) 
        mean = np.mean(df.value)
        
    elif method == 'sql_function':
        mean = session.query(func.avg(Number.value)).one()[0]
        
    
    end['time'] = wallclock.time()
    end['memory'] = measure.heap().size

    time = (end['time']-start['time'])
    memory = (end['memory']-start['memory']) / (1024.0*1024.0)

    print "Estimated mean distance = ", mean

    print "Wallclock time spent = ", np.round(time,1), "seconds"
    print "Memory used = ",np.round(memory,1), "Mb"
    
    del df, mean

    return time, memory

In [15]:
t, m = estimate_mean(method='pandas_query')

Estimated mean distance =  4999999.5
Wallclock time spent =  34.0 seconds
Memory used =  0.0 Mb


In [16]:
t, m = estimate_mean(method='sql_function')

Estimated mean distance =  4999999.5
Wallclock time spent =  1.9 seconds
Memory used =  0.1 Mb


## Conclusions

It's possible to work with 10 million row databases using `SQLalchemy` and `pandas` - I was unable to go 10 times bigger without the kernel dying. Taking the average of 10 million numbers seems to be an order of magnitude faster with the built-in command than with `np.mean()` operating on a `pandas` dataframe. Neither approach seems to need much memory at all.

In [19]:
try: os.remove(dbfile)
except: pass