## Using Pandas to process data that's too large to fit into memory

I generated some sample data into a database using [Mockaroo](https://www.mockaroo.com/) that includes bg values for different users throughout the day. 

Although the data is only 600 rows, I'll use this opportunity to process it chunkwise in python as if it were data that could not fit in memory. Let's get a quick look at the data first.

In [1]:
import pandas as pd
import sqlite3

#Connect to our database
con= sqlite3.connect('bgTable.db')

#Read the first few lines from our table
table0= pd.read_sql_query('SELECT * FROM bgTable LIMIT 3', con)
table0.head()

Unnamed: 0,id,bgVal,time
0,1,230,15:52
1,2,80,19:22
2,3,347,8:13


Now that we can see our table, we'll need to define the query that we'd like to apply to the table. We want to add a new column to our data that labels our row based on the bgValues value. It can fall into three groups: 1) Less than or equal to 80 2) between 81 and 250 and 3) Greater than 250. 

In [2]:
#binItems query has nested 'case when' statements to create a new
#column called 'bin' that includes a label for each row depending
#on the value of the bgVal column
binItems= '''
SELECT *, 
CASE WHEN bgVal <=80 THEN 'below'  
     ELSE CASE WHEN bgVal <=250 THEN 'in_range'  
          ELSE CASE WHEN bgVal >250 THEN 'above' 
          END 
     END
END AS 'bin'
FROM bgTable
'''

Now that we have the query, we'll apply it to bgTable. This will add an extra column ('bin') for the bin each row belongs to ('below','in_range', and 'above'). Again, we'll perform this operation in chunks (since we're assuming that the table is too large to load into memory). 

The code in the next cell does the following: 1) loads a chunk of our table from our database 3 rows at a time, 2) performs the 'binItems' query on each chunk and 3) appends the resulting chunks to a new table called 'bgTableBinned' that has a new 'bin' column. 

In [4]:
for query_chunk in pd.read_sql_query(binItems, con, chunksize=3):
    query_chunk.to_sql('bgTableBinned',con, index= False, if_exists= 'append')

In [5]:
#Save and close connection to database
con.close()

In [6]:
#Reopen connection and read new table

#Connect to our database with new table
conNew= sqlite3.connect('bgTable.db')

#Read the first few lines from our new table
newTable= pd.read_sql_query('SELECT * FROM bgTableBinned LIMIT 3', conNew)
newTable.head()

Unnamed: 0,id,bgVal,time,bin
0,1,230,15:52,in_range
1,2,80,19:22,below
2,3,347,8:13,above


In [7]:
conNew.close()

Note: For some reason, when I run cell #4 the first time, I get a logic error that says database doesn't exist. However, when I run the same cell again, it works, as do the remaining cells. I'll need to explore this a little further tomorrow (maybe I need to create the new table out of the loop first?), but it looks like for the first run, my workflow is doing what I need it to do. 