<center><h1>Creating a sqlite table from a large CSV file</h1></center>

This example will show you how to create a sqlite table consisting of customer complaints filed with NHTSA (U.S. National Highway Traffic Safety Administration).  The metadata on NHTSA's csv file (information about the data), can be viewed [here](http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/CMPL.txt).

This example assumes that a ```nhtsa.db``` sqlite database was already created.  With sqlite installed, at the terminal, just enter:<br>
***promtp>***```sqlite3 nhtsa.db```

**Why [sqlite](www.sqlite.org)?**
- data is relatively small
- database will be used by a small number of users
- Python already has built-in library to interact with a sqlite database
- it is free and used in production in several companies and web sites (it is probably the most used database of all time: used in most if not all smart phones, browsers, embedded devices, etc ([source](http://www.sqlite.org/mostdeployed.html))

There are a few different ways I could have obtained the data and manage the rest of my work flow.  I could manually go to NHTSA's site, download the zip file somewhere onto my computer, then extract it, massage or clean the data, and then import the data into a database.  Since I would have to repeat this process on a monthly basis, I wanted to automate this process with a single script so that I can perhaps take advantage of Windows Scheduler or Linux CRON to schedule this process automatically if I wanted to.

## Download the zip file in memory, then extract

Since I have enough RAM on my computer, I will be downloading the zip file and holding it in memory instead of physically writing/creating a file on disk.  However, the contents of the zip file will be saved on the computer.

In [1]:
from io import BytesIO
from zipfile import ZipFile
from urllib import request
import datetime as dt

url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')
zipfile_in_memory = ZipFile(BytesIO(url.read()))
zipfile_in_memory.extractall(r'D:\temp')
zipfile_in_memory.close()
print("zip download and extraction complete")

zip download and extraction complete


### Not sure why NHTSA omitted the column names from the csv file since they've already defined them in the data description file, so I had to create them myself

Below I've created ```columns``` list to contain column names that will be used for the ```complaints``` table in my sqlite database:

In [3]:
columns = [
    'CMPLID',
    'ODINO',
    'MFR_NAME',
    'MAKETXT',
    'MODELTXT',
    'YEARTXT',
    'CRASH',
    'FAILDATE',
    'FIRE',
    'INJURED',
    'DEATHS',
    'COMPDESC',
    'CITY',
    'STATE',
    'VIN',
    'DATEA',
    'LDATE',
    'MILES',
    'OCCURENCES',
    'CDESCR',
    'CMPL_TYPE',
    'POLICE_RPT_YN',
    'PURCH_DT',
    'ORIG_OWNER_YN',
    'ANTI_BRAKES_YN',
    'CRUISE_CONT_YN',
    'NUM_CYLS',
    'DRIVE_TRAIN',
    'FUEL_SYS',
    'FUEL_TYPE',
    'TRANS_TYPE',
    'VEH_SPEED',
    'DOT',
    'TIRE_SIZE',
    'LOC_OF_TIRE',
    'TIRE_FAIL_TYPE',
    'ORIG_EQUIP_YN',
    'MANUF_DT',
    'SEAT_TYPE',
    'RESTRAINT_TYPE',
    'DEALER_NAME',
    'DEALER_TEL',
    'DEALER_CITY',
    'DEALER_STATE',
    'DEALER_ZIP',
    'PROD_TYPE',
    'REPAIRED_YN',
    'MEDICAL_ATTN',
    'VEHICLES_TOWED_YN'
]

## Connect to the sqlite3 database and read in the csv file, in chunks at a time

This will create/populate the ```complaints``` table in the nhtsa.db sqlite database.  I used Pandas read_csv ```chunksize``` parameter due to the size of the csv file.  With chunking, I will add 20K rows at a time and append them to the complaints table instead of attempting to add all rows into the table.  Since we are processing 20K rows at a time, this technique will work with out-of-core or larger-than-memory csv files (data taht is larger than RAM, but fits on hard drive).

In [None]:
import sqlite3
import pandas as pd

conn = sqlite3.connect(r'D:\NHTSA\nhtsa.db')
cursor = conn.cursor()

# Since we are going to load/re-create the complaints table in its entirety, DROP it
cursor.execute('DROP TABLE IF EXISTS complaints')

start = dt.datetime.now()
chunksize = 20000
j = 0

# use the columns list to define the column names of the complaints table
for df in pd.read_csv(r'D:\temp\FLAT_CMPL.txt', names=columns, dtype=object, chunksize=chunksize, 
                      delimiter='\t', iterator=True, encoding='ISO-8859-1', error_bad_lines=False):    
    j+=1
    # To print on same line, use '\r' and end='' option with the print function
    print('\r'+'{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize),end='')

    df.to_sql('complaints', conn, if_exists='append', index=False)
cursor.close()
conn.close()

**NOTE:** Alternatively, instead of loading the csv file in its entirety into the sqlite database every time, I could look into utilizing the ```DATEA``` column.  This column represents the date the record was added to the csv file.  So we could have just loaded newly added data, instead of all data.  However, this would have added complexity and extra steps since we would still have to somehow filter the original csv file down to just the new data, and overcome memory or performance problems as well in doing so.  Therefore, I would probably still end up dumping the entire data set in a database anyways.

### To improve query performance, create indices based on most frequently used columns used for filtering

Based on my past querying history, I've been filtering a lot based on combinations of year, make, model year, and failure date.

In [None]:
conn = sqlite3.connect(r'D:\NHTSA\nhtsa.db')
cursor = conn.cursor()

cursor.execute('CREATE INDEX make ON complaints (MAKETXT)')
cursor.execute('CREATE INDEX "make-faildate" ON complaints (MAKETXT, FAILDATE)')
cursor.execute('CREATE INDEX "year-make-model" ON complaints (MAKETXT, MODELTXT, YEARTXT)')

cursor.close()
conn.close()

## Basic querying/filtering example using SQL

Here, I am looking for "serious" complaints from Toyota and Ford vehicles for failures that occurred in 2016.

In [39]:
conn = sqlite3.connect(r'D:\NHTSA\nhtsa.db')

sql = """
SELECT
MFR_NAME,
MAKETXT,
MODELTXT, 
YEARTXT, 
FAILDATE, 
LDATE,
CRASH, 
FIRE, 
INJURED, 
DEATHS,
VEHICLES_TOWED_YN,
COMPDESC, 
MILES, 
LDATE, 
OCCURENCES, 
CDESCR,
DATEA

FROM complaints

WHERE
MAKETXT IN('TOYOTA','FORD')
AND FAILDATE like '2016%'
AND (CRASH = 'Y'
    OR FIRE = 'Y'
    OR INJURED = 'Y'
    OR DEATHS = 'Y'
    OR MEDICAL_ATTN = 'Y'
    OR VEHICLES_TOWED_YN = 'Y'
)
limit 10
"""

sample = pd.read_sql_query(sql, conn)
conn.close()

In [40]:
sample

Unnamed: 0,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,FAILDATE,CRASH,FIRE,INJURED,DEATHS,VEHICLES_TOWED_YN,COMPDESC,MILES,LDATE,OCCURENCES,CDESCR,DATEA
0,Ford Motor Company,FORD,C-MAX,2015,20160813,Y,N,,,N,POWER TRAIN,5000,20160815,,REVERSE GEAR SETTING ACTUALLY PUT THE CAR IN DRIVE AFTER STARTING THE PARKED CAR ND ATTEMPTING TO BACK OUT OFPARKING...,20160815
1,Ford Motor Company,FORD,C-MAX ENERGI,2013,20160801,N,Y,,,N,ELECTRICAL SYSTEM,23000,20160908,,"THE 120 VOLT, 12 AMP CHARGE CORD THAT CAME WITH MY CAR (FORD PART NO. FM58-10B706AC) IS OVERHEATING CREATING A FIRE ...",20160908
2,Ford Motor Company,FORD,C-MAX ENERGI,2013,20160801,N,Y,,,N,FUEL/PROPULSION SYSTEM,23000,20160908,,"THE 120 VOLT, 12 AMP CHARGE CORD THAT CAME WITH MY CAR (FORD PART NO. FM58-10B706AC) IS OVERHEATING CREATING A FIRE ...",20160908
3,Ford Motor Company,FORD,C-MAX ENERGI,2013,20160801,N,Y,,,N,HYBRID PROPULSION SYSTEM: INVERTER,20000,20160908,,THE FORD EV CHARGE CORD (FORD PART NO. FM58-10B706AC) THAT CAME WITH MY 2013 CMAX ENERGI DRAWS TOO MUCH CURRENT AND ...,20160908
4,Ford Motor Company,FORD,C-MAX ENERGI,2014,20160726,N,N,0.0,0.0,Y,ENGINE,52000,20160913,,"TL* THE CONTACT OWNS A 2014 FORD C-MAX ENERGI HYBRID. WHILE DRIVING 35 MPH, THE VEHICLE STALLED WITHOUT WARNING. THE...",20160913
5,Ford Motor Company,FORD,C-MAX ENERGI,2015,20160129,Y,N,,,Y,ELECTRICAL SYSTEM,6000,20160218,,1. REAR WINDOWS OPEN 6 TO 8 INCHES UPON RETURNING TO CAR FROM STORE- HAPPENED 4 TIMES. (SITTING IN PARKING LOTS) ...,20160218
6,Ford Motor Company,FORD,C-MAX ENERGI,2015,20160129,Y,N,,,Y,UNKNOWN OR OTHER,6000,20160218,,1. REAR WINDOWS OPEN 6 TO 8 INCHES UPON RETURNING TO CAR FROM STORE- HAPPENED 4 TIMES. (SITTING IN PARKING LOTS) ...,20160218
7,Ford Motor Company,FORD,C-MAX ENERGI,2015,20160129,Y,N,,,Y,STEERING,6000,20160218,,1. REAR WINDOWS OPEN 6 TO 8 INCHES UPON RETURNING TO CAR FROM STORE- HAPPENED 4 TIMES. (SITTING IN PARKING LOTS) ...,20160218
8,Ford Motor Company,FORD,C-MAX ENERGI,2015,20160517,Y,N,,,Y,STRUCTURE:BODY,8200,20160519,,"ABOUT A MONTH AGO IN APRIL 2016, I PULLED INTO MY DRIVEWAY AND UPON PRESSING THE BRAKE PEDAL THE CAR SURGED AHEAD AN...",20160519
9,Ford Motor Company,FORD,C-MAX ENERGI,2015,20160517,Y,N,,,Y,ENGINE,8200,20160519,,"ABOUT A MONTH AGO IN APRIL 2016, I PULLED INTO MY DRIVEWAY AND UPON PRESSING THE BRAKE PEDAL THE CAR SURGED AHEAD AN...",20160519


**Instead of viewing the data in the browser, you can always export the dataframe to Excel via:**

```sample.to_excel('path to where I want to save.xlsx')```

# [db.py](https://github.com/yhat/db.py) - An easier way to interact with databases

If you like working with Pandas dataframes, db.py is a library that conveniently integrates with Pandas dataframes with the result of our queries.

In [10]:
from db import DB
import pandas as pd

db = DB(filename=r'D:\NHTSA\nhtsa.db', dbtype="sqlite")

Indexing schema. This will take a second...finished!


### db.py comes with a few useful "helper" functions to help us inspect our database

Let's see what tables are in our database:

In [11]:
db.tables

Refreshing schema. Please wait...done!


Table,Columns
complaints,"CMPLID, ODINO, MFR_NAME, MAKETXT, MODELTXT, YEARTXT, CRASH, FAILDATE, FIRE, INJU RED, DEATHS, COMPDESC, CITY, STATE, VIN, DATEA, LDATE, MILES, OCCURENCES, CDESCR , CMPL_TYPE, POLICE_RPT_YN, PURCH_DT, ORIG_OWNER_YN, ANTI_BRAKES_YN, CRUISE_CONT _YN, NUM_CYLS, DRIVE_TRAIN, FUEL_SYS, FUEL_TYPE, TRANS_TYPE, VEH_SPEED, DOT, TIR E_SIZE, LOC_OF_TIRE, TIRE_FAIL_TYPE, ORIG_EQUIP_YN, MANUF_DT, SEAT_TYPE, RESTRAI NT_TYPE, DEALER_NAME, DEALER_TEL, DEALER_CITY, DEALER_STATE, DEALER_ZIP, PROD_TY PE, REPAIRED_YN, MEDICAL_ATTN, VEHICLES_TOWED_YN"
recalls,"RECORD_ID, CAMPNO, MAKETXT, MODELTXT, YEARTXT, MFGCAMPNO, COMPNAME, MFGNAME, BGM AN, ENDMAN, RCLTYPECD, POTAFF, ODATE, INFLUENCED_BY, MFGTXT, RCDATE, DATEA, RPNO , FMVSS, DESC_DEFECT, CONSEQUENCE_DEFECT, CORRECTIVE_ACTION, NOTES, RCL_CMPT_ID"


In [30]:
db.tables.complaints

Column,Type,Foreign Keys,Reference Keys
CMPLID,TEXT,,
ODINO,TEXT,,
MFR_NAME,TEXT,,
MAKETXT,TEXT,,
MODELTXT,TEXT,,
YEARTXT,TEXT,,
CRASH,TEXT,,
FAILDATE,TEXT,,
FIRE,TEXT,,
INJURED,TEXT,,


#### As usual, we can send multi-line SQL statement:

In [48]:
result = db.query(
"""
SELECT
MFR_NAME,
MAKETXT,
MODELTXT, 
YEARTXT, 
FAILDATE,
LDATE,
CRASH, 
FIRE, 
INJURED, 
DEATHS,
VEHICLES_TOWED_YN,
COMPDESC, 
MILES, 
LDATE, 
OCCURENCES, 
CDESCR,
DATEA

FROM complaints

WHERE
MAKETXT IN('TOYOTA','FORD')
AND FAILDATE > '20160930'
AND (CRASH = 'Y'
    OR FIRE = 'Y'
    OR INJURED = 'Y'
    OR DEATHS = 'Y'
    OR MEDICAL_ATTN = 'Y'
    OR VEHICLES_TOWED_YN = 'Y'
)
limit 4
"""
)

### Let's view our query result

In [49]:
resultb

Unnamed: 0,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,FAILDATE,LDATE,CRASH,FIRE,INJURED,DEATHS,VEHICLES_TOWED_YN,COMPDESC,MILES,LDATE.1,OCCURENCES,CDESCR,DATEA
0,Ford Motor Company,FORD,TAURUS,2014,20161001,20161004,Y,N,4.0,0.0,N,STRUCTURE:BODY,78000,20161004,,"WAS RIDING IN BACK SEAT OF THIS CAR WHEN IT WAS REAR ENDED BY A 2014 JEEP CHEROKEE. TAURUS WAS STOPPED ON HIGHWAY, ...",20161004
1,Ford Motor Company,FORD,TAURUS,2014,20161001,20161004,Y,N,4.0,0.0,N,SEAT BELTS,78000,20161004,,"WAS RIDING IN BACK SEAT OF THIS CAR WHEN IT WAS REAR ENDED BY A 2014 JEEP CHEROKEE. TAURUS WAS STOPPED ON HIGHWAY, ...",20161004
2,Ford Motor Company,FORD,FUSION,2012,20161003,20161004,Y,N,,,Y,STRUCTURE:BODY,101000,20161004,,POWER STEERING WENT OUT,20161004
3,Ford Motor Company,FORD,FUSION,2012,20161003,20161004,Y,N,,,Y,STEERING,101000,20161004,,POWER STEERING WENT OUT,20161004


In [51]:
result = db.query(
"""
SELECT *

FROM complaints

where
FAILDATE >= '20160101'
"""
)

In [53]:
result.to_excel(r'D:\temp\sample.xlsx', index=False)

### [SqliteStudio](http://sqlitestudio.pl/)

There are of course GUI tools to interact with your sqlite database if you don't want to muck around issuing SQL queries using Python.  I currently use [SqliteStudio](http://sqlitestudio.pl/).