<center><h1>Updating NHTSA Complaints sqlite Database</h1></center>

This example will show you how to create a sqlite table consisting of customer complaints filed with NHTSA (National Highway Traffic Safety Administration).  The metadata on NHTSA's csv file (information about the data), can be viewed [here](http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/CMPL.txt).

This example assumes that a ```nhtsa.db``` sqlite database was already created.  With sqlite installed, at the terminal, just enter:<br>
***promtp>***```sqlite3 nhtsa.db```

**Why [sqlite](www.sqlite.org)?**
- data is relatively small
- database will be used by a small number of users
- Python already has built-in library to interact with a sqlite database
- it is free and used in production in several companies and web sites (it is probably the most used database of all time: used in most if not all smart phones, browsers, embedded devices, etc ([source](http://www.sqlite.org/mostdeployed.html))

There are a few different ways I could have obtained the data and manage the rest of my work flow.  I could manually go to NHTSA's site, download the zip file somewhere onto my computer, then extract it, massage or clean the data, and then import the data into a database.  Since I would have to repeat this process on a monthly basis, I wanted to automate this process with a single script so that I can perhaps take advantage of Windows Scheduler or Linux CRON to schedule this process automatically if I wanted to.

## Download the zip file in memory, then extract

Since I have enough RAM on my computer, I will be downloading the zip file and holding it in memory instead of physically writing/creating a file on disk.  However, the contents of the zip file will be saved on the computer.

In [1]:
from io import BytesIO
from zipfile import ZipFile
from urllib import request
import datetime as dt

start = dt.datetime.now()

url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')
zipfile_in_memory = ZipFile(BytesIO(url.read()))
zipfile_in_memory.extractall(r'D:\temp')
zipfile_in_memory.close()
print("Download and extraction completed")

Download and extraction completed


### Not sure why NHTSA omitted the column names from the csv file since they've already defined them in the data description file, so I had to create them myself

Below I've created ```columns``` list to contain column names that will be used for the ```complaints``` table in my sqlite database:

In [2]:
columns = [
    'CMPLID',
    'ODINO',
    'MFR_NAME',
    'MAKETXT',
    'MODELTXT',
    'YEARTXT',
    'CRASH',
    'FAILDATE',
    'FIRE',
    'INJURED',
    'DEATHS',
    'COMPDESC',
    'CITY',
    'STATE',
    'VIN',
    'DATEA',
    'LDATE',
    'MILES',
    'OCCURENCES',
    'CDESCR',
    'CMPL_TYPE',
    'POLICE_RPT_YN',
    'PURCH_DT',
    'ORIG_OWNER_YN',
    'ANTI_BRAKES_YN',
    'CRUISE_CONT_YN',
    'NUM_CYLS',
    'DRIVE_TRAIN',
    'FUEL_SYS',
    'FUEL_TYPE',
    'TRANS_TYPE',
    'VEH_SPEED',
    'DOT',
    'TIRE_SIZE',
    'LOC_OF_TIRE',
    'TIRE_FAIL_TYPE',
    'ORIG_EQUIP_YN',
    'MANUF_DT',
    'SEAT_TYPE',
    'RESTRAINT_TYPE',
    'DEALER_NAME',
    'DEALER_TEL',
    'DEALER_CITY',
    'DEALER_STATE',
    'DEALER_ZIP',
    'PROD_TYPE',
    'REPAIRED_YN',
    'MEDICAL_ATTN',
    'VEHICLES_TOWED_YN'
]

## Connect to the sqlite3 database and read in the csv file, in chunks at a time

This will create/populate the ```complaints``` table in the nhtsa.db sqlite database.  I used Pandas read_csv ```chunksize``` parameter due to the size of the csv file.  With chunking, I will add 20K rows at a time and append them to the complaints table instead of attempting to add all rows into the table.  Since we are processing 20K rows at a time, this technique will work with out-of-core or larger-than-memory csv files (data that is larger than RAM, but fits on hard drive).

In [3]:
import sqlite3
import pandas as pd
import datetime as dt

conn = sqlite3.connect(r'D:\NHTSA\nhtsa.db')
cursor = conn.cursor()

# Since we are going to load/re-create the complaint's table in its entirety, DROP it
cursor.execute('DROP TABLE IF EXISTS complaints')

chunksize = 20000
j = 0

begin = dt.datetime.now()

# use the columns list to define the column names of the complaints table
for df in pd.read_csv(r'D:\temp\FLAT_CMPL.txt', names=columns, dtype=object, chunksize=chunksize, 
                      delimiter='\t', iterator=True, encoding='ISO-8859-1', error_bad_lines=False):    
    j+=1
    # To print on same line, use '\r' and end='' option with the print function
    print('\r'+'{} seconds: completed {} rows'.format((dt.datetime.now() - begin).seconds, j*chunksize),end='')

    df.to_sql('complaints', conn, if_exists='append', index=False)
cursor.close()
conn.close()

227 seconds: completed 1340000 rows

### To improve query performance, create indices based on most frequently used columns used for filtering

Based on my past querying history, I've been filtering a lot based on combinations of year, make, model year, and failure date.

In [4]:
conn = sqlite3.connect(r'D:\NHTSA\nhtsa.db')
cursor = conn.cursor()

cursor.execute('CREATE INDEX make ON complaints (MAKETXT)')
cursor.execute('CREATE INDEX addeddate ON complaints (DATEA)')
cursor.execute('CREATE INDEX faildate ON complaints (FAILDATE)')
cursor.execute('CREATE INDEX compdesc ON complaints (COMPDESC)')
cursor.execute('CREATE INDEX "make-faildate" ON complaints (MAKETXT, FAILDATE)')
cursor.execute('CREATE INDEX "year-make-model" ON complaints (MAKETXT, MODELTXT, YEARTXT)')

cursor.close()
conn.close()

In [5]:
print("Total elapsed time (hr:min:sec):", dt.datetime.now() - start)

Total elapsed time (hr:min:sec): 0:13:08.021469
