<center><h1>Creating a sqlite table from a large CSV file</h1></center>

This example will show you how to create a sqlite table consisting of customer complaints filed with NHTSA (U.S. National Highway Traffic Safety Administration).  The metadata on NHTSA's csv file (data about the data), can be viewed [here](http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/CMPL.txt).

This example assumes that a nhtsa.db sqlite database was already created.  At terminal enter: sqlite3 nhtsa.db

## Download the zip file in memory, then extract

In [1]:
from io import BytesIO
from zipfile import ZipFile
from urllib import request

url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')
zipfile_in_memory = ZipFile(BytesIO(url.read()))
zipfile_in_memory.extractall('/home/pybokeh/temp/')
zipfile_in_memory.close()
print("zip download and extraction complete")

zip download and extraction complete


## Normal imports

In [59]:
import sqlite3
import pandas as pd
import datetime as dt
pd.set_option("display.max_rows",1000)
pd.set_option("display.max_columns",50)
pd.set_option('max_colwidth',40)

## Since the flat file doesn't contain column headers, had to create one

In [11]:
columns = [
    'CMPLID',
    'ODINO',
    'MFR_NAME',
    'MAKETXT',
    'MODELTXT',
    'YEARTXT',
    'CRASH',
    'FAILDATE',
    'FIRE',
    'INJURED',
    'DEATHS',
    'COMPDESC',
    'CITY',
    'STATE',
    'VIN',
    'DATEA',
    'LDATE',
    'MILES',
    'OCCURENCES',
    'CDESCR',
    'CMPL_TYPE',
    'POLICE_RPT_YN',
    'PURCH_DT',
    'ORIG_OWNER_YN',
    'ANTI_BRAKES_YN',
    'CRUISE_CONT_YN',
    'NUM_CYLS',
    'DRIVE_TRAIN',
    'FUEL_SYS',
    'FUEL_TYPE',
    'TRANS_TYPE',
    'VEH_SPEED',
    'DOT',
    'TIRE_SIZE',
    'LOC_OF_TIRE',
    'TIRE_FAIL_TYPE',
    'ORIG_EQUIP_YN',
    'MANUF_DT',
    'SEAT_TYPE',
    'RESTRAINT_TYPE',
    'DEALER_NAME',
    'DEALER_TEL',
    'DEALER_CITY',
    'DEALER_STATE',
    'DEALER_ZIP',
    'PROD_TYPE',
    'REPAIRED_YN',
    'MEDICAL_ATTN',
    'VEHICLES_TOWED_YN'
]

## Connect to the sqlite3 database and read in the flat file in chunks

This will create/populate the 'complaints' table in the nhtsa.db sqlite database.  This technique is essentially processing 20K rows at a time and appending them to the complaints table.  Since we are processing 20K rows at a time, this technique will work with large csv files (larger than RAM, but fits on hard drive).  Adjust the chunk size depending on your computer's hardware.

In [62]:
conn = sqlite3.connect('/home/pybokeh/temp/nhtsa.db')

start = dt.datetime.now()
chunksize = 20000
j = 0

for df in pd.read_csv('/home/pybokeh/temp/FLAT_CMPL.txt', names=columns, dtype=object, chunksize=chunksize, 
                      delimiter='\t', iterator=True, encoding='ISO-8859-1', error_bad_lines=False):    
    j+=1
    # To print on same line, use '\r' and end='' option with the print function
    print('\r'+'{} seconds: completed {} rows'.format((dt.datetime.now() - start).seconds, j*chunksize),end='')

    df.to_sql('complaints', conn, flavor='sqlite', if_exists='append', index=False)

395 seconds: completed 1280000 rows

## Some basic querying against the sqlite3 database

Here, I am looking for "serious" complaints from Ford vehicles that were filed in 2016.

In [66]:
sample = pd.read_sql_query('SELECT MFR_NAME, MAKETXT, MODELTXT, YEARTXT, CRASH, FIRE, INJURED, DEATHS, '
                           'COMPDESC, MILES, LDATE, OCCURENCES, CDESCR '
                           
                           'FROM complaints '
                           
                           'WHERE '
                           "LDATE like '2016%' "
                           "AND MAKETXT IN('FORD') "
                           "AND (CRASH = 'Y' "
                           "OR FIRE = 'Y' "
                           "OR INJURED = 'Y' "
                           "OR DEATHS = 'Y' "
                           "OR MEDICAL_ATTN = 'Y' "
                           "OR VEHICLES_TOWED_YN = 'Y') "
                           'limit 5', conn)

In [67]:
sample

Unnamed: 0,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FIRE,INJURED,DEATHS,COMPDESC,MILES,LDATE,OCCURENCES,CDESCR
0,Ford Motor Company,FORD,FIVE HUNDRED,2005,Y,N,2.0,0.0,AIR BAGS,55200,20160104,,THE VEHICLE WAS IN A FRONT END COLLI...
1,Ford Motor Company,FORD,FUSION,2015,Y,N,1.0,,SERVICE BRAKES,30000,20160104,,TL* THE CONTACT OWNS A 2015 FORD FUS...
2,Ford Motor Company,FORD,FUSION,2015,Y,N,1.0,,AIR BAGS,30000,20160104,,TL* THE CONTACT OWNS A 2015 FORD FUS...
3,Ford Motor Company,FORD,FOCUS,2010,Y,N,,,SUSPENSION,87500,20160104,,I HAVE OWNED THE CAR SINCE IT WAS NE...
4,Ford Motor Company,FORD,WINDSTAR,2004,N,Y,,,ENGINE,118000,20160105,,TL* THE CONTACT OWNED A 2004 FORD WI...


#### Since the output from a pandas data frame is limiting, created this simple routing to better view the customer's complaint

In [72]:
for year, make, model, cdescr in sample[['YEARTXT','MAKETXT','MODELTXT','CDESCR']].values:
    print(year + ' ' + make + ' ' + model + '==> ' + cdescr)
    print("*****************************************************************************************")

2005 FORD FIVE HUNDRED==> THE VEHICLE WAS IN A FRONT END COLLISION WITH THE FRONT DRIVER SIDE PRIMARILY DAMAGED. THE VEHICLE WAS TRAVELING DOWN A CITY STREET AT ROUGHLY 35 TO 40 MPH UPHILL. THE VEHICLE REAR ENDED A STATIONARY VEHICLE THAT WAS ATTEMPTING TO TURN. THE AIRBAG SENSOR WAS LOCATED ON THE VEHICLE AND WAS HIT, HOWEVER, THE AIRBAGS DID NOT DEPLOY. CAUSING DAMAGE TO THE DRIVER IN THE ACCIDENT.  THEY CLEARLY SHOULD HAVE DEPLOYED BY THE APPEARANCE OF THE SENSOR.
*****************************************************************************************
2015 FORD FUSION==> TL* THE CONTACT OWNS A 2015 FORD FUSION. THE CONTACT STATED THAT WHILE MAKING A TURN, THE BRAKES PULSATED WHEN THE VEHICLE WAS STOPPED. IN ADDITION, THE CONTACT STATED THAT WHILE ATTEMPTING TO PARK, THE BRAKES VIOLENTLY PULSATED AND RESULTED IN AN UNINTENDED ACCELERATION. AS A RESULT, THE CONTACT CRASHED INTO A BUILDING. THE AIR BAGS FAILED TO DEPLOY AND THE SEAT BELT FAILED TO RETRACT. A POLICE REPORT WAS FILED. T