<center><h1>Using Dask to Query a Large CSV File</h1></center>

With [Dask](http://dask.pydata.org/), you don't need a database engine to analyze or query a large CSV file.  Dask is still relatively young and hasn't implemented all of pandas' methods.  But it is getting there!

In [None]:
from io import BytesIO
from zipfile import ZipFile
from urllib import request

url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')
zipfile_in_memory = ZipFile(BytesIO(url.read()))
zipfile_in_memory.extractall('/home/pybokeh/temp/')
zipfile_in_memory.close()
print("zip download and extraction complete")

In [34]:
import dask.dataframe as dd

In [22]:
# CSV file does not have a header row, so defining my own columns
columns = [
    'CMPLID',
    'ODINO',
    'MFR_NAME',
    'MAKETXT',
    'MODELTXT',
    'YEARTXT',
    'CRASH',
    'FAILDATE',
    'FIRE',
    'INJURED',
    'DEATHS',
    'COMPDESC',
    'CITY',
    'STATE',
    'VIN',
    'DATEA',
    'LDATE',
    'MILES',
    'OCCURENCES',
    'CDESCR',
    'CMPL_TYPE',
    'POLICE_RPT_YN',
    'PURCH_DT',
    'ORIG_OWNER_YN',
    'ANTI_BRAKES_YN',
    'CRUISE_CONT_YN',
    'NUM_CYLS',
    'DRIVE_TRAIN',
    'FUEL_SYS',
    'FUEL_TYPE',
    'TRANS_TYPE',
    'VEH_SPEED',
    'DOT',
    'TIRE_SIZE',
    'LOC_OF_TIRE',
    'TIRE_FAIL_TYPE',
    'ORIG_EQUIP_YN',
    'MANUF_DT',
    'SEAT_TYPE',
    'RESTRAINT_TYPE',
    'DEALER_NAME',
    'DEALER_TEL',
    'DEALER_CITY',
    'DEALER_STATE',
    'DEALER_ZIP',
    'PROD_TYPE',
    'REPAIRED_YN',
    'MEDICAL_ATTN',
    'VEHICLES_TOWED_YN'
]


# dask's read_csv() dtype=object isn't supported yet.  Therefore, have to define EVERY column's data type.
# I was getting import errors because this CSV file is very dirty, so made every column object type
data_type = {
    'CMPLID':object,
    'ODINO':object,
    'MFR_NAME':object,
    'MAKETXT':object,
    'MODELTXT':object,
    'YEARTXT':object,
    'CRASH':object,
    'FAILDATE':object,
    'FIRE':object,
    'INJURED':object,
    'DEATHS':object,
    'COMPDESC':object,
    'CITY':object,
    'STATE':object,
    'VIN':object,
    'DATEA':object,
    'LDATE':object,
    'MILES':object,
    'OCCURENCES':object,
    'CDESCR':object,
    'CMPL_TYPE':object,
    'POLICE_RPT_YN':object,
    'PURCH_DT':object,
    'ORIG_OWNER_YN':object,
    'ANTI_BRAKES_YN':object,
    'CRUISE_CONT_YN':object,
    'NUM_CYLS':object,
    'DRIVE_TRAIN':object,
    'FUEL_SYS':object,
    'FUEL_TYPE':object,
    'TRANS_TYPE':object,
    'VEH_SPEED':object,
    'DOT':object,
    'TIRE_SIZE':object,
    'LOC_OF_TIRE':object,
    'TIRE_FAIL_TYPE':object,
    'ORIG_EQUIP_YN':object,
    'MANUF_DT':object,
    'SEAT_TYPE':object,
    'RESTRAINT_TYPE':object,
    'DEALER_NAME':object,
    'DEALER_TEL':object,
    'DEALER_CITY':object,
    'DEALER_STATE':object,
    'DEALER_ZIP':object,
    'PROD_TYPE':object,
    'REPAIRED_YN':object,
    'MEDICAL_ATTN':object,
    'VEHICLES_TOWED_YN':object
}

In [23]:
df = dd.read_csv('/home/pybokeh/temp/FLAT_CMPL.txt', delimiter='\t', names=columns, dtype=data_type,
                encoding='ISO-8859-1', error_bad_lines=False)

In [25]:
df.head()

Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,...,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN
0,1,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
1,2,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
2,3,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
3,4,958173,Ford Motor Company,LINCOLN,TOWN CAR,1994,Y,19941222.0,N,0,...,,,,,,,V,,,
4,5,958127,Ford Motor Company,FORD,RANGER,1994,,,N,0,...,,,,,,,V,,,


In [45]:
df.query("MAKETXT == 'FORD' "
         "& (CRASH == 'Y' " 
         "| FIRE == 'Y' "
         "| INJURED == 'Y' "
         "| DEATHS == 'Y' "
         "| MEDICAL_ATTN == 'Y' "
         # query() method does not support LDATE like '2016%'
         "| VEHICLES_TOWED_YN == 'Y')")[df.LDATE.str.startswith("2016")].compute()

  ordm = np.log10(abs(reindexer_size - term_axis_size))
  result = _execute_task(task, data)


Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,...,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN
9202,1249657,10817541,Ford Motor Company,FORD,FIVE HUNDRED,2005,Y,20151201,N,2,...,,,,,,,V,,Y,N
9209,1249664,10817547,Ford Motor Company,FORD,FUSION,2015,Y,20160103,N,1,...,,,,,,,V,,N,N
9210,1249665,10817547,Ford Motor Company,FORD,FUSION,2015,Y,20160103,N,1,...,,,,,,,V,,N,N
9242,1249697,10817570,Ford Motor Company,FORD,FOCUS,2010,Y,20160101,N,,...,,,,,,,V,,N,Y
9642,1250097,10818097,Ford Motor Company,FORD,WINDSTAR,2004,N,20160104,Y,,...,,,,,,,V,,N,Y
9801,1250256,10818215,Ford Motor Company,FORD,F-250,1995,N,20151231,Y,1,...,,,,,,,V,,N,N
9802,1250257,10818216,Ford Motor Company,FORD,EXPLORER,2012,Y,20151223,N,1,...,,,,,,,V,,N,Y
9879,1250334,10818281,Ford Motor Company,FORD,F-150,2015,Y,20151117,N,1,...,,,,,,,V,,Y,Y
9943,1250398,10818325,Ford Motor Company,FORD,EXPLORER,2004,N,20151226,Y,0,...,,,,,,,V,,N,N
10585,1251040,10818778,Ford Motor Company,FORD,RANGER,2003,Y,20151230,N,,...,,,,,,,V,,N,N
