<center><h1>Using Dask to Query a Large CSV File</h1></center>

With [Dask](http://dask.pydata.org/), you don't need a database engine to analyze or query a large CSV file.  Dask is still relatively young and hasn't implemented all of pandas' methods.  But it is getting there!

### Obtain the NHTSA complaints data from their [website](http://www-odi.nhtsa.dot.gov/downloads/). 

The meta data on their complaints file can be viewed [here](http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/CMPL.txt).

In [None]:
from io import BytesIO
from zipfile import ZipFile
from urllib import request

url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')
zipfile_in_memory = ZipFile(BytesIO(url.read()))
zipfile_in_memory.extractall('/home/pybokeh/temp/')
zipfile_in_memory.close()
print("zip download and extraction complete")

### Import dask dataframe and numpy

In [55]:
import dask.dataframe as dd
import numpy as np

### The complaints file does not include the header/column names, so I created them myself.

In [None]:
# CSV file does not have a header row, so defining my own columns
columns = [
    'CMPLID',
    'ODINO',
    'MFR_NAME',
    'MAKETXT',
    'MODELTXT',
    'YEARTXT',
    'CRASH',
    'FAILDATE',
    'FIRE',
    'INJURED',
    'DEATHS',
    'COMPDESC',
    'CITY',
    'STATE',
    'VIN',
    'DATEA',
    'LDATE',
    'MILES',
    'OCCURENCES',
    'CDESCR',
    'CMPL_TYPE',
    'POLICE_RPT_YN',
    'PURCH_DT',
    'ORIG_OWNER_YN',
    'ANTI_BRAKES_YN',
    'CRUISE_CONT_YN',
    'NUM_CYLS',
    'DRIVE_TRAIN',
    'FUEL_SYS',
    'FUEL_TYPE',
    'TRANS_TYPE',
    'VEH_SPEED',
    'DOT',
    'TIRE_SIZE',
    'LOC_OF_TIRE',
    'TIRE_FAIL_TYPE',
    'ORIG_EQUIP_YN',
    'MANUF_DT',
    'SEAT_TYPE',
    'RESTRAINT_TYPE',
    'DEALER_NAME',
    'DEALER_TEL',
    'DEALER_CITY',
    'DEALER_STATE',
    'DEALER_ZIP',
    'PROD_TYPE',
    'REPAIRED_YN',
    'MEDICAL_ATTN',
    'VEHICLES_TOWED_YN'
]

## I defined the data type of each column

In [101]:
data_type = {
    'CMPLID':object,
    'ODINO':object,
    'MFR_NAME':object,
    'MAKETXT':object,
    'MODELTXT':object,
    'YEARTXT':object,
    'CRASH':object,
    'FAILDATE':object,
    'FIRE':object,
    'INJURED':np.int32,
    'DEATHS':np.int32,
    'COMPDESC':object,
    'CITY':object,
    'STATE':object,
    'VIN':object,
    'DATEA':object,
    'LDATE':object,
    'MILES':object,
    'OCCURENCES':object,
    'CDESCR':object,
    'CMPL_TYPE':object,
    'POLICE_RPT_YN':object,
    'PURCH_DT':object,
    'ORIG_OWNER_YN':object,
    'ANTI_BRAKES_YN':object,
    'CRUISE_CONT_YN':object,
    'NUM_CYLS':object,
    'DRIVE_TRAIN':object,
    'FUEL_SYS':object,
    'FUEL_TYPE':object,
    'TRANS_TYPE':object,
    'VEH_SPEED':object,
    'DOT':object,
    'TIRE_SIZE':object,
    'LOC_OF_TIRE':object,
    'TIRE_FAIL_TYPE':object,
    'ORIG_EQUIP_YN':object,
    'MANUF_DT':object,
    'SEAT_TYPE':object,
    'RESTRAINT_TYPE':object,
    'DEALER_NAME':object,
    'DEALER_TEL':object,
    'DEALER_CITY':object,
    'DEALER_STATE':object,
    'DEALER_ZIP':object,
    'PROD_TYPE':object,
    'REPAIRED_YN':object,
    'MEDICAL_ATTN':object,
    'VEHICLES_TOWED_YN':object
}

## Now, read in the flat file

Due to empty/null values, I was getting errors when trying to create a dask data frame.  The converters= argument allows us to replace those empty/null values with whatever we want.  It is a life-saver for dealing with dirty data.  It is so true that data people spend a lot of their time cleaning/tranforming data.

In [102]:
df = dd.read_csv('/home/pybokeh/temp/FLAT_CMPL.txt', delimiter='\t', names=columns, dtype=data_type,
                 encoding='ISO-8859-1', error_bad_lines=False,
                 converters={'INJURED':lambda x:int(str(x.replace('','0'))), 
                             'DEATHS':lambda x:int(str(x.replace('','0'))),
                             'FAILDATE':lambda x:str(x)
                            }
                )

## Let's take a peek at our data

In [103]:
df.head()

Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,...,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN
0,1,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
1,2,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
2,3,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
3,4,958173,Ford Motor Company,LINCOLN,TOWN CAR,1994,Y,19941222.0,N,0,...,,,,,,,V,,,
4,5,958127,Ford Motor Company,FORD,RANGER,1994,,,N,0,...,,,,,,,V,,,


## Now query the data using SQL-like filtering

The query below will obtain "serious" incidents filed by Ford owners that occurred in 2016.

In SQL, it would look something like this:  
    
    SELECT *
    
    FROM complaints
    
    WHERE
    MAKETXT = 'FORD'
    and (CRASH = 'Y'
        or FIRE = 'Y'
        or INJURED > 0
        or DEATHS > 0
        or MEDICAL_ATTN = 'Y'
        or VEHICLES_TOWED_YN = 'Y'
    )
    and FAILDATE like '2016%'

In dask/pandas syntax, the query looks like this:

In [104]:
df.query("MAKETXT == 'FORD' "
         "& (CRASH == 'Y' " 
         "| FIRE == 'Y' "
         "| INJURED > 0 "
         "| DEATHS > 0 "
         "| MEDICAL_ATTN == 'Y' "
         # query() method does not support FAILDATE like '2016%'
         "| VEHICLES_TOWED_YN == 'Y')")[df.FAILDATE.str.startswith("2016")].compute()

  ordm = np.log10(abs(reindexer_size - term_axis_size))
  result = _execute_task(task, data)


Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,...,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN
9209,1249664,10817547,Ford Motor Company,FORD,FUSION,2015,Y,20160103,N,10,...,,,,,,,V,,N,N
9210,1249665,10817547,Ford Motor Company,FORD,FUSION,2015,Y,20160103,N,10,...,,,,,,,V,,N,N
9242,1249697,10817570,Ford Motor Company,FORD,FOCUS,2010,Y,20160101,N,0,...,,,,,,,V,,N,Y
9642,1250097,10818097,Ford Motor Company,FORD,WINDSTAR,2004,N,20160104,Y,0,...,,,,,,,V,,N,Y
11017,1251472,10819080,Ford Motor Company,FORD,TAURUS,2003,Y,20160108,N,0,...,,,,,,,V,,N,Y
11018,1251473,10819080,Ford Motor Company,FORD,TAURUS,2003,Y,20160108,N,0,...,,,,,,,V,,N,Y
11019,1251474,10819080,Ford Motor Company,FORD,TAURUS,2003,Y,20160108,N,0,...,,,,,,,V,,N,Y
11553,1252008,10819477,Ford Motor Company,FORD,F-150,2013,N,20160110,N,10,...,,,,,,,V,,N,N
11616,1252071,10819526,Ford Motor Company,FORD,MUSTANG,2010,N,20160108,N,0,...,,,,,,,V,,N,Y
11617,1252072,10819526,Ford Motor Company,FORD,MUSTANG,2010,N,20160108,N,0,...,,,,,,,V,,N,Y


### That's it!  We queried a large flat file and we didn't have to use a database.