<center><h1>Using Dask to Query a Large CSV File</h1></center>

With [Dask](http://dask.pydata.org/), you don't need a database engine to analyze or query a large CSV file.  Dask is still relatively young and hasn't implemented all of pandas' methods yet.  However, if you reduced your resultset to a more manageable size, you can then save the dask data frame as a pandas data frame.

In a nutshell, dask is like pandas on steroids. It allows you to work with larger than memory data sets and still use pandas-like API that we are all familiar with.  This is a practical example where someone were to be responsible for monitoring customer complaints filed with NHTSA, with dask/Python, he or she can do so without having to first load the data into a database. There were some initial prep work that was required with this process, but now this process can be done repeatedly and reliably. With this process documented in this jupyter notebook, if someone were to question or audit your process, your workflow is now transparent and easily understood.

### Obtain the NHTSA complaints data from their [website](http://www-odi.nhtsa.dot.gov/downloads/). 

The meta data on their complaints file can be viewed [here](http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/CMPL.txt).

In [None]:
from io import BytesIO
from zipfile import ZipFile
from urllib import request

url = request.urlopen('http://www-odi.nhtsa.dot.gov/downloads/folders/Complaints/FLAT_CMPL.zip')
zipfile_in_memory = ZipFile(BytesIO(url.read()))
zipfile_in_memory.extractall('/home/pybokeh/temp/')
zipfile_in_memory.close()
print("zip download and extraction complete")

### Import dask dataframe and numpy

In [1]:
import dask.dataframe as dd
import numpy as np

### The complaints file does not include the header/column names, so I created them myself.

In [2]:
# CSV file does not have a header row, so defining my own columns
columns = [
    'CMPLID',
    'ODINO',
    'MFR_NAME',
    'MAKETXT',
    'MODELTXT',
    'YEARTXT',
    'CRASH',
    'FAILDATE',
    'FIRE',
    'INJURED',
    'DEATHS',
    'COMPDESC',
    'CITY',
    'STATE',
    'VIN',
    'DATEA',
    'LDATE',
    'MILES',
    'OCCURENCES',
    'CDESCR',
    'CMPL_TYPE',
    'POLICE_RPT_YN',
    'PURCH_DT',
    'ORIG_OWNER_YN',
    'ANTI_BRAKES_YN',
    'CRUISE_CONT_YN',
    'NUM_CYLS',
    'DRIVE_TRAIN',
    'FUEL_SYS',
    'FUEL_TYPE',
    'TRANS_TYPE',
    'VEH_SPEED',
    'DOT',
    'TIRE_SIZE',
    'LOC_OF_TIRE',
    'TIRE_FAIL_TYPE',
    'ORIG_EQUIP_YN',
    'MANUF_DT',
    'SEAT_TYPE',
    'RESTRAINT_TYPE',
    'DEALER_NAME',
    'DEALER_TEL',
    'DEALER_CITY',
    'DEALER_STATE',
    'DEALER_ZIP',
    'PROD_TYPE',
    'REPAIRED_YN',
    'MEDICAL_ATTN',
    'VEHICLES_TOWED_YN'
]

## I defined the data type of each column

In [3]:
data_type = {
    'CMPLID':object,
    'ODINO':object,
    'MFR_NAME':object,
    'MAKETXT':object,
    'MODELTXT':object,
    'YEARTXT':object,
    'CRASH':object,
    'FAILDATE':object,
    'FIRE':object,
    'INJURED':np.int32,
    'DEATHS':np.int32,
    'COMPDESC':object,
    'CITY':object,
    'STATE':object,
    'VIN':object,
    'DATEA':object,
    'LDATE':object,
    'MILES':object,
    'OCCURENCES':object,
    'CDESCR':object,
    'CMPL_TYPE':object,
    'POLICE_RPT_YN':object,
    'PURCH_DT':object,
    'ORIG_OWNER_YN':object,
    'ANTI_BRAKES_YN':object,
    'CRUISE_CONT_YN':object,
    'NUM_CYLS':object,
    'DRIVE_TRAIN':object,
    'FUEL_SYS':object,
    'FUEL_TYPE':object,
    'TRANS_TYPE':object,
    'VEH_SPEED':object,
    'DOT':object,
    'TIRE_SIZE':object,
    'LOC_OF_TIRE':object,
    'TIRE_FAIL_TYPE':object,
    'ORIG_EQUIP_YN':object,
    'MANUF_DT':object,
    'SEAT_TYPE':object,
    'RESTRAINT_TYPE':object,
    'DEALER_NAME':object,
    'DEALER_TEL':object,
    'DEALER_CITY':object,
    'DEALER_STATE':object,
    'DEALER_ZIP':object,
    'PROD_TYPE':object,
    'REPAIRED_YN':object,
    'MEDICAL_ATTN':object,
    'VEHICLES_TOWED_YN':object
}

## Now, read in the flat file

Due to empty/null values, I was getting errors when trying to create a dask data frame.  The converters= argument allows us to replace those empty/null values with whatever we want.  It is a life-saver for dealing with dirty data.  It is so true that data people spend a lot of their time cleaning/transforming data.

In [4]:
df = dd.read_csv('/home/pybokeh/temp/FLAT_CMPL.txt', delimiter='\t', names=columns, dtype=data_type,
                 encoding='ISO-8859-1', error_bad_lines=False,
                 converters={'INJURED':lambda x:int(str(x.replace('','0'))), 
                             'DEATHS':lambda x:int(str(x.replace('','0'))),
                             'FAILDATE':lambda x:str(x)
                            }
                )

## Let's take a peek at our data

In [5]:
df.head()

Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,...,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN
0,1,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
1,2,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
2,3,958146,General Motors LLC,GMC,SONOMA,1995,,19941215.0,N,0,...,,,,,,,V,,,
3,4,958173,Ford Motor Company,LINCOLN,TOWN CAR,1994,Y,19941222.0,N,0,...,,,,,,,V,,,
4,5,958127,Ford Motor Company,FORD,RANGER,1994,,,N,0,...,,,,,,,V,,,


## Now query the data using SQL-like filtering

The query below will obtain "serious" incidents filed by Ford owners that occurred in 2016.

In SQL, it would look something like this:  
    
    SELECT *
    
    FROM complaints
    
    WHERE
    MAKETXT = 'FORD'
    and FAILDATE >= '201610'
    and (CRASH = 'Y'
        or FIRE = 'Y'
        or INJURED > 0
        or DEATHS > 0
        or MEDICAL_ATTN = 'Y'
        or POLICE_RPT_YN = 'Y'
        or VEHICLES_TOWED_YN = 'Y'
    )

In dask/pandas syntax, the query looks like this:

In [5]:
sql = (
    "MAKETXT == 'FORD' "
    "and FAILDATE >= '201603' "
    "and (CRASH == 'Y' " 
    "    or FIRE == 'Y' "
    "    or INJURED > 0 "
    "    or DEATHS > 0 "
    "    or MEDICAL_ATTN == 'Y' "
    "    or POLICE_RPT_YN == 'Y' "
    "    or VEHICLES_TOWED_YN == 'Y' "
    ")"
)

In [11]:
df.query(sql).compute()

Unnamed: 0,CMPLID,ODINO,MFR_NAME,MAKETXT,MODELTXT,YEARTXT,CRASH,FAILDATE,FIRE,INJURED,...,RESTRAINT_TYPE,DEALER_NAME,DEALER_TEL,DEALER_CITY,DEALER_STATE,DEALER_ZIP,PROD_TYPE,REPAIRED_YN,MEDICAL_ATTN,VEHICLES_TOWED_YN
24618,1265091,10839516,Ford Motor Company,FORD,WINDSTAR,2002,N,20160301,N,0,...,,,,,,,V,,N,Y
24685,1265158,10839568,Ford Motor Company,FORD,EDGE,2008,Y,20160302,N,20,...,,,,,,,V,,Y,N
24686,1265159,10839568,Ford Motor Company,FORD,EDGE,2008,Y,20160302,N,20,...,,,,,,,V,,Y,N
24785,1265258,10839632,Ford Motor Company,FORD,FREESTAR,2006,N,20160301,N,0,...,,,,,,,V,,N,Y
25816,1266290,10840417,Ford Motor Company,FORD,F-150,2009,Y,20160301,N,0,...,,,,,,,V,,N,Y
25845,1266319,10840439,Ford Motor Company,FORD,TAURUS X,2008,N,20160305,Y,0,...,,,,,,,V,,N,N
26051,1266525,10845616,Ford Motor Company,FORD,ESCAPE,2001,N,20160307,N,0,...,,,,,,,V,,N,Y
26052,1266526,10845616,Ford Motor Company,FORD,ESCAPE,2001,N,20160307,N,0,...,,,,,,,V,,N,Y
26137,1266611,10845681,Ford Motor Company,FORD,EXPEDITION,2011,Y,20160305,N,0,...,,,,,,,V,,N,Y
26138,1266612,10845681,Ford Motor Company,FORD,EXPEDITION,2011,Y,20160305,N,0,...,,,,,,,V,,N,Y


### Now that the query results look like what I want, I will convert the results into a pandas data frame

In [6]:
resultset = df.query(sql).compute()

In [7]:
type(resultset)

pandas.core.frame.DataFrame

**Resultset has 60 rows and 49 columns**

In [8]:
resultset.shape

(60, 49)

**Let's view some basic info on our data set**

In [11]:
resultset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 9209 to 34490
Data columns (total 49 columns):
CMPLID               183 non-null object
ODINO                183 non-null object
MFR_NAME             183 non-null object
MAKETXT              183 non-null object
MODELTXT             183 non-null object
YEARTXT              183 non-null object
CRASH                183 non-null object
FAILDATE             183 non-null object
FIRE                 183 non-null object
INJURED              183 non-null int64
DEATHS               183 non-null int64
COMPDESC             183 non-null object
CITY                 183 non-null object
STATE                183 non-null object
VIN                  168 non-null object
DATEA                183 non-null object
LDATE                183 non-null object
MILES                163 non-null object
OCCURENCES           4 non-null object
CDESCR               183 non-null object
CMPL_TYPE            183 non-null object
POLICE_RPT_YN        183 non-null

### Now that I have a pandas data frame, I can take advantage of all of its awesome functions. We can now export the data frame to an Excel or csv file or even to the clipboard.

In [None]:
resultset.to_excel(r'D:\temp\serious_ford_complaints.xlsx')

In [None]:
resultset.to_csv(r'D:\temp\serious_ford_complaints.csv', sep=',')

In [None]:
resultset.to_clipboard()

### Or we can output certain columns

In [12]:
resultset[['YEARTXT','MAKETXT','MODELTXT','CDESCR']].head(10)

Unnamed: 0,YEARTXT,MAKETXT,MODELTXT,CDESCR
9209,2015,FORD,FUSION,TL* THE CONTACT OWNS A 2015 FORD FUSION. THE C...
9210,2015,FORD,FUSION,TL* THE CONTACT OWNS A 2015 FORD FUSION. THE C...
9242,2010,FORD,FOCUS,I HAVE OWNED THE CAR SINCE IT WAS NEW NOW HAS ...
9642,2004,FORD,WINDSTAR,TL* THE CONTACT OWNED A 2004 FORD WINDSTAR. TH...
11017,2003,FORD,TAURUS,WE HAD SEVERAL CLOSE CALLS WHERE THE CAR WOULD...
11018,2003,FORD,TAURUS,WE HAD SEVERAL CLOSE CALLS WHERE THE CAR WOULD...
11019,2003,FORD,TAURUS,WE HAD SEVERAL CLOSE CALLS WHERE THE CAR WOULD...
11553,2013,FORD,F-150,I¿M REQUESTING THE NHTSA INVESTIGATE F150 ACCE...
11616,2010,FORD,MUSTANG,TL* THE CONTACT OWNS A 2010 FORD MUSTANG. THE ...
11617,2010,FORD,MUSTANG,TL* THE CONTACT OWNS A 2010 FORD MUSTANG. THE ...


**Since outputting a data frame in a web browser is limiting, here is a simple routine to print and format certain columns. This routine ouputs the first 10 rows of the data frame.**

In [13]:
for year, make, model, complaint in resultset[['YEARTXT','MAKETXT','MODELTXT','CDESCR']][:10].values:
    print(year + ' ' + make + ' ' + model + ' ==> ' + complaint)
    print('*********************************************************************************************')

2015 FORD FUSION ==> TL* THE CONTACT OWNS A 2015 FORD FUSION. THE CONTACT STATED THAT WHILE MAKING A TURN, THE BRAKES PULSATED WHEN THE VEHICLE WAS STOPPED. IN ADDITION, THE CONTACT STATED THAT WHILE ATTEMPTING TO PARK, THE BRAKES VIOLENTLY PULSATED AND RESULTED IN AN UNINTENDED ACCELERATION. AS A RESULT, THE CONTACT CRASHED INTO A BUILDING. THE AIR BAGS FAILED TO DEPLOY AND THE SEAT BELT FAILED TO RETRACT. A POLICE REPORT WAS FILED. THERE WAS ONE UNKNOWN INJURY REPORTED THAT DID NOT REQUIRE MEDICAL ATTENTION. THE VEHICLE WAS NOT DIAGNOSED OR REPAIRED. THE MANUFACTURER WAS NOT MADE AWARE OF THE FAILURE. THE FAILURE MILEAGE WAS 30,000. 
*********************************************************************************************
2015 FORD FUSION ==> TL* THE CONTACT OWNS A 2015 FORD FUSION. THE CONTACT STATED THAT WHILE MAKING A TURN, THE BRAKES PULSATED WHEN THE VEHICLE WAS STOPPED. IN ADDITION, THE CONTACT STATED THAT WHILE ATTEMPTING TO PARK, THE BRAKES VIOLENTLY PULSATED AND RESULTED

In [2]:
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

In [3]:
def f(x):
    return x

In [4]:
interact(f, x=10);