# Lab 2 - basic data analysis

This will be the first lab that we use real-world data for.  We've downloaded it for you already and put it in this directory.  It's called: "SQF_2012.csv".  Anytime you use real-world data it's important to explain:

1. Where to find the data publicly (if possible).
2. How the data was collected and what the attributes mean.
3. Any additional processing you did to the data.

So here are the answers to those questions:
* The data came from <http://www.nyclu.org/files/stopandfrisk/Stop-and-Frisk-2012.zip>.
* A full description of the data can be found at: <http://www.nyclu.org/files/SQF_Codebook.pdf>.
* The data in SQF_2012.csv appears exactly as downloaded from the NYCLU.

**As in the first lab, your job will be to do the tasks indicated in bold below.**

These will lead you through a basic data analysis of this data - we'll be continuing to examine the data in the next lab as well.

First, here's a function that we'll use to get data out of csv files.  Read it and try to understand what it's doing.  Note especially the line with a comment using some built-in functions that are new to you since the last lab.

In [15]:
"""
Takes a filename and returns the header of the file.
For example, with a file that looks like this the first line will be returned in a list.
header1, header2, header3
1,2,3
4,5,6

"""
def get_header(filename):
    filepointer = open(filename, "r")
    line = filepointer.readline()
    # The header line includes the "\n" at the end of the line.  This is removed using strip().
    # The resulting line is then split at the ","s and stored in a list by split(",").
    header = line.strip().split(",")
    filepointer.close()
    return header


** 1) Get the column names (header list) and print them out. **

As usual, add a code cell below this and do your work there.  Remember that you ned to run the cell above in order to have access to that function.

** 2) Programmatically print the 10th column name below. **

Hint: double check your work by counting by hand above to make sure you *actually* found the *10th* column name.

Below you'll find code to read a file into a data structure (in this case, a list of dictionaries) for accessing the data.  You'll notice some similarity with the code for reading the header.

In [67]:
"""
Takes a filename and returns your data.
For example, with a file that looks like this:
header1, header2, header3
1,2,3
4,5,6

You could get the first row, second header item like this:
dict = get_data("temp.csv")
print dict[1]["header2"]

For the interested, the returned data is a dictionary of dictionaries.  We'll see this more in future weeks.
"""
def get_data(filename):
    filepointer = open(filename, "r")
    data = []
    
    # get_header, inline instead of calling the function above so that the file continues reading
    # from the line right after the header in the for loop below.
    line = filepointer.readline()
    header = line.strip().split(",")

    for line in filepointer:
        fields = line.strip().split(",")

        # Unfortunately, split will split at some commas that we don't mean to split on (e.g., if they've
        # been written into addresses) so we check below to make sure we have the expected number of fields
        # and throw out any other data.  We shouldn't really be throwing out data, we should be fixing the
        # actual problem, but for the purposes of this lab, this will do.
        if (len(fields) == len(header)):
            row = {}
            for fieldNumber in range(len(fields)):
                row[header[fieldNumber]] = fields[fieldNumber]
            data.append(row)
            
    filepointer.close()
    return data

** 3) Now use this function to read in our file. **

Note that the file is big!  So far, everything you've programmed in this class has happened seemingly instantaneously.  This time, you may have to wait a bit (count to 100 before getting nervous).  You'll know that the data is still being read in because the left of the code cell will have a * instead of a number like usual.

In [79]:
data = # TODO

Each row of this list is an individual record.  For example, the sixth row:

In [81]:
data[5]

We can access specific attribute values using the names in the header:

In [82]:
data[5]["age"]

** 4) Was anyone stopped on 4/20/2012?  Write a function to find out the answer. **

** 5) How many people in the data were frisked? **

Note that these next questions that ask you to go through all the records in the data set and count something will also take awhile.  Be patient!

** 6) What percent of the people stopped (i.e., people in the data) were also frisked? **

** 7) What percent of the people stopped were arrested? **

** 8) What percent of people stopped did NOT have a weapon (0 for columns: pistol, riflshot, asltweap, knifcuti, machgun, othrweap)? **


** 9) What percent of people stopped had a weapon, but NOT a gun of any kind? **

** 10) What percent of people stopped had multiple guns? **

** 11) How many people were frisked and arrested on 4/20/2012? **

** 12) Write a get_column_statistics function that takes the data and a column name and finds the min, max, and mean of that column over all the data and returns them.**

You can return all three at once like this:
> return min, max, mean

which can then be called like this:

> min, max, mean = get_column_statistics(data, column_name)

Note that for some columns, like date, it wouldn't make sense to use this function.  You may assume that no one would be silly enough to try.  A better programming practice would be to check first to make sure that the data is of the right type (a number) before trying to take the min, max, or mean.  For **extra credit** you may make your function more robust so that it appropriately checks the type before doing the calculation.

In [83]:
get_column_statistics(data,"age")

** 13) What do you notice that's unusual / questionable about the above results for age? **

We'll pick up on exploring what's going on here in the next lab!