# Project 3: Data and Maps!

Thanks to John P. Dickersion for the project idea!

**Posted:** Nov 7th, 2019.

**Due:** Nov. 26th, 2019.

In this project we are going to work with a fairly clean set of data from Baltimore crime data covering the years 2011 and 2012.  This is a fairly open ended project, you will need to work with the data a bit and come up with your own things to show.

In [8]:
# Includes and Standard Magic...
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd
# Load Stats
from scipy import stats
# import folium TODO: add folium to path
import re

# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
matplotlib.style.use('fivethirtyeight')

## Part 1: Data Wrangling.

The data is a bit messy to start out with.  Perform the following tasks to make it clean and tidy.

1. Split the `Location 1` column into a `lat` and `long` columns.  Ensure that the columns are of float type and you drop any record that is missing a location.
2. You can drop the `arrest`, `post`, `charge`, and the `Location 1` column.
3. Merge the date and time column and make sure they are the proper type.  Drop any row that does not have a date and time.
4. Set the index so that we can sort and slice based on the date/time.
5. Drop any records that have NA values.
6. Go through the remaining columns and ensure you have set the dtype properly. 
7. Display the head of the table and the dtypes in your notebook.





In [73]:
#Read in dataframe
raw_df = pd.read_csv("./BPD_Arrests.csv")

#drop NaN location values
# raw_df = raw_df[np.isfinite(raw_df['Location 1'])]

#define function to return latitude (first value)
def get_lat(lat_long):
    temp_array = re.split(" ", str(lat_long))
    return float(re.sub(r"\(|\,", "", temp_array[0]))
#define function to get longitute (second value)
def get_long(lat_long):
    print(lat_long)
    temp_array = re.split(" ", str(lat_long))
    print(temp_array)
    return float(re.sub(r"\)", "", temp_array[1]))

# raw_df["Latitude"] = raw_df["Location 1"].apply(get_lat)
# raw_df["Longitude"] = raw_df["Location 1"].apply(get_long)
# raw_df["Location 1"]
# raw_df.head()
# re.split(" " , "37.3, 23.8")
raw_df[np.isfinite(raw_df['Location 1'])]

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

### Question 1:
How many records did we drop using our processing above?  Do you think this will affect our data later?  What type of missingness do you think these values have? 

### Question 2:
Thinking about the kinds of missing-ness in our data.  What is one imputation method that we could have used to fill in some gaps?  Implement one such method that is not just `dropna`.

## Part 2: Exploratory Data Analysis

We can use the Pandas time and date slicing functions to group our data by either day, quarter, or time.  Have a look at [pd.between_time()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.between_time.html).  I want you to explore this data in some interesting ways.

### Problem 1.
Use cut and other Pandas functions to display the joint distribution of Age and Race.  This table should not have every age in it but break the age down into a reasonable number of sub groups.

Pick another pair of variables.  Display a joint or conditional distribution and explain **why** you chose it and what the take away message is.

### Problem 2.

Pick (at least) three nieghborhoods from the data, show the crime in 2011 versus 2012 for each of these neighborhoods on one plot.  Make sure that you use visaul features to distinguish the two years.

**Hint:** You may want to look back at the lab where we worked with baby names... and maybe the [unstack](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html) function.

### Problem 3.

Show me one other interesting thing about the data.  It can be anything you find interesting but I'd encourage you to use an advanced method from class (regression, classification, hypothesis testing etc.).  If you can, maybe look at something like [the demographics of Balitmore](https://en.wikipedia.org/wiki/Baltimore) and compare those to what is in our data.



## Part 3: Interactive Maps.

Using the following code stub to start up an interactive map. You can find more information about folium here: https://github.com/python-visualization/folium/ and https://folium.readthedocs.org//


### Problem 5.

Add graphical elements to display the data. For instance, add circles, with colors indicating sex. Or circles with colors indicating race. Or anything else that strikes your fancy.  Plot some colors over the map to illustrate some joint or conditional distribution of the data.

**Explain using Markdown Cells** *what* you have shown in your map, *why* you have shown it in your map, and *how* a user should interpret this information.

In [3]:
map_osm = folium.Map(location=[39.29, -76.61], zoom_start=11)
map_osm

## Submission

Prepare a Jupyter notebook that includes for each Problem: (a) code to carry out the step discussed, (b) output showing the result of your code, and (c) a short prose description of how your code works. Remember, the writeup you are preparing is intended to communicate your data analysis effectively. Thoughtlessly showing large amounts of output in your writeup defeats that purpose.

All axes in plots should be labeled in an informative manner. Your answers to any question that refers to a plot should include both (a) a text description of your plot, and (b) a sentence or two of interpretation as it relates to the question asked.

Submit this completed notebook which contains your answers as markdown cells to [Canvas](https://tulane.instructure.com/)

## Grading Rubric

Note that code that does not work will not be graded and you will receive a 0 for that section.  We reserve the right to deduct points for things like general sloppiness of the notebook, poor labels, unlabeled axes, etc.  You should include markdown cells to break up your notebook and **clearly label** the problems and questions below.

* Part 0 Professionalism (10 points).
  * You have used both code comments and markdown cells to professionally and clearly document your work including having a clear and clean notebook; linking to resources and documents; and doing so with code that is reasonable and efficient.

* Part 1 Wrangling (20 Points).
  * (10 Points)  Data is loaded correctly and directions are followed for munging the data appropatly.
  * (10 Points) Questions are answered in a reasonable manner.  A suggested way to impute data is present along with code.
* Part 2 Exploratory Data Analysis (40 Points).
  * (20 Points) Problem 1: Distributions are compute correctly, tables are shown, explination is coherent and clear.
  * (10 Points) Problem 2: Graph is present, visual features are present to distingush the required elements.
  * (10 Points) Problem 3: Code is present to compute an interesting feature of the data.  The feature is interpreted in a written markdown cell.
* Part 3 Interactive Maps (30 Points).
  * (20 Points) Map is displayed of Baltimore, one or more interactive elements are present.  Displayed information is non-trivial and reveals something interesting about the data.
  * (10 Points) Explination of the above map is reasonable and clear.  Addresses all points.


* Total Score:

### Credits

Thanks to [John P. Dickerson](http://jpdickerson.com/) for the project idea!