# Week 4

Yay! It's week 4. Today's we'll keep things light. 

I've noticed that many of you are struggling a bit to keep up and still working on exercises from the previous weeks. Thus, this week we only have two components with no lectures and very little reading. 


## Overview

* An exercise on visualizing geodata using a different set of tools from the ones we played with during Lecture 2.
* Thinking about visualization, data quality, and binning. Why ***looking at the details of the data before applying fancy methods*** is often important.

## Part 1: Visualizing geo-data

It turns out that `plotly` (which we used during Week 2) is not the only way of working with geo-data. There are many different ways to go about it. (The hard-core PhD and PostDoc researchers in my group simply use matplotlib, since that provides more control. For an example of that kind of thing, check out [this one](https://towardsdatascience.com/visualizing-geospatial-data-in-python-e070374fe621).)

Today, we'll try another library for geodata called "[Folium](https://github.com/python-visualization/folium)". It's good for you all to try out a few different libraries - remember that data visualization and analysis in Python is all about the ability to use many different tools. 

The exercise below is based on the code illustrated in this nice [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data), so let us start by taking a look at that one.

*Reading*. Read through the following tutorial
 * "How to: Folium for maps, heatmaps & time data". Get it here: https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data
 * (Optional) There are also some nice tricks in "Spatial Visualizations and Analysis in Python with Folium". Read it here: https://towardsdatascience.com/data-101s-spatial-visualizations-and-analysis-in-python-with-folium-39730da2adf

> *Exercise 1.1*: A new take on geospatial data. 
>
>A couple of weeks ago (Part 3 of Week 2), we worked with spacial data by using color-intensity of shapefiles to show the counts of certain crimes within those individual areas. Today, we look at studying geospatial data by plotting raw data points as well as heatmaps on top of actual maps.
> 
> * First start by plotting a map of San Francisco with a nice tight zoom. Simply use the command `folium.Map([lat, lon], zoom_start=13)`, where you'll have to look up San Francisco's longitude and latitude.
> * Next, use the the coordinates for SF City Hall `37.77919, -122.41914` to indicate its location on the map with a nice, pop-up enabled maker. (In the screenshot below, I used the black & white Stamen tiles, because they look cool).
> <img src="https://raw.githubusercontent.com/suneman/socialdata2022/main/files/city_hall_2022.png" alt="drawing" width="600"/>
>
> * Now, let's plot some more data (no need for pop-ups this time). Select a couple of months of data for `'DRUG/NARCOTIC'` and draw a little dot for each arrest for those two months. You could, for example, choose June-July 2016, but you can choose anything you like - the main concern is to not have too many points as this uses a lot of memory and makes Folium behave non-optimally. 
> We can call this kind of visualization a *point scatter plot*.

Ok. Time for a little break. Note that a nice thing about Folium is that you can zoom in and out of the maps.

In [6]:
import folium

In [7]:
lat, lon = 37.7749, -122.4194
map_sf = folium.Map([lat, lon], zoom_start=13)

map_sf

In [10]:
point = 37.77919, -122.4191

map_sf = folium.Map([lat, lon],
                    tiles='Stamen Toner',
                    zoom_start=13)

folium.Marker(point, popup='San Francisco Downtown').add_to(map_sf)

map_sf

In [11]:
import pandas as pd

In [13]:
df = pd.read_csv('Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')
df.Date = pd.to_datetime(df['Date']) + pd.to_timedelta(df['Time'] + ':00')

In [30]:
# Filter data for June only (easier to view)
df_res = df[(df.Date >= '2016-06') & (df.Date < '2016-07') & (df.Category == 'DRUG/NARCOTIC')]

In [31]:
# Map object
map_sf = folium.Map([lat, lon],
                    tiles='Stamen Toner',
                    zoom_start=13)

for idx, row in df_res.iterrows():
    folium.Marker((row.Y, row.X)).add_to(map_sf)

In [32]:
# Show map
map_sf


> *Exercise 1.2*: Heatmaps.
> * Now, let's play with **heatmaps**. You can figure out the appropriate commands by grabbing code from the main [tutorial](https://www.kaggle.com/daveianhickey/how-to-folium-for-maps-heatmaps-time-data)) and modifying to suit your needs.
>    * To create your first heatmap, grab all arrests for the category `'SEX OFFENSES, NON FORCIBLE'` across all time. Play with parameters to get plots you like.
>    * Now, comment on the differences between scatter plots and heatmaps.
>.      - What can you see using the scatter-plots that you can't see using the heatmaps?
>.      - And *vice versa*: what does the heatmaps help you see that's difficult to distinguish in the scatter-plots?
>    * Play around with the various parameters for heatmaps. You can find a list here: https://python-visualization.github.io/folium/plugins.html
>    * Comment on the effect on the various parameters for the heatmaps. How do they change the picture? (at least talk about the `radius` and `blur`).
> For one combination of settings, my heatmap plot looks like this.
> <img src="https://raw.githubusercontent.com/suneman/socialdata2022/main/files/crime_hot_spot.png" alt="drawing" width="600"/>
>
>    * In that screenshot, I've (manually) highlighted a specific hotspot for this type of crime. Use your detective skills to find out what's going on in that building on the 800 block of Bryant street ... and explain in your own words.

(*Fun fact*: I remembered the concentration of crime-counts discussed at the end of this exercise from when I did the course back in 2016. It popped up when I used a completely different framework for visualizing geodata called [`geoplotlib`](https://github.com/andrea-cuttone/geoplotlib). You can spot it if you go to that year's [lecture 2](https://nbviewer.jupyter.org/github/suneman/socialdataanalysis2016/blob/master/lectures/Week3.ipynb), exercise 4.)

In [33]:
from folium.plugins import HeatMap

In [36]:
df_res = df[df.Category == 'SEX OFFENSES, NON FORCIBLE']

In [38]:
# Heatmap
map_sf = folium.Map([lat, lon],
                    tiles='Stamen Toner',
                    zoom_start=13)
HeatMap([[row.Y, row.X] for index, row in df_res.iterrows()]).add_to(map_sf)

map_sf

- What can you see using the scatter-plots that you can’t see using the heatmaps?
- > Scatter-plots show the individual datapoints more accurately than a heatmap. However, with lots of data it can be hard to infer the density of points.
- And vice versa: what does the heatmaps help you see that’s difficult to distinguish in the scatter-plots?
- > Heatmaps show the density of datapoints which is useful for large amounts of data (In the heatmap we used all arrests, whilst the scatter plot used only 1 month of data)

In [108]:
# Playing around with params
map_sf = folium.Map([lat, lon],
                    tiles='Stamen Toner',
                    zoom_start=13)

HeatMap([[row.Y, row.X] for index, row in df_res.iterrows()],
        min_opacity=.5,
        radius=25,
        blur=20,
        gradient={0.3: 'blue', 0.6: 'lime', 1: 'red'}).add_to(map_sf)

map_sf

- Playing around with parameters:
- > `min_opacity` determines how transparent low-density points can be. A value of 0.001 produces very transparent groups whilst a value of 1 produces almost all-red groups which makes it hard to distinguish between them.
- > `radius` determines the _radius_ of the heatmap groups. It seems that increasing the value from the default of 25 gives more connected and non-circular groups. Kind of like increasing the $\varepsilon$ parameter in DBScan
- > `blur` determines the _intensity_ of the heatmap groups. The default is 15 and higher values give less intense groups whilst lower values give more intense groups in terms of the color gradient.
- > `gradient` determines the color of the heatmap groups.
- > The rest of the parameters are semantics used by the library

In [111]:
from folium.plugins import HeatMapWithTime

In [134]:
# Time resolution
time_res = range(0, 24)

# Crime
CRIME = 'DRIVING UNDER THE INFLUENCE'

# Data
df_heat = df[df.Category == CRIME]

intensity = 0.5
data = [
    [ [row.Y, row.X, intensity] for idx, row in df_heat[df_heat.Date.dt.hour == hour].iterrows()]
    for hour in time_res
]

In [135]:
map_sf = folium.Map([lat, lon],
                    tiles='Stamen Toner',
                    zoom_start=13)

HeatMapWithTime(data,
                auto_play=True,
                max_opacity=0.8,
                ).add_to(map_sf)

map_sf

What patterns does your movie reveal?
- We see that there are a lot of people driving drunk in the night and evening hours (21+ h). They seem to be caught mainly in downtown, the west highway or the south-west highway. There are more arrests on the west highway after 00h.

Motivate/explain the reasoning behind your choice of crimetype and time-resolution.
- We choose this crimetype as it's shown in Week 2 that it is highly time dependent with respect to a time-resolution of 24h. There are significantly more arrests in the night than during the day. The heatmap now shows us where this happens.

## Part 2: Errors in the data. The importance of looking at raw (or close to raw) data.

We started the course by plotting simple histogram and bar plots that showed a lot of cool patterns. But sometimes the binning can hide imprecision, irregularity, and simple errors in the data that could be misleading. In the work we've done so far, we've already come across at least three examples of this in the SF data. 

1. In the temporal activity for `PROSTITUTION` something surprising is going on on Thursday. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/prostitution.png), where I've highlighted the phenomenon I'm talking about.
2. When we investigated the details of how the timestamps are recorded using jitter-plots, we saw that many more crimes were recorded e.g. on the hour, 15 minutes past the hour, and to a lesser in whole increments of 10 minutes. Crimes didn't appear to be recorded as frequently in between those round numbers. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/jitter.png), where I've highlighted the phenomenon I'm talking about.
3. And, today we saw that the Hall of Justice seemed to be an unlikely hotspot for sex offences. Remind yourself [**here**](https://raw.githubusercontent.com/suneman/socialdata2022/main/files/crime_hot_spot.png).

> *Exercise 2*: Data errors. The data errors we discovered above become difficult to notice when we aggregate data (and when we calculate mean values, as well as statistics more generally). Thus, when we visualize, errors become difficult to notice when binning the data. We explore this process in the exercise below.
>
>This last exercise for today has two parts:
> * In each of the examples above, describe in your own words how the data-errors I call attention to above can bias the binned versions of the data. Also, briefly mention how not noticing these errors can result in misconceptions about the underlying patterns of what's going on in San Francisco (and our modeling).
> * Find your own example of human noise in the data and visualize it.

**Part 1**
1. The spike in prostitution will lead to a Thursday-bin in a DayOfTheWeek plot that is larger than the other bins in the plot. This could lead to the user thinking that prostitution happens more on Thursday's than other days, since the error-spike can't be seen in the low resolution binning.
2. The bias will mostly influence minute-level resolution bins as it will aggregate the 'real' timestamps into the 10, 15 or 60-minute bins. This could lead to weird looking distributions if one decides to look at a 1-minute bin histogram. For example, all assaults in the time interval 00-01. It will be hard to see the exact time, but generally it won't have a significant effect as the general trend is still seen. With respect to modelling, it could bias the model to predict that crimes only happen in these strict 15 minute intervals or even worse, once per hour.
3. If you bin the prostitution data for example in a heatmap, you'll see that there's a high density of crimes in the police station. This will of course not be representative of the real-life, and it would cause any model or person to assume that a lot of prostitution happens around the police station.

**Part 2**

In [154]:
df.describe()

Unnamed: 0,PdId,IncidntNum,Incident Code,X,Y,SF Find Neighborhoods 2 2,Current Police Districts 2 2,Current Supervisor Districts 2 2,Analysis Neighborhoods 2 2,DELETE - Fire Prevention Districts 2 2,...,Fix It Zones as of 2017-11-06 2 2,DELETE - HSOC Zones 2 2,Fix It Zones as of 2018-02-07 2 2,"CBD, BID and GBD Boundaries as of 2017 2 2","Areas of Vulnerability, 2016 2 2",Central Market/Tenderloin Boundary 2 2,Central Market/Tenderloin Boundary Polygon - Updated 2 2,HSOC Zones as of 2018-06-05 2 2,OWED Public Spaces 2 2,Neighborhoods 2
count,2129525.0,2129525.0,2129525.0,2129525.0,2129525.0,2123558.0,2128444.0,2128878.0,2128500.0,2124316.0,...,546652.0,459947.0,552162.0,553462.0,2128500.0,333720.0,333936.0,488596.0,115980.0,2123558.0
mean,10424100000000.0,104241000.0,27532.79,-122.4228,37.77135,51.91933,4.853053,6.841214,21.93723,8.610686,...,14.112302,2.149717,13.919509,6.29761,1.595479,1.0,1.0,2.316869,36.68066,51.91933
std,4617922000000.0,46179220.0,25985.39,0.0297963,0.4256712,31.6277,2.765017,3.389863,12.68341,4.279847,...,6.115738,1.18563,6.300728,2.518953,0.4907991,0.0,0.0,1.331082,10.22428,31.6277
min,397963000.0,3979.0,0.0,-122.5136,37.70788,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0
25%,6125908000000.0,61259080.0,6244.0,-122.433,37.75388,24.0,3.0,3.0,9.0,5.0,...,10.0,1.0,9.0,5.0,1.0,1.0,1.0,1.0,35.0,24.0
50%,10119330000000.0,101193300.0,15201.0,-122.4166,37.77542,43.0,5.0,8.0,21.0,9.0,...,15.0,3.0,15.0,6.0,2.0,1.0,1.0,3.0,35.0,43.0
75%,14095960000000.0,140959600.0,63010.0,-122.407,37.78466,83.0,7.0,10.0,34.0,13.0,...,18.0,3.0,18.0,7.0,2.0,1.0,1.0,3.0,35.0,83.0
max,99158240000000.0,991582400.0,75030.0,-120.5,90.0,117.0,10.0,11.0,41.0,15.0,...,25.0,5.0,25.0,15.0,2.0,1.0,1.0,5.0,80.0,117.0


Hmm... The max values for X and Y seem strange. Let's inspect it

In [143]:
idx = df.X.idxmax()

In [150]:
weird_row = df.iloc[idx,]

In [153]:

map_sf = folium.Map([weird_row.Y, weird_row.X],
                    tiles='Stamen Toner',
                    zoom_start=1)

folium.Marker([weird_row.Y, weird_row.X], popup=f"{weird_row.Category}").add_to(map_sf)

map_sf

We see that the first row in the dataset has a 'LARCENY/THEFT' datapoint located in the North Pole.