# Week 3

## Overview

This week we'll be getting into the principles of machine learning, while starting to get to know the dataset. The lecture will feature an overview of fundamentals of machine learning (videos), and we'll (_finally_) start working on getting to know the dataset we'll be crunching over the next few weeks.

* Part 1 of today's exercises will be about getting data.
* Part 2 is the "main course", where we'll be playing around with real data, plotting and counting.
* Part 3 is about machine learning, featuring a couple of videos with Ole Winter from DTU Compute and some simple questions from the book.
* And finally, in Part 4, we plot a little bit of the detailed GPS data.

## Part 1: Getting the dataset (DSFS Chapter 9)

There's no data science without data, so we'll start thinking about ways to acquire data. You've already played around with reading/writing text from files and the exercises today focus on CSV files. For other types of data access, I recommend that you skim this chapter so you have a place to come back to when you need to scrape the web or use various APIs.

> _Reading_: Skim DSFS Chapter 9 and pay extra attention to the part about CSV files.  
  
---  

> _Exercise_: Reading in crime-data from San Francisco
>  
> * Head over to the [SF OpenData page](https://data.sfgov.org), click on the link to _Public Safety_ data and then choose _SFPD Incidents_. Figure out how to download the list of incidents containing _every crime registered since January 1st, 2003_ as a CSV file by clicking around the page.
> * Once the file has been downloaded you should read it into Python. The file has may column, and today we'll be working with 
>   - The various categories of crimes
>   - Crime over the years
>   - What time of day do crimes occur?
>   - What day of the week to crimes tend to occur?
>   - And do certain crime-types tend to happen in specific neighborhoods?
>   - Finally, we'll play around with geo-data.  
> 
>   All of this can be easily based on the columns in the incident file. Which columns will you need?

> * What is the total number of incidents in the CSV file?

** An example of how to read a CSV file**
There are many ways to read in CSV files - here's how I usually do it

```
from csv import reader                          # get the csv reader
infile = open("/path-to-file/file.csv", 'r')    # open the file for reading

for line in reader( infile ):                   # read through the CSV one line at a time
    var1, var2, ..., varN = line                # assign the various fields in the line to variables
```

## Part 2: Working with data (DSFS Chapter 10)

Once again, the book should be thought of as a useful reference, but today we'll be doing our own datascience rather than following the book slavishly. 

In order to do awesome _predictive policing_ later on in the class, we're going to dissect the SF crime-data quite thoroughly to figure out what has been going on over the last 10 years on the San Francisco crime scene. It's going to be a lot of work ... but after mastering these exercises you guys will have some real data science skills! 

> _Reading_: Skim DSFS Chapter 10 so you have a sense of what's in there.

---
> _Exercise_: The types of crime and their popularity over time. The first field we'll dig into is the column "Category".
> * Create a `set()` of all the categories of crime in the dataset. How many are there? 
> * Now count the number of occurrences of each category in the dataset. What is the most commonly occurring category of crime? What is the least frequently occurring?
> * Create a histogram over crime occurrences. Mine looks like this
>   ![Histogram](https://raw.githubusercontent.com/suneman/socialdataanalysis2016/master/files/categoryhist.png)
> * Now it's time to explore how the crime statistics change over time. To start off easily, let's count the number of crimes per year for the years 2003-2015 (the full years in the database). For that one, I used the `date` column in the CSV file and used `year = int( date.split("/")[-1] )` to extract the year of each crime. What's the average number of crimes per year? 
> * Police chief Suneman is interested in the temporal development of only a subset of categories, the so-called focus crimes. Those categories are listed below (for convenient copy-paste action). Now create bar-charts displaying the year-by-year development of each of these categories across the years 2003-2015.
> 
>    _Little note_: As with anything in Python there are many ways of doing everything. A nice choice for counting things like these is `pandas` which give you a neat way of handling tables in Python. But I've created everything I needed for today simply using standard dictionaries for keeping track of my counts.

In [1]:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])

> * My plot looks like this for the 14 focus crimes:
    ![Histograms](https://raw.githubusercontent.com/suneman/socialdataanalysis2016/master/files/temporal_categories.png)
    Comment on at least three interesting trends in your plot. Take a look at [this article](http://www.nytimes.com/2015/05/23/us/high-rents-elbow-latinos-from-san-franciscos-mission-district.html) - are the trends you observe consistent with the message in the article?

Ok, now we're getting warmed up. The next exercise looks at the focus crimes, and their temporal patterns across the day. It's interesting to know if some crimes tend to happen during the day ... and if others take during the night time. By knowing _when_ certain types of crimes happen, we can more efficiently fight them!

> _Exercises_: The types of crime and their popularity across the 24 hours of the day.
> 
> * First, plot a simple histogram of the number of crimes per hour in the dataset. This piece of information is in the _Time_ column of the CSV file, and I simply used `hr = int( time.split(":")[0] )` to get this info out. There's a little strange peak at `12:00` - what do you think is going on there? (No need for a serious exploration, just some thoughts on what might be going on).
> * Next create a plot of the count of each crime category across the 24 hours of the day. Again, comment on at least three trends in the data. (My plot looks like this)
    ![Histograms](https://raw.githubusercontent.com/suneman/socialdataanalysis2016/master/files/daily_patterns_categories.png)

There's much more to work on in this dataset (for example, I also explored if certain crimes tend to happen on certain week-days ... also some interesting patterns there!), but we have lots to do, so we must rush onwards. 

The next thing we'll be looking into is how crimes break down across the 10 districts in San Francisco.

_Exercises_: The types of crime and how they take place across San Francisco's police districts.

* So now we'll be combining information about _PdDistrict_ and _Category_ to explore differences between SF's neighborhoods. First, simply list the names of SF's 10 police districts.
* Which has the most crimes? Which has the most focus crimes?
* Next, we want to generate a slightly more complicated graphic. I'm interested to know if there are certain crimes that happen much more in certain neighborhoods than what's typical. Below I describe how to get that plot going
  - First, we need to calculate the relative probabilities of seeing each type of crime in the dataset as a whole. That's simply a normalized version of [this plot](https://raw.githubusercontent.com/suneman/socialdataanalysis2016/master/files/categoryhist.png). Let's call it `P(crime)`.
  - Next, we calculate that same probability distribution _but for each PD district_, let's call that `P(crime|district)`.
  - Now we look at the ratio `P(crime|district)/P(crime)`. That ratio is equal to 1 if the crime occurs at the same level within a district as in the city as a whole. If it's greater than one, it means that the crime occurs _more frequently_ within that district. If it's smaller than one, it means that the crime is _rarer within the district in question_ than in the city as a whole.
  - For each district plot these ratios for the 14 focus crimes. My plot looks like this
    ![Histograms](https://raw.githubusercontent.com/suneman/socialdataanalysis2016/master/files/neighborhood_ratios.png)
  - Comment on the top crimes in _Tenderloin_, _Mission_, and _Richmond_. Does this fit with the impression you get of these neighborhoods on Wikipedia?
  - Even though we only plotted the ratios for our 14 focus crimes, I asked you to calculate the ratios based on all crime categories. Why do you think I wanted to include all crime types in the calculation?

## Part 3: Machine Learning (DSFS Chapter 11)

We won't go too deep with machine learning today, but after all it's time to watch some relaxing video lectures on the fundamentals of Machine learning.

In [None]:
# video 1: Ole Winter on Machine Learning 

In [None]:
# video 2: More of Ole on Machine Learning

> _Reading_: DSFS Chapter 11

---

> _Exercises_: A few simple questions about machine learning.

## Part 4: A little bit of plotting

There's a lot of rich geo-data in the crime dataset that we haven't explored yet. Recall that every incident in the database also has an associated GPS location. We're not going to dig deep regarding the geodata quite yet, but it's interesting to just start plotting a little bit.

## Part 5 (Optional): K-nearest neighbors. 