# Crime data visualization in San Francisco

San Francisco has one of the most "open data" policies of any large city. In this lab, we are going to download about 254M of data (726,914 rows) describing all police incidents since 2018 (I'm grabbing data on May 2023).

## Getting started

Download [Police Department Incident Reports 2018 to present](https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783) or, if you want, all [San Francisco police department incident since 1 January 2003](https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-2003/tmnf-yvry). 


In [1]:
! curl 'https://data.sfgov.org/api/views/wg3w-h783/rows.csv?accessType=DOWNLOAD&bom=true&format=true' > /tmp/SFPD.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  254M    0  254M    0     0  4874k      0 --:--:--  0:00:53 --:--:-- 5000k 0 --:--:--  0:00:07 --:--:-- 4784k     0 --:--:--  0:00:12 --:--:-- 5083k0 --:--:--  0:00:29 --:--:-- 5110k


We can easily figure out how many records there are:

In [2]:
! wc -l /tmp/SFPD.csv

  726915 /tmp/SFPD.csv


So that's currently about 500,000 records.

## Sniffing the data

Let's assume the file you downloaded and is in `/tmp`:

In [3]:
import pandas as pd

df_sfpd = pd.read_csv('/tmp/SFPD.csv')
df_sfpd.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Incident Datetime,2023/03/13 11:41:00 PM,2023/03/01 05:02:00 AM,2023/03/13 01:16:00 PM,2023/03/13 10:59:00 AM,2023/03/14 06:44:00 PM,2023/02/15 03:00:00 AM,2023/03/11 12:30:00 PM,2023/03/13 11:26:00 AM,2023/03/11 03:00:00 PM,2023/03/11 02:00:00 PM
Incident Date,2023/03/13,2023/03/01,2023/03/13,2023/03/13,2023/03/14,2023/02/15,2023/03/11,2023/03/13,2023/03/11,2023/03/11
Incident Time,23:41,05:02,13:16,10:59,18:44,03:00,12:30,11:26,15:00,14:00
Incident Year,2023,2023,2023,2023,2023,2023,2023,2023,2023,2023
Incident Day of Week,Monday,Wednesday,Monday,Monday,Tuesday,Wednesday,Saturday,Monday,Saturday,Saturday
Report Datetime,2023/03/13 11:41:00 PM,2023/03/11 03:40:00 PM,2023/03/13 01:17:00 PM,2023/03/13 11:00:00 AM,2023/03/14 06:45:00 PM,2023/03/11 04:55:00 PM,2023/03/12 04:15:00 PM,2023/03/13 01:37:00 PM,2023/03/13 08:29:00 AM,2023/03/15 11:21:00 AM
Row ID,125373607041,125379506374,125357107041,125355107041,125402407041,125378606372,125381606244,125419506244,125420606244,125431804134
Incident ID,1253736,1253795,1253571,1253551,1254024,1253786,1253816,1254195,1254206,1254318
Incident Number,230167874,236046151,220343896,230174885,230176728,236046123,236046004,236046850,236045937,230182844
CAD Number,,,,,,,,,,230741133.0


To get a better idea of what the data looks like, let's do a simple histogram of the categories and crime descriptions.  Here is the category histogram:

In [4]:
df_sfpd['Incident Category'].unique()

array(['Recovered Vehicle', 'Larceny Theft', 'Assault', 'Lost Property',
       'Drug Violation', 'Malicious Mischief', 'Drug Offense',
       'Non-Criminal', 'Fraud', 'Warrant', 'Other Offenses', 'Robbery',
       'Case Closure', 'Other Miscellaneous', 'Stolen Property',
       'Offences Against The Family And Children', 'Other',
       'Motor Vehicle Theft', 'Traffic Collision', 'Suspicious Occ',
       'Missing Person', 'Disorderly Conduct', 'Weapons Carrying Etc',
       'Rape', 'Burglary', 'Fire Report', 'Arson', 'Vandalism', 'Suicide',
       'Traffic Violation Arrest', 'Courtesy Report',
       'Forgery And Counterfeiting', 'Weapons Offense', 'Embezzlement',
       'Vehicle Misplaced', 'Miscellaneous Investigation', 'Suspicious',
       nan, 'Prostitution', 'Vehicle Impounded', 'Sex Offense',
       'Liquor Laws', 'Human Trafficking, Commercial Sex Acts',
       'Gambling', 'Homicide', 'Civil Sidewalks', 'Motor Vehicle Theft?',
       'Human Trafficking (A), Commercial Sex Acts'

In [5]:
from collections import Counter
counter = Counter(df_sfpd['Incident Category'])
counter.most_common(10)

[('Larceny Theft', 221692),
 ('Other Miscellaneous', 50492),
 ('Malicious Mischief', 49177),
 ('Assault', 44216),
 ('Non-Criminal', 43158),
 ('Burglary', 40704),
 ('Motor Vehicle Theft', 37398),
 ('Recovered Vehicle', 28380),
 ('Fraud', 23436),
 ('Lost Property', 21111)]

In [6]:
from collections import Counter
counter = Counter(df_sfpd['Incident Description'])
counter.most_common(10)

[('Theft, From Locked Vehicle, >$950', 91953),
 ('Malicious Mischief, Vandalism to Property', 24142),
 ('Battery', 21492),
 ('Lost Property', 21111),
 ('Vehicle, Recovered, Auto', 20696),
 ('Theft, Other Property, $50-$200', 20492),
 ('Vehicle, Stolen, Auto', 20040),
 ('Theft, Other Property, >$950', 17341),
 ('Mental Health Detention', 15953),
 ('Theft, From Unlocked Vehicle, >$950', 13987)]

## Word clouds

A more interesting way to visualize differences in term frequency is using a so-called word cloud.  

Python has a nice library you can use:

```bash
$ pip install wordcloud
```

**Exercise**: In a file called `catcloud.py`, once again get the categories and then create a word cloud object and display it:

```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
import pandas as pd
import sys

df_sfpd = pd.read_csv(sys.argv[1])

... delete Incident Categories with nan ...
categories = ... create Counter object on column 'Incident Category' ...

wordcloud = WordCloud(width=1800,
                      height=1400,
                      max_words=10000,
                      random_state=1,
                      relative_scaling=0.25)

wordcloud.fit_words(categories)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()
```

### Which neighborhood is the "worst"?

**Exercise**: Now, pullout the neighborthood and do a word cloud on that in `hoodcloud.py` (it's ok to cut/paste):

### Crimes per neighborhood


**Exercise**: Filter the data using pandas from a particular precinct or neighborhood, such as Mission and South of Market.  Modify `catcloud.py` to use a pandas query to filter for those records.  Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python catcloud.py /tmp/SFPD.csv Mission
```

Run the `catcloud.py` script to get an idea of the types of crimes per those two neighborhoods. Here are the mission and SOMA districts crime category clouds:

<table>
    <tr>
        <td><b>Mission</b></td><td><b>South of Market</b></td>
    </tr>
    <tr>
        <td><img src="figures/SFPD-mission-wordcloud.png" width="300"></td><td><img src="figures/SFPD-soma-wordcloud.png" width="300"></td>
    </tr>
 </table>

### Which neighborhood has most car thefts?

**Exercise**: Modify `hoodcloud.py` to filter for `Motor Vehicle Theft`. Pass the hood as an argument (`sys.argv[2]`):

```bash
$ python hoodcloud.py /tmp/SFPD.csv 'Motor Vehicle Theft'
```

<img src="figures/SFPD-car-theft-hood-wordcloud.png" width="300">

Hmm..ok, so parking in the Presidio Heights is ok, but SOMA, BayView/Hunters point are bad news.

If you get stuck in any of these exercises, you can look at the [code associated with this notes](https://github.com/parrt/msds692/tree/master/notes/code/sfpd).