# A data science case study: Crime hot-spots in LA over time


## Data acquisition

Many governments use <a href="https://www.tylertech.com/products/socrata/data-platform" target="_blank">socrata</a> as their platform to serve data to the public. 
<img src="images/socrata.png" width=600>

<table>
    <tr>
        <td><a href="https://opendata.cityofnewyork.us/" target="_blank"><img src="images/ny.png" width=400></a></td>
        <td><a href="https://datasf.org/opendata/" target="_blank"><img src="images/sf.png" width=400></a></td>
    </tr>
    <tr>
        <td><a href="https://data.cityofchicago.org/" target="_blank"><img src="images/ch.png" width=400></a></td>
        <td><a href="https://data.lacity.org/" target="_blank"><img src="images/la.png" width=400></a></td>
    </tr>
</table>

For this tutorial, we will look at LAPD's arrest data:

https://data.lacity.org/A-Safe-City/Arrest-Data-from-2020-to-Present/amvf-fr72

The <a href="https://dev.socrata.com/docs/endpoints.html" target="_blank">Socrata API</a> allows direct and real-time access to open data.

To access the data, we will use the `sodapy` library: https://github.com/xmunoz/sodapy

Instructions on how to use `sodapy` to access data for this dataset:

<img src="images/ladata.png">

https://dev.socrata.com/foundry/data.lacity.org/amvf-fr72

### Question:
- What is the difference between exporting the data and using the API?

### It's time to start coding: importing libraries

Let's begin our python journey. First, we identify the libraries we will use, and import them into our project:
- `pandas`
- `plotly express`
- `sodapy`

In [None]:
import pandas as pd
import plotly.express as px
from sodapy import Socrata

### Creating a socrata client
Next, we acquire the data using the socrata API. Use the socrata documentation to grab the code syntax for our crime data.
- https://dev.socrata.com/foundry/data.lacity.org/amvf-fr72

In [None]:
client = Socrata("data.lacity.org", None)

# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("amvf-fr72", limit=2000)

# Convert to pandas DataFrame
df = pd.DataFrame.from_records(results)

# print it with .sample, which gives you random rows
df.sample(2)

That's great! But what if you wanted something specific, like "all arrests in August, 2020?"

In [None]:
# add a "where" statement
results = client.get("amvf-fr72", 
                     limit = 10000, # putting an arbitrary high number (otherwise defaults to 1000)
                    where = "arst_date between '2020-07-01T00:00:00' and '2020-07-31T00:00:00'"
                    )

# Convert to pandas DataFrame
df = pd.DataFrame.from_records(results)
df.sample(5)

## Data Exploration and Analysis

In [None]:
# how many rows and columns?
df.shape

In [None]:
# what fields?
df.info()

Now, use plotly express to create a bar chart.
- https://plotly.com/python/bar-charts/

In [None]:
# a simple bar chart, putting date on the x-axis
px.bar(df,
       x='arst_date'
      )

Let's dig in further... what if we want to see the distribution of charge types by day?

In [None]:
# show me distinct value of dates
df.arst_date.unique()

In [None]:
# show me distinct value of charges
df.grp_description.unique()

In [None]:
# show me how many arrests per day
df.groupby(['arst_date']).count()

In [None]:
# show me how many arrests per charge


In [None]:
# ok, group by date and charge, and let's get a count for each
df_grouped=df.groupby(['arst_date','grp_description']).count()[['rpt_id']]
df_grouped.head(50)

In [None]:
# flatten the multi-indexed dataframe
df_flat = df_grouped.reset_index()
df_flat

In [None]:
# make a bar chart
px.bar(df_flat,
       x='arst_date',
       y='rpt_id'
      )

In [None]:
# make a stacked bar chart
px.bar(df_flat,
       x='arst_date',
       y='rpt_id',
       color='grp_description' # this creates the "stack"
      )

## Data prep: subsetting your data

Let's go back to the original dataset.

In [None]:
df.info()

That's a lot of fields. Let's create a subset of the data with just the following fields:

- `arts_date`
- `age`
- `descent_cd`
- `grp_description`
- `lat`
- `lon`


In [None]:
# subset the data
df_mini = df[['arst_date','age','descent_cd','grp_description','lat','lon']].copy()
df_mini

Our `lat` and `lon` columns need to be of data type float. Let's convert them.

In [None]:
# convert lat/lon's to floats
df_mini['lat'] = df_mini['lat'].astype(float)
df_mini['lon'] = df_mini['lon'].astype(float)
df_mini.info()

What happens if we create a scatter plot, placing `lon` in the x-axis `lat` in the y-axis?

In [None]:
px.scatter(df_mini,
           x='lon',
           y='lat'
          )

## Data visualization: Mapping with plotly
Plotly has support for a mapbox slippy map. Have fun with this, and change the `mapbox_style` attribute to any of the following:

* `open-street-map`
* `white-bg`
* `carto-positron`
* `carto-darkmatter`
* `stamen-terrain`
* `stamen-toner`
* `stamen-watercolor`


In [None]:
fig = px.scatter_mapbox(df_mini,
                        lat='lat',
                        lon='lon',
                        mapbox_style="stamen-terrain")
fig.show()

In [None]:
# before you run this cell, what do you think it will produce?
fig = px.scatter_mapbox(df_mini, 
                        lat="lat", 
                        lon="lon", 
                        color="descent_cd"
                       )
fig.update_layout(mapbox_style="carto-darkmatter")

fig.show()

## Advanced visualizations: 3D mapping
- https://kepler.gl/

<img src="images/kepler.png" width=800>

### A note on kepler and other geo-library installations

Installing kepler and geopandas can be challenging. If you are using a JupyterHub that is already set up with these libraries in them, ignore the following instructions that are for Anaconda users.

First, geopandas. I have had much trouble installing geopandas successfully on existing environments. I would thus recommend to create a brand new environment in Anaconda, and first install geopandas.

`conda install geopandas`

Then, install jupyter.

Kepler is not part of the conda forge channel, so we are forced to use pip:

`pip install keplergl`

If after installing kepler, the map does not show, try the following three commands in your environment's terminal:

`pip install --upgrade jupyterthemes
jupyter nbextension install --py --sys-prefix keplergl
jupyter nbextension enable --py --sys-prefix keplergl`

Source: https://github.com/keplergl/kepler.gl/issues/583

You may need to restart your jupyter notebook and Anaconda if that is what you are using.

Import the keplergl library.

In [None]:
from keplergl import KeplerGl

Create a default kepler map.

In [None]:
map = KeplerGl(height=600,width=800)
map

Add our `df_mini` as a data layer on the map. Within the kepler widget, manipulate the map 
- change points to grid cells or hexbins
- change the color palette so that hot spots are red
- change the color scale from `quantile` to `quansize`
- add height to your data
- switch to 3D map view
- adjust the height of the data cells
- add `arst_date` as a filter

In [None]:
map.add_data(data=df_mini,name='arrests')

### Saving your kepler map as an html page

In [None]:
map.save_to_html(file_name='la_arrests.html',read_only=True)