# Lec 06 and 07  : Data Cleaning and EDA

## Introduction
In this lecture we examine the process of data cleaning and Exploratory Data Analysis (EDA). Often you will acquire or even be given a collection of data in order to conduct some analysis or answer some questions. The first step in using that data is to ensure that it is in the correct form (cleaned) and that you understand its properties and limitations (EDA). Often as you explore data through EDA you will identify additional transformations that may be required before the data is ready for analysis.

In this notebook we obtain crime data from the city of Berkeley's public records. Ultimately, our goal might be to understand policing patterns but before we get there we must first clean and understand the data.  The original author of this lecture material used Berkeley data sets. 

## Getting the Data

To begin this analysis we want to get data about crimes in Berkeley.  The city of Berkeley maintains an [Open Data Portal](https://data.cityofberkeley.info/) for citizens to access data about the city.  We will be examining the:

These data is for Berkeley below

1. [Call Data](https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5)
1. [Stop Data (NEW)](https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Stop-Data-NEW-/4tbf-3yt8)

This data is also relatively well documented with detailed descriptions of what it contains.

There are similar data sets for San Diego https://data.sandiego.gov/datasets/police-calls-for-service/

Some new libraries: cufflinks and plotly

In [8]:
 # !pip install cufflinks
%pip install matplotlib.pyplot

[31mERROR: Could not find a version that satisfies the requirement matplotlib.pyplot (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for matplotlib.pyplot[0m[31m
[0mNote: you may need to restart the kernel to use updated packages.


In [7]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# plotly graphing library makes interactive, publication-quality graphs
#"offline" mode does not include any functionality for uploading figures or data to cloud services.
import plotly.offline as py

#Plotly Express is a new high-level Python visualization library: 
#it’s a wrapper for Plotly.py that exposes a simple syntax for complex charts.
import plotly.express as px

import plotly.graph_objs as go

#figure_factory module contains dedicated functions for creating very specific types of plots
import plotly.figure_factory as ff

#cufflink connects plotly with pandas to create graphs and charts of dataframes directly.
import cufflinks as cf
cf.set_config_file(offline=True, world_readable=False)

# Some helpful info about cuff links and plotly.
# https://www.analyticsvidhya.com/blog/2021/06/advanced-python-data-visualization-libraries-plotly/
# https://medium.com/analytics-vidhya/plotly-and-cufflinks-an-interactive-python-visualization-tool-for-eda-and-presentations-4490b11cfbcd
# https://www.youtube.com/watch?v=pkk5U-8Vl7A

In [77]:
calls_file=open("Calls_for_Service.csv","r")
stops_file=open("Stops_Data.json","r")

## Exploring the data

Now that we have obtained the data we want to understand its:

* **Structure** -- the "shape" of a data file
* **Granularity** -- how fine/coarse is each datum
* **Scope** -- how (in)complete is the data
* **Temporality** -- how is the data situated in time
* **Faithfulness** -- how well does the data capture "reality"

## Structure

Before we even begin to load the data it often helps to understand a little about the high-level structure:

1. How much data do I have?
1. How is it formatted?

### How big is the data?

I often like to start my analysis by getting a rough estimate of the size of the data.  This will help inform the tools I use and how I view the data.  If it is relatively small I might use a text editor or a spreadsheet to look at the data.  If it is larger, I might jump to more programmatic exploration or even used distributed computing tools.

However here we will use python tools to probe the file.

If the files are too large, it wont be usable in jupyter or some analysis tools.  We need to reduce the size of the files using some data partitioning techniques which are not covered in this class.

In [78]:
import os
print("calls_file is",  os.path.getsize("Calls_for_Service.csv") / 1e6, "MB")
print("stops_file is", os.path.getsize("Stops_Data.json") / 1e6, "MB")

calls_file is 0.50254 MB
stops_file is 17.558028 MB


All the files are relatively small and we could comfortable examine them in a text editors.

In [79]:
# this measures the number of lines in the files
print("calls_file is", sum(1 for l in calls_file), "lines.")
print("stops_file is", sum(1 for l in stops_file), "lines.")

calls_file is 7877 lines.
stops_file is 59818 lines.


### What is the file format?  (Can we trust extensions? NO)

We already noticed that the files end in `csv` and `json` which suggests that these are comma separated and javascript object files respectively.  However, we can't always rely on the naming as this is only a convention.  For example, here we picked the name of the file when downloading based on some hints in the URL.



**Often files will have incorrect extensions or no extension at all.**

Let's assume that these are text files (and do not contain binary encoded data) so we can print a "few lines" to get a better understanding of the file.

If you think the number of lines matches the number of records or row of data in calls_for_service.csv, you might think that from looking at the number of lines above 

### 7877 lines does not mean 7877 records!!!

In [80]:
with open("Calls_for_Service.csv", "r") as f:
   for i in range(20):
        print(i, "\t", repr(f.readline()))

0 	 'CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State\n'
1 	 '21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n'
2 	 '(37.869058, -122.270455)",,Berkeley,CA\n'
3 	 '21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n'
4 	 '(37.869058, -122.270455)",,Berkeley,CA\n'
5 	 '21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\n'
6 	 'Berkeley, CA\n'
7 	 '(37.864908, -122.267289)",2100 BLOCK HASTE ST,Berkeley,CA\n'
8 	 '21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\n'
9 	 'Berkeley, CA\n'
10 	 '(37.863934, -122.250262)",2600 BLOCK WARRING ST,Berkeley,CA\n'
11 	 '21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\n'
12 	 'Berkeley, CA

In [81]:
# Look above.  The first line is the column names for the CSV
### The first record spans two line because the open quote for the location seems to span across two lines 

In [82]:
with open("Stops_Data.json", "r") as f:
   for i in range(20):
        print(i, "\t", repr(f.readline()))

0 	 '{\n'
1 	 '  "meta" : {\n'
2 	 '    "view" : {\n'
3 	 '      "id" : "4tbf-3yt8",\n'
4 	 '      "name" : "Berkeley PD - Stop Data (Jan 26, 2015 to Sep 30, 2020)",\n'
5 	 '      "assetType" : "dataset",\n'
6 	 '      "attribution" : "City of Berkeley Police Department",\n'
7 	 '      "averageRating" : 0,\n'
8 	 '      "category" : "Public Safety",\n'
9 	 '      "createdAt" : 1588602591,\n'
10 	 '      "description" : "This data was extracted from the Department’s Public Safety Server and covers data beginning January 26, 2015.  On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014). Under that Order, officers were required to provide certain data after making any detention (vehicle, bicycle, pedestrian, suspicious auto).  This dataset provides information about detentions, including the race, sex, age range, of the person detained; the reason for the stop; the type of enforcement taken (if any), and whether or not a search wa

Notice that I used the `repr` function to return the raw string with special characters. 

### What are some observations about `Calls` data?

1. It appears to be in comma separated value (CSV) format.
1. First line contains the column headings.
1. There are lots of **new-line** `\n` characters:
    * at the ends of lines (delimiting records?)
    * *within records* as part of addresses.
1. There are **"quoted"** strings in the `Block_Location` column:
```
"2500 LE CONTE AVE
Berkeley, CA
(37.876965, -122.260544)"
```
these are going to be difficult.  

### What are the implications on our earlier line count calculations?

### What are some observations about `Stops` data?

This appears to be a fairly standard JSON file.  We notice that the file appears to contain a description of itself in a field called "meta" (which is presumably short for meta-data). 



## Loading the Data

We will now attempt to load the data into python.  We will be using the Pandas dataframe library for basic tabular data analysis.  Fortunately, the Pandas library has some relatively sophisticated functions for loading data. 

### Loading the Calls Data

Because the file appears to be a relatively well formatted CSV we will attempt to load it directly and allow the Pandas Library to deduce column headers.  (Always check that first row and column look correct after loading.)

In [83]:
calls = pd.read_csv("Calls_for_Service.csv")
calls.head()

# below you can see that the block_location column -- that the read_csv function is having trouble with -- is being read in correctly
# but there is a lot of missing data in the other columns : BLKADDR



Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA


In [84]:
calls.shape[0]
# this shows the number of logical row or records

2632

In [85]:
calls.columns

Index(['CASENO', 'OFFENSE', 'EVENTDT', 'EVENTTM', 'CVLEGEND', 'CVDOW',
       'InDbDate', 'Block_Location', 'BLKADDR', 'City', 'State'],
      dtype='object')

In [86]:
len(calls.columns)

11

### Preliminary observations on the data?

1. `EVENTDT` -- Contain time stamp
1. `EVENTTM` -- Contains the time in 24 hour format (What timezone?)
1. `CVDOW` -- Appears to be some encoding of the day of the week (see data documentation).  What do we do about this?
1. `InDbDate` -- Appears to be correctly formatted and appears pretty consistent in time.
1. **`Block_Location` -- Errr, what a mess!  newline characters, and Geocoordinates all merged!!  Fortunately, this field was "quoted" otherwise we would have had trouble parsing the file. (why?)**
1. `BLKADDR` -- This appears to be the address in Block Location.
1. `City` and `State` seem redundant given this is supposed to be the city of Berkeley dataset. This was a waste of space since all the data is coming for the one city


### Checking that the City and State fields are all Berkeley CA

We notice that there are city and state columns.  Since this is supposed to be data for the city of Berkeley these columns appear to be redundant.  Let's quickly observe the unique values for these two columns.

In [87]:
calls["State"].unique()

array(['CA'], dtype=object)

In [88]:
calls["City"].unique()

array(['Berkeley'], dtype=object)

### Decoding day of the week

According to the documentation `CVDOW=0` is Sunday, `CVDOW=1` is Monday, ...,  Therefore we can make a series to decode the day of the week for each record and join that series with the calls data.

In [89]:
dow = pd.Series(["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"], name="Day")
dow

0       Sunday
1       Monday
2      Tuesday
3    Wednesday
4     Thursday
5       Friday
6     Saturday
Name: Day, dtype: object

In [90]:
# we first make this above series into a DataFrame
df_dow = pd.DataFrame(dow)
# Notice that I am dropping the column if it already exists to
# make it so I can run this cell more than once
#calls = pd.merge(calls, 
#         df_dow, left_on='CVDOW', right_index=True).sort_index()
calls = pd.merge(calls.drop(columns="Day", errors="ignore"), 
         df_dow, left_on='CVDOW', right_index=True).sort_index()
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,Monday
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,Saturday
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,Monday
...,...,...,...,...,...,...,...,...,...,...,...,...
2627,20058742,BURGLARY RESIDENTIAL,12/21/2020 12:00:00 AM,12:45,BURGLARY - RESIDENTIAL,1,06/15/2021 12:00:00 AM,"1300 BLOCK UNIVERSITY AVE\nBerkeley, CA\n(37.8...",1300 BLOCK UNIVERSITY AVE,Berkeley,CA,Monday
2628,21008017,BRANDISHING,02/24/2021 12:00:00 AM,15:06,WEAPONS OFFENSE,3,06/15/2021 12:00:00 AM,"100 BLOCK SEAWALL DR\nBerkeley, CA\n(37.863611...",100 BLOCK SEAWALL DR,Berkeley,CA,Wednesday
2629,21013239,THEFT FELONY (OVER $950),03/24/2021 12:00:00 AM,0:00,LARCENY,3,06/15/2021 12:00:00 AM,"2800 BLOCK HILLEGASS AVE\nBerkeley, CA\n(37.85...",2800 BLOCK HILLEGASS AVE,Berkeley,CA,Wednesday
2630,21018143,THEFT MISD. (UNDER $950),04/24/2021 12:00:00 AM,18:35,LARCENY,6,06/15/2021 12:00:00 AM,"2500 BLOCK TELEGRAPH AVE\nBerkeley, CA\n(37.86...",2500 BLOCK TELEGRAPH AVE,Berkeley,CA,Saturday


### Cleaning Block Location

The block location contains the lat/lon coordinates and I might want to use these to analyze the location of each request.  Let's try to extract the GPS coordinates using regular expressions (we will cover regular expressions in future lectures):

In [91]:
calls['Block_Location'].head(10)

0               Berkeley, CA\n(37.869058, -122.270455)
1               Berkeley, CA\n(37.869058, -122.270455)
2    2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...
3    2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...
4    2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...
5    1400 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.881...
6    2020 BANCROFT WAY\nBerkeley, CA\n(37.867426, -...
7    2800 BLOCK ADELINE ST\nBerkeley, CA\n(37.85811...
8    2200 BLOCK GRANT ST\nBerkeley, CA\n(37.868355,...
9    1215 CARRISON ST\nBerkeley, CA\n(37.851491, -1...
Name: Block_Location, dtype: object

In [92]:
calls_lat_lon = (
    # Remove newlines
    calls['Block_Location'].str.replace("\n", "\t") 
    # Extract Lat and Lon using regular expression
    .str.extract(".*\((?P<Lat>\d*\.\d*)\, (?P<Lon>-?\d*\.\d*)\)", expand=True)
)
calls_lat_lon.head(20)

Unnamed: 0,Lat,Lon
0,37.869058,-122.270455
1,37.869058,-122.270455
2,37.864908,-122.267289
3,37.863934,-122.250262
4,37.86066,-122.253407
5,37.881957,-122.269551
6,37.867426,-122.269138
7,37.858116,-122.268002
8,37.868355,-122.274953
9,37.851491,-122.28563


The following block of code joins the extracted Latitude and Longitude fields with the calls data.  Notice that we actually drop these fields before joining.  This is to enable repeated invocation of this cell even after the join has been completed. 

In [93]:
# Remove Lat and Lon if they already existed before (reproducible) -- so that you can re-run this over and over
# this is good coding practice when you submit a notebook

calls.drop(["Lat", "Lon"], axis=1, inplace=True, errors="ignore")
# Join in the the latitude and longitude data
calls = calls.merge(calls_lat_lon, left_index=True, right_index=True)
# calls[["Lat", "Lon"]] = calls_lat_lon
# calls.join(calls_lat_lon)
calls.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day,Lat,Lon
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,Monday,37.864908,-122.267289
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,Saturday,37.863934,-122.250262
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,Monday,37.86066,-122.253407


We can now look at a few of the records that were missing latitude and longitude entries:

In [94]:
calls['Lon'].isnull()

0       False
1       False
2       False
3       False
4       False
        ...  
2627    False
2628    False
2629    False
2630    False
2631    False
Name: Lon, Length: 2632, dtype: bool

In [95]:
# we check if there are any missing values in the Lon column
calls[calls['Lon'].isnull()].head(10)

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day,Lat,Lon


<br/><br/><br/>


---
## Loading the `Stops_Data.json` Data
Python has relatively good support for JSON data since it closely matches the internal python object model.  In the following cell we import the entire JSON datafile into a python dictionary.

In [96]:
import json

with open("Stops_Data.json", "rb") as f:
    stops_json = json.load(f)

In [97]:
# what kind of data structure returned ??

In [98]:
#stops_json

---
<br/><br/><br/>

### We can now examine what keys are in the top level json object

We can list the keys to determine what data is stored in the object.

In [99]:
stops_json.keys()

dict_keys(['meta', 'data'])

#### Observation

The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data).  Meta data often maintained with the data and can be a good source of additional information.


## Digging into Meta Data

We can investigate the meta data further by examining the keys associated with the metadata.

In [100]:
stops_json['meta'].keys()

dict_keys(['view'])

The `meta` key contains another dictionary called `view`.  This likely refers to meta-data about a particular "view" of some underlying database.  We will learn more about views as we study SQL later in the class.    

In [101]:
stops_json['meta']['view'].keys()

dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'licenseId', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'license', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])

Notice that this a nested/recursive data structure.  As we dig deeper we reveal more and more keys and the corresponding data:

```
meta
|-> data
    | ... (haven't explored yet)
|-> view
    | -> id
    | -> name
    | -> attribution 
    ...
```

There is a key called description in the view sub dictionary.  This likely contains a description of the data:

In [102]:
print(stops_json['meta']['view']['description'])

This data was extracted from the Department’s Public Safety Server and covers data beginning January 26, 2015.  On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014). Under that Order, officers were required to provide certain data after making any detention (vehicle, bicycle, pedestrian, suspicious auto).  This dataset provides information about detentions, including the race, sex, age range, of the person detained; the reason for the stop; the type of enforcement taken (if any), and whether or not a search was conducted.  Also provided are the date, time, location of the detention, as well as the incident number and call for service type.



### Observations?

1. JSON makes it easier (than CSV) to create "self-documented data". 
1. Self documenting data can be helpful since it maintains it's own description and these descriptions are more likely to be updated as data changes. An advantage over the CSV file format

### Examining the Data Field

We can look at a few entires in the data field

In [103]:
for i in range(3):
    print(i, "\t", stops_json['data'][i])

1 	 ['row-7zd2.fzni_26x7', '00000000-0000-0000-5F81-4C8F8669527C', 0, 1622797226, None, 1622797226, None, '{ }', '2018-03-14T16:25:55', '2018-00015116', 'ANTHONY ST / 7TH ST', 'BERKELEY', '37.8522263089015', '-122.291495435525', 'T', 'White', 'Female', '30-39', 'Traffic', 'Citation', 'No Search']
2 	 ['row-ni5e.3gps_2h9x', '00000000-0000-0000-52BB-9DF61D6907B1', 0, 1622192428, None, 1622192428, None, '{ }', '2016-07-01T16:36:28', '2016-00038890', '2008 SHATTUCK AVE', 'BERKELEY', '37.8717333701234', '-122.268656966847', '1194B', 'Black', 'Male', '18-29', 'Reas. Susp.', 'Citation', 'No Search']


## Building a Dataframe from JSON
In the following block of code we:
1. Translate the JSON records into a dataframe
1. Remove columns that have no metadata description.  This would be a bad idea in general but here we remove these columns since the above analysis suggests that they are unlikely to contain useful information.
1. Examine the top of the table

In [104]:
# notice that we can directly see the column names from the json
type(stops_json['meta']['view']['columns'])

list

In [105]:


#[c['name'] for c in stops_json['meta']['view']['columns']]

In [106]:
# Load the data from JSON and assign column titles
# creating a dataframe from a list and specifying the column names.
# https://www.geeksforgeeks.org/create-a-pandas-dataframe-from-lists/
stops = pd.DataFrame(
    stops_json['data'],
    columns=[c['name'] for c in stops_json['meta']['view']['columns']])

stops.head()

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,CreateDatetime,IncidentNumber,Address,City,Lat,Lon,CallType,Race,Gender,Age,Reason,Enforcement,Car Search
0,row-qdhm~vqfy.3zxd,00000000-0000-0000-B8F1-995A55B55DE5,0,1628845237,,1628845237,,{ },2017-01-29T23:01:39,2017-00005533,80 BOLIVAR DR,BERKELEY,37.864546219,-122.301738812,1196,Hispanic,Female,>40,Investigation,Warning,Search
1,row-7zd2.fzni_26x7,00000000-0000-0000-5F81-4C8F8669527C,0,1622797226,,1622797226,,{ },2018-03-14T16:25:55,2018-00015116,ANTHONY ST / 7TH ST,BERKELEY,37.8522263089015,-122.291495435525,T,White,Female,30-39,Traffic,Citation,No Search
2,row-ni5e.3gps_2h9x,00000000-0000-0000-52BB-9DF61D6907B1,0,1622192428,,1622192428,,{ },2016-07-01T16:36:28,2016-00038890,2008 SHATTUCK AVE,BERKELEY,37.8717333701234,-122.268656966847,1194B,Black,Male,18-29,Reas. Susp.,Citation,No Search
3,row-m7he.wthe.w7r4,00000000-0000-0000-6CC6-513F3C600645,0,1623402030,,1623402030,,{ },2016-11-06T07:15:42,2016-00065741,RUSSELL ST / MABEL ST,BERKELEY,37.8541134318992,-122.284193275337,1196,White,Male,>40,Investigation,Warning,No Search
4,row-ehgt.t33k-3t6f,00000000-0000-0000-DA9A-12258FA96C5E,0,1622192428,,1622192428,,{ },2016-12-18T19:18:06,2016-00074362,UNIVERSITY AVE / ACTON ST,BERKELEY,37.8701126448154,-122.284248929276,T,Black,Male,>40,Traffic,Warning,No Search


In [107]:
stops.columns

Index(['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at',
       'updated_meta', 'meta', 'CreateDatetime', 'IncidentNumber', 'Address',
       'City', 'Lat', 'Lon', 'CallType', 'Race', 'Gender', 'Age', 'Reason',
       'Enforcement', 'Car Search'],
      dtype='object')

Sometimes the data might contain too many columns.  Let's ask pandas to show us more.  Be careful, showing too much could break your notebook.

In [108]:
pd.set_option('display.max_columns', 100) 
pd.set_option('display.max_rows', 100) 

In [109]:
stops.head(10)

Unnamed: 0,sid,id,position,created_at,created_meta,updated_at,updated_meta,meta,CreateDatetime,IncidentNumber,Address,City,Lat,Lon,CallType,Race,Gender,Age,Reason,Enforcement,Car Search
0,row-qdhm~vqfy.3zxd,00000000-0000-0000-B8F1-995A55B55DE5,0,1628845237,,1628845237,,{ },2017-01-29T23:01:39,2017-00005533,80 BOLIVAR DR,BERKELEY,37.864546219,-122.301738812,1196,Hispanic,Female,>40,Investigation,Warning,Search
1,row-7zd2.fzni_26x7,00000000-0000-0000-5F81-4C8F8669527C,0,1622797226,,1622797226,,{ },2018-03-14T16:25:55,2018-00015116,ANTHONY ST / 7TH ST,BERKELEY,37.8522263089015,-122.291495435525,T,White,Female,30-39,Traffic,Citation,No Search
2,row-ni5e.3gps_2h9x,00000000-0000-0000-52BB-9DF61D6907B1,0,1622192428,,1622192428,,{ },2016-07-01T16:36:28,2016-00038890,2008 SHATTUCK AVE,BERKELEY,37.8717333701234,-122.268656966847,1194B,Black,Male,18-29,Reas. Susp.,Citation,No Search
3,row-m7he.wthe.w7r4,00000000-0000-0000-6CC6-513F3C600645,0,1623402030,,1623402030,,{ },2016-11-06T07:15:42,2016-00065741,RUSSELL ST / MABEL ST,BERKELEY,37.8541134318992,-122.284193275337,1196,White,Male,>40,Investigation,Warning,No Search
4,row-ehgt.t33k-3t6f,00000000-0000-0000-DA9A-12258FA96C5E,0,1622192428,,1622192428,,{ },2016-12-18T19:18:06,2016-00074362,UNIVERSITY AVE / ACTON ST,BERKELEY,37.8701126448154,-122.284248929276,T,Black,Male,>40,Traffic,Warning,No Search
5,row-273p-3a4d.8k4g,00000000-0000-0000-557E-6C26E24F271F,0,1625216435,,1625216435,,{ },2016-09-04T22:52:10,2016-00052424,VINE ST / MARTIN LUTHER KING JR WAY,BERKELEY,37.8796298549105,-122.273885711146,T,White,Male,18-29,Traffic,Arrest,Search
6,row-hnpw.gknh-mv54,00000000-0000-0000-98A9-3EE4DFAF13A6,0,1630659635,,1630659635,,{ },2017-09-12T14:04:50,2017-00054950,ASHBY AVE / CLAREMONT AVE,BERKELEY,37.858046062,-122.245300785,T,White,Male,>40,Traffic,Citation,No Search
7,row-bafv-uht5-nsjz,00000000-0000-0000-E698-3B20EEA8791D,0,1628845237,,1628845237,,{ },2016-12-31T23:17:42,2016-00076853,80 BOLIVAR DR,BERKELEY,37.864546219,-122.301738812,1194,Other,Male,>40,Investigation,Warning,No Search
8,row-4yyt~kgb5.u8mh,00000000-0000-0000-CF33-B11AD2115B40,0,1622192428,,1622192428,,{ },2017-09-08T09:21:20,2017-00053833,9TH ST / ASHBY AVE,BERKELEY,37.8516158126794,-122.289369289088,T,Black,Female,30-39,Traffic,Citation,No Search
9,row-n3e4.9c76_2xw4,00000000-0000-0000-96D3-DA7758808808,0,1625821236,,1625821236,,{ },2017-06-09T06:48:12,2017-00033003,SACRAMENTO ST / VIRGINIA ST,BERKELEY,37.874868512,-122.282443588,T,Asian,Male,>40,Traffic,Citation,No Search


## Preliminary Observations

What do we observe so far?

We observe:
1. The `Incident Number` appears to have the year encoded in it - we could potentially use this as a validation check.  
1. The `created_at` and `updated_at` Fields look like they are in milliseconds since January 1, 1970.
1. The `CreateDatetime` Field looks to be formatted in YYYY-MM-DDTHH:MM:SS.  I am guessing T means "Time".
1. The `Age` Field has variable size brackets: 18-29, 30-39, >40.
1. The definition of `CallType` can be found in [Berkeley Police DepartmentCall-Incident Types](https://www.cityofberkeley.info/uploadedFiles/Police/Level_3_-_General/Call-Incident%20Types.pdf): 1194-Pedestrian Stop, 1196-Suspicious Vehicle Stop, T-Traffic Stop
     
Recall the description:

### Stop Data
<img src="stops_desc.png" width=800px />


--- 

<br/><br/><br/>

# EDA 

Now that we have loaded our various data files.  Let's try to understand a bit more about the data by examining properties of individual fields.

---
<br/><br/><br/>

### Are Case Numbers unique?

Case numbers are probably used internally to track individual cases and my reference other data we don't have access to.  However, it is possible that multiple calls could be associated with the same case.  Let's see if the case numbers are all unique.

In [110]:
print("There are", calls['CASENO'].unique().shape[0], "unique case numbers.")
print("There are", calls.shape[0], "calls in the table.")

There are 2632 unique case numbers.
There are 2632 calls in the table.


In [111]:
# if the numbers did not match, how do we find the duplicate records with the case numbers?

In [112]:
def countRows(data):
    return len(data)
calls.groupby("CASENO").agg({"EVENTDT": countRows}).sort_values("EVENTDT", ascending=False)

Unnamed: 0_level_0,EVENTDT
CASENO,Unnamed: 1_level_1
20057207,1
21019962,1
21019978,1
21019988,1
21020010,1
...,...
21008872,1
21008874,1
21008877,1
21008879,1


---
<br/><br/><br/>

## Examining the Date

Let's dig into the date in which events were recorded.  Notice in this data we have several pieces of date/time information (this is not uncommon):
1. **`EVENTDT`**: This contains the date the event took place.  While it has time information the time appears to be `00:00:00`.  
1. **`EVENTTM`**: This contains the time at which the event took place.
1. **`InDbDate`**: This appears to be the date at which the data was entered in the database. 

In [113]:
calls.head(3)

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day,Lat,Lon
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,Monday,37.864908,-122.267289


When Pandas loads more complex fields like dates it will often load them as strings:

In [114]:
calls["EVENTDT"][0]

'04/01/2021 12:00:00 AM'

We will want to convert these to dates.  Pandas has a function `pd.to_datetime` which is capable of guessing reasonable conversions of dates to date objects. 

In [115]:
dates = pd.to_datetime(calls["EVENTDT"], format='%m/%d/%Y %H:%M:%S %p')
dates

0      2021-04-01 12:00:00
1      2021-04-01 12:00:00
2      2021-04-19 12:00:00
3      2021-02-13 12:00:00
4      2021-02-08 12:00:00
               ...        
2627   2020-12-21 12:00:00
2628   2021-02-24 12:00:00
2629   2021-03-24 12:00:00
2630   2021-04-24 12:00:00
2631   2021-02-26 12:00:00
Name: EVENTDT, Length: 2632, dtype: datetime64[ns]

We can also extract the time field:

In [116]:
times = pd.to_datetime(calls["EVENTTM"], format='%H:%M').dt.time
times.head()

0    10:58:00
1    10:38:00
2    12:15:00
3    17:00:00
4    06:20:00
Name: EVENTTM, dtype: object

To combine the correct date and correct time field we use the built-in python datetime combine function.

In [117]:
from datetime import datetime
timestamps = pd.concat([dates, times], axis=1).apply(
    lambda r: datetime.combine(r['EVENTDT'], r['EVENTTM']), axis=1)
timestamps.head()

0   2021-04-01 10:58:00
1   2021-04-01 10:38:00
2   2021-04-19 12:15:00
3   2021-02-13 17:00:00
4   2021-02-08 06:20:00
dtype: datetime64[ns]

In [118]:
type(timestamps)

pandas.core.series.Series

We now updated calls to contain this additional informations:

In [119]:
calls['timestamp'] = timestamps
calls.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day,Lat,Lon,timestamp
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455,2021-04-01 10:58:00
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455,2021-04-01 10:38:00
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,Monday,37.864908,-122.267289,2021-04-19 12:15:00
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,Saturday,37.863934,-122.250262,2021-02-13 17:00:00
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,Monday,37.86066,-122.253407,2021-02-08 06:20:00


### What time range does the data represent

In [120]:
calls['timestamp'].min()

Timestamp('2020-12-17 07:33:00')

In [121]:
calls['timestamp'].max()

Timestamp('2021-06-10 21:21:00')

---
<br/><br/><br/>


### Are there any other interesting temporal patterns

Do more calls occur on a particular day of the week?

In [122]:
calls.groupby('Day')['CASENO'].count()

Day
Friday       412
Monday       388
Saturday     375
Sunday       337
Thursday     394
Tuesday      369
Wednesday    357
Name: CASENO, dtype: int64

In [123]:
dow = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]

calls.groupby('Day')['CASENO'].count()[dow].iplot(kind='bar', yTitle="Count")

### How about temporal patterns within a day?

In [124]:
calls['timestamp'].dt.hour

0       10
1       10
2       12
3       17
4        6
        ..
2627    12
2628    15
2629     0
2630    18
2631     2
Name: timestamp, Length: 2632, dtype: int32

In [125]:
calls['timestamp'].dt.minute

0       58
1       38
2       15
3        0
4       20
        ..
2627    45
2628     6
2629     0
2630    35
2631     0
Name: timestamp, Length: 2632, dtype: int32

In [126]:
(calls['timestamp'].dt.hour * 60 + calls['timestamp'].dt.minute ) / 60

0       10.966667
1       10.633333
2       12.250000
3       17.000000
4        6.333333
          ...    
2627    12.750000
2628    15.100000
2629     0.000000
2630    18.583333
2631     2.000000
Name: timestamp, Length: 2632, dtype: float64

In [127]:
calls['hour_of_day'] = (
    calls['timestamp'].dt.hour * 60 + calls['timestamp'].dt.minute ) / 60.

In [128]:


py.iplot(ff.create_distplot([calls['hour_of_day']],group_labels=["Hour"],bin_size=1, show_rug=False))

In the above plot we see the standard pattern of limited activity early in the morning around here 6:00AM.

### Examining the Event

We also have data about the different kinds of crimes being reported

In [129]:
calls.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day,Lat,Lon,timestamp,hour_of_day
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455,2021-04-01 10:58:00,10.966667
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455,2021-04-01 10:38:00,10.633333
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,Monday,37.864908,-122.267289,2021-04-19 12:15:00,12.25
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,Saturday,37.863934,-122.250262,2021-02-13 17:00:00,17.0
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,Monday,37.86066,-122.253407,2021-02-08 06:20:00,6.333333


### The Offense Field

The Offense field appears to contain the specific crime being reported.  As nominal data we might want to see a summary constructed by computing counts of each offense type:

In [130]:
calls['OFFENSE'].value_counts()

OFFENSE
THEFT MISD. (UNDER $950)    559
VEHICLE STOLEN              277
BURGLARY AUTO               218
THEFT FELONY (OVER $950)    215
DISTURBANCE                 204
BURGLARY RESIDENTIAL        178
VANDALISM                   166
THEFT FROM AUTO             163
ASSAULT/BATTERY MISD.       116
ROBBERY                      90
BURGLARY COMMERCIAL          86
DOMESTIC VIOLENCE            66
IDENTITY THEFT               52
FRAUD/FORGERY                41
ASSAULT/BATTERY FEL.         34
NARCOTICS                    33
BRANDISHING                  25
MISSING ADULT                25
ALCOHOL OFFENSE              20
ARSON                        18
GUN/WEAPON                   12
THEFT FROM PERSON             8
SEXUAL ASSAULT FEL.           7
SEXUAL ASSAULT MISD.          7
VEHICLE RECOVERED             6
2ND RESPONSE                  2
DISTURBANCE - NOISE           1
MISSING JUVENILE              1
KIDNAPPING                    1
MUNICIPAL CODE                1
Name: count, dtype: int64

In [131]:
calls['OFFENSE'].value_counts().iplot(kind="bar")

Theft Misd and vehicle stolen seem to be the most common crimes with many other types of crimes occurring rarely.

---
<br/><br/><br/>

### CVLEGEND

The CVLEGEND field provides the broad category of crime and is a good mechanism to group potentially similar crimes. 

In [132]:
calls['CVLEGEND'].value_counts()

CVLEGEND
LARCENY                   782
MOTOR VEHICLE THEFT       277
BURGLARY - VEHICLE        218
DISORDERLY CONDUCT        204
BURGLARY - RESIDENTIAL    178
VANDALISM                 166
LARCENY - FROM VEHICLE    163
ASSAULT                   150
FRAUD                      93
ROBBERY                    90
BURGLARY - COMMERCIAL      86
FAMILY OFFENSE             66
WEAPONS OFFENSE            37
DRUG VIOLATION             33
MISSING PERSON             26
LIQUOR LAW VIOLATION       20
ARSON                      18
SEX CRIME                  14
RECOVERED VEHICLE           6
NOISE VIOLATION             3
KIDNAPPING                  1
ALL OTHER OFFENSES          1
Name: count, dtype: int64

In [133]:
calls['CVLEGEND'].value_counts().iplot(kind="bar")

**Larceny** emerges as one of the top crimes.  Larceny is essentially stealing -- taking someone else stuff without force.

### Stratified Analysis of Time of Day by CVLEGEND

View the crime time periods broken down by crime type:

In [134]:
boxes = [(len(df), go.Violin(y=df["hour_of_day"], name=i)) for (i, df) in calls.groupby("CVLEGEND")]

In [135]:
#boxes

In [136]:
#A violin plot is a statistical representation of numerical data.
#It is similar to a box plot, 
#with the addition of a rotated kernel density plot on each side.


py.iplot([r[1] for r in sorted(boxes, key=lambda x:x[0], reverse=True)])

In [137]:
py.iplot(ff.create_distplot([
    calls[calls['CVLEGEND'] == "NOISE VIOLATION"]['hour_of_day'],
    calls[calls['CVLEGEND'] == "DRUG VIOLATION"]['hour_of_day'],
    calls[calls['CVLEGEND'] == "LIQUOR LAW VIOLATION"]['hour_of_day'],
    
],
    group_labels=["Noise Violation", "Drug Violation", "Liquor Violation"], 
    show_rug=False))

## Examining Location information

Let's examine the geographic data (latitude and longitude). 

In [138]:
#!pip install folium

In [139]:
import folium
import folium.plugins # The Folium Javascript Map Library

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Lat', 'Lon']].astype('float').dropna().to_numpy()
heatmap = folium.plugins.HeatMap(locs.tolist(), radius = 10)
sf_map.add_child(heatmap)

### Questions

1. Why are all the calls located on the street and at often at intersections?


In [140]:
cluster = folium.plugins.MarkerCluster()
for _, r in calls[['Lat', 'Lon', 'CVLEGEND']].tail(1000).dropna().iterrows():
    cluster.add_child(
        folium.Marker([float(r["Lat"]), float(r["Lon"])], popup=r['CVLEGEND']))
    
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
sf_map.add_child(cluster)
sf_map

In [141]:
import math

def distancePt1Pt2(lat1, lon1, lat2, lon2):
  """Calculates the distance between two points in kilometers.

  Args:
    lat1: The latitude of the first point.
    lon1: The longitude of the first point.
    lat2: The latitude of the second point.
    lon2: The longitude of the second point.

  Returns:
    The distance between the two points in kilometers.
  """

  radius = 6371  # Earth radius in kilometers

  # Convert latitude and longitude to radians
  lat1 = math.radians(lat1)
  lon1 = math.radians(lon1)
  lat2 = math.radians(lat2)
  lon2 = math.radians(lon2)

  # Calculate the difference in latitude and longitude
  delta_lat = lat2 - lat1
  delta_lon = lon2 - lon1

  # Calculate the Haversine formula
  a = math.sin(delta_lat / 2) ** 2 + math.cos(lat1) * math.cos(lat2) * math.sin(delta_lon / 2) ** 2
  c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

  # Calculate the distance in kilometers
  distance = radius * c

  return distance

In [142]:
def convert_kilometers_to_miles(kilometers):
  """Converts kilometers to miles.

  Args:
    kilometers: The number of kilometers to convert.

  Returns:
    The number of miles.
  """

  miles = kilometers * 0.621371
  return miles

In [143]:
cityCollegeLat = 37.869937383844466
cityCollegeLon = -122.26977547024649
distanceArr = []
for _, r in calls[['Lat', 'Lon']].dropna().iterrows():
    pointLat = float(r["Lat"])
    pointLon = float(r["Lon"])
    dist = distancePt1Pt2(pointLat, pointLon, cityCollegeLat, cityCollegeLon)
    mileDist = convert_kilometers_to_miles(dist)
    distanceArr.append(mileDist)
#distanceArr

In [144]:
calls["dist"] =pd.Series(distanceArr) 

In [145]:
calls

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State,Day,Lat,Lon,timestamp,hour_of_day,dist
0,21014296,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:58,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455,2021-04-01 10:58:00,10.966667,0.071172
1,21014391,THEFT MISD. (UNDER $950),04/01/2021 12:00:00 AM,10:38,LARCENY,4,06/15/2021 12:00:00 AM,"Berkeley, CA\n(37.869058, -122.270455)",,Berkeley,CA,Thursday,37.869058,-122.270455,2021-04-01 10:38:00,10.633333,0.071172
2,21090494,THEFT MISD. (UNDER $950),04/19/2021 12:00:00 AM,12:15,LARCENY,1,06/15/2021 12:00:00 AM,"2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,...",2100 BLOCK HASTE ST,Berkeley,CA,Monday,37.864908,-122.267289,2021-04-19 12:15:00,12.250000,0.373025
3,21090204,THEFT FELONY (OVER $950),02/13/2021 12:00:00 AM,17:00,LARCENY,6,06/15/2021 12:00:00 AM,"2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393...",2600 BLOCK WARRING ST,Berkeley,CA,Saturday,37.863934,-122.250262,2021-02-13 17:00:00,17.000000,1.142330
4,21090179,BURGLARY AUTO,02/08/2021 12:00:00 AM,6:20,BURGLARY - VEHICLE,1,06/15/2021 12:00:00 AM,"2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,...",2700 BLOCK GARBER ST,Berkeley,CA,Monday,37.86066,-122.253407,2021-02-08 06:20:00,6.333333,1.099111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2627,20058742,BURGLARY RESIDENTIAL,12/21/2020 12:00:00 AM,12:45,BURGLARY - RESIDENTIAL,1,06/15/2021 12:00:00 AM,"1300 BLOCK UNIVERSITY AVE\nBerkeley, CA\n(37.8...",1300 BLOCK UNIVERSITY AVE,Berkeley,CA,Monday,37.869764,-122.28655,2020-12-21 12:45:00,12.750000,0.915007
2628,21008017,BRANDISHING,02/24/2021 12:00:00 AM,15:06,WEAPONS OFFENSE,3,06/15/2021 12:00:00 AM,"100 BLOCK SEAWALL DR\nBerkeley, CA\n(37.863611...",100 BLOCK SEAWALL DR,Berkeley,CA,Wednesday,37.863611,-122.317566,2021-02-24 15:06:00,15.100000,2.643130
2629,21013239,THEFT FELONY (OVER $950),03/24/2021 12:00:00 AM,0:00,LARCENY,3,06/15/2021 12:00:00 AM,"2800 BLOCK HILLEGASS AVE\nBerkeley, CA\n(37.85...",2800 BLOCK HILLEGASS AVE,Berkeley,CA,Wednesday,37.85968,-122.255796,2021-03-24 00:00:00,0.000000,1.041025
2630,21018143,THEFT MISD. (UNDER $950),04/24/2021 12:00:00 AM,18:35,LARCENY,6,06/15/2021 12:00:00 AM,"2500 BLOCK TELEGRAPH AVE\nBerkeley, CA\n(37.86...",2500 BLOCK TELEGRAPH AVE,Berkeley,CA,Saturday,37.864827,-122.258577,2021-04-24 18:35:00,18.583333,0.705529


In [146]:
uclawLat = 37.86948264624683
uclawLon = -122.25402049161627

In [147]:
cityCollegeToLaw = distancePt1Pt2(cityCollegeLat, cityCollegeLon, uclawLat, uclawLon)

In [148]:
convert_kilometers_to_miles(cityCollegeToLaw)

0.8598957272406743