In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline

## Text Processing and Dates

In [None]:
!cat log.txt

In [None]:
lines = open('log.txt').readlines()
first = lines[0]
first

String manipulation based on character positions.

In [None]:
time_str = first.split('[', 1)[1].split(' ', 1)[0]
day, month, rest = time_str.split('/')
year, hour, minute, second = rest.split(':')
year, month, day, hour, minute, second

In [None]:
time_strs = (pd.Series(lines).str.split('[', 1, expand=True)[1]
             .str.split(' ', 1, expand=True)[0])
day_month_rest = time_strs.str.split('/', expand=True)
pd.concat([day_month_rest.loc[:, 0:1], 
           day_month_rest[2].str.split(':', expand=True)], axis=1)

String manipulation based on regular expressions.

In [None]:
import re
pattern = r'(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+)'
day, month, year, hour, minute, second = re.search(pattern, first).groups()
year, month, day, hour, minute, second

In [None]:
pd.Series(lines).str.extract(pattern)

Date parsing using the `datetime` module.

In [None]:
from datetime import datetime
datetime.strptime(time_str, '%d/%b/%Y:%H:%M:%S')

In [None]:
pd.Series(lines).str.extract(r'\[(.*) -0800\]')[0].apply(
    lambda s: datetime.strptime(s, '%d/%b/%Y:%H:%M:%S'))

## Text Processing Case Study

In this example, we will apply string processing to the process of data cleaning and exploratory data analysis.

### Getting the Data

The city of Berkeley maintains an [Open Data Portal](https://data.cityofberkeley.info/) for citizens to access data about the city.  We will be examining [Call Data](https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5).

<img src="calls_desc.png" width=800px />



In [None]:
import ds100_utils

calls_url = 'https://data.cityofberkeley.info/api/views/k2nh-s5h5/rows.csv?accessType=DOWNLOAD'
calls_file = ds100_utils.fetch_and_cache(calls_url, 'calls.csv')
calls = pd.read_csv(calls_file, warn_bad_lines=True)
calls.head()

How many records did we get?

In [None]:
len(calls)

What does an example `Block_Location` value look like?

In [None]:
print(calls['Block_Location'].iloc[0])

### Preliminary observations on the data?

1. `EVENTDT` -- Contains the incorrect time
1. `EVENTTM` -- Contains the time in 24 hour format (What timezone?)
1. `CVDOW` -- Encodes the day of the week (see data documentation).
1. `InDbDate` -- Appears to be correctly formatted and appears pretty consistent in time.
1. **`Block_Location` -- a multi-line string that contains coordinates.**
1. `BLKADDR` -- Appears to be the address in `Block Location`.
1. `City` and `State` seem redundant given this is supposed to be the city of Berkeley dataset.

### Extracting locations

The block location contains geographic coordinates. Let's extract them.

In [None]:
calls['Block_Location'][0]

In [None]:
calls_lat_lon = (
    calls['Block_Location']
    .str.extract("\((\d+\.\d+)\, (-\d+\.\d+)\)")
)
calls_lat_lon.columns = ['Lat', 'Lon']
calls_lat_lon.head(10)

How many records have missing values?

In [None]:
calls_lat_lon.isnull().sum()

Examine the missing values.

In [None]:
calls[calls_lat_lon.isnull().any(axis=1)]['Block_Location'].head(10)

Join in the extracted values.

In [None]:
if 'Lat' not in calls.columns:
    calls = calls.merge(calls_lat_lon, left_index=True, right_index=True)
calls.head()

## Examining Location information

Let's examine the geographic data (latitude and longitude).  Recall that we had some missing values.  Let's look at the behavior of these missing values according to crime type.

In [None]:
missing_lat_lon = calls[calls[['Lat', 'Lon']].isnull().any(axis=1)]
missing_lat_lon['CVLEGEND'].value_counts().plot(kind='barh');

In [None]:
calls['CVLEGEND'].value_counts().plot(kind='barh');

### Observations?

There is a clear bias towards drug violations that is not present in the original data.  Therefore we should be careful when dropping missing values!

We might further normalize the analysis by the frequency to find which type of crime has the highest proportion of missing values.

In [None]:
(missing_lat_lon['CVLEGEND'].value_counts() 
 / calls['CVLEGEND'].value_counts()
).sort_values(ascending=False).plot(kind="barh");

Now, let's make a crime map.

In [None]:
import folium
import folium.plugins

SF_COORDINATES = (37.87, -122.28)
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
locs = calls[['Lat', 'Lon']].astype('float').dropna().values
heatmap = folium.plugins.HeatMap(locs.tolist(), radius=10)
sf_map.add_child(heatmap)

### Questions

1. Is campus really the safest place to be?
1. Why are all the calls located on the street and at often at intersections?


In [None]:
locations = calls[calls['CVLEGEND'] == 'ASSAULT'][['Lat', 'Lon']]

cluster = MarkerCluster([])
for _, r in locations.dropna().iterrows():
    cluster.add_child(
        folium.Marker([float(r["Lat"]), float(r["Lon"])]))
    
sf_map = folium.Map(location=SF_COORDINATES, zoom_start=13)
sf_map.add_child(cluster)
sf_map