# DATA CLEANING + EDA (Pandas cont.)

Created By: Angelica Rojas

In [1]:
import pandas as pd
import re

## Upload Data

The data for this notebook could be found at this link: https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5 . 

For the purpose of this lesson we will use the CSV file of the data.

In [2]:
df = pd.read_csv("BerkeleyPD_Calls_for_Service.csv")

#what does this do?
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,CVLEGEND,CVDOW,InDbDate,Block_Location,BLKADDR,City,State
0,17034187,BURGLARY AUTO,06/14/2017 12:00:00 AM,15:15,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,"ALLSTON WAY &amp; SHATTUCK AVE\nBerkeley, CA\n...",ALLSTON WAY & SHATTUCK AVE,Berkeley,CA
1,17052235,GUN/WEAPON,09/01/2017 12:00:00 AM,22:56,WEAPONS OFFENSE,5,09/25/2017 03:30:18 AM,"UNIVERSITY AVENUE &amp; MILVIA ST\nBerkeley, C...",UNIVERSITY AVENUE & MILVIA ST,Berkeley,CA
2,17091126,THEFT MISD. (UNDER $950),06/10/2017 12:00:00 AM,10:45,LARCENY,6,09/25/2017 03:30:15 AM,"2500 SHATTUCK AVE\nBerkeley, CA\n(37.863811, -...",2500 SHATTUCK AVE,Berkeley,CA
3,17018444,BURGLARY AUTO,04/02/2017 12:00:00 AM,19:30,BURGLARY - VEHICLE,0,09/25/2017 03:30:11 AM,"DURANT AVENUE &amp; ELLSWORTH ST\nBerkeley, CA...",DURANT AVENUE & ELLSWORTH ST,Berkeley,CA
4,17033328,NARCOTICS,06/10/2017 12:00:00 AM,14:30,DRUG VIOLATION,6,09/25/2017 03:30:14 AM,"MILVIA STREET &amp; UNIVERSITY AVE\nBerkeley, ...",MILVIA STREET & UNIVERSITY AVE,Berkeley,CA


Why did we only want to display the first 5 rows of the dataframe?

What if we wanted to see the size of this dataframe?

In [3]:
# number of rows
len(df.index)

5617

In [4]:
# shape of df (rows, columns)

df.shape

(5617, 11)

# Part 1: DATA CLEANING

## Column Names

What do all these column names even mean? 

On that same website, BPD offers a narrative pdf file that describes the data they provided. https://data.cityofberkeley.info/Public-Safety/Berkeley-PD-Calls-for-Service/k2nh-s5h5


<img src = "DF_col_desc.png">

## Change Column Name(s)

Why would we want to change the column names?

In [5]:
#df = df.rename(columns={'ORIG_COL_NAME': 'NEW_COL_NAME'})
df = df.rename(columns={'CVLEGEND': 'EVENTDESC', 'CVDOW':"D.O.W."})
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,Block_Location,BLKADDR,City,State
0,17034187,BURGLARY AUTO,06/14/2017 12:00:00 AM,15:15,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,"ALLSTON WAY &amp; SHATTUCK AVE\nBerkeley, CA\n...",ALLSTON WAY & SHATTUCK AVE,Berkeley,CA
1,17052235,GUN/WEAPON,09/01/2017 12:00:00 AM,22:56,WEAPONS OFFENSE,5,09/25/2017 03:30:18 AM,"UNIVERSITY AVENUE &amp; MILVIA ST\nBerkeley, C...",UNIVERSITY AVENUE & MILVIA ST,Berkeley,CA
2,17091126,THEFT MISD. (UNDER $950),06/10/2017 12:00:00 AM,10:45,LARCENY,6,09/25/2017 03:30:15 AM,"2500 SHATTUCK AVE\nBerkeley, CA\n(37.863811, -...",2500 SHATTUCK AVE,Berkeley,CA
3,17018444,BURGLARY AUTO,04/02/2017 12:00:00 AM,19:30,BURGLARY - VEHICLE,0,09/25/2017 03:30:11 AM,"DURANT AVENUE &amp; ELLSWORTH ST\nBerkeley, CA...",DURANT AVENUE & ELLSWORTH ST,Berkeley,CA
4,17033328,NARCOTICS,06/10/2017 12:00:00 AM,14:30,DRUG VIOLATION,6,09/25/2017 03:30:14 AM,"MILVIA STREET &amp; UNIVERSITY AVE\nBerkeley, ...",MILVIA STREET & UNIVERSITY AVE,Berkeley,CA


## Investigating Columns

What is the difference between "Block_Location" and "BLKADDR" columns in the dataframe? From the look of the DF that is displayed it looks as though they are almost the same.

In [6]:
#Let's look at the first value in "Block_Location"
df["Block_Location"][0]

'ALLSTON WAY &amp; SHATTUCK AVE\nBerkeley, CA\n(37.869363, -122.268028)'

In [7]:
#Let's look at the first value in "BLKADDR"
df["BLKADDR"][0]

'ALLSTON WAY & SHATTUCK AVE'

## Create New Columns

What is new information about "Block_Location" that we can actually use and save?

Let's create new columns for the information we extracted from those values.

In [8]:
coordinates = [["".join(x.split()) for x in re.split(r'[()]',i) if x.strip()][-1] for i in df["Block_Location"]]

#new values
longitude =[["".join(x.split()) for x in re.split(r'[,]',i) if x.strip()][-1] for i in coordinates]
latitude = [["".join(x.split()) for x in re.split(r'[,]',i) if x.strip()][0] for i in coordinates]

#create new columns for latitude and longitude
df["LATITUDE"] = latitude
df["LONGITUDE"] = longitude

#Check if it worked
df.head()



Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,Block_Location,BLKADDR,City,State,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017 12:00:00 AM,15:15,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,"ALLSTON WAY &amp; SHATTUCK AVE\nBerkeley, CA\n...",ALLSTON WAY & SHATTUCK AVE,Berkeley,CA,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017 12:00:00 AM,22:56,WEAPONS OFFENSE,5,09/25/2017 03:30:18 AM,"UNIVERSITY AVENUE &amp; MILVIA ST\nBerkeley, C...",UNIVERSITY AVENUE & MILVIA ST,Berkeley,CA,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017 12:00:00 AM,10:45,LARCENY,6,09/25/2017 03:30:15 AM,"2500 SHATTUCK AVE\nBerkeley, CA\n(37.863811, -...",2500 SHATTUCK AVE,Berkeley,CA,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017 12:00:00 AM,19:30,BURGLARY - VEHICLE,0,09/25/2017 03:30:11 AM,"DURANT AVENUE &amp; ELLSWORTH ST\nBerkeley, CA...",DURANT AVENUE & ELLSWORTH ST,Berkeley,CA,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017 12:00:00 AM,14:30,DRUG VIOLATION,6,09/25/2017 03:30:14 AM,"MILVIA STREET &amp; UNIVERSITY AVE\nBerkeley, ...",MILVIA STREET & UNIVERSITY AVE,Berkeley,CA,37.871884,-122.270752


In [9]:
df.LATITUDE.unique()

array(['37.869363', '37.871884', '37.863811', ..., '37.877247',
       '37.874537', '37.861541'], dtype=object)

In [10]:
df.LATITUDE
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5617 entries, 0 to 5616
Data columns (total 13 columns):
CASENO            5617 non-null int64
OFFENSE           5617 non-null object
EVENTDT           5617 non-null object
EVENTTM           5617 non-null object
EVENTDESC         5617 non-null object
D.O.W.            5617 non-null int64
InDbDate          5617 non-null object
Block_Location    5617 non-null object
BLKADDR           5590 non-null object
City              5617 non-null object
State             5617 non-null object
LATITUDE          5617 non-null object
LONGITUDE         5617 non-null object
dtypes: int64(2), object(11)
memory usage: 570.5+ KB


In [12]:
df2 = df[df['LATITUDE'].str.startswith('37')]

In [13]:
df2.LATITUDE

0       37.869363
1       37.871884
2       37.863811
3       37.867221
4       37.871884
5       37.868706
6       37.865849
7       37.880266
8       37.882457
9       37.856195
10      37.879708
11      37.859557
12      37.868352
13      37.821533
14      37.892152
15      37.878644
16       37.87091
17      37.887498
18      37.861672
19      37.858214
20      37.853275
22       37.86939
23      37.858165
24      37.859557
25       37.88015
26      37.880262
27      37.858628
28      37.880227
29      37.867852
30      37.861604
          ...    
5585    37.855435
5586    37.881141
5587    37.874489
5588     37.87247
5589    37.857409
5590    37.863072
5591    37.866426
5592    37.875531
5593     37.84827
5594      37.8704
5595    37.867212
5596    37.863839
5597    37.870417
5598    37.869921
5599    37.857099
5600    37.865711
5601    37.870289
5602    37.888784
5603    37.864826
5604    37.864535
5605    37.857674
5607    37.868512
5608    37.859309
5610    37.865816
5611    37

In [14]:
df.LONGITUDE.unique()

array(['-122.268028', '-122.270752', '-122.267412', ..., '-122.27708',
       '-122.263862', '-122.251156'], dtype=object)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5617 entries, 0 to 5616
Data columns (total 13 columns):
CASENO            5617 non-null int64
OFFENSE           5617 non-null object
EVENTDT           5617 non-null object
EVENTTM           5617 non-null object
EVENTDESC         5617 non-null object
D.O.W.            5617 non-null int64
InDbDate          5617 non-null object
Block_Location    5617 non-null object
BLKADDR           5590 non-null object
City              5617 non-null object
State             5617 non-null object
LATITUDE          5617 non-null object
LONGITUDE         5617 non-null object
dtypes: int64(2), object(11)
memory usage: 570.5+ KB


In [16]:
df2 = df[df['LONGITUDE'].str.startswith('-122')]
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5337 entries, 0 to 5616
Data columns (total 13 columns):
CASENO            5337 non-null int64
OFFENSE           5337 non-null object
EVENTDT           5337 non-null object
EVENTTM           5337 non-null object
EVENTDESC         5337 non-null object
D.O.W.            5337 non-null int64
InDbDate          5337 non-null object
Block_Location    5337 non-null object
BLKADDR           5310 non-null object
City              5337 non-null object
State             5337 non-null object
LATITUDE          5337 non-null object
LONGITUDE         5337 non-null object
dtypes: int64(2), object(11)
memory usage: 583.7+ KB


In [17]:
df3 = df[df['LONGITUDE'].str.startswith('-122') & df['LATITUDE'].str.startswith('37')]


In [18]:
df3['LATITUDE'].nunique()

1383

In [19]:
len(df3[['LONGITUDE', 'LATITUDE']].index)

5337

## Drop Columns

We got all the information we needed from "Block_Location" keeping it would be taking up extra room in our dataframe. 

Let's drop the "Block_Location" from the dataframe.

In [20]:
#df = df.drop("COL_NAME", axis = 1)
df = df.drop("Block_Location", axis = 1)
#Check if it dropped
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,BLKADDR,City,State,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017 12:00:00 AM,15:15,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,ALLSTON WAY & SHATTUCK AVE,Berkeley,CA,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017 12:00:00 AM,22:56,WEAPONS OFFENSE,5,09/25/2017 03:30:18 AM,UNIVERSITY AVENUE & MILVIA ST,Berkeley,CA,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017 12:00:00 AM,10:45,LARCENY,6,09/25/2017 03:30:15 AM,2500 SHATTUCK AVE,Berkeley,CA,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017 12:00:00 AM,19:30,BURGLARY - VEHICLE,0,09/25/2017 03:30:11 AM,DURANT AVENUE & ELLSWORTH ST,Berkeley,CA,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017 12:00:00 AM,14:30,DRUG VIOLATION,6,09/25/2017 03:30:14 AM,MILVIA STREET & UNIVERSITY AVE,Berkeley,CA,37.871884,-122.270752


We can drop other columns that we do not think would add useful information to our analysis. 

Although we did get this data from the Berkeley PD, let's make sure all values in "City" are "Berkeley". Also, Let's make sure the "State" is "CA" for all values.

In [21]:
df.City.unique()

array(['Berkeley'], dtype=object)

In [22]:
df['State'].unique()

array(['CA'], dtype=object)

We checked all the unique values for columns "State" and "City" and they are the results we wanted, therefore, we do not need those columns anymore. 

Drop the columns listed above.

In [23]:
#drop City and State columns
#df = ...
df = df.drop(["City", "State"], axis = 1)
#Check if they dropped
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,BLKADDR,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017 12:00:00 AM,15:15,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,ALLSTON WAY & SHATTUCK AVE,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017 12:00:00 AM,22:56,WEAPONS OFFENSE,5,09/25/2017 03:30:18 AM,UNIVERSITY AVENUE & MILVIA ST,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017 12:00:00 AM,10:45,LARCENY,6,09/25/2017 03:30:15 AM,2500 SHATTUCK AVE,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017 12:00:00 AM,19:30,BURGLARY - VEHICLE,0,09/25/2017 03:30:11 AM,DURANT AVENUE & ELLSWORTH ST,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017 12:00:00 AM,14:30,DRUG VIOLATION,6,09/25/2017 03:30:14 AM,MILVIA STREET & UNIVERSITY AVE,37.871884,-122.270752


## Dealing With Null Values (NaN)

<img src = "null_def.png">

REFERENCE: https://pandas.pydata.org/pandas-docs/stable/missing_data.html

This is a big data set and we can't look through each value one at a time. How can we make sure that there is a value for each category?

In [24]:
df.isnull().sum()

CASENO        0
OFFENSE       0
EVENTDT       0
EVENTTM       0
EVENTDESC     0
D.O.W.        0
InDbDate      0
BLKADDR      27
LATITUDE      0
LONGITUDE     0
dtype: int64

Let's look at the rows where "BLKADDR" is a null value. Let's make a temporary sub-dataframe.

In [25]:
null_temp = df[pd.isnull(df['BLKADDR'])]
null_temp

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,BLKADDR,LATITUDE,LONGITUDE
53,17036936,DISTURBANCE,06/26/2017 12:00:00 AM,18:24,DISORDERLY CONDUCT,1,09/25/2017 03:30:16 AM,,37.869058,-122.270455
104,17090713,THEFT FELONY (OVER $950),04/09/2017 12:00:00 AM,04:15,LARCENY,0,09/25/2017 03:30:12 AM,,37.869058,-122.270455
224,17024641,BURGLARY AUTO,05/01/2017 12:00:00 AM,21:00,BURGLARY - VEHICLE,1,09/25/2017 03:30:12 AM,,37.869058,-122.270455
235,17046547,VEHICLE STOLEN,08/08/2017 12:00:00 AM,17:00,MOTOR VEHICLE THEFT,2,09/25/2017 03:30:18 AM,,37.869058,-122.270455
291,17053694,THEFT MISD. (UNDER $950),09/07/2017 12:00:00 AM,17:43,LARCENY,4,09/25/2017 03:30:19 AM,,37.869058,-122.270455
475,17022572,VEHICLE STOLEN,04/22/2017 12:00:00 AM,21:00,MOTOR VEHICLE THEFT,6,09/25/2017 03:30:12 AM,,37.869058,-122.270455
534,17026854,BURGLARY RESIDENTIAL,05/12/2017 12:00:00 AM,09:00,BURGLARY - RESIDENTIAL,5,09/25/2017 03:30:12 AM,,37.869058,-122.270455
1228,17091147,BURGLARY AUTO,06/14/2017 12:00:00 AM,03:00,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,,37.869058,-122.270455
1306,17020446,VEHICLE STOLEN,04/12/2017 12:00:00 AM,18:00,MOTOR VEHICLE THEFT,3,09/25/2017 03:30:12 AM,,37.869058,-122.270455
1311,17025351,THEFT FROM AUTO,05/04/2017 12:00:00 AM,22:30,LARCENY - FROM VEHICLE,4,09/25/2017 03:30:12 AM,,37.869058,-122.270455


Does the number of rows in the dataframe match the values above?

In [26]:
#get number of rows of new df
len(null_temp)


27

Investigate the dataframe, do you see somethng interesting that all these rows share?

Are the Latitude/Longitude values all the same for the "NaN" values?

In [27]:
#get unique values of latitude

null_temp.LATITUDE.unique()

array(['37.869058'], dtype=object)

In [28]:
#get unique values of longitude

null_temp.LONGITUDE.unique()

array(['-122.270455'], dtype=object)

## Boolean Slicing

Let's look at the whole dataset to see if there are any rows with that Latitude and Longitude combination that might have a "BLKADDR" associated with it.  

In [29]:
df[(df["LATITUDE"] == '37.869058') & (df["LONGITUDE"] == '-122.270455')]

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,BLKADDR,LATITUDE,LONGITUDE
53,17036936,DISTURBANCE,06/26/2017 12:00:00 AM,18:24,DISORDERLY CONDUCT,1,09/25/2017 03:30:16 AM,,37.869058,-122.270455
104,17090713,THEFT FELONY (OVER $950),04/09/2017 12:00:00 AM,04:15,LARCENY,0,09/25/2017 03:30:12 AM,,37.869058,-122.270455
224,17024641,BURGLARY AUTO,05/01/2017 12:00:00 AM,21:00,BURGLARY - VEHICLE,1,09/25/2017 03:30:12 AM,,37.869058,-122.270455
235,17046547,VEHICLE STOLEN,08/08/2017 12:00:00 AM,17:00,MOTOR VEHICLE THEFT,2,09/25/2017 03:30:18 AM,,37.869058,-122.270455
291,17053694,THEFT MISD. (UNDER $950),09/07/2017 12:00:00 AM,17:43,LARCENY,4,09/25/2017 03:30:19 AM,,37.869058,-122.270455
475,17022572,VEHICLE STOLEN,04/22/2017 12:00:00 AM,21:00,MOTOR VEHICLE THEFT,6,09/25/2017 03:30:12 AM,,37.869058,-122.270455
534,17026854,BURGLARY RESIDENTIAL,05/12/2017 12:00:00 AM,09:00,BURGLARY - RESIDENTIAL,5,09/25/2017 03:30:12 AM,,37.869058,-122.270455
1228,17091147,BURGLARY AUTO,06/14/2017 12:00:00 AM,03:00,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,,37.869058,-122.270455
1306,17020446,VEHICLE STOLEN,04/12/2017 12:00:00 AM,18:00,MOTOR VEHICLE THEFT,3,09/25/2017 03:30:12 AM,,37.869058,-122.270455
1311,17025351,THEFT FROM AUTO,05/04/2017 12:00:00 AM,22:30,LARCENY - FROM VEHICLE,4,09/25/2017 03:30:12 AM,,37.869058,-122.270455


In [None]:
#get unique values of BLKADDR for the Lat/Long combo

...

## Drop Null Values (NaN)

We could essentially go to Google and try to figure out the BLKADDR ourselves, but to remove any problems that may occur while searching let's just drop all the rows that include null values

In [30]:
#drop rows that have null values
df = df.dropna(axis = 0, how = "any")

In [None]:
#now find out how many null values within the df
#What should you see when you run this?
...

# <font color = "red"> YOUR TURN! </font>

### What is the difference between "EVENTDT" and "EVENTTM"? How can we clean our columns to reflect the data that is useful?

HINT: Focus on EVENTDT

In [31]:
#Slice the string to get the information you want and set to the variable
date = [i[:10] for i in df["EVENTDT"]]

#Replace "EVENTDT" with new variable
df["EVENTDT"] = date
df.head()
#check if it worked
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,InDbDate,BLKADDR,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,3,09/25/2017 03:30:15 AM,ALLSTON WAY & SHATTUCK AVE,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,5,09/25/2017 03:30:18 AM,UNIVERSITY AVENUE & MILVIA ST,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017,10:45,LARCENY,6,09/25/2017 03:30:15 AM,2500 SHATTUCK AVE,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017,19:30,BURGLARY - VEHICLE,0,09/25/2017 03:30:11 AM,DURANT AVENUE & ELLSWORTH ST,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,6,09/25/2017 03:30:14 AM,MILVIA STREET & UNIVERSITY AVE,37.871884,-122.270752


### Drop the "InDbDate" column

In [32]:
df = df.drop("InDbDate", axis = 1)

#Check that it actually dropped
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,3,ALLSTON WAY & SHATTUCK AVE,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,5,UNIVERSITY AVENUE & MILVIA ST,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017,10:45,LARCENY,6,2500 SHATTUCK AVE,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017,19:30,BURGLARY - VEHICLE,0,DURANT AVENUE & ELLSWORTH ST,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,6,MILVIA STREET & UNIVERSITY AVE,37.871884,-122.270752


### TRICKY QUESTION

### Column "D.O.W." can be a bit confusing with the numbers. Replace the numbers with the appropriate day it corresponds to. 

You can find the days it corresponds to in the beginning of the notebook. 

##### HINT: You may need to use a dictionary, the map function, or the zip function

DICT:
https://www.programiz.com/python-programming/methods/built-in/dict

MAP:
https://www.programiz.com/python-programming/methods/built-in/map

ZIP:
https://www.programiz.com/python-programming/methods/built-in/zip

In [33]:
dow = {0:"Sunday",1:"Monday",2:"Tuesday",3:"Wednesday", 4:"Thursday", 5:"Friday", 6:"Saturday"}
df["D.O.W."] = df["D.O.W."].map(dow)

#Check if it worked
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,Wednesday,ALLSTON WAY & SHATTUCK AVE,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,Friday,UNIVERSITY AVENUE & MILVIA ST,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017,10:45,LARCENY,Saturday,2500 SHATTUCK AVE,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017,19:30,BURGLARY - VEHICLE,Sunday,DURANT AVENUE & ELLSWORTH ST,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,Saturday,MILVIA STREET & UNIVERSITY AVE,37.871884,-122.270752


# Part 2: EXPLORATORY DATA ANALYSIS

<h3>"Exploratory data analysis or 'EDA' is a <b>critical</b> beginning step in analyzing the data from an experiment.</h3>

<b>Here are the main reasons we use EDA:</b>
<ul>
• detection of mistakes<br><br>
• checking of assumptions<br><br>
• preliminary selection of appropriate models<br><br>
• determining relationships among the explanatory variables, and<br><br>
• assessing the direction and rough size of relationships between explanatory and outcome variables."</ul>
REFERENCE: http://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf


## Now what?

We have cleaned our data to the best of our ability based on the initial look. Now let's try to look at the <b>relationships</b> between different values. 

In [34]:
df.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,Wednesday,ALLSTON WAY & SHATTUCK AVE,37.869363,-122.268028
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,Friday,UNIVERSITY AVENUE & MILVIA ST,37.871884,-122.270752
2,17091126,THEFT MISD. (UNDER $950),06/10/2017,10:45,LARCENY,Saturday,2500 SHATTUCK AVE,37.863811,-122.267412
3,17018444,BURGLARY AUTO,04/02/2017,19:30,BURGLARY - VEHICLE,Sunday,DURANT AVENUE & ELLSWORTH ST,37.867221,-122.263531
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,Saturday,MILVIA STREET & UNIVERSITY AVE,37.871884,-122.270752


Let's look at the different types of offenses that were called in. We know that using the .unique() function will return all the unique values in the column, but what if we wanted to also <b>count</b> the different times each unique value appeared?

In [35]:
df.OFFENSE.value_counts()

BURGLARY AUTO               1069
THEFT MISD. (UNDER $950)     864
VANDALISM                    447
DISTURBANCE                  423
NARCOTICS                    339
VEHICLE STOLEN               306
BURGLARY RESIDENTIAL         271
ASSAULT/BATTERY MISD.        263
THEFT FELONY (OVER $950)     253
ROBBERY                      183
IDENTITY THEFT               150
ALCOHOL OFFENSE              141
THEFT FROM AUTO              137
DOMESTIC VIOLENCE            119
BURGLARY COMMERCIAL          111
ASSAULT/BATTERY FEL.         102
FRAUD/FORGERY                101
MISSING ADULT                 66
2ND RESPONSE                  42
GUN/WEAPON                    42
SEXUAL ASSAULT FEL.           32
BRANDISHING                   25
THEFT FROM PERSON             23
MISSING JUVENILE              22
SEXUAL ASSAULT MISD.          21
ARSON                         16
MUNICIPAL CODE                12
VEHICLE RECOVERED              8
KIDNAPPING                     1
VICE                           1
Name: OFFE

In [36]:
df.EVENTDESC.value_counts()

LARCENY                   1140
BURGLARY - VEHICLE        1069
VANDALISM                  447
DISORDERLY CONDUCT         424
ASSAULT                    365
DRUG VIOLATION             339
MOTOR VEHICLE THEFT        306
BURGLARY - RESIDENTIAL     271
FRAUD                      251
ROBBERY                    183
LIQUOR LAW VIOLATION       141
LARCENY - FROM VEHICLE     137
FAMILY OFFENSE             119
BURGLARY - COMMERCIAL      111
MISSING PERSON              88
WEAPONS OFFENSE             67
SEX CRIME                   53
NOISE VIOLATION             42
ARSON                       16
ALL OTHER OFFENSES          12
RECOVERED VEHICLE            8
KIDNAPPING                   1
Name: EVENTDESC, dtype: int64

Why is "LARCENY" a higher occurence in the "EVENTDESC" column, if when we looked into the "OFFENSE" column, "BURGLARY - VEHICLE" is first? Let's look into this a little more.


## GroupBy 

In [37]:
df.groupby("EVENTDESC").OFFENSE.value_counts()

#turn the series into a DF 
df.groupby("EVENTDESC").OFFENSE.value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,OFFENSE
EVENTDESC,OFFENSE,Unnamed: 2_level_1
ALL OTHER OFFENSES,MUNICIPAL CODE,12
ARSON,ARSON,16
ASSAULT,ASSAULT/BATTERY MISD.,263
ASSAULT,ASSAULT/BATTERY FEL.,102
BURGLARY - COMMERCIAL,BURGLARY COMMERCIAL,111
BURGLARY - RESIDENTIAL,BURGLARY RESIDENTIAL,271
BURGLARY - VEHICLE,BURGLARY AUTO,1069
DISORDERLY CONDUCT,DISTURBANCE,423
DISORDERLY CONDUCT,VICE,1
DRUG VIOLATION,NARCOTICS,339


From this DF we can see that "LARCENY" has the most OFFENSES within that category. When you add the totals from "THEFT MISD ( UNDER $ 950)" ," THEFT  FELONY (OVER $ 950)", and
"THEFT FROM PERSON" they equal more than "BURGLARY - VEHICLE", but "BURGLARY AUTO" as an offense alone is the highest in number.

# <font color = "red"> YOUR TURN! </font>

Could there be any relationship with the Day of the Week and the calls? Try out different functions to see if there is any significance?

In [38]:
#count the amount of calls per day
df["D.O.W."].value_counts()

Tuesday      853
Friday       818
Wednesday    818
Saturday     809
Thursday     797
Monday       793
Sunday       702
Name: D.O.W., dtype: int64

With the day that has the most calls, check the type of offense that appears the most.

In [None]:
#only display rows with the D.O.W that appears the most
#create a temp df
...

In [78]:
#only display rows with the D.O.W of Friday
#create a temp df

fri =df[df["D.O.W."] == "Friday"]
fri.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,Friday,UNIVERSITY AVENUE & MILVIA ST,37.871884,-122.270752
5,17043981,ALCOHOL OFFENSE,07/28/2017,21:31,LIQUOR LAW VIOLATION,Friday,KITTREDGE STREET & FULTON ST,37.868706,-122.266279
6,17091423,THEFT FELONY (OVER $950),07/21/2017,20:35,LARCENY,Friday,2400 HASTE ST,37.865849,-122.259977
14,17090740,IDENTITY THEFT,04/14/2017,12:55,FRAUD,Friday,900 SAN BENITO RD,37.892152,-122.269211
16,17048522,ASSAULT/BATTERY MISD.,08/18/2017,07:45,ASSAULT,Friday,2100 OXFORD ST,37.87091,-122.265993


Do these numbers match the results of the overall DF?


Let's try something else, Friday and Saturday nights are typically associated with being the "party" time. If this is true should there be more Liquor/Drug/Disordely Conduct/etc. occurrences those nights?

Let's try it with Fridays!

This will not give us the information we want. Instead let us look at each EVENTDESC and group by the D.O.W. that appears the most per EVENTDESC.

Also, I realized that "D.O.W." is becoming a problem with the periods when I am trying to call my series. I want to change the name of the column again, how can I do that?

In [79]:
#change column name
df = ...
df.head()

SyntaxError: invalid syntax (<ipython-input-79-5fed38e00bbc>, line 2)

# <font color = "red"> GROUP WORK</font> 
## What do YOU want to find out? YOUR DATA INVESTIGATION

In this notebook you have been learning all these techniques to be able to manipulate your dataframe to your preference. We know how to clean and explore our data, but what questions or topics did you actually want to learn from the data? 

<b> * In groups of 2-4 people, investigate the dataframe in this notebook and pick a question/topic to answer. Using the techniques you learned today, show relationships and results that would support that question/topic. 
</b><br><br>
<i>If we have time</i> <b>each</b> group will present their investigations and why they are significant to the class.


In [80]:
df3 = df[df['LONGITUDE'].str.startswith('-122') & df['LATITUDE'].str.startswith('37')]
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5310 entries, 0 to 5616
Data columns (total 9 columns):
CASENO       5310 non-null int64
OFFENSE      5310 non-null object
EVENTDT      5310 non-null object
EVENTTM      5310 non-null object
EVENTDESC    5310 non-null object
D.O.W.       5310 non-null object
BLKADDR      5310 non-null object
LATITUDE     5310 non-null object
LONGITUDE    5310 non-null object
dtypes: int64(1), object(8)
memory usage: 414.8+ KB


In [81]:
df3['LONGITUDE'] = pd.to_numeric(df3['LONGITUDE'])
df3['LATITUDE'] = pd.to_numeric(df3['LATITUDE'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [82]:
import numpy
df3['LATITUDE'] = ((df3['LATITUDE'] + 0.005) * 100).astype(numpy.int64)/100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [83]:
import numpy
df3['LONGITUDE'] = ((df3['LONGITUDE'] + 0.005) * 100).astype(numpy.int64)/100


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [84]:
df3.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,Wednesday,ALLSTON WAY & SHATTUCK AVE,37.87,-122.26
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,Friday,UNIVERSITY AVENUE & MILVIA ST,37.87,-122.26
2,17091126,THEFT MISD. (UNDER $950),06/10/2017,10:45,LARCENY,Saturday,2500 SHATTUCK AVE,37.86,-122.26
3,17018444,BURGLARY AUTO,04/02/2017,19:30,BURGLARY - VEHICLE,Sunday,DURANT AVENUE & ELLSWORTH ST,37.87,-122.25
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,Saturday,MILVIA STREET & UNIVERSITY AVE,37.87,-122.26


In [72]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5310 entries, 0 to 5616
Data columns (total 10 columns):
CASENO       5310 non-null int64
OFFENSE      5310 non-null object
EVENTDT      5310 non-null object
EVENTTM      5310 non-null object
EVENTDESC    5310 non-null object
D.O.W.       5310 non-null object
BLKADDR      5310 non-null object
LATITUDE     5310 non-null float64
LONGITUDE    5310 non-null float64
LOCATION     5310 non-null object
dtypes: float64(2), int64(1), object(7)
memory usage: 616.3+ KB


In [85]:
def locate(f):
    return (str(f['LATITUDE']) + " " + str(f['LONGITUDE']))
df3['LOCATION'] = df3.apply(locate,axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [86]:
df3.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE,LOCATION
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,Wednesday,ALLSTON WAY & SHATTUCK AVE,37.87,-122.26,37.87 -122.26
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,Friday,UNIVERSITY AVENUE & MILVIA ST,37.87,-122.26,37.87 -122.26
2,17091126,THEFT MISD. (UNDER $950),06/10/2017,10:45,LARCENY,Saturday,2500 SHATTUCK AVE,37.86,-122.26,37.86 -122.26
3,17018444,BURGLARY AUTO,04/02/2017,19:30,BURGLARY - VEHICLE,Sunday,DURANT AVENUE & ELLSWORTH ST,37.87,-122.25,37.87 -122.25
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,Saturday,MILVIA STREET & UNIVERSITY AVE,37.87,-122.26,37.87 -122.26


In [87]:
df3.LOCATION.unique()

array(['37.87 -122.26', '37.86 -122.26', '37.87 -122.25', '37.88 -122.26',
       '37.88 -122.29', '37.86 -122.28', '37.86 -122.29', '37.87 -122.24',
       '37.82 -122.27', '37.89 -122.26', '37.89 -122.3', '37.86 -122.25',
       '37.85 -122.27', '37.86 -122.24', '37.87 -122.27', '37.86 -122.27',
       '37.87 -122.29', '37.88 -122.3', '37.88 -122.28', '37.87 -122.28',
       '37.85 -122.28', '37.88 -122.27', '37.85 -122.26', '37.88 -122.25',
       '37.85 -122.24', '37.89 -122.25', '37.86 -122.31', '37.89 -122.27',
       '37.85 -122.23', '37.86 -122.23', '37.87 -122.3', '37.9 -122.25',
       '37.85 -122.25', '37.87 -122.31', '37.9 -122.26', '37.89 -122.24',
       '37.9 -122.27', '37.9 -122.28', '37.88 -122.24', '37.89 -122.28',
       '37.85 -122.29', '37.87 -122.23', '37.91 -122.27', '37.86 -122.22'], dtype=object)

In [133]:
df4 = df3
df5=df4.LOCATION.value_counts().reset_index()
df5.rename(columns ={'index':'LOCATION', 'LOCATION':'OCCURRENCE'}, inplace=True)
df5.head()

Unnamed: 0,LOCATION,OCCURRENCE
0,37.87 -122.26,891
1,37.87 -122.25,462
2,37.86 -122.26,339
3,37.87 -122.29,307
4,37.86 -122.25,291


In [170]:
#df5[df5['OCCURRENCE'] == df5['OCCURRENCE'].max()]['LOCATION']
df6 = df3[df3.LOCATION.isin(df5[df5['OCCURRENCE'] == df5['OCCURRENCE'].max()]['LOCATION'])]
df6.head(10)

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE,LOCATION
0,17034187,BURGLARY AUTO,06/14/2017,15:15,BURGLARY - VEHICLE,Wednesday,ALLSTON WAY & SHATTUCK AVE,37.87,-122.26,37.87 -122.26
1,17052235,GUN/WEAPON,09/01/2017,22:56,WEAPONS OFFENSE,Friday,UNIVERSITY AVENUE & MILVIA ST,37.87,-122.26,37.87 -122.26
4,17033328,NARCOTICS,06/10/2017,14:30,DRUG VIOLATION,Saturday,MILVIA STREET & UNIVERSITY AVE,37.87,-122.26,37.87 -122.26
5,17043981,ALCOHOL OFFENSE,07/28/2017,21:31,LIQUOR LAW VIOLATION,Friday,KITTREDGE STREET & FULTON ST,37.87,-122.26,37.87 -122.26
16,17048522,ASSAULT/BATTERY MISD.,08/18/2017,07:45,ASSAULT,Friday,2100 OXFORD ST,37.87,-122.26,37.87 -122.26
22,17033309,NARCOTICS,06/10/2017,13:18,DRUG VIOLATION,Saturday,2100 ALLSTON WAY,37.87,-122.26,37.87 -122.26
32,17021781,VEHICLE STOLEN,04/16/2017,08:00,MOTOR VEHICLE THEFT,Sunday,1900 HENRY ST,37.87,-122.26,37.87 -122.26
36,17020946,ASSAULT/BATTERY MISD.,04/15/2017,11:00,ASSAULT,Saturday,2100 MILVIA ST,37.87,-122.26,37.87 -122.26
49,17019604,DISTURBANCE,04/08/2017,08:00,DISORDERLY CONDUCT,Saturday,1900 SHATTUCK AVE,37.87,-122.26,37.87 -122.26
55,17056764,VEHICLE STOLEN,09/19/2017,22:00,MOTOR VEHICLE THEFT,Tuesday,2100 CHANNING WAY,37.87,-122.26,37.87 -122.26


In [171]:
#df5=df4.LOCATION.value_counts().reset_index()
df7 = df6['D.O.W.'].value_counts().reset_index()
df7.rename(columns ={'index':'DOW', 'D.O.W.':'OCCURRENCE'}, inplace=True)
df7.head()

Unnamed: 0,DOW,OCCURRENCE
0,Tuesday,143
1,Saturday,130
2,Friday,128
3,Monday,128
4,Thursday,125


In [179]:
df7[df7['OCCURRENCE'] == df7['OCCURRENCE'].max()]['DOW']
df8 = df6[df6['D.O.W.'].isin(df7[df7['OCCURRENCE'] == df7['OCCURRENCE'].max()].DOW)]
len(df8.index)

143

In [180]:
df8

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE,LOCATION
55,17056764,VEHICLE STOLEN,09/19/2017,22:00,MOTOR VEHICLE THEFT,Tuesday,2100 CHANNING WAY,37.87,-122.26,37.87 -122.26
160,17053192,NARCOTICS,09/05/2017,22:47,DRUG VIOLATION,Tuesday,1900 SHATTUCK AVE,37.87,-122.26,37.87 -122.26
162,17091722,THEFT MISD. (UNDER $950),08/29/2017,18:45,LARCENY,Tuesday,2100 SHATTUCK AVE,37.87,-122.26,37.87 -122.26
210,17029326,THEFT MISD. (UNDER $950),05/23/2017,23:00,LARCENY,Tuesday,2100 SHATTUCK AVE,37.87,-122.26,37.87 -122.26
221,17017299,DISTURBANCE,03/28/2017,13:12,DISORDERLY CONDUCT,Tuesday,1900 ALLSTON WAY,37.87,-122.26,37.87 -122.26
258,17091285,BURGLARY AUTO,07/04/2017,14:00,BURGLARY - VEHICLE,Tuesday,2300 MILVIA ST,37.87,-122.26,37.87 -122.26
308,17038535,BURGLARY COMMERCIAL,07/04/2017,02:06,BURGLARY - COMMERCIAL,Tuesday,1800 UNIVERSITY AVE,37.87,-122.26,37.87 -122.26
311,17040151,THEFT FELONY (OVER $950),07/11/2017,13:09,LARCENY,Tuesday,2200 SHATTUCK AVE,37.87,-122.26,37.87 -122.26
315,17091775,THEFT MISD. (UNDER $950),08/29/2017,21:00,LARCENY,Tuesday,2000 CENTER ST,37.87,-122.26,37.87 -122.26
324,17038593,DISTURBANCE,07/04/2017,14:57,DISORDERLY CONDUCT,Tuesday,1900 SHATTUCK AVE,37.87,-122.26,37.87 -122.26


In [181]:
def group_by_4_hour(x):
    hour, _ = x['EVENTTM'].split(':')
    if int(hour) < 4 : return ('0:00 - 4:00')
    if int(hour) < 8 : return ('4:00 - 8:00')
    if int(hour) < 12 : return ('8:00 - 12:00')
    if int(hour) < 16 : return ('12:00 - 16:00')
    if int(hour) < 20 : return ('16:00 - 20:00')
    return ('20:00 - 24:00')
#period = [[i for i in ] for j in x['EVENTDT']]
df8['PERIOD'] = df8.apply(group_by_4_hour, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Remove the CWD from sys.path while we load stuff.


In [184]:
df8.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE,LOCATION,PERIOD
55,17056764,VEHICLE STOLEN,09/19/2017,22:00,MOTOR VEHICLE THEFT,Tuesday,2100 CHANNING WAY,37.87,-122.26,37.87 -122.26,20:00 - 24:00
160,17053192,NARCOTICS,09/05/2017,22:47,DRUG VIOLATION,Tuesday,1900 SHATTUCK AVE,37.87,-122.26,37.87 -122.26,20:00 - 24:00
162,17091722,THEFT MISD. (UNDER $950),08/29/2017,18:45,LARCENY,Tuesday,2100 SHATTUCK AVE,37.87,-122.26,37.87 -122.26,16:00 - 20:00
210,17029326,THEFT MISD. (UNDER $950),05/23/2017,23:00,LARCENY,Tuesday,2100 SHATTUCK AVE,37.87,-122.26,37.87 -122.26,20:00 - 24:00
221,17017299,DISTURBANCE,03/28/2017,13:12,DISORDERLY CONDUCT,Tuesday,1900 ALLSTON WAY,37.87,-122.26,37.87 -122.26,12:00 - 16:00


In [188]:
df9 = df8.groupby(['PERIOD']).size().reset_index()
df9.rename(columns={0:'OCCURRENCE'}, inplace=True)
df9.head()

Unnamed: 0,PERIOD,OCCURRENCE
0,0:00 - 4:00,11
1,12:00 - 16:00,37
2,16:00 - 20:00,38
3,20:00 - 24:00,33
4,4:00 - 8:00,5


In [193]:
df9[df9['OCCURRENCE'] == df9['OCCURRENCE'].max()]['OCCURRENCE']
#df7[df7['OCCURRENCE'] == df7['OCCURRENCE'].max()]['DOW']
#df8 = df6[df6['D.O.W.'].isin(df7[df7['OCCURRENCE'] == df7['OCCURRENCE'].max()].DOW)]
df10 = df8[df8['PERIOD'].isin(df9[df9['OCCURRENCE'] == df9['OCCURRENCE'].max()].PERIOD)]
len(df10.index)

38

In [192]:
df10.head()

Unnamed: 0,CASENO,OFFENSE,EVENTDT,EVENTTM,EVENTDESC,D.O.W.,BLKADDR,LATITUDE,LONGITUDE,LOCATION,PERIOD
162,17091722,THEFT MISD. (UNDER $950),08/29/2017,18:45,LARCENY,Tuesday,2100 SHATTUCK AVE,37.87,-122.26,37.87 -122.26,16:00 - 20:00
614,17044793,THEFT MISD. (UNDER $950),08/01/2017,19:13,LARCENY,Tuesday,2300 SHATTUCK AVE,37.87,-122.26,37.87 -122.26,16:00 - 20:00
633,17018776,BURGLARY AUTO,04/04/2017,17:15,BURGLARY - VEHICLE,Tuesday,HAROLD WAY & ALLSTON WAY,37.87,-122.26,37.87 -122.26,16:00 - 20:00
722,17023058,THEFT FROM AUTO,04/25/2017,19:31,LARCENY - FROM VEHICLE,Tuesday,2000 BANCROFT WAY,37.87,-122.26,37.87 -122.26,16:00 - 20:00
889,17034017,BURGLARY AUTO,06/13/2017,17:15,BURGLARY - VEHICLE,Tuesday,2000 ALLSTON WAY,37.87,-122.26,37.87 -122.26,16:00 - 20:00


In [195]:
df10.groupby("EVENTDESC").OFFENSE.value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,OFFENSE
EVENTDESC,OFFENSE,Unnamed: 2_level_1
ASSAULT,ASSAULT/BATTERY FEL.,1
ASSAULT,ASSAULT/BATTERY MISD.,1
BURGLARY - COMMERCIAL,BURGLARY COMMERCIAL,2
BURGLARY - RESIDENTIAL,BURGLARY RESIDENTIAL,1
BURGLARY - VEHICLE,BURGLARY AUTO,7
DISORDERLY CONDUCT,DISTURBANCE,3
FRAUD,FRAUD/FORGERY,2
LARCENY,THEFT MISD. (UNDER $950),8
LARCENY,THEFT FELONY (OVER $950),4
LARCENY,THEFT FROM PERSON,1
