# AN ANALYTICAL DETECTIVE

Crime is an international concern, but it is documented and handled in very different ways in different countries. In the United States, violent crimes and property crimes are recorded by the Federal Bureau of Investigation (FBI). Additionally, each city documents crime, and some cities release data regarding crime rates. The city of Chicago, Illinois releases crime data from 2001 onward online.

In [1]:
import pandas as pd

##  LOADING THE DATA 

In [2]:
datafile = "../data/mvtWeek1.csv"
mvt = pd.read_csv(datafile,dtype='unicode')
df = pd.DataFrame(mvt)

### Analysis

In [3]:
print(df.columns)
print(df.head())
print(df.describe())

Index(['ID', 'Date', 'LocationDescription', 'Arrest', 'Domestic', 'Beat',
       'District', 'CommunityArea', 'Year', 'Latitude', 'Longitude'],
      dtype='object')
        ID            Date            LocationDescription Arrest Domestic  \
0  8951354  12/31/12 23:15                         STREET  FALSE    FALSE   
1  8951141  12/31/12 22:00                         STREET  FALSE    FALSE   
2  8952745  12/31/12 22:00  RESIDENTIAL YARD (FRONT/BACK)  FALSE    FALSE   
3  8952223  12/31/12 22:00                         STREET  FALSE    FALSE   
4  8951608  12/31/12 21:30                         STREET  FALSE    FALSE   

   Beat District CommunityArea  Year     Latitude     Longitude  
0   623        6            69  2012  41.75628399  -87.62164472  
1  1213       12            24  2012  41.89878849  -87.66130317  
2  1622       16            11  2012  41.96918578  -87.76766974  
3   724        7            67  2012  41.76932868  -87.65772562  
4   211        2            35  2012  41.

How many rows of data (observations) are in this dataset? 
n = 191641

In [10]:
print("\nHow many rows of data (observations) are in this dataset?:\n",len(df))


How many rows of data (observations) are in this dataset?:
 191641


How many variables are in this dataset?
c = 11 

what is the maximum value of the variable "ID"?

In [26]:
print("\nmaximum value of the variable ID is: ",df.ID.max())


maximum value of the variable ID is:  9181151


What is the minimum value of the variable "Beat"?

In [27]:
print("\nminimum value of the variable Beat:",df.Beat.min())


minimum value of the variable Beat: 1011


How many observations have value TRUE in the Arrest variable (this is the number of crimes for which an arrest was made)?

In [24]:
df['Arrest'].value_counts()

FALSE    176105
TRUE      15536
Name: Arrest, dtype: int64

How many observations have a LocationDescription value of ALLEY? 2308

In [28]:
df['LocationDescription'].value_counts()

STREET                                             156564
PARKING LOT/GARAGE(NON.RESID.)                      14852
OTHER                                                4573
ALLEY                                                2308
GAS STATION                                          2111
DRIVEWAY - RESIDENTIAL                               1675
RESIDENTIAL YARD (FRONT/BACK)                        1536
RESIDENCE                                            1302
RESIDENCE-GARAGE                                     1176
VACANT LOT/LAND                                       985
VEHICLE NON-COMMERCIAL                                817
SIDEWALK                                              462
CHA PARKING LOT/GROUNDS                               405
AIRPORT/AIRCRAFT                                      363
POLICE FACILITY/VEH PARKING LOT                       266
PARK PROPERTY                                         255
SCHOOL, PUBLIC, GROUNDS                               206
APARTMENT     

## UNDERSTANDING DATES

In many datasets, like this one, you have a date field. Unfortunately, R does not automatically recognize entries that look like dates. We need to use a function in R to extract the date and time. Take a look at the first entry of Date (remember to use square brackets when looking at a certain entry of a variable).

In what format are the entries in the variable Date? Month/Day/Year Hour:Minute 

In [29]:
df.Date.head()

0    12/31/12 23:15
1    12/31/12 22:00
2    12/31/12 22:00
3    12/31/12 22:00
4    12/31/12 21:30
Name: Date, dtype: object

Now, let's convert these characters into a Date object in python

In [30]:
from datetime import datetime

In [52]:
df['NewDate']=pd.to_datetime(df['Date'], format="%m/%d/%y %H:%M")

In [50]:
df.head()

Unnamed: 0,ID,Date,LocationDescription,Arrest,Domestic,Beat,District,CommunityArea,Year,Latitude,Longitude,NewDate
0,8951354,12/31/12 23:15,STREET,False,False,623,6,69,2012,41.75628399,-87.62164472,2012-12-31 23:15:00
1,8951141,12/31/12 22:00,STREET,False,False,1213,12,24,2012,41.89878849,-87.66130317,2012-12-31 22:00:00
2,8952745,12/31/12 22:00,RESIDENTIAL YARD (FRONT/BACK),False,False,1622,16,11,2012,41.96918578,-87.76766974,2012-12-31 22:00:00
3,8952223,12/31/12 22:00,STREET,False,False,724,7,67,2012,41.76932868,-87.65772562,2012-12-31 22:00:00
4,8951608,12/31/12 21:30,STREET,False,False,211,2,35,2012,41.83756759,-87.62176133,2012-12-31 21:30:00


What is the month and year of the median date in our dataset? Enter your answer as "Month Year", without the quotes. (Ex: if the answer was 2008-03-28, you would give the answer "March 2008", without the quotes.)