# Chicago Crimes
This examples shows an exploratory data analysis (EDA)  of crimes in Chicago. 

Original example can be found [here](https://medium.com/@ahsanzafar222/chicago-crime-data-cleaning-and-eda-a744c687a291) and [here](https://www.kaggle.com/fahd09/eda-of-crime-in-chicago-2005-2016).


### Notes on running these queries:

Bodo is used by defaults, which distributes data chunks across cores automatically.

Using dataset found [here](https://www.kaggle.com/currie32/crimes-in-chicago) which is ~1.5GB.

To run the code:
1. Make sure you [add your AWS account credentials to Saturn Cloud](https://saturncloud.io/docs/examples/python/load-data/qs-load-data-s3/#create-aws-credentials) to access the data.
2. If you want to run a query in regular pandas:
    1. Comment lines with Jupyter parallel magic (%%px) and bodo decorator (@bodo.jit) from all the code cells.
    2. Then, re-run cells from the beginning.


### Start an IPyParallel cluster
Run the following code in a cell to start an IPyParallel cluster. 4 cores are used in this example. 

In [1]:
import ipyparallel as ipp

import psutil

n = min(psutil.cpu_count(logical=False), 8)
rc = ipp.Cluster(engines="mpi", n=n).start_and_connect_sync(activate=True)

Starting 4 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>


  0%|          | 0/4 [00:00<?, ?engine/s]

### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [2]:
%%px
import bodo

print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

[stdout:3] Hello World from rank 3. Total ranks=4


[stdout:0] Hello World from rank 0. Total ranks=4


[stdout:1] Hello World from rank 1. Total ranks=4


[stdout:2] Hello World from rank 2. Total ranks=4


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data

In [3]:
%%px
import warnings

warnings.filterwarnings("ignore")

import time

import bodo
import pandas as pd

## Load Crimes Data in Chicago 2005 - 2017

In [4]:
%%px
@bodo.jit(distributed=["crimes"], cache=True)
def load_chicago_crimes():
    t1 = time.time()
    crimes1 = pd.read_csv(
        "s3://bodo-examples-data/chicago-crimes/Chicago_Crimes_2005_to_2007.csv"
    )
    crimes2 = pd.read_csv(
        "s3://bodo-examples-data/chicago-crimes/Chicago_Crimes_2008_to_2011.csv"
    )
    crimes3 = pd.read_csv(
        "s3://bodo-examples-data/chicago-crimes/Chicago_Crimes_2012_to_2017.csv"
    )
    crimes = pd.concat([crimes1, crimes2, crimes3], ignore_index=False, axis=0)
    crimes = crimes.sort_values(by="ID")
    print("Reading time: ", ((time.time() - t1) * 1000), " (ms)")
    print(crimes.head())
    return crimes


crimes = load_chicago_crimes()

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:1] Empty DataFrame
Columns: [Unnamed: 0, ID, Case Number, Date, Block, IUCR, Primary Type, Description, Location Description, Arrest, Domestic, Beat, District, Ward, Community Area, FBI Code, X Coordinate, Y Coordinate, Year, Updated On, Latitude, Longitude, Location]
Index: []

[0 rows x 23 columns]


[stdout:3] Empty DataFrame
Columns: [Unnamed: 0, ID, Case Number, Date, Block, IUCR, Primary Type, Description, Location Description, Arrest, Domestic, Beat, District, Ward, Community Area, FBI Code, X Coordinate, Y Coordinate, Year, Updated On, Latitude, Longitude, Location]
Index: []

[0 rows x 23 columns]


[stdout:2] Empty DataFrame
Columns: [Unnamed: 0, ID, Case Number, Date, Block, IUCR, Primary Type, Description, Location Description, Arrest, Domestic, Beat, District, Ward, Community Area, FBI Code, X Coordinate, Y Coordinate, Year, Updated On, Latitude, Longitude, Location]
Index: []

[0 rows x 23 columns]


[stdout:0] Reading time:  29582.451105117798  (ms)
         Unnamed: 0    ID Case Number                    Date  \
1324003     4897380  3012    HL101040  01/01/2005 01:15:00 PM   
1324004     4898204  3013    HK826899  01/02/2005 09:45:00 PM   
1324005     4898986  3014    HL106602  01/04/2005 04:39:00 PM   
1324006     4899770  3015    HL107444  01/05/2005 04:07:00 AM   
1324007     4900593  3016    HL112637  01/08/2005 03:15:00 AM   

                         Block  IUCR Primary Type          Description  \
1324003  076XX S GREENWOOD AVE  0110     HOMICIDE  FIRST DEGREE MURDER   
1324004        029XX E 82ND ST  0110     HOMICIDE  FIRST DEGREE MURDER   
1324005  070XX S CONSTANCE AVE  0110     HOMICIDE  FIRST DEGREE MURDER   
1324006     095XX S COLFAX AVE  0110     HOMICIDE  FIRST DEGREE MURDER   
1324007      015XX N DAYTON ST  0110     HOMICIDE  FIRST DEGREE MURDER   

        Location Description  Arrest  ...  Ward  Community Area  FBI Code  \
1324003           VACANT LOT    True

## Preprocessing and Cleaning
 1. Drop duplicated cases, filter unused columns, and add day of week and date of the crime.
 2. Keep only the most frequent crime type categories.


In [5]:
%%px
@bodo.jit(distributed=["crimes"], cache=True)
def data_cleanup(crimes):
    t1 = time.time()
    crimes = crimes.drop_duplicates()
    crimes.drop(
        [
            "Unnamed: 0",
            "Case Number",
            "IUCR",
            "Updated On",
            "Year",
            "FBI Code",
            "Beat",
            "Ward",
            "Community Area",
            "Location",
        ],
        inplace=True,
        axis=1,
    )
    crimes.Date = pd.to_datetime(crimes.Date, format="%m/%d/%Y %I:%M:%S %p")
    crimes["dow"] = crimes["Date"].dt.dayofweek
    crimes["date only"] = crimes["Date"].dt.floor("D")
    crimes = crimes.sort_values(by="ID")
    print("Data cleanup time: ", ((time.time() - t1) * 1000), " (ms)")
    print(crimes.head())
    return crimes


crimes = data_cleanup(crimes)

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Data cleanup time:  7731.170177459717  (ms)
           ID                Date                  Block Primary Type  \
1324003  3012 2005-01-01 13:15:00  076XX S GREENWOOD AVE     HOMICIDE   
1324004  3013 2005-01-02 21:45:00        029XX E 82ND ST     HOMICIDE   
1324005  3014 2005-01-04 16:39:00  070XX S CONSTANCE AVE     HOMICIDE   
1324006  3015 2005-01-05 04:07:00     095XX S COLFAX AVE     HOMICIDE   
1324007  3016 2005-01-08 03:15:00      015XX N DAYTON ST     HOMICIDE   

                 Description Location Description  Arrest  Domestic  District  \
1324003  FIRST DEGREE MURDER           VACANT LOT    True     False       6.0   
1324004  FIRST DEGREE MURDER               STREET    True     False       4.0   
1324005  FIRST DEGREE MURDER               STREET   False     False       3.0   
1324006  FIRST DEGREE MURDER                 AUTO   False     False       4.0   
1324007  FIRST DEGREE MURDER                 CLUB    True     False      18.0   

         X Coordina

[stdout:2] Empty DataFrame
Columns: [ID, Date, Block, Primary Type, Description, Location Description, Arrest, Domestic, District, X Coordinate, Y Coordinate, Latitude, Longitude, dow, date only]
Index: []


[stdout:3] Empty DataFrame
Columns: [ID, Date, Block, Primary Type, Description, Location Description, Arrest, Domestic, District, X Coordinate, Y Coordinate, Latitude, Longitude, dow, date only]
Index: []


[stdout:1] Empty DataFrame
Columns: [ID, Date, Block, Primary Type, Description, Location Description, Arrest, Domestic, District, X Coordinate, Y Coordinate, Latitude, Longitude, dow, date only]
Index: []


In [6]:
%%px
@bodo.jit(distributed=["crimes"], cache=True)
def get_top_crime_types(crimes):
    t1 = time.time()
    top_crime_types = crimes["Primary Type"].value_counts().index[0:10]
    print("Getting top crimes Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(top_crime_types)
    return top_crime_types

top_crime_types = get_top_crime_types(crimes)
top_crime_types = top_crime_types.tolist()

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[3:execute]
[0;31m---------------------------------------------------------------------------[0m
[0;31mBodoError[0m                                 Traceback (most recent call last)
Input [0;32mIn [5][0m, in [0;36m<module>[0;34m[0m
[1;32m      6[0m     [38;5;28mprint[39m(top_crime_types)
[1;32m      7[0m     [38;5;28;01mreturn[39;00m top_crime_types
[0;32m----> 9[0m top_crime_types [38;5;241m=[39m [43mget_top_crime_types[49m[43m([49m[43mcrimes[49m[43m)[49m
[1;32m     10[0m top_crime_types [38;5;241m=[39m top_crime_types[38;5;241m.[39mtolist()

File [0;32m/srv/conda/envs/saturn/lib/python3.9/site-packages/bodo/numba_compat.py:781[0m, in [0;36m_compile_for_args[0;34m(***failed resolving arguments***)[0m
[1;32m    779[0m     [38;5;28;01mdel[39;00m args
[1;32m    780[0m     [38;5;28;01mif[39;00m error:
[0;32m--> 781[0m         [38;5;28;01mraise[39;00m error
[1;32m    782[0m [38;5;28;01mreturn[39;00m tmit__osi

[0;31mBodoError[0m: [

AlreadyDisplayedError: 4 errors

In [None]:
%%px


@bodo.jit(distributed=["crimes", "top_crimes"], cache=True)
def filter_crimes(crimes, top_crime_types):
    t1 = time.time()
    top_crimes = crimes[crimes["Primary Type"].isin(top_crime_types)]
    print("Filtering crimes Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(top_crimes.head())
    return top_crimes


crimes = filter_crimes(crimes, top_crime_types)

## Crime Analysis

### Find Pattern of each crime over the years



In [None]:
%%px
def get_crimes_type_date(crimes):
    t1 = time.time()
    crimes_count_date = crimes.pivot_table(
        index="date only", columns="Primary Type", values="ID", aggfunc="count"
    )
    crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
    result = crimes_count_date.fillna(0).rolling(365).sum()
    result = result.sort_index(ascending=False)
    print("Computing Crime Pattern Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(result.head())


pivot_values = {"crimes_count_date": top_crime_types}
bodo_func = bodo.jit(distributed=["crimes"], pivots=pivot_values)(get_crimes_type_date)(
    crimes
)

## A general view of crime records by time, type and location

### Determining the pattern on daily basis

In [None]:
%%px
@bodo.jit(distributed=["crimes", "crimes_days"], cache=True)
def get_crimes_by_days(crimes):
    t1 = time.time()
    crimes_days = (
        crimes.groupby("dow", as_index=False)["ID"].count().sort_values(by="dow")
    )
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(crimes_days.head())
    return crimes_days


crimes_days = get_crimes_by_days(crimes)

### Determining the pattern on monthly basis

In [None]:
%%px
@bodo.jit(distributed=["crimes", "crimes_months"], cache=True)
def get_crimes_by_months(crimes):
    t1 = time.time()
    crimes["month"] = crimes["Date"].dt.month
    crimes_months = (
        crimes.groupby("month", as_index=False)["ID"].count().sort_values(by="month")
    )
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(crimes_months.head())
    return crimes_months


crimes_months = get_crimes_by_months(crimes)

### Determining the pattern by crime type

In [None]:
%%px
@bodo.jit(distributed=["crimes", "crimes_type"], cache=True)
def get_crimes_by_type(crimes):
    t1 = time.time()
    crimes_type = (
        crimes.groupby("Primary Type", as_index=False)["ID"]
        .count()
        .sort_values(by="ID", ascending=False)
    )
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(crimes_type.head())
    return crimes_type


crimes_type = get_crimes_by_type(crimes)

### Determining the pattern by location

In [None]:
%%px
@bodo.jit(distributed=["crimes", "crimes_location"], cache=True)
def get_crimes_by_location(crimes):
    t1 = time.time()
    crimes_location = (
        crimes.groupby("Location Description", as_index=False)["ID"]
        .count()
        .sort_values(by="ID", ascending=False)
    )
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    print(crimes_location.head())
    return crimes_location


crimes_location = get_crimes_by_location(crimes)

In [None]:
# To stop the cluster run the following command.
rc.cluster.stop_cluster_sync()

Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 10848, 'identifier': 'ipcontroller-1646173045-246u-10824'}
Stopping engine(s): 1646173046


Stopping cluster <Cluster(cluster_id='1646173045-246u', profile='default', controller=<running>, engine_sets=['1646173046'])>
