# Chicago Crimes - Single-Node Bodo - Jupyter Notebook

This example shows an exploratory data analysis (EDA) of crimes in Chicago using the HPC-like platform Bodo using a notebook on a single node. Chicago crime data is extracted from Bodo's public repository, cleaned and processed. Then some analysis are done to extract insight. All are **parallelized across multiple cores using Bodo**. This can be a straightforward way to
make Python code run faster than it would otherwise without requiring much change to the code. Original example can be found [here](https://medium.com/@ahsanzafar222/chicago-crime-data-cleaning-and-eda-a744c687a291) and [here](https://www.kaggle.com/fahd09/eda-of-crime-in-chicago-2005-2016).

The Bodo framework knows when to parallelize code based on the `%%px` at the start of cells and `@bodo.jit` function decorators. Removing those and restarting the kernel will run the code without Bodo.

**The Bodo parallel cluster in this example runs within the same Saturn Cloud resource as the notebook.** Thus, to increase the performance of the Bodo cluster you only need to increase the instance size of the Jupyter Server resource it's running on.

**To scale and run your application with multiple nodes you can use [Bodo platform](https://platform.bodo.ai/account/login)**

## Start an IPyParallel cluster

Run the following code in a cell to start an IPyParallel cluster. IPyParallel is used to interactively control a cluster of IPython processes. The variable `n` is used to specify the number of clusters based on the number of CPU cores available (up to 8 in the free Bodo Community Edition).

In [1]:
import ipyparallel as ipp
import psutil

n = min(psutil.cpu_count(logical=False), 8)

# command to create and start the local cluster
rc = ipp.Cluster(engines="mpi", n=n).start_and_connect_sync(activate=True)

In [2]:
%%px
import numpy as np
import pandas as pd
import time
import bodo

print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

Starting 4 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>


  0%|          | 0/4 [00:00<?, ?engine/s]

[stdout:1] Hello World from rank 1. Total ranks=4


%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:2] Hello World from rank 2. Total ranks=4


[stdout:0] Hello World from rank 0. Total ranks=4


[stdout:3] Hello World from rank 3. Total ranks=4


## Load Crimes Data in Chicago 2005 - 2017

In [2]:
%%px
@bodo.jit(cache=True)
def load_chicago_crimes():
    t1 = time.time()
    crimes = pd.read_parquet('s3://bodo-example-data/chicago-crimes/Chicago_Crimes_2012_to_2017.pq')
    crimes = crimes.sort_values(by="ID")    
    print("Reading time: ", ((time.time() - t1) * 1000), " (ms)")    
    return crimes

crimes1 = load_chicago_crimes()
if bodo.get_rank()==0:
    display(crimes1.head())

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Reading time:  3257.5977829956173  (ms)


[output:0]

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1267592,4105311,20224,HV101396,01/02/2012 02:22:00 AM,030XX W LAWRENCE AVE,110,HOMICIDE,FIRST DEGREE MURDER,AUTO,False,...,33.0,14.0,01A,1155053.0,1931730.0,2012,08/17/2015 03:03:40 PM,41.96848,-87.70526,"(41.968479866, -87.705259739)"
1267593,4105388,20225,HV102221,01/02/2012 05:58:00 PM,024XX E 78TH ST,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,...,7.0,43.0,01A,1194033.0,1853729.0,2012,08/17/2015 03:03:40 PM,41.753569,-87.564503,"(41.75356945, -87.56450286)"
1267594,4105463,20226,HV102145,01/02/2012 05:53:00 PM,066XX S WOLCOTT AVE,110,HOMICIDE,FIRST DEGREE MURDER,HOUSE,False,...,15.0,67.0,01A,1164829.0,1860636.0,2012,08/17/2015 03:03:40 PM,41.77319,-87.67133,"(41.773189519, -87.671329907)"
1267595,4105549,20227,HV101433,01/02/2012 05:15:00 AM,107XX S COTTAGE GROVE AVE,110,HOMICIDE,FIRST DEGREE MURDER,STREET,True,...,9.0,50.0,01A,1182247.0,1833951.0,2012,08/17/2015 03:03:40 PM,41.699577,-87.608304,"(41.699577165, -87.608304224)"
1267596,4105635,20228,HV102986,01/03/2012 12:07:00 PM,010XX N PULASKI RD,110,HOMICIDE,FIRST DEGREE MURDER,STREET,False,...,37.0,23.0,01A,1149528.0,1906741.0,2012,08/17/2015 03:03:40 PM,41.900017,-87.726226,"(41.900017263, -87.726225708)"


## Preprocessing and Cleaning
 1. Drop duplicated cases, filter unused columns, and add day of week and date of the crime.
 2. Keep only the most frequent crime type categories.


In [3]:
%%px
@bodo.jit(distributed=["crimes"], cache=True)
def data_cleanup(crimes):
    t1 = time.time()    
    crimes = crimes.drop_duplicates()    
    crimes.drop(['Unnamed: 0', 'Case Number', 'IUCR','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location'], inplace=True, axis=1)
    crimes.Date = pd.to_datetime(crimes.Date, format='%m/%d/%Y %I:%M:%S %p')
    crimes["dow"] = crimes["Date"].dt.dayofweek
    crimes["date only"] = crimes["Date"].dt.floor('D')
    crimes = crimes.sort_values(by="ID")    
    print("Data cleanup time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes

crimes = data_cleanup(crimes1)
if bodo.get_rank()==0:
    display(crimes.head())

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Data cleanup time:  4282.493192246875  (ms)


[output:0]

Unnamed: 0,ID,Date,Block,Primary Type,Description,Location Description,Arrest,Domestic,District,X Coordinate,Y Coordinate,Latitude,Longitude,dow,date only
1267592,20224,2012-01-02 02:22:00,030XX W LAWRENCE AVE,HOMICIDE,FIRST DEGREE MURDER,AUTO,False,False,17.0,1155053.0,1931730.0,41.96848,-87.70526,0,2012-01-02
1267593,20225,2012-01-02 17:58:00,024XX E 78TH ST,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,4.0,1194033.0,1853729.0,41.753569,-87.564503,0,2012-01-02
1267594,20226,2012-01-02 17:53:00,066XX S WOLCOTT AVE,HOMICIDE,FIRST DEGREE MURDER,HOUSE,False,False,7.0,1164829.0,1860636.0,41.77319,-87.67133,0,2012-01-02
1267595,20227,2012-01-02 05:15:00,107XX S COTTAGE GROVE AVE,HOMICIDE,FIRST DEGREE MURDER,STREET,True,False,5.0,1182247.0,1833951.0,41.699577,-87.608304,0,2012-01-02
1267596,20228,2012-01-03 12:07:00,010XX N PULASKI RD,HOMICIDE,FIRST DEGREE MURDER,STREET,False,False,11.0,1149528.0,1906741.0,41.900017,-87.726226,1,2012-01-03


In [4]:
%%px
@bodo.jit(cache=True)
def get_top_crime_types(crimes):
    t1 = time.time()
    top_crime_types = crimes['Primary Type'].value_counts().index[0:10]
    print("Getting top crimes Time: ", ((time.time() - t1) * 1000), " (ms)")
    return top_crime_types

top_crime_types = get_top_crime_types(crimes)
top_crime_types = bodo.allgatherv(top_crime_types)
if bodo.get_rank()==0:
    print(top_crime_types)

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Getting top crimes Time:  148.8639335211701  (ms)
Index(['THEFT', 'BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'ASSAULT',
       'OTHER OFFENSE', 'BURGLARY', 'DECEPTIVE PRACTICE',
       'MOTOR VEHICLE THEFT', 'ROBBERY'],
      dtype='object')


In [5]:
%%px

@bodo.jit(cache=True)
def filter_crimes(crimes, top_crime_types):
    t1 = time.time()
    top_crimes = crimes[crimes['Primary Type'].isin(top_crime_types)]
    print("Filtering crimes Time: ", ((time.time() - t1) * 1000), " (ms)")
    return top_crimes

crimes = filter_crimes(crimes, top_crime_types)
if bodo.get_rank()==0:
    display(crimes.head())

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Filtering crimes Time:  174.47385585728625  (ms)


[output:0]

Unnamed: 0,ID,Date,Block,Primary Type,Description,Location Description,Arrest,Domestic,District,X Coordinate,Y Coordinate,Latitude,Longitude,dow,date only
77270,8421394,2012-01-01 00:15:00,004XX E ILLINOIS ST,BATTERY,SIMPLE,OTHER,False,False,18.0,1179396.0,1903711.0,41.89107,-87.616614,6,2012-01-01
77272,8421398,2012-01-01 00:23:00,033XX N HALSTED ST,ASSAULT,AGGRAVATED:KNIFE/CUTTING INSTR,BAR OR TAVERN,True,False,19.0,1170335.0,1922325.0,41.942351,-87.649345,6,2012-01-01
77273,8421402,2012-01-01 00:30:00,092XX S DR MARTIN LUTHER KING JR DR,BATTERY,AGGRAVATED: OTHER DANG WEAPON,SIDEWALK,False,False,6.0,1180537.0,1843779.0,41.726586,-87.614265,6,2012-01-01
77274,8421404,2012-01-01 00:23:00,002XX W 118TH ST,CRIMINAL DAMAGE,TO CITY OF CHICAGO PROPERTY,STREET,False,False,5.0,1176631.0,1826688.0,41.679774,-87.629085,6,2012-01-01
77276,8421408,2012-01-01 00:40:00,008XX E 79TH ST,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,PARKING LOT/GARAGE(NON.RESID.),True,False,6.0,1183364.0,1852811.0,41.751305,-87.603629,6,2012-01-01


## Crime Analysis

### Find Pattern of each crime over the years



In [6]:
%%px
@bodo.jit(cache=True)
def get_crimes_count_date(crimes):
    t1 = time.time()
    crimes_count_date = crimes.pivot_table(index='date only', columns='Primary Type', values='ID', aggfunc="count")
    print("Computing Crime Pattern Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_count_date

crimes_count_date = get_crimes_count_date(crimes)

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Computing Crime Pattern Time:  234.7587114199996  (ms)


In [7]:
%%px

@bodo.jit
def get_crimes_type_date(crimes_count_date):
    t1 = time.time()
    crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
    result = crimes_count_date.fillna(0).rolling(365).sum()
    result = result.sort_index(ascending=False)
    print("Computing Crime Pattern Time: ", ((time.time() - t1) * 1000), " (ms)")
    return result

get_crimes_type_date = get_crimes_type_date(crimes_count_date)
if bodo.get_rank()==0:
    display(get_crimes_type_date.head())

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Computing Crime Pattern Time:  244.78003393505787  (ms)


[output:0]

Unnamed: 0,ROBBERY,BATTERY,THEFT,ASSAULT,OTHER OFFENSE,DECEPTIVE PRACTICE,NARCOTICS,MOTOR VEHICLE THEFT,BURGLARY,CRIMINAL DAMAGE
2017-01-18,11285.0,51795.0,65040.0,18258.0,17780.0,15159.0,26614.0,12271.0,16294.0,30518.0
2017-01-17,11033.0,49992.0,63479.0,17709.0,17223.0,15775.0,23804.0,10927.0,15240.0,29348.0
2017-01-16,10773.0,51417.0,62199.0,17806.0,16866.0,15163.0,23896.0,10931.0,14937.0,29827.0
2017-01-15,11028.0,50080.0,63491.0,17711.0,17234.0,15762.0,23864.0,10923.0,15231.0,29375.0
2017-01-14,11239.0,53092.0,64371.0,18019.0,17021.0,14610.0,26102.0,11837.0,16150.0,31212.0


## A general view of crime records by time, type and location

### Determining the pattern on daily basis

In [8]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_days'], cache=True)
def get_crimes_by_days(crimes):
    t1 = time.time()
    crimes_days = crimes.groupby('dow', as_index=False)['ID'].count().sort_values(by='dow')
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_days
    
crimes_days = get_crimes_by_days(crimes)
if bodo.get_rank()==0:
    display(crimes_days.head())

[stdout:0] Group by days Time:  36.926575777215476  (ms)


[output:0]

Unnamed: 0,dow,ID
3,0,190485
0,1,189223
1,2,191247
2,3,189308
4,4,200886


### Determining the pattern on monthly basis

In [9]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_months'], cache=True)
def get_crimes_by_months(crimes):
    t1 = time.time()
    crimes['month'] = crimes["Date"].dt.month
    crimes_months = crimes.groupby('month', as_index=False)['ID'].count().sort_values(by='month')
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_months
    
crimes_months = get_crimes_by_months(crimes)
if bodo.get_rank()==0:
    display(crimes_months.head())

%px:   0%|          | 0/4 [00:00<?, ?tasks/s]

[stdout:0] Group by days Time:  47.583010580638074  (ms)


[output:0]

Unnamed: 0,month,ID
2,1,113675
3,2,90123
4,3,109104
6,4,108457
9,5,119081


### Determining the pattern by crime type

In [10]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_type'], cache=True)
def get_crimes_by_type(crimes):
    t1 = time.time()
    crimes_type = crimes.groupby('Primary Type', as_index=False)['ID'].count().sort_values(by='ID', ascending=False)
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_type
    
crimes_type = get_crimes_by_type(crimes)
if bodo.get_rank()==0:
    display(crimes_type.head())

[stdout:0] Group by days Time:  68.05387312579114  (ms)


[output:0]

Unnamed: 0,Primary Type,ID
2,THEFT,329460
1,BATTERY,263700
7,CRIMINAL DAMAGE,155455
4,NARCOTICS,135240
3,ASSAULT,91289


### Determining the pattern by location

In [11]:
%%px
@bodo.jit(distributed=['crimes', 'crimes_location'], cache=True)
def get_crimes_by_location(crimes):
    t1 = time.time()
    crimes_location = crimes.groupby('Location Description', as_index=False)['ID'].count().sort_values(by='ID', ascending=False)
    print("Group by days Time: ", ((time.time() - t1) * 1000), " (ms)")
    return crimes_location
    
crimes_location = get_crimes_by_location(crimes)
if bodo.get_rank()==0:
    display(crimes_location.head())

[stdout:0] Group by days Time:  70.30599407698901  (ms)


[output:0]

Unnamed: 0,Location Description,ID
49,STREET,306860
78,RESIDENCE,216611
50,APARTMENT,173373
48,SIDEWALK,147414
77,OTHER,51854
