### How to load the crime data

1. To download a small sample, go to https:... and download the csv file to your own machine.
2. To download the whole dataset go to http: select download by API, and then cut and past the command to the server.

Next steps:
1. Transform the csv files into a parquet database on S3.

### Content of the dataset 

The dataset can be seen as one large table (8.4 million row) the content of these rows is described below.

The student needs to familirize themselves with the meaning of each row2.

In [None]:
import dask.dataframe as dd

# https://data.cityofchicago.org/Public-Safety/Crimes-2023/xguy-4ndq/about_data
file_path = "/home/akash2016/dask-CSE255/chicago_crimes/Crimes_2023.csv"

# lazy data load
df = dd.read_csv(
    file_path,
    assume_missing=True,      # helps with mixed integer/float columns
    dtype=str,                # start with string to inspect columns safely
    blocksize="64MB"          # each partition ~64MB
)

# peek at columns - eager
df.head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,13327763,JH103488,12/31/2023 11:59:00 PM,010XX N ORLEANS ST,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,False,...,27,8,14,1173727,1907173,2023,2024 Dec 21 03:40:46 PM,41.900698378,-87.637329754,POINT (-87.637329754 41.900698378)
1,13325009,JH100002,12/31/2023 11:51:00 PM,051XX S PRINCETON AVE,550,ASSAULT,AGGRAVATED POLICE OFFICER - HANDGUN,STREET,True,False,...,20,37,04A,1175152,1871065,2023,2024 Dec 21 03:40:46 PM,41.801583507,-87.633177068,POINT (-87.633177068 41.801583507)
2,13324997,JH100010,12/31/2023 11:51:00 PM,009XX E 77TH ST,530,ASSAULT,AGGRAVATED - OTHER DANGEROUS WEAPON,APARTMENT,False,True,...,8,69,04A,1183685,1854148,2023,2024 Dec 21 03:40:46 PM,41.754966726,-87.602410989,POINT (-87.602410989 41.754966726)
3,13324881,JH100006,12/31/2023 11:50:00 PM,051XX S WASHTENAW AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,14,63,08B,1159244,1870437,2023,2024 Dec 21 03:40:46 PM,41.800200965,-87.691535096,POINT (-87.691535096 41.800200965)
4,13324829,JG561343,12/31/2023 11:50:00 PM,014XX N LOCKWOOD AVE,454,BATTERY,"AGGRAVATED P.O. - HANDS, FISTS, FEET, NO / MIN...",STREET,False,False,...,37,25,08B,1140764,1909050,2023,2024 Dec 21 03:40:46 PM,41.906519104,-87.758359629,POINT (-87.758359629 41.906519104)


In [None]:
df.shape

In [2]:
print(df.dtypes)

ID                      string[pyarrow]
Case Number             string[pyarrow]
Date                    string[pyarrow]
Block                   string[pyarrow]
IUCR                    string[pyarrow]
Primary Type            string[pyarrow]
Description             string[pyarrow]
Location Description    string[pyarrow]
Arrest                  string[pyarrow]
Domestic                string[pyarrow]
Beat                    string[pyarrow]
District                string[pyarrow]
Ward                    string[pyarrow]
Community Area          string[pyarrow]
FBI Code                string[pyarrow]
X Coordinate            string[pyarrow]
Y Coordinate            string[pyarrow]
Year                    string[pyarrow]
Updated On              string[pyarrow]
Latitude                string[pyarrow]
Longitude               string[pyarrow]
Location                string[pyarrow]
dtype: object


In [3]:
# shape of the dataframe
df.shape[0].compute()

263137

In [4]:
df["Date"] = dd.to_datetime(df["Date"], errors="coerce") # lazy
# creates a new task in the Dask task graph but doesn't actually perform the conversion until you call compute()

In [5]:
# unique crime types
df["Primary Type"].value_counts().compute().head(10)

Primary Type
ARSON                                  513
ASSAULT                              22629
BATTERY                              44250
BURGLARY                              7487
CONCEALED CARRY LICENSE VIOLATION      205
CRIMINAL DAMAGE                      30093
CRIMINAL SEXUAL ASSAULT               1670
CRIMINAL TRESPASS                     4720
DECEPTIVE PRACTICE                   17419
GAMBLING                                15
Name: count, dtype: int64[pyarrow]

  return get_meta_library(args[0]).to_datetime(*args, **kwargs)


In [6]:
# simple aggregations
# extract year and filter first
df["year"] = df["Date"].dt.year
df_2023 = df[df["year"] == 2023]  # or whatever year you want to analyze

df_2023 = df_2023.assign(month=df_2023["Date"].dt.month)
crimes_by_month = df_2023.groupby("month")["ID"].count().compute()
print(crimes_by_month)

month
1     21304
2     18478
3     20796
4     20796
5     22271
6     22747
7     24053
8     24216
9     22646
10    23091
11    21418
12    21321
Name: ID, dtype: int64


In [7]:
# for string-based values
df["Arrest"] = df["Arrest"].map(lambda x: True if str(x).lower() == "true" else False, meta=('Arrest', 'bool'))

In [8]:
# arrest rate by crime type
arrest_rate = df.groupby("Primary Type")["Arrest"].mean().compute().sort_values(ascending=False)
print(arrest_rate.head(10))

Primary Type
GAMBLING                             1.000000
LIQUOR LAW VIOLATION                 0.978495
NARCOTICS                            0.975798
CONCEALED CARRY LICENSE VIOLATION    0.956098
PROSTITUTION                         0.947368
INTERFERENCE WITH PUBLIC OFFICER     0.873720
PUBLIC INDECENCY                     0.833333
WEAPONS VIOLATION                    0.580390
OBSCENITY                            0.547619
PUBLIC PEACE VIOLATION               0.451276
Name: Arrest, dtype: float64
