## Extraction and Preprocessing of Data for Visualization 

#### Requirements
This notebook was developed and tested on WSL2 Ubuntu 18.04 within the following environment.  
<sup>Please note due to use of RAPIDS you must run this notebook on a system with at least one CUDA GPU [*](https://www.nvidia.com/en-us/geforce/technologies/cuda/supported-gpus/).</sup>

```
conda create -n rapids -c rapidsai -c nvidia -c conda-forge \
    cudf=0.18.1 cudnn=8.0.0 python=3.8 pandas=1.1.5 geopandas=0.8.1 cudatoolkit=11.0
```

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import cudf  # GPU Pandas

#### Data
Historical crime data of the City of Chicago for all crimes recorded from 2001 to present must be downloaded from [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2) as CSV into the root directory and renamed to `crime_big.csv`. Please note the file is ~1.7 GB.  
Data is loaded as a GPU Dataframe for fast parallel processing with CUDA.

In [2]:
### This cell may take up to 1 minute to run.
crime_cdf = cudf.read_csv('crime-big.csv')
crime_cdf.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 7298624 entries, 0 to 7298623
Data columns (total 22 columns):
 #   Column                Dtype
---  ------                -----
 0   ID                    int64
 1   Case Number           object
 2   Date                  object
 3   Block                 object
 4   IUCR                  object
 5   Primary Type          object
 6   Description           object
 7   Location Description  object
 8   Arrest                bool
 9   Domestic              bool
 10  Beat                  int64
 11  District              float64
 12  Ward                  float64
 13  Community Area        float64
 14  FBI Code              object
 15  X Coordinate          float64
 16  Y Coordinate          float64
 17  Year                  int64
 18  Updated On            object
 19  Latitude              float64
 20  Longitude             float64
 21  Location              object
dtypes: bool(2), float64(7), int64(3), object(10)
memory usage: 1.8+ GB

----
Dataframe is trimmed to keep only important features. Due to crime recording restrictions, kept features do not contain any NA/NULL values, but `dropna()` is implemented for contingency purposes.

In [3]:
crime_clean = (crime_cdf[['Date', 'Primary Type', 'Latitude', 'Longitude']])\
              .dropna()\
              .rename(columns={'Date':'date',
                               'Primary Type':'prmtype',
                               'Latitude':'latitude',
                               'Longitude':'longitude'})

Crime data is filtered to all crimes recorded from 01/01/2010 to present.

In [4]:
crime_clean['date'] = cudf.to_datetime(crime_clean['date'], format='%m/%d/%Y %I:%M:%S %p')
crime_clean = crime_clean[crime_clean['date'] >= np.datetime64('2010-01-01')]
crime_clean.info()

<class 'cudf.core.dataframe.DataFrame'>
Int64Index: 3195890 entries, 0 to 7298623
Data columns (total 4 columns):
 #   Column     Dtype
---  ------     -----
 0   date       datetime64[ns]
 1   prmtype    object
 2   latitude   float64
 3   longitude  float64
dtypes: datetime64[ns](1), float64(2), object(1)
memory usage: 140.3+ MB


----
#### Time Decay Factor
Time factor for all crimes committed from 01/01/2020 until present is $1$, time factor decays linearly by day to $0$ for all crimes committed prior to 01/01/2015.

In [5]:
crime_clean['deltad'] = np.minimum((crime_clean['date'] - np.datetime64('2020-01-01')).dt.days, 0)
crime_clean['timedecay'] = np.maximum(1 + (crime_clean['deltad']/(1826)), 0)
crime_clean = crime_clean.drop(columns=['deltad'])

#### Spatial Join of Zip-Code and Neighborhood
GeoPandas was chosen over cuSpatial due to its stability (execution time vs. debugging time). Next cells should be reworked with cuSpatial in the future itterations to significantly improve performance.

In [6]:
### This cell may take up to 1 minute to run.
df = crime_clean.to_pandas()
df_geom = gpd.points_from_xy(df.longitude, df.latitude, crs="EPSG:4326")
crime_gpd = gpd.GeoDataFrame(df, geometry = df_geom)
crime_gpd.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 3195890 entries, 0 to 7298623
Data columns (total 6 columns):
 #   Column     Dtype         
---  ------     -----         
 0   date       datetime64[ns]
 1   prmtype    object        
 2   latitude   float64       
 3   longitude  float64       
 4   timedecay  float64       
 5   geometry   geometry      
dtypes: datetime64[ns](1), float64(3), geometry(1), object(1)
memory usage: 170.7+ MB


----
Chicago [Neighorhood boundries](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Neighborhoods/9wp7-iasj) and [Zip-Code boundries](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-ZIP-Codes/gdcf-axmw) data was also exported as `geojson` from [Chicago Data Portal](https://data.cityofchicago.org/) and saved into `geo` folder in root directory.

In [7]:
neigh_gpd = gpd.read_file('geo/Neighborhoods.geojson')
zip_gpd = gpd.read_file('geo/ZIP.geojson')

In [8]:
### This cell may take up to 5 minutes to run.
crime_gpd = gpd.sjoin(crime_gpd, neigh_gpd, op='within')\
               .drop(['index_right'], axis=1)  # necessary for the next spatial join
crime_gpd = gpd.sjoin(crime_gpd, zip_gpd, op='within')

In [9]:
crime_clean = crime_gpd[['date', 'prmtype', 'latitude', 'longitude',
                         'timedecay', 'pri_neigh', 'zip']]\
                       .rename(columns={'pri_neigh':'neigh'})

crime_clean['zip'] = crime_clean.zip.astype('int32')
crime_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3187346 entries, 0 to 7297926
Data columns (total 7 columns):
 #   Column     Dtype         
---  ------     -----         
 0   date       datetime64[ns]
 1   prmtype    object        
 2   latitude   float64       
 3   longitude  float64       
 4   timedecay  float64       
 5   neigh      object        
 6   zip        int32         
dtypes: datetime64[ns](1), float64(3), int32(1), object(2)
memory usage: 182.4+ MB


----
#### Population by Zip-Code
Most recent data on Chicago population is from the 2010 census block which was also downloaded from [Chicago Data Portal](https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Population-by-2010-Census-Block/5yjb-v3mj). The `population.csv` file provided here is the trimmed version of this data to include only `ZipCode` and `Adult Population` features.

In [10]:
pop_zip = pd.read_csv('population.csv').rename(columns={'ZipCode': 'zip'})

crime_clean = crime_clean.merge(pop_zip, how='left', on='zip')
crime_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3187346 entries, 0 to 3187345
Data columns (total 8 columns):
 #   Column            Dtype         
---  ------            -----         
 0   date              datetime64[ns]
 1   prmtype           object        
 2   latitude          float64       
 3   longitude         float64       
 4   timedecay         float64       
 5   neigh             object        
 6   zip               int32         
 7   Adult Population  int64         
dtypes: datetime64[ns](1), float64(3), int32(1), int64(1), object(2)
memory usage: 206.7+ MB


#### Crime Score Based on Crime Severity Rank and Time Decay Factor

In [11]:
crime_rank = pd.read_csv('crime_rank.csv')
crime_clean = crime_clean.merge(crime_rank, how= 'left', on='prmtype')

In [12]:
crime_clean['cscore'] = np.around(crime_clean['Rank'] * crime_clean['timedecay'], decimals=4)
crime_clean = crime_clean.drop(columns=['Rank', 'timedecay'])

In [13]:
crime_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3187346 entries, 0 to 3187345
Data columns (total 8 columns):
 #   Column            Dtype         
---  ------            -----         
 0   date              datetime64[ns]
 1   prmtype           object        
 2   latitude          float64       
 3   longitude         float64       
 4   neigh             object        
 5   zip               int32         
 6   Adult Population  int64         
 7   cscore            float64       
dtypes: datetime64[ns](1), float64(3), int32(1), int64(1), object(2)
memory usage: 206.7+ MB


----
#### Crime Score Adjusted By Population
Crime Score `cscore` is summed by (Neighborhood, Zip-Code) key to be adjusted by population as `CSperCapita`.

In [14]:
crime_clean = crime_clean.merge(crime_clean.groupby(['neigh','zip'], as_index=False)['cscore']\
                                           .agg('sum'), how='left', on = ['neigh','zip'])
                         

In [15]:
crime_clean['CSperCapita'] = np.divide(crime_clean['cscore_y'],
                                       crime_clean['Adult Population'])\
                               .replace(np.inf, 0)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3187346 entries, 0 to 3187345
Data columns (total 10 columns):
 #   Column            Dtype         
---  ------            -----         
 0   date              datetime64[ns]
 1   prmtype           object        
 2   latitude          float64       
 3   longitude         float64       
 4   neigh             object        
 5   zip               int32         
 6   Adult Population  int64         
 7   cscore_x          float64       
 8   cscore_y          float64       
 9   CSperCapita       float64       
dtypes: datetime64[ns](1), float64(5), int32(1), int64(1), object(2)
memory usage: 255.3+ MB


----
##### Final Touches and Write File

In [16]:
crime_clean['Year'] = crime_clean.date.dt.strftime('%Y')
crime_clean['Month'] = crime_clean.date.dt.strftime('%m')
crime_clean = crime_clean.drop(columns=['date', 'Adult Population'])\
                         .rename(columns={'cscore_x':'Crime Score',
                                          'cscore_y':'Neigh Score',
                                          'prmtype':'Crime Type',
                                          'neigh':'Neighborhood',
                                          'zip':'Zip Code'})

In [17]:
crime_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3187346 entries, 0 to 3187345
Data columns (total 10 columns):
 #   Column        Dtype  
---  ------        -----  
 0   Crime Type    object 
 1   latitude      float64
 2   longitude     float64
 3   Neighborhood  object 
 4   Zip Code      int32  
 5   Crime Score   float64
 6   Neigh Score   float64
 7   CSperCapita   float64
 8   Year          object 
 9   Month         object 
dtypes: float64(5), int32(1), object(4)
memory usage: 335.3+ MB


In [18]:
crime_clean.to_csv('crime-clean.csv', index=False)