# Prepare Chicago Crime Data for a GitHub Repository

- Original Notebook Source: https://github.com/coding-dojo-data-science/preparing-chicago-crime-data
- Updated 11/17/22

>- This notebook will process a "Crimes_-_2001_to_2023_processed.csv" crime file in your Downloads folder and save it as smaller .csv's in a new "Data/Chicago/" folder inside this notebook's folder/repo.

# INSTRUCTIONS

- 1) Export your Chicago Crime data to a CSV file. **The file is too big for a repository.**
    
- 2) Once exported, change `RAW_FILE` variable below to match the filepath to the downloaded file.

## 🚨 Set the correct `RAW_FILE` path

- The cell below will attempt to check your Downloads folder for any file with a name that contains "Crimes_-_2001_to_2023_processed".
    - If you know the file path already, you can skip the next cell and just manually set the RAW_FILE variable in the following code cell.

In [47]:
## Run the cell below to attempt to programmatically find your crime file
import os,glob

## Getting the home folder from environment variables
home_folder = os.environ['HOME']
# print("- Your Home Folder is: " + home_folder)

## Check for downloads folder
if 'Downloads' in os.listdir(home_folder):
    
    
    # Print the Downloads folder path
    dl_folder = os.path.abspath(os.path.join(home_folder,'/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago'))
    print(f"- Your Downloads folder is '{dl_folder}/'\n")
    
    ## checking for crime files using glob
    crime_files = sorted(glob.glob(dl_folder+'/**/Crimes_-_2001_to_2023_processed*',recursive=True))
    
    # If more than 
    if len(crime_files)==1:
        RAW_FILE = crime_files[0]
        
    elif len(crime_files)>1:
        print('[i] The following files were found:')
        
        for i, fname in enumerate(crime_files):
            print(f"\tcrime_files[{i}] = '{fname}'")
        print(f'\n- Please fill in the RAW_FILE variable in the code cell below with the correct filepath.')

else:
    print(f'[!] Could not programmatically find your downloads folder.')
    print('- Try using Finder (on Mac) or File Explorer (Windows) to navigate to your Downloads folder.')


- Your Downloads folder is '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/'



<span style="color:red"> **IF THE CODE ABOVE DID NOT FIND YOUR DOWNLOADED FILE, UNCOMMENT AND CHANGE THE `"YOUR FILEPATH HERE"` VARIABLE ONLY IN THE CELL BELOW**

In [48]:
## (Required) MAKE SURE TO CHANGE THIS VARIABLE TO MATCH YOUR LOCAL FILE NAME
##RAW_FILE = r"YOUR FILEPATH HERE")

<span style="color:red"> **DO NOT CHANGE ANYTHING IN THE CELL BELOW**

In [49]:
## DO NOT CHANGE THIS CELL
if RAW_FILE == r"YOUR FILEPATH HERE":
	raise Exception("You must update the RAW_FILE variable in the previous cell to match your local filepath.")
	
RAW_FILE

'/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/temp/Crimes_-_2001_to_2023_processed.csv'

In [50]:
## (Optional) SET THE FOLDER FOR FINAL FILES
OUTPUT_FOLDER = '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/'
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

# 🔄 Full Workflow

- Now that your RAW_FILE variable is set either:
    - On the toolbar, click on the Kernel menu > "Restart and Run All".
    - OR click on this cell first, then on the toolbar click on the "Cell" menu > "Run All Below"

In [51]:
import pandas as pd

chicago_full = pd.read_csv(RAW_FILE)
chicago_full

Unnamed: 0,CrimeDateTime,ID,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Latitude,Longitude
0,2001-01-01 00:00:00,13246664,SEX OFFENSE,NON-AGGRAVATED,RESIDENCE,False,False,1723,17,,
1,2001-01-01 00:00:00,1369398,DECEPTIVE PRACTICE,FORGERY,OTHER,False,False,1924,19,41.941360,-87.672218
2,2001-01-01 00:00:00,1322043,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,825,8,41.782292,-87.684799
3,2001-01-01 00:00:00,1310724,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,825,8,41.782030,-87.690891
4,2001-01-01 00:00:00,1318802,THEFT,$500 AND UNDER,RESIDENCE,False,True,532,5,41.667824,-87.622155
...,...,...,...,...,...,...,...,...,...,...,...
7974650,2023-12-31 23:50:00,13324881,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,923,9,41.800201,-87.691535
7974651,2023-12-31 23:50:00,13327752,THEFT,FROM BUILDING,BAR OR TAVERN,False,False,122,1,41.886816,-87.631524
7974652,2023-12-31 23:51:00,13324997,ASSAULT,AGGRAVATED - OTHER DANGEROUS WEAPON,APARTMENT,False,True,624,6,41.754967,-87.602411
7974653,2023-12-31 23:51:00,13325009,ASSAULT,AGGRAVATED POLICE OFFICER - HANDGUN,STREET,True,False,935,9,41.801584,-87.633177


In [52]:
# this cell can take up to 1 min to run
date_format = "%Y-%m-%d %H:%M:%S"

chicago_full['Datetime'] = pd.to_datetime(chicago_full['CrimeDateTime'], format=date_format)
chicago_full = chicago_full.sort_values('Datetime')
chicago_full


Unnamed: 0,CrimeDateTime,ID,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Latitude,Longitude,Datetime
0,2001-01-01 00:00:00,13246664,SEX OFFENSE,NON-AGGRAVATED,RESIDENCE,False,False,1723,17,,,2001-01-01 00:00:00
124,2001-01-01 00:00:00,6742268,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,False,True,612,6,41.755091,-87.652502,2001-01-01 00:00:00
125,2001-01-01 00:00:00,6483603,SEX OFFENSE,PREDATORY,APARTMENT,False,True,2013,20,41.986230,-87.664563,2001-01-01 00:00:00
126,2001-01-01 00:00:00,6040504,OFFENSE INVOLVING CHILDREN,CRIM SEX ABUSE BY FAM MEMBER,RESIDENCE,False,True,1021,10,41.856707,-87.712855,2001-01-01 00:00:00
127,2001-01-01 00:00:00,5926227,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,2233,22,41.687651,-87.632719,2001-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
7974649,2023-12-31 23:50:00,13324829,BATTERY,"AGGRAVATED P.O. - HANDS, FISTS, FEET, NO / MIN...",STREET,False,False,2532,25,41.906519,-87.758360,2023-12-31 23:50:00
7974651,2023-12-31 23:50:00,13327752,THEFT,FROM BUILDING,BAR OR TAVERN,False,False,122,1,41.886816,-87.631524,2023-12-31 23:50:00
7974653,2023-12-31 23:51:00,13325009,ASSAULT,AGGRAVATED POLICE OFFICER - HANDGUN,STREET,True,False,935,9,41.801584,-87.633177,2023-12-31 23:51:00
7974652,2023-12-31 23:51:00,13324997,ASSAULT,AGGRAVATED - OTHER DANGEROUS WEAPON,APARTMENT,False,True,624,6,41.754967,-87.602411,2023-12-31 23:51:00


## Separate the Full Dataset by Years

In [53]:
chicago_full['Datetime'].dt.year

0          2001
124        2001
125        2001
126        2001
127        2001
           ... 
7974649    2023
7974651    2023
7974653    2023
7974652    2023
7974654    2023
Name: Datetime, Length: 7974655, dtype: int64

In [54]:
# save the years for every crime
chicago_full["Year"] = chicago_full['Datetime'].dt.year.astype(str)
chicago_full["Year"].value_counts()

2002    486813
2001    485903
2003    475987
2004    469428
2005    453776
2006    448183
2007    437090
2008    427193
2009    392835
2010    370525
2011    352006
2012    336333
2013    307552
2014    275811
2016    269865
2017    269138
2018    268950
2015    264823
2019    261411
2023    260541
2022    239152
2020    212306
2021    209034
Name: Year, dtype: int64

In [55]:
chicago_full.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7974655 entries, 0 to 7974654
Data columns (total 13 columns):
 #   Column                Dtype         
---  ------                -----         
 0   CrimeDateTime         object        
 1   ID                    int64         
 2   Primary Type          object        
 3   Description           object        
 4   Location Description  object        
 5   Arrest                bool          
 6   Domestic              bool          
 7   Beat                  int64         
 8   District              int64         
 9   Latitude              float64       
 10  Longitude             float64       
 11  Datetime              datetime64[ns]
 12  Year                  object        
dtypes: bool(2), datetime64[ns](1), float64(2), int64(3), object(5)
memory usage: 745.3+ MB


In [56]:
## Dropping unneeded columns to reduce file size
drop_cols = ["CrimeDateTime"]

In [57]:
# save final df
chicago_final = chicago_full.drop(columns=drop_cols)
chicago_final = chicago_final.set_index('Datetime')
chicago_final

Unnamed: 0_level_0,ID,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Latitude,Longitude,Year
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2001-01-01 00:00:00,13246664,SEX OFFENSE,NON-AGGRAVATED,RESIDENCE,False,False,1723,17,,,2001
2001-01-01 00:00:00,6742268,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,False,True,612,6,41.755091,-87.652502,2001
2001-01-01 00:00:00,6483603,SEX OFFENSE,PREDATORY,APARTMENT,False,True,2013,20,41.986230,-87.664563,2001
2001-01-01 00:00:00,6040504,OFFENSE INVOLVING CHILDREN,CRIM SEX ABUSE BY FAM MEMBER,RESIDENCE,False,True,1021,10,41.856707,-87.712855,2001
2001-01-01 00:00:00,5926227,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,2233,22,41.687651,-87.632719,2001
...,...,...,...,...,...,...,...,...,...,...,...
2023-12-31 23:50:00,13324829,BATTERY,"AGGRAVATED P.O. - HANDS, FISTS, FEET, NO / MIN...",STREET,False,False,2532,25,41.906519,-87.758360,2023
2023-12-31 23:50:00,13327752,THEFT,FROM BUILDING,BAR OR TAVERN,False,False,122,1,41.886816,-87.631524,2023
2023-12-31 23:51:00,13325009,ASSAULT,AGGRAVATED POLICE OFFICER - HANDGUN,STREET,True,False,935,9,41.801584,-87.633177,2023
2023-12-31 23:51:00,13324997,ASSAULT,AGGRAVATED - OTHER DANGEROUS WEAPON,APARTMENT,False,True,624,6,41.754967,-87.602411,2023


In [58]:
# unique # of year bins
year_bins = chicago_final['Year'].astype(str).unique()
year_bins

array(['2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017', '2018', '2019', '2020', '2021', '2022', '2023'],
      dtype=object)

In [59]:
FINAL_DROP = ['Year']

In [60]:
## set save location 

os.makedirs(OUTPUT_FOLDER, exist_ok=True)
print(f"[i] Saving .csv's to {OUTPUT_FOLDER}")
## loop through years
for year in year_bins:
    
    ## save temp slices of dfs to save.
    temp_df = chicago_final.loc[year]
    temp_df = temp_df.sort_index()
    #temp_df = temp_df.reset_index(drop=True)
    temp_df = temp_df.drop(columns=FINAL_DROP)

    # save as csv to output folder
    fname_temp = f"{OUTPUT_FOLDER}Chicago-Crime_{year}.csv"#.gz
    temp_df.to_csv(fname_temp,index=True)

    print(f"- Successfully saved {fname_temp}")

[i] Saving .csv's to /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2001.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2002.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2003.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2004.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2005.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2006.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2007.csv
- Successfully saved /Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2008.csv
- Successfully

In [61]:
saved_files = sorted(glob.glob(OUTPUT_FOLDER+'*.*csv'))
saved_files

['/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2001.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2002.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2003.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2004.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2005.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2006.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2007.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2008.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2009.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crime-Data/Data/Chicago/Chicago-Crime_2010.csv',
 '/Users/whitefreeze/Documents/GitHub/Chicago-Crim

In [62]:
## create a README.txt for the zip files
readme = """Source URL: 
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
- Filtered for years 2000-Present.

Downloaded 01/01/2024
- Files are split into 1 year per file.

EXAMPLE USAGE:
>> import glob
>> import pandas as pd
>> folder = "Data/Chicago/"
>> crime_files = sorted(glob.glob(folder+"*.csv"))
>> df = pd.concat([pd.read_csv(f) for f in crime_files])
"""
print(readme)


with open(f"{OUTPUT_FOLDER}README.txt",'w') as f:
    f.write(readme)

Source URL: 
- https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2
- Filtered for years 2000-Present.

Downloaded 01/01/2024
- Files are split into 1 year per file.

EXAMPLE USAGE:
>> import glob
>> import pandas as pd
>> folder = "Data/Chicago/"
>> crime_files = sorted(glob.glob(folder+"*.csv"))
>> df = pd.concat([pd.read_csv(f) for f in crime_files])



## Confirmation

- Follow the example usage above to test if your files were created successfully.

In [63]:
# get list of files from folder
crime_files = sorted(glob.glob(OUTPUT_FOLDER+"*.csv"))
df = pd.concat([pd.read_csv(f, nrows=5) for f in crime_files])
df

Unnamed: 0,Datetime,ID,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Latitude,Longitude
0,2001-01-01 00:00:00,13246664,SEX OFFENSE,NON-AGGRAVATED,RESIDENCE,False,False,1723,17,,
1,2001-01-01 00:00:00,6742268,OFFENSE INVOLVING CHILDREN,AGG SEX ASSLT OF CHILD FAM MBR,RESIDENCE,False,True,612,6,41.755091,-87.652502
2,2001-01-01 00:00:00,6483603,SEX OFFENSE,PREDATORY,APARTMENT,False,True,2013,20,41.986230,-87.664563
3,2001-01-01 00:00:00,6040504,OFFENSE INVOLVING CHILDREN,CRIM SEX ABUSE BY FAM MEMBER,RESIDENCE,False,True,1021,10,41.856707,-87.712855
4,2001-01-01 00:00:00,5926227,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,2233,22,41.687651,-87.632719
...,...,...,...,...,...,...,...,...,...,...,...
0,2023-01-01 00:00:00,12989036,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,1832,18,41.900547,-87.633632
1,2023-01-01 00:00:00,12938946,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,False,2233,22,41.686490,-87.638563
2,2023-01-01 00:00:00,12938519,CRIMINAL DAMAGE,TO VEHICLE,GOVERNMENT BUILDING / PROPERTY,False,False,621,6,41.751928,-87.644077
3,2023-01-01 00:00:00,12938498,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,624,6,41.750371,-87.599099


In [64]:
years = df['Datetime'].map(lambda x: x.split()[0].split('/')[-1])
years.value_counts().sort_index()

2001-01-01    5
2002-01-01    5
2003-01-01    5
2004-01-01    5
2005-01-01    5
2006-01-01    5
2007-01-01    5
2008-01-01    5
2009-01-01    5
2010-01-01    5
2011-01-01    5
2012-01-01    5
2013-01-01    5
2014-01-01    5
2015-01-01    5
2016-01-01    5
2017-01-01    5
2018-01-01    5
2019-01-01    5
2020-01-01    5
2021-01-01    5
2022-01-01    5
2023-01-01    5
Name: Datetime, dtype: int64

## Summary

- The chicago crime dataset has now been saved to your repository as csv files. 
- You should save your notebook, commit your work and push to GitHub using GitHub desktop.