# Crime Patterns and Arrest Trends Due to SocioEconomic Disparity: A Data Science Analysis

## Project Introduction
Our project investigates disparities in crime and arrest trends across neighborhoods in NYC, with a focus on how these trends correlate with income levels, race demographics, and geographic location.

### Research Questions
- Do crime and arrest rates vary by location?
- Do areas with lower income experience higher police activity and crime rates?
- Do we see similar trends of systematic discrimination in areas with higher population of people of color?

### Data Sources
We are using multiple public datasets from data.gov including:
- **NYPD Shooting Incident Data**
- **NYC Crimes 2001–Present**
- **NYC Arrests Data**

## Project Status
Since our check-in proposal slides, there has been a significant change to the project scope. Initially, we planned to analyze crime and arrest trends using datasets from Chicago. However, due to data limitations, we decided to switch our focus to datasets from the NYPD. This change aligns with our original objective of analyzing crime and arrest trends by location, income, and racial demographics, but with a different geographic focus. Aside from this switch, we have not removed or added any major components to the project thus far.


In [44]:
import pandas as pd
import numpy as np
import re                        
from datetime import datetime
import zipfile

import matplotlib.pyplot as plt   
import seaborn as sns
import plotly.express as px 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer          

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, r2_score


In [55]:
with zipfile.ZipFile("NYPD_Hate_Crimes.zip", "r") as z:
    with z.open("NYPD_Hate_Crimes.csv") as f:
        df_hate_crimes = pd.read_csv(f)
columns_to_drop = ['Complaint Precinct Code', 'Law Code Category Description',  'PD Code Description','Bias Motive Description','Month Number','Patrol Borough Name','Full Complaint ID'
]
df_hate_crimes = df_hate_crimes.drop(columns=[col for col in columns_to_drop if col in  df_hate_crimes.columns])
df_hate_crimes = df_hate_crimes.drop_duplicates()
df_hate_crimes = df_hate_crimes.dropna()
print(df_hate_crimes.head())


   Complaint Year Number Record Create Date County  \
0                   2021         05/01/2021  BRONX   
1                   2021         12/28/2021  BRONX   
2                   2022         10/11/2022  BRONX   
3                   2019         01/15/2019  KINGS   
4                   2019         02/08/2019  KINGS   

              Offense Description             Offense Category Arrest Date  \
0                        BURGLARY  Religion/Religious Practice  05/01/2021   
1         MISCELLANEOUS PENAL LAW  Religion/Religious Practice  09/28/2022   
2                  FELONY ASSAULT           Sexual Orientation  10/11/2022   
3  MURDER & NON-NEGL. MANSLAUGHTE                   Race/Color  01/16/2019   
4   OFF. AGNST PUB ORD SENSBLTY &  Religion/Religious Practice  02/08/2019   

   Arrest Id  
0  B33683676  
1  B34705870  
2  B34707656  
3  K31675023  
4  K31679592  


In [63]:
with zipfile.ZipFile("NYPD_Arrest_Data__Year_to_Date_.zip", "r") as z:
    with z.open("NYPD_Arrest_Data__Year_to_Date_.csv") as f:
        df_arrest = pd.read_csv(f)
df_shooting.replace("(null)", np.nan, inplace=True)
#print("Columns in the Arrest DataFrame:")
#print(df_arrest.columns)
columns_to_drop = ['PD_CD', 'PD_DESC', 'KY_CD', 'LAW_CODE', 'LAW_CAT_CD','JURISDICTION_CODE', 'X_COORD_CD', 'Y_COORD_CD','Latitude', 'Longitude', 'New Georeferenced Column'
]
columns_to_drop = [col for col in columns_to_drop if col in df_arrest.columns]
#print(f"Columns to drop: {columns_to_drop}")
df_arrest.drop(columns=columns_to_drop, inplace=True)
df_arrest.dropna(inplace=True)
print("Cleaned Arrest DataFrame:")
print(df_arrest.head())

Cleaned Arrest DataFrame:
   ARREST_KEY ARREST_DATE       OFNS_DESC ARREST_BORO  ARREST_PRECINCT  \
0   281369711  01/30/2024      SEX CRIMES           M               25   
1   284561406  03/30/2024  FELONY ASSAULT           B               44   
2   284896016  04/06/2024  FELONY ASSAULT           M               19   
3   285569016  04/18/2024  FELONY ASSAULT           K               69   
4   287308954  05/22/2024        JOSTLING           M               18   

  AGE_GROUP PERP_SEX PERP_RACE  
0     25-44        M     BLACK  
1     25-44        M     BLACK  
2     25-44        M     BLACK  
3     25-44        M     BLACK  
4     18-24        M     WHITE  


In [61]:
df_shooting = pd.read_csv("NYPD_shooting_incident_data__Historic__.csv")
df_shooting.replace("(null)", np.nan, inplace=True)
columns_to_keep = ['OCCUR_DATE', 'BORO', 'VIC_RACE', 'VIC_AGE_GROUP', 'VIC_SEX', 'PERP_SEX', 'PERP_RACE', 'PRECINCT']
df_shooting = df_shooting[columns_to_keep]
df_shooting.dropna(inplace=True)
print(df_shooting.head())

   OCCUR_DATE      BORO VIC_RACE VIC_AGE_GROUP VIC_SEX PERP_SEX  \
1  04/07/2018  BROOKLYN    BLACK         25-44       M        M   
3  11/19/2006  BROOKLYN    BLACK         18-24       M        U   
4  05/09/2010     BRONX    BLACK           <18       F        M   
5  07/22/2012     BRONX    BLACK         18-24       M        M   
8  04/21/2015  BROOKLYN    BLACK         25-44       M        M   

        PERP_RACE  PRECINCT  
1  WHITE HISPANIC        79  
3         UNKNOWN        66  
4           BLACK        46  
5           BLACK        42  
8           BLACK        75  


In [69]:
df_hate_crimes.rename(columns={'Arrest Date': 'DATE', 'County': 'BORO'}, inplace=True)
df_hate_crimes['DATE'] = pd.to_datetime(df_hate_crimes['DATE'])

df_arrest.rename(columns={'ARREST_DATE': 'DATE', 'ARREST_BORO': 'BORO'}, inplace=True)
df_arrest['DATE'] = pd.to_datetime(df_arrest['DATE'])

df_shooting.rename(columns={'OCCUR_DATE': 'DATE'}, inplace=True)
df_shooting['DATE'] = pd.to_datetime(df_shooting['DATE'])
 
merged_df = pd.merge(df_arrest, df_shooting, on=['DATE', 'BORO'], how='outer')
merged_df = pd.merge(merged_df, df_hate_crimes, on=['DATE', 'BORO'], how='outer')

print(merged_df.head())
merged_df.to_csv("merged_nypd_data.csv", index=False)


   ARREST_KEY       DATE OFNS_DESC       BORO  ARREST_PRECINCT AGE_GROUP  \
0         NaN 2006-01-01       NaN      BRONX              NaN       NaN   
1         NaN 2006-01-01       NaN      BRONX              NaN       NaN   
2         NaN 2006-01-01       NaN   BROOKLYN              NaN       NaN   
3         NaN 2006-01-01       NaN  MANHATTAN              NaN       NaN   
4         NaN 2006-01-01       NaN     QUEENS              NaN       NaN   

  PERP_SEX_x PERP_RACE_x        VIC_RACE VIC_AGE_GROUP VIC_SEX PERP_SEX_y  \
0        NaN         NaN           BLACK           <18       M          M   
1        NaN         NaN  WHITE HISPANIC         18-24       M          M   
2        NaN         NaN           BLACK         18-24       M          U   
3        NaN         NaN           BLACK         25-44       M          M   
4        NaN         NaN           BLACK         25-44       M          M   

      PERP_RACE_y  PRECINCT  Complaint Year Number Record Create Date  \
0      