# Crime Patterns and Arrest Trends Due to SocioEconomic Disparity: A Data Science Analysis

## Project Introduction
Our project investigates disparities in crime and arrest trends across neighborhoods in NYC, with a focus on how these trends correlate with income levels, race demographics, and geographic location.

### Research Questions
- Do crime and arrest rates vary by location?
- Do areas with lower income experience higher police activity and crime rates?
- Do we see similar trends of systematic discrimination in areas with higher population of people of color?

### Data Sources
We are using multiple public datasets from data.gov including:
- **NYPD Shooting Incident Data**
- **NYC Crimes 2001–Present**
- **NYC Arrests Data**

## Project Status
Since our check-in proposal slides, there has been a significant change to the project scope. Initially, we planned to analyze crime and arrest trends using datasets from Chicago. However, due to data limitations, we decided to switch our focus to datasets from the NYPD. This change aligns with our original objective of analyzing crime and arrest trends by location, income, and racial demographics, but with a different geographic focus. Aside from this switch, we have not removed or added any major components to the project thus far.


In [7]:
import pandas as pd
import numpy as np
import re                        
from datetime import datetime
import zipfile

import matplotlib.pyplot as plt   
import seaborn as sns
import plotly.express as px 

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer          

from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, r2_score


In [10]:
import pandas as pd

# Load the CSV file
with zipfile.ZipFile("NYPD_Hate_Crimes.zip", "r") as z:
    with z.open("NYPD_Hate_Crimes.csv") as f:
        df = pd.read_csv(f)

# Display the first 5 rows
#print(df.head())

# Check for missing values in each column
#print("\nMissing Values per Column:")
#print(df.isnull().sum())

# Convert date columns to datetime format
df['Record Create Date'] = pd.to_datetime(df['Record Create Date'], format='%m/%d/%Y', errors='coerce')
df['Arrest Date'] = pd.to_datetime(df['Arrest Date'], format='%m/%d/%Y', errors='coerce')

# Standardize column names by stripping spaces, replacing spaces with underscores, and making all lowercase
df.columns = [col.strip().replace(" ", "_").lower() for col in df.columns]

# Remove duplicate rows if any
df.drop_duplicates(inplace=True)

# Display the cleaned data
print("\nCleaned Data:")
#print(df.head())

# Set display options to show all columns in one line
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', None)

# Load the cleaned data (assuming df is your DataFrame)
#print(df.head())

# display(df.head()) 
# List of columns to drop
columns_to_drop = [
    'complaint_precinct_code',
    'law_code_category_description',  # Assuming this is what you meant by "aw_code_category_description"
    'pd_code_description',
    'bias_motive_description',
    'month_number',
    'patrol_borough_name',
    'full_complaint_id'
]

# Drop the columns from the DataFrame
df.drop(columns=columns_to_drop, inplace=True)

# Display the resulting DataFrame
display(df.head())



Cleaned Data:


Unnamed: 0,complaint_year_number,record_create_date,county,offense_description,offense_category,arrest_date,arrest_id
0,2021,2021-05-01,BRONX,BURGLARY,Religion/Religious Practice,2021-05-01,B33683676
1,2021,2021-12-28,BRONX,MISCELLANEOUS PENAL LAW,Religion/Religious Practice,2022-09-28,B34705870
2,2022,2022-10-11,BRONX,FELONY ASSAULT,Sexual Orientation,2022-10-11,B34707656
3,2019,2019-01-15,KINGS,MURDER & NON-NEGL. MANSLAUGHTE,Race/Color,2019-01-16,K31675023
4,2019,2019-02-08,KINGS,OFF. AGNST PUB ORD SENSBLTY &,Religion/Religious Practice,2019-02-08,K31679592


In [11]:
with zipfile.ZipFile("NYPD_Arrest_Data__Year_to_Date_.zip", "r") as z:
    with z.open("NYPD_Arrest_Data__Year_to_Date_.csv") as f:
        df = pd.read_csv(f)

columns_to_drop = [
    'PD_CD', 'PD_DESC', 'KY_CD', 'LAW_CODE', 'LAW_CAT_CD',
    'JURISDICTION_CODE', 'X_COORD_CD', 'Y_COORD_CD',
    'Latitude', 'Longitude', 'New Georeferenced Column'
]

df.drop(columns=columns_to_drop, inplace=True)

# Drop rows with any missing values
df.dropna(inplace=True)

print(df)


        ARREST_KEY ARREST_DATE                       OFNS_DESC ARREST_BORO  ARREST_PRECINCT AGE_GROUP PERP_SEX       PERP_RACE
0        281369711  01/30/2024                      SEX CRIMES           M               25     25-44        M           BLACK
1        284561406  03/30/2024                  FELONY ASSAULT           B               44     25-44        M           BLACK
2        284896016  04/06/2024                  FELONY ASSAULT           M               19     25-44        M           BLACK
3        285569016  04/18/2024                  FELONY ASSAULT           K               69     25-44        M           BLACK
4        287308954  05/22/2024                        JOSTLING           M               18     18-24        M           WHITE
...            ...         ...                             ...         ...              ...       ...      ...             ...
260498   298287970  12/20/2024                   PETIT LARCENY           K               90     25-44        M 