# **Data Analysis of the [Chicago Arrest Records](https://www.kaggle.com/datasets/davinascimento/chicago-arrest-records)**
### **Objective:**
The objective of this notebook is to answer several key questions derived from the dataset that shed light on important aspects of the arrest records. Through this analysis, we aim to:
- Understand the distribution of arrests over time and across different areas of Chicago.
- Explore which types of crimes are most prevalent in different regions.
- Analyze the demographic breakdown of individuals arrested, focusing on age, gender, and race.
- Identify any patterns in the data that could suggest socioeconomic or policy-related implications.

### **Structure of the Analysis**
The analysis is structured as follows:
1. **Data Preprocessing**: We'll clean the dataset, handle missing values, and ensure that the data is in the correct format for analysis.
2. **Key Questions and Answers**: A set of specific questions will be presented, and the answers will be shown using analysis and visualizations. This section includes:
    - Trends in arrests over time (daily, monthly, yearly).
    - The most common types of crimes and their frequency.
    - Demographic analysis of the arrested individuals.
3. **Conclusions**: A summary of the key findings and any significant patterns identified during the analysis.

### Questions Addressed
Some of the key questions we aim to answer through this analysis include:
1. **How have arrest numbers changed over time in Chicago? Are there any seasonal or time-based trends in criminal activity?**
2. **Which types of crimes are most common in the city?**
3. **What are the most affected demographic groups in terms of race?**

Each question will be addressed using code, visualizations, and detailed analysis.

In [None]:
#Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Functions
def get_variable_types(dataframe):
    continuous_vars = []
    categorical_vars = []

    for column in dataframe.columns:
        if dataframe[column].dtype == 'object':
            categorical_vars.append(column)
        else:
            continuous_vars.append(column)

    return continuous_vars, categorical_vars

# Plot the distribuition of a column
def plot_distribution(df, column):
    # Calculate value counts
    value_counts = df[column].value_counts()

    # Create a figure with two subplots
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

    # Bar plot on the first subplot
    sns.barplot(x=value_counts.index, y=value_counts.values, palette="viridis", ax=ax1)
    ax1.set_xlabel(column, fontsize=12)
    ax1.set_ylabel('Count', fontsize=12)
    ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45, ha='right', fontsize=10)

    # Add data labels above each bar
    for index, value in enumerate(value_counts):
        ax1.text(index, value, str(value), ha='center', va='bottom', fontsize=10)

    # Pie plot on the second subplot
    ax2.pie(value_counts, labels=value_counts.index, autopct='%1.1f%%', colors=sns.color_palette("viridis", len(value_counts)))
    ax2.axis('equal')

    # Main title for the figure
    fig.suptitle(f'Comparison of {column} Distribution in Train Dataset', fontsize=18)
    
    # Adjust layout and display the figure
    plt.tight_layout()
    plt.show()

## **Data Preprocessing**

In [None]:
# Load dataset
df = pd.read_csv('arrests.csv')

In [None]:
df.head()

In [None]:
# Continuous and Categorical Variables
continuous_vars, categorical_vars = get_variable_types(df)

print("Continuous Variables:", continuous_vars)
print("Categorical Variables:", categorical_vars)

In [None]:
# Values of each categorical column
for column in categorical_vars:
    print("column : ",column,"\nunique values : ",df[column].unique(),"\n")

In [None]:
# Eliminate some features that are not relevant for the analysis
df = df.drop(columns=['CB_NO', 'CASE NUMBER','CHARGE 1 STATUTE','CHARGE 1 DESCRIPTION','CHARGE 1 TYPE','CHARGE 1 CLASS','CHARGE 2 STATUTE','CHARGE 2 DESCRIPTION','CHARGE 2 TYPE','CHARGE 2 CLASS','CHARGE 3 STATUTE','CHARGE 3 DESCRIPTION','CHARGE 3 TYPE','CHARGE 3 CLASS','CHARGE 4 STATUTE','CHARGE 4 DESCRIPTION','CHARGE 4 TYPE','CHARGE 4 CLASS'])

In [None]:
# Convert the Arrest Date feature to Date type
df['ARREST DATE'] = pd.to_datetime(df['ARREST DATE'])

In [None]:
# Extract year, month, day, and hour from the Date feature
df['Year'] = df['ARREST DATE'].dt.year
df['Month'] = df['ARREST DATE'].dt.month
df['Hour'] = df['ARREST DATE'].dt.hour
df['Month-Year'] = df['ARREST DATE'].dt.to_period('M') # Useful for month-year analysis

In [None]:
# Extract the CHARGES DESCRIPTION feature into separate columns
df[['Charge_1', 'Charge_2', 'Charge_3', 'Charge_4']] = df['CHARGES DESCRIPTION'].str.split('|', expand=True)
df[['Charge_1', 'Charge_2', 'Charge_3', 'Charge_4']] = df[['Charge_1', 'Charge_2', 'Charge_3', 'Charge_4']].apply(lambda x: x.str.strip()) # Strip any leading/trailing whitespace from each charge and fill empty entries with NaN
df = df.replace("", pd.NA) # Replace empty strings or NaN values

## **Questions and Answers**

### **How have arrest numbers changed over time in Chicago? Are there any seasonal or time-based trends in criminal activity?**

#### For this two questions we will analyze the following aspects of the data:
- **Number of arrests per year**: This will help us identify any significant year-over-year changes.
- **Monthly breakdown of arrests for each year**: By analyzing the arrest count month-by-month, we can detect potential seasonal patterns or anomalies in specific periods.
- **Arrest patterns by time of day**: Understanding at which times of day arrests are most frequent can help in identifying high-crime periods, which may influence policing strategies.


In [None]:
# Annual Arrests
arrests_per_year = df.groupby('Year').size() # Group by year and count arrests

plt.figure(figsize=(10, 6))
arrests_per_year.plot(kind='bar', color='skyblue')
plt.title('Number of Arrests Per Year')
plt.xlabel('Year')
plt.ylabel('Number of Arrests')
plt.xticks(rotation=45)
plt.show()

#### While the number of arrests remained relatively stable between 2014 and 2019, the significant drop in 2020 and the consistently lower figures in subsequent years may suggest a lasting impact from both public health crises (Covid-19) and criminal justice reforms (bail reforms, decriminalization of minor offenses).

In [None]:
# Monthly Arrests by Year
arrests_per_month_year = df.groupby('Month-Year').size() # Group by month-year and count arrests

plt.figure(figsize=(12, 6))
arrests_per_month_year.plot(kind='line', color='purple')
plt.title('Number of Arrests Per Month-Year')
plt.xlabel('Month-Year')
plt.ylabel('Number of Arrests')
plt.xticks(rotation=45)
plt.show()

#### This more detailed view of arrests per month-year clearly demonstrates the significant impact of the COVID-19 pandemic, as seen by the sharp drop in arrest numbers in early 2020.

In [None]:
# Arrests by Hour of the Day
arrests_per_hour = df.groupby('Hour').size() # Group by hour and count arrests

plt.figure(figsize=(10, 6))
arrests_per_hour.plot(kind='bar', color='orange')
plt.title('Number of Arrests by Hour of the Day')
plt.xlabel('Hour of Day')
plt.ylabel('Number of Arrests')
plt.xticks(rotation=0)
plt.show()

#### The visualization clearly shows that arrests peak during the late afternoon and early evening hours, with the highest activity occurring between 15:00 and 20:00, likely due to increased public movement during these times. Moreover, the early morning hours, particularly between 2:00 and 7:00, see the lowest number of arrests, reflecting reduced public activity and fewer law enforcement interactions during these hours.

### **Which types of crimes are most common in the city?**
#### To answer this question we'll consolidate all the charges into a single list and we'll count the frequency of each charge to determine the most common crimes.

In [None]:
# Combine all charges into one column 
charges_combined = pd.concat([df['Charge_1'], df['Charge_2'], df['Charge_3'], df['Charge_4']], ignore_index=True)
charges_combined = charges_combined.str.strip() # Remove any leading/trailing whitespace in charge descriptions
charges_combined = charges_combined.dropna() # Drop any missing values (empty charges)
charge_counts = charges_combined.value_counts() # Count the frequency of each type of charge

plt.figure(figsize=(10, 6))
charge_counts.head(10).plot(kind='bar', color='orange')
plt.title('Top 10 Most Common Crimes in the City')
plt.xlabel('Crime Type')
plt.ylabel('Number of Occurrences')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

#### The most common offense is 'Issuance of Warrant,' far exceeding others, despite not being a direct crime, followed by drug possession and driving violations. Domestic violence and retail theft are also prevalent issues in the city.

### **What are the most affected demographic groups in terms of race?**
#### Analyzing the relationship between race and arrest records can be complex due to multiple socio-economic and systemic factors. For example, previous studies suggests that neighborhoods with higher minority populations, such as Black and Latino communities, often experience more intensive police surveillance compared to predominantly white neighborhoods. This difference in policing patterns could influence arrest rates and the types of crimes registred.

#### For this analysis, we will focus on data-driven insights to explore:
1. **Number of arrests per race**: Identifying which racial groups have the highest number of arrests.
2. **The most common causes of arrest per race**: Analyzing the types of charges most frequently associated with each racial group.

In [None]:
 # Melt the charge columns into a single column 
charges_melted = df.melt(id_vars=['RACE'], value_vars=['Charge_1', 'Charge_2', 'Charge_3', 'Charge_4'], var_name='Charge_Number', value_name='Charge')

# Group by race and charge 
race_charge_counts = charges_melted.groupby(['RACE', 'Charge']).size().reset_index(name='Count') 

In [None]:
# Number of arrests per race
total_charges_by_race = race_charge_counts.groupby('RACE')['Count'].sum().reset_index()

plt.figure(figsize=(10, 6))
plt.bar(total_charges_by_race['RACE'], total_charges_by_race['Count'], color='skyblue')
plt.xlabel('Race')
plt.ylabel('Number of Charges')
plt.title('Total Number of Charges by Race')
plt.xticks(rotation=80)
plt.show()

#### The graph reveals that Black individuals face the highest number of charges by a significant margin compared to other racial groups.

In [None]:
# The most common causes of arrest per race
top_charges_by_race = race_charge_counts.groupby('RACE').head(5) # Get the top 5 charges for each race

races = top_charges_by_race['RACE'].unique() # Get the races

for race in races:
    race_data = top_charges_by_race[top_charges_by_race['RACE'] == race] # Get data of the current race

    plt.figure(figsize=(10, 6))
    sns.barplot(x='Charge', y='Count', data=race_data)
    
    plt.title(f'Most Common Charges for {race}')
    plt.xticks(rotation=80)
    plt.ylabel('Count')
    plt.xlabel('Charge')
    
    plt.tight_layout()
    plt.show()

#### Assault charges, particularly against public officials, are more prevalent among Amer Indian, Asian/Pacific Islander, and Black Hispanic groups, while drug-related charges, specifically involving LSD analogs, dominate among White and White Hispanic individuals. This suggests a pattern where certain racial groups face different types of offenses, indicating a potential correlation between race and the nature of the charges.