# Analyzing Austin PD's Crime Reports Dataset

The Austin Police Department's Crime Reports dataset provides valuable information about reported crimes in Austin, Texas. In this analysis, we will explore the dataset and gain insights into various aspects of crime in Austin.

## Table of Contents 

I. Introduction
II. Data Scrubbing
III. Exploratory Analysis

## I. Introduction

The Austin Police Department's Crime Reports dataset offers a comprehensive view of reported crimes in Austin, Texas. By analyzing this dataset, we can extract valuable insights into the nature and patterns of crime in the city.

## II. Data Preparation

Before conducting any analysis, it is crucial to clean and preprocess the dataset. The data scrubbing process involves addressing missing data, removing irrelevant columns, and ensuring data types are appropriate.

## III. Exploratory Analysis

Once the dataset is scrubbed and prepared, we can perform an exploratory analysis to gain a deeper understanding of the crime data. The exploratory analysis may include the following:

1. Overall crime trends over time: We can examine the total number of reported crimes each year to identify any noticeable patterns or changes.

2. Crime distribution by category: Analyzing the distribution of crimes by category can provide insights into the most prevalent types of crimes in Austin.

3. Crime distribution by location: Exploring the locations where crimes occur most frequently can help identify high-crime areas or hotspots.

4. Crime distribution by time of day: Investigating the temporal patterns of crime by analyzing the frequency of crimes during different times of the day or week.

5. Crime correlations: Exploring potential correlations between different types of crimes or examining the relationship between crime and other factors such as location, time, or demographic variables.

6. Crime mapping: Visualizing the spatial distribution of crimes on a map can help identify clusters or spatial patterns.

These are just a few examples of the exploratory analysis that can be performed using the Austin PD's Crime Reports dataset. The specific analysis and insights will depend on the available data and the research questions of interest.

Let's proceed with the data scrubbing phase to ensure the dataset is clean and ready for analysis.

**I. Introduction**
I started reviewing the Crime Reports dataset provided by the Austin PD around the beginning of 2020, along with the Hate Crimes datasets. The dataset is quite extensive, containing over 2 million records from 2003 to the present, and it is updated on a weekly basis.

This project is self-paced and not affiliated with any work or educational institution. My goal is to uncover valuable insights that can benefit the Austin law enforcement community, news outlets, and anyone interested in understanding and addressing crime-related issues in the Austin area.

Initially, I attempted to import the data into this notebook using Sodapy's Socrata API method. However, I found it to be insufficient as it didn't import the complete dataset and added redundant columns. As a result, I decided to manually download the entire dataset and repeat the process each week after the updates.

In [None]:
# importing necessary libraries and configurations
import folium
from folium import plugins

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb
import itertools
import warnings

pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")
plt.style.use("seaborn-white")
# sb.set_style("whitegrid")
%matplotlib inline

In [None]:
# loading the data
df = pd.read_csv("/kaggle/input/crime-reports-06152023/crime_reports.csv")

In [None]:
# examining the dataframe
display(df.info())
display(df.isnull().sum())
display(df.head())
display(df.tail())

## II. Data Preparation

In this section, we will perform data preparation steps to clean and organize the crime reports data.

### 1. Cleaning the Data

Next, we clean the data using the `clean_data` function. This function performs the following operations:

- Drops unnecessary columns: We remove the columns that are not needed for our analysis.
- Renames columns: We standardize the column names by removing leading/trailing spaces and replacing spaces with underscores.
- Fills missing values: We replace missing values in specific columns with "Unknown".
- Converts data types: We convert certain columns to the appropriate data types, such as categorical and datetime.
- Creates additional time-based columns: We extract year, month, week, day, and hour information from the "occurred_date_time" column.
- Sets and sorts the index: We set the "occurred_date_time" column as the index and sort the data by the index.

### 2. Removing Duplicates

We check for and remove any duplicate rows in the dataset.

### 3. Analyzing Crime Rates

We display dataframes showing the crime rates by zip code, both in counts and percentages.

### 4. Visualizing Crime Rates

We create visualizations to visualize the top 25 crime-ridden zip codes in Austin. We plot a bar chart showing the total crimes for each zip code, as well as a time series plot with a 12-month rolling average of the total crimes per month.

The following code provides a reproducible script for performing the data preparation steps and generating the necessary visualizations.

In [None]:
def clean_data(df):
    """
    Clean the crime data by removing unnecessary columns, renaming columns, filling missing values,
    converting data types, and creating additional time-based features.

    Args:
        df (pandas.DataFrame): Crime data.

    Returns:
        pandas.DataFrame: Cleaned crime data.
    """
    drop_col = [
        "Highest Offense Code",
        "Incident Number",
        "Occurred Time",
        "Occurred Date",
        "Report Date",
        "Report Time",
        "UCR Category",
        "X-coordinate",
        "Y-coordinate",
        "Location",
    ]
    df.drop(drop_col, axis=1, inplace=True)
    df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)
    
    fillna_cols = [
        "zip_code",
        "location_type",
        "council_district",
        "pra",
        "census_tract",
        "location_type",
        "apd_district",
        "apd_sector",
        "clearance_status",
        "category_description",
    ]
    df[fillna_cols].fillna("Unknown", inplace=True)

    date_cols = ["occurred_date_time", "report_date_time", "clearance_date"]
    cat_cols = [
        "highest_offense_description",
        "zip_code",
        "location_type",
        "council_district",
        "apd_district",
        "apd_sector",
        "pra",
        "census_tract",
        "category_description",
    ]

    df.family_violence.replace({"Y": "True", "N": "False"}, inplace=True)
    df.clearance_status.replace({"C": "True", "O": "True", "N": "False"}, inplace=True)

    df[cat_cols] = df[cat_cols].astype("category")
    df[date_cols] = df[date_cols].astype("datetime64")

    df["year"] = pd.to_datetime(df["occurred_date_time"], format="%m/%d/%Y").dt.year
    df["month"] = pd.to_datetime(df["occurred_date_time"], format="%m/%d/%Y").dt.month
    df["week"] = pd.to_datetime(df["occurred_date_time"], format="%m/%d/%Y").dt.week
    df["day"] = pd.to_datetime(df["occurred_date_time"], format="%m/%d/%Y").dt.day
    df["hour"] = pd.to_datetime(df["occurred_date_time"], format="%m/%d/%Y").dt.hour

    df.set_index(["occurred_date_time"], inplace=True)
    df.sort_index(inplace=True)
    
    return df

df = clean_data(df)



In [None]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

In [None]:
# Re-examining the dataframe
display(df.info())
display(df.head())
display(df.tail())

## III. Exploratory Analysis

In this section, we will perform exploratory analysis to determine the areas of Austin with the highest crime rates.

### A. Crime Rates by Area

To identify the areas of Austin with the highest crime rates, we will analyze the distribution of crimes across different areas. We will focus on the zip codes as the primary area indicator in our dataset.

#### 1. Crime Rates by Zip Code

We start by examining the crime rates by zip code. The following code calculates the total number of crimes for each zip code and displays the top 10 zip codes with the highest crime rates.

```python
crime_by_zipcode = df["zip_code"].value_counts().head(10).to_frame()
display(crime_by_zipcode)

```

#### 2. Crime Rates as Percentages

In addition to the total crime counts, it is also informative to analyze the crime rates as percentages relative to the total. The following code calculates the crime rates as percentages for each zip code and displays the top 10 zip codes with the highest crime rates.

```python
crime_rates_by_zipcode = df["zip_code"].value_counts(normalize=True).head(10).to_frame()
display(crime_rates_by_zipcode)

```

#### 3. Visualizing Crime Rates by Zip Code

To visually represent the crime rates by zip code, we can create a bar chart. The following code generates a bar chart showing the top 10 zip codes with the highest crime rates.

```python
plt.figure(figsize=(10, 6))
crime_by_zipcode.plot(kind="bar", rot=45)
plt.title("Top 10 Zip Codes with Highest Crime Rates")
plt.xlabel("Zip Code")
plt.ylabel("Number of Crimes")
plt.show()

```

By analyzing the crime rates by area, specifically the zip codes, we can gain insights into the areas of Austin with the highest crime rates. This information can be valuable for understanding the distribution of crime and potentially identifying areas that require additional attention in terms of crime prevention and law enforcement efforts.

In [None]:
# Crime Rates by Zip Code

# Displaying dataframes for crime rates by zip code
crime_counts = df["zip_code"].value_counts().head(25).to_frame()
crime_percentages = df["zip_code"].value_counts(normalize=True).head(25).to_frame()

display(crime_counts)
display(crime_percentages)

# Visualizing the top 25 crime-ridden zip codes in Austin
plt.figure(figsize=(8, 4), dpi=100)
df["zip_code"].value_counts().head(25).plot.bar(rot=60)
plt.title("Overall crime by the top 25 zip codes, (2003-present)")
plt.xlabel("Zip Code")
plt.ylabel("Number of Crimes")
plt.show()

# Creating a time series plot with a 12-month rolling average
plt.figure(figsize=(8, 4), dpi=100)
monthly_crimes = df.resample("M").size()
monthly_crimes.plot(color="red", linewidth=1.5, label="Total per month")
monthly_crimes.rolling(window=12).mean().plot(
    color="orange", linewidth=5, label="12-month Moving Average"
)
plt.title("Crimes per Month")
plt.xlabel("Month")
plt.ylabel("Number of Crimes")
plt.legend()
plt.show()

# Displaying top crime descriptions
top_crime_descriptions = (
    df["highest_offense_description"].value_counts().head(10).to_frame()
)
display(top_crime_descriptions)

# Analyzing monthly crime trends for selected offense descriptions
selected_crimes = [
    "BURGLARY OF VEHICLE",
    "THEFT",
    "FAMILY DISTURBANCE",
    "CRIMINAL MISCHIEF",
    "ASSAULT W/INJURY-FAM/DATE VIOL",
    "BURGLARY OF RESIDENCE",
    "DWI",
    "HARASSMENT",
    "DISTURBANCE - OTHER",
    "AUTO THEFT",
]
df_selected = df[df["highest_offense_description"].isin(selected_crimes)]

monthly = df_selected.resample("M").size().to_frame(name="TOTAL")
for crime in selected_crimes:
    monthly[crime] = (
        df_selected[df_selected["highest_offense_description"] == crime]
        .resample("M")
        .size()
    )
# Plotting monthly crime trends for selected offenses
plt.figure(figsize=(15, 20), dpi=100)
for i, crime in enumerate(selected_crimes, 1):
    ax = plt.subplot(5, 2, i)
    monthly[crime].plot(color="red", linewidth=1.5, label="Total per month")
    monthly[crime].rolling(window=12).mean().plot(
        color="orange", linewidth=5, label="12-months Moving Average"
    )
    plt.title(crime, fontsize=12)
    plt.xlabel("")
    plt.legend(prop={"size": 12})
    plt.tick_params(labelsize=12)
plt.tight_layout()
plt.show()



### How is crime distributed in 78701 (downtown Austin)? 

In [None]:
# Filtering data for the 78701 area
df_01 = df[df['zip_code'] == 78701]

# Create a dataframe for the top crime categories in the zipcode
df_01_off = df_01['highest_offense_description'].value_counts().head(24)

# Display the count of different crime categories
display(df_01_off.to_frame())

# Display the crime categories as percentages
display(df_01['highest_offense_description'].value_counts(normalize=True).head(24).to_frame())

# Plotting a pie chart for crime distribution in 78701
plt.figure(figsize=(8, 8), dpi=100)
df_01_off.plot.pie(title="Crime Distribution (78701)")
plt.ylabel('')  # Remove y-label
plt.show()


### To analyze other zip codes, simply update the 'zip_code' variable accordingly.


### How are violent crimes distributed? 

In [None]:
# Creating separate dataframes for violent crime & murder rates
df_viol = df.query(
    'category_description == ["Aggravated Assault", "Robbery", "Rape", "Murder"]'
)
df_mur = df[df.category_description == "Murder"]
df_agg_asslt = df[df.category_description == "Aggravated Assault"]
df_robbery = df[df.category_description == "Robbery"]
df_rape = df[df.category_description == "Rape"]

# Creating yearly dataframes
# Annual overall crime
df_17 = df[df.year == 2017]
df_18 = df[df.year == 2018]
df_19 = df[df.year == 2019]
df_20 = df[df.year == 2020]
df_21 = df[df.year == 2021]

# Annual violent crime
df_viol_17 = df_viol[df_viol.year == 2017]
df_viol_18 = df_viol[df_viol.year == 2018]
df_viol_19 = df_viol[df_viol.year == 2019]
df_viol_20 = df_viol[df_viol.year == 2020]
df_viol_21 = df_viol[df_viol.year == 2021]

# Annual murders
df_mur_17 = df_mur[df_mur.year == 2017]
df_mur_18 = df_mur[df_mur.year == 2018]
df_mur_19 = df_mur[df_mur.year == 2019]
df_mur_20 = df_mur[df_mur.year == 2020]
df_mur_21 = df_mur[df_mur.year == 2021]

In [None]:
# Filtering data for violent crimes
df_viol = df.query(
    'category_description == ["Aggravated Assault", "Robbery", "Rape", "Murder"]'
)
df_mur = df.query('category_description == "Murder"')
df_agg_asslt = df.query('category_description == "Aggravated Assault"')
df_robbery = df.query('category_description == "Robbery"')
df_rape = df.query('category_description == "Rape"')

# Creating yearly dataframes for overall crime
dfs_overall = [df[df.year == year] for year in range(2017, 2022)]

# Creating yearly dataframes for violent crime
dfs_violent = [df_viol[df_viol.year == year] for year in range(2017, 2022)]

# Creating yearly dataframes for murders
dfs_murders = [df_mur[df_mur.year == year] for year in range(2017, 2022)]

# Plotting total property and violent crimes
plt.figure(figsize=(12, 6), dpi=100)
sb.countplot(x="category_description", data=df).set_title(
    "Total Property & Violent Crimes (2003-Present)"
)
plt.xlabel("Crime Type")
plt.ylabel("Total Incidents")
plt.xticks(rotation=60)
plt.show()

# Displaying count of each crime category
display(df.category_description.value_counts())

# Plotting total violent crimes and murders by zip code
fig, axs = plt.subplots(figsize=(16, 6), ncols=2, dpi=100)
df_viol.zip_code.value_counts().head(25).plot.bar(
    ax=axs[0], title="Total Violent Crimes in Top 25 Zip Codes (2003-Present)", rot=60
)
df_mur.zip_code.value_counts().head(25).plot.bar(
    ax=axs[1], title="Total Murders in Top 25 Zip Codes (2003-Present)", rot=60
)
plt.show()

# Creating frequency tables for violent crimes and murders by zip code
viol_freq = pd.crosstab(df_viol.zip_code, df_viol.category_description)
mur_freq = pd.crosstab(df_mur.zip_code, df_mur.category_description)
display(viol_freq)

# Creating monthly dataframes for violent crimes
monthly_viol = pd.DataFrame(
    df_viol[df_viol["category_description"] == "Aggravated Assault"]
    .resample("M")
    .size()
)
monthly_viol.columns = ["Aggravated Assault"]

for crime_type in df_viol["category_description"].unique():
    monthly_viol[crime_type] = pd.DataFrame(
        df_viol[df_viol["category_description"] == crime_type].resample("M").size()
    )
monthly_viol["Total"] = monthly_viol.sum(axis=1)

# Plotting monthly trends for each type of violent crime
plt.figure(figsize=(16, 25), dpi=100)

i = 521
for crime_type in monthly_viol.columns:
    plt.subplot(i)
    monthly_viol[crime_type].plot(color="red", linewidth=1.5, label="Total per month")
    monthly_viol[crime_type].rolling(window=12).mean().plot(
        color="orange", linewidth=5, label="12-month Moving Average"
    )
    plt.title(crime_type, fontsize=12)
    plt.xlabel("")
    plt.legend(prop={"size": 12})
    plt.tick_params(labelsize=12)
    i += 1
viol_freq.to_csv("viol_freq.csv")


### J. Distribution of violent crime and murders across council districts, APD Districts, and APD sectors 

In [None]:
# Plotting violent crime distribution by council district
plt.figure(figsize=(12, 6), dpi=100)
pd.crosstab(df_viol.council_district, df_viol.category_description).plot.bar(
    rot=60,
    xlabel="Council District",
    ylabel="Crime Count",
    title="Violent Crime Distribution by Council District (2003-Present)",
)
plt.show()

# Plotting murder distribution by council district
plt.figure(figsize=(12, 6), dpi=100)
pd.crosstab(df_mur.council_district, df_mur.category_description).plot.bar(
    rot=60,
    xlabel="Council District",
    ylabel="Crime Count",
    title="Murder Distribution by Council District (2003-Present)",
    legend=False,
)
plt.show()

# Plotting violent crime distribution by APD sector
plt.figure(figsize=(12, 6), dpi=100)
pd.crosstab(df_viol.apd_sector, df_viol.category_description).plot.bar(
    rot=60,
    xlabel="APD Sector",
    ylabel="Crime Count",
    title="Violent Crime Distribution by APD Sector (2003-Present)",
)
plt.show()

# Plotting murder distribution by APD sector
plt.figure(figsize=(12, 6), dpi=100)
pd.crosstab(df_mur.apd_sector, df_mur.category_description).plot.bar(
    rot=60,
    xlabel="APD Sector",
    ylabel="Crime Count",
    title="Murder Distribution by APD Sector (2003-Present)",
    legend=False,
)
plt.show()

# Plotting violent crime distribution by APD district
plt.figure(figsize=(12, 6), dpi=100)
pd.crosstab(df_viol.apd_district, df_viol.category_description).plot.bar(
    rot=60,
    xlabel="APD District",
    ylabel="Crime Count",
    title="Violent Crime Distribution by APD District (2003-Present)",
)
plt.show()

# Plotting murder distribution by APD district
plt.figure(figsize=(12, 6), dpi=100)
pd.crosstab(df_mur.apd_district, df_mur.category_description).plot.bar(
    rot=60,
    xlabel="APD District",
    ylabel="Crime Count",
    title="Murder Distribution by APD District (2003-Present)",
    legend=False,
)
plt.show()


### K. Violent crime and murder distribution by location type

In [None]:
# Calculating violent crime distribution by location type
viol_loc = pd.crosstab(df_viol.location_type, df_viol.category_description)
display(viol_loc)

# Calculating murder distribution by location type
mur_loc = pd.crosstab(df_mur.location_type, df_mur.category_description)

# Plotting violent crime distribution by location type
fig, axs = plt.subplots(figsize=(20, 14), dpi=100, ncols=2)
viol_loc.plot.barh(
    title="Violent Crime Distribution by Location Type (2003-Present)", ax=axs[0]
)
mur_loc.plot.barh(
    title="Murder Distribution by Location Type (2003-Present)", legend=False, ax=axs[1]
)
plt.show()

# Saving the violent crime distribution by location type to a CSV file
viol_loc.to_csv("viol_loc.csv")


<a id='q9'></a>
### L. How does violent crime appear on the map?

** Note: Rape incidents provide no location coordinates therefore cannot be shown on a map. **

In [None]:
def create_heatmap(df, outfile):
    coords_heat = df[(df["latitude"].notnull()) & (df["longitude"].notnull())]

    map_austin = folium.Map(
        location=[30.2672, -97.7431], tiles="OpenStreetMap", zoom_start=12
    )
    map_austin.add_child(
        plugins.HeatMap(coords_heat[["latitude", "longitude"]].values, radius=15)
    )
    map_austin.save(outfile)


# Create heat map for Aggravated Assault
create_heatmap(df_agg_asslt, "agg_asslt_heatmap.html")

# Create heat map for Robbery
create_heatmap(df_robbery, "agg_robbery_heatmap.html")

# Create heat map for Murder
create_heatmap(df_mur, "mur_heatmap.html")


<a id='q10'></a>
### M. Are there any addresses where violent crime and murder occurs frequently?

In [None]:
# Show addresses with 50 or more reported violent crimes
df_viol_address_counts = df_viol.address.value_counts()
addresses_with_50_or_more_violent_crimes = df_viol_address_counts[df_viol_address_counts >= 50].to_frame()
addresses_with_50_or_more_violent_crimes


In [None]:
# Show addresses with 2 or more reported murders
df_mur_address_counts = df_mur.address.value_counts()
addresses_with_2_or_more_murders = df_mur_address_counts[df_mur_address_counts >= 2].to_frame()
addresses_with_2_or_more_murders


In [None]:
df_clean = df.copy()
df_clean.to_csv("df_clean.csv")

df_17.to_csv("df_17.csv")
df_18.to_csv("df_18.csv")
df_19.to_csv("df_19.csv")
df_20.to_csv("df_20.csv")
df_21.to_csv("df_21.csv")

df_viol_17.to_csv("df_viol_17.csv")
df_viol_18.to_csv("df_viol_18.csv")
df_viol_19.to_csv("df_viol_19.csv")
df_viol_20.to_csv("df_viol_20.csv")
df_viol_21.to_csv("df_viol_21.csv")

df_mur_17.to_csv("df_mur_17.csv")
df_mur_18.to_csv("df_mur_18.csv")
df_mur_19.to_csv("df_mur_19.csv")
df_mur_20.to_csv("df_mur_20.csv")
df_mur_21.to_csv("df_mur_21.csv")

df_viol.to_csv("df_viol.csv")
df_mur.to_csv("df_mur.csv")
df_agg_asslt.to_csv("df_agg_asslt.csv")
df_rape.to_csv("df_rape.csv")

In [None]:
df_53 = df[df.zip_code == 78753]
df_05 = df[df.zip_code == 78705]
df_41 = df[df.zip_code == 78741]

df_01.to_csv("df_01.csv")
df_53.to_csv("df_53.csv")
df_41.to_csv("df_41.csv")
df_05.to_csv("df_05.csv")
