# Part 1: Predictive policing. A case to learn from

# Exercise 1

Answer the following questions in your own words

*   According to the article, is predictive policing better than best practice techniques for law enforcement? The article is from 2016. Take a look around the web, does this still seem to be the case in 2024? (hint, when you evaluate the evidence consider the source)<br>

The article suggests that predective policing might be a better technique. Though, the article also mentions some points of critique but touches these very superficially. As of 2024 the attitude towards predective policing has changed this can be seen in this article for example: "https://www.theguardian.com/us-news/2021/nov/07/lapd-predictive-policing-surveillance-reform". One of the points of critique in this is: "Documents show how data-driven policing programs reinforced harmful patterns, fueling the over-policing of Black and brown communities" 

*   List and explain some of the possible issues with predictive policing according to the article.<br>

The article names to main points of critique:<br>

*   The first point is that predictive policing relies on an inherented bias in the criminal justice system. This prioritizes white people over black people. 

*   The second point is that privacy of the citizens are not prioritized as a list of individuals most likely to be involved in a violent crime is generated by one scientist.

# Exercise 2: The types of crimes. 
The first field we'll dig into is the column "Category".

*   We have already counted the number of crimes in each category. What is the most commonly occurring category of crime? What is the least frequently occurring?
*   Create a bar-plot over crime occurrences. This is a data visualization class, so here is the first essential lesson: For a plot to be informative you need to label the axes (The police chief will be furious if you forget). It can also be nice to other relevant pieces of info, title, labels, etc.). Mine looks like this (but yours doesn't have to look exactly like mine - the important thing is that you clearly communicate the information in the dataset).

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px

In [2]:
# Load data
df = pd.read_csv("Police_Department_Incident_Reports.csv")

In [3]:
# Check NaN
count_nan = df.isnull().sum()
print ('Count of NaN: ' + str(count_nan))

Count of NaN: PdId                                                              0
IncidntNum                                                        0
Incident Code                                                     0
Category                                                          0
Descript                                                          0
DayOfWeek                                                         0
Date                                                              0
Time                                                              0
PdDistrict                                                        1
Resolution                                                        0
Address                                                           0
X                                                                 0
Y                                                                 0
location                                                          0
SF Find Neighborhoods 2 2         

In [4]:
# Get number of rows and columns
num_rows, num_cols = df.shape

# Print colnames:
for i in range(num_cols):
    print(df.columns[i])

PdId
IncidntNum
Incident Code
Category
Descript
DayOfWeek
Date
Time
PdDistrict
Resolution
Address
X
Y
location
SF Find Neighborhoods 2 2
Current Police Districts 2 2
Current Supervisor Districts 2 2
Analysis Neighborhoods 2 2
DELETE - Fire Prevention Districts 2 2
DELETE - Police Districts 2 2
DELETE - Supervisor Districts 2 2
DELETE - Zip Codes 2 2
DELETE - Neighborhoods 2 2
DELETE - 2017 Fix It Zones 2 2
Civic Center Harm Reduction Project Boundary 2 2
Fix It Zones as of 2017-11-06  2 2
DELETE - HSOC Zones 2 2
Fix It Zones as of 2018-02-07 2 2
CBD, BID and GBD Boundaries as of 2017 2 2
Areas of Vulnerability, 2016 2 2
Central Market/Tenderloin Boundary 2 2
Central Market/Tenderloin Boundary Polygon - Updated 2 2
HSOC Zones as of 2018-06-05 2 2
OWED Public Spaces 2 2
Neighborhoods 2


In [5]:
# Amount of crimes:
print(df.size)

74533375


In [6]:
# Counting different crime
print(df["Category"].nunique())

37


In [7]:
# Print a sorted list of amount occurences of each crime
print(df["Category"].value_counts())

LARCENY/THEFT                  477975
OTHER OFFENSES                 301874
NON-CRIMINAL                   236928
ASSAULT                        167042
VEHICLE THEFT                  126228
DRUG/NARCOTIC                  117821
VANDALISM                      114718
WARRANTS                        99821
BURGLARY                        91067
SUSPICIOUS OCC                  79087
ROBBERY                         54467
MISSING PERSON                  44268
FRAUD                           41348
FORGERY/COUNTERFEITING          22995
SECONDARY CODES                 22378
WEAPON LAWS                     21004
TRESPASS                        19194
PROSTITUTION                    16501
STOLEN PROPERTY                 11450
DISORDERLY CONDUCT               9932
DRUNKENNESS                      9760
SEX OFFENSES, FORCIBLE           8747
RECOVERED VEHICLE                8688
DRIVING UNDER THE INFLUENCE      5652
KIDNAPPING                       4282
ARSON                            3875
EMBEZZLEMENT

This list shows that the most common crime is "LARCENY/THEFT"

Creating the plots:

In [8]:
y_values = df["Category"].value_counts()

fig = px.bar(x=y_values.index, y=y_values, labels={'y': 'Occurences'}, title=f'Histogram for Column Category')
fig.update_layout(height=1000)


Exercise 3: Temporal patterns.

To start off easily, let's count the number of crimes per year:
*   What is the year with most crimes?
*   What is the year with the fewest crimes? (hint if your result is 2018, go back and see what I wrote about the date range up in exercise 1).

*   Create a barplot of crimes-per-year (years on the -axis, crime-counts on the -axis).
*   Finally, Police chief Suneman is interested in the temporal development of only a subset of categories, the so-called *focus crimes*. Those categories are listed below (for convenient copy-paste action). Create bar-charts displaying the year-by-year development of each of these categories across the years 2003-2017.

In [9]:
# Extracting the year
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
# Remove 2018 as this year is not complete
df = df[df['Date'].dt.year != 2018]
df['Year'] = df['Date'].dt.year

In [10]:
# Print the amount of crimes per year
print(df["Year"].value_counts())

2015    151459
2017    149487
2013    147664
2016    145994
2014    144844
2003    142803
2004    142054
2005    137048
2012    135464
2008    135242
2009    134309
2006    131856
2007    131771
2010    127758
2011    126713
Name: Year, dtype: int64


Here we can see that the year with most crimes is 2015 and with least crimes is 2011

Bar-plots of crimes-per-year:

In [11]:
NCategories = df["Category"].unique()
for category in NCategories:
    category_df = df[df['Category'] == category]
    fig = px.bar(category_df['Year'].value_counts().reset_index(), x='index', y='Year', labels={'Year': 'Occurrences', 'index': 'Year'},title= category)
    fig.show()

# Trends in plots
*   We can see a decreasing trend in the amount of bad checks. This might be due to the decrease in the use of checks as technologi gets better
*   Acts of treason are only observed since 2010
*   Forgery/Counterfeiting is decreasing over the yeears. This might also be due to the advancement of technology