### Dataset

The dataset that I choose is called "BPD Part 1 Victim Based Crime Data" and it contains information about reported crime incidents in the City of Baltimore, Maryland, USA from 2011 to 2016. The data is collected by the Baltimore Police Department (BPD) and is based on the National Incident-Based Reporting System (NIBRS)

The dataset can be downloaded from the link: https://query.data.world/s/ta65lvu2ttt55dbkghtzjxeczlx2ky?dws=00000

In [1]:
#Importing required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Loading the data into a datframe
df = pd.read_csv('https://query.data.world/s/ta65lvu2ttt55dbkghtzjxeczlx2ky?dws=00000')

### Data Overview

In [3]:
#Checking the number of rows and columns
df.shape

(285807, 12)

In [4]:
#Checking the column titles
df.columns

Index(['CrimeDate', 'CrimeTime', 'CrimeCode', 'Location', 'Description',
       'Inside/Outside', 'Weapon', 'Post', 'District', 'Neighborhood',
       'Location 1', 'Total Incidents'],
      dtype='object')

In [5]:
#Creating new columns Latitiude and Longitude from the Location 1 Column
df[['Latitude','Longitude']]=df['Location 1'].str.split(',',expand=True)

In [6]:
#Checking the columns
df.columns

Index(['CrimeDate', 'CrimeTime', 'CrimeCode', 'Location', 'Description',
       'Inside/Outside', 'Weapon', 'Post', 'District', 'Neighborhood',
       'Location 1', 'Total Incidents', 'Latitude', 'Longitude'],
      dtype='object')

In [7]:
#Converting the new columns into float datatype
df['Latitude'] = df['Latitude'].str.replace(r'[^\d.]+', '').astype('float64')
df['Longitude'] = df['Longitude'].str.replace(r'[^\d.]+', '').astype('float64')

  df['Latitude'] = df['Latitude'].str.replace(r'[^\d.]+', '').astype('float64')
  df['Longitude'] = df['Longitude'].str.replace(r'[^\d.]+', '').astype('float64')


In [8]:
#Summary of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 285807 entries, 0 to 285806
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   CrimeDate        285807 non-null  object 
 1   CrimeTime        285807 non-null  object 
 2   CrimeCode        285807 non-null  object 
 3   Location         284184 non-null  object 
 4   Description      285807 non-null  object 
 5   Inside/Outside   281611 non-null  object 
 6   Weapon           97396 non-null   object 
 7   Post             285616 non-null  float64
 8   District         285749 non-null  object 
 9   Neighborhood     284106 non-null  object 
 10  Location 1       284188 non-null  object 
 11  Total Incidents  285807 non-null  int64  
 12  Latitude         284188 non-null  float64
 13  Longitude        284188 non-null  float64
dtypes: float64(3), int64(1), object(10)
memory usage: 30.5+ MB


In [9]:
#Statistical summary
df.describe()

Unnamed: 0,Post,Total Incidents,Latitude,Longitude
count,285616.0,285807.0,284188.0,284188.0
mean,504.234184,1.0,39.308704,76.617269
std,261.354783,0.0,0.061588,0.042107
min,0.0,1.0,39.20041,76.51784
25%,242.0,1.0,39.28838,76.5876
50%,445.0,1.0,39.30369,76.61396
75%,723.0,1.0,39.32757,76.64802
max,945.0,1.0,41.62973,76.71144


In [10]:
#Checking the first 5 rows
df.head()

Unnamed: 0,CrimeDate,CrimeTime,CrimeCode,Location,Description,Inside/Outside,Weapon,Post,District,Neighborhood,Location 1,Total Incidents,Latitude,Longitude
0,11/12/2016,02:35:00,3B,300 SAINT PAUL PL,ROBBERY - STREET,O,,111.0,CENTRAL,Downtown,"(39.2924100000, -76.6140800000)",1,39.29241,76.61408
1,11/12/2016,02:56:00,3CF,800 S BROADWAY,ROBBERY - COMMERCIAL,I,FIREARM,213.0,SOUTHEASTERN,Fells Point,"(39.2824200000, -76.5928800000)",1,39.28242,76.59288
2,11/12/2016,03:00:00,6D,1500 PENTWOOD RD,LARCENY FROM AUTO,O,,413.0,NORTHEASTERN,Stonewood-Pentwood-Winston,"(39.3480500000, -76.5883400000)",1,39.34805,76.58834
3,11/12/2016,03:00:00,6D,6600 MILTON LN,LARCENY FROM AUTO,O,,424.0,NORTHEASTERN,Westfield,"(39.3626300000, -76.5516100000)",1,39.36263,76.55161
4,11/12/2016,03:00:00,6E,300 W BALTIMORE ST,LARCENY,O,,111.0,CENTRAL,Downtown,"(39.2893800000, -76.6197100000)",1,39.28938,76.61971


In [11]:
#Checking the last 5 rows
df.tail()

Unnamed: 0,CrimeDate,CrimeTime,CrimeCode,Location,Description,Inside/Outside,Weapon,Post,District,Neighborhood,Location 1,Total Incidents,Latitude,Longitude
285802,01/01/2011,22:15:00,4D,6800 MCCLEAN BD,AGG. ASSAULT,I,HANDS,423.0,NORTHEASTERN,Hamilton Hills,"(39.3704700000, -76.5670500000)",1,39.37047,76.56705
285803,01/01/2011,22:30:00,6J,3000 ODONNELL ST,LARCENY,I,,232.0,SOUTHEASTERN,Canton,"(39.2804600000, -76.5727300000)",1,39.28046,76.57273
285804,01/01/2011,23:00:00,7A,2500 ARUNAH AV,AUTO THEFT,O,,721.0,WESTERN,Evergreen Lawn,"(39.2954200000, -76.6592800000)",1,39.29542,76.65928
285805,01/01/2011,23:25:00,4E,100 N MONROE ST,COMMON ASSAULT,I,HANDS,714.0,WESTERN,Penrose/Fayette Street Outreach,"(39.2899900000, -76.6470700000)",1,39.28999,76.64707
285806,01/01/2011,23:38:00,4D,800 N FREMONT AV,AGG. ASSAULT,I,HANDS,123.0,WESTERN,Upton,"(39.2981200000, -76.6339100000)",1,39.29812,76.63391


### Data Cleaning

In [12]:
#Calculaing the count and proportion of missing values
null_count = df.isnull().sum()
null_prop = null_count / len(df)
pd.DataFrame({
    'Count': null_count,
    'Proportion': null_prop})

Unnamed: 0,Count,Proportion
CrimeDate,0,0.0
CrimeTime,0,0.0
CrimeCode,0,0.0
Location,1623,0.005679
Description,0,0.0
Inside/Outside,4196,0.014681
Weapon,188411,0.659225
Post,191,0.000668
District,58,0.000203
Neighborhood,1701,0.005952


In [13]:
#Filling all the missing values in wapon with No Weapon, as No Weapons were detected
df['Weapon'].fillna('NO WEAPON', inplace = True)

In [14]:
#Checking the Total Incidents column  
df['Total Incidents'].value_counts()

1    285807
Name: Total Incidents, dtype: int64

In [15]:
#The Total Incidents column is the same throughout the dataset
df = df.drop('Total Incidents',axis=1)

In [16]:
#Cleaning CrimeTime and CrimeDate
df['CrimeTime'] = df['CrimeTime'].str.replace('24:00:00', '00:00:00')

In [17]:
df['Date'] = df['CrimeDate'] + ' ' + df['CrimeTime']

In [None]:
#Dropping the rows that has inconsistant data and time format
dates = df['Date'].to_list()

converted_dates = []
errors = []
for i in range(len(dates)):
    try:
        converted_dates.append(pd.to_datetime(dates[i]))
    except:
        converted_dates.append(pd.to_datetime('11/11/2030 00:00:00'))
        errors.append(i)
        print(i)
        
errors
df['Date'] = converted_dates
df = df.drop(df.index[errors])

In [None]:
#Checking the number of rows dropped
len(errors)

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
#Creating seperate columns for Day, Month, Year.. Hours
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df['Weekday'] = df['Date'].dt.weekday + 1
df['Hour'] = df['Date'].dt.hour

In [None]:
#Dropping the Crime Date and CrimeTime original columns 
df = df.drop(['CrimeDate', 'CrimeTime'], axis = 1)

In [None]:
#Changing the default index to Date column
df = df.set_index('Date')

In [None]:
#Cleaning Inside/Outside
df['Inside/Outside'].value_counts()
df['Inside/Outside'] = df['Inside/Outside'].replace('I', 'Inside')
df['Inside/Outside'] = df['Inside/Outside'].replace('O', 'Outside')

In [None]:
#Cleanign Disctict Column
df['District'] = df['District'].str.lower()
df['District'] = df['District'].str.replace('northestern', 'northeastern')
df['District'] = df['District'].str.replace('southestern', 'southeastern')

### Data Visualization

In [None]:
#Hours
plt.figure(figsize=(11,4))

plt.title('Frequency of Crime by Hour of Day', fontsize=13)
ax = sns.countplot(x = 'Hour', data = df,color='lightblue')
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('Hour', fontsize=13)
plt.plot()

From the graph, we can see that the number of crimes tends to increase during the late afternoon and evening hours. This trend is consistent with general crime patterns in many urban areas, where criminal activity tends to increase during the evening and nighttime hours when there are fewer people on the streets and less visibility.

On the other hand, the early morning hours, between midnight and 6 am, have relatively fewer crime incidents, with the lowest point being around 5 am. As, Most people are indoors, which may reduce the likelihood of criminal activity.

In [None]:
#Day
plt.figure(figsize=(11,4))

plt.title('Frequency of Crime by Day of Month', fontsize=13)
ax = sns.countplot(x = 'Day', data = df, color='lightblue')
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('Day of Month', fontsize=13)
plt.plot()

From the graph, we can see that there is a slight variation in the frequency of crimes over the days of the month, with some days having slightly higher or lower crime incidents than others. However, there is no clear trend or pattern that stands out from this graph.

In [None]:
#Weekday
plt.figure(figsize=(11,4))

plt.title('Frequency of Crime by Day of Week', fontsize=13)
ax = sns.countplot(x = 'Weekday', data = df, color='lightblue')
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('Day of Week', fontsize=13)
labels = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',' Saturday', 'Sunday']
ax.set_xticklabels(labels)
plt.plot()

Graph shows that the number of crimes tends to be higher on weekdays (Monday to Friday) than on weekends (Saturday and Sunday), Peaks on Friday.

This trend is consistent with general crime patterns in many urban areas, where criminal activity tends to be higher during weekdays when more people are out and about and businesses are open, and it drops during the weekends when fewer people are out and about.

In [None]:
#Month
plt.figure(figsize=(11,4))

plt.title('Frequency of Crime by Month', fontsize=13)
ax = sns.countplot(x = 'Month', data = df, color='lightblue')
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('Month', fontsize=13)
plt.xticks(rotation = 90)
labels = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 
ax.set_xticklabels(labels)
plt.plot()

Graph shows that the number of crimes tends to be higher during the summer months (May,June, July, and August) than during the winter months (December, January, and February).

Criminal activity tends to increase during warmer months when more people are out and about and when there is more daylight. On the other hand, the colder months may see a decrease in criminal activity due to fewer people being out and about and fewer daylight hours.

In [None]:
#Year
plt.figure(figsize=(11,4))

plt.title('Frequency of Crime by Year', fontsize=13)
ax = sns.countplot(x = 'Year', data = df, color='lightblue')
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('Year', fontsize=13)
plt.plot()

We can see that there is a decreasing trend in the frequency of crimes in Baltimore over the years.This trend is a positive sign and may be due to various factors, such as law enforcement efforts, community outreach, and crime prevention strategies

In [None]:
#Crime Description
ordered = list(df['Description'].value_counts().sort_values(ascending=False).index)
plt.figure(figsize=(20,6))
plt.title('Frequency of Crime by Police Description', fontsize=13)
ax = sns.countplot(x = 'Description', data = df,color='lightblue',order=ordered)
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('Description', fontsize=13)
plt.xticks(rotation = 90)
plt.plot()

From the graph, we can see that the most common police descriptions of crimes are "LARCENY" and "COMMON ASSAULT", followed by "BURGLARY" and "LARCENY FROM AUTO". 

This graph provides useful insights into the types of crimes that are most prevalent in Baltimore and can be used to inform crime prevention strategies

In [None]:
#Clubbing similar crime types to make it easier to understand
def update(x):
    mydict = {'COMMON ASSAULT':'ASSAULT','ASSAULT BY THREAT':'ASSAULT','AGG. ASSAULT':'ASSAULT', 'LARCENY FROM AUTO':'LARCENY','LARCENY':'LARCENY', 'RAPE':'RAPE', 'ROBBERY - COMMERCIAL':'ROBBERY','ROBBERY - RESIDENCE':'ROBBERY','ROBBERY - STREET':'ROBBERY','ROBBERY - CARJACKING':'ROBBERY', 'SHOOTING':'SHOOTING','BURGLARY':'BURGLARY','AUTO THEFT':'AUTO THEFT', 'HOMICIDE':'HOMICIDE','ARSON':'ARSON'}
    return mydict[x]

updated_description = list(map(update, df['Description'].to_list()))
df['updated_description'] = updated_description

In [None]:
for i in set(df['updated_description'].to_list()):
    print(i +': '+str(len(df[df['updated_description']==i])))

In [None]:
plt.figure(figsize=(20,15), dpi=80)
plt.title('Frequency of Crime by wrt updated_description', fontsize=15)
ax = sns.countplot(x = 'Year', data = df, hue='updated_description')
plt.ylabel("Crime Frequency", fontsize=15)
plt.xlabel('Day of Week', fontsize=15)
plt.plot()

The plot shows that LARCENY is the most common crime in Baltimore, followed by ASSAULT and BURGLARY. The trend for these crimes shows a slight decrease in Crime frequency

The plot also shows a significant increase in the frequency of SHOOTING in 2015, which continued to increase until 2016. There is also a noticeable increase in HOMICIDE in 2015

Overall, the plot suggests that the crime rate in Baltimore has been decreasing, but certain crimes like SHOOTING and HOMICIDE still require special attention.

In [None]:
#weapons per year
plt.figure(figsize = (11, 4))

ax = sns.countplot(x = "Year", hue = "Weapon", data = df)
plt.title('Weapons per year', fontsize=15)
plt.ylabel("Frequency of Crime per Year Grouped by Weapon Used", fontsize = 13)
plt.ylabel("Crime Frequency", fontsize = 13)
plt.xlabel("Year", fontsize = 13)
plt.plot()

From the graph, we can see that the use of FIREARM and OTHER types of Weapons has incresed during the year 2015

In [None]:
#Crime per district
plt.figure(figsize=(10,4))

plt.title('Frequency of Crime by District', fontsize=13)
ax = sns.countplot(x = 'District', data = df,color='lightblue', order=['northeastern','southeastern','central','southern','northern','northwestern','southwestern','eastern','western','gay street'])
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('District', fontsize=13)
plt.xticks(rotation = 90)
plt.plot()

The graph also shows that the eastern and western districts have a significantly lower frequency of crime compared to other districts

In [None]:
#Crime per district (Inside or Outside)
plt.figure(figsize=(11,4))
plt.title('Frequency of Crime by District', fontsize=13)
ax = sns.countplot(x = 'District', hue = 'Inside/Outside', data = df, order=['northeastern','southeastern','central','southern','northern','northwestern','southwestern','eastern','western','gay street'])
plt.ylabel("Crime Frequency", fontsize=13)
plt.xlabel('District', fontsize=13)
plt.xticks(rotation = 90)
plt.plot()

The district of northeastern has the highest number of crime incidents, followed by southeastern and central districts. In most of the districts the are equal chances of crime inside and outside

In [None]:
#Crime per Neighbourhood 
def get_top_categories(data, column_name, n):
    top_categories = data[column_name].value_counts().head(n).index.tolist()
    filtered_data = data[data[column_name].isin(top_categories)]
    return filtered_data
neigh_df = get_top_categories(df, 'Neighborhood',10)

In [None]:
#Plotting Graphs with respect to neighbourhood
plt.figure(figsize=(12,6))
plt.title('Crime per Neighborhood', fontsize=13)
category_order = df['Neighborhood'].value_counts().head(10).index.tolist()
sns.countplot(x=neigh_df['Neighborhood'], order=category_order, color='lightblue')
plt.xticks(rotation=90)
plt.show()

The graph indicates that the Downtown and Frankford neighborhoods have the highest crime frequencies

As Downtown, Inner Harbor and Mondawin neighborhood are pretty popular as a tourist places, It's advisible to visit with caution.

### Visualizing crime on map using folium

This gives an overview of how the crime is spread across the city, The darker zones have the highest crime rate

In [None]:
pip install folium

In [None]:
#importing folium this package helps to visualize using heatmaps
import folium
from folium.plugins import HeatMap

In [None]:
df = df[pd.notnull(df['Latitude'])]
df = df[pd.notnull(df['Longitude'])]

In [None]:
print(df['Latitude'].median())
print(df['Longitude'].median())

In [None]:
#Function to generate map based on the longitude and lattitude
def generateBaseMap(default_location = [39.30364, -76.6139599], default_zoom_start = 12):
    base_map = folium.Map(location = default_location, control_scale = True, zoom_start = default_zoom_start)
    return base_map

In [None]:
base_map = generateBaseMap()

In [None]:
#Plotting heapmap for weapon
df['count'] = 1
df_murder = df[df['Weapon'] == 'FIREARM']

In [None]:
df_murder['Longitude'] = df_murder["Longitude"]*-1

In [None]:
#Generating heatmap
HeatMap(
        data = df_murder[['Latitude', 'Longitude', 'count']].groupby(
            ['Latitude', 'Longitude']).sum().reset_index().values.tolist(),
        radius = 12,
        max_zoom = 13).add_to(base_map)
base_map

Key Takeaways:
1. Crime frequency varies slightly over the days of the month, but no clear trend or pattern stands out.
2. Crime incidents tend to be higher on weekdays than on weekends, with a peak on Friday.
3. Crime frequency is higher during the summer months than during the winter months.
4. There is a decreasing trend in the frequency of crimes in Baltimore over the years.
5. LARCENY and COMMON ASSAULT are the most common police descriptions of crimes in Baltimore, followed by BURGLARY and LARCENY FROM AUTO.
6. LARCENY is the most common crime in Baltimore, followed by ASSAULT and BURGLARY. There is a slight decrease in the frequency of these crimes, but SHOOTING and HOMICIDE still require special attention.
7. The use of FIREARM and OTHER types of Weapons has increased during the year 2015.
8. The eastern and western districts have a significantly lower frequency of crime compared to other districts.
9. The northeastern district has the highest number of crime incidents, followed by the southeastern and central districts.
10. Downtown and Frankford neighborhoods have the highest crime frequencies in Baltimore, and caution should be exercised while visiting these areas.