# **Introduction**

It is posible to create a safe environment by analyzing the crime records. Finding connections between places and dates with the crimes, helps us to understand where or when will the next crime happend.

The aim of this project is to investigate if there is such a connection in the Boston crime data and answer following questions:

How has crime changed over the years?
Is it possible to predict where or when a crime will be committed?
What can you say about the distribution of different offenses over the city?
The project incudes seven sections:

Introduction
Set Up
Data
General Analysis
Analysis Based on Crime Types
Analysis Based on Crime Locations
Conclusions
Each section includes explanations of 'What has been done?" and "What did we obtain by doing this?". At the last section, Conclusion, there is a summary of all analysis with the questions answered.

# Set Up

Python and it's libraies pandas and numpy are used for organizing the data set and for calculations. Different kinds of plots are used for convenience in analysis and for visualization. For this reason plotly and follium are used.

In [None]:
!pip install --upgrade plotly
!pip install chart_studio
!pip install cufflinks
!pip install folium

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import cufflinks as cf
import plotly.express as px
import folium

In [None]:
from plotly import __version__
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

In [None]:
from plotly.subplots import make_subplots

In [None]:
from plotly import __version__
import plotly.graph_objs as go 
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

# Data

The data is provided in csv format. It is imported and read by using pandas.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
cr = pd.read_csv("/kaggle/input/crimes-in-boston/crime.csv",encoding='latin1')

In [None]:
cr.head()

The Records begin in June 22, 2015 and continue to October 3, 2018. So, there are missing months in 2015 and 2018.

Column names and missing data counts were checked.

In [None]:
cr.info()

As it can be seen the data has 17 columns and 327820 entiries. To take a close look at the columns with null values, count null values.

In [None]:
cr.isnull().sum()

The data has 6 colums with missing values. For shooting colums missing values are replaced with N to indicate there is no shooting. For missing location informations in Lat and Long column colums missing values are replaced with -1. These location values neglected during analysis.

In [None]:
cr["SHOOTING"].fillna("N", inplace = True)

In [None]:
cr.Lat.replace(-1, None, inplace=True)
cr.Long.replace(-1, None, inplace=True)

Number of unique entries were found for each column.

In [None]:
cr.apply(pd.Series.nunique)

Data frame has 280156 different incidents under 4 main UCR categories for 12 district and for 4 years. Incident numbers indicates that there are dubicated values in my data.Dublicated values are deleted.

In [None]:
cr.drop_duplicates(subset="INCIDENT_NUMBER", inplace=True)

In [None]:
cr.info()

For convenience in visulization and compresion main data frame (cr = crime.csv) is organized in different forms.

In [None]:
#For sorting weekdays in correct order a key is created
m = ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]

In [None]:
#The data of total crime values
crimes_per_year = pd.DataFrame(data=cr['YEAR'].value_counts().reset_index().values, columns=["YEAR","CRIME COUNT"]).sort_values('YEAR').reset_index(drop=True)
crimes_per_month = pd.DataFrame(data=cr['MONTH'].value_counts().reset_index().values, columns=["MONTH","CRIME COUNT"]).sort_values('MONTH').reset_index(drop=True)
crimes_per_day = pd.DataFrame(data=cr['DAY_OF_WEEK'].value_counts().reset_index().values, columns=["DAY","CRIME COUNT"])
crimes_per_day["DAY"] = pd.Categorical(crimes_per_day['DAY'], categories=m, ordered=True)
crimes_per_day = crimes_per_day.sort_values('DAY').reset_index(drop=True)
crimes_per_hour = pd.DataFrame(data=cr['HOUR'].value_counts().reset_index().values, columns=["HOUR","CRIME COUNT"]).sort_values('HOUR').reset_index(drop=True)
crimes_per_district = pd.DataFrame(data=cr['DISTRICT'].value_counts().reset_index().values, columns=["DISTRICT","CRIME COUNT"])
crimes_per_street = pd.DataFrame(data=cr['STREET'].value_counts().reset_index().values, columns=["STREET","CRIME COUNT"]).sort_values('CRIME COUNT',ascending=False).reset_index(drop=True).head(50)

In [None]:
#Data for total crime counts for each UCR Parts
ucr_year = pd.DataFrame(data =(cr.groupby(["YEAR","UCR_PART"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["YEAR","UCR_PART","CRIME COUNT"]).sort_values('YEAR').reset_index(drop=True)
ucr_month = pd.DataFrame(data =(cr.groupby(["MONTH","UCR_PART"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["MONTH","UCR_PART","CRIME COUNT"]).sort_values('MONTH').reset_index(drop=True)
ucr_day = pd.DataFrame(data =(cr.groupby(["DAY_OF_WEEK","UCR_PART"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["DAY","UCR_PART","CRIME COUNT"]).sort_values('DAY').reset_index(drop=True)
ucr_day["DAY"] = pd.Categorical(ucr_day['DAY'], categories=m, ordered=True)
ucr_day = ucr_day.sort_values('DAY').reset_index(drop=True)
ucr_hour = pd.DataFrame(data =(cr.groupby(["HOUR","UCR_PART"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["HOUR","UCR_PART","CRIME COUNT"]).sort_values('HOUR').reset_index(drop=True)
ucr_district = pd.DataFrame(data =(cr.groupby(["DISTRICT","UCR_PART"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["DISTRICT","UCR_PART","CRIME COUNT"])
ucr_street = pd.DataFrame(data =(cr.groupby(["STREET","UCR_PART"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["STREET","UCR_PART","CRIME COUNT"]).sort_values('CRIME COUNT',ascending=False).reset_index(drop=True).head(50)

In [None]:
offense_code_count = pd.DataFrame(cr['OFFENSE_CODE_GROUP'].value_counts().reset_index().values, columns=["OFFENSE_CODE_GROUP'","CRIME COUNT"])

# General Analysis

In [None]:
fig = make_subplots(
    rows=2, cols=3,specs=[[{"type": "scatter"}, {"type": "scatter"},{"type": "scatter"}],[{"type": "scatter"},{"type": "bar"}, {"type": "bar"}]],
    subplot_titles=("Number of crimes per year", "Number of crimes per month", "Number of crimes per day", "Number of crimes per hour","Number of crimes per district","Number of crimes per streets (For highest 50 value)"))
# Add traces
fig.add_trace(go.Scatter(x=crimes_per_year["YEAR"], y=crimes_per_year["CRIME COUNT"]), row=1, col=1)
fig.add_trace(go.Scatter(x=crimes_per_month["MONTH"], y=crimes_per_month["CRIME COUNT"]), row=1, col=2)
fig.add_trace(go.Scatter(x=crimes_per_day["DAY"], y=crimes_per_day["CRIME COUNT"]), row=1, col=3)
fig.add_trace(go.Scatter(x=crimes_per_hour["HOUR"], y=crimes_per_hour["CRIME COUNT"]), row=2, col=1)
fig.add_trace(go.Bar(x=crimes_per_district["DISTRICT"], y=crimes_per_district["CRIME COUNT"]), row=2, col=2)
fig.add_trace(go.Bar(x=crimes_per_street["STREET"], y=crimes_per_street["CRIME COUNT"]), row=2, col=3)
# Update xaxis properties
fig.update_xaxes(title_text="Year", row=1, col=1)
fig.update_xaxes(title_text="Month", range=[0, 13], row=1, col=2)
fig.update_xaxes(title_text="Day", row=1, col=3)
fig.update_xaxes(title_text="Hour",row=2, col=1)
fig.update_xaxes(title_text="District", row=2, col=2)
fig.update_xaxes(title_text="Street", row=2, col=3)
# Update yaxis properties
fig.update_yaxes(title_text="Crime Count", row=1, col=1)
fig.update_yaxes(title_text="Crime Count",row=1, col=2)
fig.update_yaxes(title_text="Crime Count", row=1, col=3)
fig.update_yaxes(title_text="Crime Count", row=2, col=1)
fig.update_yaxes(title_text="Crime Count", row=2, col=2)
fig.update_yaxes(title_text="Crime Count", row=2, col=3)
# Update title
fig.update_layout(showlegend=False,title_text="Distributions of Total Number of Crimes Between 2015-2018", height=900)
fig.show()

From the figure 'Distributions of Total Number of Crimes Between 2015-2018' following inferences can be made;

* Yearly:

The is a drastic increase in crime rate between 2015-2016. It can be seen that in 2017 crime count is slightly higher than 2016. Most likely the reason is missing months in 2015 and 2018.

* Monthly: 

The crime count varies thorugh year between 2015-2018.In spring and summer periods crime count is relatively high compare to winter sesion. 7th,8th and 9th months (July, August and September) are the top three months of the year for crime counts. 

* Daily:

There is also a variation in the crime numbers during a week. Crime numbers see the bottom in the weekends where they have the highest value in Friday. Except Friday, there is no significant difference in the crime number during the weekdays. 

* Hourly:

'Number of crimes per hour' graph shows hourly changes in crime rate. Crime number changes a lot throughout the day.It has the highest values between 10 am - 8 pm and lowest values between 1 am - 7 am.

* District

There is no homogenous distribution of crime between districts. The first five district has almost twofold crimes compare to rest of the districts. 

* Street

Change in the crime number between streets are much more higher compare districts. Most of the streets the crime rate is under 1000 for a year. The Washingthon street has the hisghest crime number. 



After seeing the general trend in crimes, now different crime types are also considered. To categorize crimes different offense codes and groups are used in the data frame. The most general one between these categories is the UCR Parts. Now take a look in the crime numbers in UCR Parts to see if they follow the same trend with the general one. 


It can be seen from the figures below that all the UCR parts follow the general trend but minor changes occur in Part One crimes. Crimes cathegorized under "Other" doesn't change much between 2015-2018. 
When compare to other UCR parts, Part One type crime rates vary less monthly and weekly.
No matter when or where Part three is the most committed crime part among all. 

In [None]:
fig1 = px.line(ucr_year, x='YEAR', y="CRIME COUNT", color="UCR_PART",title='Yearly chages in number of crimes in UCR categories')
fig1.show()

In [None]:
fig2= px.line(ucr_month, x='MONTH', y="CRIME COUNT", color="UCR_PART",title='Monthly chages in number of crimes in UCR categories')
fig2.show()

In [None]:
fig3 = px.line(ucr_day, x='DAY', y="CRIME COUNT", color="UCR_PART",title='Daily chages in number of crimes in UCR categories')
fig3.show()

In [None]:
fig4 = px.line(ucr_hour, x='HOUR', y="CRIME COUNT", color="UCR_PART",title='Hourly chages in number of crimes in UCR categories')
fig4.show()

In general 31.6% of the crimes belong to UCR part three and committed in 2017. 

In [None]:
fig = make_subplots(
    rows=1, cols=2,specs=[[{"type": "pie"}, {"type": "pie"}]],
    subplot_titles=("Number of crimes per year","Number of crimes for UCR Parts"))

fig.add_trace(go.Pie(labels=crimes_per_year['YEAR'], values=crimes_per_year['CRIME COUNT'], textinfo='label+percent',pull=[0, 0, 0.1, 0]), row=1, col=1)
fig.add_trace(go.Pie(labels=ucr_year['UCR_PART'], values=crimes_per_year['CRIME COUNT'], textinfo='label+percent',pull=[0, 0, 0.1, 0]), row=1, col=2)

fig.update_layout(showlegend=False,title_text="Distribution percentages of Total Number of Crimes Between 2015-2018", height=500)
fig.show()

In [None]:
df = pd.DataFrame(data =(cr.groupby(["YEAR","UCR_PART",'OFFENSE_CODE_GROUP']).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["YEAR","UCR_PART",'OFFENSE_CODE_GROUP',"CRIME COUNT"])
fig = px.sunburst(df, path=['YEAR', 'UCR_PART', 'OFFENSE_CODE_GROUP'], values='CRIME COUNT')

fig.show()

# Analysis Based on Crime Types

The most and least committed crimes were found and cathegorized according to UCR parts.

In [None]:
ucr_offense = pd.DataFrame(data =(cr.groupby(["UCR_PART","OFFENSE_CODE_GROUP"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["UCR_PART","OFFENSE_CODE_GROUP","CRIME COUNT"]).sort_values('UCR_PART').reset_index(drop=True)
ucr_offense['all'] = 'all'
fig = px.treemap(ucr_offense, path=['all', 'UCR_PART', 'OFFENSE_CODE_GROUP'], values='CRIME COUNT',color='CRIME COUNT',color_continuous_scale='RdBu')
fig.show()

In [None]:
df = ucr_offense.sort_values('CRIME COUNT',ascending=False).reset_index(drop=True)
fig = px.bar(df.head(25),x = "OFFENSE_CODE_GROUP", y ="CRIME COUNT",color="UCR_PART",)
fig.update_layout(
    title_text='Top 25 Crimes in Boston', # title of plot
    xaxis_title_text='Offense Code Group', # xaxis label
    yaxis_title_text='Count', # yaxis label
    barmode='stack', xaxis={'categoryorder':'total descending'}
)
fig.show()

In [None]:
df = ucr_offense.sort_values('CRIME COUNT',ascending=False).reset_index(drop=True)
fig = px.bar(df.tail(25),x = "OFFENSE_CODE_GROUP", y ="CRIME COUNT",color="UCR_PART",)
fig.update_layout(
    title_text='Least Committed 25 Crimes in Boston', # title of plot
    xaxis_title_text='Offense Code Group', # xaxis label
    yaxis_title_text='Count', # yaxis label
    barmode='stack', xaxis={'categoryorder':'total ascending'}
)
fig.show()

There is a significant difference between Top 25 crimes and least committed 25 crimes. From the graph 'Least Committed 25 Crimes in Boston' we can see that some crimes categories has value lower than 100. To see the over all situation crimes with lower occurancey than 1000 is printed.

In [None]:
ucr_offense[ucr_offense['CRIME COUNT']< 1000].sort_values('CRIME COUNT').reset_index(drop=True)

36 different crime occured less than 1000 time in 4 years. So, it is possible that certain grup of crimes is the majority.

Top 25 crime count is compared total crime count.

In [None]:
other = df[25:].sum()[['CRIME COUNT']]
top25 = df[:25].sum()[['CRIME COUNT']]
print(f'''Count of total crimes - Top25 {other}
Count of most commited 25 crimes {top25}''')

In [None]:
labels = ['Top 25','Other crimes']
values = [258035,24428]
fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='label+percent',insidetextorientation='radial')])
fig.show()

Top 25 crimes constitute 91.4% of the total crimes. The probability of occurrence of the next crime is higher in this category. 

# Analysis Based on Location

Before showing the crime distribution on the map, the districts were represented with different colors to indicate their area and location in Boston. As can be seen, the areas were the crimes are higher most likely in the center of the city. The districts with lower crime rates are also the outmost areas.

In [None]:
plt.subplots(figsize=(11,6))
sns.scatterplot(x='Lat',
                y='Long',
                hue='DISTRICT',
                alpha=0.1,
                data=cr)
plt.legend(loc=2)

In [None]:
crime_loc = pd.DataFrame(data =(cr.groupby(['DISTRICT',"Lat","Long","OFFENSE_CODE_GROUP"]).count()[['INCIDENT_NUMBER']]).reset_index().values,columns=['DISTRICT',"Lat","Long","OFFENSE_CODE_GROUP","CRIME COUNT"])

The figures below show the distribution of different crimes over Boston. However it is also possibleto see that some locations are mislabeled. We can see district points are overlapping with each other. 

In [None]:
fig = px.scatter(crime_loc[crime_loc['Lat']!=-1], x="Lat", y="Long", animation_frame="OFFENSE_CODE_GROUP", animation_group="CRIME COUNT",
            color="DISTRICT",width=800, height=500)

fig["layout"].pop("updatemenus") 
fig.show()

In [None]:
crime_loc[crime_loc['Long']!=-1][crime_loc['Lat']!=-1]

In [None]:
fig = px.scatter(crime_loc[crime_loc['Long']!=-1][crime_loc['Lat']!=-1], x="Lat", y="Long",color ='DISTRICT', facet_col="OFFENSE_CODE_GROUP", facet_col_wrap=10,
              facet_row_spacing=0.05, 
              facet_col_spacing=0.05, 
              height=950, width=1800,
              title="Distributions of offences over city")
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))
#fig.update_yaxes(showticklabels=True)
fig.update_xaxes(range=[42.25, 42.4])
fig.update_yaxes(range = [-71.2, -71])
fig.show()

In [None]:
location = pd.DataFrame(data =(cr.groupby(["Lat","Long"]).count()[['INCIDENT_NUMBER']]).reset_index().values, columns=["Lat","Long","CRIME COUNT"])
x,y = location['Long'], location['Lat']
fig = px.density_mapbox(location,lat="Lat",lon="Long",z="CRIME COUNT",radius=10,center=dict(lat=42.357791, lon=-71),zoom = 10,mapbox_style="stamen-terrain",height=500,width=800)
fig.show()

To see exacly how many crimes were commited markers were added in the Boston map.

In [None]:
cr1 = pd.read_csv("/kaggle/input/crimes-in-boston/crime.csv",encoding='latin1')

tmp = cr1.groupby('INCIDENT_NUMBER')['YEAR'].count().sort_values(ascending = False)
tmp = pd.DataFrame({'INCIDENT_NUMBER': tmp.index, 'NUM_RECORDS': tmp.values})
seriousCrimes = cr.merge(tmp[tmp['NUM_RECORDS'] > 2 ], on = 'INCIDENT_NUMBER', how = 'inner')
seriousCrimes = seriousCrimes[['Lat','Long','OFFENSE_CODE_GROUP']].dropna()

In [None]:
from folium.plugins import MarkerCluster
import folium.plugins as plugins
f = folium.Figure(width=800, height=500)
boston_map = folium.Map(location = [seriousCrimes['Lat'].mean(), 
                                  seriousCrimes['Long'].mean()], 
                      zoom_start = 11).add_to(f)

incidents=folium.map.FeatureGroup()
#creating a Marker for each point. 
incidents2=plugins.MarkerCluster().add_to(boston_map)
for lat,lon,label in zip(seriousCrimes.Lat,seriousCrimes.Long,seriousCrimes.OFFENSE_CODE_GROUP):
    folium.Marker(location=[lat,lon],icon=None,popup=label).add_to(incidents2)

boston_map.add_child(incidents2)

boston_map

# Conclusions

* How has crime changed over the years?

  Crime number is the hghest in 2017 and lowest in 2015. There is a significant decrese in the crime numbers in 2018. 
  However, the records begin in June 22, 2015 and continue to October 3, 2018. So there are 6 missing months in 2015 and 3 months for 2018. So, the significant decrease in this year is most probably cause by these missing values.





* Is it possible to predict where or when a crime will be committed?

It is possible to next crime occur near the city center where the districs with higher crime rates are located. Also, next crime might occur in the summer months, during day time because most of the crimes happend at that times. So, if we assume that next crimes follow the same pattern with the ones from 2015-2018, we can preapere a list for where and where will next crime committed.


In [None]:
print(f"Possible places and times for next crime(Top 5 places and times):\n\nDistricts:\n{crimes_per_district.head()['DISTRICT']}\n\nStreets:\n{crimes_per_street.head()['STREET']}\n\nMonth:\n{crimes_per_month.sort_values('CRIME COUNT',ascending=False).head()['MONTH'].reset_index(drop=True)}\n\nDay:\n{crimes_per_day.sort_values('CRIME COUNT',ascending=False).head()['DAY'].reset_index(drop=True)}\n\nHour:\n{crimes_per_hour.sort_values('CRIME COUNT',ascending=False).head()['HOUR'].reset_index(drop=True)}")


* What can you say about the distribution of different offenses over the city?


There is no homogeneous distribution over the city. Crimes are most likey committed in the central areas. 