# INTRODUCTION

In this kernel the crime situations in the city of Boston MA were examined. The aim of this case is to go around the following questions:

1. How has crime changed over the years?
2. Is it possible to predict where or when a crime will be committed? 
3. What can you say about the distribution of different offenses over the city?

### **Loading Data and Explanation of Features**

In [None]:
#import libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

crime= pd.read_csv('../input/crimes-in-boston/crime.csv', encoding='unicode_escape')
crime.info()

In [None]:
crime.describe()

In [None]:
crime.head()

### **First impressions about the data:** 

*   There are 17 columns and 319073 rows. We can see the data type used in every column.

*   Some of the columns like 'INCIDENT_NUMBER' and 'OFFENSE_CODE' in the dataset are not important as well as others so it would not make sense to keep all of them in our final dataframe configuration.

*   Another thing is that there are too many missing values in the SHOOTING column.

*   Or there are columns with degrees of latitude and longitude -1.

*   We can also change the column names to make it easier.


First, let's clean up and simplify this data set.

In [None]:
crime.SHOOTING.unique()  # If there is a shot, it is filled with 'Y'. For this reason, I filled all the missing values with 'N'.

In [None]:
crime = crime.drop(['INCIDENT_NUMBER','OFFENSE_CODE'], axis=1)

#replace NaN values with 'N' : means No
crime.SHOOTING.fillna('N', inplace=True)

#Replace -1 values in Lat/Long with NaN
crime.Lat.replace(-1, None, inplace=True)
crime.Long.replace(-1, None, inplace=True)


#change the column names
rename = {'OFFENSE_CODE_GROUP':'Offense_group',
         'OFFENSE_DESCRIPTION':'Description',
         'DISTRICT':'District',
         'REPORTING_AREA':'Area',
         'SHOOTING':'Shooting',
         'OCCURRED_ON_DATE':'Date',
         'YEAR':'Year',
         'MONTH':'Month',
         'DAY_OF_WEEK':'Day',
         'HOUR':'Hour',
         'UCR_PART':'UCR',
         'STREET':'Street'}
crime.rename(index=str, columns=rename, inplace=True)

#setting the index to be the date will help 
crime.index = pd.DatetimeIndex(crime.Date)

I will use it when I plot a map according to the police districts.

In [None]:
crime_dist = crime.copy()

People usually know the district name rather than district code. So I will recode the district code to district name.

In [None]:
crime.District.unique()

In [None]:
district_name = {
'D14':'Brighton',
'C11':'Dorchester',
'D4':'South End',
'B3':'Mattapan',
'B2':'Roxbury',
'C6':'South Boston',
'A1':'Downtown',
'E5':'West Roxbury',
'A7':'East Boston',
'E13':'Jamaica Plain',
'E18':'Hyde Park',
'A15':'Charlestown'
}
crime['District'] = crime['District'].map(district_name)

New version of the dataframe:

In [None]:
crime.head()

# How has crime changed over the years?

Only the number of crimes received by years could be misleading for us, as it is seen that **2015 starts from the 6th month and 2018 ends in the 9th month.**

In [None]:
crime.resample('M').size()

In [None]:
monthly = crime.resample('M').size().to_frame(name='Count')
monthly = monthly.reset_index(level='Date')
import plotly.express as px

fig = px.line(monthly, x=monthly.Date, y=monthly.Count)
fig.update_layout(title='Number of Crimes per Month(2015-2018)',
                   xaxis_title='Month',
                   yaxis_title='Number of Crimes')
fig.show()

results = px.get_trendline_results(fig)
print(results)

* Looking at the trend above, we see that crimes in Boston **increased slightly in 2016 and 2017**. However, since the data in 2015 and 2018 are missing, we can't say anything definite. 
* In general, there is an **increase towards August and then it falls**. Perhaps there may be a correlation with the average temperature in Boston. For this we need to look at the weather data.

### **CATEGORIZATION**

While reviewing the data, I noticed that there are many offense groups that can be put in the same category. In order to make the crime categories clearer for visualization purposes, I have recategorized them into **11 crime type groups**. 

* Assembly or Gathering Violations, Restraining Order Violations, Violations, 
Drug Violation, Firearm Violations, Liquor Violation, License Violation as  **Violation**
* Simple Assault, Aggrevated Assault as **Assault**
* Residential Burglary, Other Burglary, Commercial Burglary, Burglary – No Property Taken  as **Burglary**
* Harassment, Criminal Harassment  as **Harassment**
* Larceny, Larceny From Motor Vehicle, Auto Theft as **Larceny**
* Homocide, Manslaughter as **Killing**
* Fraud, Counterfeiting, Confidence Games as **Fraud**
* Motor Vehicle Accident Response as **Motor vehicle accident**
* Robbery as **Robbery**
* Verbal Disputes as **Verbal Disputes**
* Vandalism as **Vandalism**

All crimes were not included in the categorization.I chose to focus on the most common crime that disrupts Boston's reputation.

On the other hand, although the number of murder and manslaughter crimes is lower, I decided to group them because they are very serious crimes.

Also, since the number of Burglary, Robbery and Larceny crimes is huge, I did not combine these crimes.

In [None]:
#Categorization 
#violation.Offense_group.unique()
#violation.Offense_group.value_counts()

import re
violation = crime[crime.Offense_group.str.contains('Violation')]
assault = crime[crime.Offense_group.str.contains('Assault')]
burglary = crime[crime.Offense_group.str.contains('Burglary')]
harassment = crime[crime.Offense_group.str.contains('Harassment')]
larceny = crime[crime.Offense_group.str.contains('Larceny|Theft')]
larceny = larceny[~larceny.Offense_group.str.contains('Recovery')]
#Investigate = crime[crime.Offense_group.str.contains('Investigate|Search', flags=re.IGNORECASE, regex=True)]
killing = crime[crime.Offense_group.str.contains('Manslaughter|Homicide')] #bunlar az sayıdaydı ancak en ciddi suçlardandı az row olması sebebiyle de aynı categoriye aldım
fraud = crime[crime.Offense_group.str.contains('Confidence Games|Fraud|Counterfeiting')]
#x = crime[crime.Offense_group.str.contains('Missing Person Located', flags=re.IGNORECASE, regex=True)] it was not crime
mv_accident = crime[crime.Offense_group.str.contains('Accident')]
#medicaid = crime[crime.Offense_group.str.contains('Medical Assistance')]
robbery = crime[crime.Offense_group.str.contains('Robbery')]
disputes = crime[crime.Offense_group.str.contains('Verbal Disputes')]
vandalism = crime[crime.Offense_group.str.contains('Vandalism')]

#A column named Category has been added for each crime category.
violation.insert(0, 'Category', 'Violation')
assault.insert(0, 'Category', 'Assault')
burglary.insert(0, 'Category', 'Burglary')
harassment.insert(0, 'Category', 'Harrassment')
larceny.insert(0, 'Category', 'Larceny')
#Investigate = crime[crime.Offense_group.str.contains('Investigate|Search', flags=re.IGNORECASE, regex=True)]
killing.insert(0, 'Category', 'Killing') #bunlar az dayıdaydı ancak en ciddi suçlardandı az row olması sebebiyle de aynı categoriye aldım
fraud.insert(0, 'Category', 'Fraud')
#x = crime[crime.Offense_group.str.contains('Missing Person Located', flags=re.IGNORECASE, regex=True)] it was not crime
mv_accident.insert(0, 'Category', 'Motor vehicle accident')
#medicaid = crime[crime.Offense_group.str.contains('Medical Assistance')]
robbery.insert(0, 'Category', 'Robbery')
disputes.insert(0, 'Category', 'Verbal disputes')
vandalism.insert(0, 'Category', 'Vandalism')

#A dataframe called categorized_crimes indicativing categorized crimes was created.
frames = [violation, assault, burglary, harassment, larceny, killing, fraud, mv_accident, robbery, disputes, vandalism]
categorized_crimes = pd.concat(frames)
categorized_crimes

I'll also use the categorized_crimes dataframe where needed.

**If we want to look at the changes in crime over the years by categories.**

In [None]:
categorized_crimes['Date'] = pd.to_datetime(categorized_crimes['Date'])
monthly_crimes_count = categorized_crimes.pivot_table(index=pd.Grouper(freq = 'M', key ='Date'), columns='Category', aggfunc=np.size, values= 'Offense_group')
monthly_crimes_count


In [None]:
monthly_crimes_count = monthly_crimes_count.reset_index(level='Date')
monthly_crimes_count = pd.melt(monthly_crimes_count, id_vars=['Date'])
monthly_crimes_count.tail()

By selecting the types of crime it is possible to detect the number of crimes as well as the trendline for each type of crime in the plot:

In [None]:

fig = px.scatter(monthly_crimes_count, x='Date', y=monthly_crimes_count.value, color = 'Category', trendline="ols")

fig.update_layout(title='Number of Crimes for each Crime Category',
                   yaxis_title='Number of Crimes')
fig.show()

results = px.get_trendline_results(fig)
print(results)


1. Over all Boston, the incidents like Larceny Robbery and Burglary show decrease in numbers over time. This shows that theft-like crimes are decreasing in Boston.

2. Through the points on the scatter plot, we can achieve more temporal results. For example, Larceny reveals a seasonal pattern with more crime in summer and less in winter. 

3. Assault, Motor Vehicle Accident, Harrassment and Verbal Disputes show an increase in number of incidents over time for the entire city of Boston.

4. Furthermore, Violation, Fraud and Vandalism seem to have decreased in number over time.


Let's look at the most common crimes when we don't categorize.

In [None]:
plt.figure(figsize=(10,8))
x = crime.Offense_group.value_counts().head(30).plot(kind='bar', color = 'salmon')

As can be seen, driving in Boston can be dangerous.

# Is it possible to predict where or when a crime will be committed? 

In [None]:
crime.District.value_counts()

In [None]:
crime['Date'] = pd.to_datetime(crime['Date'])
district_crimes_count = crime.pivot_table(index=pd.Grouper(freq = 'M', key ='Date'), columns='District', aggfunc=np.size, values= 'Offense_group')

district_crimes_count = district_crimes_count.reset_index(level='Date')
district_crimes_count = pd.melt(district_crimes_count, id_vars=['Date'])
district_crimes_count

In [None]:

fig = px.scatter(district_crimes_count, x='Date', y=district_crimes_count.value, color = 'District', trendline="ols")

fig.update_layout(title='Number of Crimes for each District',
                   yaxis_title='Number of Crimes')
fig.show()

results = px.get_trendline_results(fig)
print(results)

* According to the data we have, the district where the crime rate is the lowest is Charlestown. According to this data, we can say that Charlestown is the safest district in Boston to live.
* There has been a marked decrease in the number of crime over the years in Dorchester and Roxbury. However, the number of crimes is still high compared to many districts. According to the data, Roxbury is the most dangerous district to live in.
* The district where crime is most common are Dorchester, South-End and Roxbury.
* The crime rates of districts such as Brighton, Downtown, Hyde Park, Jamaica Plain, Mattapan, Roxbury, South Boston, South End and  West Roxbury  did not change much over the years.


Let's also look at the distribution of each crime in the districts.

In [None]:
plt.figure(figsize=(20,10))
sns.countplot(x = 'Category', data = categorized_crimes, order = categorized_crimes.Category.value_counts().index, hue='District')

* According to the table we have, larcency incidents are mostly seen in the South end district.
* The most common incident in Mattapan district is motor vehicle accidents.
* The district where crime is most common are Dorchester, South-End and Roxbury.
* The crime of killing all across boston is really minimal.

**Distribution of Crime over the Months (2015-2018)**

In [None]:
plt.figure(figsize=(10,6))
sns.countplot(x='Month',data=crime,palette="Reds")

It seems like, crimes are mostly committed in summer.

**Distribution of Crime over the Months (for each Years)**

In [None]:
crime_2015=crime[crime['Year']==2015]
crime_2016=crime[crime['Year']==2016]
crime_2017=crime[crime['Year']==2017]
crime_2018=crime[crime['Year']==2018]

fig, axes = plt.subplots(1,4, figsize = (32,6))

sns.countplot(x='Month',data=crime_2015,palette = 'Blues',ax = axes[0]).set_title('2015')
sns.countplot(x='Month',data=crime_2016,palette = 'Greens',ax = axes[1]).set_title('2016')
sns.countplot(x='Month',data=crime_2017,palette = 'Reds',ax = axes[2]).set_title('2017')
sns.countplot(x='Month',data=crime_2018,palette = 'Blues',ax = axes[3]).set_title('2018')


When we look at the monthly crime distribution by years, we realize again that we do not have the first 5 months of 2015 and the last 3 months of 2018. This shortcoming will make us wrong in understanding how the crime is distributed by months.

However, as we can see from the available data, the number of crime increases in the summer season.

When we look at 2016 and 2017 when we know that the data are complete, we do not see a big change in the number of crime.

In [None]:
print(crime_2016.Offense_group.count())
print(crime_2017.Offense_group.count())

**What if we want to see the distribution of crime categories by months?**

In [None]:
all_catcrime = ['Larceny','Motor vehicle accident', 'Violation', 'Assault', 'Vandalism', 'Verbal disputes', 'Fraud','Killing', 'Burglary', 'Robbery','Harrassment']
catcrime_month = categorized_crimes.copy()
catcrime_month = catcrime_month[catcrime_month['Category'].isin(all_catcrime)]
catcrime_month= catcrime_month.groupby(['Month','Category']).size().reset_index(name = 'Number of Crimes')
catcrime_month['Months'] = catcrime_month['Month']

#pivot table
catcrime_month_reshape = pd.pivot_table(catcrime_month, index=['Month'], columns=['Category'], values='Number of Crimes', aggfunc=np.sum)


catcrime_month_reshape.plot(kind= 'bar', stacked = True, figsize=(15,8),color=['#c8553d','#370617','#6a040f','#9d0208','#f9844a','#450920','#dc2f02','#e85d04','#f48c06','#90be6d','#4d908e'])
plt.title('Number of Crimes by Type')
plt.show()

The number of crime increases in the summer season.

As can be seen, Violation and Larceny and Motor vehicle accident form a significant portion of crimes committed in all months.

When we look at the bar plot, we can hardly see the killing. We can say that the number of incidents that resulted in killing in the city of Boston is really few.

#### Distribution of Crime over the Day

Let's look at the times when the crime was committed the most. I used the categorized data.

In [None]:
plt.figure(figsize=(10,8))
categorized_crimes.Category.value_counts().head(10).plot(kind='bar', color = 'salmon')

In [None]:
catcrime_time = categorized_crimes.copy()
top10_catcrime = ['Larceny','Motor vehicle accident', 'Violation', 'Assault', 'Vandalism', 'Verbal disbutes', 'Fraud', 'Burglary', 'Robbery','Harrassment']
catcrime_time = catcrime_time[catcrime_time['Category'].isin(all_catcrime)]
catcrime_time = catcrime_time.groupby('Hour').size().reset_index(name = 'Number of Crimes')
catcrime_time['Hour'] = catcrime_time['Hour'].apply(lambda i: str(i)+':00')
catcrime_time

In [None]:
plt.figure(figsize=(15,8))
sns.pointplot(data = catcrime_time, x = 'Hour', y = 'Number of Crimes')
plt.show()


As we can see, the crime is mostly committed during the day and falls in the early morning. And it reaches the highest levels during **rush hours**. During these hours, when people are tired and more irritable, crime rates may be increasing. However, in order to say something exactly, the working status of those who committed crime, etc. we need to know about it.

As can be seen, the crime rate during the day is highest between **5 and 7 o'clock** in the evening. We also know that the most committed crime is motor vehicle accident. Considering that the traffic is very busy during these hours, it can be said that this is the reason for the high crime rate in this hour.

We can narrow down based on each crime category.



In [None]:
#group crimes 
catcrime_type = categorized_crimes.copy()
catcrime_type = catcrime_type[catcrime_type['Category'].isin(top10_catcrime)]
catcrime_type= catcrime_type.groupby(['Hour','Category']).size().reset_index(name = 'Number of Crimes')
catcrime_type['Hours'] = catcrime_type['Hour'].apply(lambda i: str(i)+':00')

#pivot table
catcrime_type_reshape = pd.pivot_table(catcrime_type, index=['Hour'], columns=['Category'], values='Number of Crimes', aggfunc=np.sum)


catcrime_type_reshape.plot(kind= 'bar', stacked = True, figsize=(15,8),color=['#450920','#370617','#6a040f','#9d0208','#f9844a','#dc2f02','#e85d04','#f48c06','#90be6d','#4d908e'])
plt.title('Number of Crimes by Type')
plt.show()

Looking at the time periods, each crime category seems to have similar proportions within each hour. Violation and Larceny and Motor vehicle accident form a significant portion of crimes committed. 

#### Distribution of Crime over the Week

In [None]:
crime_days = crime.groupby('Day').agg('count')
day_counts = crime_days.Offense_group
day_counts

As can be seen, there is no major change. The days when the crime is committed the most are **Monday** and **Friday**, which are the start and end days of the working days.

# What can you say about the distribution of different offenses over the city?

 **Map of Boston Police Districts and Scatter Plot of all crime geo-locations**

> I want to know how offenses are distributed across the whole city. I created a scatter plot mapping all offenses locations in the dataset.




In [None]:
sns.lmplot('Lat', 
           'Long',
           data=crime[:],
           fit_reg=False, 
           hue = 'District',
           palette ='Dark2',
           height=12,
           ci=2,
           scatter_kws={"marker": "D", 
                        "s": 10})
ax = plt.gca()
ax.set_title("All Crime Distribution per District")

From the visualizations above, we can see an image from the web of the districts and a scatter plot of all crime data geo-locations in the dataset which produced a copy image of the Boston map.

Next I visualized each geographical distribution scatter plot for each of the crimes to understand how the all categorized crimes are distributed across the city. In the countplot I plot above, we can see the distribution of the categories of crime over the city.

In [None]:
sns.lmplot(x="Lat",
           y="Long",
           col="Category",
           hue = 'District',
           data=categorized_crimes.dropna(), 
           col_wrap=2, height=6, fit_reg=False, 
           sharey=False,
           scatter_kws={"marker": "D",
                            "s": 10})

* It looks like Larceny and Motor vehicle accident crimes are common all over the city.
* But Harrassment Robbery, Verbal Disputes crimes have a specific geographic pattern starting to emerge.
* We can see where the crime clusters appear. While this is a good observation, it is too early to arrive confidence-inspiring conclusions.



I used folium to create some maps illustrating some of my findings and make notes below. The maps are interactive. You can zoom in and out and move them around.

I'll show crime distribution per district.

In [None]:
crime_dist

I am creating a dataframe according to the json format.

In [None]:
from urllib.request import urlopen
import json
with open('../input/police-district/Police_Districts.geojson') as f:
    boston_geojson1 = json.load(f)

boston_geojson1['features'][0]

In [None]:
dist = pd.DataFrame(data= crime_dist.District.value_counts().values, index= crime_dist.District.value_counts().index, columns=['Count'])
dist = dist.reset_index()
dist.rename({'index': 'District'}, axis='columns', inplace=True)
dist

In [None]:
import folium
from folium import Choropleth, Circle, Marker, plugins
from folium.plugins import HeatMap, MarkerCluster, FastMarkerCluster, HeatMapWithTime

crime_map = folium.Map(location=[42.361145,-71.057083], tiles='cartodbpositron', zoom_start=11.2)

crime_map.choropleth(
    geo_data= boston_geojson1,
    data= dist,
    columns= ['District', 'Count'],
    key_on='feature.properties.DISTRICT',
    fill_color='GnBu', 
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Choropleth of Crimes per Police District'
)
crime_map

* This map shows how the crimes vary across the different districts.
* We can see that crime is concentrated around Roxbury : B2, Dorchester : C4, South-End : D4 and Downtown : A1.
* The city where crime is the most intense seems to be Roxbury.

* Furthermore, districts with higher crime rates are geographicaly connected.

**Geo Locations of Crimes:**

In [None]:
crimes = crime.copy()
crimes.dropna( axis = 0, subset = [ 'Lat', 'Long' ], inplace = True )

lats = list(crimes.Lat)
longs = list(crimes.Long)
locations = [lats,longs]

m = folium.Map(location=[42.361145,-71.057083], tiles='cartodbpositron', zoom_start=11.2)

FastMarkerCluster(data=list(zip(lats, longs))).add_to(m)

m.choropleth(
    geo_data= boston_geojson1,
    name='choropleth',
    data= dist,
    columns= ['District', 'Count'],
    key_on='feature.properties.DISTRICT',
    fill_color='YlOrRd', 
    fill_opacity=0.4, 
    line_opacity=0.2,
    legend_name='Distribution of Crimes over the City',
    highlight=False
    )
m


As you zoom in on the map above, it gives the number of crimes committed in every street of Boston.

With the map above, the streets where more crimes are committed can be identified and more resources can be sent to those streets. Those streets can be tried to be made safer. Or we can identify the less dangerous streets and use them as routes.

**How has different offenses distributed over the city?**

Districts should be as a district code.

In [None]:
categorized_crime = categorized_crimes.copy()
districts_name = {
'Brighton':'D14',
'Dorchester':'C11',
'South End':'D4',
'Mattapan':'B3',
'Roxbury':'B2',
'South Boston':'C6',
'Downtown':'A1',
'West Roxbury':'E5',
'East Boston':'A7',
'Jamaica Plain':'E13',
'Hyde Park':'E18',
'Charlestown':'A15'
}
categorized_crime['District'] = categorized_crime['District'].map(districts_name)
categorized_crime

In [None]:
crimes_overcity = categorized_crime.pivot_table(index=pd.Grouper(key ='District'), columns='Category', aggfunc=np.size, values= 'Offense_group')
crimes_overcity = crimes_overcity.reindex(['A15', 'A7', 'A1', 'C6','D4', 'D14', 'E13', 'E5','B3', 'C11', 'E18', 'B2'])
crimes_overcity = crimes_overcity.reset_index(level='District')
crimes_overcity = pd.melt(crimes_overcity, id_vars=['District'])
crimes_overcity

In [None]:
import plotly.express as px

fig = px.choropleth_mapbox(crimes_overcity, geojson=boston_geojson1,  featureidkey ='properties.DISTRICT', locations='District', color='value',
                           color_continuous_scale=[[0.0, "rgb(165,0,38)"],
                [0.1111111111111111, "rgb(215,48,39)"],
                [0.2222222222222222, "rgb(244,109,67)"],
                [0.3333333333333333, "rgb(253,174,97)"],
                [0.4444444444444444, "rgb(254,224,144)"],
                [0.5555555555555556, "rgb(224,243,248)"],
                [0.6666666666666666, "rgb(171,217,233)"],
                [0.7777777777777778, "rgb(116,173,209)"],
                [0.8888888888888888, "rgb(69,117,180)"],
                [1.0, "rgb(49,54,149)"]],
                           animation_frame=crimes_overcity["Category"],
                           range_color=(900, 10000),
                           mapbox_style="carto-positron",
                           zoom=10, center={"lat": 42.361145, "lon": -71.057083},
                           opacity=0.5,
                           labels={'value':'Number of Crimes'}
                          )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# CONCLUSION

After analysing on the Crimes in Boston dataset, we can see the trends and relations between the types of crimes, location and the occurance of the crime. Some of the takeaways from the analysis are mentioned below:

* It is seen that the most crime occurred in the summer months of July and August.

* Motor-Vehicle accident  response is the most reported incident in the Boston data set.


* The districts where crime is most common are Roxbury, Dorchester, South-End and Downtown.

* The city where crime is the most intense seems to be Roxbury