# Preface

The city of Boston has created an initiate on making different datasets available to the public. These datasets cover different aspect of the city of Boston such as City Hall electricity usage, approved building permits, and 311 service requests. For this project, I will be focusing on the crime reported to the Boston Police Department (BPD). The end goal is to provide some insight of the crimes within Boston, assisting the BPD to better prepare for crimes. 

There are two reasons to why I chose this dataset:
-  Concurrent updates: The dataset covers crimes from 2015 until April of 2019 (as of writing of this notebook). The dataset is constantly updated as new crimes occurred
-  Dataset features: This dataset contains enough features that can assist with different analysis. The data has also been carefully sanitized before making it public, easing the tasks of analyzing the data. 


# Tools used

In [None]:
'''
These libraries must be present prior to running the notebook.
For more information, please visit the developer site for each package.
'''
import pandas as pd
import seaborn as sns
import gmaps
import plotly as py
from plotly import tools
import plotly.graph_objs as go 
import requests
import matplotlib as plt
import sklearn

print(f"Pandas version: {pd.__version__}")
print(f"Seaborn version: {sns.__version__}")
print(f"gmaps version: {gmaps.__version__}")
print(f"Plotly version: {py.__version__}")
print(f"requests version: {requests.__version__}")
print(f"matplotlib version: {plt.__version__}")
print(f"sklearn version: {sklearn.__version__}") 


# Importing the crime data

In [None]:
datacrime_api = "https://data.boston.gov/datastore/dump/12cb3883-56f5-47de-afa5-3b1cf61b257b"
api_response = requests.get(datacrime_api)
if (api_response.status_code == 200):
    crime_data = pd.read_csv(datacrime_api, parse_dates=['OCCURRED_ON_DATE'])
    print("Retrieval of dataset from Analyze Boston's API was succesful")
else:
    print(f"The notebook could not connect to the API. The following error has ocurred: {api_response}.")
    print("Instead, the notebook will use a local dataset retrieved on April 7, 2019.")
    crime_data = pd.read_csv("tmpj29on4xs.csv", parse_dates=['OCCURRED_ON_DATE'])


# The Data

In [None]:
crime_data.head(10)

Throughout this notebook, I will be concentrating on different features to find patterns and answer some questions about this data. 


# Predicting crimes in districts
Sometimes BPD is not able to accurately determine the location of crimes in the city of Boston. By using the Sickit's API, we might be able to do some sort of prediction on which ditricts crime will occur based on certain features

In [None]:
ValidCrimeData = crime_data.loc[crime_data["DISTRICT"].notnull() & crime_data["STREET"].notnull() & crime_data["UCR_PART"].notnull()]

x = ValidCrimeData[["OFFENSE_CODE", "MONTH", "HOUR"]]

y = ValidCrimeData["DISTRICT"]


In [None]:
from sklearn.preprocessing import LabelEncoder
pd.set_option('mode.chained_assignment', None) # Hides warning about chainning

LabelEncoder = LabelEncoder()

street = ValidCrimeData["STREET"]
StreetEnc = LabelEncoder.fit_transform(street)
x["STREET"] = StreetEnc

ReportingArea = ValidCrimeData["REPORTING_AREA"]
ReportingAreaEnc = LabelEncoder.fit_transform(ReportingArea)
x["REPORTING_AREA"] = ReportingAreaEnc

IncidentNum = ValidCrimeData["INCIDENT_NUMBER"]
IncidentNumEnc = LabelEncoder.fit_transform(IncidentNum)
x["INCIDENT_NUMBER"] = IncidentNumEnc

OffenseDescreption = ValidCrimeData["OFFENSE_DESCRIPTION"]
OffenseDescreptionEnc = LabelEncoder.fit_transform(OffenseDescreption)
x["OFFENSE_DESCRIPTION"] = OffenseDescreptionEnc

OffenseCode = ValidCrimeData["OFFENSE_CODE_GROUP"]
OffenseCodeEnc = LabelEncoder.fit_transform(OffenseCode)
x["OFFENSE_CODE_GROUP"] = OffenseCodeEnc

DayOfWeek = ValidCrimeData["DAY_OF_WEEK"]
DayOfWeekEnc = LabelEncoder.fit_transform(DayOfWeek)
x["DAY_OF_WEEK"] = DayOfWeekEnc

UrcPart = ValidCrimeData["UCR_PART"]
UrcPartENC = LabelEncoder.fit_transform(UrcPart)
x["URCPart"] =  UrcPartENC

occurred_data = ValidCrimeData["OCCURRED_ON_DATE"]
OccurredDataEnc = LabelEncoder.fit_transform(occurred_data)
x["OCCURRED_ON_DATE"] = OccurredDataEnc

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# KNeighborsClassifier = 0.2575589894932437
# SVC
model = GaussianNB()
Xtrain, Xtest, Ytrain, Ytest = train_test_split(x, y, test_size=.20, random_state=1)

In [None]:
model.fit(Xtrain, Ytrain) # Fit model to data
y1_model = model.predict(Xtrain) # Predict on new data


In [None]:
accuracy_score(Ytrain, y1_model) # View score of train data

In [None]:
y2_model = model.predict(Xtest) # View score of test train data
accuracy_score(Ytest, y2_model)

# Analysis

How can we provide a quick overview of crimes to BPD? One of the easiest way is to find the top crimes for the available dataset. In this case, I will be focusing on the top 20 crimes from 2015 to the most present time, which is April of 2019.

In [None]:
top_20_crimes = pd.DataFrame(crime_data["OFFENSE_DESCRIPTION"].value_counts()[:20]) # Select top 20 crimes 2015-2019
sns.set(style="darkgrid", rc={'figure.figsize':(10,10)})
ax = sns.barplot(x=top_20_crimes["OFFENSE_DESCRIPTION"], y=top_20_crimes.index, data=top_20_crimes)
ax.set_xlabel('Total crimes',x=0.4, fontsize=20)


We are going to begin by looking at the trend of crimes from 2015 until April of 2019. Primarily, we are focusing on the fluctuaction of crimes for the months of January through December for the years previously mentioned. To start, we will be collecting all crimes for all the years available within the data set:

In [None]:
# Select year column
crime2015 = crime_data.loc[crime_data["YEAR"] == 2015]
crime2016 = crime_data.loc[crime_data["YEAR"] == 2016]
crime2017 = crime_data.loc[crime_data["YEAR"] == 2017]
crime2018 = crime_data.loc[crime_data["YEAR"] == 2018]
crime2019 = crime_data.loc[crime_data["YEAR"] == 2019]

However, our data will be out of order, causing our graph to improperly display data. To solve this, we need to first count all the crimes for each individual month on the available years. Then, we will map the indexes of the crime data to match a Pandas series that will contain the name of the months. The mapping will allow us to map the month number to the month name. Lastly, we want to re-index the series' indexing. This will set the order of the months in the proper order. Additionally, any values missing on certain months, the re-indexing will simply fill NaN

In [None]:
# Counts all crimes for each year and month, then sorts index from smallest to largest (index = month number)
months = pd.Series(["", "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"])

crime2015_count = (crime2015["MONTH"]).map(months).value_counts().reindex(months)
crime2016_count = (crime2016["MONTH"]).map(months).value_counts().reindex(months)
crime2017_count = (crime2017["MONTH"]).map(months).value_counts().reindex(months)
crime2018_count = (crime2018["MONTH"]).map(months).value_counts().reindex(months)
crime2019_count = (crime2019["MONTH"]).map(months).value_counts().reindex(months)


To visualize our data, we can use many different tools such as matplotlib, seaborn, etc. However, none of them offered an easy
way for someone to interact with the data such as changing the year to view crimes. For such task, I will be implementing 
a volume chart with plotly:

In [None]:
'''
pip install plotly 
pip install dash==0.39.0  # The core dash backend

####### This gets installed when installing dash #######
pip install dash-html-components==0.14.0  # HTML components
pip install dash-core-components==0.44.0  # Supercharged components
pip install dash-table==3.6.0  # Interactive DataTable component (new!)

####### This does not get installed when installing dash, will need to run pip #######
pip install dash-daq==0.1.0  # DAQ components (newly open-sourced!)
'''

tools.set_credentials_file(username="thenr@wit.edu",api_key="aEjbCknbrip24nS2D45f")
# Defines x/y axis as well as style
crime2019_plot = go.Scatter(x=(crime2019_count.index),
    y=crime2019_count.values,
    name="2019",
    line=dict(color='#ce1c1c')
)

crime2018_plot = go.Scatter(x=(crime2018_count.index),
    y=crime2018_count.values,
    name="2018",
    line=dict(color='#2ae011')
)

crime2017_plot = go.Scatter(x=(crime2017_count.index), #crime2017_count.index,
    y=crime2017_count.values,
    name="2017",
    line=dict(color='#1025e0')
)

crime2016_plot = go.Scatter(x=(crime2016_count.index), #crime2016_count.index,
    y=crime2016_count.values,
    name="2016",
    line=dict(color='#10e0c7')
)

crime2015_plot = go.Scatter(x=(crime2015_count.index), #crime2015_count.index,
    y=crime2015_count.values,
    name="2015",
    line=dict(color='#fffb0f')
)


data = [crime2019_plot, crime2018_plot, crime2017_plot, crime2016_plot, crime2015_plot]

updatemenus = list([
    dict(active=-1,
         buttons=list([
             dict(label = '2015 to 2019',
                 method = 'update',
                 args = [{'visible': [True, True, True, True, True]}, # Defines which dataframe to display per plot
                         {'title': 'Crimes report from 2015 to 2019',
                          'annotations': data}]), 
             dict(label = '2019',
                 method = 'update',
                 args = [{'visible': [True, False, False, False, False]},
                         {'title': 'Crimes of 2019',
                          'annotations': crime2019_plot}]), 
             dict(label = '2018',
                 method = 'update',
                 args = [{'visible': [False, True, False, False, False]},
                         {'title': 'Crimes of 2018',
                          'annotations': crime2018_plot}]),         
            dict(label = '2017',
                 method = 'update',
                 args = [{'visible': [False, False, True, False, False]},
                         {'title': 'Crimes of 2017',
                          'annotations': crime2017_plot}]),         
            dict(label = '2016',
                 method = 'update',
                 args = [{'visible': [False, False, False, True, False]},
                         {'title': 'Crimes of 2016',
                          'annotations': crime2016_plot}]),
             dict(label = '2015',
                 method = 'update',
                 args = [{'visible': [False, False, False, False, True]},
                         {'title': 'Crimes of 2015',
                          'annotations': crime2015_plot}])         
        ]),
    )
])

layout = dict(title='Crimes report from 2015 to 2019', showlegend=True, updatemenus=updatemenus)

fig = dict(data=data, layout=layout)
py.plotly.iplot(fig, filename='update_dropdown')


# Montly crime

In [None]:
months = (pd.DataFrame((crime_data[["OFFENSE_CODE"]]).set_index(crime_data["OCCURRED_ON_DATE"]))).index.month
MonthsName = pd.Series(["", "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"])
monthsanalysis = months.map(MonthsName).value_counts().reindex(MonthsName)

bar_chart = plt.pyplot.bar(x=monthsanalysis.index, height=monthsanalysis.values, color="orange")
plt.pyplot.xticks(rotation=45)


plt.pyplot.plot((monthsanalysis.values), 'r--o') # Line over bar chart
plt.pyplot.show()


# Day of the week crime

In [None]:
dayofweek = (pd.DataFrame((crime_data[["OFFENSE_CODE"]]).set_index(crime_data["OCCURRED_ON_DATE"]))).index.dayofweek
WeekdaysName = pd.Series(["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"])
dayofweekname = dayofweek.map(WeekdaysName).value_counts().reindex(WeekdaysName)

bar_chart = plt.pyplot.bar(x=dayofweekname.index, height=dayofweekname.values)

plt.pyplot.plot((dayofweekname.values), 'r--o') # Line over bar chart
plt.pyplot.show()


# Location of crimes

Since our dataset contains the coordinates of where reported crimes ocurred, it would be ideal to have a visualization of such crimes. To do so, we will be using gmaps. gmaps is an opensource module that integrates well with Jupyter. For more information, please visit: https://github.com/pbugnion/gmaps.

Before using gmaps, you will need to obtain an API key from Google maps: https://developers.google.com/maps/documentation/javascript/get-api-key. Once you have obtained the key, you can use it by setting gmaps.configure(api_key="") equals to your key. One of the advantages of using your own key is the availability of looking at the analytics of the API calls:

![title](url="https://drive.google.com/file/d/1gEnFeiTBxLSonw-FU86rzMNm66ec2TPN/view?usp=sharing")

However, for simplicity, the key is already provided in the notebook. 

In [None]:
'''
1-Install gmaps by running conda install -c conda-forge gmaps

2-Enable ipywidgets extensions by running jupyter nbextension enable --py --sys-prefix widgetsnbextension

3-Load the extension to jupy by running jupyter nbextension enable --py --sys-prefix gmaps

4-Next, let's install the jupyter widgets extension for JupyterLab by running jupyter labextension install @jupyter-widgets/jupyterlab-manager
If you receive an error that node js is not installed, run "conda install nodejs". Then re-run the labextension installtion

5-Next, run "jupyter lab build" to rebuild jupyterlab in order to incolde the frontend code to jupyerlab installtion. Next
Restart the kernel and try again. 

'''

gmaps.configure(api_key="AIzaSyCHGv8iQ0fuxUnIccKCwcMnHepRMeBo85Y")

#####################################################################################################
# Certain crimes reported to BPD does not contain location. Instead, BPD sets latitute and longtitute 
# equals to 0.000000. Since these coordinates are not useful, let's drop them
#####################################################################################################
crime_lat_long = crime_data[["Lat", "Long"]].dropna()

# Create a heatmap using coordinates on Google maps
fig = gmaps.figure()
heatmap_layer = gmaps.heatmap_layer(crime_lat_long)

fig.add_layer(heatmap_layer)
fig


While zooming in the Boston area with Google maps, you may have noticed that the heatmap slowly disappears as you zoom into the city. This is due to Google maps trimming off the maximum peak intensity of the points. To get around this, we need to modify the max_intensity and the point_radius of the heatmap. Unfortunately, there is no perfect value for these settings. The best way to find a decent value is by playing around with gmap's settings. For the dataset we are using, I discovered that max_intensity = 100 and point_radius = 12 works best. Rather than re-drawing the heat map, we are able to modify these settings and only change the coverage of each individual point:

In [None]:
heatmap_layer.max_intensity = 100
heatmap_layer.point_radius = 12

If you zoom out, you will see the heatmap covers a large portion of the Boston city. However, when zooming in, you will begin to see the heatmap properly display the dataset's coordinates. Again, this is just the way Google maps handles the drawing of the heatmap and not a bug with gmaps. 

# Challenges

There were many challenges that I faced when working in this project. For starters, picking a dataset. In today's age, there hundreds of datasets available. I struggle on choosing one as I wanted to work on something that will be interesting and also use a dataset that I could trust. Additionally, what type of analysis I wanted to carry out was another challenge. In this dataset, I could either do regression or classification. After doing research and understanding what exactly I wanted to do, I decided to move towards classifying whether a crime occurs in a certain district within the city of Boston. 

# Future work

- Implement graphs and heatmap on a public site
- Increase the score of predicting the likelyhood of a crime ocurring in a district
- Do some regression to predict how many crimes are likely to occur in a given month

# Conclusion
With this dataset, we were able to answer some questions and find some patterns about crimes within the city of Boston. We discovered the most common crimes committed, where crimes occurred, and more. With existing tools, data analysis could not be any easier. What makes these tools so powerful is the opportunity to apply many of the same logic with different datasets. This can allow us to answer questions with unknown answers or have a better understanding of existing data. 
