# Once upon a time, when Gotham was rife with demonic fireflies, it caused mayhem during an intense battle of 24 hours. Our one and only cape crusader, *Batman*, decided to setup a crime analysis unit in his cave. Naturally, Alfred took over the charge of predictive analytics of the upcoming incidents to better reinfrorce Batman with information. He was smart, he understood the value of data, its capabilities and it's loyalty. 

![batman](https://3.bp.blogspot.com/-XarXIDJXjxg/VY_mrEZfgnI/AAAAAAAAVz0/MQaHmALgGI0/w1200-h630-p-k-no-nu/Batman-Arkham-Knight_Firefly.jpg)

## Following his footsteps and with increasing crime rate in Boston, Massachussets. We have been tasked to setup a similar unit to help BPD tackle this rampant issue. Today we are gonna deal with the data provided to us by BPD of crime analysis in 2018. We have a big responsibility to provide key insights and predictive analysis to the stakeholders. Let's find a good spot, turn on the machine and dive deeper into it. 

![batman with alfred](https://c4.wallpaperflare.com/wallpaper/380/923/812/batman-batman-a-telltale-game-series-alfred-pennyworth-bruce-wayne-wallpaper-preview.jpg)

# Importing libraries and packages along with dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import scipy.stats as st
import sklearn
!pip install bs4
!pip install openpyxl
from bs4 import BeautifulSoup
import requests
from geopy.geocoders import Nominatim
%matplotlib inline

In [None]:
df = pd.read_csv('../input/boston-crime-incident-report-2018/crime-incident-reports-2018.csv')

In [None]:
df_raw = df.copy() #Just in case

In [None]:
df.head()

# Initial inspection of the dataset

In [None]:
df.columns

In [None]:
len(df.columns) 

# Total 17 columns

In [None]:
df.info()

# Our dataframe has 11 categorical features and 6 quantitative features and has around 1lakh entries.

In [None]:
from sklearn.compose import make_column_selector

In [None]:
selector = make_column_selector(dtype_include='object')

In [None]:
categorical_data = selector(df)

In [None]:
categorical_data

In [None]:
len(categorical_data)

In [None]:
df[categorical_data].head()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
df.isnull().sum()

#### We have some columns with missing values. Looking at the size of the dataset, these values are very less so we will probably drop them. Let's check the rows where atleast two of these values are simultaenously null.

In [None]:
df[(df.DISTRICT.isnull() | df.STREET.isnull()) & df.Lat.isnull()]

In [None]:
df.drop(index= df[(df.DISTRICT.isnull() | df.STREET.isnull()) & df.Lat.isnull()].index,inplace=True)

In [None]:
df.drop(index= df[df.DISTRICT.isnull() & df.STREET.isnull() & df.Lat.isnull()].index,inplace=True)

In [None]:
df[(df.DISTRICT.isnull() | df.STREET.isnull()) & df.Lat.notnull()]

# Here we don't have either the district or street values but we have coordinates of the place from where we can get 
# above values but we see that some coordinates have -1,-1 values which is not good, let's go deeper!

In [None]:
df[(df.DISTRICT.isnull() | df.STREET.isnull()) & df.Lat.notnull()].Lat.value_counts().head(10)

# Here we see we have -1 values for 277 records but we don't know yet that corresponding district and street values are 
# there or not

In [None]:
df[(df.DISTRICT.isnull() | df.STREET.isnull()) & df.Lat.notnull() & (df.Lat<0)].shape

# Dropping all these rows as we don't have any way to fetch the details of location based on coordinate points

In [None]:
df.drop(df[(df.DISTRICT.isnull() | df.STREET.isnull()) & df.Lat.notnull() & (df.Lat<0)].index,inplace=True)

In [None]:
# Now we should not have any rows with all the info lost
df[df.Lat < 0].shape

# Here we have both district and street, so no reason to drop it.

In [None]:
df.isnull().sum()

##### Let's check what all we can get from coordinates

In [None]:
df[df.STREET.isnull() & df.Lat.isnull() & df.Lat<0] 

#### So, all 634 street values above can be fetched from coordinate values

In [None]:
df[df.DISTRICT.isnull() & df.Lat.isnull() & df.Lat<0] 

####  All 331 district values can be fetched from coordinate values 

In [None]:
df[df.Lat.isnull()].shape

# These are all the values where we have both district and street values but not coordinates, dealing with it later.

In [None]:
df[df.UCR_PART.isnull()]

# Since these are all home invasion, we will assign the ucr to part two

In [None]:
df.UCR_PART.replace(np.nan, 'Part Two',inplace=True)

In [None]:
df.UCR_PART.isnull().sum()

In [None]:
df.isnull().sum()

In [None]:
round((df.SHOOTING.notnull().sum()/df.SHOOTING.isnull().sum())*100,2)

#### Time to drop Shooting column as it is disastorously empty. 

In [None]:
df.drop('SHOOTING',axis=1,inplace=True)

In [None]:
# We know that this data is dated to 2018, so there is no point in keeping 'YEAR' column

df.drop('YEAR',axis=1,inplace=True)

In [None]:
# dropping incident number as it does not provide any useful information and is only used for internal categorization.

df.drop('INCIDENT_NUMBER',axis=1,inplace=True)

In [None]:
# Since offense code also does not provide any information, so it will be dropped.

df.drop('OFFENSE_CODE',axis=1,inplace=True)

In [None]:
df.Location.head()

# This column is formed by joining lat and long columns and converted to strings for usability. Dropping it!

In [None]:
df.drop('Location',axis=1,inplace=True)

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
# If we could have a nice dictionary. It would help us better understand our dataset
dictionary = pd.read_excel(io = 'https://data.boston.gov/dataset/6220d948-eae2-4e4b-8723-2dc8e67722a3/resource/9c30453a-fefa-4fe0-b51a-5fc09b0f4655/download/rmscrimeincidentfieldexplanation.xlsx')

In [None]:
dictionary

#### Let's prettify district values

In [None]:
dist_map = {'B2': 'Roxbury', 'C11':'Dorchester','D4':'South End','B3':'Mattapan','A1':'Downtown','C6':'South Boston',
           'D14':'Brighton','E18':'Hyde Park','E13':'Jamaica Plain','E5':'West Roxbury','A7':'East Boston',
           'A15':'Charlestown'}

In [None]:
df.DISTRICT = df.DISTRICT.map(dist_map)

In [None]:
df.head()

#### What is reporting area?  I searched on google, I couldn't find much information on that.

In [None]:
df.REPORTING_AREA.value_counts().head()

# Also, some information is lost here, better to drop it 

In [None]:
df.drop('REPORTING_AREA',axis=1, inplace=True)

In [None]:
df.head()

#### Now is the time to recover district and street values from coordinates

In [None]:
def latlong(lat,long):
    return str(lat) + ', ' + str(long)

In [None]:
# Joining and converting coordinates in usable string format

df['LatLong'] = df[['Lat','Long']].apply(lambda df : latlong(round(df.Lat,6),round(df.Long,6)),axis=1)

In [None]:
df.LatLong[0]

In [None]:
# Let's check how the null values appear
df[df.LatLong.str.startswith('n')].head()

In [None]:
df.drop(columns=['Lat','Long'],axis=1,inplace=True)

#### Let's check 'OCCURED_ON_DATE' column, if we can clean it up a bit

In [None]:
df.OCCURRED_ON_DATE.isna().sum()

In [None]:
df.OCCURRED_ON_DATE[0] 

# We have hour of the day, we have month of the year, we have day of the week but we do not have date of the month!
# Let's extract it out and drop this column

In [None]:
df['dAY_OF_MONTH'] = df.OCCURRED_ON_DATE.map(lambda x : x.split(' ')[0].split('-')[2])

In [None]:
df.insert(3,'DAY_OF_MONTH',df.dAY_OF_MONTH)
df.drop('dAY_OF_MONTH',axis=1,inplace=True)

In [None]:
df.head()

In [None]:
locator = Nominatim(user_agent='mygeocoder')

In [None]:
list(locator.reverse(df.LatLong[0]).raw['address'].values())

In [None]:
list(loc1['address'].values())[4]

In [None]:
def county_mapper(coor):
    return list(locator.reverse(coor).raw['address'].values())[4]

In [None]:
for i in list(df.LatLong):
    print(i)

In [None]:
county_mapper(coor = df.LatLong[0])

In [None]:
df['COUNTY'] = df['LatLong'].map(lambda x : county_mapper(str(x)))

In [None]:
# We can fetch county details of the incidents where it happened that will help us more understand the 
# jurisdiction it comes under

list(loc1['address'].values())[4]

In [None]:
locator.reverse(df.LatLong[1]).raw

In [None]:
locator.reverse(df.LatLong[2]).raw['address']

In [None]:
locator.reverse(df.LatLong[3]).raw['address']

In [None]:
locator.reverse(df.LatLong[4]).raw

In [None]:
df.DISTRICT

In [None]:
locator.reverse(df.LatLong[3]).raw

### We see here that, Motor Vehicle Accidents share the highest percentage in incidents followed by Medical assistance and Larceny! Also we have that 'Other' dude showing a high response, we need to further check it out!

In [None]:
fig = plt.figure(figsize=(9,9),dpi=200,facecolor= '#ffe6b8')
ax = fig.add_axes([0,0,1,1])
ax.barh(np.arange(len(df.OFFENSE_CODE_GROUP.value_counts()))[:50]
                  ,df.OFFENSE_CODE_GROUP.value_counts().sort_values(ascending=False).values[:50]
                  ,align='center', color= '#ffc291',edgecolor= 'black',linewidth=1.2)
ax.set_yticks(np.arange(len(df.OFFENSE_CODE_GROUP.value_counts()))[:50])
ax.set_yticklabels(df.OFFENSE_CODE_GROUP.value_counts().index[:50])
ax.invert_yaxis()
ax.set_title('Number of crimes in different offense groups',fontdict={'fontsize':20,'fontweight':14},loc='left',pad=20)
ax.set_xlabel('Total Incidents',fontdict={'fontsize':12,'fontweight':20},labelpad=20)
ax.set_ylabel('Category of Incidents',fontdict={'fontsize':12,'fontweight':20},labelpad=20)
plt.show()

In [None]:
df[df.OFFENSE_CODE_GROUP == 'Motor Vehicle Accident Response'].head()

In [None]:
counts_mv = df[df.OFFENSE_CODE_GROUP == 'Motor Vehicle Accident Response'].groupby(df.OFFENSE_DESCRIPTION).count()

In [None]:
counts_mv

In [None]:
mean_mv = df[df.OFFENSE_CODE_GROUP == 'Motor Vehicle Accident Response'].groupby(df.OFFENSE_DESCRIPTION).mean()

In [None]:
mean_mv

In [None]:
plt.figure(figsize=(10,8))
plt.title('Inside "Other" Incidents')
df[df.OFFENSE_CODE_GROUP == 'Other'].OFFENSE_DESCRIPTION.value_counts().sort_values().plot(kind='barh')

In [None]:
plt.figure(figsize=(10,8))
plt.title('Inside "Motor Vehicle Accident Response" Incidents')
df[df.OFFENSE_CODE_GROUP == 'Motor Vehicle Accident Response'].OFFENSE_DESCRIPTION.value_counts().sort_values().plot(kind='barh')

In [None]:
plt.figure(figsize=(10,8))
plt.title('Inside "Medical Assistance" Incidents')
df[df.OFFENSE_CODE_GROUP == 'Medical Assistance'].OFFENSE_DESCRIPTION.value_counts().sort_values().plot(kind='barh')

In [None]:
plt.figure(figsize=(10,8))
plt.title('Inside "Larceny" Incidents')
df[df.OFFENSE_CODE_GROUP == 'Larceny'].OFFENSE_DESCRIPTION.value_counts().sort_values().plot(kind='barh')

### We see that there are various other crimes inside the 'Other' tag and Threat to bodily harm has been registered in the complaint most often 