# Table of Contents

1. [Intro](#section-one)
 
   1.1.  [Imports](#section-1two)
   
   1.2.  [Renaming and Changing Dtype](#section-1three)

   1.3.  [Missing Value Analyze](#section-1four)
   

2.  [Exploratory Data Analysis 101](#section-two)

      2.1.[Create new vars (Season, Day& Night)](#section-2one)

      2.2. [Q1: How has crime changed over the years?](#section-2two)
       * [Q1: Analyze](#section-2two1)
       * [Q1: Answer](#section-2two2)
    
     2.3.[Q2: Is it possible to predict where or when a crime will be committed?](#section-2three)
       * [Q2: When?](#section-2threewhen)
       * [Q2: Answer](#section-2threewhenanswer)
       * [Q2: Where?](#section-2threewhere)
       * [Q2: Answer](#section-2threewhereanswer)
    
     2.4. [Q3: What can you say about the distribution of different offenses over the city?](#section-2four)
       * [Let's handle crime types](#section-2four1)
       * [Offense Code Group by districts](#section-2four2)
       * [Let's look at how the 3 most committed crimes spread to the city](#section-2four3)
       * [Let's handle shooting column](#section-2four4)
       * [General distribution of crimes with heat map](#section-2four5)
    
     2.5. [Feature Encoding](#section-2five)

     2.6. [Conclusions for Q1,Q2& Q3](#section-2six)
     

3. [Models](#section-3)

     3.1. [A. Predict crime numbers](#section-3A)
      * [A.1. Prepare for model](#section-3A1)
      * [A.2.Model 1: Predict crime numbers for each day (Base-model)](#section-3A2)
      * [A.3.Model 2: Predict crime numbers for each day for Ucr Part 3](#section-3A3)
      
   3.2. [B. Predict district](#section-3B)
      * [B.1. Prepare for model](#section-3B1)
     
         
 


![Boston1](https://images.unsplash.com/photo-1506551907304-60bb62ffc9b0?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1350&q=80)

<a id="section-one"></a>

# Intro

The aim of this study is to examine how crimes have changed over the years, whether it is possible to predict where and when a crime will be committed, and the distribution of crimes across the city. Crimes in Boston data were used in the study. Data contains information about the crime such as date, location, crime group, crime code.

**Features:**

**INCIDENT_NUMBER:** The id of the crime committed. It is unique value for each crime.

**OFFENSE_CODE:** It shows code of crime types.

**OFFENSE_CODE_GROUP:** General crime types.

**OFFENSE_DESCRIPTION:** Detailed explanation of the crime.

**DISTRICT:** District name where the crime occurred.

**REPORTING_AREA:** Area number that crime reported.

**SHOOTING:** It shows with 'Y',  if the crime included shooting.

**OCCURRED_ON_DATE:** the date& time that crime occured.

**YEAR:** the year that crime occured. (2015,2016,2017,2018)

**MONTH:** the month that crime occured.

**DAY OF WEEK:** the week that crime occured.

**HOUR:** the hour that crime occured.

**UCR_PART:** Uniform Crime Reporting Offence types. Part 1 contains the most dangerous and important crimes.

**STREET:** the street  where crime occured.

**LAT:**  the latitude where the crime occurred.

**LONG:**  the longitude where the crime occurred.

**LOCATION:** the location where the crime occurred.(include latitude and longitude)

<a id="section-1two"></a>

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
import pandas_profiling
import plotly.express as px

from sklearn import preprocessing
import datetime as dt    # for linear reg.

from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
#from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor


In [None]:
data = pd.read_csv("../input/crimes-in-boston/crime.csv", encoding = "latin1")

In [None]:
# data.profile_report()  
#It takes a lot of time, don't run it each time

In [None]:
data.info()

In [None]:
print(data.isnull().sum(), end = '\n\n')
print(data[(data['Lat'].isnull()) | (data['Long'].isnull())]['Location'].unique())

Lat and Long columns produce the Location column. But like the others, Location has no null values. When we print the unique values Location has for Lat or Long being null, we can see that the value was given 0 which also is kind of null for Location. So we we can accept it null.

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe().T

<a id="section-1three"></a>

# Renaming and Changing Dtype

In the beginning, we renamed our columns for clarity.


In [None]:
# rename columns
data.rename(columns = {"INCIDENT_NUMBER": "Incident_Number", 
                     "OFFENSE_CODE":"Offense_Code","OFFENSE_CODE_GROUP":"Offense_Code_Group","OFFENSE_DESCRIPTION":"Offense_Description",
                     "DISTRICT": "District","REPORTING_AREA": "Reporting_Area","SHOOTING": "Shooting",
                     "OCCURRED_ON_DATE": "Occurred_On_Date","YEAR": "Year","MONTH": "Month",
                     "DAY_OF_WEEK": "Day_Of_Week","HOUR": "Hour","UCR_PART": "Ucr_Part",
                     "STREET": "Street"
                     }, 
                                 inplace = True) 

We convert the Occurred_On_Date feature from datetime to date. Because we will handle the dates in our study.

In [None]:
data["Occurred_On_Date"] = data["Occurred_On_Date"].apply(pd.to_datetime, errors='coerce')
data["Occurred_On_Date"] = data["Occurred_On_Date"].dt.date

In [None]:
data["Occurred_On_Date"] = data["Occurred_On_Date"].apply(pd.to_datetime, errors='coerce')

<a id="section-1four"></a>

# Missing Value Analyze

* Let's see how much of the data is null. 99% of the Shooting column consists of null values. Assuming these mean no, we can replace it with N. (Yes = there is shooting, No = no shooting)
* Also, 0.06% of the Lat and Long values are missing. Since we have a small number of null values compared to our data, we can remove them. 
* Let's remove the remaining missing values from our data.

In [None]:
sns.heatmap(data.isnull(), cbar=False)

In [None]:
#missing data
total = data.isnull().sum().sort_values(ascending=False)
percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(8)

We filled in the  null-values in the Shooting column with N. Because we assumed nulls were not shooting.

In [None]:
data.Shooting.fillna('N', inplace=True)

We removed the remaining Null/NaN  values. Because it was a small part of the crime data set.

In [None]:
# 319073 entriesden 296573' a düştü.
data=data.dropna()

<a id="section-two"></a>
# Exploratory Data Analysis 101

In [None]:
data.columns

Let's group them together to see how many unique Offense_Code there are for each Offense_Code_Group. For example, Drug Violation Offense_Code_Group  has 26 Offense_Code underneath it

In [None]:
off_code = data.groupby('Offense_Code_Group')['Offense_Code'].nunique().sort_values(ascending = False)
off_code.to_frame().reset_index()

Ucr Part is the most general category that defines crimes. UCR is published by the FBI and  stands for uniform crime reports. UCR Part 1 represents the most important/dangerous crimes.

In [None]:
print(data["Ucr_Part"].unique())

<a id="section-2one"></a>
# Create new vars (Season, Day& Night)

**Seasons**

We have years and months in our data, but we create the seasons column to see seasonality.

In [None]:
def getSeason(month):
    if (month == 12 or month == 1 or month == 2):
       return "Winter"
    elif(month == 3 or month == 4 or month == 5):
       return "Spring"
    elif(month ==6 or month==7 or month == 8):
       return "Summer"
    else:
       return "Fall"

In [None]:
data['Season'] = data.Month.apply(getSeason)

**Day & Night**

In [None]:
data['Day'] = 0
data['Night'] = 0
# Day or night for 1st month
data['Day'].loc[(data['Hour'] >= 6) & (data['Hour'] <= 18)] = 1

In [None]:
data['Night'].loc[data['Day']==0]=1

In [None]:
plt.figure(figsize=(16,8))
data['Night'].value_counts().plot.bar()
plt.show()

<a id="section-2two"></a>
# Q1: How has crime changed over the years?

<a id="section-2two1"></a>
## Q1: Analyze

* **Crime counts per year**

In the chart below, we see the sums of the number of crimes committed each year. Could crime have increased this much in 2016 and 2017? It looks a little odd.

In [None]:
year_count = []

for i in data.Year.unique():
    year_count.append(len(data[data['Year']==i]))

plt.figure(figsize=(12,5))
sns.pointplot(x=data.Year.unique(),y=year_count,color='blue',alpha=0.8)
plt.xlabel('Year',fontsize = 15)
plt.xticks(rotation=45)
plt.ylabel('Crime Count',fontsize = 15)
plt.title('Crime Counts Per Year',fontsize = 15)
plt.grid()
plt.show()

When we look at the distribution of the crime over the years, we see that there were more crimes in 2016 and 2017 than in other years. Let's look in more detail to understand the reason for this. We may have missing data.

In [None]:
sns.countplot(data=data, x="Year",palette='YlGnBu')
plt.title('Number Of Crimes Each Year')

* **Number Of Crimes Each Season**

According to the graph below, we see an increase in crime during the summer season. Is there seasonality in crimes?
The answer is in the Crimes by Month of Year graph. During the summer months, data were entered for all years, so it is normal to see more crime in these months. This does not mean seasonality.


In [None]:
season_counts = data.groupby('Season').count()['Incident_Number'].to_frame().reset_index()
ax = sns.barplot(x = 'Season' , y="Incident_Number", data = season_counts, palette='YlGnBu')
plt.title('Number Of Crimes Each Season')

* **Number Of Crimes Each Month**

As we can guess from the seasons, it seems more crime in June, July and August.

In [None]:
month_counts = data.groupby('Month').count()['Incident_Number'].to_frame().reset_index()
ax = sns.barplot(x = 'Month' , y="Incident_Number", data = month_counts, palette='YlGnBu')
plt.title('Number Of Crimes Each Month')

* **Number Of Crimes Each Day_of_Week**

More crimes were committed on Fridays, but there is not much difference between the numbers.

In [None]:
day_counts = data.groupby('Day_Of_Week').count()['Incident_Number'].to_frame().reset_index()
ax = sns.barplot(x = 'Day_Of_Week' , y="Incident_Number", data = day_counts, palette='YlGnBu')
plt.title('Number Of Crimes Each Day_of_Week')
print(day_counts)

* **Crimes by Month of Year**

In [None]:
plt.figure(figsize=(15,7))
data.groupby(['Year','Month']).count()['Incident_Number'].plot.bar()
plt.title('Crime counts per year and month')

According to the chart below, we do not have data for the first 5 months of 2015 and the last 3 months of 2018. When we look at the number of crimes committed by years, we see that there were fewer crimes in 2015 and 2018 for this reason.

In [None]:
fig, ax = plt.subplots(figsize=(17,8))
# with sns.color_palette("RdGy", 10):
montyearAggregated = pd.DataFrame(data.groupby(["Month","Year"])["Incident_Number"].count()).reset_index()
a=sns.barplot(data=montyearAggregated,x="Month", y="Incident_Number",hue = 'Year', palette='YlGnBu')
a.set_title("Crimes by Month of Year",fontsize=12)
plt.legend(loc='upper right')
plt.show()

In [None]:
hour_nums = data.groupby(['Hour']).count()['Incident_Number'].to_frame().reset_index()
ax = sns.barplot(x = 'Hour' , y="Incident_Number", data = hour_nums,  palette='YlGnBu')

<a id="section-2two2"></a>

## Q1: Answer

Let's look at the years by filtering only the months for which data were entered in all months. In the "Crime Amount by Year" graph (in question 2) , we can see that the average of the crime amounts is closer. We must pay attention to the missing months to avoid making wrong assumptions.

**Incıdent Number**

In [None]:
df_year_new=data.groupby(["Year","Month"])["Incident_Number"].count().reset_index()
df_year_filter=df_year_new[~df_year_new['Month'].isin(['1','2','3','4','5','11','12'])]
fig = plt.figure(figsize=(8,8))
with sns.color_palette("BrBG",4):
    ctplt2=sns.catplot(x="Year", y="Incident_Number",kind="box", data=df_year_new,size=5, aspect=2)
    plt.ylabel('Count')
    plt.show();

In [None]:
plt.title('Incident Number vs Year')
year_counts = data.groupby('Year').count()['Incident_Number'].to_frame().reset_index()
print(data.groupby('Year').count()['Incident_Number'])
ax = sns.barplot(x = 'Year' , y="Incident_Number", data = year_counts, palette='YlGnBu')

In [None]:
print('Count of Months Per Year:\n',data.groupby('Year')['Month'].nunique())
av_month = (data.groupby('Year').count()['Incident_Number'] / data.groupby('Year')['Month'].nunique()).to_frame().reset_index()

print('\nAverage monthly incident per year:\n',av_month)
av_month.rename(columns = {0:'Incident_n-Number'}, inplace = True)


We see that average daily casesare quite close to each other. The daily average increases a couple of incidents from 2015 to 2016 and increase a couple from 2016 to 2017 as well while there is around 7 incident decrease from 2017 to 2018.

Let's consider our data on a daily basis to make the most accurate assumptions.

In [None]:
print("Min date:  ", data.min()['Occurred_On_Date'])
print("Max date:  ", data.max()['Occurred_On_Date'])
print("--------------------------------------------")
data['Occurred_On_Date'] = pd.to_datetime(data['Occurred_On_Date'])
yearly_counts = data.groupby('Year').count()['Incident_Number'].to_numpy()
days = [(data[data['Year'] == year].Occurred_On_Date.max() - data[data['Year'] == year].Occurred_On_Date.min()).days for year in data.Year.sort_values().unique()]
average_daily_incidents = yearly_counts/days
print("Average Daily Crimes:  ")
print([str(year)+": "+str(avg)[:4] for year, avg in enumerate(average_daily_incidents, 2015)])

In [None]:
d_avg = pd.DataFrame(data = average_daily_incidents, index = av_month.index)
d_avg.rename(columns = {0:'counts'}, inplace = True)
plt.title('Avg. Daily Incidents')

ax = sns.barplot(x = d_avg.index , y="counts", data = d_avg,palette='YlGnBu' )
ax.set_xticklabels(data.Year.sort_values().unique())

**Weekend vs. Weekday**

As seen in the graphs below, there is no significant difference in the number of cases on weekdays and weekends.

In [None]:
weekend = data[(data['Day_Of_Week'] == 'Saturday') | (data['Day_Of_Week'] == 'Sunday')]
weekday = data[(data['Day_Of_Week']!= 'Saturday') & (data['Day_Of_Week'] != 'Sunday')]

weekday_year = weekday.groupby('Year').count()['Incident_Number'].to_frame()
weekend_year = weekend.groupby('Year').count()['Incident_Number'].to_frame()
plt.title('Incidents for weekend')

ax = sns.barplot(x = weekend_year.index , y="Incident_Number", data = weekend_year,palette='YlGnBu')

In [None]:
plt.title('Incidents for weekday')
ax = sns.barplot(x = weekday_year.index , y="Incident_Number", data = weekday_year,palette='YlGnBu')

In [None]:
plt.title('Filtered by non-null months')
monthCrimes = df_year_new[~df_year_new['Month'].isin(['1','2','3','4','5','11','12'])]
yearly_counts = monthCrimes.groupby('Year').count()['Incident_Number'].to_numpy()
ax = sns.barplot(x = [2015, 2016, 2017, 2018] , y=yearly_counts,palette='YlGnBu')

As a result, there are months in our data with missing data. However, if we go deeper and look on a day basis, we see that there is no significant difference. We have to care about missing data in our analysis.

<a id="section-2three"></a>

# Q2: Is it possible to predict where or when a crime will be committed?

<a id="section-2threewhen"></a>
## **Q2: When?**

In [None]:
sns.catplot(x='Hour',
           kind='count',
            height=4, 
            aspect=3,
            palette='BrBG',
            #color='BrBG',
           data=data)
plt.title('Number Of Crimes Each Hour')
plt.xticks(size=10)
plt.yticks(size=10)
plt.xlabel('Hour', fontsize=15)
plt.ylabel('Count', fontsize=15)

In the chart above, we see that the most crimes occur around 17:00, 18:00 and 12:00. The reason for this may be that students / employees coincide with work hours and breaks. The number of crimes is likely to increase with the increasing number of people on the street.

* **Incident counts per week day per year**

In [None]:
plt.figure(figsize=(20,8))
data.groupby(['Year','Day_Of_Week']).count()['Incident_Number'].plot.bar(color = ['sienna', 'darkolivegreen', 'chocolate', 'seagreen'])

In [None]:
plt.title("Crime Amount By Year(all data)")
years = data.groupby('Year').count()['Incident_Number'].to_frame().reset_index()
ax = sns.barplot(x = 'Year' , y="Incident_Number", data = years, palette='YlGnBu')

In [None]:
fig,axes= plt.subplots(2,2)
fig.set_size_inches(16,12)
with sns.color_palette('BrBG',4):
  a=sns.countplot(x="Day_Of_Week",order=['Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],data=data,ax=axes[0, 0],palette='YlGnBu')
  a.set(xlabel='Dayofweek', ylabel='Total Crime')
  a.set_title("Crime Amount By Weekday",fontsize=10)

  b=sns.countplot(x="Month",data=data,ax=axes[0, 1],palette='YlGnBu')
  b.set(xlabel='Month', ylabel='Total Crime')
  b.set_title("Crime Amount By Month",fontsize=10)

  c=sns.countplot(x="Season",data=data,ax=axes[1, 0])
  c.set(xlabel='Season', ylabel='Total Crime')
  c.set_title("Crime Amount By Season",fontsize=10)


   
  #df_year_new=data.groupby(["Year","Month"])["Incident_Number"].count().reset_index()
  #df_year_filter=df_year_new[~df_year_new['Month'].isin(['1','2','3','4','5','11','12'])]
  d=sns.countplot(x="Year",data=df_year_filter,ax=axes[1, 1])
  d.set(xlabel='Year', ylabel='Total Crime')
  d.set(xlabel='Year', ylabel='Total Crime')
  d.set_title("Crime Amount By Year(filtered by missing months)",fontsize=10);


 * The most crimes seem to be on Friday. The months with the most crimes appear to be July and August (Crime Amount By Month chart), which coincides with the summer season in the Crime Amount By Season chart.

* When we plot again by taking only the months of data entered, we can see that the crime entries are close in the chart called Crime Amount By Year and 2018 is lower than the others.

Crimes can behave differently over the years. We can say that the  Medical Assistance crimes saw a rise over the years. We can observe a similar trend with the Investigate Person crime type.


In [None]:
ten_freq_crimes = data["Offense_Code_Group"].value_counts()[:12]
df_top_crimes = data[data["Offense_Code_Group"].isin(ten_freq_crimes.index)]
df_tp = df_top_crimes.pivot_table(index=df_top_crimes["Occurred_On_Date"],
                                                      columns=["Offense_Code_Group"],aggfunc="size", fill_value=0).resample("M").sum()

#palette = plt.get_cmap('Set2')
num=0
ax,fix = plt.subplots(figsize=(15,7))
for column in df_tp:
    num+=1
    plt.subplot(3,4, num)
    for v in df_tp:
        plt.plot(df_tp.index,v,data=df_tp,marker='', color='grey', linewidth=0.9, alpha=0.3)
        plt.tick_params(labelbottom=False)
        plt.plot(df_tp.index,column, data=df_tp,color="green", linewidth=2.4, alpha=0.75, label=column)
        plt.title(column, loc='left', fontsize=12, fontweight=0, color="black", alpha=0.75)
ax.text(x=0.05,y=0.95,s="Timeline of the most frequent crimes(2015-2018).",alpha=0.75, fontsize=22)

<a id="section-2threewhenanswer"></a>
## Q2: Answer 

We can predict possible crimes based on historical data. We can use different time zones such as day and night to strengthen our forecast. I think we can try answering questions like the ones below.Methods such as classification, ARIMA and regression can be used.
* What is the time of the crime?
* How many crimes happen in an hour?
* How many crimes will happen next week?
* How many crimes occur in a particular area?

<a id="section-2threewhere"></a>
## **Q2: Where?**

* **Number Of Crimes By District**

In [None]:
plt.subplots(figsize=(15,6))
sns.countplot('District',palette='BrBG',data=data,edgecolor=sns.color_palette('YlGnBu',20),order=data['District'].value_counts().index)
plt.xticks(rotation=90)
plt.title('Number Of Crimes By District')
plt.show()

From the graph above, we can say that the most crimes occur in the B2 district.

* Crime Numbers of Districts by Years

More crime appears in all districts in 2016 and 2017. It was no surprise.

In [None]:
plt.figure(figsize=(20,8))
data.groupby(['District','Year']).count()['Incident_Number'].plot.bar(color = ['sienna', 'darkolivegreen', 'chocolate', 'seagreen'])

* **Street**

In [None]:
fig = plt.figure(figsize=(12,5))
crime_street = data.groupby('Street')['Incident_Number'].count().nlargest(10)
crime_street.plot(kind='bar', color ="saddlebrown")
plt.xlabel("Street")
plt.ylabel("Offense Amount")
plt.show()

In the plot above, we can see the streets with the most crime. According to the chart, Washington St is the street with the most crime, it is Blue Hill Ave and Boylston St. is following.

Here we see how many streets a district has.

In [None]:
(data.groupby('District')['Street'].nunique().sort_values(ascending = False).to_frame()).head()

In this graph, we can see the distribution of crimes by districts. We can interpret that theft crime was mostly seen in the D4 region, and the crimes in the motor vehicle accident group were seen in the B2 district. By looking at which time, which street, which crime was most committed in which district, etc., we can speculate on where and when the possible crimes will be committed.

* **Location**

In [None]:
((data.groupby(["Lat","Long"]).count()[['Incident_Number']]).reset_index()).head()

<a id="section-2threewhereanswer"></a>
## Answer

We can guess the districts and streets. Classification will be useful here. But we have too many streets, that could be a problem.
* We can say that the most crimes occur in the B2 district.
* We can see the streets with the most crime.
* We can see how the offense code groups are distributed to districts.
* We can find crime centers with clustering method. We can use these as police stations.

<a id="section-2four"></a>

# Q3: What can you say about the distribution of different offenses over the city?

In this section, we will examine how crimes are distributed in the city. Since there are many streets and reporting arenas, we will look at the districts. We have 12 different districts.

In [None]:
plt.figure(figsize=(7,7))
sp = data[(data['Lat'] != -1) & (data['Long'] != -1)]
sns.scatterplot(x="Lat", y="Long",hue='District',data=sp)

<a id="section-2four1"></a>
* **Let's handle crime types**

Q: *What are the most common crimes?*


According to the graph below, the most common crimes are motor vehicle accident and larcency.


In [None]:
plt.subplots(figsize=(15,6))
sns.countplot('Offense_Code_Group',palette='BrBG',data=data,edgecolor=sns.color_palette('YlGnBu',20),order=data['Offense_Code_Group'].value_counts().index)
plt.xticks(rotation=90)
plt.title('Types of serious crimes')
plt.show()

<a id="section-2four2"></a>
* **Offense Code Group by districts**

Let's look at the relationship between Offense_Code_Group and District.
According to the graph below, the most larceny crime occurred in D4 district.

In [None]:
fig = plt.figure(figsize=(20,10))
order2 = data['Offense_Code_Group'].value_counts().head(6).index
sns.countplot(data = data, x='Offense_Code_Group',hue='District', order = order2,palette='BrBG' );
plt.ylabel("Offense Amount");

<a id="section-2four3"></a>

* **Let's look at how the 3 most committed crimes spread to the city.**

In [None]:
# import plotly.express as px
ds = data.dropna(subset = ['Lat','Long','District'])
ds = ds[ds['Offense_Code_Group'] == 'Motor Vehicle Accident Response']
location = pd.DataFrame(data =(ds.groupby(["Lat","Long"]).count()[['Incident_Number']]).reset_index().values, columns=["Lat","Long","Incident_Number"])
x,y = location['Long'], location['Lat']
fig = px.density_mapbox(location,lat="Lat",lon="Long",z="Incident_Number",radius=10,center=dict(lat=42.32475, lon=-71.076),zoom=10,mapbox_style="stamen-terrain",height=500,width=1450)
fig.show()

In [None]:
ds = data.dropna(subset = ['Lat','Long','District'])
ds = ds[ds['Offense_Code_Group'] == 'Larceny']
location = pd.DataFrame(data =(ds.groupby(["Lat","Long"]).count()[['Incident_Number']]).reset_index().values, columns=["Lat","Long","Incident_Number"])
x,y = location['Long'], location['Lat']
fig = px.density_mapbox(location,lat="Lat",lon="Long",z="Incident_Number",radius=10,center=dict(lat=42.32475, lon=-71.076),zoom=10,mapbox_style="stamen-terrain",height=500,width=1450)
fig.show()

In [None]:
ds = data.dropna(subset = ['Lat','Long','District'])
ds = ds[ds['Offense_Code_Group'] == 'Medical Assistance']
location = pd.DataFrame(data =(ds.groupby(["Lat","Long"]).count()[['Incident_Number']]).reset_index().values, columns=["Lat","Long","Incident_Number"])
x,y = location['Long'], location['Lat']
fig = px.density_mapbox(location,lat="Lat",lon="Long",z="Incident_Number",radius=10,center=dict(lat=42.32475, lon=-71.076),zoom=10,mapbox_style="stamen-terrain",height=500,width=1450)
fig.show()

<a id="section-2four4"></a>

* **Let's handle shooting column**


Q: *How is the relationship between districts and shooting?*



We can see relationship between districts and shooting like that most shooting occur in B2 district. The number of shooting is high near Dorchester ave Roxbury. We can see other cases on the map.

In [None]:
shtng = data[(data.Shooting == 'Y') & (data.District.notnull())]

import folium
import folium.plugins as plugins

latitude = list(shtng.Lat)[1] # This is to initiate the latitude start point for the map
longitude = list(shtng.Long)[1] # This is to initiate the longitude start point for the map

latitudes = list(shtng.Lat) #create the list of all reported latitudes
longitudes = list(shtng.Long) #create the list of all reported longitudes

shooting_map = folium.Map(location = [latitude, longitude], zoom_start = 12) # instantiate a folium.map object

shooting = plugins.MarkerCluster().add_to(shooting_map) # instantiate a mark cluster object for the incidents in the dataframe

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(shtng.Lat, shtng.Long, shtng.District):
    if (not np.isnan(lat)) & (not np.isnan(lng)): # also, we check a non-nullness of the coordinates
        folium.Marker(
            location=[lat, lng],
#             icon=None,
            popup=label,
            icon=folium.Icon(icon='exclamation-sign')
        ).add_to(shooting)

# display the map
shooting_map

To look at crimes committed in a specific area, a heat map can be useful. We can use latitude and longitude values.By zooming in on the map, we can see which streets and intersections are intense. By looking at the heat map, we can speculate when crimes are concentrated in the city.

<a id="section-2four5"></a>

* **General distribution of crimes with heat map**

In [None]:
# Folium crime map
crime_map = folium.Map(location=[42.3125,-71.0875], 
                      zoom_start = 13)

# yıl filtrelemek için: 
# data_heatmap = data[data.Year ==  2016]   

data_heatmap = data[['Lat','Long']]
data_heatmap = data.dropna(axis=0, subset=['Lat','Long'])
data_heatmap = [[row['Lat'],row['Long']] for index, row in data_heatmap.iterrows()]
HeatMap(data_heatmap, radius=10).add_to(crime_map)

crime_map

<a id="section-2five"></a>
# Feature Encoding

**Shooting** 

We encoded the shooting column to use it in models. We changed it to 0 if there is no shooting, and 1 if there is.

In [None]:
# if Shooting is No =  0, Yes = 1 .
def isShooting(dataFrame):
    dataFrame["Shooting"] = dataFrame["Shooting"].apply(lambda x: 1 if x == "Y" else 0)
    return dataFrame
data = isShooting(data)

**Day Of Week**

We encode days of week with mapping.

In [None]:
data["DayOfWeek"] = data["Day_Of_Week"].map({
    "Monday":1,
    "Tuesday":2,
    "Wednesday":3,
    "Thursday":4,
    "Friday":5,
    "Saturday":6,
    "Sunday":7
})

**Season**

Season names can be challenging while developing models. Instead, we will use the numbers below.
* Fall:0 
* Spring: 1 
* Summer: 2 
* Winter:3

In [None]:
# Fall:0 , Spring: 1 , Summer: 2 , Winter:3
le = preprocessing.LabelEncoder()
data['Seasons'] = le.fit_transform(data['Season'])

In [None]:
data = data.drop("Season", axis = 1)

**Ucr Parts**

In [None]:
# 'Part One', 'Part Three', 'Part Two', 'Other' lerin anlamlı olarak değişmesini istediğim için label encodingle değil mapleyerek oluşturdum.
data["Ucr_Parts"] = data["Ucr_Part"].map({
    "Part One":1,
    "Part Two":2,
    "Part Three":3,
    "Other":0
})

In [None]:
data = data.drop("Ucr_Part", axis = 1)

In [None]:
data.columns

In [None]:
# sort
data = data.sort_values(by =['Occurred_On_Date'], ascending=False)
data[[  'District', 'Shooting','Occurred_On_Date', 'Year', 'Month', 'Day_Of_Week', 'Hour',  'Night', 'DayOfWeek', 'Seasons','Ucr_Parts']]

In [None]:
data

<a id="section-2six"></a>
#  **Conclusions for Q1,Q2& Q3**




**Q1.   How has crime changed over the years?**

* When we look at the number of crimes committed by years, we see that there were fewer crimes in 2015 and 2018.There are 6 missing months in 2015 and 3 months for 2018. So, the significant decrease in this year is most probably cause by these missing values. If we look at it by filtering it according to the months we have data, we can say that the distribution of crimes by years is close.



**Q2. Is it possible to predict where or when a crime will be committed?**

 * The most crimes seem to be on Friday. The months with the most crimes appear to be July and August (Crime Amount By Month chart), which coincides with the summer season in the Crime Amount By Season chart.

* We see that the most crimes occur around  17: 00 ,  18: 00  and  12: 00 . This may be because students / staff coincide with departure times and breaks. The number of crimes is likely to increase with the increasing number of people on the street.


* From the graph above, we can say that the most crimes occur in the B2 district.
*  Washington St  is the most criminal street, it is Blue Hill Ave and Boylston St. is following.

*  So, assuming that the next crimes follow the same pattern as the crimes in 2015-2018, based on our analysis, we can speculate which crime could be committed in which region in the future.







**Q3. What can you say about the distribution of different offenses over the city?**


*   The most common crimes are motor vehicle accident and larceny.
 
*   We can say that larceny crime was mostly seen in the D4 district, and the crimes in the motor vehicle accident group were seen in the B2 district.


*  By looking at the heat map, we can speculate when crimes are concentrated in the city.

* There is no homogeneous distribution over the Boston city. Crimes are most likely committed in the central areas.














<a id="section-3"></a>

# Models

After EDA, we learned about how data is. In line with our main goal, we will make predictions about when and where crimes may occur.

<a id="section-3A"></a>

# **A. Predict crime numbers**

Our first goal is to predict the number of crimes that may occur in the future. 
Why is this important? 
Because if we know the number of crimes that can be committed, we can take action.
* If more crimes are to be committed, a more intense shift may be applied that day.
* The Police Department can focus on the day.
* Equipment can be supplied according to the number of crimes.

*Why didn't I want to apply time series?*


Because my goal is to predict multiple variables together, it can be difficult for timeseries models. The forecast is determined only by the past behavior of the variable in timeseries. ARIMA is a univariate model (working with one variable only) and hence cannot exploit the leading indicators or explanatory variables.ARIMA requires a lot of time series observations in this dataset.But if we want to handle single variable we can use ARIMA. Our data has a time dimension so we can apply time series.

<a id="section-3A1"></a>

#  **A.1. Prepare for model**

We will trying predict how much case will occure a day.

We will create with dataframe included district, occured on date, incident number (counted by days& district),day of week for modelling. 

 If we add the Ucr_part and Hour features, they do not increase the score of our model. We do not include these to reduce complexity.

In [None]:
dataR = pd.DataFrame(data.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataR.tail()

In [None]:
dataR.Occurred_On_Date.nunique()   # her gün eşsiz

In [None]:
"""
dataR["Hour"] = data["Hour"]
dataR["Hour"] = dataR["Hour"].fillna((dataR["Hour"].mean()))
dataR["Ucr_Parts"] = data["Ucr_Parts"]
dataR["Ucr_Parts"] = dataR["Ucr_Parts"].fillna((dataR["Ucr_Parts"].mean()))
"""
# bu iki feature scorumu artırmadı, complexityi artırmaması için modele dahil etmedim.

In [None]:
dataR

In [None]:
dataR.isnull().sum()   #we have 1113 null value on day feature if we get day column from original dataset, so we can create from occured_on_dae

We created the days of the week column starting with Monday.

In [None]:
dataR['Day'] = dataR['Occurred_On_Date'].dt.dayofweek
days = ( 1,2,3,4,5,6,7) # starts w monday
dataR['Day'] = dataR['Day'].apply(lambda x: days[x])

We are changing the name again. We give clear names for this model. Case Count returns the total number of cases that were in that district that day.

In [None]:
# rename for clarity
dataR.rename(columns={'Occurred_On_Date': 'OccuredDate', 'Incident_Number': 'CaseCount', 'Day': 'DayOfWeek'}, inplace=True)

In [None]:
dataR

Our data set showing how many cases occur in each district for each day is ready. We're going to **encode** the District column.

In [None]:
dataR = pd.concat([dataR,pd.get_dummies(dataR['District'], prefix='D')],axis=1)

# now drop the original 'country' column (you don't need it anymore)
dataR.drop(['District'],axis=1, inplace=True)

In [None]:
dataR

In [None]:
# dataR.loc[:,"D_A1":"D_E5"][1150:1200]

We changed the data types for the required columns.

In [None]:
dataR["D_A1"] = np.int64(dataR["D_A1"])                      # convert uint8 to int64
dataR["D_A15"] = np.int64(dataR["D_A15"])
dataR["D_A7"] = np.int64(dataR["D_A7"])
dataR["D_B2"] = np.int64(dataR["D_B2"])
dataR["D_B3"] = np.int64(dataR["D_B3"])
dataR["D_C11"] = np.int64(dataR["D_C11"])
dataR["D_C6"] = np.int64(dataR["D_C6"])
dataR["D_D14"] = np.int64(dataR["D_D14"])
dataR["D_D4"] = np.int64(dataR["D_D4"])
dataR["D_E13"] = np.int64(dataR["D_E13"])
dataR["D_E18"] = np.int64(dataR["D_E18"])
dataR["D_E5"] = np.int64(dataR["D_E5"])

In [None]:
dataR.info()

Since the linear regression does not take a date variable, we convert the OccuredDate to ordinal from date.

In [None]:
# import datetime as dt                                                                    # convert date to ordinal for linear regression 
dataR['OccuredDate'] = pd.to_datetime(dataR['OccuredDate'])
dataR['OccuredDate'] = dataR['OccuredDate'].map(dt.datetime.toordinal)
# "TypeError: invalid type promotion" 

In [None]:
dataR['OccuredDate'].head()           

In [None]:
dataR["OccuredDate"].nunique()  
# date formatıyla aynı sayıda unique değere sahip, her gün için 1 satır.                                 

In [None]:
dataR

 <a id="section-3A2"></a>
 # A.2.Model 1: Predict for each day (Base-model)

In [None]:
# importings moved up to importings part 
"""from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics
#from sklearn.metrics import r2_score,mean_squared_error
from sklearn.ensemble import RandomForestRegressor"""

After preparing our data, we will try to estimate the number of crimes that will occur per day. Here we will start by using linear regression as it is simple, easy to understand, easy to implement.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(dataR.drop(columns=["CaseCount"]), dataR["CaseCount"], random_state = 42)  

In [None]:
lr = LinearRegression().fit(x_train,y_train)

y_train_pred = lr.predict(x_train)
y_test_pred = lr.predict(x_test)

print(lr.score(x_test,y_test))

In [None]:
ax = sns.regplot(x=y_test, y=y_test_pred, color="g")

In [None]:
print("R2 Score: ",r2_score(y_test, y_test_pred))
print("MAE:", metrics.mean_absolute_error(y_test, y_test_pred))
print('MSE:', metrics.mean_squared_error(y_test, y_test_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_test_pred)))                    

Why did I choose R-squared as the metric?


Because R-squared tells us how much of variance can be explained by the linear model.
R-squared is conveniently scaled between 0 and 1 and it can be compared with accuracy.
Better the model, higher the r2 value.
 

The model groups the days according to the district 12 times. This means 12 repetitions each day. This can cause data leakage. Therefore, we will do a more detailed study.

 <a id="section-3A3"></a>

# A.3.Model 2: Predict for each day for Ucr Part 3

In the previous model, we made an estimate for all Ucr_Parts. We are now customizing the model. Ucr_Part 3 is the Ucr part type with the highest number of crimes. Therefore, we will try to estimate how many crimes can be committed from Ucr_Part 3 in the future.

part1 : 0.42
part2: 0.57

Let's remember the Ucr Parts.

In [None]:
ucr_counts = data.groupby('Ucr_Parts').count()['Incident_Number'].to_frame().reset_index()
ax = sns.barplot(x = 'Ucr_Parts' , y="Incident_Number", data = ucr_counts,palette='BrBG')
print(ucr_counts)

In [None]:
data["District"].unique()

We will feed our model with the number of crimes happening in the districts.

In [None]:
dataD4 = data.loc[data['District'] == "D4"]
dataD4 = dataD4.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataD4 = pd.DataFrame(dataD4.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataD4.rename(columns={ 'Incident_Number': 'countD4','Occurred_On_Date': "DateD4"}, inplace=True)
dataD4 = dataD4.drop("District",axis = 1)

In [None]:
dataD14 = data.loc[data['District'] == "D14"]
dataD14 = dataD14.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataD14 = pd.DataFrame(dataD14.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataD14.rename(columns={ 'Incident_Number': 'countD14','Occurred_On_Date': "DateD14"}, inplace=True)
dataD14 = dataD14.drop("District",axis = 1)

In [None]:
dataC11 = data.loc[data['District'] == "C11"]
dataC11 = dataC11.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataC11 = pd.DataFrame(dataC11.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataC11.rename(columns={ 'Incident_Number': 'countC11','Occurred_On_Date': "DateC11"}, inplace=True)
dataC11 = dataC11.drop("District",axis = 1)

In [None]:
dataB3 = data.loc[data['District'] == "B3"]
dataB3 = dataB3.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataB3 = pd.DataFrame(dataB3.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataB3.rename(columns={ 'Incident_Number': 'countB3', 'Occurred_On_Date': "DateB3"}, inplace=True)
dataB3 = dataB3.drop("District",axis = 1)

In [None]:
dataB2 = data.loc[data['District'] == "B2"]
dataB2 = dataB2.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataB2 = pd.DataFrame(dataB2.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataB2.rename(columns={ 'Incident_Number': 'countB2','Occurred_On_Date': "DateB2"}, inplace=True)
dataB2 = dataB2.drop("District",axis = 1)

In [None]:
dataC6 = data.loc[data['District'] == "C6"]
dataC6 = dataC6.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataC6 = pd.DataFrame(dataC6.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataC6.rename(columns={ 'Incident_Number': 'countC6','Occurred_On_Date': "DateC6"}, inplace=True)
dataC6 = dataC6.drop("District",axis = 1)

In [None]:
dataA1 = data.loc[data['District'] == "A1"]
dataA1 = dataA1.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataA1 = pd.DataFrame(dataA1.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataA1.rename(columns={ 'Incident_Number': 'countA1','Occurred_On_Date': "DateA1"}, inplace=True)
dataA1 = dataA1.drop("District",axis = 1)

In [None]:
dataE5 = data.loc[data['District'] == "E5"]
dataE5 = dataE5.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataE5 = pd.DataFrame(dataE5.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataE5.rename(columns={ 'Incident_Number': 'countE5','Occurred_On_Date': "DateE5"}, inplace=True)
dataE5 = dataE5.drop("District",axis = 1)

In [None]:
dataA7 = data.loc[data['District'] == "A7"]
dataA7 = dataA7.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataA7 = pd.DataFrame(dataA7.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataA7.rename(columns={ 'Incident_Number': 'countA7','Occurred_On_Date': "DateA7"}, inplace=True)
dataA7 = dataA7.drop("District",axis = 1)

In [None]:
dataE13 = data.loc[data['District'] == "E13"]
dataE13 = dataE13.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataE13 = pd.DataFrame(dataE13.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataE13.rename(columns={ 'Incident_Number': 'countE13','Occurred_On_Date': "DateE13"}, inplace=True)
dataE13 = dataE13.drop("District",axis = 1)

In [None]:
dataE18 = data.loc[data['District'] == "E18"]
dataE18 = dataE18.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataE18 = pd.DataFrame(dataE18.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataE18.rename(columns={ 'Incident_Number': 'countE18', 'Occurred_On_Date': "DateE18"}, inplace=True)
dataE18 = dataE18.drop("District",axis = 1)

In [None]:
dataA15 = data.loc[data['District'] == "A15"]
dataA15 = dataA15.loc[:, ["Incident_Number",'Offense_Code_Group', 'District', 'Occurred_On_Date']]
dataA15 = pd.DataFrame(dataA15.groupby(["Occurred_On_Date","District"])["Incident_Number"].count()).reset_index()
dataA15.rename(columns={ 'Incident_Number': 'countA15','Occurred_On_Date': "DateA15"}, inplace=True)
dataA15 = dataA15.drop("District",axis = 1)

In [None]:
dataE18

In [None]:
dataUCR = data.loc[data['Ucr_Parts'] == 3]
dataUCR = dataUCR.loc[:, ["Incident_Number", 'Ucr_Parts', 'Occurred_On_Date']]
dataUCR = pd.DataFrame(dataUCR.groupby(["Occurred_On_Date","Ucr_Parts"])["Incident_Number"].count()).reset_index()
dataUCR.rename(columns={ 'Incident_Number': 'countUCR','Occurred_On_Date': "DateUCR"}, inplace=True)
dataUCR = dataUCR.drop("Ucr_Parts",axis = 1)


In [None]:
dataUCR

We get the daily crime numbers in each district and the crime numbers in Ucr_Part 3.

In [None]:
result = pd.concat([dataUCR,dataD14, dataC11, dataD4, dataB3,dataB2,dataC6, dataA1, dataE5, dataA7, dataE13,
       dataE18, dataA15], axis=1, sort=False)

In [None]:
result = result.drop(["DateUCR",'DateC11','DateD4', 'DateB3',  'DateB2','DateC6',  'DateA1', 'DateE5',  'DateA7',  'DateE13','DateE18',  'DateA15'] ,axis = 1)

In [None]:
result.isnull().sum()

We have missing data in the A15 district. Let's fill this with average.

In [None]:
result["countA15"] = result["countA15"].fillna(result["countA15"].mean())

Let's add  DayofMonth, Month and Weekday features to our data. Maybe it will help.

In [None]:
result['DayofMonth'] = result['DateD14'].dt.day
result['Month'] = result['DateD14'].dt.month
result['Weekday'] = result['DateD14'].dt.weekday

In [None]:
result

Check for null values.

In [None]:
result.isnull().sum()

And our final data for the model.

In [None]:
result

In [None]:
result['DateD14'] = pd.to_datetime(result['DateD14'])
result['DateD14'] = result['DateD14'].map(dt.datetime.toordinal)

In [None]:
x_trainUCR, x_testUCR, y_trainUCR, y_testUCR = train_test_split(result.drop(columns=["countUCR"]), result["countUCR"], random_state = 42)  

In [None]:
lr = LinearRegression().fit(x_trainUCR,y_trainUCR)

y_train_predUCR = lr.predict(x_trainUCR)
y_test_predUCR = lr.predict(x_testUCR)

print(lr.score(x_testUCR,y_testUCR))

In [None]:
print("R2 Score: ",r2_score(y_testUCR, y_test_predUCR))
print("MAE:", metrics.mean_absolute_error(y_testUCR, y_test_predUCR))
print('MSE:', metrics.mean_squared_error(y_testUCR, y_test_predUCR))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_testUCR, y_test_predUCR)))   

In [None]:
ax = sns.regplot(x=y_testUCR, y=y_test_predUCR, color="g")

<a id="section-3B"></a>
# Model 3: Predict district& location

Now, we want to predict the region. If we can predict in which region the crime will occur, the police can work by focusing on the regions.
It can help us plan and guide patrol cars. Patrol cars can route in busy places during peak hours.

In other words, it is important to predict in which district a crime may occur in order to prevent crime.

 <a id="section-3B1"></a>
# Prep for Model 3

It is important to take action where the crime may occur.
We will try to guess at which districts the crime that occurs in this model. We will be using XGB Classifier for its ease of use and predictive power.

We used the Offense_Code_Group, Year, Seasons, Hour, Shooting, DayOfWeek, Ucr_Parts, Lat, Long features in the first model to predict the districts. However, when we got 99% accuracy, we noticed that Lat and Long columns negatively affected our model. Let's examine.

In [None]:
dataD2 = data[['Offense_Code_Group', 'District',   'Year', 'Seasons', 'Hour', 'Shooting', 'DayOfWeek',"Ucr_Parts","Lat","Long"]]
dataD2['Offense_Code_Group'] = le.fit_transform(dataD2['Offense_Code_Group'])

In [None]:
X_train_xgb2, X_test_xgb2, y_train_xgb2, y_test_xgb2 = train_test_split(dataD2.drop(["District"], axis = 1), dataD2["District"], test_size=0.20, random_state=42)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train_xgb2)
X_train = scaler.transform(X_train_xgb2)
X_test = scaler.transform(X_test_xgb2)

# Model 3

Let's build the model.

In [None]:
xgB2 = XGBClassifier(learning_rate=0.1, n_estimators=140, max_depth=5, random_state=0)
xgB2.fit(X_train_xgb2, y_train_xgb2)

In [None]:
y_predXG2 = xgB2.predict(X_test_xgb2)
cmXG =confusion_matrix(y_test_xgb2, y_predXG2)
print(classification_report(y_test_xgb2, y_predXG2))

We gave the location values that make up our predicted value to the model. And our accuracy is 0.99.

**Feature Importance**

When we look at feature importance, we can see how much latitude and longitude values affect our estimation.

In [None]:
from xgboost import plot_importance
plot_importance(xgB2).set_yticklabels(["Offense_Code_Group","Year","Seasons","Hour","Shooting","DayOfWeek","Ucr_Parts","Lat","Long"])


* Accuracy is 0.99 because we gave the location while estimating the regions. Now we need to take the columns out of the model and try.

* We'll create the day of week column to increase the accuracy.

# Model 3 without Lat and Long

In [None]:
data['DayofMonth'] = data['Occurred_On_Date'].dt.day

In [None]:
data.columns

In [None]:
dataD = data[['Offense_Code_Group', 'District',   'Year', 'Seasons', 'Hour', 'Shooting', 'DayOfWeek',"Ucr_Parts","DayofMonth","Night"]]

In [None]:
dataD["DayofMonth"].nunique()

In [None]:
dataD['Offense_Code_Group'] = le.fit_transform(dataD['Offense_Code_Group'])
# dataD['Reporting_Area'] = le.fit_transform(dataD['Reporting_Area'])     # Reporting areayı modele dahil edince 0.97 accuracy çıkıyor.
# dataD = data[['Offense_Code_Group', 'District',   'Year', 'Seasons', 'Hour', 'Shooting', 'DayOfWeek',"Ucr_Parts",'Year', 'Month', 'Night']]
# 'Year', 'Month', 'Night' columnlarını eklersek modelin accuracysi yine 0.20'de kalıyor. Değiştirmiyor.

In [None]:
data.columns

In [None]:
dataD

In [None]:
X_train, X_test, y_train, y_test = train_test_split(dataD.drop(["District"], axis = 1), dataD["District"], test_size=0.20, random_state=42)

In [None]:
# from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Let's try our model without latitude and longitude values.

In [None]:
# from xgboost import XGBClassifier

xgB = XGBClassifier(learning_rate=0.1, n_estimators=140, max_depth=5, random_state=0)
xgB.fit(X_train, y_train)

In [None]:
# from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, confusion_matrix, classification_report

y_predXG = xgB.predict(X_test)
cmXG =confusion_matrix(y_test, y_predXG)
print(classification_report(y_test, y_predXG))

**Feature Importance**

In [None]:
# Feature Importance ile sonucumuza en çok katkı sağlayan featureları görebiliriz. Gelecekteki modelleri buna göre kurabiliriz.
plot_importance(xgB).set_yticklabels(["Offense_Code_Group","Year","Seasons","Hour","Shooting","DayOfWeek","Ucr_Parts","DayofMonth","Night"])

While predicting the districts, our model was not successful enough. The reasons for this may be the following.

* Districts are close to each other and have no clear boundaries.
* We have a large number of districts.
* We don't have enough data.

For this reason, we will try to model the districts by grouping them among themselves. So I aim for better prediction.

# Model 3 with Grouped Districts

In [None]:
dataD = data[['Offense_Code_Group', 'District',   'Year', 'Seasons', 'Hour', 'Shooting', 'DayOfWeek',"Ucr_Parts","DayofMonth","Night"]]

In [None]:
dataD

In [None]:
dataD3 = data[['Offense_Code_Group','District', 'Year', 'Seasons', 'Hour', 'Shooting', 'DayOfWeek',"Ucr_Parts","DayofMonth","Night"]]

In [None]:
dataD3['Offense_Code_Group'] = le.fit_transform(dataD3['Offense_Code_Group'])

![Districts](https://i.pinimg.com/originals/73/00/d7/7300d79ca2fed818119719fba67d9a50.jpg)

Let's remember how the districts are distributed within the city.

In [None]:
plt.figure(figsize=(7,7))
sp = data[(data['Lat'] != -1) & (data['Long'] != -1)]
sns.scatterplot(x="Lat", y="Long",hue='District',data=sp)

# for d name https://mikethemadbiologist.com/2013/06/29/why-are-bostons-police-districts-named-so-bizarrely/

We divided the districts into 3 groups according to their proximity.

In [None]:
dataD3["District"] = dataD3["District"].map({
    
    "E18":1,
    "E5":1,
    "E13":1,
    "B3":1,
    
    "D4":2,
    "B2":2,
    "C11":2,
    "C6":2,
    
    "A15":3,
    "A1":3,
    "A7":3,
    "D14":3,
    
})                         # 3 gruba ayırınca %49 accuracy çıkıyor.

In [None]:
"""dataD3["District"] = dataD3["District"].map({
    "B2":3,
    "E18":2,
    "C11":3,
    "A1":1,
    "D4":4,
    "C6":4,
    "B3":2,
    "A15":1,
    "D14":4,
    "E13":3,
    "A7":1,
    "E5":2,
    
})"""      
# acccuracy 0.38

This time in our dataset we have Hour, Ucr_Part,Offense_Code_Group,Shooting,... columns. 

We will try to classify the district groups.

In [None]:
dataD3

In [None]:
X_train_xgb3, X_test_xgb3, y_train_xgb3, y_test_xgb3 = train_test_split(dataD3.drop(["District"], axis = 1), dataD3["District"], test_size=0.20, random_state=42)

In [None]:
scaler = StandardScaler()
scaler.fit(X_train_xgb3)
X_train = scaler.transform(X_train_xgb3)
X_test = scaler.transform(X_test_xgb3)

In [None]:
xgB3 = XGBClassifier(learning_rate=0.1, n_estimators=50, max_depth=5, random_state=0)
xgB3.fit(X_train_xgb3, y_train_xgb3)

In [None]:
y_predXG3 = xgB3.predict(X_test_xgb3)
cmXG3 =confusion_matrix(y_test_xgb3, y_predXG3)
print(classification_report(y_test_xgb3, y_predXG3))

In [None]:
plot_importance(xgB3).set_yticklabels(["Offense_Code_Group","Year","Seasons","Hour","Shooting","DayOfWeek","Ucr_Parts","DayofMonth","Night"])

![Boston](https://images.unsplash.com/photo-1501979376754-2ff867a4f659?ixlib=rb-1.2.1&auto=format&fit=crop&w=1350&q=80)