# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

### City crime is one of the main concerns for public and decision makers.
Today's life in big cities with high population became major concern. The safety of community creates challenges for authorities and police departments. 
Crimes in big cities are different with diverse rates and might happen in any time, which in its turn need planning and resources to understand its trend and where and when it is necessary to allocate resources in certain area. 
For proper planning many questions seek for answers such as:
-	What are the major crimes in the city?
-	How the crime types distributed across the city?
-	Where are such crimes concentrated in each city's districts?
-	With limited budget, which districts should get more focus and financial support to improve its capabilities?
-	What are the main geo characterizes that repeated for a specific crime, which can be used as an indicator for a proactive approach to improve the policy service quality in other areas or cities?

Data science approach and techniques provide the means that can help in meeting such challenges and provide answers to such questions. 


## Data <a name="data"></a>

###### This project relies on different information resources..
-	Chicago crime incidents dataset from 2001 to present can be found Chicago Data Portal. , which has details information on the daily cases, such as where it happened, when, case coordinates (latitude and longitude), crime type, etc. (Sample of  data is extracted below.)

-	Foursquare, which is geo information platform that powers leading business solutions and consumer products through a deep understanding of location. Foursquare Explore API will be used to get nearby venues based on case coordinates.



In [1]:
# Import the used libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd
pd.set_option('display.max_columns', 500)
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import requests
from bs4 import BeautifulSoup
import json # library to handle JSON files

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
%matplotlib inline

# import k-means from clustering stage
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim

#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

In [2]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location,Historical Wards 2003-2015,Zip Codes,Community Areas,Census Tracts,Wards,Boundaries - ZIP Codes,Police Districts,Police Beats
0,11668274,JC240043,04/26/2019 11:58:00 PM,008XX N MAY ST,0620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,False,1213,12.0,27.0,24.0,05,1168861.0,1905677.0,2019,05/03/2019 04:14:46 PM,41.8967,-87.655246,"(41.896700196, -87.655246179)",41.0,22620.0,25.0,109.0,46.0,49.0,15.0,60.0
1,11668131,JC240018,04/26/2019 11:58:00 PM,017XX N CENTRAL AVE,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,GAS STATION,False,False,2531,25.0,29.0,25.0,11,1138768.0,1911350.0,2019,05/03/2019 04:14:46 PM,41.912867,-87.765636,"(41.912867052, -87.765635915)",52.0,22615.0,26.0,597.0,7.0,2.0,6.0,154.0
2,11668155,JC240031,04/26/2019 11:56:00 PM,046XX N MELVINA AVE,2093,NARCOTICS,FOUND SUSPECT NARCOTICS,PARK PROPERTY,True,False,1622,16.0,38.0,15.0,18,1134198.0,1930660.0,2019,05/03/2019 04:14:46 PM,41.965938,-87.781969,"(41.965937596, -87.781969004)",25.0,21869.0,15.0,95.0,19.0,48.0,12.0,43.0
3,11668197,JC240026,04/26/2019 11:51:00 PM,004XX W 83RD ST,143A,WEAPONS VIOLATION,UNLAWFUL POSS OF HANDGUN,STREET,True,False,622,6.0,21.0,71.0,15,1174909.0,1849960.0,2019,05/03/2019 04:14:46 PM,41.743674,-87.634697,"(41.743674436, -87.634696986)",18.0,21554.0,40.0,1.0,13.0,59.0,20.0,236.0
4,11668158,JC239985,04/26/2019 11:49:00 PM,049XX W JACKSON BLVD,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,False,True,1533,15.0,28.0,25.0,08B,1143691.0,1898221.0,2019,05/03/2019 04:14:46 PM,41.876749,-87.747879,"(41.876748723, -87.747878888)",36.0,22216.0,26.0,69.0,7.0,32.0,25.0,137.0


Let's find out how many entries there are in our dataset.

In [None]:
#print(df_data_1.shape)
ss = df_data_1
ss.sort_values(by=['Case Number'], inplace=True)
ss = ss['Case Number'].unique()
#print (ss.shape)
# Select duplicate rows except first occurrence based on all columns
duplicateRowsDF = df_data_1[df_data_1.duplicated()]
 
print("Duplicate Rows except first occurrence based on all columns are :")
print(duplicateRowsDF)

### Data Cleaning and preprocessing

In [None]:
# will start by removing the na values
df_data_1.columns
df_data = df_data_1 [['Case Number', 'Date', 'Block', 'IUCR', 'Primary Type',
       'Description', 'Arrest', 'Community Area', 'Year',  'Latitude', 'Longitude',
       'Zip Codes','Community Areas', 'Census Tracts', 'Wards', 'Boundaries - ZIP Codes',
       'Police Districts']]

#Drop rows with nan @ Latitude 	Longitude
df_data = df_data[np.isfinite(df_data['Latitude'])]

# Year 2109 data not complete then will be droped for time being
df_data = df_data[df_data.Year != 2019]

print(df_data.shape)
#df_data



In [None]:
#df_year_crime = df_data[['Year', 'Case Number']].groupby(['Year']).count()
#df_year_crime = df_year_crime.reset_index()
#df_year_crime

In [None]:
#df_year_crime.plot(kind='line',x='Year',y='Case Number', color='red')
#plt.show()

In [3]:
#df_year_type = df_data[['Year', 'Primary Type','Case Number']].groupby(['Year','Primary Type']).count()
df_year_type = df_data[['Year', 'Primary Type','Case Number']]
#f_year_type = df_year_type.groupby(['Year','Primary Type']).count()
df_year_type = df_year_type.groupby(['Year','Primary Type']).size().rename('count').reset_index()
#df_year_type = df_year_type.reset_index(name='Count')
df_year_type

ss= df_year_type[df_year_type.Year == 2001] 
#ss= ss[['Year','Primary Type']]
#ss1= ss.drop(ss.loc[ss['Primary Type']!='ARSON'].index, inplace=True)
ss.shape
df_year_type.shape

NameError: name 'df_data' is not defined

In [42]:
# one hot encoding
df__onehot = pd.get_dummies(df_year_type[['Primary Type']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
df__onehot['Year'] = df_year_type['Year'] 

# move neighborhood column to the first column
fixed_columns = [df__onehot.columns[-1]] + list(df__onehot.columns[:-1])
df__onehot = df__onehot[fixed_columns]
df__onehot = df__onehot.groupby(['Year']).count()
df__onehot.head()


Unnamed: 0_level_0,ARSON,ASSAULT,BATTERY,BURGLARY,CONCEALED CARRY LICENSE VIOLATION,CRIM SEXUAL ASSAULT,CRIMINAL DAMAGE,CRIMINAL TRESPASS,DECEPTIVE PRACTICE,DOMESTIC VIOLENCE,GAMBLING,HOMICIDE,HUMAN TRAFFICKING,INTERFERENCE WITH PUBLIC OFFICER,INTIMIDATION,KIDNAPPING,LIQUOR LAW VIOLATION,MOTOR VEHICLE THEFT,NARCOTICS,NON - CRIMINAL,NON-CRIMINAL,NON-CRIMINAL (SUBJECT SPECIFIED),OBSCENITY,OFFENSE INVOLVING CHILDREN,OTHER NARCOTIC VIOLATION,OTHER OFFENSE,PROSTITUTION,PUBLIC INDECENCY,PUBLIC PEACE VIOLATION,RITUALISM,ROBBERY,SEX OFFENSE,STALKING,THEFT,WEAPONS VIOLATION
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
2001,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30,30
2002,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29
2003,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29
2004,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29
2005,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29


So the dataframe consists of 6,795,046 crimes, which took place from year 20101 up to 2019. In order to reduce computational cost during the dta review, let's just work with the first 1000 incidents in this dataset.

In [9]:
# get the first 1000 crimes in the df_incidents dataframe
limit = 1000
df_incidents = df_data_1.iloc[0:limit, :]

In [10]:
df_incidents.shape

(1000, 30)

Now that we reduced the data a little bit, let's visualize where these crimes took place in the city of Chicago. We will use the default style and we will initialize the zoom level to 12.

In [11]:
# Chicago latitude and longitude values
#Use geopy library to get the latitude and longitude values of Toronto City.
address = 'Chicago, IL, USA'
geolocator = Nominatim(user_agent="ON_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Chicago City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Chicago City are 41.8755616, -87.6244212.


In [None]:
# create map and display it
Chicago_map = folium.Map(location=[latitude, longitude], zoom_start=12)

# display the map of San Francisco
Chicago_map

Now let's superimpose the locations of the crimes onto the map. The way to do that in Folium is to create a feature group with its own features and style and then add it to the sanfran_map.

In [None]:
# instantiate a feature group for the incidents in the dataframe
incidents = folium.map.FeatureGroup()

# loop through the 1000 crimes and add each to the incidents feature group
for lat, lng, in zip(df_data.Latitude, df_data.Longitude):
    incidents.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6
        )
    )

# add incidents to map
Chicago_map.add_child(incidents)