### Background
San Franscico, CA is one of the cities in the US which has the high crime rate. Studies suggested that San Francisco's daily influx of tourists and commuters may lead to high crime rates. While San Franscico ranks high in  crime, it ranks low in arrests. Therefore, I assumed San Franscico might need more police force to combat the crime. In the end of this project, the places/neighborhoods where the new police station should be built at would be pointed out based on how many incidents have happened in the area from 2018 to present and how many venues that could be the potential crime magnets are (without considering the government budget and the police deployment routes).

Reference:
[1] https://www.sfchronicle.com/bayarea/philmatier/article/SF-ranks-high-in-property-crime-while-it-ranks-14439369.php
[2] https://archive.attn.com/stories/3382/which-businesses-attract-crime

## Data I used in this project
1. Police Department Incident Reports: 2018 to Present (from https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783); to see which neighborhoods have the greater incidents of crime in San Franscico. By using this dataset, we can get an idea about how many crimes have happened in different areas from 2018 to present.
2. Analysis Neighborhoods (from https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h); to get the values of individual key of neighborhood
3. Foursquare; to see how many venues (especially those might induce criminal activities; the filtering criteria is based on this article: https://archive.attn.com/stories/3382/which-businesses-attract-crime) are in different neighborhoods

## Methods

In [1]:
!pip install pandas
import pandas as pd
import numpy as np



#### Data from https://data.sfgov.org/Public-Safety/Police-Department-Incident-Reports-2018-to-Present/wg3w-h783

In [3]:
df = pd.read_csv('Police_Department_Incident_Reports__2018_to_Present.csv')
df.head()

Unnamed: 0,Incident Datetime,Incident Date,Incident Time,Incident Year,Incident Day of Week,Report Datetime,Row ID,Incident ID,Incident Number,CAD Number,...,SF Find Neighborhoods,Current Police Districts,Current Supervisor Districts,Analysis Neighborhoods,HSOC Zones as of 2018-06-05,OWED Public Spaces,Central Market/Tenderloin Boundary Polygon - Updated,Parks Alliance CPSI (27+TL sites),ESNCAG - Boundary File,"Areas of Vulnerability, 2016"
0,2018/01/01 09:26:00 AM,2018/01/01,09:26,2018,Monday,2018/01/01 09:27:00 AM,61893007041,618930,171052174,173641140.0,...,88.0,2.0,9.0,1.0,,,,,,2.0
1,2018/01/01 02:30:00 AM,2018/01/01,02:30,2018,Monday,2018/01/01 08:21:00 AM,61893105041,618931,180000768,180010668.0,...,90.0,9.0,1.0,7.0,,,,,,2.0
2,2018/01/01 10:00:00 AM,2018/01/01,10:00,2018,Monday,2018/01/01 10:20:00 AM,61893275000,618932,180000605,180010893.0,...,20.0,4.0,10.0,36.0,,,1.0,,,2.0
3,2018/01/01 10:03:00 AM,2018/01/01,10:03,2018,Monday,2018/01/01 10:04:00 AM,61893565015,618935,180000887,180011579.0,...,,9.0,1.0,28.0,,,,,,1.0
4,2018/01/01 09:01:00 AM,2018/01/01,09:01,2018,Monday,2018/01/01 09:39:00 AM,61893607041,618936,171052958,180011403.0,...,106.0,6.0,3.0,6.0,,,,,,2.0


In [4]:
df = df[['Incident Datetime', 'Incident ID','Analysis Neighborhoods']]
df.head()

Unnamed: 0,Incident Datetime,Incident ID,Analysis Neighborhoods
0,2018/01/01 09:26:00 AM,618930,1.0
1,2018/01/01 02:30:00 AM,618931,7.0
2,2018/01/01 10:00:00 AM,618932,36.0
3,2018/01/01 10:03:00 AM,618935,28.0
4,2018/01/01 09:01:00 AM,618936,6.0


In [5]:
df = df.groupby('Analysis Neighborhoods').count()
df = df.reset_index()
df.head()

Unnamed: 0,Analysis Neighborhoods,Incident Datetime,Incident ID
0,1.0,25571,25571
1,2.0,7539,7539
2,3.0,7046,7046
3,4.0,7292,7292
4,5.0,12651,12651


#### Data from https://data.sfgov.org/Geographic-Locations-and-Boundaries/Analysis-Neighborhoods/p5b7-5n3h; to get the values (neighborhood name) of the keys (analysis neighborhood index)

In [6]:
data = pd.read_csv('Neighbor name.csv')
df = pd.merge(df, data, on="Analysis Neighborhoods")
df.head()

Unnamed: 0,Analysis Neighborhoods,Incident Datetime,Incident ID,Neighborhood Name
0,1.0,25571,25571,Bayview Hunters Point
1,2.0,7539,7539,Bernal Heights
2,3.0,7046,7046,Castro/Upper Market
3,4.0,7292,7292,Chinatown
4,5.0,12651,12651,Excelsior


In [7]:
import numpy as np
import json
! pip install geopy 
from geopy.geocoders import Nominatim

import requests
from pandas.io.json import json_normalize

! pip install matplotlib
import matplotlib.cm as cm
import matplotlib.colors as colors

! pip install sklearn
from sklearn.cluster import KMeans

! pip install folium
import folium

print('Libraries imported.')

Collecting geopy
  Downloading geopy-2.1.0-py3-none-any.whl (112 kB)
[K     |████████████████████████████████| 112 kB 4.2 MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49
  Downloading geographiclib-1.50-py3-none-any.whl (38 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0
Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Created wheel for sklearn: filename=sklearn-0.0-py2.py3-none-any.whl size=1315 sha256=b71344b7648b7fb5709b046346a3c74a2587bb457add176a0ec36c70758eccfc
  Stored in directory: /home/jovyan/.cache/pip/wheels/46/ef/c3/157e41f5ee1372d1be90b09f74f82b10e391eaacca8f22d33e
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 

#### get the location (latitude and longitude) of each neighborhood

In [8]:
df_for_lalo = pd.read_csv('Police_Department_Incident_Reports__2018_to_Present.csv')
df_for_lalo = df_for_lalo[['Analysis Neighborhoods', 'Latitude', 'Longitude']]
df_for_lalo = df_for_lalo.groupby('Analysis Neighborhoods').mean()
df_for_lalo = df_for_lalo.reset_index()
df_for_lalo.head()

Unnamed: 0,Analysis Neighborhoods,Latitude,Longitude
0,1.0,37.732638,-122.390891
1,2.0,37.741431,-122.416346
2,3.0,37.769543,-122.444316
3,4.0,37.772379,-122.394286
4,5.0,37.76313,-122.43291


In [9]:
df_merged = pd.merge(df, df_for_lalo, on="Analysis Neighborhoods")
df_merged.shape

(41, 6)

In [10]:
address = 'San Francisco, CA'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of San Francisco are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of San Francisco are 37.7790262, -122.4199061.


In [11]:
CLIENT_ID = 'CJNC2DKVA4T45N41PI2TR1UFPBQMKRFSMNWR22EJ1Q5RZO45'
CLIENT_SECRET = '4SIRJIT02OADNJXMCOVMYCMF22YYNQQDF30MGDONU4RBYL3S'
VERSION = '20180605'
LIMIT = 100

#### the function for getting the surrounding venues

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)

In [15]:
sf_venues = getNearbyVenues(names=df_merged['Neighborhood Name'], latitudes=df_merged['Latitude'], longitudes=df_merged['Longitude'])

Bayview Hunters Point
Bernal Heights
Castro/Upper Market
Chinatown
Excelsior
Financial District/South Beach
Glen Park
Inner Richmond
Golden Gate Park
Haight Ashbury
Hayes Valley
Inner Sunset
Japantown
McLaren Park
Tenderloin
Lakeshore
Lincoln Park
Lone Mountain/USF
Marina
Russian Hill
Mission
Mission Bay
Nob Hill
Seacliff
Noe Valley
North Beach
Oceanview/Merced/Ingleside
South of Market
Sunset/Parkside
Outer Mission
Outer Richmond
Pacific Heights
Portola
Potrero Hill
Presidio
Presidio Heights
Treasure Island
Twin Peaks
Visitacion Valley
West of Twin Peaks
Western Addition


#### Define different categories of venues in San Franscico and then trim down the dataset until only the places that might attract crimes remained

In [16]:
sf_venues['Venue Category'].unique()

array(['African Restaurant', 'Mexican Restaurant',
       'Southern / Soul Food Restaurant', 'Theater', 'Gym',
       'Latin American Restaurant', 'Fried Chicken Joint', 'Bakery',
       'Pharmacy', 'Café', 'Bus Station', 'Light Rail Station',
       'Chinese Restaurant', 'Thrift / Vintage Store', 'Garden',
       'Health & Beauty Service', 'Park', 'Liquor Store', 'Restaurant',
       'Market', 'Gourmet Shop', 'Flower Shop', 'Playground',
       'Cocktail Bar', 'Coffee Shop', 'Butcher', 'Pet Store',
       'Asian Restaurant', 'Trail', 'Burger Joint', 'Italian Restaurant',
       'Gay Bar', 'Yoga Studio', 'Grocery Store', 'Scenic Lookout',
       'Caribbean Restaurant', 'Peruvian Restaurant', 'Pizza Place',
       'Gift Shop', 'New American Restaurant', 'Cosmetics Shop',
       'Dive Bar', 'Indian Restaurant', 'Art Gallery',
       'Japanese Curry Restaurant', 'Ramen Restaurant', 'Locksmith',
       'Bus Stop', 'Bus Line', 'Boutique', 'Shoe Store',
       'Vegetarian / Vegan Restaurant'

In [17]:
sf_onehot = pd.get_dummies(sf_venues[['Venue Category']], prefix="", prefix_sep="")
sf_onehot['Neighborhood'] = sf_venues['Neighborhood'] 
sf_grouped = sf_onehot.groupby('Neighborhood').sum()
sf_grouped

Unnamed: 0_level_0,Accessories Store,Adult Boutique,African Restaurant,Alternative Healer,American Restaurant,Antique Shop,Aquarium,Arcade,Art Gallery,Art Museum,...,Vietnamese Restaurant,Volleyball Court,Wagashi Place,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bayview Hunters Point,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Bernal Heights,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
Castro/Upper Market,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Chinatown,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
Excelsior,0,1,0,0,1,0,0,1,0,0,...,0,0,0,0,2,0,0,0,1,0
Financial District/South Beach,1,0,0,0,0,0,0,0,0,0,...,1,0,0,0,3,0,0,0,1,0
Glen Park,0,0,0,0,0,0,0,0,0,0,...,2,0,0,0,0,0,0,0,0,0
Golden Gate Park,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,4,2,1,0,1,0
Haight Ashbury,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Hayes Valley,0,0,0,0,1,0,0,0,1,0,...,3,0,0,0,1,0,0,0,0,0


In [18]:
print(sf_grouped.filter(like='Bar').columns)
print(sf_grouped.filter(like='Club').columns)
print(sf_grouped.filter(like='Liquor').columns)

Index(['Bar', 'Beer Bar', 'Cocktail Bar', 'Dive Bar', 'Gay Bar', 'Hotel Bar',
       'Juice Bar', 'Karaoke Bar', 'Salon / Barbershop', 'Sports Bar',
       'Tiki Bar', 'Whisky Bar', 'Wine Bar'],
      dtype='object')
Index(['Comedy Club', 'Jazz Club', 'Rock Club'], dtype='object')
Index(['Liquor Store'], dtype='object')


In [19]:
sf_refined = sf_grouped[['Bar', 'Beer Bar', 'Cocktail Bar', 'Dive Bar', 'Gay Bar', 'Hotel Bar',
       'Juice Bar', 'Karaoke Bar', 'Salon / Barbershop', 'Sports Bar',
       'Tiki Bar', 'Whisky Bar', 'Wine Bar', 'Comedy Club', 'Jazz Club', 'Rock Club', 'Liquor Store', 'Smoke Shop','Monument / Landmark',
'General Entertainment']]
sf_refined

Unnamed: 0_level_0,Bar,Beer Bar,Cocktail Bar,Dive Bar,Gay Bar,Hotel Bar,Juice Bar,Karaoke Bar,Salon / Barbershop,Sports Bar,Tiki Bar,Whisky Bar,Wine Bar,Comedy Club,Jazz Club,Rock Club,Liquor Store,Smoke Shop,Monument / Landmark,General Entertainment
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Bayview Hunters Point,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
Bernal Heights,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
Castro/Upper Market,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0
Chinatown,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Excelsior,0,0,0,0,6,0,2,0,0,0,0,0,2,0,0,0,1,0,1,0
Financial District/South Beach,0,1,3,1,0,1,0,0,0,0,0,0,3,0,0,0,0,0,0,1
Glen Park,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Golden Gate Park,1,0,2,1,0,0,2,0,0,0,0,0,4,0,0,0,3,0,0,0
Haight Ashbury,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Hayes Valley,3,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


#### the numbers of 'crime magnet'-like venues in different neighborhoods

In [20]:
sf_refined["sum"] = sf_refined.sum(axis=1)
sf_refined = sf_refined.reset_index()
sf_refined

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Neighborhood,Bar,Beer Bar,Cocktail Bar,Dive Bar,Gay Bar,Hotel Bar,Juice Bar,Karaoke Bar,Salon / Barbershop,...,Whisky Bar,Wine Bar,Comedy Club,Jazz Club,Rock Club,Liquor Store,Smoke Shop,Monument / Landmark,General Entertainment,sum
0,Bayview Hunters Point,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
1,Bernal Heights,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,4
2,Castro/Upper Market,1,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,4
3,Chinatown,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,Excelsior,0,0,0,0,6,0,2,0,0,...,0,2,0,0,0,1,0,1,0,12
5,Financial District/South Beach,0,1,3,1,0,1,0,0,0,...,0,3,0,0,0,0,0,0,1,10
6,Glen Park,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
7,Golden Gate Park,1,0,2,1,0,0,2,0,0,...,0,4,0,0,0,3,0,0,0,13
8,Haight Ashbury,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,Hayes Valley,3,1,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,6


In [21]:
df_merged = df_merged.rename(columns={"Neighborhood Name": "Neighborhood"})
sf_final = pd.merge(df_merged, sf_refined, on="Neighborhood")
sf_final = sf_final.rename(columns={"Incident Datetime": "# of Incidents", 'sum':'# of places attract crime'})
sf_final = sf_final[['Neighborhood', '# of Incidents', '# of places attract crime', 'Latitude', 'Longitude']]
sf_final.sort_values(by=['# of Incidents', '# of places attract crime'], ascending = False)

Unnamed: 0,Neighborhood,# of Incidents,# of places attract crime,Latitude,Longitude
19,Russian Hill,45605,9,37.76146,-122.416831
35,Presidio Heights,42418,15,37.783273,-122.414568
7,Inner Richmond,35045,7,37.789237,-122.40095
33,Potrero Hill,34762,7,37.778202,-122.407309
0,Bayview Hunters Point,25571,1,37.732638,-122.390891
38,Visitacion Valley,13422,3,37.782314,-122.42859
4,Excelsior,12651,12,37.76313,-122.43291
22,Nob Hill,12231,7,37.804907,-122.411072
34,Presidio,12222,2,37.749732,-122.491422
20,Mission,11918,15,37.790052,-122.416139


## Results
#### Cluster the neighborhoods based on the numbers of venues and the numbers of crime incidents

In [22]:
kclusters = 6
sf_grouped_clustering = sf_final.drop('Neighborhood', 1)
sf_grouped_clustering = sf_grouped_clustering.drop('Latitude', 1)
sf_grouped_clustering = sf_grouped_clustering.drop('Longitude', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sf_grouped_clustering)
kmeans.labels_[0:41]
sf_final.insert(0, 'Cluster Labels', kmeans.labels_)

In [23]:
sf_final.sort_values(by=['Cluster Labels','# of Incidents', '# of places attract crime'])

Unnamed: 0,Cluster Labels,Neighborhood,# of Incidents,# of places attract crime,Latitude,Longitude
0,0,Bayview Hunters Point,25571,1,37.732638,-122.390891
14,1,Tenderloin,4041,5,37.785445,-122.432533
21,1,Mission Bay,4211,4,37.749006,-122.432284
23,1,Seacliff,4316,1,37.71753,-122.460352
11,1,Inner Sunset,4326,1,37.769687,-122.467917
24,1,Noe Valley,4718,0,37.727454,-122.40711
39,1,West of Twin Peaks,4730,0,37.712426,-122.412112
15,1,Lakeshore,4854,2,37.722032,-122.479855
13,1,McLaren Park,5089,2,37.761356,-122.465304
17,1,Lone Mountain/USF,5155,2,37.777948,-122.448764


#### Visualizing the culstering

In [24]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_final['Latitude'], sf_final['Longitude'], sf_final['Neighborhood'], sf_final['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Discussion

There is only one neighborhood (Bayview Hunters Point) in Cluster 0 in which the incidents of the crime are high while the numbers of venues are low. Cluster 1 (including 19 neighborhoods) has medium numbers of the crime and low numbers of venues. Both Cluster 2 and 5 have the high crime rate as well as numbers of venues. Lastly, Cluster 3 and 4 have low crime rate, less venues and medium-high crime rate, medium numbers of venues, respectively. Interestingly, the neighborhoods in Cluster 2 and 5 (which have extremely high crime incidents and tourist attractions) are around San Franscico downtown region, which is understandable since that is the core finantial area with a lot of tourists visiting.

## Conclusion

If the government of San Franscico would like to set a new police station as a deterrence to lower the crime rate, places like Presidio Heights, Russian Hill, Potrero Hill, or Inner Richmond would be (Cluster 2 and 5 where a lot of crime incidents and places that might attract crimes)