In [1]:
from bs4 import BeautifulSoup
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
from datetime import datetime

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
import matplotlib.cm as cm
import matplotlib.colors as colors
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.pyplot as plt
import seaborn as sns
import folium # plotting library
%matplotlib inline

# Capstone Project - The Battle of Neighborhoods (Week 2)

## Introduction/Business Problem

A friend of mine is extremely worried about violence. He is moving to another city, which does not provide statistics about the crime rate in its neighborhoods, and is wondering how he will be sure to choose the least violent neighborhood.

I had an idea. Based on the crime rate of the neighborhoods in the city of Chicago, I will develop a crime index of each neiborhood, use foursquare data to get the number and type of venues in the neighborhoods, train a model to fit the crime index to the number and type of venues in them and apply this model to predict a crime index of each neighborhood in the city my friend is moving to.

## Data

To solve the problem above, I will be using the Crimes - Map dataset from Chicago Data Portal (https://data.cityofchicago.org/Public-Safety/Crimes-Map/dfnk-7re6) in combination with foursquare data. The variable of the Crimes Map that I will use will be CASE#, PRIMARY DESCRIPTION, LATITUDE and LONGITUDE. The case# is a itentifier of the crime, the primary description shows the details of the crime (for exemple: primary description: THEFT), and the location show the coordinates of the crime. 

Based on the coordinates of the crime, I will assign it to a neighborhood of Chicago. A crime index will be created based on the number and type of crimes in each neighborhood. Then, Foursquare data will be used to describe the venues in each neighborhood. Foursquare data will also be used to get the number and type of venues in each neighborhood in the city my friend is moving to.

A crime index will be assigned to the neighborhoods of this new city based on the similarities between the number and type of venues in them and the number and type of venues in the neighborhoods in Chicago. My friend would probably choose the neighborhood with the smallest crime index.




## Methodology

Let's import de crimes - map dataset

In [2]:
df_crimes = pd.read_csv('Crimes_-_Map.csv')
df_crimes.head(5)

Unnamed: 0,CASE#,DATE OF OCCURRENCE,BLOCK,IUCR,PRIMARY DESCRIPTION,SECONDARY DESCRIPTION,LOCATION DESCRIPTION,ARREST,DOMESTIC,BEAT,WARD,FBI CD,X COORDINATE,Y COORDINATE,LATITUDE,LONGITUDE,LOCATION
0,JD141525,02/05/2020 02:54:00 PM,030XX N HALSTED ST,860,THEFT,RETAIL THEFT,DRUG STORE,N,N,1933,44.0,6,,,,,
1,JD177980,03/08/2020 02:15:00 AM,064XX S DR MARTIN LUTHER KING JR DR,1330,CRIMINAL TRESPASS,TO LAND,APARTMENT,Y,N,312,20.0,26,1180028.0,1862391.0,41.777671,-87.615561,"(41.777670858, -87.61556066)"
2,JC497784,11/03/2019 11:40:00 AM,032XX N CLARK ST,860,THEFT,RETAIL THEFT,DEPARTMENT STORE,N,N,1924,44.0,6,,,,,
3,JD195928,03/21/2020 10:05:00 PM,019XX E 73RD PL,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,N,N,333,7.0,11,,,,,
4,JC497415,11/03/2019 04:30:00 AM,107XX S PEORIA ST,1320,CRIMINAL DAMAGE,TO VEHICLE,RESIDENTIAL YARD (FRONT/BACK),N,N,2233,34.0,14,,,,,


and maintain only the columns of interest with crimes from 2020.

The values from 'DATE OF OCCURRENCE' column are of type string. We just need the year.

In [3]:
df_crimes['DATE  OF OCCURRENCE'][0]

'02/05/2020 02:54:00 PM'

In [4]:
df_crimes['DATE  OF OCCURRENCE'][0][6:10]

'2020'

In [5]:
df_crimes = df_crimes[df_crimes['DATE  OF OCCURRENCE'].apply(lambda x: x[6:10]) == '2020']
df_crimes = df_crimes[['CASE#',' PRIMARY DESCRIPTION','LATITUDE','LONGITUDE']]
df_crimes = df_crimes.rename(columns={" PRIMARY DESCRIPTION": "PRIMARY DESCRIPTION"})
df_crimes.head(5)

Unnamed: 0,CASE#,PRIMARY DESCRIPTION,LATITUDE,LONGITUDE
0,JD141525,THEFT,,
1,JD177980,CRIMINAL TRESPASS,41.777671,-87.615561
3,JD195928,DECEPTIVE PRACTICE,,
5,JD160107,THEFT,,
7,JD178312,THEFT,,


In [6]:
print('The number of observations is ', df_crimes.shape[0])

The number of observations is  52305


We have 52,305 observations, but many of them may have empty values for latitude and longitude.

In [7]:
df_crimes.isnull().sum()

CASE#                    0
PRIMARY DESCRIPTION      0
LATITUDE               329
LONGITUDE              329
dtype: int64

There are 329 observations with empty values for the coordinates. We need to remove them.

In [8]:
df_crimes.dropna(axis=0, how='any', thresh=None, subset=None, inplace=True)
df_crimes = df_crimes.reset_index(drop = True)
df_crimes.shape[0]

51976

Now we need to assign each crime observation to a neighborhood or community area of Chicago.

Let's get the coordinates of the first observation.

In [9]:
i = 0
latitude = df_crimes.at[0,'LATITUDE']
longitude = df_crimes.at[0,'LONGITUDE']
print('For observation number ', i, '. Latitude is ', latitude, 'and longitude is ', longitude, '.')

For observation number  0 . Latitude is  41.777670858 and longitude is  -87.61556066 .


We are going to be using the Boundary Service API (see here for more information http://boundaries.tribapps.com/api/ ).

In [10]:
url = f'http://boundaries.tribapps.com/1.0/boundary/?contains={latitude},{longitude}&sets=community-areas'
url

'http://boundaries.tribapps.com/1.0/boundary/?contains=41.777670858,-87.61556066&sets=community-areas'

The query above will provide the following result:

In [11]:
results = requests.get(url).json()
results

{'meta': {'limit': 20,
  'next': None,
  'offset': 0,
  'previous': None,
  'total_count': 1},
 'objects': [{'centroid': {'coordinates': [-87.594925, 41.778876],
    'type': 'Point'},
   'external_id': '42',
   'kind': 'Community Area',
   'metadata': {'AREA': 0.0,
    'AREA_NUMBE': '42',
    'AREA_NUM_1': '42',
    'COMAREA_': 0,
    'COMAREA_ID': 0,
    'COMMUNITY': 'WOODLAWN',
    'PERIMETER': 0.0,
    'SHAPE_AREA': 57815179.512,
    'SHAPE_LEN': 46936.9592443},
   'name': 'Woodlawn',
   'resource_uri': '/1.0/boundary/woodlawn-community-area/',
   'set': '/1.0/boundary-set/community-areas/',
   'simple_shape': {'coordinates': [[[[-87.577145, 41.786146],
       [-87.576046, 41.783598],
       [-87.573101, 41.782134],
       [-87.570497, 41.781371],
       [-87.569683, 41.781297],
       [-87.568397, 41.781608],
       [-87.567644, 41.784622],
       [-87.567532, 41.784597],
       [-87.568306, 41.781539],
       [-87.57308, 41.780251],
       [-87.572954, 41.779824],
       [-87.5741

The information that is important for us in this exercise is the name of the community

In [12]:
results['objects'][0]['name']

'Woodlawn'

The dataframe has 51,976 observations. I tried to get the community name for each of those, but it was taking too long to complete, so I decided to take a sample of 2,000 crimes.

In [13]:
sample_cases = list(df_crimes['CASE#'].sample(n=2000, random_state=4))
df_crimes_sample = df_crimes[df_crimes['CASE#'].isin(sample_cases)].copy()
df_crimes_sample.head(5)

Unnamed: 0,CASE#,PRIMARY DESCRIPTION,LATITUDE,LONGITUDE
56,JD192044,BATTERY,41.790569,-87.623986
131,JD121589,THEFT,41.885557,-87.653649
145,JD121019,WEAPONS VIOLATION,41.855959,-87.721125
146,JD192370,BATTERY,41.88321,-87.634336
230,JD191833,ASSAULT,41.947298,-87.651034


We need to get the name of the communities for all observations in this sample.

In [14]:
count = 0
for observation in sample_cases:
    count += 1
    latitude = df_crimes_sample.loc[df_crimes_sample['CASE#'] == observation,'LATITUDE'].item()
    longitude = df_crimes_sample.loc[df_crimes_sample['CASE#'] == observation,'LONGITUDE'].item()
    url = f'http://boundaries.tribapps.com/1.0/boundary/?contains={latitude},{longitude}&sets=community-areas'
    try:
        results = requests.get(url).json()
        community = results['objects'][0]['name']
    except:
        community = 'ERROR'
    df_crimes_sample.loc[df_crimes_sample['CASE#'] == observation, 'Community'] = community
    if count in [1,10,50,100,500,750,1000,1250,1500,1750]:
        print('Observation ', count, ' done.')
df_crimes_sample.head(10)
    

Observation  1  done.
Observation  10  done.
Observation  50  done.
Observation  100  done.
Observation  500  done.
Observation  750  done.
Observation  1000  done.
Observation  1250  done.
Observation  1500  done.
Observation  1750  done.


Unnamed: 0,CASE#,PRIMARY DESCRIPTION,LATITUDE,LONGITUDE,Community
56,JD192044,BATTERY,41.790569,-87.623986,Washington Park
131,JD121589,THEFT,41.885557,-87.653649,Near West Side
145,JD121019,WEAPONS VIOLATION,41.855959,-87.721125,North Lawndale
146,JD192370,BATTERY,41.88321,-87.634336,Loop
230,JD191833,ASSAULT,41.947298,-87.651034,Lake View
232,JD121500,OTHER OFFENSE,41.751719,-87.566914,South Shore
259,JD194613,DECEPTIVE PRACTICE,41.904982,-87.678445,West Town
309,JD191733,ASSAULT,41.83877,-87.653161,Bridgeport
421,JD105335,THEFT,41.732214,-87.625329,Chatham
426,JD104187,OFFENSE INVOLVING CHILDREN,41.739424,-87.663766,Auburn Gresham


In [15]:
df_crimes_sample['Community'].value_counts()

Austin                    104
Near North Side            79
North Lawndale             74
Humboldt Park              72
Near West Side             69
South Shore                68
West Town                  67
Greater Grand Crossing     66
Loop                       66
Auburn Gresham             59
Englewood                  54
Chicago Lawn               51
Roseland                   51
West Englewood             49
West Garfield Park         49
Chatham                    41
Lake View                  41
Belmont Cragin             39
Logan Square               39
South Chicago              35
South Lawndale             34
East Garfield Park         34
Grand Boulevard            33
West Pullman               33
Lincoln Park               32
New City                   31
Rogers Park                29
West Ridge                 27
Uptown                     27
Gage Park                  25
                         ... 
Calumet Heights            12
Clearing                   11
Kenwood   

There are 77 communities in our sample and some of them have less than 10 crimes. We will drop these observations

In [16]:
for community in list(df_crimes_sample['Community'].unique()):
    if df_crimes_sample[df_crimes_sample['Community'] == community].shape[0] < 10:
        df_crimes_sample.drop(df_crimes_sample[ (df_crimes_sample['Community'] == community) | (df_crimes_sample['Community'] == 'ERROR') ].index , inplace=True)
    elif community == 'ERROR':
        df_crimes_sample.drop(df_crimes_sample[ (df_crimes_sample['Community'] == community)].index , inplace=True)
#The dataframe will be called just df from this point foward
df = df_crimes_sample.reset_index(drop = True)
df['Community'].value_counts()

Austin                    104
Near North Side            79
North Lawndale             74
Humboldt Park              72
Near West Side             69
South Shore                68
West Town                  67
Greater Grand Crossing     66
Loop                       66
Auburn Gresham             59
Englewood                  54
Chicago Lawn               51
Roseland                   51
West Garfield Park         49
West Englewood             49
Chatham                    41
Lake View                  41
Logan Square               39
Belmont Cragin             39
South Chicago              35
East Garfield Park         34
South Lawndale             34
West Pullman               33
Grand Boulevard            33
Lincoln Park               32
New City                   31
Rogers Park                29
Uptown                     27
West Ridge                 27
Gage Park                  25
Woodlawn                   25
Douglas                    23
Edgewater                  23
Irving Par

Now we have to construct our crime index. First, let's see the type of crimes that occurred in our sample.

In [17]:
df['PRIMARY DESCRIPTION'].value_counts()

THEFT                                423
BATTERY                              391
CRIMINAL DAMAGE                      175
ASSAULT                              148
OTHER OFFENSE                        120
NARCOTICS                            112
DECEPTIVE PRACTICE                   108
ROBBERY                               70
BURGLARY                              66
MOTOR VEHICLE THEFT                   64
CRIMINAL TRESPASS                     51
WEAPONS VIOLATION                     46
OFFENSE INVOLVING CHILDREN            22
INTERFERENCE WITH PUBLIC OFFICER      11
CRIM SEXUAL ASSAULT                   10
SEX OFFENSE                            9
CRIMINAL SEXUAL ASSAULT                8
ARSON                                  6
CONCEALED CARRY LICENSE VIOLATION      6
PUBLIC PEACE VIOLATION                 4
LIQUOR LAW VIOLATION                   3
HOMICIDE                               2
PROSTITUTION                           2
KIDNAPPING                             1
INTIMIDATION    

I am doing this model for my friend Horace. Therefore, I will construct the crime index based on my friend's preferences. Let's suppose that I showed the previous list of crimes to my friend and asked him to rank the crimes, giving them a score. The worst crimes should receive the highest scores. He gave me the following list

In [18]:
scores = {
    'THEFT' : 5,
    'BATTERY': 7,
    'CRIMINAL DAMAGE': 6,
    'ASSAULT': 6,
    'OTHER OFFENSE': 1,
    'NARCOTICS': 4,
    'DECEPTIVE PRACTICE': 2,
    'ROBBERY': 6,
    'BURGLARY': 7,
    'MOTOR VEHICLE THEFT': 5,
    'CRIMINAL TRESPASS': 3,
    'WEAPONS VIOLATION': 3,
    'OFFENSE INVOLVING CHILDREN': 8,
    'INTERFERENCE WITH PUBLIC OFFICER': 4,
    'CRIM SEXUAL ASSAULT': 8,
    'SEX OFFENSE': 7,
    'CRIMINAL SEXUAL ASSAULT': 8,
    'ARSON': 6,
    'CONCEALED CARRY LICENSE VIOLATION': 3,
    'PUBLIC PEACE VIOLATION': 3,
    'LIQUOR LAW VIOLATION': 1,
    'PROSTITUTION': 4,
    'HOMICIDE': 10,
    'KIDNAPPING': 9.5,
    'STALKING': 7,
    'INTIMIDATION': 4    
}

Let's add a column with this score to the dataframe.

In [19]:
for observation in list(df['CASE#'].unique()):
    df.loc[df['CASE#'] == observation, 'Score'] = scores[df.loc[df['CASE#'] == observation, 'PRIMARY DESCRIPTION'].item()]
df.head(5)

Unnamed: 0,CASE#,PRIMARY DESCRIPTION,LATITUDE,LONGITUDE,Community,Score
0,JD192044,BATTERY,41.790569,-87.623986,Washington Park,7.0
1,JD121589,THEFT,41.885557,-87.653649,Near West Side,5.0
2,JD121019,WEAPONS VIOLATION,41.855959,-87.721125,North Lawndale,3.0
3,JD192370,BATTERY,41.88321,-87.634336,Loop,7.0
4,JD191833,ASSAULT,41.947298,-87.651034,Lake View,6.0


Now we get the total average score for each community, keep just the columns of interest aNd group by community names.

In [20]:
df = df[['Community','LATITUDE','LONGITUDE','Score']]
for community in list(df['Community'].unique()):
    df.loc[df['Community'] == community, 'Number of crimes'] = df.loc[df['Community'] == community].shape[0]
df = df.groupby(['Community']).mean().sort_values(by='Score', ascending=False).reset_index()
df.head(10)

Unnamed: 0,Community,LATITUDE,LONGITUDE,Score,Number of crimes
0,South Deering,41.715227,-87.573128,6.4,10.0
1,Clearing,41.777899,-87.765665,6.363636,11.0
2,Irving Park,41.952855,-87.716304,6.090909,22.0
3,Calumet Heights,41.730932,-87.574564,6.0,12.0
4,Lower West Side,41.852999,-87.667109,5.631579,19.0
5,South Chicago,41.741429,-87.554962,5.628571,35.0
6,Woodlawn,41.778553,-87.602723,5.6,25.0
7,Chicago Lawn,41.774654,-87.693067,5.490196,51.0
8,West Ridge,41.999191,-87.693149,5.481481,27.0
9,South Lawndale,41.846049,-87.709297,5.470588,34.0


Communities with a higher number of crimes should have a higher crime index. Therefore, lets create a weighting factor that will go from 0 to 1, where the number 1 will be given to the community with the highest number of cases. Then, a weighted score will be calculated as the multiplication of the score and the weighting factor.

In [21]:
df['Weighted Score'] = df['Score']*df['Number of crimes']/df['Number of crimes'].max()
df.head(3)

Unnamed: 0,Community,LATITUDE,LONGITUDE,Score,Number of crimes,Weighted Score
0,South Deering,41.715227,-87.573128,6.4,10.0,0.615385
1,Clearing,41.777899,-87.765665,6.363636,11.0,0.673077
2,Irving Park,41.952855,-87.716304,6.090909,22.0,1.288462


We just need to keep the weighted score.

In [22]:
df = df.sort_values('Weighted Score', ascending = False).reset_index(drop = True)
df.head(5)

Unnamed: 0,Community,LATITUDE,LONGITUDE,Score,Number of crimes,Weighted Score
0,Austin,41.888931,-87.759193,5.221154,104.0,5.221154
1,Near North Side,41.896855,-87.631069,5.113924,79.0,3.884615
2,North Lawndale,41.861835,-87.717354,5.027027,74.0,3.576923
3,Humboldt Park,41.89964,-87.718668,5.138889,72.0,3.557692
4,South Shore,41.762242,-87.573253,5.235294,68.0,3.423077


Now we need to get the venues of each community. To acocomplish this, we will be using the Foursquare API.

Define Foursquare Credentials and Version:

In [23]:
CLIENT_ID = 'OXMDBA50DX0OLAJDSK02UPDLHMSDWCA4OUVMXUOKMH0WG5IN' # your Foursquare ID
CLIENT_SECRET = 'V0AWE4C13NSU55C02ZH5VWNKGKGCUGTLLAVGEFIDOF3CKLSG' # your Foursquare Secret
VERSION = '20191231' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: OXMDBA50DX0OLAJDSK02UPDLHMSDWCA4OUVMXUOKMH0WG5IN
CLIENT_SECRET:V0AWE4C13NSU55C02ZH5VWNKGKGCUGTLLAVGEFIDOF3CKLSG


Let's see how to query the Foursquare database and what the result looks like.

In [24]:
latitude = 41.790569
longitude = -87.623986
LIMIT = 100 # limit of number of venues returned by Foursquare API

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            latitude, 
            longitude,
            500, #RADIUS 
            LIMIT)
results = requests.get(url).json()
results
    

{'meta': {'code': 200, 'requestId': '5e9b7b73c8cff200267e090d'},
 'response': {'headerLocation': 'Washington Park',
  'headerFullLocation': 'Washington Park, Chicago',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 41.7950690045, 'lng': -87.61796173952443},
   'sw': {'lat': 41.786068995499996, 'lng': -87.63001026047557}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4c781b252d3ba143bccb89d0',
       'name': "Church's Chicken",
       'location': {'address': '6 W 59th St',
        'lat': 41.787655383849675,
        'lng': -87.62595021050572,
        'labeledLatLngs': [{'label': 'display',
          'lat': 41.787655383849675,
          'lng': -87.62595021050572},
         {'label': '?', 'lat': 41.787586, 'lng': -8

The function bellow will get the venue name and category for the first 100 venus in each neighborhood.

In [25]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

In [26]:
venues = getNearbyVenues(names=df['Community'],
                                   latitudes=df['LATITUDE'],
                                   longitudes=df['LONGITUDE']
                                  )

Austin
Near North Side
North Lawndale
Humboldt Park
South Shore
West Town
Loop
Near West Side
Greater Grand Crossing
Auburn Gresham
Englewood
Chicago Lawn
Roseland
West Englewood
West Garfield Park
Logan Square
Chatham
Lake View
South Chicago
South Lawndale
Belmont Cragin
Lincoln Park
East Garfield Park
West Pullman
Grand Boulevard
New City
West Ridge
Uptown
Rogers Park
Woodlawn
Irving Park
Gage Park
Douglas
Edgewater
Washington Park
Lower West Side
Albany Park
Washington Heights
Brighton Park
Avondale
Lincoln Square
Portage Park
Garfield Ridge
Calumet Heights
Near South Side
Hyde Park
Clearing
Armour Square
South Deering
Kenwood
Morgan Park
West Lawn
Bridgeport


Check how many venues were returned for each neighborhood.

In [27]:
venues.groupby('Neighborhood').count()['Venue Category']

Neighborhood
Albany Park                95
Armour Square              76
Auburn Gresham             23
Austin                     25
Avondale                  100
Belmont Cragin             48
Bridgeport                 78
Brighton Park              41
Calumet Heights            44
Chatham                    48
Chicago Lawn               44
Clearing                   28
Douglas                    47
East Garfield Park         30
Edgewater                 100
Englewood                  22
Gage Park                  46
Garfield Ridge             49
Grand Boulevard            46
Greater Grand Crossing     49
Humboldt Park              31
Hyde Park                 100
Irving Park                94
Kenwood                    63
Lake View                 100
Lincoln Park               94
Lincoln Square            100
Logan Square              100
Loop                      100
Lower West Side            95
Morgan Park                12
Near North Side           100
Near South Side           1

In [28]:
print('There are {} uniques categories.'.format(len(venues['Venue Category'].unique())))

There are 307 uniques categories.


We will need to count the frequency of each venue category.

In [29]:
# one hot encoding
venues_onehot = pd.get_dummies(venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
venues_onehot['Neighborhood'] = venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [venues_onehot.columns[-1]] + list(venues_onehot.columns[:-1])
venues_onehot = venues_onehot[fixed_columns]

venues_grouped = venues_onehot.groupby('Neighborhood').sum().reset_index()
venues_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Afghan Restaurant,African Restaurant,Airport,Airport Food Court,Airport Terminal,American Restaurant,Amphitheater,...,Vietnamese Restaurant,Vineyard,Warehouse Store,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Albany Park,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,Armour Square,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Auburn Gresham,0,0,0,0,0,0,0,3,0,...,0,0,0,0,0,0,0,0,0,0
3,Austin,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Avondale,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


The venues categories will be classified into a few generic groups. Let's first see the all the columns.

In [30]:
print(list(venues_grouped.columns))

['Neighborhood', 'ATM', 'Accessories Store', 'Afghan Restaurant', 'African Restaurant', 'Airport', 'Airport Food Court', 'Airport Terminal', 'American Restaurant', 'Amphitheater', 'Animal Shelter', 'Antique Shop', 'Arcade', 'Arepa Restaurant', 'Argentinian Restaurant', 'Art Gallery', 'Art Museum', 'Arts & Crafts Store', 'Asian Restaurant', 'Athletics & Sports', 'Automotive Shop', 'BBQ Joint', 'Bagel Shop', 'Bakery', 'Bank', 'Bar', 'Baseball Field', 'Baseball Stadium', 'Basketball Court', 'Beach', 'Bed & Breakfast', 'Beer Bar', 'Beer Garden', 'Big Box Store', 'Bike Shop', 'Bistro', 'Board Shop', 'Boat or Ferry', 'Bookstore', 'Boutique', 'Bowling Alley', 'Brazilian Restaurant', 'Breakfast Spot', 'Brewery', 'Bubble Tea Shop', 'Building', 'Burger Joint', 'Burmese Restaurant', 'Burrito Place', 'Bus Line', 'Bus Station', 'Business Service', 'Butcher', 'Cafeteria', 'Café', 'Cajun / Creole Restaurant', 'Camera Store', 'Candy Store', 'Caribbean Restaurant', 'Check Cashing Service', 'Cheese Shop

The groups will be the following:

In [31]:
category_list_0 = list(venues_grouped.columns)[1:].copy()
remaining_list = list(venues_grouped.columns)[1:].copy()

groups = {}
groups.setdefault('ATMs', [])
groups.setdefault('Restaurants', [])
groups.setdefault('Athletics & Sports', [])
groups.setdefault('Train or Subway Stations', [])
groups.setdefault('Stores', [])
groups.setdefault('Entertainment', [])
groups.setdefault('Arts & Culture', [])
groups.setdefault('Bakery', [])
groups.setdefault('Banking', [])
groups.setdefault('Bars', [])
groups.setdefault('Gymnasiums and Courts', [])
groups.setdefault('Hotels and Hospitality', [])
groups.setdefault('Services', [])
groups.setdefault('Bus Lines and Stations', [])
groups.setdefault('Gas Stations', [])
groups.setdefault('Stadiums and Concert Halls', [])
groups.setdefault('Hospitals and Clinics', [])
groups.setdefault('Laundry', [])
groups.setdefault('Personal Care', [])
groups.setdefault('Parks & Public Places', [])
groups.setdefault('Shops', [])
groups.setdefault('Markets', [])
groups.setdefault('Vacation', [])
groups.setdefault('College', [])
groups.setdefault('Water', [])
groups.setdefault('Industry', [])
groups.setdefault('Other', [])

for category in category_list_0:
    remaining_list.remove(category)
    if ('restaurant' in category.lower()) or ('food truck' in category.lower()) or ('steakhouse' in category.lower()) or ('breakfast' in category.lower()) or ('bistro' in category.lower()) or ('diner' in category.lower()) or ('place' in category.lower()) or ('joint' in category.lower()):
        groups['Restaurants'].append(category)
    elif ('store' in category.lower()) or ('boutique' in category.lower()):
        groups['Stores'].append(category)
 
    elif 'atm' in category.lower():
        groups['ATMs'].append(category)
    
    elif 'college' in category.lower():
        groups['College'].append(category)

    elif ('art' in category.lower()) or ('historic' in category.lower()) or ('exhibit' in category.lower()) or ('music' in category.lower()) or ('planetarium' in category.lower()) or ('cultural' in category.lower()) or ('museum' in category.lower()):
        groups['Arts & Culture'].append(category)

    elif ('gym' in category.lower()) or ('skat' in category.lower()) or ('yoga' in category.lower()) or ('athletics' in category.lower()) or ('run' in category.lower()) or ('cycle' in category.lower()) or ('bike' in category.lower()):
        groups['Athletics & Sports'].append(category)

    elif ('stadium' in category.lower()) or ('opera' in category.lower()) or ('hockey' in category.lower()) or ('concert' in category.lower()) or ('theater' in category.lower()):
        groups['Stadiums and Concert Halls'].append(category)

    elif 'gas station' in category.lower():
        groups['Gas Stations'].append(category)

    elif ('park' in category.lower()) or ('preserve' in category.lower()) or ('memorial' in category.lower()) or ('sculpture' in category.lower()) or ('field' in category.lower()) or ('scenic' in category.lower()) or ('trail' in category.lower()) or ('track' in category.lower()) or ('garden' in category.lower()) or ('playground' in category.lower()) or ('recreation' in category.lower()) or ('pool' in category.lower()) or ('fountain' in category.lower()) or ('library' in category.lower()) :
        groups['Parks & Public Places'].append(category)
 
    elif 'bakery' in category.lower():
        groups['Bakery'].append(category)

    elif ('bank' in category.lower()) or ('exchange' in category.lower()):
        groups['Banking'].append(category)

    elif ('salon' in category.lower()) or ('tattoo' in category.lower()) or ('spa' in category.lower()) or ('massage' in category.lower()):
        groups['Personal Care'].append(category)

    elif ('club' in category.lower()) or ('circus' in category.lower()) or ('arcade' in category.lower()) or ('lounge' in category.lower()) or ('bowling' in category.lower()) or  ('entertainment' in category.lower()) or ('dance' in category.lower()) or ('music' in category.lower()):
        groups['Entertainment'].append(category)
    
    elif ('bar' in category.lower()) or ('pub' in category.lower()):
        groups['Bars'].append(category)
    
    elif 'bar' in category.lower():
        groups['Bars'].append(category)

    elif ('hospital' in category.lower()) or ('doctor' in category.lower()) or ('pharmacy' in category.lower()) :
        groups['Hospitals and Clinics'].append(category)

    elif ('hotel' in category.lower()) or ('hostel' in category.lower()) :
        groups['Hotels and Hospitality'].append(category)

    elif ('laundromat' in category.lower()) or ('dry cleaner' in category.lower()) :
        groups['Laundry'].append(category)

    elif ('court' in category.lower()) :
        groups['Gymnasiums and Courts'].append(category)
        
    elif ('market' in category.lower()) or ('deli' in category.lower()) or ('butcher' in category.lower()) or ('grocery' in category.lower()) :
        groups['Markets'].append(category)

    elif (('subway' in category.lower()) or ('train' in category.lower()) or ('ferry' in category.lower()) or ('train' in category.lower()) or ('metro' in category.lower()) or ('rail' in category.lower())):
        groups['Train or Subway Stations'].append(category)

    elif ('bus line' in category.lower())  or ('bus station' in category.lower()) :
        groups['Bus Lines and Stations'].append(category)
        
    elif ('shop' in category.lower()) or ('brewery' in category.lower()) or ('dealership' in category.lower()) or ('plaza' in category.lower()) or ('cafe' in category.lower()) or ('café' in category.lower()) or ('food' in category.lower()) :
        groups['Shops'].append(category)
        
    elif ('travel' in category.lower())  or ('rental' in category.lower()) or ('tour' in category.lower()) or ('vacation' in category.lower()) :
        groups['Vacation'].append(category)
        
    elif ('service' in category.lower()) or ('insurance' in category.lower()) or ('veterinarian' in category.lower()) or ('newsstand' in category.lower()) or ('locksmith' in category.lower()) :
        groups['Services'].append(category)
        
    elif ('river' in category.lower()) or ('waterfront' in category.lower()) or ('harbor' in category.lower())  or ('marina' in category.lower()) or ('lake' in category.lower()):
        groups['Water'].append(category)
    elif ('factory' in category.lower()) or ('industry' in category.lower()) or ('farm' in category.lower()):
        groups['Industry'].append(category)
    else:
        groups['Other'].append(category)
groups

{'ATMs': ['ATM'],
 'Restaurants': ['Afghan Restaurant',
  'African Restaurant',
  'American Restaurant',
  'Arepa Restaurant',
  'Argentinian Restaurant',
  'Asian Restaurant',
  'BBQ Joint',
  'Bed & Breakfast',
  'Bistro',
  'Brazilian Restaurant',
  'Breakfast Spot',
  'Burger Joint',
  'Burmese Restaurant',
  'Burrito Place',
  'Cajun / Creole Restaurant',
  'Caribbean Restaurant',
  'Chinese Restaurant',
  'Comfort Food Restaurant',
  'Cuban Restaurant',
  'Dim Sum Restaurant',
  'Diner',
  'Eastern European Restaurant',
  'Ethiopian Restaurant',
  'Falafel Restaurant',
  'Fast Food Restaurant',
  'Filipino Restaurant',
  'Food Truck',
  'French Restaurant',
  'Fried Chicken Joint',
  'German Restaurant',
  'Greek Restaurant',
  'Hawaiian Restaurant',
  'Hot Dog Joint',
  'Hotpot Restaurant',
  'Indian Restaurant',
  'Israeli Restaurant',
  'Italian Restaurant',
  'Japanese Restaurant',
  'Jewish Restaurant',
  'Korean Restaurant',
  'Latin American Restaurant',
  'Malay Restauran

Now its necessary to create a dataframe with these new venue categories.

First create a new dataframe.

In [32]:
df_categories = venues_grouped['Neighborhood'].to_frame()
for key in list(groups.keys()):
    df_categories[key] = 0
df_categories.head()

Unnamed: 0,Neighborhood,ATMs,Restaurants,Athletics & Sports,Train or Subway Stations,Stores,Entertainment,Arts & Culture,Bakery,Banking,...,Laundry,Personal Care,Parks & Public Places,Shops,Markets,Vacation,College,Water,Industry,Other
0,Albany Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Armour Square,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Auburn Gresham,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Austin,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Avondale,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now place the number of each category for each neighborhood.

In [33]:
for neighborhood in list(df_categories['Neighborhood']):
    for column in list(venues_grouped.columns)[1:]:
        for key, value in groups.items():
            for item in value:
                if (item == column) and venues_grouped.loc[venues_grouped['Neighborhood'] == neighborhood, column].item() > 0:
                    df_categories.loc[df_categories['Neighborhood'] == neighborhood, key] = df_categories.loc[df_categories['Neighborhood'] == neighborhood, key] + venues_grouped.loc[venues_grouped['Neighborhood'] == neighborhood, column].item()
df_categories

Unnamed: 0,Neighborhood,ATMs,Restaurants,Athletics & Sports,Train or Subway Stations,Stores,Entertainment,Arts & Culture,Bakery,Banking,...,Laundry,Personal Care,Parks & Public Places,Shops,Markets,Vacation,College,Water,Industry,Other
0,Albany Park,0,44,0,2,19,0,0,3,2,...,0,0,3,13,1,0,0,0,0,0
1,Armour Square,0,44,3,0,2,1,0,2,0,...,0,0,3,9,1,1,0,0,0,2
2,Auburn Gresham,0,8,0,0,6,2,0,0,1,...,0,0,1,0,0,0,0,0,0,1
3,Austin,0,12,0,0,4,0,0,0,0,...,0,0,2,4,0,0,0,0,0,1
4,Avondale,1,38,3,0,10,2,3,0,0,...,0,1,2,17,3,1,0,0,0,1
5,Belmont Cragin,0,22,2,0,8,1,0,0,2,...,1,0,1,6,1,0,0,0,0,0
6,Bridgeport,0,34,0,0,11,0,6,2,1,...,0,2,3,11,1,0,0,0,0,0
7,Brighton Park,0,14,0,0,9,1,0,0,1,...,0,0,0,9,1,1,0,0,1,1
8,Calumet Heights,0,15,1,0,5,3,0,0,2,...,0,0,1,9,1,0,0,1,0,2
9,Chatham,0,20,0,0,9,2,1,0,0,...,0,2,1,8,0,0,0,0,0,0


And now we just need to add the crime index back.

In [34]:
df_crime_index = df[['Community','Weighted Score']].rename(columns={"Community": "Neighborhood", "Weighted Score" : "Crime Index"})
df = df_categories.join(df_crime_index.set_index('Neighborhood'), on='Neighborhood')
df

Unnamed: 0,Neighborhood,ATMs,Restaurants,Athletics & Sports,Train or Subway Stations,Stores,Entertainment,Arts & Culture,Bakery,Banking,...,Personal Care,Parks & Public Places,Shops,Markets,Vacation,College,Water,Industry,Other,Crime Index
0,Albany Park,0,44,0,2,19,0,0,3,2,...,0,3,13,1,0,0,0,0,0,1.009615
1,Armour Square,0,44,3,0,2,1,0,2,0,...,0,3,9,1,1,0,0,0,2,0.653846
2,Auburn Gresham,0,8,0,0,6,2,0,0,1,...,0,1,0,0,0,0,0,0,1,2.923077
3,Austin,0,12,0,0,4,0,0,0,0,...,0,2,4,0,0,0,0,0,1,5.221154
4,Avondale,1,38,3,0,10,2,3,0,0,...,1,2,17,3,1,0,0,0,1,0.836538
5,Belmont Cragin,0,22,2,0,8,1,0,0,2,...,0,1,6,1,0,0,0,0,0,1.730769
6,Bridgeport,0,34,0,0,11,0,6,2,1,...,2,3,11,1,0,0,0,0,0,0.471154
7,Brighton Park,0,14,0,0,9,1,0,0,1,...,0,0,9,1,1,0,0,1,1,0.875
8,Calumet Heights,0,15,1,0,5,3,0,0,2,...,0,1,9,1,0,0,1,0,2,0.692308
9,Chatham,0,20,0,0,9,2,1,0,0,...,2,1,8,0,0,0,0,0,0,2.009615


In [35]:
df_crime_index = df[['Neighborhood','Crime Index']].copy()

### Adding the neighborhoods of the City my friends is moving to.

My friend is moving to the city of Santo André, State of São Paulo, Brazil. He has to choose between the following neighborhoods.

In [36]:
data = {'Bairro Campestre': [-23.637909,-46.542631],
        'Centro': [-23.661471,-46.529092],
       'Vila Assuncao': [-23.672044, -46.526948],
       'Vila Humaita': [-23.673853, -46.503658],
       'Bairro Jardim': [-23.652431, -46.536477],
       'Vila Pires': [-23.676149, -46.511200],
       'Vila Luzita': [-23.701109, -46.507170],
       'Jd Alvorada': [-23.692945, -46.522974],
       'Jardim Irene': [-23.709475, -46.510108],
       'Vila Metalurgica': [-23.627966, -46.534318],
       'Utinga': [-23.617370, -46.538667],
       'Santa Terezinha': [-23.635531, -46.532421],}
df_santo_andre = pd.DataFrame.from_dict(data, orient='index')
df_santo_andre.reset_index(inplace =True)
df_santo_andre = df_santo_andre.rename(columns={0: "LATITUDE", 1: "LONGITUDE", "index": "Neighborhood"})
df_santo_andre

Unnamed: 0,Neighborhood,LATITUDE,LONGITUDE
0,Bairro Campestre,-23.637909,-46.542631
1,Centro,-23.661471,-46.529092
2,Vila Assuncao,-23.672044,-46.526948
3,Vila Humaita,-23.673853,-46.503658
4,Bairro Jardim,-23.652431,-46.536477
5,Vila Pires,-23.676149,-46.5112
6,Vila Luzita,-23.701109,-46.50717
7,Jd Alvorada,-23.692945,-46.522974
8,Jardim Irene,-23.709475,-46.510108
9,Vila Metalurgica,-23.627966,-46.534318


In [37]:
SA_venues = getNearbyVenues(names=df_santo_andre['Neighborhood'],
                                   latitudes=df_santo_andre['LATITUDE'],
                                   longitudes=df_santo_andre['LONGITUDE']
                                  )

Bairro Campestre
Centro
Vila Assuncao
Vila Humaita
Bairro Jardim
Vila Pires
Vila Luzita
Jd Alvorada
Jardim Irene
Vila Metalurgica
Utinga
Santa Terezinha


Check how many venues were returned for each neighborhood.

In [38]:
SA_venues.groupby('Neighborhood').count()['Venue Category']

Neighborhood
Bairro Campestre     49
Bairro Jardim       100
Centro              100
Jardim Irene         16
Jd Alvorada          21
Santa Terezinha      54
Utinga               16
Vila Assuncao        47
Vila Humaita         48
Vila Luzita          26
Vila Metalurgica     43
Vila Pires           61
Name: Venue Category, dtype: int64

In [39]:
print('There are {} uniques categories.'.format(len(SA_venues['Venue Category'].unique())))

There are 122 uniques categories.


The following steps will do the same manipulation that were done for the neighborhoods in Chicago.

In [40]:
# one hot encoding
SA_venues_onehot = pd.get_dummies(SA_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
SA_venues_onehot['Neighborhood'] = SA_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [SA_venues_onehot.columns[-1]] + list(SA_venues_onehot.columns[:-1])
SA_venues_onehot = SA_venues_onehot[fixed_columns]

SA_venues_grouped = SA_venues_onehot.groupby('Neighborhood').sum().reset_index()

category_list_0 = list(SA_venues_grouped.columns)[1:].copy()
remaining_list = list(SA_venues_grouped.columns)[1:].copy()

groups = {}
groups.setdefault('ATMs', [])
groups.setdefault('Restaurants', [])
groups.setdefault('Athletics & Sports', [])
groups.setdefault('Train or Subway Stations', [])
groups.setdefault('Stores', [])
groups.setdefault('Entertainment', [])
groups.setdefault('Arts & Culture', [])
groups.setdefault('Bakery', [])
groups.setdefault('Banking', [])
groups.setdefault('Bars', [])
groups.setdefault('Gymnasiums and Courts', [])
groups.setdefault('Hotels and Hospitality', [])
groups.setdefault('Services', [])
groups.setdefault('Bus Lines and Stations', [])
groups.setdefault('Gas Stations', [])
groups.setdefault('Stadiums and Concert Halls', [])
groups.setdefault('Hospitals and Clinics', [])
groups.setdefault('Laundry', [])
groups.setdefault('Personal Care', [])
groups.setdefault('Parks & Public Places', [])
groups.setdefault('Shops', [])
groups.setdefault('Markets', [])
groups.setdefault('Vacation', [])
groups.setdefault('College', [])
groups.setdefault('Water', [])
groups.setdefault('Industry', [])
groups.setdefault('Other', [])

for category in category_list_0:
    remaining_list.remove(category)
    if ('restaurant' in category.lower()) or ('food truck' in category.lower()) or ('steakhouse' in category.lower()) or ('breakfast' in category.lower()) or ('bistro' in category.lower()) or ('diner' in category.lower()) or ('place' in category.lower()) or ('joint' in category.lower()):
        groups['Restaurants'].append(category)
    elif ('store' in category.lower()) or ('boutique' in category.lower()):
        groups['Stores'].append(category)
 
    elif 'atm' in category.lower():
        groups['ATMs'].append(category)
    
    elif 'college' in category.lower():
        groups['College'].append(category)

    elif ('art' in category.lower()) or ('historic' in category.lower()) or ('exhibit' in category.lower()) or ('music' in category.lower()) or ('planetarium' in category.lower()) or ('cultural' in category.lower()) or ('museum' in category.lower()):
        groups['Arts & Culture'].append(category)

    elif ('gym' in category.lower()) or ('skat' in category.lower()) or ('yoga' in category.lower()) or ('athletics' in category.lower()) or ('run' in category.lower()) or ('cycle' in category.lower()) or ('bike' in category.lower()):
        groups['Athletics & Sports'].append(category)

    elif ('stadium' in category.lower()) or ('opera' in category.lower()) or ('hockey' in category.lower()) or ('concert' in category.lower()) or ('theater' in category.lower()):
        groups['Stadiums and Concert Halls'].append(category)

    elif 'gas station' in category.lower():
        groups['Gas Stations'].append(category)

    elif ('park' in category.lower()) or ('preserve' in category.lower()) or ('memorial' in category.lower()) or ('sculpture' in category.lower()) or ('field' in category.lower()) or ('scenic' in category.lower()) or ('trail' in category.lower()) or ('track' in category.lower()) or ('garden' in category.lower()) or ('playground' in category.lower()) or ('recreation' in category.lower()) or ('pool' in category.lower()) or ('fountain' in category.lower()) or ('library' in category.lower()) :
        groups['Parks & Public Places'].append(category)
 
    elif 'bakery' in category.lower():
        groups['Bakery'].append(category)

    elif ('bank' in category.lower()) or ('exchange' in category.lower()):
        groups['Banking'].append(category)

    elif ('salon' in category.lower()) or ('tattoo' in category.lower()) or ('spa' in category.lower()) or ('massage' in category.lower()):
        groups['Personal Care'].append(category)

    elif ('club' in category.lower()) or ('circus' in category.lower()) or ('arcade' in category.lower()) or ('lounge' in category.lower()) or ('bowling' in category.lower()) or  ('entertainment' in category.lower()) or ('dance' in category.lower()) or ('music' in category.lower()):
        groups['Entertainment'].append(category)
    
    elif ('bar' in category.lower()) or ('pub' in category.lower()):
        groups['Bars'].append(category)
    
    elif 'bar' in category.lower():
        groups['Bars'].append(category)

    elif ('hospital' in category.lower()) or ('doctor' in category.lower()) or ('pharmacy' in category.lower()) :
        groups['Hospitals and Clinics'].append(category)

    elif ('hotel' in category.lower()) or ('hostel' in category.lower()) :
        groups['Hotels and Hospitality'].append(category)

    elif ('laundromat' in category.lower()) or ('dry cleaner' in category.lower()) :
        groups['Laundry'].append(category)

    elif ('court' in category.lower()) :
        groups['Gymnasiums and Courts'].append(category)
        
    elif ('market' in category.lower()) or ('deli' in category.lower()) or ('butcher' in category.lower()) or ('grocery' in category.lower()) :
        groups['Markets'].append(category)

    elif (('subway' in category.lower()) or ('train' in category.lower()) or ('ferry' in category.lower()) or ('train' in category.lower()) or ('metro' in category.lower()) or ('rail' in category.lower())):
        groups['Train or Subway Stations'].append(category)

    elif ('bus line' in category.lower())  or ('bus station' in category.lower()) :
        groups['Bus Lines and Stations'].append(category)
        
    elif ('shop' in category.lower()) or ('brewery' in category.lower()) or ('dealership' in category.lower()) or ('plaza' in category.lower()) or ('cafe' in category.lower()) or ('café' in category.lower()) or ('food' in category.lower()) :
        groups['Shops'].append(category)
        
    elif ('travel' in category.lower())  or ('rental' in category.lower()) or ('tour' in category.lower()) or ('vacation' in category.lower()) :
        groups['Vacation'].append(category)
        
    elif ('service' in category.lower()) or ('insurance' in category.lower()) or ('veterinarian' in category.lower()) or ('newsstand' in category.lower()) or ('locksmith' in category.lower()) :
        groups['Services'].append(category)
        
    elif ('river' in category.lower()) or ('waterfront' in category.lower()) or ('harbor' in category.lower())  or ('marina' in category.lower()) or ('lake' in category.lower()):
        groups['Water'].append(category)
    elif ('factory' in category.lower()) or ('industry' in category.lower()) or ('farm' in category.lower()):
        groups['Industry'].append(category)
    else:
        groups['Other'].append(category)

df_categories = SA_venues_grouped['Neighborhood'].to_frame()
for key in list(groups.keys()):
    df_categories[key] = 0

for neighborhood in list(df_categories['Neighborhood']):
    for column in list(SA_venues_grouped.columns)[1:]:
        for key, value in groups.items():
            for item in value:
                if (item == column) and SA_venues_grouped.loc[SA_venues_grouped['Neighborhood'] == neighborhood, column].item() > 0:
                    df_categories.loc[df_categories['Neighborhood'] == neighborhood, key] = df_categories.loc[df_categories['Neighborhood'] == neighborhood, key] + SA_venues_grouped.loc[SA_venues_grouped['Neighborhood'] == neighborhood, column].item()
#df_categories['Crime Index'] = " "
df_categories


Unnamed: 0,Neighborhood,ATMs,Restaurants,Athletics & Sports,Train or Subway Stations,Stores,Entertainment,Arts & Culture,Bakery,Banking,...,Laundry,Personal Care,Parks & Public Places,Shops,Markets,Vacation,College,Water,Industry,Other
0,Bairro Campestre,0,14,3,0,6,2,0,2,0,...,0,1,3,6,2,0,0,0,0,1
1,Bairro Jardim,0,37,3,0,5,1,3,2,0,...,0,5,2,22,3,0,0,0,0,4
2,Centro,0,34,7,0,8,3,2,4,0,...,0,4,1,26,1,0,0,0,0,0
3,Jardim Irene,0,4,0,0,3,0,0,1,0,...,0,0,1,2,3,0,0,0,0,0
4,Jd Alvorada,0,3,4,0,3,0,0,3,0,...,0,0,0,2,4,0,0,0,0,1
5,Santa Terezinha,0,17,5,0,8,1,0,3,0,...,0,1,1,4,6,0,0,0,0,0
6,Utinga,0,1,2,0,4,0,1,4,0,...,0,0,0,0,2,0,0,0,0,2
7,Vila Assuncao,0,14,2,0,6,0,1,2,0,...,0,0,2,10,2,0,0,0,0,0
8,Vila Humaita,0,12,5,0,5,2,0,8,0,...,0,0,2,7,0,0,0,0,0,1
9,Vila Luzita,0,4,5,0,6,0,0,2,0,...,0,0,1,3,4,0,0,0,0,0


Now we need to add theses neighborhoods to the dataframe containing the neighborhoods of Chicago.

In [41]:
df_final = pd.concat([df, df_categories], sort=False)
df_final

Unnamed: 0,Neighborhood,ATMs,Restaurants,Athletics & Sports,Train or Subway Stations,Stores,Entertainment,Arts & Culture,Bakery,Banking,...,Personal Care,Parks & Public Places,Shops,Markets,Vacation,College,Water,Industry,Other,Crime Index
0,Albany Park,0,44,0,2,19,0,0,3,2,...,0,3,13,1,0,0,0,0,0,1.009615
1,Armour Square,0,44,3,0,2,1,0,2,0,...,0,3,9,1,1,0,0,0,2,0.653846
2,Auburn Gresham,0,8,0,0,6,2,0,0,1,...,0,1,0,0,0,0,0,0,1,2.923077
3,Austin,0,12,0,0,4,0,0,0,0,...,0,2,4,0,0,0,0,0,1,5.221154
4,Avondale,1,38,3,0,10,2,3,0,0,...,1,2,17,3,1,0,0,0,1,0.836538
5,Belmont Cragin,0,22,2,0,8,1,0,0,2,...,0,1,6,1,0,0,0,0,0,1.730769
6,Bridgeport,0,34,0,0,11,0,6,2,1,...,2,3,11,1,0,0,0,0,0,0.471154
7,Brighton Park,0,14,0,0,9,1,0,0,1,...,0,0,9,1,1,0,0,1,1,0.875000
8,Calumet Heights,0,15,1,0,5,3,0,0,2,...,0,1,9,1,0,0,1,0,2,0.692308
9,Chatham,0,20,0,0,9,2,1,0,0,...,2,1,8,0,0,0,0,0,0,2.009615


# Results

### Clustering the Neighborhoods

In [42]:
from sklearn.cluster import KMeans

In [43]:
kclusters = 5

df_clustering = df_final.drop(columns=['Neighborhood','Crime Index'])

kmeans = KMeans(n_clusters=kclusters, random_state=7).fit(df_clustering)

kmeans.labels_[0:10] 

array([4, 4, 2, 0, 1, 3, 1, 0, 0, 3])

In [44]:
df_clustering.insert(0,'Cluster Labels',kmeans.labels_)
df_clustering.insert(0,'Neighborhood',df_final['Neighborhood'])
df_clustering.insert(len(list(df_clustering.columns)),'Crime Index',df_final['Crime Index'])
df_clustering

Unnamed: 0,Neighborhood,Cluster Labels,ATMs,Restaurants,Athletics & Sports,Train or Subway Stations,Stores,Entertainment,Arts & Culture,Bakery,...,Personal Care,Parks & Public Places,Shops,Markets,Vacation,College,Water,Industry,Other,Crime Index
0,Albany Park,4,0,44,0,2,19,0,0,3,...,0,3,13,1,0,0,0,0,0,1.009615
1,Armour Square,4,0,44,3,0,2,1,0,2,...,0,3,9,1,1,0,0,0,2,0.653846
2,Auburn Gresham,2,0,8,0,0,6,2,0,0,...,0,1,0,0,0,0,0,0,1,2.923077
3,Austin,0,0,12,0,0,4,0,0,0,...,0,2,4,0,0,0,0,0,1,5.221154
4,Avondale,1,1,38,3,0,10,2,3,0,...,1,2,17,3,1,0,0,0,1,0.836538
5,Belmont Cragin,3,0,22,2,0,8,1,0,0,...,0,1,6,1,0,0,0,0,0,1.730769
6,Bridgeport,1,0,34,0,0,11,0,6,2,...,2,3,11,1,0,0,0,0,0,0.471154
7,Brighton Park,0,0,14,0,0,9,1,0,0,...,0,0,9,1,1,0,0,1,1,0.875000
8,Calumet Heights,0,0,15,1,0,5,3,0,0,...,0,1,9,1,0,0,1,0,2,0.692308
9,Chatham,3,0,20,0,0,9,2,1,0,...,2,1,8,0,0,0,0,0,0,2.009615


### Assigning a crime index to the neighborhoods of Santo André.

The selected neighborhoods of Santo André were given a cluster label from 0 to 4. We will assign the crime index of these neighborhoods as the average of the crime index of the cluster it belongs to.

In [45]:
index_0 = df_clustering[df_clustering['Cluster Labels'] == 0]['Crime Index'].dropna().mean()
index_1 = df_clustering[df_clustering['Cluster Labels'] == 1]['Crime Index'].dropna().mean()
index_2 = df_clustering[df_clustering['Cluster Labels'] == 2]['Crime Index'].dropna().mean()
index_3 = df_clustering[df_clustering['Cluster Labels'] == 3]['Crime Index'].dropna().mean()
index_4 = df_clustering[df_clustering['Cluster Labels'] == 4]['Crime Index'].dropna().mean()
list_index = [index_0,index_1,index_3,index_4]
list_index

[2.3453525641025643, 1.6145104895104894, 1.558566433566434, 1.6014957264957266]

In [46]:
for neighborhood in list(df_santo_andre['Neighborhood'].values):
    if  df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Cluster Labels'].item() == 0:
        df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Crime Index'] = index_0
    elif df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Cluster Labels'].item() == 1:
        df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Crime Index'] = index_1
    elif df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Cluster Labels'].item() == 2:
        df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Crime Index'] = index_2
    elif df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Cluster Labels'].item() == 3:
        df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Crime Index'] = index_3
    elif df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Cluster Labels'].item() == 4:
        df_clustering.loc[df_clustering['Neighborhood'] == neighborhood, 'Crime Index'] = index_4
        
df_SA = df_clustering.loc[df_clustering['Neighborhood'].isin(list(df_santo_andre['Neighborhood'].values))][['Neighborhood','Crime Index']]
df_SA = df_SA.sort_values(by='Crime Index')
df_SA

Unnamed: 0,Neighborhood,Crime Index
3,Jardim Irene,1.495673
4,Jd Alvorada,1.495673
6,Utinga,1.495673
9,Vila Luzita,1.495673
5,Santa Terezinha,1.558566
11,Vila Pires,1.558566
1,Bairro Jardim,1.61451
2,Centro,1.61451
0,Bairro Campestre,2.345353
7,Vila Assuncao,2.345353


My friend would probably choose the first neighborhood in the list.

# Discussion

This project was just an exercise and the crime index developed here does not reflect the reality. In fact, the correlation is very small between the number of each venue's category in a neighborhood and the crime index of that neighborhood. The construction of the crime index itself was very subjective and the weights and averaging process could be improved.

# Conclusion


The results are the following

In [47]:
df_SA

Unnamed: 0,Neighborhood,Crime Index
3,Jardim Irene,1.495673
4,Jd Alvorada,1.495673
6,Utinga,1.495673
9,Vila Luzita,1.495673
5,Santa Terezinha,1.558566
11,Vila Pires,1.558566
1,Bairro Jardim,1.61451
2,Centro,1.61451
0,Bairro Campestre,2.345353
7,Vila Assuncao,2.345353


Looking just at this list, my friend would choose one of the first four neighborhoods.