# **Final Assignment - "Battle of the Neighborhoods"**

## *Introduction/Business Problem*

In some of the previous labs and assignments, we had the opportunity to work with the location data of New York and Toronto to make clusters based on venue categories in neighborhoods. Also, in prior sections of the course, we were required to work with and analyze crime data of Chicago and San Francisco. Combining these two ideas together and applying it to my home city of Atlanta, Georgia, United States, I am going to attempt to determine whether we can find any correlation between the top venue categories in a neighborhood and the number of crimes or types of crimes in the neighborhood, and whether clusters of neighborhoods with similar venue types are isomorphic with clusters of neighborhoods with a similar quantity and/or type of crime. I will also perform exploratory data analysis as I work through the project and ideas come to mind. This project would be of particular interest to anyone who is curious about the relation between venues-types of a location and crime of that location. The police department, venue owners, or social workers may also be interested in the results of this study as it may provide insights relative to their fields. It will be interesting to have an "unbiased" look; for example, when we are in a particular location we automatically make assumptions based on the appearance of the buildings, infrastructure, people, etc. However, this study will not so much be looking at the perceived quality of the venues, but on the pattern of venue occurrence. I may also look in to venue ratings once clusters are made to explore whether a lower average rating of venues has any correlation to crime.

## *Data Description/Examples*

***Atlanta Data*** - I will use Wikipedia to get data concerning Atlanta neighborhoods and NPUs(Neighborhood Planning Unit, geographically based). Conveniently, the crime data I will use includes the latitude and longitude of the crime in addition to the neighborhood name and NPU, so I will take the mean lat/long for crimes in a given neighborhood to determine its coordinates(to be used with Foursquare API). Finally, I will use the open data csv files from the Atlanta Police Department to acquire crime data; for simplicity, we will only use the crime data for 2018.

https://en.wikipedia.org/wiki/Neighborhood_planning_unit

http://opendata.atlantapd.org/Crimedata/Default.aspx


#### Import packages to be used

*Note: I will create a CSV for some of the data manually, and I will web-parse for some of the data as well.*

In [44]:
from bs4 import BeautifulSoup
import requests
import urllib
import urllib.request

import pandas as pd
import numpy as np

#### Read csv file (manually constructed using Wikipedia)

NPU csv file containing the neighborhood name and the respective NPU

In [102]:
df_npu = pd.read_csv('NPU.csv')
print(df_npu.shape)
df_npu.head()

(242, 2)


Unnamed: 0,NPU,Neighborhood
0,A,Chastain Park
1,A,Kingswood
2,A,Margaret Mitchell
3,A,Mt. Paran Parkway
4,A,Mt. Paran/Northside


#### Import crime data and clean dataframe

Because the crime data is stored by month, we will first need to concatenate all the crime data into a single dataframe before we merge the dataframes together

In [103]:
df_crime = pd.read_csv('Jan18.csv')
month_list = ['Feb', 'Mar', 'Apr', 'May', "Jun", 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
for month in month_list:
    month_csv = '{}18.csv'.format(month)
    df_month = pd.read_csv(month_csv)
    frames = [df_crime, df_month]
    df_crime = pd.concat(frames, sort=False)
print(df_crime.shape)


(25622, 10)


In [104]:
df_crime.head()

Unnamed: 0,UC2_Literal,Report Number,Report Date,Location,Beat,Neigborhood,NPU,Lat,Long,Unnamed: 9
0,LARCENY-FROM VEHICLE,180310061,01/31/2018,691 PENN AVE NE,505.0,Midtown,E,33.77338,-84.37856,
1,LARCENY-FROM VEHICLE,180310146,01/31/2018,543 STOKESWOOD AVE SE,612.0,East Atlanta,W,33.73949,-84.34501,
2,LARCENY-FROM VEHICLE,180310161,01/31/2018,437 MORELAND AVE NE,608.0,Candler Park,N,33.76632,-84.34892,
3,LARCENY-FROM VEHICLE,180310488,01/31/2018,976 GILBERT ST SE,607.0,Ormewood Park,W,33.72787,-84.35055,
4,LARCENY-FROM VEHICLE,180310584,01/31/2018,2470 CHESHIRE BRIDGE RD NE,212.0,Lindridge/Martin Manor,F,33.82236,-84.35173,


We will do not need the report number, date, location, beat, or unnamed: 9 columns for our study so we can drop those

In [105]:
df_crime.drop(['Report Number', 'Report Date', 'Location', 'Beat', 'Unnamed: 9'], axis=1, inplace=True)
print(df_crime.shape)
df_crime.head()

(25622, 5)


Unnamed: 0,UC2_Literal,Neigborhood,NPU,Lat,Long
0,LARCENY-FROM VEHICLE,Midtown,E,33.77338,-84.37856
1,LARCENY-FROM VEHICLE,East Atlanta,W,33.73949,-84.34501
2,LARCENY-FROM VEHICLE,Candler Park,N,33.76632,-84.34892
3,LARCENY-FROM VEHICLE,Ormewood Park,W,33.72787,-84.35055
4,LARCENY-FROM VEHICLE,Lindridge/Martin Manor,F,33.82236,-84.35173


Drop rows with NaN

In [106]:
atl_crime = df_crime
print('Shape with NaN values:', atl_crime.shape)
atl_crime = atl_crime.dropna()
print('Shape without NaN values:', atl_crime.shape)
atl_crime = atl_crime.rename(columns={'UC2_Literal':'Type'})
atl_crime.head()

Shape with NaN values: (25622, 5)
Shape without NaN values: (24708, 5)


Unnamed: 0,Type,Neigborhood,NPU,Lat,Long
0,LARCENY-FROM VEHICLE,Midtown,E,33.77338,-84.37856
1,LARCENY-FROM VEHICLE,East Atlanta,W,33.73949,-84.34501
2,LARCENY-FROM VEHICLE,Candler Park,N,33.76632,-84.34892
3,LARCENY-FROM VEHICLE,Ormewood Park,W,33.72787,-84.35055
4,LARCENY-FROM VEHICLE,Lindridge/Martin Manor,F,33.82236,-84.35173


In [107]:
df_coord = atl_crime.groupby(['Neigborhood','NPU']).mean().reset_index()
print(df_coord.shape)
print(df_coord.columns)
df_coord.head()

(240, 4)
Index(['Neigborhood', 'NPU', 'Lat', 'Long'], dtype='object')


Unnamed: 0,Neigborhood,NPU,Lat,Long
0,Adair Park,V,33.730508,-84.410038
1,Adams Park,I,33.71764,-84.46923
2,Adams Park,R,33.714968,-84.461356
3,Adamsville,H,33.75922,-84.503682
4,Almond Park,G,33.783619,-84.460499


#### Search for Duplicates

In [108]:
duplicates = pd.concat(g for _, g in df_coord.groupby("Neigborhood") if len(g) > 1)
duplicates

Unnamed: 0,Neigborhood,NPU,Lat,Long
1,Adams Park,I,33.71764,-84.46923
2,Adams Park,R,33.714968,-84.461356
15,Atlanta University Center,M,33.74889,-84.40453
16,Atlanta University Center,T,33.752156,-84.413525
103,Grant Park,N,33.74362,-84.3582
104,Grant Park,W,33.739281,-84.369418
186,Reynoldstown,N,33.751498,-84.35308
187,Reynoldstown,O,33.75242,-84.34922
211,Tuxedo Park,A,33.853535,-84.391393
212,Tuxedo Park,B,33.84805,-84.39124


In [109]:
dup_list = duplicates['Neigborhood'].tolist()
dup_list = set(dup_list)
for hood in dup_list:
    print(df_npu.loc[df_npu['Neighborhood'] == hood])

  NPU Neighborhood
8   A  Tuxedo Park
    NPU  Neighborhood
142   N  Reynoldstown
    NPU Neighborhood
210   W   Grant Park
    NPU               Neighborhood
194   T  Atlanta University Center
    NPU Neighborhood
181   R   Adams Park
    NPU    Neighborhood
192   S  Venetian Hills


I cross referenceed the NPU dataframe to verify correct NPU for neighborhoods that show differing NPUs for the same neighborhood

|Hood|Correct NPU|
| --- | --- |
|Adams Park|R|
|Tuxedo Park|A|
|Reynoldstown|N|
|Grant Park|W|
|Atlanta University Center|T|
|Venetian Hills|S|


#### Remove duplicates from df_coord

In [110]:
hoods = ['Adams Park', 'Tuxedo Park', "Reynoldstown", 'Grant Park', 'Atlanta University Center', 'Venetian Hills']
npu_correct = ['R', 'A', 'N', 'W' ,'T' ,'S']
for hood, npu in zip(hoods, npu_correct):
    i = df_coord[((df_coord.Neigborhood == hood) &( df_coord.NPU != npu))].index
    df_coord = df_coord.drop(i)
df_coord.rename(index = str, columns={'Neigborhood':'Neighborhood'}, inplace = True)

df_coord = df_coord.reset_index()
df_coord.drop('index', axis=1, inplace = True)
print(df_coord.shape)
df_coord.head(10)

(234, 4)


Unnamed: 0,Neighborhood,NPU,Lat,Long
0,Adair Park,V,33.730508,-84.410038
1,Adams Park,R,33.714968,-84.461356
2,Adamsville,H,33.75922,-84.503682
3,Almond Park,G,33.783619,-84.460499
4,Amal Heights,Y,33.708469,-84.398998
5,Ansley Park,E,33.79317,-84.3784
6,Arden/Habersham,C,33.838003,-84.401703
7,Ardmore,E,33.804271,-84.394102
8,Argonne Forest,C,33.841271,-84.40349
9,Arlington Estates,P,33.691722,-84.539251


#### **We now have our neighborhood-information dataset**

With 234 out of the 242 Atlanta neighborhoods(including all neighborhoods where a crime was recorded), and I am confident in the ability of the dataset in searching for correlations. Although we will not be using neighborhoods where there was no crime, that is not too relevant as our aim is to see if there is a correlation between venue-types and crime quantity/type.

#### Remove duplicates from atl_crime

Now that I know there are incorrect values in our atl_crime dataframe, I'll fix it by the same process already used.

In [111]:
hoods = ['Adams Park', 'Tuxedo Park', "Reynoldstown", 'Grant Park', 'Atlanta University Center', 'Venetian Hills']
npu_correct = ['R', 'A', 'N', 'W' ,'T' ,'S']
for hood, npu in zip(hoods, npu_correct):
    i = atl_crime[((atl_crime.Neigborhood == hood) &( atl_crime.NPU != npu))].index
    atl_crime = atl_crime.drop(i)
atl_crime.rename(index = str, columns={'Neigborhood':'Neighborhood'}, inplace = True)

atl_crime = atl_crime.reset_index()
atl_crime.drop('index', axis=1, inplace = True)
print(atl_crime.shape)
atl_crime.head(10)

(24631, 5)


Unnamed: 0,Type,Neighborhood,NPU,Lat,Long
0,LARCENY-FROM VEHICLE,Midtown,E,33.77338,-84.37856
1,LARCENY-FROM VEHICLE,East Atlanta,W,33.73949,-84.34501
2,LARCENY-FROM VEHICLE,Candler Park,N,33.76632,-84.34892
3,LARCENY-FROM VEHICLE,Ormewood Park,W,33.72787,-84.35055
4,LARCENY-FROM VEHICLE,Lindridge/Martin Manor,F,33.82236,-84.35173
5,LARCENY-FROM VEHICLE,Sweet Auburn,M,33.75912,-84.3726
6,LARCENY-FROM VEHICLE,Midtown,E,33.78346,-84.37931
7,LARCENY-FROM VEHICLE,Browns Mill Park,Z,33.68156,-84.39458
8,LARCENY-FROM VEHICLE,Campbellton Road,R,33.7053,-84.46047
9,LARCENY-FROM VEHICLE,Lenox,B,33.84676,-84.36212


#### **We now have our crime dataset**

The crime dataset was minimally reduced while cleaning the data so I think it serves as a very good dataset to work with for our study with respect to 2018 crime in Atlanta

***Foursquare API*** - I will make use of the Foursquare API to obtain location data with respect to the neighborhood latitudes and longitudes. We will use this in combination with the other dataframes we have formed to search for correlations in venue-types and crime.

#### Define Foursqaure Credentials

In [116]:
CLIENT_ID = 'BQXM0NHEJLDDRQE5VX1U10E1VOXRVXRMH1I2N30LZJDVSY5W' # your Foursquare ID
CLIENT_SECRET = 'R22G1UF3WEP2F0YYYJ4MY2MSN4ODZRLDRKJKAU10WGFMYIJW' # your Foursquare Secret
VERSION = '20190308' # Foursquare API version
LIMIT = 100

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: BQXM0NHEJLDDRQE5VX1U10E1VOXRVXRMH1I2N30LZJDVSY5W
CLIENT_SECRET:R22G1UF3WEP2F0YYYJ4MY2MSN4ODZRLDRKJKAU10WGFMYIJW


#### Import Libraries

In [113]:
from geopy.geocoders import Nominatim
import folium
import json
import requests
from pandas.io.json import json_normalize

#### Define function to get location data for every neighborhood

We'll search for the top 100 venues within 800 meters(half-mile). This is a difficult parameter to set. The city of Atlanta is 134 square miles, with 242 neighborhoods. Assuming they are equally spaced apart(I know they are not, but for the sake of standardization), each neighborhood is approximately .55 square miles. Therefore we'll search for venues using the latitude and longitude for the [crime] center of the neighborhood.

In [121]:
def getNearbyVenues(names, latitudes, longitudes, radius=800):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [122]:
atlanta_venues = getNearbyVenues(names=df_coord['Neighborhood'],
                                   latitudes=df_coord['Lat'],
                                   longitudes=df_coord['Long']
                                  )

Adair Park
Adams Park
Adamsville
Almond Park
Amal Heights
Ansley Park
Arden/Habersham
Ardmore
Argonne Forest
Arlington Estates
Ashley Courts
Ashview Heights
Atkins Park
Atlanta Industrial Park
Atlanta University Center
Atlantic Station
Audobon Forest
Audobon Forest West
Baker Hills
Bakers Ferry
Bankhead
Bankhead/Bolton
Beecher Hills
Ben Hill
Ben Hill Acres
Ben Hill Forest
Ben Hill Pines
Ben Hill Terrace
Benteen Park
Berkeley Park
Betmar LaVilla
Blair Villa/Poole Creek
Blandtown
Bolton
Bolton Hills
Boulder Park
Boulevard Heights
Brandon
Brentwood
Briar Glen
Brookhaven
Brookview Heights
Brookwood
Brookwood Hills
Browns Mill Park
Buckhead Forest
Buckhead Heights
Buckhead Village
Bush Mountain
Butner/Tell
Cabbagetown
Campbellton Road
Candler Park
Capitol Gateway
Capitol View
Capitol View Manor
Carey Park
Carroll Heights
Carver Hills
Cascade Avenue/Road
Cascade Green
Cascade Heights
Castleberry Hill
Castlewood
Center Hill
Chalet Woods
Channing Valley
Chastain Park
Chosewood Park
Collier Hei

In [124]:
print(atlanta_venues.shape)
atlanta_venues.head(10)

(4927, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adair Park,33.730508,-84.410038,Adair Park One,33.730525,-84.412837,Park
1,Adair Park,33.730508,-84.410038,Monday Night Garage,33.729407,-84.418303,Brewery
2,Adair Park,33.730508,-84.410038,Atlanta BeltLine Corridor under Lee/Murphy,33.727205,-84.417238,Trail
3,Adair Park,33.730508,-84.410038,Atlanta Beltline Westside Trail,33.726212,-84.413658,Trail
4,Adair Park,33.730508,-84.410038,The hood Zone 3,33.731417,-84.403257,Historic Site
5,Adair Park,33.730508,-84.410038,Studioplexx47,33.736314,-84.409719,Event Space
6,Adair Park,33.730508,-84.410038,Boxcar Atl,33.729955,-84.417383,Gastropub
7,Adair Park,33.730508,-84.410038,The Bakery,33.724926,-84.41439,Art Gallery
8,Adair Park,33.730508,-84.410038,Ok Yaki,33.729375,-84.418087,Pop-Up Shop
9,Adams Park,33.714968,-84.461356,Alfred Holmes Golf Course,33.711642,-84.467037,Golf Course


#### **We now have our venue dataframe**