<h1> Segmenting and Classifying Atlanta Neighborhoods <h1>

In [1]:
import pandas as pd
import numpy as np
import requests
import folium
import json
from sklearn import preprocessing
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, log_loss, confusion_matrix, classification_report, f1_score, jaccard_score, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
import matplotlib.cm as cm
import matplotlib.colors as colors
import pandas as pd
import numpy as np
import requests
from pandas import json_normalize 

<h2>Introduction-The Problem<h2>

In Atlanta, crime is a major issue. With a crime rate of 58 per 1000 residents, Atlanta is in the 2nd percentile for safety in the United States, meaning it is only safer than 2% of cities in the United States. It's violent crime rate is double the national average. I want to see if there's a meaningful correlation between the venues in a neighborhood and the crime that occurs within that neighborhood.(statistics from https://www.neighborhoodscout.com/ga/atlanta/crime.)

<h2>The Data<h2>

I will be using data from two sources: Foursquare and the Atlanta Police Department

<h3>Foursquare Location Data<h3>

I will be making explore calls using the foursquare API to get location data about each neighborhood. Explore calls return information about a specified number of venues around a specified Latitude and Longitude within a specified radius. Explore calls return lots of information about each venue, but we will only be using the venue's category. Below is an example dataframe with 20 venues within 500m of Atlanta itself.

In [2]:
CLIENT_ID = 'hidden'
CLIENT_SECRET = 'hidden'
VERSION = '20201127'
LATITUDE = 33.7490
LONGITUDE = -84.3880
RADIUS = 500
LIMIT = 20
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, LATITUDE, LONGITUDE, RADIUS, LIMIT)
results = requests.get(url).json()
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,The Masquerade,Music Venue,33.75172,-84.389739
1,GSU Sports Arena,College Basketball Court,33.751735,-84.386328
2,Georgia Railroad Freight Depot,Event Space,33.751479,-84.388224
3,Willy's Mexicana Grill #22,Mexican Restaurant,33.751293,-84.385337
4,Jamrock Restaurant,Caribbean Restaurant,33.751554,-84.391356


<h3>Atlanta Police Department Data<h3>

I will be using a csv file from the Atlanta Police Department that contains data about crimes in Atlanta from 2009-2019. Each row is a crime and the dataset's columns include a lot of information about each crime, but we will only be using the type of crime and the neighborhood the crime happened in. The columns are called UCR Literal and Neighborhood. The dataset is called COBRA-2009-2019 and can be found at https://www.atlantapd.org/i-want-to/crime-data-downloads. Below is the first 5 rows of the dataset.

In [3]:
crime_df = pd.read_csv('Atlanta_crime.csv')
crime_df.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Report Number,Report Date,Occur Date,Occur Time,Possible Date,Possible Time,Beat,Apartment Office Prefix,Apartment Number,Location,Shift Occurence,Location Type,UCR Literal,UCR #,IBR Code,Neighborhood,NPU,Latitude,Longitude
0,90010930,2009-01-01,2009-01-01,1145,2009-01-01,1148.0,411.0,,,2841 GREENBRIAR PKWY,Day Watch,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68845,-84.49328
1,90011083,2009-01-01,2009-01-01,1330,2009-01-01,1330.0,511.0,,,12 BROAD ST SW,Day Watch,9,LARCENY-NON VEHICLE,630,2303,Downtown,M,33.7532,-84.39201
2,90011208,2009-01-01,2009-01-01,1500,2009-01-01,1520.0,407.0,,,3500 MARTIN L KING JR DR SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Adamsville,H,33.75735,-84.50282
3,90011218,2009-01-01,2009-01-01,1450,2009-01-01,1510.0,210.0,,,3393 PEACHTREE RD NE,Evening Watch,8,LARCENY-NON VEHICLE,630,2303,Lenox,B,33.84676,-84.36212
4,90011289,2009-01-01,2009-01-01,1600,2009-01-01,1700.0,411.0,,,2841 GREENBRIAR PKWY SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68677,-84.49773


<h3>Data Collection and Preparation<h3>

First I have to load in the Atlanta Crime csv file.

In [4]:
atlanta_crime = pd.read_csv('Atlanta_crime.csv')
atlanta_crime['Count'] = 1
atlanta_crime.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Report Number,Report Date,Occur Date,Occur Time,Possible Date,Possible Time,Beat,Apartment Office Prefix,Apartment Number,Location,Shift Occurence,Location Type,UCR Literal,UCR #,IBR Code,Neighborhood,NPU,Latitude,Longitude,Count
0,90010930,2009-01-01,2009-01-01,1145,2009-01-01,1148.0,411.0,,,2841 GREENBRIAR PKWY,Day Watch,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68845,-84.49328,1
1,90011083,2009-01-01,2009-01-01,1330,2009-01-01,1330.0,511.0,,,12 BROAD ST SW,Day Watch,9,LARCENY-NON VEHICLE,630,2303,Downtown,M,33.7532,-84.39201,1
2,90011208,2009-01-01,2009-01-01,1500,2009-01-01,1520.0,407.0,,,3500 MARTIN L KING JR DR SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Adamsville,H,33.75735,-84.50282,1
3,90011218,2009-01-01,2009-01-01,1450,2009-01-01,1510.0,210.0,,,3393 PEACHTREE RD NE,Evening Watch,8,LARCENY-NON VEHICLE,630,2303,Lenox,B,33.84676,-84.36212,1
4,90011289,2009-01-01,2009-01-01,1600,2009-01-01,1700.0,411.0,,,2841 GREENBRIAR PKWY SW,Unknown,8,LARCENY-NON VEHICLE,630,2303,Greenbriar,R,33.68677,-84.49773,1


Here I make a list of the neighborhoods. This will be helpful for clustering them and for Data Visualization

In [5]:
crime_neighborhoods = pd.DataFrame(atlanta_crime.groupby('Neighborhood').count())
crime_neighborhoods.reset_index(inplace = True) 
neighborhoods = crime_neighborhoods['Neighborhood'].tolist()
neighborhoods

['Adair Park',
 'Adams Park',
 'Adamsville',
 'Almond Park',
 'Amal Heights',
 'Ansley Park',
 'Arden/Habersham',
 'Ardmore',
 'Argonne Forest',
 'Arlington Estates',
 'Ashley Courts',
 'Ashview Heights',
 'Atkins Park',
 'Atlanta Industrial Park',
 'Atlanta University Center',
 'Atlantic Station',
 'Audobon Forest',
 'Audobon Forest West',
 'Baker Hills',
 'Bakers Ferry',
 'Bankhead',
 'Bankhead Courts',
 'Bankhead/Bolton',
 'Beecher Hills',
 'Ben Hill',
 'Ben Hill Acres',
 'Ben Hill Forest',
 'Ben Hill Pines',
 'Ben Hill Terrace',
 'Benteen Park',
 'Berkeley Park',
 'Betmar LaVilla',
 'Blair Villa/Poole Creek',
 'Blandtown',
 'Bolton',
 'Bolton Hills',
 'Boulder Park',
 'Boulevard Heights',
 'Brandon',
 'Brentwood',
 'Briar Glen',
 'Brookhaven',
 'Brookview Heights',
 'Brookwood',
 'Brookwood Hills',
 'Browns Mill Park',
 'Buckhead Forest',
 'Buckhead Heights',
 'Buckhead Village',
 'Bush Mountain',
 'Butner/Tell',
 'Cabbagetown',
 'Campbellton Road',
 'Candler Park',
 'Capitol Gatew

Here I can see that there are 243 neighborhoods.

In [6]:
len(neighborhoods)

243

The UCR Literal column contains the type of crime that occured. Here I rename the different values so they will be acceptable as variable names.

In [7]:
atlanta_crime['UCR Literal'].replace(['LARCENY-NON VEHICLE', 'LARCENY-FROM VEHICLE',
       'ROBBERY-PEDESTRIAN', 'ROBBERY-RESIDENCE', 'AUTO THEFT',
       'AGG ASSAULT', 'BURGLARY-RESIDENCE', 'BURGLARY-NONRES',
       'ROBBERY-COMMERCIAL', 'HOMICIDE', 'MANSLAUGHTER'], ['LARCENY_NON_VEHICLE', 'LARCENY_FROM_VEHICLE', 'ROBBERY_PEDESTRIAN', 'ROBBERY_RESIDENCE', 'AUTO_THEFT', 'AGG_ASSAULT', 'BURGLARY_RESIDENCE', 'BURGLARY_NONRES',
       'ROBBERY_COMMERCIAL', 'HOMICIDE', 'MANSLAUGHTER'], inplace = True)

Now I am turning the different types of crime into a list that will be helpful when making loops later.

In [8]:
crime_types = atlanta_crime['UCR Literal'].unique().tolist()
crime_types

['LARCENY_NON_VEHICLE',
 'LARCENY_FROM_VEHICLE',
 'ROBBERY_PEDESTRIAN',
 'ROBBERY_RESIDENCE',
 'AUTO_THEFT',
 'AGG_ASSAULT',
 'BURGLARY_RESIDENCE',
 'BURGLARY_NONRES',
 'ROBBERY_COMMERCIAL',
 'HOMICIDE',
 'MANSLAUGHTER']

Here I am making a list for each crime type so I can make the crime type a column in a dataframe with each row being a neighborhood.

In [9]:
for crime_type in crime_types:
    globals()[crime_type] = []
for neighborhood in neighborhoods:
    print(neighborhood)
    for crime_type in crime_types:
        type_df = atlanta_crime[atlanta_crime['UCR Literal']==crime_type]
        globals()[crime_type].append(len(type_df[type_df['Neighborhood']==neighborhood].index))    

Adair Park
Adams Park
Adamsville
Almond Park
Amal Heights
Ansley Park
Arden/Habersham
Ardmore
Argonne Forest
Arlington Estates
Ashley Courts
Ashview Heights
Atkins Park
Atlanta Industrial Park
Atlanta University Center
Atlantic Station
Audobon Forest
Audobon Forest West
Baker Hills
Bakers Ferry
Bankhead
Bankhead Courts
Bankhead/Bolton
Beecher Hills
Ben Hill
Ben Hill Acres
Ben Hill Forest
Ben Hill Pines
Ben Hill Terrace
Benteen Park
Berkeley Park
Betmar LaVilla
Blair Villa/Poole Creek
Blandtown
Bolton
Bolton Hills
Boulder Park
Boulevard Heights
Brandon
Brentwood
Briar Glen
Brookhaven
Brookview Heights
Brookwood
Brookwood Hills
Browns Mill Park
Buckhead Forest
Buckhead Heights
Buckhead Village
Bush Mountain
Butner/Tell
Cabbagetown
Campbellton Road
Candler Park
Capitol Gateway
Capitol View
Capitol View Manor
Carey Park
Carroll Heights
Carver Hills
Cascade Avenue/Road
Cascade Green
Cascade Heights
Castleberry Hill
Castlewood
Center Hill
Chalet Woods
Channing Valley
Chastain Park
Chattahooc

Here I am making a dataframe that contains the average longitude and latitude of the crimes that occured in each neighborhood. These coordinates are what I will use as the coordinates for each dataframe when graphing.

In [10]:
coords_df = atlanta_crime[['Neighborhood', 'Latitude', 'Longitude']].groupby(['Neighborhood']).mean()
coords_df.head()

Unnamed: 0_level_0,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Adair Park,33.729698,-84.410426
Adams Park,33.713987,-84.460214
Adamsville,33.758748,-84.503608
Almond Park,33.784186,-84.46047
Amal Heights,33.708719,-84.398984


Here I am making a dataframe that contains the number of total crimes that occurred in each neighborhood.

In [11]:
counts_df = atlanta_crime[['Neighborhood', 'Count']].groupby(['Neighborhood']).sum()
counts_df.head()

Unnamed: 0_level_0,Count
Neighborhood,Unnamed: 1_level_1
Adair Park,2012
Adams Park,1504
Adamsville,2798
Almond Park,850
Amal Heights,372


Here I am adding the number for each type of crime.

In [12]:
for crime_type in crime_types:
    counts_df[crime_type] = globals()[crime_type]
counts_df.head()

Unnamed: 0_level_0,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Adair Park,2012,399,440,170,14,343,198,339,75,21,13,0
Adams Park,1504,310,344,92,15,202,114,281,117,25,4,0
Adamsville,2798,603,699,206,22,419,304,316,176,42,11,0
Almond Park,850,123,92,55,16,152,124,254,18,4,12,0
Amal Heights,372,43,54,14,1,75,45,139,1,0,0,0


Now I am joining the counts and coords dataframes to get a full dataframe that contains all the basic information we want on each neighborhood.

In [13]:
full_df = counts_df.join(coords_df, how = 'inner', on = 'Neighborhood').reset_index()
full_df.head()

Unnamed: 0,Neighborhood,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER,Latitude,Longitude
0,Adair Park,2012,399,440,170,14,343,198,339,75,21,13,0,33.729698,-84.410426
1,Adams Park,1504,310,344,92,15,202,114,281,117,25,4,0,33.713987,-84.460214
2,Adamsville,2798,603,699,206,22,419,304,316,176,42,11,0,33.758748,-84.503608
3,Almond Park,850,123,92,55,16,152,124,254,18,4,12,0,33.784186,-84.46047
4,Amal Heights,372,43,54,14,1,75,45,139,1,0,0,0,33.708719,-84.398984


Now I am making a dataframe that I will use to cluster the neighborhoods. I am normalizing the values for each type of crime so that they will be big enough to get counted in clustering but not so large that they eliminate the effect of the total amount of crimes.

In [14]:
#copying dataframe
clustering_df = counts_df.copy()
#normalizing crime types columns
for crime_type in crime_types:
    clustering_df[crime_type] = counts_df[crime_type]*counts_df['Count'].mean()/counts_df['Count']
clustering_df = clustering_df.reset_index().drop(['Neighborhood'], axis = 1)
clustering_df.head()

Unnamed: 0,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER
0,2012,269.759732,297.479403,114.935224,9.465254,231.898717,133.865732,229.194359,50.706716,14.197881,8.789164,0.0
1,1504,280.379372,311.130658,83.209362,13.566744,182.698817,103.107253,254.150334,105.820602,22.61124,3.617798,0.0
2,2798,293.158036,339.829962,100.150175,10.69565,203.703511,147.794433,153.628424,85.565198,20.418968,5.347825,0.0
3,850,196.84228,147.231624,88.018906,25.6055,243.252249,198.442624,406.487311,28.806187,6.401375,19.204125,0.0
4,372,157.238075,197.461768,51.193792,3.656699,274.252456,164.551474,508.281218,3.656699,0.0,0.0,0.0


<h3>Data Visualization<h3>

<h4>Choropleth Map<h4>

First I need to open the geojson file that contains the neighborhood's borders. This file is from David Blakman's neighborhoods repository on github:https://github.com/blackmad/neighborhoods/blob/master/atlanta.geojson

In [15]:
with open('atlanta_geojson.json') as f:
    atlanta_geojson = json.load(f)
print(atlanta_geojson)

{'type': 'FeatureCollection', 'features': [{'type': 'Feature', 'properties': {'name': 'Tacotown', 'created_at': '2013-02-13T23:00:00.000Z', 'updated_at': '2013-02-15T23:00:00.000Z', 'cartodb_id': 2}, 'geometry': {'type': 'MultiPolygon', 'coordinates': [[[[-84.368195, 33.746681], [-84.36306, 33.746593], [-84.363083, 33.743561], [-84.368149, 33.743553], [-84.368195, 33.746681]]]]}}, {'type': 'Feature', 'properties': {'name': 'Oakland Cemetery', 'created_at': '2013-02-13T23:00:00.000Z', 'updated_at': '2013-02-15T23:00:00.000Z', 'cartodb_id': 21}, 'geometry': {'type': 'MultiPolygon', 'coordinates': [[[[-84.376656, 33.749542], [-84.37516, 33.749947], [-84.37442, 33.75024], [-84.373894, 33.750462], [-84.372559, 33.750866], [-84.371094, 33.751301], [-84.370399, 33.750027], [-84.369911, 33.749149], [-84.368217, 33.747925], [-84.368202, 33.746761], [-84.372253, 33.746807], [-84.374878, 33.746857], [-84.375877, 33.746876], [-84.375847, 33.747448], [-84.376633, 33.74744], [-84.376656, 33.749542]]

Now I will extract the neighborhoods from the json files and put them into a list

In [16]:
json_neighborhoods = []
for neighborhood in range(len(atlanta_geojson['features'])):
    json_neighborhoods.append(atlanta_geojson['features'][neighborhood]['properties']['name'])
json_neighborhoods

['Tacotown',
 'Oakland Cemetery',
 'High Point',
 'Atkins Park',
 'Pittsburgh',
 'Oakland',
 'Loring Heights',
 'Custer-Mcdonough',
 'Amal Heights',
 'Semmes Park',
 'Swallow Circle / Baywood',
 'West End',
 'Harris Chiles',
 'Historic Westside Village',
 'Meadow Lark Estates',
 'Just Us',
 'Joyland',
 'Westchester Hills / Chelsea Heights',
 'Berkeley Park',
 'East Atlanta',
 'Harvel Homes Community',
 'Mechanicsville',
 'Downtown',
 'Midtown',
 'Sherwood Forest',
 'Cabbagetown',
 'Ansley Park',
 'Piedmont Heights',
 'Virginia-Highland',
 'Druid Hills',
 'Candler Park',
 'Sweet Auburn',
 'Morningside / Lenox Park',
 'Edgewood',
 'Piedmont Park',
 'Lake Claire',
 'Poncey-Highland',
 'Reynoldstown',
 'Old Fourth Ward',
 'The Bluff',
 'Inman Park',
 'Little Five Points',
 'Lindridge - Martin Manor',
 'Armour',
 'Sycamore Ridge',
 'Hunter Hills',
 'South River Gardens',
 'Kirkwood',
 'East Lake',
 'Chosewood Park',
 'English Avenue',
 'Vine City',
 'Georgia Tech',
 'Home Park',
 'Atlantic 

Now I will make a list that contains the intersection of the neighborhoods and json_neighborhoods lists

In [17]:
def intersection(lst1, lst2): 
    lst3 = [value for value in lst1 if value in lst2] 
    return lst3 
combined = intersection(neighborhoods, json_neighborhoods)

Finally, I will put the correct neighborhoods and their number of crimes into a dataframe

In [18]:
#making list of boolean values that contains whether or not each row of the datframe should be included
combined_bool = []
for element in atlanta_crime['Neighborhood'].tolist():
    if element in combined:
        combined_bool.append(True)
    else:
        combined_bool.append(False)
#only including rows for which the combined_bool list is true
choro_df = atlanta_crime[combined_bool]
choro_df = choro_df.groupby('Neighborhood').count()[['Count']].reset_index()
choro_df.head()

Unnamed: 0,Neighborhood,Count
0,Adair Park,2012
1,Amal Heights,372
2,Ansley Park,748
3,Ashview Heights,1989
4,Atkins Park,128


Now I can show the map

In [19]:
world_map = folium.Map(location = (33.7490, -84.3880), zoom_start = 12) 
world_map.choropleth(
    geo_data=atlanta_geojson,
    data=crime_neighborhoods,
    columns=["Neighborhood", "Report Number"],
    key_on="feature.properties.name",
    legend_name="Atlanta Crime"
)
world_map



<h4>Scatter Plot<h4>

I can also use the coords_df to plot each neighborhood as a point on the folium map.

In [20]:
neighborhoods_map = folium.Map(location = (33.7490, -84.3880), zoom_start = 12) 
for lat, lng, name in zip(coords_df['Latitude'], coords_df['Longitude'], coords_df.index):
    label = folium.Popup(name)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        fill=True,
        fill_opacity=0.7).add_to(neighborhoods_map)
       
neighborhoods_map

<h2>Modelling<h2>

<h3>Clustering<h3>

Now we can cluster the neighborhoods and add the results to a dataframe. Here we can see the different values we get for each cluster. Kmeans clusters can slightly vary each time you run the model, but I found that the clusters that averaged the highest number of crimes contained the lowest number of neighborhoods, so I could sort the values to get consistent labels every time, although one or two neighborhoods may move back and forth between clusters over multiple iterations.

In [21]:
#making and fitting model
model = KMeans(init = "k-means++", n_clusters = 4, n_init = 12)
model.fit(clustering_df)
#putting the 4 labels into a list
model_labels=np.unique(model.labels_).tolist()
#putting all the labels into a list to make them easier to use
label_list = model.labels_.tolist()
#making empty lists that I will fill
label_counts = []
labels_list = np.zeros(len(label_list)).tolist()
#filling counts list with the number of occurences for each label
for label in model_labels:
    label_counts.append(model.labels_.tolist().count(label))
#sorting counts list in descending order
sorted_counts = sorted(label_counts, reverse = True)
#replacing the value given by the model with the sorted value
for label in range(len(sorted_counts)):
    current_value = label_counts.index(sorted_counts[label])
    for n, i in enumerate(label_list):
        if i == current_value:
            labels_list[n] = label
#printing results
for label in model_labels:
    print('{}: {}'.format(label, sorted_counts[label]))
#adding results to the main dataframe
full_df['Cluster'] = labels_list
#making dataframe to show the results
results_df = clustering_df.copy()
results_df['Cluster'] = labels_list
results_df = results_df.groupby('Cluster').mean()
show_df = results_df.copy()
for crime_type in crime_types:
    show_df[crime_type] = results_df[crime_type]/counts_df['Count'].mean()
show_df.head()

0: 169
1: 57
2: 15
3: 2


Unnamed: 0_level_0,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER
Cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
0,350.538462,0.168171,0.28238,0.03931,0.007728,0.155338,0.070449,0.244037,0.026708,0.003385,0.002483,1.1e-05
1,2477.561404,0.19655,0.288474,0.054649,0.007473,0.148055,0.08514,0.170926,0.037869,0.007451,0.003398,1.6e-05
2,5764.2,0.333574,0.276758,0.045498,0.004912,0.127522,0.061288,0.114209,0.026859,0.007203,0.002177,0.0
3,21813.0,0.279035,0.475829,0.047161,0.002466,0.095489,0.04537,0.025582,0.022452,0.005369,0.001246,0.0


It looks like we have four clusters, with cluster zero neighborhoods having a low number of crimes, cluster one neighborhoods having a medium number of crimes, cluster two neighborhoods having a high number of crimes, and cluster three neighborhoods having a very high number of crimes. This can be confirmed by looking at the individual clusters

In [22]:
full_df[full_df['Cluster']==0].head(50)

Unnamed: 0,Neighborhood,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER,Latitude,Longitude,Cluster
3,Almond Park,850,123,92,55,16,152,124,254,18,4,12,0,33.784186,-84.46047,0
4,Amal Heights,372,43,54,14,1,75,45,139,1,0,0,0,33.708719,-84.398984,0
5,Ansley Park,748,83,453,28,1,92,15,70,6,0,0,0,33.79217,-84.378796,0
6,Arden/Habersham,57,12,24,1,0,7,2,9,2,0,0,0,33.838463,-84.400728,0
7,Ardmore,616,52,481,7,0,43,7,14,8,4,0,0,33.804977,-84.394423,0
8,Argonne Forest,70,10,23,3,0,11,5,17,1,0,0,0,33.840779,-84.403913,0
9,Arlington Estates,330,59,60,15,1,44,33,112,5,0,1,0,33.691885,-84.538608,0
10,Ashley Courts,538,51,73,16,13,128,39,206,12,0,0,0,33.717852,-84.522899,0
12,Atkins Park,128,21,69,5,0,13,2,16,2,0,0,0,33.775669,-84.350905,0
13,Atlanta Industrial Park,197,32,101,4,1,26,8,5,19,0,1,0,33.79684,-84.494977,0


In [23]:
full_df[full_df['Cluster']==1].head(50)

Unnamed: 0,Neighborhood,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER,Latitude,Longitude,Cluster
0,Adair Park,2012,399,440,170,14,343,198,339,75,21,13,0,33.729698,-84.410426,1
1,Adams Park,1504,310,344,92,15,202,114,281,117,25,4,0,33.713987,-84.460214,1
2,Adamsville,2798,603,699,206,22,419,304,316,176,42,11,0,33.758748,-84.503608,1
11,Ashview Heights,1989,360,419,133,19,276,213,514,38,7,10,0,33.749932,-84.420883,1
14,Atlanta University Center,1834,269,680,198,16,272,189,155,38,7,10,0,33.7516,-84.412137,1
15,Atlantic Station,2962,1603,812,81,11,231,61,105,42,15,1,0,33.79178,-84.398449,1
20,Bankhead,2242,494,268,158,25,221,510,399,120,20,25,2,33.767357,-84.424644,1
33,Blandtown,1974,239,1168,50,11,252,35,118,89,12,0,0,33.792163,-84.423322,1
45,Browns Mill Park,1807,324,282,146,26,305,238,404,62,15,5,0,33.684128,-84.386043,1
46,Buckhead Forest,2081,556,1104,45,4,161,49,87,61,11,3,0,33.845456,-84.376014,1


In [24]:
full_df[full_df['Cluster']==2].head(50)

Unnamed: 0,Neighborhood,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER,Latitude,Longitude,Cluster
30,Berkeley Park,4583,2795,1155,57,2,249,45,132,115,33,0,0,33.801969,-84.412627,2
85,Edgewood,5176,2763,835,179,14,479,231,535,94,43,3,0,33.756125,-84.344091,2
104,Grant Park,5293,870,2294,183,25,890,213,606,178,30,4,0,33.739602,-84.370679,2
107,Greenbriar,5487,1766,1504,237,26,931,249,541,176,48,9,0,33.684489,-84.492735,2
109,Grove Park,5222,1024,869,365,44,746,667,1246,190,36,35,0,33.77218,-84.447372,2
118,Home Park,4458,646,2674,159,10,456,109,227,160,15,2,0,33.784002,-84.405876,2
136,Lenox,6203,4594,1167,88,2,204,40,3,72,33,0,0,33.847558,-84.361913,2
138,Lindbergh/Morosgo,5000,1976,1361,171,22,527,159,585,141,56,2,0,33.82349,-84.365757,2
146,Mechanicsville,5051,705,1497,358,47,1054,623,632,97,16,22,0,33.738579,-84.399261,2
159,North Buckhead,5651,2389,2235,79,6,375,85,316,135,30,1,0,33.853948,-84.368943,2


In [25]:
full_df[full_df['Cluster']==3].head(50)

Unnamed: 0,Neighborhood,Count,LARCENY_NON_VEHICLE,LARCENY_FROM_VEHICLE,ROBBERY_PEDESTRIAN,ROBBERY_RESIDENCE,AUTO_THEFT,AGG_ASSAULT,BURGLARY_RESIDENCE,BURGLARY_NONRES,ROBBERY_COMMERCIAL,HOMICIDE,MANSLAUGHTER,Latitude,Longitude,Cluster
79,Downtown,25386,7971,11058,1391,57,2336,1481,429,483,139,41,0,33.758493,-84.389368,3
149,Midtown,18240,4452,9413,721,49,1805,591,625,472,96,16,0,33.78017,-84.382027,3


<h4>Clustering Visualization<h4>

Now I can plot each neighborhood with a different color for each cluster

In [26]:
rainbow = ['#2adddd', '#8000ff', '#d4dd80', '#ff0000']

In [27]:
clustering_map = folium.Map(location = (33.7490, -84.3880), zoom_start = 12) 
for lat, lon, name, cluster in zip(full_df['Latitude'], full_df['Longitude'], full_df['Neighborhood'], full_df['Cluster']):
    label = folium.Popup(str(name) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(clustering_map)

clustering_map

<h3>Classification<h3>

<h4>Preparation of foursquare data<h4>

Now I am going to classify each neighborhood into it's cluster based off of the venues around it. First I have to define the constant variables I will use in my foursquare urls. I will use a radius of 800m because that is about the average radius of each neighborhood. I will use a limit of 100 so I have enough data.

In [28]:
CLIENT_ID = 'hidden'
CLIENT_SECRET = 'hidden'
VERSION = '20201127'
radius = 800
LIMIT = 100

Now I will run a foursquare API call to the center of Atlanta and see what a sample response looks like.

In [29]:
test_url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, 33.7490, -84.3880, radius, LIMIT)
test_results = requests.get(test_url).json()['response']['groups'][0]['items'][0]['venue']
print(test_results)

{'id': '40e0b100f964a520e9061fe3', 'name': 'The Masquerade', 'location': {'address': '75 Martin Luther King Jr Dr SW', 'crossStreet': 'Central Avenue', 'lat': 33.75171952169695, 'lng': -84.38973873781782, 'labeledLatLngs': [{'label': 'display', 'lat': 33.75171952169695, 'lng': -84.38973873781782}, {'label': 'entrance', 'lat': 33.7513, 'lng': -84.390486}], 'distance': 342, 'postalCode': '30303', 'cc': 'US', 'neighborhood': 'Downtown', 'city': 'Atlanta', 'state': 'GA', 'country': 'United States', 'formattedAddress': ['75 Martin Luther King Jr Dr SW (Central Avenue)', 'Atlanta, GA 30303', 'United States']}, 'categories': [{'id': '4bf58dd8d48988d1e5931735', 'name': 'Music Venue', 'pluralName': 'Music Venues', 'shortName': 'Music Venue', 'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/arts_entertainment/musicvenue_', 'suffix': '.png'}, 'primary': True}], 'photos': {'count': 0, 'groups': []}}


Now that I know the structure of the json file, I will define a function to find nearby venues. It will also drop any neighborhoods that don't have a venue within 800m of their latitude and longitude.

In [30]:
def findNearbyVenues(names, latitudes, longitudes, radius=800):
    venues_list=[]
    to_be_dropped = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        #creating a url with the neighborhoods latitude and longitude
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        results = requests.get(url).json()['response']
        #indication to make sure the loop is still running
        print(name)
        #accounting for neighborhoods that don't have a venue within 800m of their coordinates
        if list(results.keys())[0] == 'warning':
            to_be_dropped.append(int(full_df[full_df['Neighborhood']==name].index[0]))
        else:
            #adding venue information to list
            results = requests.get(url).json()['response']['groups'][0]['items']
            #accounting for venues that don't have a category
            for v in range(len(results)):
                if len(results[v]['venue']['categories'])>0:
                    category = results[v]['venue']['categories'][0]['name']
                else:
                    category = 'No Category'
                venues_list.append([
                name, 
                lat, 
                lng, 
                results[v]['venue']['name'], 
                results[v]['venue']['location']['lat'], 
                results[v]['venue']['location']['lng'],
                category])
    #making empty dataframes with the correct column names
    nearby_venues = pd.DataFrame(columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category'])
    #filling dataframe with lists created above
    for row in venues_list:
        to_append = row
        nearby_venues.loc[len(nearby_venues)] = to_append

    print(to_be_dropped)
    return nearby_venues

Now I run the function on our dataframe and store the values in a new dataframe and list

In [31]:
atlanta_venues = findNearbyVenues(full_df['Neighborhood'], full_df['Latitude'], full_df['Longitude'])

Adair Park
Adams Park
Adamsville
Almond Park
Amal Heights
Ansley Park
Arden/Habersham
Ardmore
Argonne Forest
Arlington Estates
Ashley Courts
Ashview Heights
Atkins Park
Atlanta Industrial Park
Atlanta University Center
Atlantic Station
Audobon Forest
Audobon Forest West
Baker Hills
Bakers Ferry
Bankhead
Bankhead Courts
Bankhead/Bolton
Beecher Hills
Ben Hill
Ben Hill Acres
Ben Hill Forest
Ben Hill Pines
Ben Hill Terrace
Benteen Park
Berkeley Park
Betmar LaVilla
Blair Villa/Poole Creek
Blandtown
Bolton
Bolton Hills
Boulder Park
Boulevard Heights
Brandon
Brentwood
Briar Glen
Brookhaven
Brookview Heights
Brookwood
Brookwood Hills
Browns Mill Park
Buckhead Forest
Buckhead Heights
Buckhead Village
Bush Mountain
Butner/Tell
Cabbagetown
Campbellton Road
Candler Park
Capitol Gateway
Capitol View
Capitol View Manor
Carey Park
Carroll Heights
Carver Hills
Cascade Avenue/Road
Cascade Green
Cascade Heights
Castleberry Hill
Castlewood
Center Hill
Chalet Woods
Channing Valley
Chastain Park
Chattahooc

In [32]:
#rows copied over from above result, this list will be used later
dropped_rows = [6, 16, 17, 18, 19, 23, 24, 26, 28, 32, 35, 36, 41, 57, 61, 77, 78, 91, 94, 95, 96, 103, 108, 115, 131, 134, 135, 137, 142, 151, 154, 156, 157, 158, 160, 164, 167, 169, 178, 189, 191, 196, 198, 211, 232, 237, 238]

Let's see the new Dataframe now

In [33]:
atlanta_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Adair Park,33.729698,-84.410426,Adair Park One,33.730525,-84.412837,Park
1,Adair Park,33.729698,-84.410426,Monday Night Garage,33.729407,-84.418303,Brewery
2,Adair Park,33.729698,-84.410426,Atlanta BeltLine Corridor under Lee/Murphy,33.727205,-84.417238,Trail
3,Adair Park,33.729698,-84.410426,Boxcar,33.730106,-84.418582,Gastropub
4,Adair Park,33.729698,-84.410426,Wild Heaven West End Brewery & Gardens,33.729979,-84.419031,Brewery


In [34]:
atlanta_venues.shape

(5182, 7)

How many venues does each Neighborhood have?

In [35]:
atlanta_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adair Park,13,13,13,13,13,13
Adams Park,4,4,4,4,4,4
Adamsville,18,18,18,18,18,18
Almond Park,5,5,5,5,5,5
Amal Heights,8,8,8,8,8,8
...,...,...,...,...,...,...
Wildwood (NPU-H),4,4,4,4,4,4
Wisteria Gardens,14,14,14,14,14,14
Woodfield,6,6,6,6,6,6
Woodland Hills,33,33,33,33,33,33


How many venue categories are there?

In [36]:
print('There are {} unique categories.'.format(len(atlanta_venues['Venue Category'].unique())))

There are 367 unique categories.


Now I will turn the categorical variables into numerical variables so we can classify the neighborhoods

In [37]:
#turning categorical variables into 0 or 1 values
atlanta_onehot = pd.get_dummies(atlanta_venues[['Venue Category']], prefix="", prefix_sep="")
#organizing columns
atlanta_onehot['Neighborhood'] = atlanta_venues['Neighborhood'] 
atlanta_onehot = atlanta_onehot.set_index('Neighborhood').reset_index()
atlanta_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport Terminal,American Restaurant,Animal Shelter,Antique Shop,Aquarium,...,Waste Facility,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Adair Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Adair Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Adair Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Adair Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Adair Park,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now I will group the dataset by Neighborhood and average the amount for each category

In [38]:
atlanta_grouped = atlanta_onehot.groupby('Neighborhood').mean().reset_index()
atlanta_grouped

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,African Restaurant,Airport Terminal,American Restaurant,Animal Shelter,Antique Shop,Aquarium,...,Waste Facility,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Adair Park,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adams Park,0.0,0.0,0.0,0.0,0.0,0.250000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Adamsville,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Almond Park,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Amal Heights,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191,Wildwood (NPU-H),0.0,0.0,0.0,0.0,0.0,0.250000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0
192,Wisteria Gardens,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
193,Woodfield,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
194,Woodland Hills,0.0,0.0,0.0,0.0,0.0,0.060606,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now  I will classify the neighborhoods using a variety of models and use 5 fold cross validation with a few different scoring methods to evaluate the models.

In [39]:
feature_set = atlanta_grouped.columns[1:].tolist()
feature_set

['ATM',
 'Accessories Store',
 'Adult Boutique',
 'African Restaurant',
 'Airport Terminal',
 'American Restaurant',
 'Animal Shelter',
 'Antique Shop',
 'Aquarium',
 'Arcade',
 'Arepa Restaurant',
 'Art Gallery',
 'Art Museum',
 'Arts & Crafts Store',
 'Arts & Entertainment',
 'Asian Restaurant',
 'Athletics & Sports',
 'Auto Garage',
 'Auto Workshop',
 'Automotive Shop',
 'BBQ Joint',
 'Bagel Shop',
 'Bakery',
 'Bank',
 'Bar',
 'Baseball Field',
 'Basketball Court',
 'Basketball Stadium',
 'Bath House',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Garden',
 'Beer Store',
 'Big Box Store',
 'Bike Shop',
 'Bistro',
 'Board Shop',
 'Boat or Ferry',
 'Bookstore',
 'Border Crossing',
 'Botanical Garden',
 'Boutique',
 'Bowling Alley',
 'Boxing Gym',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Brewery',
 'Bridal Shop',
 'Bridge',
 'Bubble Tea Shop',
 'Buffet',
 'Building',
 'Burger Joint',
 'Burrito Place',
 'Bus Line',
 'Bus Station',
 'Bus Stop',
 'Business Service',
 'Cafeteria',
 'Café',


Now I am going to define the independent and dependent variables

In [40]:
X = np.array(atlanta_grouped[feature_set])
X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [41]:
dropped_full_df = full_df.drop(index = dropped_rows)
y = np.array(dropped_full_df['Cluster'])

<h4>K Nearest Neighbors<h4>

The first method I will use is K nearest neighbors. 

In [42]:
#list of scorers I will use
scoring = ['accuracy', 'f1_weighted', 'jaccard_weighted']
for scorer in scoring:
    #making variables for each scorer
    globals()['neighbors_'+scorer] = []
    for k in range(1, 10):
        #making a model for each k
        neigh = KNeighborsClassifier(n_neighbors = k)
        #evaluating the model with each scorer and storing the results in a list for each scorer
        globals()['neighbors_'+scorer].append(cross_validate(neigh, X, y, scoring = scorer)['test_score'].mean())
#averaging the score from each scorer for each k value and finding the best k
average_neighbor_score = [(x + y + z)/3 for x, y, z in zip(neighbors_accuracy, neighbors_f1_weighted, neighbors_jaccard_weighted)]
best_k = average_neighbor_score.index(max(average_neighbor_score))
#getting each score for the best k
for scorer in scoring:
    globals()['best_'+scorer] = globals()['neighbors_'+scorer][best_k]
#printing results
print('The best k was {}, with a f1_score of {}, a jaccard_score of {}, and an accuracy score of {}'.format(best_k, best_f1_weighted, best_jaccard_weighted, best_accuracy))



The best k was 8, with a f1_score of 0.5301209853228079, a jaccard_score of 0.397844074398254, and an accuracy score of 0.5662820512820513


Those scores aren't very good. K nearest neighbors may not be the best model for this. Now I am going to try a support vector machine.

<h4>Support Vector Machine<h4>

In [43]:
#making list of c values
c_list = np.arange(.01, 0.1, 0.01).tolist()
for scorer in scoring:
    #making variable for each scorer
    globals()['svm_'+scorer] = []
    for c in c_list:
        #creating model for each c value
        SVM1 = svm.SVC(C=c, kernel='linear')
        #evaluating the model with each scorer and storing the results in a list for each scorer
        globals()['svm_'+scorer].append(cross_validate(SVM1, X, y, scoring = scorer)['test_score'].mean())
#averaging the score from each scorer for each c value and finding the best c
average_svm_score = [(x + y + z)/3 for x, y, z in zip(svm_accuracy, svm_f1_weighted, svm_jaccard_weighted)]
best_c = c_list[average_svm_score.index(max(average_svm_score))]
for scorer in scoring:
    #getting each score for the best c
    globals()['best_'+scorer] = globals()['svm_'+scorer][c_list.index(best_c)]
#printing the results
print('The best c was {}, with a f1_score of {}, a jaccard_score of {}, and an accuracy score of {}'.format(best_c, best_f1_weighted, best_jaccard_weighted, best_accuracy))



The best c was 0.01, with a f1_score of 0.4904075091575092, a jaccard_score of 0.4004129684418146, and an accuracy score of 0.6326923076923077




That's a bit better, but still not very good. The last model I will try is a Decision Tree

<h4>Decision Tree<h4>

In [44]:
#making list of depths
depths = np.arange(1, 10, 1)
for scorer in scoring:
    #making variable for each scorer
    globals()['tree_'+scorer] = []
    for depth in depths:
        #creating model for each depth
        tree = DecisionTreeClassifier(max_depth = depth, criterion='entropy')
        #evaluating the model with each scorer and storing the results in a list for each scorer
        globals()['tree_'+scorer].append(cross_validate(tree, X, y, scoring = scorer)['test_score'].mean())
#averaging the score from each scorer for each depth and finding the best depth
average_tree_score = [(x + y + z)/3 for x, y, z in zip(tree_accuracy, tree_f1_weighted, tree_jaccard_weighted)]
best_depth = depths[average_tree_score.index(max(average_tree_score))]
for scorer in scoring:
    #getting each score for the best depth
    globals()['best_'+scorer] = globals()['tree_'+scorer][best_k]
#printing the results
print('The best depth was {}, with a f1_score of {}, a jaccard_score of {}, and an accuracy score of {}'.format(best_depth, best_f1_weighted, best_jaccard_weighted, best_accuracy))



The best depth was 7, with a f1_score of 0.5490000640961761, a jaccard_score of 0.3916241796803941, and an accuracy score of 0.5202564102564103


That is also not very good. It doesn't seem like the types of venues in a neighborhood are a very good indicator of the amount of crime the neighborhood has.

<h2>Results and Discussion<h2>

In the analysis, I found that there were four main clusters, with each varying mostly by level of crime while the distribution between different types of crime was relatively consistent. The most common type of neighborhood had a low level of crime, while there were only two negihborhoods with a very high level of crime. After I made the clusters, I used fourquare location data to see if I could classify a neighborhood based off of the types of venues in that neighborhood using SVM, k-nearest neighbors, and a Decision tree. However, using cross validation, I found that none of the models were very accurate, even with tuning of hyperparameters, which led me to the conclusion that the categories of venues in and around a neighborhood do not have a large effect on the crime that occurs in that neighborhood.

<h3>Conclusion<h3>

The goal of this analysis was to make a model for anyone living in or considering living in Atlanta to help understand how different types of venues may impact the amount of crime in a certain neighborhood. Although I was able to cluster the neighborhoods, which could be helpful for residents and potential residents, the classification model turned out to be inconclusive. A factor I could consider for future analysis is the effect of specific types of venues as opposed to the types of all venues as a whole. 