# Los Angeles Neighborhood Analysis


My article on [Medium](https://chaitanya-kasaraneni.medium.com/los-angeles-neighborhood-analysis-c43457441869)

***Note:*** *GitHub doesn't show folium maps. To view maps please go to https://nbviewer.jupyter.org*

## Introduction

Los Angeles is a very vibrant city with a lot of neighborhoods, each with unique character. Some neighborhoods are quiet and cozy, has convenient store locations, while others offer a lot of fun and nightlife activities. Choosing a neighborhood to live in or open a business can be a complicated task to do, but with the help of location data from Foursquare and crime data, we can make it a little bit easier.

### Business Problem
The objective of this capstone project is to analyze and select the best locations in the city of Los Angeles, California to choose a neighborhood to live in or open a new business. Using data science methodology and machine learning techniques like clustering, this project aims to provide solutions to answer the business question: In the city of Los Angeles, California, what would be a better place to live in or start a business?

### Target Audience
- People interested in moving to Los Angeles and looking for a perfect neighborhood for their needs
- Business owners looking to expand their business to a new location
- A beginner data scientist who may use this research as an example

## Data

For this project, the following data is needed:
•	List of neighborhoods in Los Angeles
•	Latitude and longitude coordinates of neighborhoods to get the venue data
•	Crime data in Los Angeles
•	Venues Details

### Data Sources and Preparation:

1.	**Location Data**
    - First, we need to get a full list of all LA neighborhoods. Wikipedia article [List of districts and neighborhoods in Los Angeles](https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_in_Los_Angeles) is a great place to start.
    - [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a Python library used for pulling data out of HTML. We will use it to parse the Wikipedia page
    - For geolocation data, we will use the Geocoding API. To get more information about it, follow the Geocoding Developer Guide.
 
 
2.	**Venues Data (Foursquare API)**
    - [Foursquare API](https://foursquare.com/) provides information about venues and geolocation. We will use Foursquare API to get the venue data for LA neighborhoods. Foursquare has one of the largest databases of 105+ million places and is used by over 125,000 developers. Foursquare API will provide many categories of the venue data such as name, location, hours, rating, prices, etc.
    
    
3.	**Crime Data**
    - To analyze criminal activity for each neighborhood we use Los Angeles Crime & Arrest Data: from Beginning 2020 to Present dataset from https://data.lacity.org/A-Safe-City/Crime-Data-from-2020-to-Present/2nrs-mtv8. It contains information about location, time, category and other miscellaneous data from the LA Police Department.   

#### Import Required Libraries

In [None]:
import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup
import json  
from pandas.io.json import json_normalize
from geopy.geocoders import Nominatim

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

import seaborn as sns

import folium #maps library

from sklearn.cluster import KMeans

import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

### Neighbourhoods Data

To begin with, we need to get a list of neighborhoods in LA. We scrape this data from Wikipedia page [List of districts and neighborhoods in Los Angeles](https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_in_Los_Angeles) using BeautifulSoup  

Can you guess how many neighborhoods are in LA? 200!

In [None]:
link = requests.get("https://en.wikipedia.org/wiki/List_of_districts_and_neighborhoods_in_Los_Angeles")
soup = BeautifulSoup(link.text, "lxml")

sections=soup.find_all(class_="div-col columns column-width")
places = BeautifulSoup(str(sections)).find_all('li')

neighborhoods_list = []

for div in places:
    if div.find('a').contents[0] == '[40]':
        neighborhoods_list.append('Pico Robertson')
    else:
        neighborhoods_list.append(div.find('a').contents[0])

In [None]:
len(neighborhoods_list)

Using Google's Geocoding API, we will get geolocation information. To know more about how to use Geocoding API, follow [Geocoding Developer Guide](https://developers.google.com/maps/documentation/geocoding/intro)

In [None]:
geoKey = 'your-geocodingAPI-key' #key for Geocoding API

Function to get the Neighborhood details using Geocoding API

In [None]:
def getNeighborhoodData(neighborhoods_list):
    '''
    DESCRIPTION:
        Using Geocoding API, this method gets the location data for each neighborhood 
    PARAMETERS:
        INPUT:
            List of Neighborhoods
        OUTPUT:
            JSON containing location data for each neighborhood 
    '''
    try:
        with open("../input/la-neighborhoods/LA_Neighborhoods.json") as data:
            jsonList = json.load(data)
    except IOError:
        jsonList = []
        for neighborhood in neighborhoods_list:
            parameters = {
                "address": "%s, Los Angeles, CA" % neighborhood,
                "key": geoKey 
            }
            results = requests.get(
                'https://maps.googleapis.com/maps/api/geocode/json', 
                params=parameters
            ).json()
            jsonList.append(results)
        with open("../input/la-neighborhoods/LA_Neighborhoods.json", 'w') as outputFile:
            json.dump(jsonList, outputFile)
        
    return jsonList

In [None]:
jsonList = getNeighborhoodData(neighborhoods_list)

In [None]:
neighborData = []
for element in jsonList:
    if element['results']:
        neighborData.append([
            element['results'][0]['address_components'][0]['long_name'],
            element['results'][0]['geometry']['location']['lat'],
            element['results'][0]['geometry']['location']['lng']
        ])

**Convert to Pandas DataFrame**

In [None]:
laDF = pd.DataFrame(
    data=neighborData,
    columns=["Neighborhood", "Latitude", "Longitude"],
)

laDF.head(10)

We can see that for some neighborhoods, portion of address is only saved in the neigjborhood column. We need to clean these. I chose to do it manually

In [None]:
#correct anomalies
laDF.loc[1, "Neighborhood"] = "Angeles Mesa"
laDF.loc[8, "Neighborhood"] = "Baldwin Hills Crenshaw"
laDF.loc[11, "Neighborhood"] = "Beachwood Canyon"
laDF.loc[16, "Neighborhood"] = "Beverly Grove"
laDF.loc[33, "Neighborhood"] = "Chesterfield Square"
laDF.loc[43, "Neighborhood"] = "East Gate Bell Air"
laDF.loc[59, "Neighborhood"] = "Flower District"
laDF.loc[61, "Neighborhood"] = "Gallery Row"
laDF.loc[83, "Neighborhood"] = "Jewelry District"
laDF.loc[96, "Neighborhood"] = "Little Italy"
laDF.loc[118, "Neighborhood"] = "Old Bank District"
laDF.loc[124, "Neighborhood"] = "Park La Brea"
laDF.loc[144, "Neighborhood"] = "Sonoratown"
laDF.loc[184, "Neighborhood"] = "Westside Village"

In [None]:
laDF.dtypes

In [None]:
laDF.head(10)

In [None]:
address = 'Los Angeles'

geolocator = Nominatim(user_agent = "ExploreLA")
LA_location = geolocator.geocode(address)
LA_latitude = LA_location.latitude
LA_longitude = LA_location.longitude

print('The geograpical coordinates of Los Angeles are {}, {}.'.format(LA_latitude, LA_longitude))

**Plotting neighbourhoods in LA on the map**

In [None]:
mapLA = folium.Map(
    location=[LA_latitude, LA_longitude], 
    tiles='Stamen Toner', 
    zoom_start=10, 
)

# add markers to map
for lat, lng, neighborhood in zip(laDF['Latitude'], laDF['Longitude'], laDF['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html = True)
    folium.Marker(
        [lat, lng],
        popup = label,
    ).add_to(mapLA)

In [None]:
mapLA

### Crime Data

To analyze criminal activity, we use LA Police Department Crime Data from beginning of 2020 to Present. It containes information about location, time, category and other miscellaneous data from LAPD.

In [None]:
crimeData = pd.read_csv('../input/crime-data-from-2020-to-present/Crime_Data_from_2020_to_Present.csv')

In [None]:
crimeData.head()

In [None]:
crimeDF = crimeData[['DR_NO','AREA NAME']]
crimeDF.rename(columns={"DR_NO": "IncidentID", 'AREA NAME':'Area'}, inplace=True)

In [None]:
crimeDF.dtypes

The data has 21 Areas, each for one of the 21 community Police Station that LAPD has. 

**Counting the number of crimes for each community Police station**

In [None]:
crimeDFCounts = crimeDF.groupby('Area').agg(['count'])
crimeDFCounts.reset_index(inplace=True)
crimeDFCounts.columns = crimeDFCounts.columns.droplevel(level=1)
crimeDFCounts.rename(columns={"IncidentID": "NumberofCrimes"}, inplace=True)
crimeDFCounts['Area'].loc[crimeDFCounts['Area']=='N Hollywood'] = 'North Hollywood'
crimeDFCounts.sort_values(by="NumberofCrimes", ascending=False).head(10)

In [None]:
plt.figure(figsize=(20, 10))

sns.set(style="white", context="talk", palette="rocket")

sns.barplot(
    data=crimeDFCounts,
    x=crimeDFCounts["Area"],
    y=crimeDFCounts["NumberofCrimes"],
)

plt.xticks(rotation=45, ha='right')
sns.despine(offset=10, trim=True, bottom=True)
plt.tight_layout(h_pad=2)

Let's plot a Choropleth map for areas under community Police Station based on number of crimes

In [None]:
LAgeo = r'../input/lapd-divisions/LAPD_Divisions.json'

mapLACrimes = folium.Map(
    location=[LA_latitude, LA_longitude], 
    zoom_start=10, 
    tiles='Stamen Toner', 
)

mapLACrimes.choropleth(
    geo_data=LAgeo,
    name='choropleth',
    data=crimeDFCounts,
    columns=['Area', 'NumberofCrimes'],
    key_on='feature.properties.name',
    fill_color='YlOrRd',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Crimes in LA'
)

In [None]:
mapLACrimes

In [None]:
# add markers to map
for lat, lng, neighborhood in zip(laDF['Latitude'], laDF['Longitude'], laDF['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html = True)
    folium.Marker(
        [lat, lng],
        popup = label,
    ).add_to(mapLACrimes)

In [None]:
mapLACrimes

The neighborhoods that come under Pacific, 77th Street and Soutwest LAPD community divisions have more number of crimes recorded

### Venues Data (Foursquare API)

Foursquare API provides information about venues and geolocation.

In [None]:
#Define Foursquare Credentials and Version
#Private information deleted
# define Foursquare Credentials and Version
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

# defining radius and limit of venues to get
radius=1000
LIMIT=200

In [None]:
def getVeneus(neighborhood, latitude, longitude, category=None, radius=1000):
    venues_list = []
    params = {
        "client_id": CLIENT_ID,
        "client_secret": CLIENT_SECRET,
        "v": VERSION,
        "ll": "{},{}".format(latitude, longitude),
        "radius": radius,
        "limit": LIMIT,
    }
    url = 'https://api.foursquare.com/v2/venues/search'    
    results = requests.get(url, params=params).json()

    if not results["response"]:
        return []

    for v in results["response"]['venues']:
        if not v['categories']:
            continue
        venues_list.append([
            neighborhood,
            latitude, 
            longitude, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],
            v['categories'][0]["name"]
        ])
    return venues_list

In [None]:
def getNearbyVenues(neighborhoods, latitudes, longitudes, category=None, radius=1000):
    
    venues_list=[]
    for neig, lat, lng in zip(neighborhoods, latitudes, longitudes):
        results = getVeneus(neig, lat, lng, category=category, radius=radius)
        venues_list += results
        
    if not venues_list:
        print("venue list is empty")
        return []
    
    venues_data = pd.DataFrame(venue for venue in venues_list)
    venues_data.columns = [
        'Neighborhood', 
        'Neighborhood Latitude', 
        'Neighborhood Longitude', 
        'Venue', 
        'Venue Latitude', 
        'Venue Longitude',
        'Venue Category',
    ]
    
    return venues_data

In [None]:
venues_df = getNearbyVenues(
    neighborhoods=laDF['Neighborhood'],
    latitudes=laDF['Latitude'],
    longitudes=laDF['Longitude'],
)

In [None]:
venues_df.drop_duplicates(keep="first", inplace=True)
venues_df.head()

For this project we need general category, i.e. "Venue Category". Hence, we will find general category of each venue

In [None]:
def get_categories():
    try:
        with open("categories.json") as data:
            categories = json.load(data)
    except IOError:
        url = 'https://api.foursquare.com/v2/venues/categories'
        params = {
            "client_id": CLIENT_ID,
            "client_secret": CLIENT_SECRET,
            "v": VERSION,
        }
        categories = requests.get(url, params=params).json()["response"]["categories"]
    return categories

In [None]:
#the function return dictionaries of lists with parents and child categories

def collect_categories(node, categories):
    categories.append(node["name"])
    if not node["categories"]:
        return
    for sub_node in node['categories']:
        collect_categories(sub_node, categories)

In [None]:
#from list of dictionaries to one dictoinary
categories_list = {}
for i in get_categories():
    categories = []
    collect_categories(i, categories)
    categories_list[i["name"]] = categories

In [None]:
venueCat = []

for venue_category in venues_df["Venue Category"]:
    for key in categories_list.keys():
        if venue_category in categories_list[key]:
            venueCat.append(key)

venues_df["General Venue Category"] = venueCat

venues_df.head(10)

For this project, we need only Shop & Service, Outdoors & Recreation, Travel & Transport, Food, Nightlife Spot and Arts & Entertainment, so lets drop others

In [None]:
venues_df.drop(
    venues_df[
        (venues_df["General Venue Category"] == 'Professional & Other Places') |
        (venues_df["General Venue Category"] == 'Residence') |
        (venues_df["General Venue Category"] == 'College & University')
    ].index,
    axis=0,
    inplace=True,
)
 
venues_df.head(10)

In [None]:
venueList = venues_df["General Venue Category"].unique()
venueList

In [None]:
colorDict = {
    'Shop & Service': 'red',
    'Outdoors & Recreation': 'cadetblue',
    'Travel & Transport': 'darkgreen',
    'Food': 'orange',
    'Nightlife Spot': 'purple',
    'Arts & Entertainment': 'beige',
}

In [None]:
from folium.plugins import MarkerCluster

venueMap = folium.Map(
    location=[LA_latitude, LA_longitude], 
    tiles='Stamen Toner', 
    zoom_start=10
)

markCluster = MarkerCluster().add_to(venueMap)

for lat, lng, cat in zip(venues_df['Venue Latitude'],
                         venues_df['Venue Longitude'],
                         venues_df['General Venue Category']):  
    if cat in colorDict:
        folium.Marker(
            location=[lat, lng],
            icon=folium.Icon(color=colorDict[cat]),
        ).add_to(markCluster)

In [None]:
venueMap

Now let us see which category of venues is popular in LA

In [None]:
venCountdf = venues_df.groupby(["General Venue Category"]).count()
venCountdf.reset_index(inplace=True)
venCountdf.set_index(pd.Index([0, 1, 2, 3, 4, 5, 6]))

venCountdf.drop(
    columns=[
        "Neighborhood",
        "Neighborhood Latitude",
        "Neighborhood Longitude",
        "Venue",
        "Venue Latitude",
        "Venue Longitude"
    ],
    inplace=True
)

venCountdf.rename(
    columns={"Venue Category": "Number of Categories"},
    inplace=True
)

venCountdf.sort_values(
    by="Number of Categories",
    ascending=False
)

In [None]:
plt.figure(figsize=(20, 10))

sns.set(style="white", context="talk", palette="rocket")

sns.barplot(
    data=venues_df,
    x=venCountdf["General Venue Category"],
    y=venCountdf["Number of Categories"],
)

sns.despine(offset=10, trim=True, bottom=True)
plt.tight_layout(h_pad=2)

Shopping is most popular category of venues followed by food

### Clustering using k-Means

In [None]:
venCatDF = venues_df.copy()

for category in venueList:
    venCatDF[category] = np.nan

for i in range(len(venueList)):
    for index, row in venCatDF.iterrows():
        venCatDF.loc[index, venueList[i]] = venues_df[
            (venues_df["Neighborhood"] == row["Neighborhood"])&
            (venues_df["General Venue Category"] == venueList[i])
        ].count()[0]

venCatDF

In [None]:
kclusters = 5
clusterDF = venCatDF.drop(
    columns=["Neighborhood", "Neighborhood Latitude", "Neighborhood Longitude",'Venue',"Venue Latitude",
             'Venue Longitude', 'Venue Category','General Venue Category']
)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(clusterDF)

In [None]:
mergedLADF = venCatDF
mergedLADF.insert(0, 'Cluster_Labels', kmeans.labels_)
mergedLADF.head() # check the last columns!

In [None]:
mergedLADF['Cluster_Labels'].unique()

**Display Venues Clusters on a Map**

In [None]:
# create map
mapClusters = folium.Map(location = [LA_latitude, LA_longitude], tiles='Stamen Toner', zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mergedLADF['Neighborhood Latitude'], 
                                  mergedLADF['Neighborhood Longitude'], 
                                  mergedLADF['Neighborhood'], 
                                  mergedLADF['Cluster_Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.9
    ).add_to(mapClusters)

In [None]:
mapClusters

**Map Venue Clusters on to Crime Data**

In [None]:
mapClusters.choropleth(
    geo_data=LAgeo,
    name='choropleth',
    data=crimeDFCounts,
    columns=['Area', 'NumberofCrimes'],
    key_on='feature.properties.name',
    fill_color='YlOrRd',
    fill_opacity=0.4,
    line_opacity=0.7,
    legend_name='Crimes in LA'
)
    
mapClusters

### Cluster Analysis

In [None]:
def mostCommonVenue(df):
    
    num_top_venues = 6

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    areaColumns = ['Neighborhood','Cluster Label']
    freqColumns = []
    for ind in np.arange(num_top_venues):
        try:
            freqColumns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            freqColumns.append('{}th Most Common Venue'.format(ind+1))
    columns = areaColumns+freqColumns

    # create a new dataframe
    neighborhoods_venues_sorted = pd.DataFrame(columns=columns)

    neighborhoods_venues_sorted['Neighborhood'] = df['Neighborhood']
    neighborhoods_venues_sorted['Cluster Label'] = df['Cluster_Labels']

    for ind in np.arange(df.shape[0]):
        row_categories = df.iloc[ind, :].iloc[9:]
    #     print(row_categories)
        row_categories_sorted = row_categories.sort_values(ascending=False)
        neighborhoods_venues_sorted.iloc[ind, 2:] = row_categories_sorted.index.values[0:num_top_venues]

    neighborhoods_venues_sorted.drop_duplicates(keep="first", inplace=True)
    neighborhoods_venues_sorted.reset_index(inplace=True,drop=True)
    return neighborhoods_venues_sorted

In [None]:
mostCommonV = mostCommonVenue(mergedLADF)
mostCommonV

**Cluster 1**

In [None]:
cluster1DF = mergedLADF.loc[
    mergedLADF['Cluster_Labels'] == 0,
    mergedLADF.columns[list(range(0, mergedLADF.shape[1]))]
]
cluster1DFMost = mostCommonVenue(cluster1DF)
cluster1DFMost.drop('Cluster Label', inplace=True,axis=1)
cluster1DFMost

**Cluster 2**

In [None]:
cluster2DF = mergedLADF.loc[
    mergedLADF['Cluster_Labels'] == 1,
    mergedLADF.columns[list(range(0, mergedLADF.shape[1]))]
]
cluster2DFMost = mostCommonVenue(cluster2DF)
cluster2DFMost.drop('Cluster Label', inplace=True,axis=1)
cluster2DFMost

**Cluster 3**

In [None]:
cluster3DF = mergedLADF.loc[
    mergedLADF['Cluster_Labels'] == 2,
    mergedLADF.columns[list(range(0, mergedLADF.shape[1]))]
]
cluster3DFMost = mostCommonVenue(cluster3DF)
cluster3DFMost.drop('Cluster Label', inplace=True,axis=1)
cluster3DFMost

**Cluster 4**

In [None]:
cluster4DF = mergedLADF.loc[
    mergedLADF['Cluster_Labels'] == 3,
    mergedLADF.columns[list(range(0, mergedLADF.shape[1]))]
]
cluster4DFMost = mostCommonVenue(cluster4DF)
cluster4DFMost.drop('Cluster Label', inplace=True,axis=1)
cluster4DFMost

**Cluster 5**

In [None]:
cluster5DF = mergedLADF.loc[
    mergedLADF['Cluster_Labels'] == 4,
    mergedLADF.columns[list(range(0, mergedLADF.shape[1]))]
]
cluster5DFMost = mostCommonVenue(cluster5DF)
cluster5DFMost.drop('Cluster Label', inplace=True,axis=1)
cluster5DFMost

#### Observations
- All the venues can be grouped into 5 clusters
- Of all the clusters, Cluster 1 has least number of neighborhoods (23) and "Outdoor & Recreation" venue category is the most popular among neighborhoods in Cluster-1
- "Shop & Service" venue category is the most popular among neighborhoods in clusters 2,3 & 4
- Among all the venue categories, "Shop & Service" is the most popular category.
- "Food" seems to be the second popular venue category followed by "Entertainment"
- The neighborhoods that come under Pacific, 77th Street and Soutwest LAPD community divisons have higher number of crimes recorded

My article on [Medium](https://chaitanya-kasaraneni.medium.com/los-angeles-neighborhood-analysis-c43457441869)