# Introduction
In this captone project, I will analyze districts in the city of Los Angeles: I will group similar districts in terms of their venue distribution. By doing so, a future entreprenuer should be able to tell most favorable locations in launching his or her business. For example, when a businessman desires to open up a cafe, it would be more beneficial if he does so in the urban cities than rural areas. 

Problem: An entreprenuer is trying to launch a business in LA––but, he or she does not know where to open since each district has differentiated venues. 
# Data
To contrive the data analysis, I would 1)extract data showing coordinates based on different zipcodes of LA from "opendatasoft.com" 2)based on the coordinates, retrieve the necessary data––district, its name, latitude, longitude, and zipcode–– using FourSquare API 3)retrieve the district's neiborghood information(e.g. the number of restaurants and parks) 4) lastly, conduct K-Means clustering using venue frequency as input data. In essence, the purpose of the analysis is to analyze different distributions of venues in districts of LA and suggest some ideas where to open up a new business. 


#### Import packages and data

In [3]:
# import necessry packages
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim 
import folium
import time
from geopy.geocoders import Nominatim
import json 

import requests 
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

In [4]:
# read in LA district data file
df=pd.read_csv ("Cities in LA.csv")

In [5]:
# see first 5 rows of the data
df.head()

Unnamed: 0,Zip,City,State,Latitude,Longitude,DistrictName,Unnamed: 6,Unnamed: 7,Unnamed: 8
0,90001,Los Angeles,CA,33.972914,-118.24878,Acton,,,
1,90002,Los Angeles,CA,33.948315,-118.24845,Agoura Hills,,,
2,90003,Los Angeles,CA,33.962714,-118.276,Alhambra,,,
3,90091,Los Angeles,CA,33.786594,-118.298662,Alhambra,,,
4,90004,Los Angeles,CA,34.07711,-118.30755,Alondra Park,,,


#### Drop unnamed columns and rows with missing values

In [6]:
df.columns
# drop columns
df = df.drop(['Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'], axis =1)

In [7]:
df.head()

Unnamed: 0,Zip,City,State,Latitude,Longitude,DistrictName
0,90001,Los Angeles,CA,33.972914,-118.24878,Acton
1,90002,Los Angeles,CA,33.948315,-118.24845,Agoura Hills
2,90003,Los Angeles,CA,33.962714,-118.276,Alhambra
3,90091,Los Angeles,CA,33.786594,-118.298662,Alhambra
4,90004,Los Angeles,CA,34.07711,-118.30755,Alondra Park


In [8]:
# dimension
print(df.shape)
# unique zipcode
df['Zip'].unique

(101, 6)


<bound method Series.unique of 0      90001
1      90002
2      90003
3      90091
4      90004
       ...  
96     90103
97     90174
98     90185
99     90774
100    91671
Name: Zip, Length: 101, dtype: int64>

In [9]:
# check for missing values
df.isna().sum()

Zip             0
City            0
State           0
Latitude        0
Longitude       0
DistrictName    8
dtype: int64

In [10]:
# drop rows with any missing values
df = df.dropna()
df.head()

Unnamed: 0,Zip,City,State,Latitude,Longitude,DistrictName
0,90001,Los Angeles,CA,33.972914,-118.24878,Acton
1,90002,Los Angeles,CA,33.948315,-118.24845,Agoura Hills
2,90003,Los Angeles,CA,33.962714,-118.276,Alhambra
3,90091,Los Angeles,CA,33.786594,-118.298662,Alhambra
4,90004,Los Angeles,CA,34.07711,-118.30755,Alondra Park


In [11]:
df.shape

(93, 6)

#### 93 zipcode based districts in LA will be analyzed
* note that coordinates are different by zipcode though it may share the same district name

## Methodology
1. Retrieve necesary data from FourSquare
  * We retrieved about hundred venues within the radius of 500––and we further extracted categories of the venues.
  * Having done so, I implemented One Hot encoding and analyzed the frequency of the venue categories. 
2. Check the frequency distribution by processing the data to return K top venues in each neighborhood. 
3. Find the most favorable location to open a business by filtering the neighborhood's top venues
  * Such step is significant as a business cannot survive in excessively competitive working environment, with similar shops surrounding them. 

### Extract data from Foursquare
* Setting to use Foursquare API

In [12]:
CLIENT_ID = 'PLABBRFUVEWY4GXRWZ1D251HNMSZ01NHTTF3SDSMTFA2V45D' 
CLIENT_SECRET = 'E1M2DPREQDBVGXGHBMDEJFBOYHDSHFQUV3AI54CETR3X2LU0' 
VERSION = '20210212' # Foursquare API version

In [13]:
LIMIT = 100 # number of venues returned by Foursquare API

radius = 500 # radius

#### Running sample test on the second row of the dataset

In [14]:
df.iloc[1]

Zip                    90002
City             Los Angeles
State                     CA
Latitude             33.9483
Longitude           -118.248
DistrictName    Agoura Hills
Name: 1, dtype: object

In [15]:
latitude =df.iloc[1][3]
longitude =df.iloc[1][4]
print("latitude: {} , longitude: {}".format(latitude, longitude))

latitude: 33.948315 , longitude: -118.24845


In [16]:
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

In [17]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '60702c4b7405654b97fb399b'},
  'headerLocation': 'Watts',
  'headerFullLocation': 'Watts, Los Angeles',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 33.9528150045, 'lng': -118.24303544079379},
   'sw': {'lat': 33.9438149955, 'lng': -118.25386455920622}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4d76e9667204a1cdf2788143',
       'name': 'Watts Senior Center & Rose Garden',
       'location': {'address': '1657 E Century Blvd',
        'lat': 33.945899963378906,
        'lng': -118.24408721923828,
        'labeledLatLngs': [{'label': 'display',
          'lat': 33.945899963378906,
          'lng': -118.24408721923828}],
        'distance': 484,
        'postalCode': '90002'

In [18]:
# define function
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [19]:
venues = results['response']['groups'][0]['items']
venues 

[{'reasons': {'count': 0,
   'items': [{'summary': 'This spot is popular',
     'type': 'general',
     'reasonName': 'globalInteractionReason'}]},
  'venue': {'id': '4d76e9667204a1cdf2788143',
   'name': 'Watts Senior Center & Rose Garden',
   'location': {'address': '1657 E Century Blvd',
    'lat': 33.945899963378906,
    'lng': -118.24408721923828,
    'labeledLatLngs': [{'label': 'display',
      'lat': 33.945899963378906,
      'lng': -118.24408721923828}],
    'distance': 484,
    'postalCode': '90002',
    'cc': 'US',
    'city': 'Los Angeles',
    'state': 'CA',
    'country': 'United States',
    'formattedAddress': ['1657 E Century Blvd',
     'Los Angeles, CA 90002',
     'United States']},
   'categories': [{'id': '4bf58dd8d48988d163941735',
     'name': 'Park',
     'pluralName': 'Parks',
     'shortName': 'Park',
     'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/parks_outdoors/park_',
      'suffix': '.png'},
     'primary': True}],
   'photos': {'count': 0

In [20]:
# flatten JSON
nearby_venues = pd.json_normalize(venues) 
nearby_venues

Unnamed: 0,referralId,reasons.count,reasons.items,venue.id,venue.name,venue.location.address,venue.location.lat,venue.location.lng,venue.location.labeledLatLngs,venue.location.distance,venue.location.postalCode,venue.location.cc,venue.location.city,venue.location.state,venue.location.country,venue.location.formattedAddress,venue.categories,venue.photos.count,venue.photos.groups
0,e-0-4d76e9667204a1cdf2788143-0,0,"[{'summary': 'This spot is popular', 'type': '...",4d76e9667204a1cdf2788143,Watts Senior Center & Rose Garden,1657 E Century Blvd,33.9459,-118.244087,"[{'label': 'display', 'lat': 33.94589996337890...",484,90002,US,Los Angeles,CA,United States,"[1657 E Century Blvd, Los Angeles, CA 90002, U...","[{'id': '4bf58dd8d48988d163941735', 'name': 'P...",0,[]


In [21]:
columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']

nearby_venues =nearby_venues.loc[:, columns]
nearby_venues

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Watts Senior Center & Rose Garden,"[{'id': '4bf58dd8d48988d163941735', 'name': 'P...",33.9459,-118.244087


In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
df.columns

Index(['Zip', 'City', 'State', 'Latitude', 'Longitude', 'DistrictName'], dtype='object')

#### Get nearby venues from eac neighborhood using the custom function created above.

In [24]:
LA_venues = getNearbyVenues(names=df['DistrictName'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Acton
Agoura Hills
Alhambra
Alhambra
Alondra Park
Alondra Park
Altadena
Altadena
Arcadia
Arcadia
Artesia
Artesia
Avalon
Avalon
Avocado Heights
Azusa
Baldwin Park
Bell
Bell Canyon
Bell Gardens
Bellflower
Claremont
Commerce
Compton
Covina
Cudahy
Culver City
Diamond Bar
Downey
Duarte
El Monte
El Segundo
Gardena
Glendale
Glendora
Hawaiian Gardens
Hawthorne
Hermosa Beach
Hidden Hills
Huntington Park
Industry
Inglewood
Irwindale
La Cañada Flintridge
La Habra Heights
La Mirada
La Puente
La Verne
Lakewood
Lancaster
Lawndale
Lomita
Long Beach
Los Angeles
Lynwood
Malibu
Manhattan Beach
Maywood
Monrovia
Montebello
Monterey Park
Norwalk
Palmdale
Palos Verdes Estates
Paramount
Pasadena
Pico Rivera
Pomona
Rancho Palos Verdes
Redondo Beach
Rolling Hills
Rolling Hills Estates
Rosemead
San Dimas
San Fernando
San Gabriel
San Marino
Santa Clarita
Santa Fe Springs
Santa Monica
Sierra Madre
Signal Hill
South El Monte
South Gate
South Pasadena
Temple City
Torrance
Vernon
Walnut
West Covina
West Hollywood
We

In [25]:
# check for shape
print(LA_venues.shape)
LA_venues.head()

(2201, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Acton,33.972914,-118.24878,Superior Grocers,33.973227,-118.247133,Grocery Store
1,Acton,33.972914,-118.24878,Bill's Drive In,33.974392,-118.243828,Burger Joint
2,Acton,33.972914,-118.24878,Rite Aid,33.974383,-118.246351,Pharmacy
3,Acton,33.972914,-118.24878,SUBWAY,33.975386,-118.248062,Sandwich Place
4,Acton,33.972914,-118.24878,Jack in the Box,33.975167,-118.250313,Fast Food Restaurant


#### Check how many venues were retieved for each neighborhood

In [26]:
LA_venues_g = LA_venues.groupby('Neighborhood').count()
LA_venues_g 

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Acton,11,11,11,11,11,11
Agoura Hills,1,1,1,1,1,1
Alhambra,31,31,31,31,31,31
Alondra Park,59,59,59,59,59,59
Altadena,61,61,61,61,61,61
...,...,...,...,...,...,...
Walnut,20,20,20,20,20,20
West Covina,20,20,20,20,20,20
West Hollywood,20,20,20,20,20,20
Westlake Village,20,20,20,20,20,20


#### Group data by  neighborhood(district name) , venue categories¶

In [27]:
# one hot encoding
LA_onehot = pd.get_dummies(LA_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
LA_onehot['Neighborhood'] = LA_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [LA_onehot.columns[-1]] + list(LA_onehot.columns[:-1])
LA_onehot = LA_onehot[fixed_columns]

LA_onehot.head()
LA_grouped = LA_onehot.groupby('Neighborhood').mean().reset_index()
LA_grouped

Unnamed: 0,Neighborhood,Yoga Studio,ATM,Accessories Store,American Restaurant,Arcade,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Water Park,Weight Loss Center,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Acton,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
1,Agoura Hills,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
2,Alhambra,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
3,Alondra Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.016949,0.0,0.0,0.0,0.0,0.0,0.0
4,Altadena,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82,Walnut,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
83,West Covina,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
84,West Hollywood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0
85,Westlake Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0


#### Check for types of venue categories for filtering later

In [31]:
for col in LA_grouped.columns :
    print(col)

Neighborhood
Yoga Studio
ATM
Accessories Store
American Restaurant
Arcade
Art Gallery
Art Museum
Arts & Crafts Store
Asian Restaurant
Athletics & Sports
Auditorium
Auto Garage
Automotive Shop
BBQ Joint
Bagel Shop
Bakery
Bank
Bar
Basketball Stadium
Beer Bar
Beer Garden
Beer Store
Big Box Store
Board Shop
Bookstore
Boutique
Bowling Alley
Boxing Gym
Brazilian Restaurant
Breakfast Spot
Brewery
Bubble Tea Shop
Building
Burger Joint
Burrito Place
Bus Line
Bus Station
Business Service
Butcher
Cafeteria
Café
Cajun / Creole Restaurant
Campground
Cantonese Restaurant
Casino
Check Cashing Service
Cheese Shop
Chinese Restaurant
Clothing Store
Cocktail Bar
Coffee Shop
College Residence Hall
Comedy Club
Comic Shop
Concert Hall
Convenience Store
Cosmetics Shop
Coworking Space
Creperie
Cuban Restaurant
Cupcake Shop
Cycle Studio
Dance Studio
Deli / Bodega
Department Store
Dessert Shop
Dim Sum Restaurant
Diner
Discount Store
Doner Restaurant
Donut Shop
Dry Cleaner
Dumpling Restaurant
Electronics Store
E

#### Check for K  top venues, K= 5

In [29]:
# define function to find most common venue categories
def return_most_common_venues(row, Top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:Top_venues]

In [30]:
Top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(Top_venues):
    try:
        columns.append('{}{} Popular Venues'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Popular Venues'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = LA_grouped['Neighborhood']

for ind in np.arange(LA_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(LA_grouped.iloc[ind, :], Top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues
0,Acton,Donut Shop,Mexican Restaurant,Pizza Place,Burger Joint,Pharmacy
1,Agoura Hills,Park,Women's Store,Event Space,Food Stand,Food Service
2,Alhambra,Fast Food Restaurant,Sandwich Place,Burger Joint,Mexican Restaurant,Pizza Place
3,Alondra Park,Korean Restaurant,Coffee Shop,Bar,Sandwich Place,Mexican Restaurant
4,Altadena,Korean Restaurant,Coffee Shop,Grocery Store,Pizza Place,Café
...,...,...,...,...,...,...
82,Walnut,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Deli / Bodega,Thai Restaurant
83,West Covina,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Deli / Bodega,Thai Restaurant
84,West Hollywood,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Deli / Bodega,Thai Restaurant
85,Westlake Village,Mexican Restaurant,Sandwich Place,Fast Food Restaurant,Deli / Bodega,Thai Restaurant


#### Find the most favorable location to open a business by filtering the neighborhood's top venues

  * filter neighborhood that has similar types of business in top 5 venues to avoid possible competition.
  * Similar or the same type of business include 'Coffee Shop', 'Ice Cream Shop','Juice Bar','Café' and 'Tea Room'
  * Afterwards, most appropriate location(neighborhood) will be chosen with favorable surroundings as one of top popular venue

In [46]:
to_drop = [ 'Tea Room','Coffee Shop', 'Ice Cream Shop','Juice Bar','Café','Deli / Bodega']
a = neighborhoods_venues_sorted.copy()
for col in neighborhoods_venues_sorted.columns:
    a = a[~a[col].isin(to_drop)]

### Result
 * after filtering, locations reduced to 35 
 * From here, we can decide whic location is the best to open a "Tea Room" Busness 

In [49]:
# see the shape of the filtered dataframe
# reduced 87 to 35
a.shape

(35, 6)

In [51]:
a

Unnamed: 0,Neighborhood,1st Popular Venues,2nd Popular Venues,3rd Popular Venues,4th Popular Venues,5th Popular Venues
0,Acton,Donut Shop,Mexican Restaurant,Pizza Place,Burger Joint,Pharmacy
1,Agoura Hills,Park,Women's Store,Event Space,Food Stand,Food Service
2,Alhambra,Fast Food Restaurant,Sandwich Place,Burger Joint,Mexican Restaurant,Pizza Place
5,Arcadia,Sandwich Place,Mexican Restaurant,Fast Food Restaurant,Food Truck,Pizza Place
7,Avalon,Sandwich Place,Fast Food Restaurant,Mexican Restaurant,Southern / Soul Food Restaurant,Department Store
10,Baldwin Park,Fast Food Restaurant,Pizza Place,Fried Chicken Joint,Restaurant,Diner
11,Bell,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,Bakery,Sandwich Place
12,Bell Canyon,Japanese Restaurant,Sushi Restaurant,Bar,Bubble Tea Shop,Pizza Place
15,Claremont,Dance Studio,Fried Chicken Joint,Sandwich Place,Latin American Restaurant,Liquor Store
21,Diamond Bar,Mexican Restaurant,Fast Food Restaurant,Pizza Place,Pharmacy,Donut Shop


## Discussion 
  * Final decision: La Habra Heights
  * Why? 
    - No similar shops/stores as top venues, meaning no rigorous competion in the area
    - Has venues such as Music Venue and Sceinc Lookout which gathers lots of population
    -  Grocery Store as top venue indicates there are number of residents in the area, which we hope to attract later to the shop

In [68]:
b = a.reset_index()
b.iloc[21]

index                                 38
Neighborhood            La Habra Heights
1st Popular Venues         Grocery Store
2nd Popular Venues    Mexican Restaurant
3rd Popular Venues           Music Venue
4th Popular Venues                  Park
5th Popular Venues        Scenic Lookout
Name: 21, dtype: object

## Conclusion
In conclusion, as the first fifth venues are not related to "Tea Room," we can say that opening this in LaHabra Heights, a music venue, would be the most profitable choice. 