### Where is the recommended place to open a restaurant, a coffee shop, or to start a new business?

## Introduction

In this project, I will provide a general guide about where to open a restaurant, a coffee shop, or to setup an office for the new business based on. The Foursquare API is used to explore the neighborhoods in a particular city, and the explore function is used to get the most common venue categories in each neighborhood. After this project, you will get a general idea on determining the location for your business.

### Data 

The data for this project is all the neighborhood in California. There is a total of 55 counties and 481 cities in California. And 500 venues within 1000 meters of each cities are investiaged. Since the total number of venues for all the 55 counties (481 cities) are too big, which exceeds the personal call limit of Foursquare API. I downsize the counties to Log Angeles only, consisting 88 cities, which is famous for its diversity. 

Import libraries that are used in this project

In [3]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.cluster import KMeans  # import k-means from clustering stage

import matplotlib.cm as cm   # Matplotlib and associated plotting modules
import matplotlib.colors as colors
from matplotlib import pyplot as plt

import folium # map rendering library

## 1. Import Dataset: Get all the neighborhood names of California from wikipedia

In [4]:
url_cal = 'https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_California'
California_raw = pd.read_clipboard()

In [5]:
California_raw.head()

Unnamed: 0,Adelanto,City,San Bernardino,"31,765",56.01,145.1,"December 22, 1970"
0,Agoura Hills,City,Los Angeles,20330,7.79,20.2,"December 8, 1982"
1,Alameda,City,Alameda,73812,10.61,27.5,"April 19, 1854"
2,Albany,City,Alameda,18539,1.79,4.6,"September 22, 1908"
3,Alhambra,City,Los Angeles,83089,7.63,19.8,"July 11, 1903"
4,Aliso Viejo,City,Orange,47823,7.47,19.3,"July 1, 2001"


In [7]:
len(California_raw)

481

In [8]:
# move the title into the data
California_update = np.vstack([California_raw.columns, California_raw])
California_update = pd.DataFrame(California_update)
California_update.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Adelanto,City,San Bernardino,31765,56.01,145.1,"December 22, 1970"
1,Agoura Hills,City,Los Angeles,20330,7.79,20.2,"December 8, 1982"
2,Alameda,City,Alameda,73812,10.61,27.5,"April 19, 1854"
3,Albany,City,Alameda,18539,1.79,4.6,"September 22, 1908"
4,Alhambra,City,Los Angeles,83089,7.63,19.8,"July 11, 1903"


In [9]:
California_update.sort_values([2, 0]).head()
California_update.columns = ['Name', 
                             'Type', 
                             'County', 
                             'Population(2010)', 
                             'Land area (sq mi)', 
                             'Land area (km^2)', 
                             'Incorporated']

In [10]:
California_update.head()

Unnamed: 0,Name,Type,County,Population(2010),Land area (sq mi),Land area (km^2),Incorporated
0,Adelanto,City,San Bernardino,31765,56.01,145.1,"December 22, 1970"
1,Agoura Hills,City,Los Angeles,20330,7.79,20.2,"December 8, 1982"
2,Alameda,City,Alameda,73812,10.61,27.5,"April 19, 1854"
3,Albany,City,Alameda,18539,1.79,4.6,"September 22, 1908"
4,Alhambra,City,Los Angeles,83089,7.63,19.8,"July 11, 1903"


Get the total number of counties and their cororesponding number of cities (or towns) within them.

In [11]:
total_counties = California_update['County'].value_counts()
len(total_counties)

55

There are a total of 55 counties, and 481 cities in California. Instead of focusing on all of the 481 cities from all the 55 counties, I will only focus on Log Angeles, which is famous for its diversity. I am interested to explore how many venues categories are there, what are these categories, and how they are distributed.

Get the cities is Los Angeles.

In [190]:
# Notes: create a dataframe for the top 5 counties, where the cities are not index

#df1 = pd.DataFrame(data = top_5_counties.index, columns = ['County'])
#df2 = pd.DataFrame(data = top_5_counties.values, columns = ['Number of Cities'])
#top_5_counties = pd.concat([df1, df2], axis = 1)

# Notes: get the cities (or towns) when county is in ['Los Angeles', 'Orange']
# California_update[California_update['County'].isin(['Los Angeles', 'Orange'])]

# California_df = California_update[California_update['County'].isin(top_5_counties.index)]
# California_df.head()

In [12]:
# get the cities (or towns) when county = 'Log Angeles'
Los_Angeles_df = California_update[California_update['County'] == 'Los Angeles']

In [13]:
Los_Angeles_df.head()

Unnamed: 0,Name,Type,County,Population(2010),Land area (sq mi),Land area (km^2),Incorporated
1,Agoura Hills,City,Los Angeles,20330,7.79,20.2,"December 8, 1982"
4,Alhambra,City,Los Angeles,83089,7.63,19.8,"July 11, 1903"
14,Arcadia,City,Los Angeles,56364,10.93,28.3,"August 5, 1903"
17,Artesia,City,Los Angeles,16522,1.62,4.2,"May 29, 1959"
23,Avalon,City,Los Angeles,3728,2.94,7.6,"June 26, 1913"


In [15]:
len(Los_Angeles_df)

88

There are 88 cities in Log Angeles.