## **IBM Applied Data Science Capstone Course by Coursera**

 
![alt text](https://www.culturetrav.co/wp-content/uploads/2016/08/san_diego.png)

#### **Week 5 Final Project for Capstone** 

##### "***Research for Practicability for building New Shopping Mall in San Diego, California***"

* Load a dataframe of cities in San Diego, California by web scraping tools from San Diego cities category (Wikipedia). 
* Get the geographical coordinates of the cities
* Obtain the venue data for the cities from Foursquare API
* Explore and cluster the cities
* Select the best cluster to open a new shopping mall. 
---

#### **1. Import Libraries**

In [97]:
# Import libraries to handle data
import requests 
import numpy as np
import pandas as pd
import json
#!conda install -c conda-forge geocoder --yes
import geocoder
#!conda install -c conda-forge/label/gcc7 geopy --yes 
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize

print("Work Done!")

Work Done!


In [79]:
# Import Matplotlib and associated plotting modules 
import matplotlib.cm as cm 
import matplotlib.colors as colors 

print("Work Done!")

Work Done!


In [80]:
# Import K-Means for clustering 
from sklearn.cluster import KMeans

print("Work Done!")

Work Done!


In [81]:
# Import webscraping tool from Beautiful Soup 
#!conda install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup as bts
import xml 
import folium 

print("Work Done!")

Work Done!


---

#### **2. Scrap Data from San Diego Categories(Wikipedia) and Transfer into Dataframe**

In [82]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Cities_in_San_Diego_County,_California").text

In [83]:
# parse data from the html into a beautifulsoup object
soup = bts(data, 'html.parser')

In [84]:
# create a list to store cities data
citiesList = []

In [85]:
# append the data into the list
for row in soup.find_all("div", id="mw-pages")[0].findAll("li"):
    citiesList.append(row.text)

In [86]:
# create a new DataFrame from the list
df_sd_raw = pd.DataFrame({"Cities": citiesList})

df_sd_raw.shape

(18, 1)

In [87]:
# Clean the state name of cities list
df_sd = pd.DataFrame({"Cities": df_sd_raw['Cities'].str.replace(", California", "")})

df_sd

Unnamed: 0,Cities
0,Carlsbad
1,Chula Vista
2,Coronado
3,Del Mar
4,El Cajon
5,Encinitas
6,Escondido
7,Imperial Beach
8,La Mesa
9,Lemon Grove


In [88]:
# print the number of rows of the dataframe
df_sd.shape

(18, 1)

----

#### **3. Get the Geographical Coordinates** 

In [98]:
import geocoder
# define a function to get coordinates
def get_latlng(cities):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, San Diego, California'.format(cities))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [100]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(cities) for cities in df_sd["Cities"].tolist() ]
coords

[[33.16588000000007, -117.33821999999998],
 [32.640900000000045, -117.08427999999998],
 [32.67619000000008, -117.17088999999999],
 [32.95526000000007, -117.26361999999995],
 [32.79495000000003, -116.95902999999998],
 [33.045810000000074, -117.29256999999996],
 [33.12316000000004, -117.08216999999996],
 [32.57657000000006, -117.11639999999994],
 [32.76604000000003, -117.02442999999994],
 [32.740800000000036, -117.03107999999997],
 [32.672180000000026, -117.10550999999998],
 [33.19715000000008, -117.38057999999995],
 [32.95461000000006, -117.04288999999994],
 [32.71568000000008, -117.16170999999997],
 [33.14083000000005, -117.16015999999996],
 [32.870490000000075, -116.97101999999995],
 [32.98735000000005, -117.27068999999995],
 [33.20239000000004, -117.23504999999994]]

In [101]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [102]:
# merge the coordinates into the original dataframe
df_sd['Latitude'] = df_coords['Latitude']
df_sd['Longitude'] = df_coords['Longitude']

In [104]:
# check the cities and the coordinates
print(df_sd.shape)
df_sd

(18, 3)


Unnamed: 0,Cities,Latitude,Longitude
0,Carlsbad,33.16588,-117.33822
1,Chula Vista,32.6409,-117.08428
2,Coronado,32.67619,-117.17089
3,Del Mar,32.95526,-117.26362
4,El Cajon,32.79495,-116.95903
5,Encinitas,33.04581,-117.29257
6,Escondido,33.12316,-117.08217
7,Imperial Beach,32.57657,-117.1164
8,La Mesa,32.76604,-117.02443
9,Lemon Grove,32.7408,-117.03108


---

#### **4. Create Map of San Diego with cities superimposed on top**

In [107]:
# get the coordinates of Kuala Lumpur
address = 'San Diego, California'

geolocator = Nominatim(user_agent="foursquare")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The Geograpical Coordinate of San Diego, California {}, {}.'.format(latitude, longitude))

The Geograpical Coordinate of San Diego, California 32.7174209, -117.1627714.


In [110]:
# create map of Toronto using latitude and longitude values
map_sd = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, cities in zip(df_sd['Latitude'], df_sd['Longitude'], df_sd['Cities']):
    label = '{}'.format(cities)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_sd)  
    
map_sd

![alt text](/labs/Captone/map_sd-1.png)

In [245]:
# save the map as HTML file
map_sd.save('map_sd.pdf')

---

#### **5. Apply Foursquare Database API to Explore the Cities**

In [112]:
# define Foursquare Credentials and Version
CLIENT_ID = 'DAXA4GASHZEBHDRATG0IQJE4FFW0M3HU1K5N20RYMNGHEE5S'
CLIENT_SECRET = '4BZQ5RJUZC4S2Y3PU15DGIYYOIFH5OLMM3N21KBMRAIPF0FL' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: DAXA4GASHZEBHDRATG0IQJE4FFW0M3HU1K5N20RYMNGHEE5S
CLIENT_SECRET:4BZQ5RJUZC4S2Y3PU15DGIYYOIFH5OLMM3N21KBMRAIPF0FL


In [242]:
# Get the top 100 venues that are within a radius under 20 kilometers.
radius = 20000
LIMIT = 200

venues = []

for lat, long, neighborhood in zip(df_sd['Latitude'], df_sd['Longitude'], df_sd['Cities']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [243]:
# convert the venues list into a new DataFrame
df_venues = pd.DataFrame(venues)

# define the column names
df_venues.columns = ['City', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(df_venues.shape)
df_venues

(1800, 7)


Unnamed: 0,City,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Carlsbad,33.16588,-117.33822,Campfire,33.162005,-117.350937,American Restaurant
1,Carlsbad,33.16588,-117.33822,Pollos Maria,33.161308,-117.344295,Mexican Restaurant
2,Carlsbad,33.16588,-117.33822,Choice Juicery,33.159605,-117.348978,Juice Bar
3,Carlsbad,33.16588,-117.33822,Prontos Gourmet Market,33.162126,-117.348904,Gourmet Shop
4,Carlsbad,33.16588,-117.33822,Trader Joe's,33.184009,-117.331985,Grocery Store
...,...,...,...,...,...,...,...
1795,Vista,33.20239,-117.23505,Costco,33.120162,-117.316361,Warehouse Store
1796,Vista,33.20239,-117.23505,Four Seasons Residence Club Aviara,33.096896,-117.283665,Resort
1797,Vista,33.20239,-117.23505,Legoland California,33.126343,-117.311619,Theme Park
1798,Vista,33.20239,-117.23505,Double Peak Park,33.109423,-117.177601,Park


In [221]:
# Check out total categories of venues
print('There are {} uniques categories.'.format(len(df_venues['VenueCategory'].unique())))

There are 157 uniques categories.


In [222]:
# Print out the list of categories
df_venues['VenueCategory'].unique()[:]

array(['American Restaurant', 'Mexican Restaurant', 'Juice Bar',
       'Gourmet Shop', 'Grocery Store', 'Seafood Restaurant',
       'Ice Cream Shop', 'Italian Restaurant', 'Gastropub',
       'Movie Theater', 'Beach', 'Cosmetics Shop', 'Discount Store',
       'Pizza Place', 'Bar', 'Coffee Shop', 'Café', 'Bakery', 'Bookstore',
       'Asian Restaurant', 'Athletics & Sports', 'Breakfast Spot',
       'Sports Bar', 'Sushi Restaurant', 'Steakhouse', 'Snack Place',
       'Furniture / Home Store', 'Surf Spot', 'Museum', 'Burger Joint',
       'Fast Food Restaurant', 'Board Shop', 'Japanese Restaurant',
       'Pier', 'Brewery', 'Theme Park', "Women's Store", 'Beer Garden',
       'Warehouse Store', 'BBQ Joint', 'Water Park', 'German Restaurant',
       'Garden', 'Vegetarian / Vegan Restaurant',
       'Theme Park Ride / Attraction', 'Frozen Yogurt Shop', 'Hotel',
       'Farmers Market', 'Golf Course', 'Supermarket', 'Church',
       'Thai Restaurant', 'Taco Place', 'Harbor / Marina', 'P

In [187]:
# Check if the results contain "Shopping Mall"
"Shopping Mall" in df_venues['VenueCategory'].unique()

True

---

#### **6. Data Analyzing and Digging more**

In [224]:
# one hot encoding
sd_onehot = pd.get_dummies(df_venues[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sd_onehot['City'] = df_venues['City'] 

# move neighborhood column to the first column
fixed_columns = [sd_onehot.columns[-1]] + list(sd_onehot.columns[:-1])
sd_onehot = sd_onehot[fixed_columns]

print(sd_onehot.shape)
sd_onehot

(1800, 158)


Unnamed: 0,City,Accessories Store,American Restaurant,Amphitheater,Aquarium,Argentinian Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,...,Warehouse Store,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Carlsbad,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Carlsbad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1795,Vista,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1796,Vista,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1797,Vista,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1798,Vista,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [189]:
# Group rows by neighborhood and by taking the mean of the frequency of occurrence of each categor
sd_grouped = sd_onehot.groupby(["City"]).mean().reset_index()

print(sd_grouped.shape)
sd_grouped

(18, 131)


Unnamed: 0,City,Accessories Store,American Restaurant,Amphitheater,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bakery,...,Warehouse Store,Water Park,Whisky Bar,Wine Bar,Wine Shop,Winery,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,Carlsbad,0.0,0.07,0.0,0.0,0.0,0.0,0.01,0.02,0.02,...,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
1,Chula Vista,0.02,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.02,0.06
2,Coronado,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.02,0.06
3,Del Mar,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0
4,El Cajon,0.01,0.02,0.01,0.0,0.01,0.0,0.0,0.01,0.01,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.06
5,Encinitas,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.03,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Escondido,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.02,0.01,...,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.0,0.02,0.0
7,Imperial Beach,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.01,...,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.0
8,La Mesa,0.01,0.02,0.01,0.0,0.01,0.0,0.0,0.01,0.01,...,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.02,0.06
9,Lemon Grove,0.01,0.02,0.01,0.0,0.01,0.0,0.0,0.01,0.01,...,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.02,0.06


In [228]:
len(sd_grouped[sd_grouped["Shopping Mall"] > 0])

7

In [192]:
# Create a new DataFrame for Shopping Mall data only
sd_mall = sd_grouped[["City","Shopping Mall"]]

In [229]:
sd_mall

Unnamed: 0,City,Shopping Mall
0,Carlsbad,0.0
1,Chula Vista,0.0
2,Coronado,0.0
3,Del Mar,0.0
4,El Cajon,0.01
5,Encinitas,0.0
6,Escondido,0.0
7,Imperial Beach,0.01
8,La Mesa,0.01
9,Lemon Grove,0.01


____

#### **7. Clustering Process**

In [233]:
# Run k-means to cluster the neighborhoods in Kuala Lumpur into 3 clusters.

# set number of clusters
kclusters = 2

sd_clustering = sd_mall.drop(["City"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sd_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:20]

array([1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1], dtype=int32)

In [234]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
sd_merged = sd_mall.copy()

# add clustering labels
sd_merged["Cluster Labels"] = kmeans.labels_

In [235]:
sd_merged.head()

Unnamed: 0,City,Shopping Mall,Cluster Labels
0,Carlsbad,0.0,1
1,Chula Vista,0.0,1
2,Coronado,0.0,1
3,Del Mar,0.0,1
4,El Cajon,0.01,0


In [236]:
# merge SD_grouped with SD_data to add latitude/longitude for each neighborhood
sd_merged = sd_merged.join(df_sd.set_index("Cities"), on="City")


In [237]:
print(sd_merged.shape)
sd_merged.head() # check the last columns!

(18, 5)


Unnamed: 0,City,Shopping Mall,Cluster Labels,Latitude,Longitude
0,Carlsbad,0.0,1,33.16588,-117.33822
1,Chula Vista,0.0,1,32.6409,-117.08428
2,Coronado,0.0,1,32.67619,-117.17089
3,Del Mar,0.0,1,32.95526,-117.26362
4,El Cajon,0.01,0,32.79495,-116.95903


In [238]:
# Sort the results by Cluster Labels
print(sd_merged.shape)
sd_merged.sort_values(["Cluster Labels"], inplace=True)
sd_merged

(18, 5)


Unnamed: 0,City,Shopping Mall,Cluster Labels,Latitude,Longitude
8,La Mesa,0.01,0,32.76604,-117.02443
15,Santee,0.01,0,32.87049,-116.97102
13,San Diego,0.01,0,32.71568,-117.16171
4,El Cajon,0.01,0,32.79495,-116.95903
12,Poway,0.01,0,32.95461,-117.04289
7,Imperial Beach,0.01,0,32.57657,-117.1164
9,Lemon Grove,0.01,0,32.7408,-117.03108
14,San Marcos,0.0,1,33.14083,-117.16016
11,Oceanside,0.0,1,33.19715,-117.38058
10,National City,0.0,1,32.67218,-117.10551


----

#### **8. Visualization of result from clustering**

In [239]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sd_merged['Latitude'], sd_merged['Longitude'], sd_merged['City'], sd_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [211]:
# Save the map as HTML file
map_clusters.save('map_clusters.html')

___

#### **9. Examine Cluster**

In [240]:
# Examine cluter 0
sd_merged.loc[sd_merged['Cluster Labels'] == 0]

Unnamed: 0,City,Shopping Mall,Cluster Labels,Latitude,Longitude
8,La Mesa,0.01,0,32.76604,-117.02443
15,Santee,0.01,0,32.87049,-116.97102
13,San Diego,0.01,0,32.71568,-117.16171
4,El Cajon,0.01,0,32.79495,-116.95903
12,Poway,0.01,0,32.95461,-117.04289
7,Imperial Beach,0.01,0,32.57657,-117.1164
9,Lemon Grove,0.01,0,32.7408,-117.03108


In [241]:
# Examine cluster 1 
sd_merged.loc[sd_merged['Cluster Labels'] == 1]

Unnamed: 0,City,Shopping Mall,Cluster Labels,Latitude,Longitude
14,San Marcos,0.0,1,33.14083,-117.16016
11,Oceanside,0.0,1,33.19715,-117.38058
10,National City,0.0,1,32.67218,-117.10551
0,Carlsbad,0.0,1,33.16588,-117.33822
6,Escondido,0.0,1,33.12316,-117.08217
5,Encinitas,0.0,1,33.04581,-117.29257
3,Del Mar,0.0,1,32.95526,-117.26362
2,Coronado,0.0,1,32.67619,-117.17089
1,Chula Vista,0.0,1,32.6409,-117.08428
16,Solana Beach,0.0,1,32.98735,-117.27069


**Conclusion**

Its clear to see that from map, there are total seven shopping mall that are searched by this research. Because of this reason, it is easy to classify types of cities of San Diego county and by whether there are shopping in those cities. For this reason, the cluster 0 (indicating city that has shopping mall) and verse visa. Investigating by map, we found the most of shopping mall are located in south of San Diego, and there is no shopping mall located at north. It clearly has a opportunity to open a new shopping mall with high success probability. 