# Coursera IBM Data Science Capstone Project

#### Opening a new Supermarket in Los Angeles ,California

•	Build a dataframe of neighborhoods in Los Angeles, California by web scraping the data from Wikipedia page

•	Get the geographical coordinates of the neighborhoods

•	Obtain the venue data for the neighborhoods from Foursquare API

•	Explore and cluster the neighborhoods

•	Select the best cluster to open a new Supermarket


### 1. Importing Libraries

In [4]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder # to get coordinates

import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library
#!conda install -c conda-forge wordcloud==1.4.1 --yes
from wordcloud import WordCloud, get_single_color_func

print("Done.")

ModuleNotFoundError: No module named 'geopy'

### 2. Get data from Excel file containing Neighborhood Details

In [None]:
# Read files
los_angeles_data=pd.read_excel("E:\Saurav\LAPPY\study\Coursera\IBM_Data_Science\project\IBM_DATA_SCIENCE\Final_project_week_4_5\los_angeles_neighborhood_data_new.xlsx")
los_angeles_data

### 3. Get neighborhood coordinates

In [None]:

# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Los Angeles, California'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [None]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in los_angeles_data["Neighborhood"].tolist() ]

In [None]:
coords

In [None]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [None]:
# merge the coordinates into the original dataframe
los_angeles_data['Latitude'] = df_coords['Latitude']
los_angeles_data['Longitude'] = df_coords['Longitude']

In [None]:
# check the neighborhoods and the coordinates
print(los_angeles_data.shape)
los_angeles_data

### 4. Create a map of Los Angeles with neighborhoods superimposed on top

In [None]:
# get the coordinates of Los Angeles
address = 'Los Angeles, California'

geolocator = Nominatim(user_agent="my-application")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Los Angeles, California {}, {}.'.format(latitude, longitude))

In [None]:
# create map of Los Angeles using latitude and longitude values
map_LA = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(los_angeles_data['Latitude'], los_angeles_data['Longitude'], los_angeles_data['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_LA)  
    
map_LA

In [None]:
# save the map as HTML file
map_LA.save('map_LA.html')

### 5. Use the Foursquare API to explore the neighborhoods

In [None]:
# define Foursquare Credentials and Version
CLIENT_ID = '3X23YXNCVQTROXF2LA3OOLQQ1ZUAFJVJZJY3XVZEUAHRUMAI' # your Foursquare ID
CLIENT_SECRET = 'T1U0CPYBO4DPH4I1AHUFMMWZ33HF43QKMVFRRBXKAD0NBLR1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

##### Now, let's get the top 100 venues that are within a radius of 2000 meters.

In [None]:
radius = 2000
LIMIT = 100

venues = []

for lat, long, neighborhood in zip(los_angeles_data['Latitude'], los_angeles_data['Longitude'], los_angeles_data['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [None]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head()

In [None]:
#to check for venues per neighborhood
venues_df.groupby(["Neighborhood"]).count()

In [None]:
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

In [None]:
# print out the list of categories
venues_df['VenueCategory'].unique()

In [None]:
# check if the results contain "Shopping Mall"
"Supermarket" in venues_df['VenueCategory'].unique()

In [None]:
"Supermarket" in venues_df['VenueCategory'].len()

### 6. Analyze Each Neighborhood

In [None]:
# one hot encoding
la_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
la_onehot['Neighborhood'] = venues_df['Neighborhood']
la_onehot['Latitude'] = venues_df['Latitude'] 
la_onehot['Longitude'] = venues_df['Longitude'] 

# move neighborhood column to the first column
fixed_columns = [la_onehot.columns[-1]] + list(la_onehot.columns[:-1])
la_onehot = la_onehot[fixed_columns]

print(la_onehot.shape)
la_onehot.head()

In [None]:
#group rows by neighborhood and by take the mean of the frequency of occurrence of each category
la_grouped = la_onehot.groupby(["Neighborhood",'Latitude','Longitude']).mean().reset_index()

print(la_grouped.shape)
la_grouped

In [None]:
len(la_grouped[la_grouped["Supermarket"] > 0])

##### Create a new DataFrame for Supermarket data

In [None]:
la_market = la_grouped[["Neighborhood","Supermarket"]]

In [None]:
la_market.head()

### 7. Cluster Neighborhoods

In [None]:
# Finding best k
plt.style.use("seaborn")
Ks = 11
mse = np.zeros((Ks-1))
la_grouped_clustering = la_grouped.drop(['Neighborhood','Latitude','Longitude'], 1)
la_grouped_clustering


In [None]:
for n in range(1,Ks):
    
    # set number of clusters
    kclusters = n
    # run k-means clustering
    kmeans = KMeans(n_clusters=kclusters, random_state=0, init = 'random', n_init = 15).fit(la_grouped_clustering)
    mse[n-1] = kmeans.inertia_

plt.plot(range(1,Ks),mse)
plt.xlabel("Number of clusters")
plt.ylabel("MSE")
plt.title("K selection")
plt.show()

In [None]:
# set number of clusters
kclusters = 4

la_clustering = la_market.drop(["Neighborhood"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(la_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

In [None]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
la_merged = la_market.copy()

# add clustering labels
la_merged["Cluster Labels"] = kmeans.labels_

In [None]:
la_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
la_merged.head()


In [None]:
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
la_merged = la_merged.join(los_angeles_data.set_index("Neighborhood"), on="Neighborhood")

print(la_merged.shape)
la_merged.head()

In [None]:
# sort the results by Cluster Labels
print(la_merged.shape)
la_merged.sort_values(["Cluster Labels"], inplace=True)
la_merged

##### visualize the clusters

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(la_merged['Latitude'], la_merged['Longitude'], la_merged['Neighborhood'], la_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
# save the map as HTML file
map_clusters.save('map_clusters.html')

### 8. Examine Clusters

##### Cluster  0

In [None]:
cl0 = la_merged.loc[la_merged['Cluster Labels'] == 0]
cl0

##### Cluster 1

In [None]:
cl1 = la_merged.loc[la_merged['Cluster Labels'] == 1]
cl1

##### Cluster 2

In [None]:
cl2 = la_merged.loc[la_merged['Cluster Labels'] == 2]
cl2

##### Cluster 3

In [None]:
cl3 = la_merged.loc[la_merged['Cluster Labels'] == 3]
cl3

### Wordcloud

In [None]:
word_string = ""
for neighborhood in la_merged["Neighborhood"]:
    elements = ""
    for element in neighborhood.split(","):
        elements += element.strip().replace(" ", "") + " "
    word_string += elements+" "
word_string = word_string.replace(".","")

tmp = cl0["Neighborhood"].tolist()
cl0_list = []
for element in tmp:
    cl0_list.extend(element.split(", "))
cl0_list = [element.replace(" ","") for element in cl0_list]

tmp = cl1["Neighborhood"].tolist()
cl1_list = []
for element in tmp:
    cl1_list.extend(element.split(", "))
cl1_list = [element.replace(" ","") for element in cl1_list]

tmp = cl2["Neighborhood"].tolist()
cl2_list = []
for element in tmp:
    cl2_list.extend(element.split(", "))
cl2_list = [element.replace(" ","") for element in cl2_list]
    
tmp = cl3["Neighborhood"].tolist()
cl3_list = []
for element in tmp:
    cl3_list.extend(element.split(", "))
cl3_list = [element.replace(" ","") for element in cl3_list]
cl3_list = [element.replace(".","") for element in cl3_list]



cl1_list

In [None]:
class GroupedColorFunc(object):
    """Create a color function object which assigns DIFFERENT SHADES of
       specified colors to certain words based on the color to words mapping.

       Uses wordcloud.get_single_color_func

       Parameters
       ----------
       color_to_words : dict(str -> list(str))
         A dictionary that maps a color to the list of words.

       default_color : str
         Color that will be assigned to a word that's not a member
         of any value from color_to_words.
    """

    def __init__(self, color_to_words, default_color):
        self.color_func_to_words = [
            (get_single_color_func(color), set(words))
            for (color, words) in color_to_words.items()]

        self.default_color_func = get_single_color_func(default_color)

    def get_color_func(self, word):
        """Returns a single_color_func associated with the word"""
        try:
            color_func = next(
                color_func for (color_func, words) in self.color_func_to_words
                if word in words)
        except StopIteration:
            color_func = self.default_color_func

        return color_func

    def __call__(self, word, **kwargs):
        return self.get_color_func(word)(word, **kwargs)

wordcloud = WordCloud(width=1000, height=600, background_color='white', max_words = 500).generate(word_string)

color_to_words = {
    # words below will be colored with a green single color function
    'yellow': cl0_list,
    # will be colored with a red single color function
    'black': cl1_list,
    'blue': cl2_list,
    '#00ff00': cl3_list
}

default_color = 'grey'

print('Word cloud created!')

fig = plt.figure()
fig.set_figwidth(500)
fig.set_figheight(10)

grouped_color_func = GroupedColorFunc(color_to_words, default_color)

# Apply our color function
wordcloud.recolor(color_func=grouped_color_func)

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

## Conclusion :


###### Based on the clustering of neighborhoods, we can see that Cluster 1 has no Supermarkets followed by some supermarkets in cluster 2 ,cluster 0 and cluster 3 respectively. It would be a good option to open a Supermarket in any of the neighborhoods that fall under Cluster 0. In order to decide a neighborhood within cluster 0 we can improve this analysis using 2 more criterias. First criteria would be to analyze those neighborhoods based on their population as higher population will lead to more customers in the supermarket which is highly essential for new businesses. Second criteria to be considered is the cost of land as this would highly affect the Return on Investment factor of the Supermarket. For this , we can analyze the cost of land for each neighborhood falling under cluster 0 and look for the cheaper options to open a Supermarket. Considering these conditions, the best case scenario to open a supermarket would be in a neighborhood with high population and low cost of land as this will Increase Return on Inverstment and ensure stable income due to high population.