<a href="https://colab.research.google.com/github/ruamaz/coursera_project/blob/main/CapstoneProject_looking_for_new_bar_to_share.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction / Problem descrition

In this project, we will try to find an optimal location for a new bar. Specifically, this report will be targeted to stakeholders interested in opening a bar in Moscow city centre, Russia.

Since there are many bars in the city centre, which are placed around a crowded location, the project will focus on the least occupied areas. 
However, the preference for avoiding no-crowded presents.

In simple term, it should be far enough from other bars but still in a crowded area. 

Bars and other spots location data obtained from Foursquare API 

# Methodology

The methodology considers the following steps :


*   Getting locations of bars in the city centre
*   Cluster the locations to get *crowded areas * ( here the idea is that existing bars already locates around profitable areas )
*   Find the least dense cluster (area/number of bars )
*   For the selected cluster find the spot which further away from other bars
*   Check if any offices around proposed locations ( i.e. proposed bar location should be close to some offices as spot of people attraction)





# Import libraries for use

In [1]:
import pandas as pd

! pip install folium==0.5.0
import folium # plotting library

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import numpy as np
import matplotlib.cm as cm
import matplotlib.colors as colors

import requests

from geopy.geocoders import Nominatim
import folium
import json
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe


Collecting folium==0.5.0
[?25l  Downloading https://files.pythonhosted.org/packages/07/37/456fb3699ed23caa0011f8b90d9cad94445eddc656b601e6268090de35f5/folium-0.5.0.tar.gz (79kB)
[K     |████▏                           | 10kB 21.2MB/s eta 0:00:01[K     |████████▎                       | 20kB 5.2MB/s eta 0:00:01[K     |████████████▍                   | 30kB 5.4MB/s eta 0:00:01[K     |████████████████▌               | 40kB 6.5MB/s eta 0:00:01[K     |████████████████████▊           | 51kB 6.3MB/s eta 0:00:01[K     |████████████████████████▉       | 61kB 6.9MB/s eta 0:00:01[K     |█████████████████████████████   | 71kB 7.9MB/s eta 0:00:01[K     |████████████████████████████████| 81kB 4.4MB/s 
Building wheels for collected packages: folium
  Building wheel for folium (setup.py) ... [?25l[?25hdone
  Created wheel for folium: filename=folium-0.5.0-cp36-none-any.whl size=76240 sha256=69fb9b77a0bd8d9c5f0f5e18d61256bff2a58e9895739a3a73d5721e440283f4
  Stored in directory: /root

In [2]:
#Use geojson file to write out the features
! pip install geojson
from geojson import FeatureCollection, Feature, Polygon

Collecting geojson
  Downloading https://files.pythonhosted.org/packages/e4/8d/9e28e9af95739e6d2d2f8d4bef0b3432da40b7c3588fbad4298c1be09e48/geojson-2.5.0-py2.py3-none-any.whl
Installing collected packages: geojson
Successfully installed geojson-2.5.0


In [3]:
import os

In [4]:
from scipy.spatial import Voronoi, voronoi_plot_2d

In [5]:
from scipy.spatial import ConvexHull, convex_hull_plot_2d

In [6]:
from sklearn.neighbors import NearestNeighbors
from scipy.spatial import Delaunay


In [37]:
!pip install https://github.com/barseghyanartur/transliterate/archive/stable.tar.gz
from transliterate import translit

Collecting https://github.com/barseghyanartur/transliterate/archive/stable.tar.gz
[?25l  Downloading https://github.com/barseghyanartur/transliterate/archive/stable.tar.gz
[K     / 81kB 5.0MB/s
Building wheels for collected packages: transliterate
  Building wheel for transliterate (setup.py) ... [?25l[?25hdone
  Created wheel for transliterate: filename=transliterate-1.10.2-py2.py3-none-any.whl size=62397 sha256=e83fc4a1384752e85192c7dc6f1106fcb26ee41754bd7841c7f23e853355d785
  Stored in directory: /tmp/pip-ephem-wheel-cache-fhwk9214/wheels/a5/b9/7c/f7eae6049de57d64f95151526d914514deefc76c36e083a5c1
Successfully built transliterate
Installing collected packages: transliterate
Successfully installed transliterate-1.10.2


# Project solution steps

## Collect existing bars location in Moscow 

In [7]:
#find Lat/long of the city
geolocator = Nominatim(user_agent="coursera")
address = 'Moscow'
try:
    location = geolocator.geocode(address)
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinates of {} are {}, {}.'.format(address, latitude, longitude))
except AttributeError:
    print('Cannot find: {}, will drop index: {}'.format(address, index))


The geograpical coordinates of Moscow are 55.7504461, 37.6174943.


In [9]:
#search for certain category around the city 
LIMIT = 500 # limit of number of venues returned by Foursquare API
radius = 3000 # define radius
search_query = 'bar'
# create URL
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    latitude,
    longitude,
    VERSION,
    search_query, 
    radius,
    LIMIT)
url # display URL

'https://api.foursquare.com/v2/venues/search?client_id=DO2PODWSDB5IELF5XVSEWDAFK2UXIY0LUHSNZ5JPU3HI3UAH&client_secret=RAI0055P3A5SXZYTKYYIWLEHQOT4D452BZ0KKNMTLRKUXGFQ&ll=55.7504461,37.6174943&v=20180605&query=bar&radius=3000&limit=500'

In [10]:
#Send the GET Request
results = requests.get(url).json()

# assign relevant part of JSON to venues
venues = results['response']['venues']

# tranform venues into a dataframe
dataframe = json_normalize(venues)
#dataframe.head()
#results

  


In [11]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered.head(2)

Unnamed: 0,name,categories,address,crossStreet,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,neighborhood,id
0,Baga Bar,Indian Restaurant,"Пятницкая ул., 25, стр. 1",Новокузнецкая ул.,55.742151,37.63016,"[{'label': 'display', 'lat': 55.74215129898238...",1217,115035,RU,Москва,Москва,Россия,"[Пятницкая ул., 25, стр. 1 (Новокузнецкая ул.)...",,51406dc9e4b09ccb3c1de43e
1,Papa's Bar & Grill,Bar,"Никольская ул., 10",Большой Черкасский пер.,55.758177,37.624839,"[{'label': 'display', 'lat': 55.75817694127271...",975,119019,RU,Москва,Москва,Россия,"[Никольская ул., 10 (Большой Черкасский пер.),...",,4bfb5775bbb7c92810120843


## Map found bars 

In [12]:
print(dataframe_filtered.shape [0], " bars found in the city center")

50  bars found in the city center


In [15]:
city_map = folium.Map(location=[latitude, longitude], zoom_start=14) # generate map centred around the city

# add a red circle marker to represent the city
folium.CircleMarker(
    [latitude, longitude],
    radius=10,
    color='red',
    popup=address,
    fill = True,
    fill_color = 'red',
    fill_opacity = 0.6
).add_to(city_map)

# add the venues as blue circle markers
for lat, lng in zip(dataframe_filtered.lat, dataframe_filtered.lng):
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        color='blue',
        #popup=label,
        fill = True,
        fill_color='blue',
        fill_opacity=0.6
    ).add_to(city_map)


city_map

## Cluster analisys and the most promising cluster

### K mean method for clusters 

In [16]:
# set number of clusters
kclusters = 5

df_for_clustering = dataframe_filtered[['lat','lng']]  

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(df_for_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 1, 2, 1, 1, 0, 3, 1, 1, 1], dtype=int32)

In [17]:
# add clustering labels
df_clusters=pd.DataFrame({'Cluster':kmeans.labels_ })

dataframe_filtered = dataframe_filtered.join(df_clusters)

dataframe_filtered.head(2)

Unnamed: 0,name,categories,address,crossStreet,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,neighborhood,id,Cluster
0,Baga Bar,Indian Restaurant,"Пятницкая ул., 25, стр. 1",Новокузнецкая ул.,55.742151,37.63016,"[{'label': 'display', 'lat': 55.74215129898238...",1217,115035,RU,Москва,Москва,Россия,"[Пятницкая ул., 25, стр. 1 (Новокузнецкая ул.)...",,51406dc9e4b09ccb3c1de43e,2
1,Papa's Bar & Grill,Bar,"Никольская ул., 10",Большой Черкасский пер.,55.758177,37.624839,"[{'label': 'display', 'lat': 55.75817694127271...",975,119019,RU,Москва,Москва,Россия,"[Никольская ул., 10 (Большой Черкасский пер.),...",,4bfb5775bbb7c92810120843,1


In [18]:
#get clusters centers
centers=kmeans.cluster_centers_
df_centers=pd.DataFrame ( {'lat' : centers[:,0], 'lng':centers[:,1]} )


### Look for the least dens cluster ( use ConvexHull function)

In [19]:
df_cluster_densities = pd.DataFrame(columns=['Cluster','N_point','Area', 'Density'])
for i in range(kclusters):
  df_cluster=dataframe_filtered.loc[dataframe_filtered['Cluster']==i,['lat','lng']]
  n_points=df_cluster.shape[0]
  if n_points > 2 :
    area = ConvexHull(df_cluster).volume
  else:
    area=0
  density=area/n_points
  df_cluster_densities=df_cluster_densities.append({'Cluster':i,'N_point': n_points, 'Area': area, 'Density': density}, ignore_index=True)

the_promissing_cluster= df_cluster_densities.loc[df_cluster_densities['Density']==df_cluster_densities['Density'].max(),'Cluster'].max() 
print('the least occupied cluser is ' , the_promissing_cluster)
df_cluster_densities

the least occupied cluser is  0.0


Unnamed: 0,Cluster,N_point,Area,Density
0,0.0,7.0,0.000271,3.9e-05
1,1.0,20.0,0.000182,9e-06
2,2.0,8.0,0.00025,3.1e-05
3,3.0,10.0,0.000122,1.2e-05
4,4.0,5.0,0.000109,2.2e-05


### Make Vornoi grid to visualize boundary of the clusters

In [20]:
df_voronoi=df_centers
vor=Voronoi(df_voronoi)

In [21]:
def voronoi_finite_polygons_2d(vor, radius=None):
    """Reconstruct infinite Voronoi regions in a
    2D diagram to finite regions.
    Source:
    [https://stackoverflow.com/a/20678647/1595060](https://stackoverflow.com/a/20678647/1595060)
    """
    if vor.points.shape[1] != 2:
        raise ValueError("Requires 2D input")
    new_regions = []
    new_vertices = vor.vertices.tolist()
    center = vor.points.mean(axis=0)
    if radius is None:
        radius = vor.points.ptp().max()
    # Construct a map containing all ridges for a
    # given point
    all_ridges = {}
    for (p1, p2), (v1, v2) in zip(vor.ridge_points,
                                  vor.ridge_vertices):
        all_ridges.setdefault(
            p1, []).append((p2, v1, v2))
        all_ridges.setdefault(
            p2, []).append((p1, v1, v2))
    # Reconstruct infinite regions
    for p1, region in enumerate(vor.point_region):
        vertices = vor.regions[region]
        if all(v >= 0 for v in vertices):
            # finite region
            new_regions.append(vertices)
            continue
        # reconstruct a non-finite region
        ridges = all_ridges[p1]
        new_region = [v for v in vertices if v >= 0]
        for p2, v1, v2 in ridges:
            if v2 < 0:
                v1, v2 = v2, v1
            if v1 >= 0:
                # finite ridge: already in the region
                continue
            # Compute the missing endpoint of an
            # infinite ridge
            t = vor.points[p2] - \
                vor.points[p1]  # tangent
            t /= np.linalg.norm(t)
            n = np.array([-t[1], t[0]])  # normal
            midpoint = vor.points[[p1, p2]]. \
                mean(axis=0)
            direction = np.sign(
                np.dot(midpoint - center, n)) * n
            far_point = vor.vertices[v2] + \
                direction * radius
            new_region.append(len(new_vertices))
            new_vertices.append(far_point.tolist())
        # Sort region counterclockwise.
        vs = np.asarray([new_vertices[v]
                         for v in new_region])
        c = vs.mean(axis=0)
        angles = np.arctan2(
            vs[:, 1] - c[1], vs[:, 0] - c[0])
        new_region = np.array(new_region)[
            np.argsort(angles)]
        new_regions.append(new_region.tolist())
    return new_regions, np.asarray(new_vertices)

In [22]:
regions, vertices = voronoi_finite_polygons_2d(vor,radius=10)

In [23]:
file_name='voronoi.geojson'
data = os.path.join(file_name)
point_voronoi_list = []
feature_list = []
for region in range(len(regions)-1):
    vertex_list = []
    for x in regions[region]:
        if x == -1:
            break;
        else:
            #Get the vertex out of the list, and flip the order for folium:
            vertex = vertices[x]
            vertex = (vertex[1], vertex[0])
        vertex_list.append(vertex)
    #Save the vertex list as a polygon and then add to the feature_list:
    polygon = Polygon([vertex_list])
    feature = Feature(geometry=polygon, properties={})
    feature_list.append(feature)
#The output file, to contain the Voronoi diagram we computed:
vorJSON = open(file_name, 'w')
#Write the features to the new file:
feature_collection = FeatureCollection(feature_list)
print (feature_collection, file=vorJSON)
vorJSON.close()

## Update map with clusters, boundaries and the most promissing cluster

In [25]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13) # generate map centred around the city

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

folium.GeoJson(data, name='voronoi').add_to(map_clusters)

#folium.Circle(location=[latitude, longitude], popup='searching radius', radius=radius, weight=2, color="#000").add_to(map_clusters)

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.name, dataframe_filtered.Cluster):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

# add a red circle marker to represent the city
folium.CircleMarker(
    [latitude, longitude],
    radius=2,
    color='black',
    popup="city center",
    fill = True,
    fill_color = 'black',
    fill_opacity = 0.6
).add_to(map_clusters)

folium.Marker(
    location=[df_centers.lat[the_promissing_cluster], df_centers.lng[the_promissing_cluster]],
    popup="least dens cluster",
    icon=folium.Icon(color='red', icon='info-sign')
).add_to(map_clusters)

map_clusters


## Searching for area where a new  bar could be considered

create triangle sufraces over existing bars 


In [27]:
df_best_cluster=dataframe_filtered.loc[dataframe_filtered['Cluster']==the_promissing_cluster,['lat','lng']]
tri = Delaunay(df_best_cluster)
indices =tri.simplices

### Look for three top biggest triangle and locate the middle points of these triangle ( i.e. area where bars are far away from each other) 

In [28]:
#Select top 3 places
top_ranking=3
df_new_space_search= pd.DataFrame(columns=['Area','lat','lng','indx','N_spots'])

for i in range(indices.shape[0]):
  df_filter=df_best_cluster.iloc[indices[i]]
  area = ConvexHull(df_filter).volume
  cntr_lat= (df_filter['lat'].iloc[0] + df_filter['lat'].iloc[1] + df_filter['lat'].iloc[2]) / 3;
  cntr_lng= (df_filter['lng'].iloc[0] + df_filter['lng'].iloc[1] + df_filter['lng'].iloc[2]) / 3;
  df_new_space_search=df_new_space_search.append({'Area':area,'lat': cntr_lat, 'lng': cntr_lng,'indx':i}, ignore_index=True)

df_new_space_search= df_new_space_search.sort_values('Area',ascending = False) [:top_ranking]
df_new_space_search

Unnamed: 0,Area,lat,lng,indx,N_spots
2,0.000109,55.744973,37.577755,2.0,
4,5e-05,55.743649,37.593417,4.0,
1,3.5e-05,55.741162,37.585842,1.0,


### Check how many offices around these locations 

( more offices -> more people -> more preffered bar location)

In [52]:
#search for spots around these locations 
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 300 # define radius
search_query="office"
df_new_space_spots= pd.DataFrame(columns=['N_spots'])

for k in range (df_new_space_search.shape[0]):
  # create URL
  url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(
    CLIENT_ID,
    CLIENT_SECRET,
    df_new_space_search['lat'].iloc[k], 
    df_new_space_search['lng'].iloc[k],
    VERSION,
    search_query, 
    radius,
    LIMIT)
 
  #Send the GET Request
  results = requests.get(url).json()

  # assign relevant part of JSON to venues
  # tranform venues into a dataframe
  dataframe = pd.json_normalize(results['response']['venues'])

  df_new_space_search['N_spots'].iloc[k]= dataframe.shape[0]
  
  if k==0:
    spots_dataframe = dataframe
  else:
    spots_dataframe = spots_dataframe.append(dataframe)
  
df_new_space_search

0
1
2


Unnamed: 0,Area,lat,lng,indx,N_spots
2,0.000109,55.744973,37.577755,2.0,6.0
4,5e-05,55.743649,37.593417,4.0,8.0
1,3.5e-05,55.741162,37.585842,1.0,2.0


### Clean *spots* dataframe

In [53]:
# keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in spots_dataframe.columns if col.startswith('location.')] + ['id']
spots_dataframe_filtered = spots_dataframe.loc[:, filtered_columns]

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# filter the category for each row
spots_dataframe_filtered['categories'] = spots_dataframe_filtered.apply(get_category_type, axis=1)

# clean column names by keeping only last term
spots_dataframe_filtered.columns = [column.split('.')[-1] for column in spots_dataframe_filtered.columns]

spots_dataframe_filtered.head(2)

Unnamed: 0,name,categories,lat,lng,labeledLatLngs,distance,cc,country,formattedAddress,address,city,state,postalCode,crossStreet,id
0,WRF Office,Office,55.747401,37.580803,"[{'label': 'display', 'lat': 55.747401, 'lng':...",330,RU,Россия,[Россия],,,,,,55759393498eedcbafffd84a
1,Daddy's office,Office,55.746669,37.579491,"[{'label': 'display', 'lat': 55.746669, 'lng':...",217,RU,Россия,[Россия],,,,,,519f5ff1498eeddbeb3b8bd8


## Make map with proposed locations for a new bar 

In [74]:
# create map
map_suggestions = folium.Map(location=[df_centers.lat[the_promissing_cluster], df_centers.lng[the_promissing_cluster]], zoom_start=15) # generate map centred around the city

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.name, dataframe_filtered.Cluster):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_suggestions)

# add the spots as black circle markers
for lat, lng in zip(spots_dataframe_filtered.lat, spots_dataframe_filtered.lng):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        color='black',
       # popup=label, #translit(label ,reversed=True),
        fill = True,
        fill_color='red',
        fill_opacity=0.6
    ).add_to(map_suggestions)


for i in range(df_new_space_search.shape[0]):
  folium.Marker(
      location=[df_new_space_search['lat'].iloc[i], df_new_space_search['lng'].iloc[i]],
      popup="proposed loc " + str(df_new_space_search['N_spots'].iloc[i])+" offices around",
      icon=folium.Icon(color='green', icon='cloud')
  ).add_to(map_suggestions)

  idx=int(df_new_space_search['indx'].iloc[i])
  
  df_filter=df_best_cluster.iloc[indices[idx]]

  points = []
  for k in range(df_filter.shape[0]):
      points.append([df_filter['lat'].iloc[k], df_filter['lng'].iloc[k]])
  points.append([df_filter['lat'].iloc[0], df_filter['lng'].iloc[0]])

  map_suggestions.add_child(folium.PolyLine(locations=points,weight=2,color='black'),index=i)

#points
map_suggestions

# Results and Discussion

The analisys shows that there are several areas\clusters in the city center where people can fina a bar. The area around Arbat seems having the lowest density bars and has potential for new place to open

Amoung the least occupied areas , "Perichenskiy pereulok" seems to have the highest potential due to 8 offcies location in walking distance

# Conclusions

the purpose of the project was to recommend a new bar located in the Moscow city, such that the place has a balance of far distance from existing bars and within people busy area. 
The logic of selection was to find the most suitable location as an infill in existing bars network. 
To achieve the goal the clustering analysis was used to define "bars busy" areas and the least dense area/cluster was further investigated to define the most "bars empty" region
More advanced analysis of the people walking traffic and attraction spots (offices, clubs, shops) could further improve the recommendations