# This is a Jupiter Notebook for the IBM Applied Data Science Capstone project

# Initial Prague Neighbourhoods Data Acquisition

First of all, we have to acquire the data about neighbourhoods, boroughs in Prague.

Install all the required libraries if they are not present in the python ecosystem, we use:

* pandas: for dataframes
* numpy: for mathematical operations
* requests: for web scraping
* bs4: for web scraping, parsing of the html to the pandas dataframe
* folium: for map visualisation
* sklearn: for clustering
* geocoder: for access to the TomTom API for location retrieval
* matplotlib: for plotting, colours 

In [2]:
!pip3 install pandas numpy requests bs4 folium sklearn geocoder matplotlib




Now lets import all the libraries we are going to use

In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import folium
from sklearn.cluster import KMeans
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import geocoder
import time


Scrape the html from the provided url address that has statistical information about the Prague neighbourhoods

In [4]:
html_data = requests.get(
    url='https://vdb.czso.cz/vdbvo2/faces/cs/index.jsf?page=vystup-objekt&z=T&f=TABULKA&skupId=1372&katalog=30845&pvo=DEM01D-PHA&pvo=DEM01D-PHA&str=v33&c=v3~2__RP2019MP12DP31#fx=0')


Load the scraped html into the BeautifulSoup object for further analysis and coversion to the pandas dataframe

In [5]:
soup = BeautifulSoup(html_data.text, 'html')


Now create the dataframe from the scraped data. We'll differentiate boroughs and neighbourhoods based on the 'td' classs.

In [6]:
prague_districts = pd.DataFrame(
    columns=['Borough', 'Neighbourhood', 'TotalInhabitants', 'AverageAge'])

for row in soup.find('table', id='tabData').find('tbody').find_all('tr'):
    col = row.find_all('td')
    if len(col) > 0:
        #Borough row
        if 'genCls5000' in col[0].get('class') or 'genCls7000' in col[0].get('class'):
            borough = col[0].span.text
        #Neighbourhood row
        if 'genCls1000' in col[0].get('class'):
            neighbourhood = col[0].span.text
            total_inhabitants = col[1].span.text.replace(u'\xa0', '')
            average_age = col[5].span.text.replace(',', '.')
            prague_districts = prague_districts.append(
                {'Borough': borough, 'Neighbourhood': neighbourhood, 'TotalInhabitants': total_inhabitants, 'AverageAge': average_age}, ignore_index=True)

prague_districts.head()


Unnamed: 0,Borough,Neighbourhood,TotalInhabitants,AverageAge
0,SO Praha 1,Praha 1,29563,43.7
1,SO Praha 2,Praha 2,50363,41.3
2,SO Praha 3,Praha 3,76041,41.6
3,SO Praha 4,Praha 4,132068,44.2
4,SO Praha 4,Praha-Kunratice,10023,38.2


Get longitude and langitude based on the Neighbourhood from the TomTom API. As the API has requests quotas, we have to also provide sleep time to slow the amount of requests going to the api.

In [7]:
tom_tom_api_key = 'ajKyRUiFdkGnACvUJPCyFOiAXsFAxHAs'


In [8]:
neighbourhood_locations = pd.DataFrame(
    columns=['Neighbourhood', 'Latitude', 'Longitude'])

for neighbourhood in prague_districts['Neighbourhood']:
    location = geocoder.tomtom(neighbourhood, key=tom_tom_api_key)
    print(neighbourhood)
    neighbourhood_locations = neighbourhood_locations.append(
        {'Neighbourhood': neighbourhood, 'Latitude': location.json['lat'], 'Longitude': location.json['lng']}, ignore_index=True)
    time.sleep(0.5)
    
neighbourhood_locations.head()


Praha 1
Praha 2
Praha 3
Praha 4
Praha-Kunratice
Praha 5
Praha-Slivenec
Praha 6
Praha-Lysolaje
Praha-Nebušice
Praha-Přední Kopanina
Praha-Suchdol
Praha 7
Praha-Troja
Praha 8
Praha-Březiněves
Praha-Ďáblice
Praha-Dolní Chabry
Praha 9
Praha 10
Praha 11
Praha-Křeslice
Praha-Šeberov
Praha-Újezd
Praha 12
Praha-Libuš
Praha 13
Praha-Řeporyje
Praha 14
Praha-Dolní Počernice
Praha 15
Praha-Dolní Měcholupy
Praha-Dubeč
Praha-Petrovice
Praha-Štěrboholy
Praha 16
Praha-Lipence
Praha-Lochkov
Praha-Velká Chuchle
Praha-Zbraslav
Praha 17
Praha-Zličín
Praha 18
Praha-Čakovice
Praha 19
Praha-Satalice
Praha-Vinoř
Praha 20
Praha 21
Praha-Běchovice
Praha-Klánovice
Praha-Koloděje
Praha 22
Praha-Benice
Praha-Kolovraty
Praha-Královice
Praha-Nedvězí


Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Praha 1,50.08796,14.42122
1,Praha 2,50.0753,14.44693
2,Praha 3,50.08813,14.46823
3,Praha 4,50.0628,14.44091
4,Praha-Kunratice,50.01341,14.4831


Join our two pandas data frames and see the top 5 rows.

In [9]:
prague_districts_geo = neighbourhood_locations.join(
    prague_districts.set_index('Neighbourhood'), on='Neighbourhood')
prague_districts_geo.head(5)


Unnamed: 0,Neighbourhood,Latitude,Longitude,Borough,TotalInhabitants,AverageAge
0,Praha 1,50.08796,14.42122,SO Praha 1,29563,43.7
1,Praha 2,50.0753,14.44693,SO Praha 2,50363,41.3
2,Praha 3,50.08813,14.46823,SO Praha 3,76041,41.6
3,Praha 4,50.0628,14.44091,SO Praha 4,132068,44.2
4,Praha-Kunratice,50.01341,14.4831,SO Praha 4,10023,38.2


Run the describe to see how many Neighbourhoods and Boroughs we have in the data.

In [10]:
prague_districts_geo.describe(include=[object])



Unnamed: 0,Neighbourhood,Borough,TotalInhabitants,AverageAge
count,57,57,57,57.0
unique,57,22,57,39.0
top,Praha-Křeslice,SO Praha 15,12559,40.5
freq,1,5,1,4.0


What data types we have?

In [11]:
prague_districts_geo.dtypes

Neighbourhood        object
Latitude            float64
Longitude           float64
Borough              object
TotalInhabitants     object
AverageAge           object
dtype: object

We see Total Inhabitants and Average Age are both 'object' types, we need to convert them to float if we want to get any insights. Then we run describe and see what are the statistics for those columns.

In [12]:
prague_districts_geo['TotalInhabitants'] = prague_districts_geo['TotalInhabitants'].astype(float, copy=True)
prague_districts_geo['AverageAge'] = prague_districts_geo['AverageAge'].astype(
    float, copy=True)
prague_districts_geo.describe()


Unnamed: 0,Latitude,Longitude,TotalInhabitants,AverageAge
count,57.0,57.0,57.0,57.0
mean,50.068642,14.483619,23232.929825,40.284211
std,0.050685,0.105877,33740.836491,1.90533
min,49.96248,14.29442,333.0,36.4
25%,50.02347,14.39543,2733.0,39.2
50%,50.06532,14.48396,6035.0,40.3
75%,50.11139,14.56721,29563.0,41.3
max,50.16647,14.66915,132068.0,44.6


## Map Visualisation of Neighbourhoods

Now lets take a look on the Neighbourhoods on the map

First get the latitude and longitude of Praguie as a bases for charting on Folium

In [13]:
address = 'Prague'

location = geocoder.tomtom(neighbourhood, key=tom_tom_api_key)
latitude = location.json['lat']
longitude = location.json['lng']
print('The geograpical coordinate of Prague are {}, {}.'.format(
    latitude, longitude))


The geograpical coordinate of Prague are 50.01834, 14.65285.


Now create the Folium map and draw circles on it with our neighbourhoods

In [14]:
# create map of Prague using latitude and longitude values
map_prague = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighbourhood in zip(prague_districts_geo['Latitude'], prague_districts_geo['Longitude'],
                                            prague_districts_geo['Borough'], prague_districts_geo['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_prague)

map_prague


## Venues Clustering for Neighbourhoods

The following is a function from the Applied Data Science Project capstone which will help us to get all the venues for our neighbourhoods.

In [15]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):

    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame(
        [item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood',
                             'Neighbourhood Latitude',
                             'Neighbourhood Longitude',
                             'Venue',
                             'Venue Latitude',
                             'Venue Longitude',
                             'Venue Category']

    return(nearby_venues)


In [16]:
CLIENT_ID = 'T1GPNN0F3DDVR5HUEMG3AVOGD3GPKQ0QAJMHUYLF4520ZAUE'  # your Foursquare ID
# your Foursquare Secret
CLIENT_SECRET = '2SM12XT5EDJ5QXQAEDOJVGDYCIP40JCWXBPUSTT0LAJYBVUP'
VERSION = '20180605'  # Foursquare API version
LIMIT = 100  # A default Foursquare API limit value


Call our function to get the nearby venues for Neighbourhoods and print the first 5.

In [17]:
prague_venues = getNearbyVenues(names=prague_districts_geo['Neighbourhood'],
                                latitudes=prague_districts_geo['Latitude'],
                                longitudes=prague_districts_geo['Longitude']
                                 )
prague_venues.head()


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Praha 1,50.08796,14.42122,Staroměstské náměstí | Old Town Square (Starom...,50.087371,14.421187,Plaza
1,Praha 1,50.08796,14.42122,Dior,50.088309,14.420388,Boutique
2,Praha 1,50.08796,14.42122,Bugsy's Bar,50.088948,14.419832,Cocktail Bar
3,Praha 1,50.08796,14.42122,AghaRTA Jazz Centrum,50.086388,14.422175,Jazz Club
4,Praha 1,50.08796,14.42122,The Emblem Hotel,50.087541,14.418491,Hotel


Lets see how many unique categories we have in the data

In [18]:
print('There is {} uniques categories.'.format(
    len(prague_venues['Venue Category'].unique())))


There is 231 uniques categories.


Use one hot encoding and create groups of our neighbourhood venues.

In [19]:
# one hot encoding
prague_venues_onehot = pd.get_dummies(
    prague_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
prague_venues_onehot['Neighbourhood'] = prague_venues['Neighbourhood']

# move neighborhood column to the first column
fixed_columns = [prague_venues_onehot.columns[-1]] + \
    list(prague_venues_onehot.columns[:-1])
prague_venues_onehot = prague_venues_onehot[fixed_columns]

prague_venues_grouped = prague_venues_onehot.groupby(
    'Neighbourhood').mean().reset_index()
prague_venues_grouped.head()


Unnamed: 0,Neighbourhood,ATM,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,Auto Garage,Auto Workshop,...,Vacation Rental,Vegetarian / Vegan Restaurant,Vehicle Inspection Station,Venezuelan Restaurant,Vietnamese Restaurant,Volleyball Court,Wine Bar,Wine Shop,Yoga Studio,Zoo
0,Praha 1,0.0,0.021277,0.010638,0.0,0.010638,0.0,0.0,0.0,0.0,...,0.0,0.010638,0.0,0.0,0.010638,0.0,0.021277,0.0,0.0,0.0
1,Praha 10,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,...,0.0,0.03125,0.0,0.0,0.03125,0.0,0.0625,0.0,0.0,0.0
2,Praha 11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0
3,Praha 12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Praha 13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Lets run the KMeans clustering with 5 clusters. This should give us interesting information about which venues in which locations are valued.

In [20]:
# set number of clusters
kclusters = 5

prague_venues_grouped_clustering = prague_venues_grouped.drop(
    'Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(
    prague_venues_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([0, 0, 2, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Get the top 10 venues for each Neighbourhood

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)

    return row_categories_sorted.index.values[0:num_top_venues]


num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
prague_venues_sorted = pd.DataFrame(columns=columns)
prague_venues_sorted['Neighbourhood'] = prague_venues_grouped['Neighbourhood']

for ind in np.arange(prague_venues_grouped.shape[0]):
    prague_venues_sorted.iloc[ind, 1:] = return_most_common_venues(
        prague_venues_grouped.iloc[ind, :], num_top_venues)

prague_venues_sorted.head()


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Praha 1,Hotel,Café,Italian Restaurant,Restaurant,Pub,Czech Restaurant,Cocktail Bar,Boutique,Coffee Shop,Plaza
1,Praha 10,Wine Bar,Pizza Place,Café,Restaurant,Nightclub,Grocery Store,Bus Station,Bus Stop,Gift Shop,Soccer Field
2,Praha 11,Bus Stop,Soccer Field,Czech Restaurant,Market,Wine Bar,Food & Drink Shop,Furniture / Home Store,Plaza,Restaurant,Cafeteria
3,Praha 12,Dessert Shop,River,Tram Station,Food Stand,Pub,Movie Theater,Bus Stop,Flower Shop,Auto Workshop,Snack Place
4,Praha 13,Bus Stop,Gym,Fast Food Restaurant,Casino,Pub,Italian Restaurant,Reservoir,Mini Golf,Restaurant,Grocery Store


Add Clustering labels so we can visualise it on the map

In [22]:
# add clustering labels
prague_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
prague_merged = prague_districts_geo

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
prague_merged = prague_districts_geo.join(
    prague_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

prague_merged.head()  # check the last columns!


Unnamed: 0,Neighbourhood,Latitude,Longitude,Borough,TotalInhabitants,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Praha 1,50.08796,14.42122,SO Praha 1,29563.0,43.7,0,Hotel,Café,Italian Restaurant,Restaurant,Pub,Czech Restaurant,Cocktail Bar,Boutique,Coffee Shop,Plaza
1,Praha 2,50.0753,14.44693,SO Praha 2,50363.0,41.3,0,Café,Wine Bar,Pub,Beer Bar,Clothing Store,Cocktail Bar,Vietnamese Restaurant,Coffee Shop,Vegetarian / Vegan Restaurant,Bakery
2,Praha 3,50.08813,14.46823,SO Praha 3,76041.0,41.6,0,Bar,Vietnamese Restaurant,Café,Burger Joint,Czech Restaurant,Plaza,Restaurant,Steakhouse,Gym,Bowling Alley
3,Praha 4,50.0628,14.44091,SO Praha 4,132068.0,44.2,0,Café,Bar,Pizza Place,Restaurant,Pub,Kebab Restaurant,Vietnamese Restaurant,Gastropub,Theater,Beer Store
4,Praha-Kunratice,50.01341,14.4831,SO Praha 4,10023.0,38.2,0,Bus Stop,Czech Restaurant,Athletics & Sports,Pizza Place,Farmers Market,Soccer Field,Fruit & Vegetable Store,Café,Food & Drink Shop,Golf Course


Validation that we have 5 clusters

In [23]:
# Ensure there are not floats in cluster lables
print('There are {} uniques categories.'.format(
    len(prague_merged['Cluster Labels'].unique())))
prague_merged['Cluster Labels'].unique()


There are 5 uniques categories.


array([0, 1, 2, 3, 4], dtype=int32)

Visualise the clusters on the map

In [24]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(prague_merged['Latitude'], prague_merged['Longitude'], prague_merged['Neighbourhood'], prague_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' +
                         str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters


In [25]:
prague_merged.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Borough,TotalInhabitants,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Praha 1,50.08796,14.42122,SO Praha 1,29563.0,43.7,0,Hotel,Café,Italian Restaurant,Restaurant,Pub,Czech Restaurant,Cocktail Bar,Boutique,Coffee Shop,Plaza
1,Praha 2,50.0753,14.44693,SO Praha 2,50363.0,41.3,0,Café,Wine Bar,Pub,Beer Bar,Clothing Store,Cocktail Bar,Vietnamese Restaurant,Coffee Shop,Vegetarian / Vegan Restaurant,Bakery
2,Praha 3,50.08813,14.46823,SO Praha 3,76041.0,41.6,0,Bar,Vietnamese Restaurant,Café,Burger Joint,Czech Restaurant,Plaza,Restaurant,Steakhouse,Gym,Bowling Alley
3,Praha 4,50.0628,14.44091,SO Praha 4,132068.0,44.2,0,Café,Bar,Pizza Place,Restaurant,Pub,Kebab Restaurant,Vietnamese Restaurant,Gastropub,Theater,Beer Store
4,Praha-Kunratice,50.01341,14.4831,SO Praha 4,10023.0,38.2,0,Bus Stop,Czech Restaurant,Athletics & Sports,Pizza Place,Farmers Market,Soccer Field,Fruit & Vegetable Store,Café,Food & Drink Shop,Golf Course


## Analysis which Neighbourhood is the most interesting for opening of a new Café

First of all, lets go through all our clusters and see what are their top 10 venues

In [26]:
prague_merged.loc[prague_merged['Cluster Labels'] == 0,
                  prague_merged.columns[[0] + list(range(5, prague_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Praha 1,43.7,0,Hotel,Café,Italian Restaurant,Restaurant,Pub,Czech Restaurant,Cocktail Bar,Boutique,Coffee Shop,Plaza
1,Praha 2,41.3,0,Café,Wine Bar,Pub,Beer Bar,Clothing Store,Cocktail Bar,Vietnamese Restaurant,Coffee Shop,Vegetarian / Vegan Restaurant,Bakery
2,Praha 3,41.6,0,Bar,Vietnamese Restaurant,Café,Burger Joint,Czech Restaurant,Plaza,Restaurant,Steakhouse,Gym,Bowling Alley
3,Praha 4,44.2,0,Café,Bar,Pizza Place,Restaurant,Pub,Kebab Restaurant,Vietnamese Restaurant,Gastropub,Theater,Beer Store
4,Praha-Kunratice,38.2,0,Bus Stop,Czech Restaurant,Athletics & Sports,Pizza Place,Farmers Market,Soccer Field,Fruit & Vegetable Store,Café,Food & Drink Shop,Golf Course
5,Praha 5,41.0,0,Cosmetics Shop,Hotel,Pub,Clothing Store,Café,Bistro,Multiplex,Coffee Shop,Park,Chocolate Shop
6,Praha-Slivenec,39.4,0,Grocery Store,Czech Restaurant,Italian Restaurant,Reservoir,Soccer Stadium,Bus Stop,Restaurant,Beer Store,Cafeteria,Gastropub
7,Praha 6,42.3,0,Coffee Shop,Café,ATM,Vietnamese Restaurant,Pizza Place,Japanese Restaurant,Italian Restaurant,Hotel,Dessert Shop,Clothing Store
8,Praha-Lysolaje,39.2,0,Music Venue,Bus Stop,Trail,Hot Spring,Gastropub,College Cafeteria,Bathing Area,Café,Organic Grocery,Park
9,Praha-Nebušice,40.5,0,Hotel,Bus Stop,Athletics & Sports,Dog Run,Supermarket,Gastropub,Soccer Field,Grocery Store,Bar,Burger Joint


In [27]:
prague_merged.loc[prague_merged['Cluster Labels'] == 1,
                  prague_merged.columns[[0] + list(range(5, prague_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
10,Praha-Přední Kopanina,41.6,1,Restaurant,Plaza,Soccer Field,Bus Stop,ATM,Outdoor Supply Store,Multiplex,Music Store,Music Venue,Nature Preserve
16,Praha-Ďáblice,40.3,1,Bus Stop,Restaurant,Food,Soccer Field,Gym Pool,ATM,Park,Music Store,Music Venue,Nature Preserve
45,Praha-Satalice,40.2,1,Bus Stop,Restaurant,Athletics & Sports,Italian Restaurant,Train Station,Nature Preserve,Music Venue,Music Store,Pedestrian Plaza,New American Restaurant
56,Praha-Nedvězí,42.8,1,Trail,Restaurant,Bus Stop,Cocktail Bar,ATM,Music Store,Music Venue,Nature Preserve,New American Restaurant,Nightclub


In [28]:
prague_merged.loc[prague_merged['Cluster Labels'] == 2,
                  prague_merged.columns[[0] + list(range(5, prague_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,Praha-Dolní Chabry,38.2,2,Bus Stop,Restaurant,Czech Restaurant,Pharmacy,Supermarket,Bowling Alley,Soccer Field,Café,Flower Shop,Cupcake Shop
20,Praha 11,44.6,2,Bus Stop,Soccer Field,Czech Restaurant,Market,Wine Bar,Food & Drink Shop,Furniture / Home Store,Plaza,Restaurant,Cafeteria
21,Praha-Křeslice,39.4,2,Bus Stop,Pub,Trail,Scenic Lookout,Movie Theater,Multiplex,Music Store,Music Venue,Nature Preserve,New American Restaurant
22,Praha-Šeberov,40.9,2,Bus Stop,Athletics & Sports,Farm,Auto Garage,Convenience Store,Pharmacy,Asian Restaurant,Nature Preserve,New American Restaurant,Music Venue
23,Praha-Újezd,37.7,2,BBQ Joint,Sporting Goods Shop,Gastropub,Bus Stop,Soccer Field,ATM,Outdoor Supply Store,Multiplex,Music Store,Music Venue
32,Praha-Dubeč,39.2,2,Bus Stop,Zoo,Historic Site,Pub,Soccer Field,Tennis Stadium,Grocery Store,Diner,Eastern European Restaurant,History Museum
33,Praha-Petrovice,41.3,2,Bus Stop,Pharmacy,Convenience Store,Juice Bar,Pet Store,Music Store,Music Venue,Nature Preserve,New American Restaurant,Nightclub
36,Praha-Lipence,40.5,2,Grocery Store,Food & Drink Shop,Bus Stop,Gym,ATM,Movie Theater,Music Store,Music Venue,Nature Preserve,New American Restaurant
37,Praha-Lochkov,40.3,2,Czech Restaurant,Soccer Field,Bus Stop,Tunnel,ATM,Park,Multiplex,Music Store,Music Venue,Nature Preserve
38,Praha-Velká Chuchle,40.5,2,Bus Stop,Plaza,Racecourse,Grocery Store,Restaurant,Betting Shop,Multiplex,Music Store,Music Venue,Nature Preserve


In [29]:
prague_merged.loc[prague_merged['Cluster Labels'] == 3,
                  prague_merged.columns[[0] + list(range(5, prague_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
41,Praha-Zličín,37.9,3,Furniture / Home Store,Hobby Shop,Go Kart Track,Gymnastics Gym,ATM,Outdoor Supply Store,Movie Theater,Multiplex,Music Store,Music Venue


In [30]:
prague_merged.loc[prague_merged['Cluster Labels'] == 4,
                  prague_merged.columns[[0] + list(range(5, prague_merged.shape[1]))]]


Unnamed: 0,Neighbourhood,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
55,Praha-Královice,40.0,4,Field,Auto Workshop,ATM,Pedestrian Plaza,Multiplex,Music Store,Music Venue,Nature Preserve,New American Restaurant,Nightclub


What we can see now is, that actually the most interesting cluster for us is the one with Cluster Label 0, because it has the most Cafés in the first top 10 venues. Therefore we should pick Neighbourhood from this cluster 0, but also, ideally the one where Café is not in the actual top 10, because that will mean there are no good Cafés yet and we should be able to penetrate the market there if we do things well.

In [31]:
df_without_cafe = prague_merged[prague_merged.apply(lambda row: ~row.astype(
    str).str.contains('Café', case=False).any(), axis=1)]
df_without_cafe.head()


Unnamed: 0,Neighbourhood,Latitude,Longitude,Borough,TotalInhabitants,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Praha-Slivenec,50.02014,14.35525,SO Praha 5,3696.0,39.4,0,Grocery Store,Czech Restaurant,Italian Restaurant,Reservoir,Soccer Stadium,Bus Stop,Restaurant,Beer Store,Cafeteria,Gastropub
9,Praha-Nebušice,50.11139,14.32557,SO Praha 6,3372.0,40.5,0,Hotel,Bus Stop,Athletics & Sports,Dog Run,Supermarket,Gastropub,Soccer Field,Grocery Store,Bar,Burger Joint
10,Praha-Přední Kopanina,50.1175,14.2967,SO Praha 6,693.0,41.6,1,Restaurant,Plaza,Soccer Field,Bus Stop,ATM,Outdoor Supply Store,Multiplex,Music Store,Music Venue,Nature Preserve
11,Praha-Suchdol,50.13316,14.37681,SO Praha 6,7225.0,40.4,0,Hotel,Music Venue,Grocery Store,Dairy Store,Pool,Restaurant,Plaza,Snack Place,Creperie,Gastropub
15,Praha-Březiněves,50.16647,14.48396,SO Praha 8,1754.0,37.2,0,Hotel,Restaurant,Gym / Fitness Center,Pizza Place,Bus Stop,Soccer Field,Pool,Music Venue,Nature Preserve,Outdoor Supply Store


Lets first chart all the neighbourhoods without the Cafe in any of the top 10 venues.

In [33]:
# create map
map_without_cafe = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for total_inhab, lat, lon, poi, cluster in zip(df_without_cafe['TotalInhabitants'], df_without_cafe['Latitude'], df_without_cafe['Longitude'], df_without_cafe['Neighbourhood'], df_without_cafe['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' +
                         str(cluster) + ', ' + str(total_inhab), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_without_cafe)

map_without_cafe


We can see mostly the Neighbourhoods outside the City Center, that makes sense as the City Center is the most crowded. However we do not want to focus on that area as that would mean very high concurrency, lets focus on the ones where there is no good Cafe yet instead.

Bellow we can see areas with the most Inhabitants from our non-Cafe dataframe.

In [34]:
df_top_nb_without_cafe = df_without_cafe.sort_values(by='TotalInhabitants', ascending=False).head(10)
df_top_nb_without_cafe


Unnamed: 0,Neighbourhood,Latitude,Longitude,Borough,TotalInhabitants,AverageAge,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Praha 11,50.03242,14.50757,SO Praha 11,77324.0,44.6,2,Bus Stop,Soccer Field,Czech Restaurant,Market,Wine Bar,Food & Drink Shop,Furniture / Home Store,Plaza,Restaurant,Cafeteria
26,Praha 13,50.04519,14.32161,SO Praha 13,63554.0,40.5,0,Bus Stop,Gym,Fast Food Restaurant,Casino,Pub,Italian Restaurant,Reservoir,Mini Golf,Restaurant,Grocery Store
18,Praha 9,50.1134,14.50146,SO Praha 9,60601.0,41.1,0,Coffee Shop,Electronics Store,Gastropub,Outdoor Supply Store,Indian Restaurant,Clothing Store,Restaurant,Eastern European Restaurant,Drugstore,Dog Run
24,Praha 12,50.00575,14.40525,SO Praha 12,57821.0,43.0,0,Dessert Shop,River,Tram Station,Food Stand,Pub,Movie Theater,Bus Stop,Flower Shop,Auto Workshop,Snack Place
40,Praha 17,50.06432,14.30799,SO Praha 17,24075.0,42.1,0,Supermarket,Gym / Fitness Center,Pharmacy,Bus Stop,Czech Restaurant,Coffee Shop,Ski Shop,Bowling Alley,Gym,Pizza Place
47,Praha 20,50.11261,14.59627,SO Praha 20,15652.0,41.5,0,Restaurant,Bus Stop,Hotel,Supermarket,Art Gallery,Italian Restaurant,Pharmacy,Platform,Czech Restaurant,Indian Restaurant
43,Praha-Čakovice,50.15175,14.52337,SO Praha 18,11868.0,37.7,0,Bus Stop,Vietnamese Restaurant,Pizza Place,Restaurant,Chinese Restaurant,Park,Train Station,Food & Drink Shop,Supermarket,Historic Site
48,Praha 21,50.0764,14.65936,SO Praha 21,10860.0,40.2,2,Bus Stop,Bar,Supermarket,Bakery,Dessert Shop,Soccer Field,Tea Room,Pub,Jewelry Store,Music Store
25,Praha-Libuš,50.00907,14.46199,SO Praha 12,10623.0,39.5,0,Bus Stop,Hotel,Grocery Store,Rental Service,Dessert Shop,Restaurant,Dog Run,Park,Soccer Field,Music Store
39,Praha-Zbraslav,49.97606,14.39343,SO Praha 16,10049.0,41.7,0,Czech Restaurant,Restaurant,Stadium,Castle,Market,Chinese Restaurant,Beer Garden,Church,Pier,Turkish Restaurant


From this, we should be able to pick the Neighbourhood with other venues, that look interesting to us (complementary to coffee shop). It seems that the best option from this would be to pick the Praha 13 as it is the cluster 0 and has most Inhabitants. Lets look how it looks on the map and also chart the cycling paths into the map (cycling paths often bring a lot of customers to Cafe)

In [45]:
# create map
map_without_cafe = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for total_inhab, lat, lon, poi, cluster in zip(df_top_nb_without_cafe['TotalInhabitants'], df_top_nb_without_cafe['Latitude'], 
                                               df_top_nb_without_cafe['Longitude'], df_top_nb_without_cafe['Neighbourhood'], df_top_nb_without_cafe['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' +
                         str(cluster) + ', ' + str(total_inhab), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=10,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_without_cafe)

geojson_cycling_paths_url = f'http://opendata.iprpraha.cz/CUR/DOP/DOP_Cyklogenerel_l/WGS_84/DOP_Cyklogenerel_l.json'
folium.Choropleth(geojson_cycling_paths_url).add_to(map_without_cafe)

map_without_cafe


## Conclusion

If we want to open Cafe in the place where none is yet in the top 10 venues of the Neighbourhood based on the Foursquare data and also in the Neighbourhood where is high number of Total Inhabitants and has some cycling paths, we could select Praha 13. On the map we can also see that Praha 13 has quite a big park next to it (Prokopske Udoli), which means it is going to be prefered area for a lot of people. Praha 13 has 63 554 inhabitants and average age is round 40. This area has Restaurants as well as Public Transportation venues in the top 10 venues, which means it is well traveled through area and that means a lot of potential customers. 

Therefore we would recommend looking into opening a new Cafe in Praha 13 and ideally either close to the 'Centralni Park' which we can see might be quite interesting place from the map or closer to the Public transportation venues. Ideally a combination, as we can see Underground stations there as well close to the 'Centralni Park'. We would focus our efforts there.