Project Milan

A. Introduction

In the course of this project, I will analyze the city of Milan in hopes of gaining knowledge that may come in handy when one visits Italy. I am traveling to the city in a few weeks, so I could make a great use of a map that shows the landmarks/souvenir shops of the city  divided into separate clusters that can be visited together on foot. If I won't be able to make useful deductions, than I will try to explore other venues, such as supermarkets or clothing stores.  

B. Gathering Data

I will use the data collected by Foursquare to prepare the analysis. I will get the location of Milan, then explore the city's landmarks and souvenir shops. I will also have to clean the data, determine the clusters using K-means algorithm, and finally visualize all that.

C. Methodology

First, we have to import the packages that we are going to use.

In [5]:
import requests
import numpy as np


!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim

from sklearn.cluster import KMeans

import pandas as pd
from numpy import nan

!pip install lxml
import html5lib
import lxml

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.

Solving environment: done


  current version: 4.5.11
  latest version: 4.7.12

Please update conda by running

    $ conda update -n base -c defaults conda



# All requested packages already installed.



After importing the packages, we need to look for a website to get the data about the neighborhoods of Milan. Actually, we could just use the Foursquare API on the city of Milan, but due to restrictions imposed by the free version of it, we can acquire more venues if we divide the city into subparts. I get the table from the Wikipedia site; however, the read_html() function did not work on the English version of the site, so I navigated to the French one. I am also slicing off the first two rows and creating a header. 

In [2]:
url = "https://fr.wikipedia.org/wiki/Zones_de_Milan"
df = pd.read_html(url)[0]
df.columns = df.iloc[1]
df = df.iloc[2:].reset_index()
df

1,index,NaN,Dénomination,Superficie (km²),Habitants (31 dicembre 2010),Densité (ab/km²),Quartiers
0,2,Zone 1,Centre historique,967,97.231,10.055,"Centro storico, Brera, Porta Tenaglia, Porta S..."
1,3,Zone 2,"Gare de Milan-Centrale, Gorla, Turro, Greco, C...",1258,144.301,11.471,"Adriano, Crescenzago, Gorla, Greco, Loreto, Ma..."
2,4,Zone 3,"Città Studi, Lambrate, Porta Venezia",1423,139.897,9.831,"Porta Venezia, Porta Monforte, Casoretto, Rott..."
3,5,Zone 4,"Porta Vittoria, Forlanini",2095,152.259,7.268,"Porta Vittoria, Porta Romana, Cavriano, Forlan..."
4,6,Zone 5,"Vigentino, Chiaravalle Milanese, Gratosoglio",2987,119.900,4.014,"Porta Vigentina, Porta Lodovica, San Gottardo,..."
5,7,Zone 6,"Barona, Lorenteggio",1828,146.606,8.02,"Porta Ticinese, Porta Genova, Conchetta, Moncu..."
6,8,Zone 7,"Baggio, De Angeli, San Siro",3134,168.899,5.389,"Porta Magenta, Quartiere De Angeli - Frua, San..."
7,9,Zone 8,"Fiera, Gallaratese, Quarto Oggiaro",2372,179.453,7.565,"Porta Volta, Bullona, Ghisolfa, Portello, Cagn..."
8,10,Zone 9,"Gare de Milan-Porta Garibaldi, Niguarda",2112,174.204,8.248,"Porta Garibaldi, Porta Nuova, Centro Direziona..."
9,11,,Total commune,18176,1.322.750,7.277,


I loop through the table and append the boroughs to a single list. Note the comma separation in the column, therefore we need to split each cell by the comma.

In [3]:
boroughs = []
for index, row in df.iterrows():
    boroughs.extend(df.iloc[index, 2].split(","))
boroughs

['Centre historique',
 'Gare de Milan-Centrale',
 ' Gorla',
 ' Turro',
 ' Greco',
 ' Crescenzago',
 'Città Studi',
 ' Lambrate',
 ' Porta Venezia',
 'Porta Vittoria',
 ' Forlanini',
 'Vigentino',
 ' Chiaravalle Milanese',
 ' Gratosoglio',
 'Barona',
 ' Lorenteggio',
 'Baggio',
 ' De Angeli',
 ' San Siro',
 'Fiera',
 ' Gallaratese',
 ' Quarto Oggiaro',
 'Gare de Milan-Porta Garibaldi',
 ' Niguarda',
 'Total commune']

I create a distinct DataFrame from the extracted list of boroughs, and also append two empty columns to it in order to make space for further processing. We are removing the last, aggregator row of the table.

In [4]:
df = pd.DataFrame({"Borough": boroughs})
df["Latitude"] = ""
df["Longitude"] = ""
df = df.drop(24)
df

Unnamed: 0,Borough,Latitude,Longitude
0,Centre historique,,
1,Gare de Milan-Centrale,,
2,Gorla,,
3,Turro,,
4,Greco,,
5,Crescenzago,,
6,Città Studi,,
7,Lambrate,,
8,Porta Venezia,,
9,Porta Vittoria,,


We are creating a geolocator object to extract each borough's location data. We account for the malfunctioning of geopy by retrying an unsuccessful query.

In [6]:
df
geolocator = Nominatim(user_agent="capstone")
for index, row in df.iterrows():
    location = None
    while location == None:
        location = geolocator.geocode("{}, Milan".format(df.iloc[index, 0]))
    df["Latitude"].iloc[index] = location.latitude
    df["Longitude"].iloc[index] = location.longitude
df

Unnamed: 0,Borough,Latitude,Longitude
0,Centre historique,43.6118,3.87338
1,Gare de Milan-Centrale,45.4842,9.19881
2,Gorla,45.5049,9.22454
3,Turro,45.4975,9.2259
4,Greco,45.5022,9.21123
5,Crescenzago,45.5092,9.24748
6,Città Studi,45.4771,9.22657
7,Lambrate,45.4831,9.242
8,Porta Venezia,45.4745,9.2048
9,Porta Vittoria,45.4623,9.20958


We prepare for the use of Foursquare API by defining the input parameters

In [7]:
CLIENT_ID = 'BR55JVHMKUKWRDFHIFZVKNWASTPB1TVW1SV1OLC2GA3DWEG5'
CLIENT_SECRET = 'AD1ATAUTPXGQQRAUC4IMMBBQB4EOZM3Q5HWVWQZBKR3AFP2H'
VERSION = '20191030'
LIMIT = 200
radius = 1200

We are using the Explore function of the API. We iterate over the rows of df. Output table shows each venue and their related info (including the location).

In [8]:
venues_list = []
for index, row in df.iterrows():
    lat = df.iloc[index, 1]
    lng = df.iloc[index, 2]
    name = df.iloc[index, 0]
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
    
    results = requests.get(url).json()["response"]["groups"][0]['items']
    venues_list.append([(lat, lng, v['venue']['name'], v['venue']['location']['lat'], v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = [ 
              'Neighborhood Latitude', 
              'Neighborhood Longitude', 
              'Venue', 
              'Venue Latitude', 
              'Venue Longitude', 
              'Venue Category']

We are filtering out dataset by category. I selected those that interest me the most. (Most of the results were Italian restaurants :))

In [18]:
venues = nearby_venues[(nearby_venues["Venue Category"] == "Historic Site") | 
                     (nearby_venues["Venue Category"] == "Monument / Landmark") | 
                     (nearby_venues["Venue Category"] == "Public Art") |
                     (nearby_venues["Venue Category"] == "Museum") |
                     (nearby_venues["Venue Category"] == "Italian Restaurant") |
                     (nearby_venues["Venue Category"] == "Castle")  |
                     (nearby_venues["Venue Category"] == "Palace") |
                     (nearby_venues["Venue Category"] == "Plaza") |
                     (nearby_venues["Venue Category"] == "Coffee Shop") |
                     (nearby_venues["Venue Category"] == "Café")
                    ]
venues

Unnamed: 0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
1,43.611762,3.873377,Arc de Triomphe,43.611129,3.872363,Historic Site
2,43.611762,3.873377,Coffee Club,43.610002,3.876010,Coffee Shop
3,43.611762,3.873377,Place de la Canourgue,43.611623,3.874397,Plaza
8,43.611762,3.873377,Latitude Café,43.611740,3.874492,Café
9,43.611762,3.873377,La Panacée,43.612741,3.878277,Public Art
...,...,...,...,...,...,...
1577,45.516974,9.192488,Premiata Trattoria Arlati dal 1936,45.516197,9.207754,Italian Restaurant
1581,45.516974,9.192488,MIC - Museo Interattivo del Cinema,45.513470,9.203987,Museum
1599,45.516974,9.192488,Caffé Gelateria Delrosso,45.515787,9.194142,Café
1600,45.516974,9.192488,Pasticceria Ornato,45.517070,9.191052,Café


I am now performing a Kmeans algorithm. I wanted to break down the city into clusters of venues that are reachable on foot, therefore I set k to 20.

In [19]:
k = 20
venues_clusters = venues[["Venue Latitude", "Venue Longitude"]]
kmeans = KMeans(n_clusters=k, random_state=0).fit(venues_clusters)

I am inserting the predicted lables to the original dataframe. 

In [20]:
venues.insert(0, 'Cluster Labels', kmeans.labels_)

D. Results

I am using the already learnt method of visualizing the cluster elements. The coordinates of the city had to be extracted from the web.

The maps shows the result of the analysis, it demonstrates the clusters based on the situation of the relevant venues. 

In [21]:
map_clusters = folium.Map(location=[45.464203, 9.189982], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, cluster in zip(venues['Venue Latitude'], venues['Venue Longitude'], venues['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

E. Discussion

Due to the API being a free version, the results were highly limited, that may be the reason for having "dead spaces" in the downtown. I only took into account the spatial location of a venue, I did not create the clusters based on a lot of features. 

I could now utilize the result as, for example, filtering the table by the predicted labels, then selecting what venue to visit based on the feedbacks given by other users. 

F. Conclusion

I used many parts of what I've learnt throughout the Data Science courses. I managed to reach my goal, I can now determine the venues, landmarks, museum, cafes, restaurants that are close to each other.