# Battle of Neighborhoods
## has started

#### Episode II

####  

## Problem statement

Problem statement
Consider you are living in a nice Toronto city (for example in Old City) next to Ontario lake, but you have to relocate to Berlin (where I actually live), because you want to enlarge a footprint of your fast growing startup, that has just raised some millions through series “C” funding round. You might also have a family and they surely will go with you, so you are sort of picky regarding new neighborhood. 
Of course, you want that in Berlin you settle down in the same nice and pretty area, have all you need to feel happy and everything you have got used too. You want to get some Berlin insights upfront, but you cannot waste your time surfing the web in search of right option, because you are obviously very busy with marketing and growth strategy. So finally you would like to know which neighborhoods (called Ortsteile) will full fill your needs the most. 

### Generalization

Relocation outside your home city is one of the most pressing questions, that you need to answer once you decided to leave your domicile for whatever reason. The majority of people would like to have comparable life quality, surroundings and environment in the new place. But it is not really straightforward to figure out which area is the best match, if you are not familiar with a new city. 
The objective of this project is to utilize Foursquare location data, leveraged by clustering of venues to determine what might be the ‘best’ neighborhood in Berlin to relocate. 


In [10]:
#!conda install -c conda-forge geopy --yes 
#print('Geopy installed')

#!pip install geocoder
#print('geocoder installed')

#!conda install -c conda-forge folium=0.5.0 --yes
#print('Folium installed')

#!conda install beautifulsoup4
#print('beautifulsoup4 installed')

#!pip install pandas

#!pip install matplotlib

#!pip install geopy
#!pip install sklearn
#!pip install folium
#!pip install BeautifulSoup4
#!pip install foursquare
#!pip install simplejson

#!pip install --upgrade pandas

print('Libraries imported.')

Libraries imported.


In [11]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
import math

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

import geocoder # import geocoder

import folium # plotting library

#scraping webpages
from bs4 import BeautifulSoup

#Clustering 
from sklearn.cluster import KMeans

from sklearn.metrics import pairwise

#from sklearn.decomposition import PCA

import foursquare

from scipy.stats import norm

import matplotlib.pyplot as plt

%matplotlib inline

# Data Section

To begin with, let’s retrieve the data about neighborhoods and boroughs in both capitals from Wiki: 

* List of Toronto neighborhoods vs Boroughs is pulled out from here: https://en.wikipedia.org/wiki/List_of_city-designated_neighbourhoods_in_Toronto

For administrative purposes, the City of Toronto is divided into 140 neighborhoods. These divisions are used for internal planning purposes. The boundaries and names often do not conform to the usage of the general population or designated business improvement areas. 

* List of Berlin neighborhoods vs Boroughs is pulled out form here: https://de.wikipedia.org/wiki/Verwaltungsgliederung_Berlins
In January 1, 2001 an administrative reform divided Berlin into 12 boroughs (Bezirke), which function as administrative districts according to the principles of self-government. The boroughs are divided into 96 neighborhoods (Ortsteile).

## Scraping Webpages
To convert results of the web request into human format, I used BeautifulSoup library, that allows to parse HTML webpages, so that finally data can be packed into a data frame. 
Parsed data frames were also uploaded under names Toronto_data.csv and Berlin_data.csv. 

## Adding Location Data
Before we start exploring neighborhoods, lets locate the coordinates of neighborhoods, using Geopy Nominatim geocoder for OpenStreetMap data. 


### Helper functions

Lets create few helper funcitons, so that our code looks nicer and is more readable in general.

#### Function that maps given address to coordinates using Nominatim of Geopy

In [12]:
def add_coors(address):
    geolocator = Nominatim(user_agent="ny_explorer")
    location = geolocator.geocode(address)
    if not location:
        return 1
    latitude = location.latitude
    longitude = location.longitude
    return latitude, longitude

#### Function that maps all neighborhoods in the data frame to coordinates

In [13]:
def add_coords_to_df(df_in, prefix):

    df=df_in[:]
    nei_len=len(df['Neighborhood'])
    for i in range(nei_len):
        #print(df.loc[i,'Neighborhood'])
        try:
            #print(i)
            adr=(df.loc[i,'Neighborhood']).strip()+', ' +prefix
        #except:
           # adr=(df.loc[i,'Borough']).strip()+', ' +prefix
           # print(adr)
            coords=add_coors(adr)
            if coords==1:
                   # str=(df.loc[i,'Neighborhood']).split(',')
                    adr=(df.loc[i,'Neighborhood']).split(',')[0].strip()+', ' +prefix
               # print(adr)
                    coords=add_coors(adr)
                    if coords==1:
                       # str=adr.split('and')
                        adr=(df.loc[i,'Neighborhood']).split(' and ')[0].strip()+', ' +prefix
                    #print(adr)
                        coords=add_coors(adr)
                        if coords==1:
                            coords=add_coors((df.loc[i,'Borough']).strip()+', ' +prefix)

            df.loc[i,'Latitude']=coords[0]
            df.loc[i,'Longitude']=coords[1]
        except:
            adr=(df.loc[i,'Borough']).strip()+', ' +prefix
            
            coords=add_coors(adr)
           # print(adr, coords)
            df.loc[i,'Latitude']=coords[0]
            df.loc[i,'Longitude']=coords[1]
    
    return df

#### Loader of Toronto data
Function that loads neighborhoods and boroughs data from Toronto wiki page, parses it and then wraps into the dataframe

In [14]:
#Retreive Toronto Neighborhoods and Boroughs from wiki page
def load_Toronto_data():
    toronto_data = requests.get('https://en.wikipedia.org/wiki/List_of_city-designated_neighbourhoods_in_Toronto').text   
    soup = BeautifulSoup(toronto_data, 'html.parser')
    data = []
    #print(soup)
    table=soup.find(attrs={"class": "wikitable sortable"})
    #print(table)
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')

    for row in rows:
       # print(row)
        cols = row.find_all('td')
        cols = [c.text.strip() for c in cols]
        if cols:
                data.append([c for c in cols if c])
    tz=pd.DataFrame(data)

    tz.columns = ['Latitude','Longitude','Borough','Neighborhood']
    tz=tz[['Borough','Neighborhood','Latitude','Longitude']]
    tz.loc[:,'Latitude']=0.0
    tz.loc[:,'Longitude']=0.0
    return tz

#### Loader of Belrin data
Function that loads neighborhoods and boroughs data from Berlin wiki page, parses it and then wraps into the dataframe

In [15]:
# Retreive Berlin Neighborhoods and Boroughs from wiki page
def load_Berlin_data():
    berlin_data = requests.get('https://de.wikipedia.org/wiki/Verwaltungsgliederung_Berlins').text   
    soup = BeautifulSoup(berlin_data, 'html.parser')

    data = []
    #table = soup.find('table')
    tables = soup.findAll("table") 
    table_body = tables[2].find('tbody')
    rows = table_body.find_all('tr')

    for row in rows:
        cols = row.find_all('td')
        cols = [c.text.strip() for c in cols]
        if cols:
            data.append([c for c in cols if c])

    br=pd.DataFrame(data)

    br=br.drop([br.columns.values[0],br.columns.values[5]], axis=1)
    br.columns = ['Neighborhood','Borough','Latitude','Longitude']
    br.loc[:,'Latitude']=0.0
    br.loc[:,'Longitude']=0.0

    return br

#### Function that plots data points on the map 

In [16]:
# function that plots data points on the map for a given zoom and in the desired color
# using latitude and longitude values of a given adress
def plot_markers(neighborhoods, adress,map_n,c):

# add markers to map
    for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
        label = '{}, {}'.format(neighborhood, borough)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color=c,
            fill=True,
            fill_color=c,
            fill_opacity=0.7,
            parse_html=False).add_to(map_n)    
    return map_n

Load Toronto hoods and boroughs, clean them up and save down the results into csv format

In [17]:
# load Toronto hoods 
try:
    tz=pd.read_csv("Toronto_hoods.csv")
    print("Hoods will loaded from file")
except:
    print("Hoods will be loaded from wiki and saved in csv format")
    tz=load_Toronto_data()

    # remove None values
    print(tz.shape)
    tz.dropna(subset=['Borough'], inplace=True)
    print(tz.shape)
    tz.dropna(subset=['Neighborhood'], inplace=True, how='all')
    print(tz.shape)
    tz.to_csv("Toronto_hoods.csv",index=False)

# add coordinates 
tz=add_coords_to_df(tz[:].reset_index(drop=True),'Toronto')

print(tz.head(3))

Hoods are loaded from file
       Borough            Neighborhood   Latitude  Longitude
0  Scarborough  Agincourt and Brimwood  43.785353 -79.278549
1  Scarborough   Agincourt and Malvern  43.781969 -79.257689
2    Etobicoke               Alderwood  43.601717 -79.545232


Load Berlin hoods and boroughs, clean them up and save down the results into csv format

In [18]:
# load Belrin hoods 

try:
    br=pd.read_csv("Berlin_hoods.csv")
    print("Hoods will be loaded from file")
except:
    print("Hoods will be loaded from wiki and saved in csv format")
    br=load_Berlin_data()
    
    # remove None values
    print(br.shape)
    br.dropna(subset=['Borough'], inplace=True)
    print(br.shape)
    br.dropna(subset=['Neighborhood'], inplace=True, how='all')
    print(br.shape)
    br.to_csv("Berlin_hoods.csv",index=False)

# add coordinates 
br=add_coords_to_df(br[:],'Berlin')

print(br.head(3))

Hoods are be loaded from file
   Neighborhood Borough   Latitude  Longitude
0         Mitte   Mitte  52.517690  13.402376
1        Moabit   Mitte  52.530102  13.342542
2  Hansaviertel   Mitte  52.519123  13.341872


### Lets finally plot neighborhoods on the map

Let's plot neibourhoods of Totonto

In [19]:
adress='Toronto'
zoom=10
c='blue'
latitude=add_coors(adress)[0]
longitude=add_coors(adress)[1]
map_n = folium.Map(location=[latitude, longitude], zoom_start=zoom)
map_n=plot_markers(tz[:], adress,map_n,c)
map_n

Let's plot neibourhoods of Berlin

In [20]:
#plot neibourhoods in Berlin
adress='Berlin'
c='red'
latitude=add_coors(adress)[0]
longitude=add_coors(adress)[1]
map_n = folium.Map(location=[latitude, longitude], zoom_start=zoom)
map_n=plot_markers(br[:], adress,map_n,c)
map_n

## K-Means Clustering

Now lwe are ready to cluster all hoods in both capitals.

### Let's find the best number of clusters
We will try out different number of clusters, calculate total SSE to conclude, which split is the best.

In [21]:
#funciton that calculates SSE
def calc_SEE(neighborhoods,kclusters):

    SSE=np.arange(kclusters)

    for k in np.arange(kclusters):

        # run k-means clustering
        
        neighborhoods_cl=neighborhoods[:]
        
        #initialize with k-means++ in a smart way to speed up convergence
        kmeans = KMeans(init="k-means++", n_clusters=k+1, n_init=20).fit(neighborhoods_cl)

        #lets add cluster labels to Toronto neighborhoods
        neighborhoods_cl.insert(0, 'Cluster Labels', kmeans.labels_)

        #lets calculate SSE = sum of sqares of disrances between data points in a cluster and cluster centroid
        d = {'Cluster Labels':np.arange(k+1),'c_la': kmeans.cluster_centers_[:,0], 'c_lo': kmeans.cluster_centers_[:,0]}
        cl_centroids = pd.DataFrame(data=d)
        neighborhoods_cl=neighborhoods_cl.merge(cl_centroids, on=['Cluster Labels'], how='left')
        
        #calculate SSE for each point 
        neighborhoods_cl['dist']=(neighborhoods_cl['Latitude']-neighborhoods_cl['c_la'])**2 + (neighborhoods_cl['Longitude']-neighborhoods_cl['c_lo'])**2
        
        #calculate overall SSE
        sse=neighborhoods_cl['dist'].sum()

        SSE[k]=sse

    #return SSE vectore 
    return [SSE]

#### This function clusters data for a given number of clusters 

In [22]:
def cluster_data(neighborhoods,k):
    
   #clustering = neighborhoods[['Latitude', 'Longitude']]
    neighborhoods_cl = neighborhoods[:]

    # run k-means clustering

    #initialize with k-means++ in a smart way to speed up convergence
    kmeans = KMeans(init="k-means++", n_clusters=k, n_init=20).fit(neighborhoods_cl[['Latitude', 'Longitude']])
    #neighborhoods_cl=neighborhoods[:]

    #lets add cluster labels to Toronto neighborhoods
    neighborhoods_cl.insert(0, 'Cluster Labels', kmeans.labels_)

    #lets calculate SSE = sum of sqares of disrances between data points in a cluster and cluster centroid
    d = {'Cluster Labels':np.arange(k),'c_la': kmeans.cluster_centers_[:,0], 'c_lo': kmeans.cluster_centers_[:,0]}
    cl_centroids = pd.DataFrame(data=d)
    neighborhoods_cl=neighborhoods_cl.merge(cl_centroids, on=['Cluster Labels'], how='left')

    #return clustered data
    return neighborhoods_cl

In [23]:
kclusters_t = len(tz['Borough'].unique())
SSE_t=calc_SEE(tz[['Latitude', 'Longitude']],kclusters_t)

kclusters_b = len(br['Borough'].unique())
SSE_b=calc_SEE(br[['Latitude', 'Longitude']],kclusters_b)

Looks like that our data set is not sensitive to changes of 'k' at all. Let us pick the number of clusters equal to the number of boroughs

In [24]:
if len(min(SSE_t))>1:
    print ('SSE for Toronto data is the same for all k, we will take k equal to the number of Boroughs, that is ', kclusters_t)
    k_t=kclusters_t
else:
    print('Min SSE for Toronto data is ', min(SSE_t), ' the best k is', pd.DataFrame(SSE_t).idxmin())
    k_t=pd.DataFrame(SSE_t).idxmin()+1
if len(min(SSE_b))>1:
    print ('SSE for Berlin data is the same for all k, we will take k equal to the number of Boroughs, that is ', kclusters_b+1)
    k_b=kclusters_b
else:
    print('Min SSE for Belrin data is ', min(SSE_b), ' the best k is', pd.DataFrame(SSE_b).idxmin()+1)
    k_b=pd.DataFrame(SSE_b).idxmin()+1

SSE for Toronto data is the same for all k, we will take k equal to the number of Boroughs, that is  6
SSE for Berlin data is the same for all k, we will take k equal to the number of Boroughs, that is  13


### Lets create final pretty map with clusterized hoods

In [25]:
# function that plots clustered data point on the map for a given zoom
# using latitude and longitude values of a given adress
def plot_clusters(adress,neighborhoods_cl, kclusters):
    
    # set color scheme for the clusters
    x = np.arange(kclusters)
    ys = [i + x + (i*x)**2 for i in range(kclusters)]
    
    #generate ramdon filling
    colors_array =cm.rainbow(np.random.rand(kclusters))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    #generate ramdon contor
    colors_array_f =cm.rainbow(np.random.rand(kclusters))
    rainbow_f = [colors.rgb2hex(i) for i in colors_array_f]
    
    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(neighborhoods_cl['Latitude'], neighborhoods_cl['Longitude'], neighborhoods_cl['Neighborhood'], neighborhoods_cl['Cluster Labels']):
        label = folium.Popup(str(poi).format("UTF-8") + ' Cluster ' + str(cluster+1), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow_f[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)
    return map_clusters

### Clusered hoods of Toronto

In [26]:
#cluster anf plot data for the the best k for Toronto 
tz_cl=cluster_data(tz,k_t)
adress='Toronto'
zoom=10
latitude=add_coors(adress)[0]
longitude=add_coors(adress)[1]
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=zoom)

map_clusters=plot_clusters(adress,tz_cl, k_t)
map_clusters

### Clusered hoods of Berlin

In [27]:
#cluster anf plot data for the the best k for Berlin
br_cl=cluster_data(br,k_b)
adress='Berlin'
latitude=add_coors(adress)[0]
longitude=add_coors(adress)[1]
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=zoom)
map_clusters=plot_clusters(adress,br_cl, k_b)
map_clusters