# Battle of Neighborhoods (Week 1)

## 1. Introduction 

Munich is the most important center of commerce and industry in Germany and among top in Europe. Not only because of that, it is expected that the population of Munich will increase by more than 10% within the next 20 years [(source)](https://www.br.de/nachrichten/bayern/bevoelkerungsprognose-region-muenchen-wird-voller-und-juenger,RKadnKj).

![Munich](https://upload.wikimedia.org/wikipedia/commons/c/cd/Frauenkirche_in_M%C3%BCnchen.jpg "City of Munich")
(By Reinald Kirchner - originally posted to Flickr as Frauenkirche in München, [CC BY-SA 2.0](https://commons.wikimedia.org/w/index.php?curid=7882323))


This project will look into the prospect of launching an Italian restaurant in the major business discricts of Munich. On one hand, the target audience are indeed business people seeking to open a restaurant. On the other hand, results of the analysis also provide hints on where to find interesting restaurants in Munich.


Please find the outline of the project below:
1. Apply web scrapping to get the names and basic information about Munich's districts
2. Select the districts to focus on
3. Use Foresquare data to obtain information about restaurants
    * Perform basic visualization and statistical anlysis on the data
    * Apply k-means clustering on the data
4. Visualization using Chloropleth map

## 2. Data

### Install and import necessary packages

In [1]:
!conda install --insecure -c conda-forge folium=0.5.0 --yes

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [2]:
import pandas as pd
import numpy as np
import requests
import geopy
from bs4 import BeautifulSoup

### Scrapping wikipedia for names and information on districts

In [3]:
# source page
wiki_page = requests.get("https://de.wikipedia.org/w/index.php?title=Stadtbezirke_M%C3%BCnchens").text
soup = BeautifulSoup(wiki_page,'lxml')

# grab desired table type
table = soup.find('table',{'class':'wikitable sortable'})

Read data into dataframe

In [14]:
df = pd.read_html(str(table))

df = df[0]
df

Unnamed: 0,Nr.,Stadtbezirk,Fläche(km²),Einwohner,Dichte(Einw./km²),Ausländer(%)
0,1.0,Altstadt-Lehel,315,21.100,6.708,261
1,2.0,Ludwigsvorstadt-Isarvorstadt,440,51.644,11.734,284
2,3.0,Maxvorstadt,430,51.402,11.96,254
3,4.0,Schwabing-West,436,68.527,15.706,227
4,5.0,Au-Haidhausen,422,61.356,14.541,235
5,6.0,Sendling,394,40.983,10.405,269
6,7.0,Sendling-Westpark,781,59.643,7.632,289
7,8.0,Schwanthalerhöhe,207,29.743,14.367,335
8,9.0,Neuhausen-Nymphenburg,1291,98.814,7.651,243
9,10.0,Moosach,1109,54.223,4.888,315


Drop the summation (last) row last two columns

In [15]:
df.drop(df.tail(1).index,inplace=True)
df.drop(['Dichte(Einw./km²)', 'Ausländer(%)'], axis=1,inplace=True)
df

Unnamed: 0,Nr.,Stadtbezirk,Fläche(km²),Einwohner
0,1.0,Altstadt-Lehel,315,21.1
1,2.0,Ludwigsvorstadt-Isarvorstadt,440,51.644
2,3.0,Maxvorstadt,430,51.402
3,4.0,Schwabing-West,436,68.527
4,5.0,Au-Haidhausen,422,61.356
5,6.0,Sendling,394,40.983
6,7.0,Sendling-Westpark,781,59.643
7,8.0,Schwanthalerhöhe,207,29.743
8,9.0,Neuhausen-Nymphenburg,1291,98.814
9,10.0,Moosach,1109,54.223


Recast district number as integer and translate column names

In [16]:
df["Nr."]= df["Nr."].astype(int)

# translation
df.columns = ['No.', 'district', 'area', 'population']
df

Unnamed: 0,No.,district,area,population
0,1,Altstadt-Lehel,315,21.1
1,2,Ludwigsvorstadt-Isarvorstadt,440,51.644
2,3,Maxvorstadt,430,51.402
3,4,Schwabing-West,436,68.527
4,5,Au-Haidhausen,422,61.356
5,6,Sendling,394,40.983
6,7,Sendling-Westpark,781,59.643
7,8,Schwanthalerhöhe,207,29.743
8,9,Neuhausen-Nymphenburg,1291,98.814
9,10,Moosach,1109,54.223


### Get coordinates

Simply passing the name of the district to the geocoder may lead to unfavorable results. For a distinct search result, the query name needs to be adapted to e.g. 'Stadtbezirk 01 Altstadt-Lehel'. 'Stadtbezirk' being German for district, '01' being the district number, and 'Alstadt-Lehel' the corresponding district name.

In [17]:
df['Search term'] = 'Stadtbezirk '+df['No.'].astype(str).apply(lambda x: x.zfill(2))+' '+df['district'].astype(str)
df

Unnamed: 0,No.,district,area,population,Search term
0,1,Altstadt-Lehel,315,21.1,Stadtbezirk 01 Altstadt-Lehel
1,2,Ludwigsvorstadt-Isarvorstadt,440,51.644,Stadtbezirk 02 Ludwigsvorstadt-Isarvorstadt
2,3,Maxvorstadt,430,51.402,Stadtbezirk 03 Maxvorstadt
3,4,Schwabing-West,436,68.527,Stadtbezirk 04 Schwabing-West
4,5,Au-Haidhausen,422,61.356,Stadtbezirk 05 Au-Haidhausen
5,6,Sendling,394,40.983,Stadtbezirk 06 Sendling
6,7,Sendling-Westpark,781,59.643,Stadtbezirk 07 Sendling-Westpark
7,8,Schwanthalerhöhe,207,29.743,Stadtbezirk 08 Schwanthalerhöhe
8,9,Neuhausen-Nymphenburg,1291,98.814,Stadtbezirk 09 Neuhausen-Nymphenburg
9,10,Moosach,1109,54.223,Stadtbezirk 10 Moosach


Processing of geocoder on the whole dataframe may lead to time-out errors. Instead, manually looping over each row of the dataframe turns out to be a robust approach

In [22]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Munich_explorer")

#df['district_coord'] = df['Search term'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
#df

import time
coords = []
for i,r in df.iterrows():
    location = geolocator.geocode(r['Search term'])
    coords.append(location[1])
    print(location[0], location[1])
    time.sleep(2) # pause between search request
#coords

Stadtbezirk 01 Altstadt-Lehel, München, Bayern, Deutschland (48.143648049999996, 11.589578888860586)
Stadtbezirk 02 Ludwigsvorstadt-Isarvorstadt, München, Bayern, Deutschland (48.130722250000005, 11.566525978684702)
Stadtbezirk 03 Maxvorstadt, München, Bayern, Deutschland (48.1465704, 11.5714445)
Stadtbezirk 04 Schwabing-West, München, Bayern, Deutschland (48.166354049999995, 11.566190844059609)
Stadtbezirk 05 Au-Haidhausen, München, Bayern, Deutschland (48.130273849999995, 11.59833361534854)
Stadtbezirk 06 Sendling, München, Bayern, 81371, Deutschland (48.1159214, 11.54838668124458)
Stadtbezirk 07 Sendling-Westpark, München, Bayern, Deutschland (48.11803085, 11.519332770284128)
Stadtbezirk 08 Schwanthalerhöhe, München, Bayern, 80339, Deutschland (48.13458915, 11.538012925336673)
Stadtbezirk 09 Neuhausen-Nymphenburg, München, Bayern, Deutschland (48.1566258, 11.518015570976043)
Stadtbezirk 10 Moosach, München, Bayern, Deutschland (48.1854255, 11.515593609710727)
Stadtbezirk 11 Milberts

Create the final data frame by adding the coordinates

In [25]:
df['district_coord'] = coords
df[['Latitude', 'Longitude']] = df['district_coord'].apply(pd.Series)
df.drop(['district_coord'], axis=1, inplace=True)
df

Unnamed: 0,No.,district,area,population,Search term,Latitude,Longitude
0,1,Altstadt-Lehel,315,21.1,Stadtbezirk 01 Altstadt-Lehel,48.143648,11.589579
1,2,Ludwigsvorstadt-Isarvorstadt,440,51.644,Stadtbezirk 02 Ludwigsvorstadt-Isarvorstadt,48.130722,11.566526
2,3,Maxvorstadt,430,51.402,Stadtbezirk 03 Maxvorstadt,48.14657,11.571445
3,4,Schwabing-West,436,68.527,Stadtbezirk 04 Schwabing-West,48.166354,11.566191
4,5,Au-Haidhausen,422,61.356,Stadtbezirk 05 Au-Haidhausen,48.130274,11.598334
5,6,Sendling,394,40.983,Stadtbezirk 06 Sendling,48.115921,11.548387
6,7,Sendling-Westpark,781,59.643,Stadtbezirk 07 Sendling-Westpark,48.118031,11.519333
7,8,Schwanthalerhöhe,207,29.743,Stadtbezirk 08 Schwanthalerhöhe,48.134589,11.538013
8,9,Neuhausen-Nymphenburg,1291,98.814,Stadtbezirk 09 Neuhausen-Nymphenburg,48.156626,11.518016
9,10,Moosach,1109,54.223,Stadtbezirk 10 Moosach,48.185426,11.515594


### Conclusions
The initial data frame was created with the distrcit names and corresponding coordinates. In the following we will concentrate on the major business districts:
* Altstadt-Lehel
* Schwabing-West
* Au-Haidhausen
* Ludwigsvorstadt-Isarvorstadt
* Maxvorstadt

### Pitfalls
* The search term for the geocoder needs to have a specific format. Initially, just using the district name in the query, coordinates of some place within the district were returned. The latter was revealed by plotting the district coordinates in a map and manual cross-check on https://nominatim.openstreetmap.org/search.php
* The geocoder query may lead to timeout errors if the full list is passed. These errors could be circumvented by manually looping through the dataframe row by row.