# Capstone Project - The Battle of the Neighborhoods

## Introduction: Business Problem

Nowadays working at office or home office is one of the most frequent setups. In most cases such type of work requires seating for multiple hours a day, which brings very unhealthy static tension to the body.  
There are multiple types of sport activities to eliminate and compensiate such bad influence, one of very popular is yoga. Yoga implies combination of physical exercises with mental relaxing which also helps to decrease overall stress. Moreover most of yoga exersices do not require extreme physical pressure and are quite safe from injury prospective.  
All that said, yoga is quite nice choice for office workers who live in big urban areas.

Thus, building a recomendation system for finding best suitable yoga class for office workers based on certain criteria is valuable analytical problem that perfectly fits into _Clustering_ type of Data Science problems which could be solved by unsupervised learning algorithms.

## Data

#### Python libraries import

In [None]:
import pandas as pd
import folium
import requests
import math

### Collection

Geo-data about Munich boroughs from [Wikipedia](https://en.wikipedia.org/wiki/Boroughs_of_Munich), [surface and population in each of them](http://www.total-munich.com/20160623888/blog/moving-to-munich/moving-to-munich-introduction-to-munich-s-boroughs.html) was manually constructed and stored into [munich_boroughs.csv](munich_boroughs.csv) CSV-file:

In [None]:
boroughs_df = pd.read_csv('munich_boroughs.csv')
boroughs_df.head()

Foursquare API is used to obtain information about _Yoga Studios_ in each borough. Following API endpoints are specially useful to get needed info:
 - https://developer.foursquare.com/docs/api/venues/search
 - https://developer.foursquare.com/docs/api/venues/details

To narrow search results to Yoga classes Venues only we use _Yoga Studio_ `categoryId` = `4bf58dd8d48988d102941735` from [available API categories values](https://developer.foursquare.com/docs/resources/categories).  
Total number of venues in all boroughs after collection is truncated to no more than 100 rows. 

### Cleanup and feature extraction

Raw JSON data about _Venues_ retrieved from Foursquare API should be filtered to the following structure:
 - Foursquare ID
 - Name
 - Geo-location:
   - Latitude
   - Longitude
 - Contacts:
   - Phone
   - Website
   - Facebook
   - Twitter
   - Instagram
 - Openning hours
 - Rating
 
Mentioned structure is then populated with prices information manually to the best of researcher's effort. 
Populated data is then flattened and one-hot encoded to generate _feature-file_ for _K-Means Clustering_ algorithm to determine main types of offered Yoga classes in Munich (e.g. far from city center, but cheap; popular in the city center, etc.)

## Methodology

First, let's visualize boroughs centers on the map:

In [None]:
munich_latitude = 48.153333
munich_longitude = 11.566667
map_munich = folium.Map(location=[munich_latitude, munich_longitude], zoom_start=12)

for _, borough in boroughs_df.iterrows():
    label = folium.Popup(borough['name'], parse_html=True)
    folium.CircleMarker(
        [borough['latitude'], borough['longitude']],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_munich)

map_munich

Next we used Foursquare API to obtain information about Yoga Studios in each boroughs.  
Define Foursquare API calls parameters which have constant values:

In [None]:
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20190922'
RADIUS=5000 # meters around location
LIMIT=100
# see https://developer.foursquare.com/docs/resources/categories for all possible values
YOGA_STUDIO_CATEGORY_ID='4bf58dd8d48988d102941735'

Retrieve information about Yoga Studios via Foursquare API call:

In [None]:
venues = []
for _, borough in boroughs_df.iterrows():
    
    url = 'https://api.foursquare.com/v2/venues/search?&client_id={client_id}&client_secret={client_secret}&v={v}&ll={lat},{lng}&radius={radius}&limit={limit}&categoryId={categoryId}'.format(
            client_id=CLIENT_ID, 
            client_secret=CLIENT_SECRET, 
            v=VERSION, 
            lat=borough['latitude'], 
            lng=borough['longitude'], 
            radius=RADIUS, 
            limit=LIMIT,
            categoryId=YOGA_STUDIO_CATEGORY_ID)
    response = requests.get(url).json()['response']['venues']
    if response:
        print("Found {number} venues for '{name}' borough.".format(number=len(response), name=borough['name']))
        for venue in response:
            address = venue['location']['address'] if 'address' in venue['location'] else ''
            venues.append([venue['id'], venue['name'], venue['location']['lat'], venue['location']['lng'], address])
    else:
        print("WARNING: No venues found for '{name}' borough.".format(name=borough['name']))
yoga_classes_raw_df = pd.DataFrame(venues, columns=['id', 'name', 'latitude', 'longitude', 'address'])

Let's get basic understanding of actual data:

In [None]:
print('Raw df size:', yoga_classes_raw_df.shape)
yoga_classes_raw_df.head()

Remove duplicates:

In [None]:
yoga_classes_df = yoga_classes_raw_df.drop_duplicates(subset='id').reset_index(drop=True)
print('Deduplicated df size:', yoga_classes_df.shape)

Utility function to convert latitute and longitude coordinates into distance, based on [Haversine formula](https://en.wikipedia.org/wiki/Haversine_formula):

In [None]:
def haversine_formula(lat1, lon1, lat2, lon2):
    R = 6378.137; # radius of earth in KM
    dLat = lat2 * math.pi / 180 - lat1 * math.pi / 180;
    dLon = lon2 * math.pi / 180 - lon1 * math.pi / 180;
    a = math.sin(dLat/2) * math.sin(dLat/2) + math.cos(lat1 * math.pi / 180) * math.cos(lat2 * math.pi / 180) * math.sin(dLon/2) * math.sin(dLon/2);
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a));
    d = R * c;
    
    return math.ceil(d * 1000) # meters

Calculate distance from Munich city center:

In [None]:
munich_lat = 48.1451181
munich_lon = 11.5430782
yoga_classes_df['distance to center, m'] = yoga_classes_df.apply(lambda row: haversine_formula(row['latitude'], row['longitude'], munich_lat, munich_lon), axis=1)
yoga_classes_df.head()

To determine to which borough each yoga class belongs to:
1. Calculate distance between venue and borough center locations
1. Select borough with min distance

In [None]:
borough_distances = []
for _, borough in boroughs_df.iterrows():
    column_name = borough['name']
    borough_distances.append(column_name)
    borough_lat = borough['latitude']
    borough_lon = borough['longitude']
    yoga_classes_df[column_name] = yoga_classes_df.apply(lambda row: haversine_formula(row['latitude'], row['longitude'], borough_lat, borough_lon), axis=1)
yoga_classes_df['borough'] = yoga_classes_df[borough_distances].T.idxmin()
yoga_classes_df.drop(columns=borough_distances, inplace=True)


Create dataframe with features for K-means algorihtm:

In [None]:
yoga_classes_features_df = yoga_classes_df[['id', 'name', 'borough', 'distance to center, m']]
yoga_classes_features_df.head()

## Results

TBD

## Discussion

TBD

## Conclusion

TBD