# Battle of the Neighborhoods (Week 1)
#### Part of the IBM Data Science Certification: Applied Data Science Capstone, Final Project
    
## Abstract
    
In this work, we will be examining geographical locations from 3 different neighboring areas within northern Virginia. Each area will be divided into equally sized bounding boxes and the types of venues (shops, restaurants, ammenities) will be exmained to find the best fit based on a predetermined set of preferences. This is exercising a content-based recommender algorithm to determine the top `N` (in this case `N = 3`) sections of any of the neighborhoods. 


## Table of Contents

1. <a href="#introduction">Introduction (Background)</a>
1. <a href="#data_description">Data Description</a>

In [6]:
# The code was removed by Watson Studio for sharing.

In [8]:
%%capture
# Get stuff installed
!pip install geocoder
!pip install foursquare
!pip install folium
!pip install wordcloud

import pandas as pd
import numpy as np
# import k-means from clustering stage
from sklearn.cluster import KMeans

# Geo-data
import geocoder
import foursquare
import folium # mapping

# Viz stuff
# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
%matplotlib inline

# import package and its set of stopwords
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)


# Watson Studio stuff
from project_lib import Project

# Others
import re
import html
import math
from IPython.core.display import display, HTML

KM_PER_DEGREE = 111.32
R_EARTH = 6378.1
CHANTILLY = 0
TYSONS = 1
ARLINGTON = 2
ALEXANDRIA = 3
PLACES = pd.DataFrame({
    'id': [CHANTILLY, TYSONS, ARLINGTON, ALEXANDRIA],
    'place': ['Chantilly, VA','Tysons Corner, VA', 'Arlington, VA', 'Alexandria, VA'], 
    'radius-m': [2780, 1660, 3500, 2000]
    })

PLACES = pd.concat([PLACES, pd.DataFrame([geocoder.google(place, key=api_key).latlng for place in PLACES['place']], columns=['lat', 'lng'])], axis=1)
CENTER = PLACES[['lat', 'lng']].mean()
BOUNDS = [list(PLACES[['lat','lng']].min().values), list(PLACES[['lat', 'lng']].max().values)]

FSQ_CLIENT = foursquare.Foursquare(client_id=fsq['id'], client_secret=fsq['sec'], version=fsq['ver'])
PROJECT = Project(None, proj['id'], proj['token'])

SURVEY = pd.read_csv(PROJECT.get_file('Survey.csv'))

In [9]:
# Functions

def newLatLng(lat, lng, distance, bearing):
    '''
    Generate new lat, lng based on input lat and lng, distance and bearing.
    
    lat, lng: current lat, lng in degrees
    d: distance in meters from current lat, lng
    bearing: 0 is north and west is -90 (270) and so on
    
    returns: (lat, lng)
    
    '''
    d = distance/1000
    lat1 = math.radians(lat)
    lng1 = math.radians(lng)
    
    brng = math.radians(bearing)
    lat2 = math.asin( math.sin(lat1)*math.cos(d/R_EARTH) + math.cos(lat1)*math.sin(d/R_EARTH)*math.cos(brng))
    lng2 = lng1 + math.atan2(math.sin(brng)*math.sin(d/R_EARTH)*math.cos(lat1), math.cos(d/R_EARTH)-math.sin(lat1)*math.sin(lat2))

    lat2 = math.degrees(lat2)
    lng2 = math.degrees(lng2)
    
    return (lat2, lng2)

def generateGrids(lat, lng, radius, grid_size, label):
    '''
    Generate a series of bounding boxes (Northing & Easting) with grid IDs (label_<N>) where N is a
    one-up counter from 0.
    
    lat, lng: location in degrees
    radius: in meters of entire region
    grid_size: in meters of a grid location (square, i.e. 500 would be a 500x500 square)
    label: some text, will result in IDs of label_<N>
    
    returns dataFrame of labels, Northing point and easting point
    '''
    
    # number of grids in one direction
    _grids = math.ceil(2*radius/grid_size)
    
    d = math.sqrt(grid_size*grid_size)
    input = {
        'grid': [],
        'boundN': [],
        'boundE': []
    }
    
    # we're just assuming a big square
    id = 0
    latTop, lngTop = newLatLng(lat, lng, radius, -45)
    curLat = latTop
    curLng = lngTop
    gLatE, gLngE = (latTop,0)
    for x in range(_grids):
        curLat = gLatE
        curLng = lngTop
        for y in range(_grids):
            id = y + _grids*x
            gLatE, gLngE = newLatLng(curLat, curLng, d, 135)
            input['grid'].append(f'{label}_{id}')
            input['boundN'].append((curLat, curLng))
            input['boundE'].append((gLatE, gLngE))
            curLng = gLngE
            
    return pd.DataFrame(input)
    
    

<a id="introduction"></a>

## Introduction (Background)

ACME, Inc. is a growing company currently located in Chantilly, Virginia. Due to their growth, they need to relocate to a larger office and have decided to relocate closer to Washington, D.C. in response to their employees' preference. The owners of ACME, Inc. have determined that *Tysons Corner*, *Arlington*, or *Alexandria* would be all be viable locations for their new office. However, determining a specific location has become an issue. 

The following map shows the current location of ACME, Inc. (in red), and the other (blue) locations are the potential new areas.


In [10]:
map = folium.Map(location=CENTER, zoom_control=False, title='Test')
map.fit_bounds(BOUNDS)
for _, row in PLACES[['lat', 'lng', 'radius-m']].iterrows():
    folium.vector_layers.Circle(location=(row.loc['lat'], row.loc['lng']), radius=row.loc['radius-m'], fill=True, color='red' if _ == 0 else 'blue').add_to(map)


display(HTML('<h3 style="text-align: center">Potential Areas for ACME, Inc. Relocation</h3>'))    
map


In order to improve their employees' work experience, they issued an employee-wide survey to determine the types of venues, shops, restaurants or other ammenities that need to be near the new location (within approximately 500 meters or about 5 minutes walk). The following shows the results of their survey on a 10 point scale (1 = not interested, 10 = must have). 

In [11]:
SURVEY

Unnamed: 0,Type,Rating
0,Metro Station,7.5
1,Gym,8.0
2,Coffee Shop,6.0
3,Post Office,3.5
4,Cleaners,4.5
5,Sandwich Shop,7.0
6,Convenience Store/Drug Store,2.5
7,Bar,8.5


<a id="data_description"></a>

## Data Description

Along with the employee survey data, we will need venues for the various areas. For this effort, we will retrieve venu data for the various grids [(see below)](#methodology) from the [Foursquare data set](https://foursquare.com/), using the [foursquare python library](https://pypi.org/project/foursquare/). The venu data will be matched to each grid ID (or `gid`) by the venues location and category (venues without categories will be removed). 

The following shows the first 10 venues and their categories from the Foursquare data set. Notice, that the category column will need some cleaning as it contains more specific categories than just `Restaurant`, e.g. `Sushi Restaurant`. Furthermore, there are venues denoted as `Pizza Place` which should be considered a `Restaurant`. 

In [12]:
tysons = pd.read_csv(PROJECT.get_file('tysons_10_venues.csv'))
tysons

Unnamed: 0,id,name,lat,lng,category
0,51891fea498ee05ee808e258,REI,38.91835,-77.228827,Sporting Goods Shop
1,4cd1b5b606b546881d3ce294,Super Chicken,38.920575,-77.235075,South American Restaurant
2,568441df498eed21c59b25de,Roll Play,38.916136,-77.227337,Vietnamese Restaurant
3,4a63e6acf964a520fbc51fe3,"Sakura Japanese Steak, Seafood House & Sushi Bar",38.921471,-77.235775,Sushi Restaurant
4,4b8323eff964a520f1f930e3,Fleming's Prime Steakhouse & Wine Bar,38.920557,-77.227068,Steakhouse
5,547bf26b498ea3fb9947b0cb,Esthetic Institute,38.91434,-77.23416,School
6,4f626ea7e4b0ea77cba053b8,CAVA,38.917194,-77.223629,Mediterranean Restaurant
7,5dd831fed892900007dfa839,Shotted Specialty Coffee,38.917435,-77.223452,Coffee Shop
8,59a5873fd3cce87c7c6cab7a,DoubleTree by Hilton,38.920667,-77.227136,Hotel
9,5a1375b246e1b62527257ae3,&pizza,38.917024,-77.223835,Pizza Place


### Assign Venues to Grids

Each location (Tysons Corner, Arlington, & Alexandria) will be segmented into grids of approximately 500 by 500 meters. This represents about a 5 minute walk from one point within the grid to any other point within the grid. As venues are collected from the Foursquare data set, they will be assigned a particular grid ID. Below is a map that shows the Tysons Corner area segmented into a 7x7 grid. 

Then, the `one-hot encoding` technique will be applied and each grid will have a unit score for all categories. The survey data will be applied to the new data frame and the top grid locations will be explored as potential locations for ACME, Inc. This technique is similar to a movie recommendation engine. In this case, the grid locations are equivalent to the movies and the venue categories are equivalent to the movie genres.

In [13]:
lat, lng, rad = PLACES[['lat', 'lng', 'radius-m']].iloc[1,:]

grids = generateGrids(lat, lng, rad, 500, 'tysons')

map = folium.Map(location=[lat,lng], zoom_control=False, zoom_start=13)
folium.map.Marker([lat, lng], popup='<i>Tysons</i>').add_to(map)
for _, row in grids.iterrows():
    folium.vector_layers.Rectangle([row.loc['boundN'], row.loc['boundE']], fill=True, popup=f"<i>{row.loc['grid']}</i>").add_to(map)
map