## Capstone Project: City recommender (for visiting or living)
### Data
*Auther: Xinyue Luo*

### Overview of data
#### Part 1: Data collection (in this notebook)
1. Largest ~20 cities by population on the US east coast (list obtained from Wikipedia page https://en.wikipedia.org/wiki/Eastern_United_States) will be involved in clustering; area information will also be obtained from each city's main page on Wikipedia to determine the search radius for venue search.
2. Information of nearby venues **(latitute, longitute, category)** will be obtained from **FOURSQUARE** for each city involved.

#### Part 2: Data understanding and modeling (next week's objective)
3. A **normalized frequency count of each venue category** for each city will be used as independent variables when performing clustering of all cities involved in this study.
4. Recommended cities will be returned as final results, once a user inputs a known favorate city, or preferences for ideal city (represented as venue categories and score/weight - higher score/weight is converted to preference for higher frequency of such venture category in the city).
5. For each recommended city on the list from step 4, local venues will be clustered (using k-means or DBSCAN). The venue number and popular venue categories of each cluster will be used to match with user's preferences to further narrow-down recommended regions in each city.

***
- *If not too difficult to implement, more features could be added to **Step 3** as independent variables to improve clustering accuracy. For example, the distrubution pattern (e.g. number and coordinates of centroids/hubs, number of members in a cluster) of local restaurants/shops could indicate whether/how a shopping/entertainment hub exists, which could result in very different clustering result comparing to when not considered*

### Part 1: Data collection
#### 1. Get a list of largest cities on the east coast of US from Wikipedia page

In [1]:
# import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np

In [2]:
# get raw list of top cites from the Wikipedia page "Eastern United States"
webpage = requests.get("https://en.wikipedia.org/wiki/Eastern_United_States").text
soup = BeautifulSoup(webpage, 'lxml')
cities_gallery = soup.find('ul', class_='gallery mw-gallery-traditional')
cities = cities_gallery.find_all('div', class_='gallerytext')
cities[0:2]

[<div class="gallerytext">
 <p><a href="/wiki/New_York_City" title="New York City">New York City</a><br/>population: 8,622,698
 </p>
 </div>, <div class="gallerytext">
 <p><a href="/wiki/Chicago" title="Chicago">Chicago</a><br/>population: 2,695,598
 </p>
 </div>]

In [3]:
# define a function to scrape information from city's main Wikipedia page
# 2 information obtained: [state the city is located in, area of city]
def get_info(url_extention):

    city_page = requests.get("https://en.wikipedia.org" + url_extention).text
    soup = BeautifulSoup(city_page, 'lxml')
    geoinfo = soup.find('table', class_='infobox geography vcard')
    trs = geoinfo.find_all('tr')

    result = ['', '']
    i = 0
    for tr in trs:
        i += 1
        try:
            header = tr.find('th').text
            if header[-1] == ']':
                header = header[:header.find('[')]
        except:
            continue
            
        if header == 'State':
            result[0] = tr.find('td').text
        if header == 'Area':
            result[1] = trs[i].find('td').text
        if result[0] and result[1]:
            break
    return result

In [4]:
# extract city name and population information from raw html and convert into pandas dataframe object
city_list = []
for city in cities:
    a = city.find('a', href=True)
    info = get_info(a['href'])
    
    # some entries contain an extra '\n' in the end; remove 
    if len(info[0])>0 and info[0][-1] == '\n':
        info[0] = info[0][:-1]
        
    temp = [a.text, info[0], info[1], city.find('p').text.split(': ', 1)[-1][:-1]]
    city_list.append(temp)
df_city = pd.DataFrame(city_list, columns=['City', 'State', 'Area', 'Population'])
print('There are a total of', df_city.shape[0], 'cities on the list.')
df_city

There are a total of 24 cities on the list.


Unnamed: 0,City,State,Area,Population
0,New York City,New York,"468.484 sq mi (1,213.37 km2)",8622698
1,Chicago,Illinois,234.14 sq mi (606 km2),2695598
2,Philadelphia,Pennsylvania,142.71 sq mi (369.62 km2),1567827
3,Jacksonville,Florida,"874.64 sq mi (2,265.30 km2)",821784
4,Indianapolis,Indiana,368.02 sq mi (953.18 km2),820445
5,Columbus,Ohio,223.11 sq mi (577.85 km2),787033
6,Charlotte,North Carolina,305.4 sq mi (771 km2),731424
7,Detroit,Michigan,142.89 sq mi (370.08 km2),713777
8,"Washington, D.C.",,68.34 sq mi (177.0 km2),703608
9,Boston,Massachusetts,89.63 sq mi (232.14 km2),667137


In [5]:
# cleanup the 'City' column so that there's no state information, except for 'Washington, D.C.'
df_city['City'] = df_city['City'].str.split(',').str[0]
df_city.loc[8, 'City'] = 'Washingston, D.C.'

In [6]:
# cleanup the 'Area' column so that area is only represented in km2
km = df_city['Area'].str.find('km2')
p_left = df_city['Area'].str.find('(')

df_city.loc[km>p_left, 'Area'] = df_city.loc[km>p_left, 'Area'].apply(lambda x: x[(x.find('(')+1):-5])
df_city.loc[km<p_left, 'Area'] = df_city.loc[km<p_left, 'Area'].apply(lambda x: x[:(x.find('km2')-1)])

# convert string to float and change column name to specify unit
df_city['Area'] = df_city['Area'].str.replace(',','').astype(float)
df_city.rename(columns={"Area":"Area (km2)"}, inplace=True)

# add raius column: assume each city is a circle, thus radius=sqrt(area/pi), convert km to m
df_city['Radius (m)'] = 1000*np.sqrt(df_city['Area (km2)']/np.pi)
df_city.head()

Unnamed: 0,City,State,Area (km2),Population,Radius (m)
0,New York City,New York,1213.37,8622698,19652.675813
1,Chicago,Illinois,606.0,2695598,13888.69292
2,Philadelphia,Pennsylvania,369.62,1567827,10846.829036
3,Jacksonville,Florida,2265.3,821784,26852.697912
4,Indianapolis,Indiana,953.18,820445,17418.571047


#### 2. Get nearby venues (latitute, longitute, category) from FOURSQUARE API for each city

In [7]:
# import additional libraries
from geopy.geocoders import Nominatim
import geocoder
import folium

In [15]:
# @hidden_cell
CLIENT_ID = 'LI3A1RJPXEDEZPARE1JT2CWUWFFAYKXXHB3LENOA0T1KF5HJ' # your Foursquare ID
CLIENT_SECRET = 'EEK4FRAQRPKXCYLJRQHGRYOA2YFH5WZNTWEQWJDT34YICH01' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Foursquare credentails set!')

Foursquare credentails set!


In [9]:
# define a function to get geographic information [latitude, longitude] for each city
def get_geo(address):
    geolocator = Nominatim(user_agent="EastCoast_explorer")
    loc_custom = geolocator.geocode(address)
    return [loc_custom.latitude, loc_custom.longitude]

In [10]:
# loop through each row in the city dataframe and append geographical information
df_city['Latitude'] = np.nan
df_city['Longitude'] = np.nan

for index, row in df_city.iterrows():
    address = row['City'] + ', ' + row['State']
    temp_geo = get_geo(address)
    df_city.at[index, 'Latitude'] = temp_geo[0]
    df_city.at[index, 'Longitude'] = temp_geo[1]
    
df_city.to_csv('dream_city_candidates.csv')
df_city

Unnamed: 0,City,State,Area (km2),Population,Radius (m),Latitude,Longitude
0,New York City,New York,1213.37,8622698,19652.675813,40.730862,-73.987156
1,Chicago,Illinois,606.0,2695598,13888.69292,41.875562,-87.624421
2,Philadelphia,Pennsylvania,369.62,1567827,10846.829036,39.952415,-75.163575
3,Jacksonville,Florida,2265.3,821784,26852.697912,30.332184,-81.655651
4,Indianapolis,Indiana,953.18,820445,17418.571047,39.768333,-86.15835
5,Columbus,Ohio,577.85,787033,13562.27738,39.96226,-83.000707
6,Charlotte,North Carolina,771.0,731424,15665.788274,35.227087,-80.843127
7,Detroit,Michigan,370.08,713777,10853.576493,42.331551,-83.04664
8,"Washingston, D.C.",,177.0,703608,7506.054213,35.936191,-97.069192
9,Boston,Massachusetts,232.14,667137,8596.072183,42.360253,-71.058291


In [19]:
# defind a function to get up to 200 venues nearby each city's geographic coordinates
def getNearbyVenues(cities, latitudes, longitudes, radius, limi):
    
    venues_list=[]
    for city, lat, lng, rad in zip(cities, latitudes, longitudes, radius):

        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            rad, 
            limi)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return relevant information for each nearby venue
        venues_list.append([(
            city, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
# get all venues for cities on list
EastCoastCities_venues = getNearbyVenues(cities=df_city['City'], 
                                         latitudes=df_city['Latitude'], 
                                         longitudes=df_city['Longitude'], 
                                         radius=df_city['Radius (m)'], limi=200)
EastCoastCities_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,New York City,40.730862,-73.987156,Sake Bar Decibel,40.729418,-73.987769,Sake Bar
1,New York City,40.730862,-73.987156,Peridance Capezio Center,40.732987,-73.988522,Dance Studio
2,New York City,40.730862,-73.987156,Trader Joe's Wine Shop,40.73375,-73.988128,Wine Shop
3,New York City,40.730862,-73.987156,Strand Bookstore,40.73314,-73.990912,Bookstore
4,New York City,40.730862,-73.987156,The Public Theater,40.729169,-73.99207,Theater


In [23]:
# take a look at the venue data and save to .csv
print(EastCoastCities_venues.shape)
EastCoastCities_venues.to_csv('dream_city_venues.csv')

(2308, 7)
