# Creating a World Cities dataset

The wikipedia page [List of largest cities](https://en.wikipedia.org/wiki/List_of_largest_cities) has a list of the largest cities in the world.  
We are interested in table that contains the actual cities, and want the cityname, the nation, and its city proper population. 

In [Section 2](#2.-Scrape-list-of-largest-cities-from-Wikipedia), we use urllib to fetch the page, then BeautifulSoup to parse it and find the first (and only) sortable table.  
We also scrape the URL referencing the individual cities' pages.

In [Section 3](#3.-Adding-geopositioning-data) we scrape the wikipedia page for every individual city in order to get the cities geographic coordinates.  
In theory we could use this step to grab more information, like population density. However, since we want to demonstrate the use of the Foursquare API, we will not use such data.

In [Section 4](#4.-Getting-more-information-from-Foursquare) we use the Foursquare API to learn more about these cities. 
Specifically we ask for top recommendations in those cities, to get an idea of what is popular.

In [Section 5](#5.-Saving-the-dataframes) we export the created dataframe so we can import it in other notebooks.

## 1. Imports and such things

We use the following libraries to fetch and parse html pages, and to interact with the Foursquare API.

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import requests

We use regular expressions, e.g. to extract population count and city coordinates from scraped webpages

In [2]:
import re

We use Pandas for building the dataframe.

In [3]:
import pandas as pd

## 2. Scrape list of largest cities from Wikipedia

The following code fetches a Wikipedia page.

In [4]:
# location of the wikipedia article
url = "https://en.wikipedia.org/wiki/List_of_largest_cities"

# fetch the article
req = urllib.request.urlopen(url)
article = req.read().decode()

We use BeautifulSoup to parse the obtained HTML, and find the first sortable table. This is the table of Largest cities that dominates the Wikipedia article.

In [5]:
# parse with BeautifulSoup and find the first sortable table
soup = BeautifulSoup(article, 'html.parser')
table = soup.find('table', class_='sortable')

We create a new Pandas dataframe to hold our cities. We will also save its nation, its population count, and its URL.

In [6]:
# create an empty DataFrame
cols=["City", "Nation", "Population", "URL"]
df_cities = pd.DataFrame(columns=cols)
df_cities['Population'].astype(int)

Series([], Name: Population, dtype: int64)

We can now traverse through the entire table and append the required data to our dataframe

In [7]:
# iterate trough all the rows in the table:
for tr in table.find_all('tr'):
    tds = tr.find_all('td')
    if not tds:
        continue                            # skips first row with headings
    nation = tds[0].find('a').string        # first td column contains nation   
    try:
        pop = int(re.compile(r'\[.*\]').sub("",tds[2].text).replace(',',''))  # rough but working way to parse the population count
    except ValueError:
        pop = 0                             # but not every city has a population count
    city_a = tr.find('th').find('a')        # the first column contains th tag and contains the <a> link to the city
    city = city_a.string
    url = "https://en.wikipedia.org" + city_a['href']
    df_cities = df_cities.append({
        'City': str(city), 
        'Nation': str(nation), 
        'Population': pop, 
        'URL': str(url)
    }, ignore_index=True)

In [8]:
print ("Shape: {}".format(df_cities.shape))
print ("There are {} unique city names".format(df_cities['City'].nunique()))

Shape: (247, 4)
There are 246 unique city names


This is strange, apparently a city occurs twice in the list?
This is the city of *Hyderabad*. They have one in both India and in Pakistan.

Conclusion: we cannot use 'city' as unique identifier to this list.

In [9]:
print ("There are {} unique city/nation combinations".format(len(df_cities.groupby(['City', 'Nation']))))
df_cities.head()

There are 247 unique city/nation combinations


Unnamed: 0,City,Nation,Population,URL
0,Chongqing,China,30751600,https://en.wikipedia.org/wiki/Chongqing
1,Shanghai,China,24256800,https://en.wikipedia.org/wiki/Shanghai
2,Delhi,India,11034555,https://en.wikipedia.org/wiki/Delhi
3,Beijing,China,21516000,https://en.wikipedia.org/wiki/Beijing
4,Dhaka,Bangladesh,14399000,https://en.wikipedia.org/wiki/Dhaka


## 3. Adding geopositioning data
Every city's wikipedia page contains the geographic coordinates, which we can also scrape.
The following function scrapes the city page and uses a simple regular expression to capture the coordinates.

Here we don't need BeautifulSoup, since we have a Regular Expression that we can directly run on the fetched html.

In [10]:
# Scrape an individual cities page for its coordinates
def scrape_city_coords(url):
    req = urllib.request.urlopen(url)
    article = req.read().decode()
    reg = re.search(r'"lat":(.*?),"lon":(.*?)}', article)
    lat = float(reg.group(1))
    lon = float(reg.group(2))
    return lat,lon

Next we will run this function on the URLs to create two new columns

In [11]:
df_cities["Latitude"], df_cities["Longitude"] = zip(*df_cities["URL"].map(scrape_city_coords))

In [12]:
df_cities.head()

Unnamed: 0,City,Nation,Population,URL,Latitude,Longitude
0,Chongqing,China,30751600,https://en.wikipedia.org/wiki/Chongqing,29.558333,106.566667
1,Shanghai,China,24256800,https://en.wikipedia.org/wiki/Shanghai,31.228611,121.474722
2,Delhi,India,11034555,https://en.wikipedia.org/wiki/Delhi,28.61,77.23
3,Beijing,China,21516000,https://en.wikipedia.org/wiki/Beijing,39.916667,116.383333
4,Dhaka,Bangladesh,14399000,https://en.wikipedia.org/wiki/Dhaka,23.716111,90.396111


## 4. Getting more information from Foursquare

This information is needed to connect with Foursquare API

In [13]:
CLIENT_ID = 'YPBVFDUZOP1M24BKCWGXIYZ3RFACOE3V35WSFY4DSCMRU44L' # your Foursquare ID
CLIENT_SECRET = 'VYHYTBSRIZBPYAOCP5ZEFV3YM4C40YEQCQWCUO4NC1JTPNJM' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

We will ask Foursquare for the top picks in every city, then store its name, location and main category.

In [14]:
RADIUS=20000 # search within 20km from city centre (max=100km)
LIMIT=100    # maximum of 100 venues per city (max=100)

def getRecommendedVenues(cities, nations, latitudes, longitudes):
    
    venues_list=[]
    for city, nation, lat, lon in zip(cities, nations, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&section=topPicks&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lon,
            RADIUS,
            LIMIT)
            
        # make the GET request
        response = requests.get(url)
        if response.status_code == requests.codes.ok:
            results = response.json()["response"]['groups'][0]['items']
        else:
            print ("status was:" + str(response.status_code))
            print ("could not scrape " + url)
            break
        
        print("City: {}, Nation: {}, results: {}".format(city, nation, len(results)))
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            city,
            nation,
            v['venue']['name'], 
            v['venue']['id'],
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])


    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                  'Nation',
                  'Venue',
                  'Venue ID',
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now, actually run the above function for every city.

In [15]:
df_venues = getRecommendedVenues(df_cities['City'], df_cities['Nation'], df_cities['Latitude'], df_cities['Longitude'])

City: Chongqing, Nation: China, results: 16
City: Shanghai, Nation: China, results: 96
City: Delhi, Nation: India, results: 100
City: Beijing, Nation: China, results: 87
City: Dhaka, Nation: Bangladesh, results: 83
City: Mumbai, Nation: India, results: 100
City: Lagos, Nation: Nigeria, results: 100
City: Chengdu, Nation: China, results: 15
City: Karachi, Nation: Pakistan, results: 100
City: Guangzhou, Nation: China, results: 90
City: Istanbul, Nation: Turkey, results: 100
City: Tokyo, Nation: Japan, results: 28
City: Tianjin, Nation: China, results: 34
City: Moscow, Nation: Russia, results: 100
City: São Paulo, Nation: Brazil, results: 100
City: Kinshasa, Nation: DR Congo, results: 24
City: Baoding, Nation: China, results: 3
City: Lahore, Nation: Pakistan, results: 100
City: Cairo, Nation: Egypt, results: 100
City: Seoul, Nation: Korea, South, results: 62
City: Jakarta, Nation: Indonesia, results: 100
City: Wenzhou, Nation: China, results: 5
City: Mexico City, Nation: Mexico, results: 

In [16]:
print ("Found {} venues for {} different cities".format(df_venues.shape[0], len(df_venues.groupby(['City', 'Nation']))))
df_venues.head()

Found 16421 venues for 247 different cities


Unnamed: 0,City,Nation,Venue,Venue ID,Venue Latitude,Venue Longitude,Venue Category
0,Chongqing,China,Hongyadong (洪崖洞),4bc830fa15a7ef3b9ca87ada,29.564942,106.574742,Shopping Mall
1,Chongqing,China,Paulaner Brauhaus,504e9876e4b0225693e3ddcb,29.538571,106.557791,German Restaurant
2,Chongqing,China,The Cactus,4df77dac45dd222116c98b42,29.565164,106.575347,Mexican Restaurant
3,Chongqing,China,观音桥步行街,4d4f81433626a093174511bd,29.577044,106.528655,Plaza
4,Chongqing,China,Blue Frog (蓝蛙),565ab146498e3f29232c2f27,29.580209,106.528782,American Restaurant


That's looking pretty awesome. Now let's prevent more scraping by saving the dataframe to a file.

## 5. Joining the dataframe

In [17]:
df_joined = pd.merge(df_venues, df_cities, on=['City','Nation'])
print ("The joined dataframe contains {} different cities".format(len(df_joined.groupby(['City', 'Nation']))))
df_joined.head()

The joined dataframe contains 247 different cities


Unnamed: 0,City,Nation,Venue,Venue ID,Venue Latitude,Venue Longitude,Venue Category,Population,URL,Latitude,Longitude
0,Chongqing,China,Hongyadong (洪崖洞),4bc830fa15a7ef3b9ca87ada,29.564942,106.574742,Shopping Mall,30751600,https://en.wikipedia.org/wiki/Chongqing,29.558333,106.566667
1,Chongqing,China,Paulaner Brauhaus,504e9876e4b0225693e3ddcb,29.538571,106.557791,German Restaurant,30751600,https://en.wikipedia.org/wiki/Chongqing,29.558333,106.566667
2,Chongqing,China,The Cactus,4df77dac45dd222116c98b42,29.565164,106.575347,Mexican Restaurant,30751600,https://en.wikipedia.org/wiki/Chongqing,29.558333,106.566667
3,Chongqing,China,观音桥步行街,4d4f81433626a093174511bd,29.577044,106.528655,Plaza,30751600,https://en.wikipedia.org/wiki/Chongqing,29.558333,106.566667
4,Chongqing,China,Blue Frog (蓝蛙),565ab146498e3f29232c2f27,29.580209,106.528782,American Restaurant,30751600,https://en.wikipedia.org/wiki/Chongqing,29.558333,106.566667


## 6. Saving the dataframes

We export the dataframes so we don't have to scrape again.  
First as CSV, as this is a very generic format.  
Secondly as pickle, since this is a quick way to import the dataframe again.

In [18]:
# export both dataframes as CSV
df_cities.to_csv('cities.csv')
df_venues.to_csv('venues.csv')
df_joined.to_csv('joined.cvs')

In [19]:
# export both dataframe as pickles
df_cities.to_pickle('cities.pickle')
df_venues.to_pickle('venues.pickle')
df_joined.to_pickle('joined.pickle')