# Food around the Universities in Canada

### Applied Data Science Capstone
### Coursera Capstone Project - The Battle of Neighborhoods

___by Ripple Shi, 20 May 2020___

----

## PART 1 INTRODUCTION

I believe food is an important part of life. Food provides you with nutrition, energy and ideally, satisfaction. As someone who will soon enroll in a university in Canada, I am really curious about what types of food I can get there. I was born and raised in Asia, where the diets are quite different from those in North America, so it will be a great relief if I know whether I could easily find my familiar types of meals nearby.

I am sure this is also a concern for others who are seeking to study, work or live in a different country. However, limit to the scale of the project, we will only explore different food suppliers around Canadian universities. They could be restaurants, café or any other venues that could satisfy people’s needs in food.

I hope this project could provide the readers with some insights on this subject. Although here we will focus on food, universities and Canada, I think the idea behind this project can also be applied to any similar intention.


To carry out the project, we are going to rely on the recommendations provided by Foursquare’s API. By specifying the coordinate of the university, Foursquare will return us some recommended venues that are in the food section within the limit and distance we set. Meanwhile, the category of a venue will also be returned, so we could use that to guess the main cuisine of the venue.

Also, we will use K-Means, a clustering algorithm, to cluster the universities in groups and explore the features of each group. Hopefully, we could get some interesting findings out of that.

## PART 2 DATA

The data used in this project come from three sources.

We start by getting a list of universities in Canada from Wikipedia. The link is https://en.wikipedia.org/wiki/List_of_universities_in_Canada.

Although we are not sure whether the list is complete, it should be enough to represent the population we are interested in.

Next, we will use the name and the province of the universities to get their coordinates. We obtain the latitude and longitude of each university using the dataset on http://py4edata.dr-chuck.net/. This is a subset of data from the Google Geocoding API, established by Dr. Charles R. Severance from University of Michigan. This data set is built to facilitate the study of Python courses taught by Dr. Chuck. Please note that this dataset is not my first choice to get the coordinates. We will further explain this problem in the Data Preprocessing section afterwards. Anyway, the coordinates retrieving from this data set allow us to specify the location in the search queries of Foursquare.

Finally, we use Foursquare’s API to get the recommendations in the food section and do the analysis. According to the documentation of Foursquare’s API, by using the endpoint “explore” we could get a list of recommended venues near the current location. The list includes much information, but we only need the venue name and the venue type. This information will be enough for us to summarize what types of venues we could find around the universities. 

Based on that, we will build our variables of what percent a venue category is taking among all the recommended venues that meet the conditions we set. These self-created variables will be  used to train a model using K-Means algorithm to get the clusters of the universities.

## PART 3 Methodology

### 3.1. Data Preprocessing

#### 3.1.1. List of the Universities

In [1]:
import numpy as np
import pandas as pd

In [2]:
!conda install -c conda-forge lxml --yes
url_wiki = "https://en.wikipedia.org/wiki/List_of_universities_in_Canada"
tables = pd.read_html(url_wiki)

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    libxslt-1.1.33             |       h7d1a2b0_0         426 KB
    lxml-3.8.0                 |           py36_0         3.8 MB  conda-forge
    ------------------------------------------------------------
                                           Total:         4.2 MB

The following NEW packages will be INSTALLED:

  libxslt            pkgs/main/linux-64::libxslt-1.1.33-h7d1a2b0_0
  lxml               conda-forge/linux-64::lxml-3.8.0-py36_0



Downloading and Extracting Packages
lxml-3.8.0           | 3.8 MB    | ##################################### | 100% 
libxslt-1.1.33       | 426 KB    | ##################################### | 100% 
Pre

From the wikipedia page we could see there are 10 tables, each is a list of public universities of a province, and 1 table of private universities.

The format of the tables are slightly different. Lists of public universities missed one column named _province_ compared with list of private universities. We will consider this difference when reading the tables.

In [3]:
province_ls = ["Alberta", "British Columbia", "Manitoba", "New Brunswick", "Newfoundland and Labrador", "Nova Scotia", "Ontario", "Prince Edward Island", "Quebec", "Saskatchewan"]

df_public = pd.DataFrame()
i = 0

for table, province in zip(tables, province_ls):
    table.insert(2, "Province", province)
    df_public = pd.concat([df_public, table], axis=0)

df_public

Unnamed: 0_level_0,Name,City,Province,Language,Est.,Students,Students,Students,Notes
Unnamed: 0_level_1,Name,City,Unnamed: 3_level_1,Language,Est.,Undergrad.,Postgrad.,Total,Notes
0,Alberta University of the Arts,Calgary,Alberta,English,1926,,,1323,
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,English,1970,36240.0,3460.0,39700,[38]
2,MacEwan University,Edmonton,Alberta,English,1971,18897.0,0.0,18897,[39]
3,Mount Royal University,Calgary,Alberta,English,1910,24768.0,0.0,24768,[40]
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,Bilingual,1906,31904.0,7598.0,39502,[41]
...,...,...,...,...,...,...,...,...,...
12,Université du Québec à Rimouski[note 3],Rimouski and Lévis,Quebec,French,1969,4620.0,810.0,5430,[75]
13,Université du Québec à Trois-Rivières[note 3],Trois-Rivières,Quebec,French,1969,9160.0,1450.0,10610,[76]
14,Université Laval,Quebec City,Quebec,French,1663,27530.0,10270.0,37800,[77]
0,University of Regina,"Regina, Saskatoon, Swift Current",Saskatchewan,Bilingual,1911,10690.0,1480.0,12170,[78]


In [4]:
df_private = tables[-4]
df_private

Unnamed: 0,Name,City,Province,Language,Established,Undergraduates,Post-graduates,Total students,Notes
0,Fairleigh Dickinson University (branch),Vancouver,British Columbia,English,2007,78[failed verification],50.0,78[failed verification],[80]
1,New York Institute of Technology (branch),Vancouver,British Columbia,English,2007,70[failed verification],40.0,70[failed verification],[81]
2,Quest University,Squamish,British Columbia,English,2007,700,0.0,700,[82]
3,Niagara University (branch),Vaughan,Ontario,English,2019,,,,[83]
4,Trinity Western University,Langley,British Columbia,English,1962,2130,730.0,2860,[84]
5,University Canada West,Victoria,British Columbia,English,2005,350[needs update],0.0,350[needs update],[85]
6,Booth University College,Winnipeg,Manitoba,English,1982,250,0.0,250,[86]
7,Canadian Mennonite University,Winnipeg,Manitoba,English,1944,600,0.0,600,[56]
8,Kingswood University,Sussex,New Brunswick,English,1945,300,0.0,300,[87][needs update]
9,Crandall University,Moncton,New Brunswick,English,1949,685,0.0,685,[88][needs update]


To combine the two dataframes, we should adjust the format of the df_public to keep it consistent with that of df_private.

In [5]:
# Drop additional column index
df_public.columns = pd.MultiIndex.droplevel(df_public.columns,level=1)

# Unify the column names
df_public.columns = df_private.columns

df_public.head()

Unnamed: 0,Name,City,Province,Language,Established,Undergraduates,Post-graduates,Total students,Notes
0,Alberta University of the Arts,Calgary,Alberta,English,1926,,,1323,
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,English,1970,36240.0,3460.0,39700,[38]
2,MacEwan University,Edmonton,Alberta,English,1971,18897.0,0.0,18897,[39]
3,Mount Royal University,Calgary,Alberta,English,1910,24768.0,0.0,24768,[40]
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,Bilingual,1906,31904.0,7598.0,39502,[41]


Now we are able to concatenate the two lists and clean the tables altogether.

In [6]:
df_univ = pd.concat([df_public, df_private], axis=0)
df_univ

Unnamed: 0,Name,City,Province,Language,Established,Undergraduates,Post-graduates,Total students,Notes
0,Alberta University of the Arts,Calgary,Alberta,English,1926,,,1323,
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,English,1970,36240,3460.0,39700,[38]
2,MacEwan University,Edmonton,Alberta,English,1971,18897,0.0,18897,[39]
3,Mount Royal University,Calgary,Alberta,English,1910,24768,0.0,24768,[40]
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,Bilingual,1906,31904,7598.0,39502,[41]
...,...,...,...,...,...,...,...,...,...
11,University of Fredericton,Fredericton,New Brunswick,English,2005,,,,[59][needs update]
12,Atlantic School of Theology,Halifax,Nova Scotia,English,1971,0,124.0,124,[59]
13,Tyndale University,Toronto,Ontario,English,1894,850,0.0,850,[90]
14,Redeemer University College,Ancaster,Ontario,English,1982,955,0.0,955,


There are some columns not relevant to the location, so we drop some columns and clean the name of the universities for later processes.

In [7]:
# Drop columns
df_univ.drop(df_univ.columns[3:], axis=1, inplace=True)

# Reset the index
df_univ.reset_index(drop=True, inplace=True)

# Remove the notes in the names
df_univ["Name"] = [name.replace("[note 3]","") for name in df_univ["Name"]]

# Escape the 's in the names
df_univ["Name"] = [name.replace("'s","\'s") for name in df_univ["Name"]]

df_univ.head()

Unnamed: 0,Name,City,Province
0,Alberta University of the Arts,Calgary,Alberta
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta
2,MacEwan University,Edmonton,Alberta
3,Mount Royal University,Calgary,Alberta
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta


#### 3.1.2. Coordinates of the Universities

At this stage, we are ready to retrieve the coordinates of the universities.

In [8]:
import requests
import urllib.parse

In [9]:
lat_ls = []
lng_ls = []
address_ls = []
api_key = 42
serviceurl = "http://py4e-data.dr-chuck.net/json?"

for name, province in zip(df_univ["Name"], df_univ["Province"]):
    
    # Define the location of the university
    lc = "{}, {}, Canada".format(name, province)
    print("Searching for {} ......".format(name))
    
    # Set the parameters to be sent together with the url
    parms = dict()
    parms["address"] = lc
    parms['key'] = api_key
    url = serviceurl + urllib.parse.urlencode(parms)

    json = requests.get(url).json()
    
    if 'status' not in json or json['status'] == 'ZERO_RESULTS' :
        print('==== Failure To Retrieve ====')
        address_ls.append(np.nan)
        lat_ls.append(np.nan)
        lng_ls.append(np.nan)
        continue
    
    latlng = json['results'][0]['geometry']['location']  
    address = json['results'][0]['formatted_address']
    print("Result: ", latlng, address)
    address_ls.append(address)
    lat_ls.append(latlng['lat'])
    lng_ls.append(latlng['lng'])

print("Done!")

Searching for Alberta University of the Arts ......
Result:  {'lat': 51.0615707, 'lng': -114.0920983} 1407 14 Ave NW, Calgary, AB T2N 4R3, Canada
Searching for Athabasca University ......
Result:  {'lat': 54.714955, 'lng': -113.3085451} 1 University Dr, Athabasca, AB T9S 3A3, Canada
Searching for MacEwan University ......
Result:  {'lat': 53.5470544, 'lng': -113.506372} 10700 104 Ave NW, Edmonton, AB T5J 4S2, Canada
Searching for Mount Royal University ......
Result:  {'lat': 51.01101, 'lng': -114.129778} 30 Mt Royal Cir SW, Calgary, AB T3E 7C9, Canada
Searching for University of Alberta ......
Result:  {'lat': 53.5232189, 'lng': -113.5263186} 116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada
Searching for University of Calgary ......
Result:  {'lat': 51.159473, 'lng': -114.214827} 11877 85 St NW, Calgary, AB T3R 1J3, Canada
Searching for University of Lethbridge ......
Result:  {'lat': 49.6786156, 'lng': -112.8601177} 4401 University Dr W, Lethbridge, AB T1K 3M4, Canada
Searching for Capi

In [10]:
df_univ['Address'] = address_ls
df_univ['Latitude'] = lat_ls
df_univ['Longitude'] = lng_ls
df_univ

Unnamed: 0,Name,City,Province,Address,Latitude,Longitude
0,Alberta University of the Arts,Calgary,Alberta,"1407 14 Ave NW, Calgary, AB T2N 4R3, Canada",51.061571,-114.092098
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,"1 University Dr, Athabasca, AB T9S 3A3, Canada",54.714955,-113.308545
2,MacEwan University,Edmonton,Alberta,"10700 104 Ave NW, Edmonton, AB T5J 4S2, Canada",53.547054,-113.506372
3,Mount Royal University,Calgary,Alberta,"30 Mt Royal Cir SW, Calgary, AB T3E 7C9, Canada",51.011010,-114.129778
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,"116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada",53.523219,-113.526319
...,...,...,...,...,...,...
86,University of Fredericton,Fredericton,New Brunswick,"3 Bailey Dr, Fredericton, NB E3B 5A3, Canada",45.945570,-66.640826
87,Atlantic School of Theology,Halifax,Nova Scotia,"660 Francklyn St, Halifax, NS B3H 3B6, Canada",44.626826,-63.580503
88,Tyndale University,Toronto,Ontario,"3377 Bayview Ave, North York, ON M2M 3S4, Canada",43.796851,-79.392186
89,Redeemer University College,Ancaster,Ontario,"777 Garner Rd E, Ancaster, ON L9K 1J4, Canada",43.208677,-79.949140


We could see that we failed to retrieve the coordinate of Trent University, so we remove that record.

In [11]:
df_univ.dropna(axis=0, how='any', inplace=True)

df_univ

Unnamed: 0,Name,City,Province,Address,Latitude,Longitude
0,Alberta University of the Arts,Calgary,Alberta,"1407 14 Ave NW, Calgary, AB T2N 4R3, Canada",51.061571,-114.092098
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,"1 University Dr, Athabasca, AB T9S 3A3, Canada",54.714955,-113.308545
2,MacEwan University,Edmonton,Alberta,"10700 104 Ave NW, Edmonton, AB T5J 4S2, Canada",53.547054,-113.506372
3,Mount Royal University,Calgary,Alberta,"30 Mt Royal Cir SW, Calgary, AB T3E 7C9, Canada",51.011010,-114.129778
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,"116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada",53.523219,-113.526319
...,...,...,...,...,...,...
86,University of Fredericton,Fredericton,New Brunswick,"3 Bailey Dr, Fredericton, NB E3B 5A3, Canada",45.945570,-66.640826
87,Atlantic School of Theology,Halifax,Nova Scotia,"660 Francklyn St, Halifax, NS B3H 3B6, Canada",44.626826,-63.580503
88,Tyndale University,Toronto,Ontario,"3377 Bayview Ave, North York, ON M2M 3S4, Canada",43.796851,-79.392186
89,Redeemer University College,Ancaster,Ontario,"777 Garner Rd E, Ancaster, ON L9K 1J4, Canada",43.208677,-79.949140


In [12]:
df_univ.reset_index(drop=True, inplace=True)
df_univ

Unnamed: 0,Name,City,Province,Address,Latitude,Longitude
0,Alberta University of the Arts,Calgary,Alberta,"1407 14 Ave NW, Calgary, AB T2N 4R3, Canada",51.061571,-114.092098
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,"1 University Dr, Athabasca, AB T9S 3A3, Canada",54.714955,-113.308545
2,MacEwan University,Edmonton,Alberta,"10700 104 Ave NW, Edmonton, AB T5J 4S2, Canada",53.547054,-113.506372
3,Mount Royal University,Calgary,Alberta,"30 Mt Royal Cir SW, Calgary, AB T3E 7C9, Canada",51.011010,-114.129778
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,"116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada",53.523219,-113.526319
...,...,...,...,...,...,...
85,University of Fredericton,Fredericton,New Brunswick,"3 Bailey Dr, Fredericton, NB E3B 5A3, Canada",45.945570,-66.640826
86,Atlantic School of Theology,Halifax,Nova Scotia,"660 Francklyn St, Halifax, NS B3H 3B6, Canada",44.626826,-63.580503
87,Tyndale University,Toronto,Ontario,"3377 Bayview Ave, North York, ON M2M 3S4, Canada",43.796851,-79.392186
88,Redeemer University College,Ancaster,Ontario,"777 Garner Rd E, Ancaster, ON L9K 1J4, Canada",43.208677,-79.949140


Here we quickly explain why we give up using the module `geocoder` to retrieve the latitude and longitude of a university.

In [13]:
!conda install -c conda-forge geocoder --yes
import geocoder

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geocoder


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    brotlipy-0.7.0             |py36h8c4c3a4_1000         346 KB  conda-forge
    chardet-3.0.4              |py36h9f0ad1d_1006         188 KB  conda-forge
    click-7.1.2                |     pyh9f0ad1d_0          64 KB  conda-forge
    cryptography-2.9.2         |   py36h45558ae_0         613 KB  conda-forge
    future-0.18.2              |   py36h9f0ad1d_1         714 KB  conda-forge
    geocoder-1.38.1            |             py_1          53 KB  conda-forge
    pysocks-1.7.1              |   py36h9f0ad1d_1          27 KB  conda-forge
    ratelim-0.1.6              |             py_2           6 KB  conda-forge
    ---------------------

In [14]:
# Get some coordinates of the universities as an example for clarification
uni_dc = {}
count = 0
for name, province in zip(df_univ["Name"], df_univ["Province"]):
    lc = "{}, {}, Canada".format(name, province)
    print("Searching for {} ......".format(name))
    g = geocoder.arcgis(lc)
    latlng = g.latlng
    print("Result: ", latlng)
    uni_dc[name] = latlng
    count += 1
    if count > 5:
        break
print("Done!")
pd.DataFrame(uni_dc).T

Searching for Alberta University of the Arts ......
Result:  [51.06438000000003, -114.09211999999997]
Searching for Athabasca University ......
Result:  [53.522890000000075, -113.52626999999995]
Searching for MacEwan University ......
Result:  [53.53948006665105, -113.49235997778297]
Searching for Mount Royal University ......
Result:  [51.01228000000003, -114.13238999999999]
Searching for University of Alberta ......
Result:  [53.522890000000075, -113.52626999999995]
Searching for University of Calgary ......
Result:  [51.07663000000008, -114.13208999999995]
Done!


Unnamed: 0,0,1
Alberta University of the Arts,51.06438,-114.09212
Athabasca University,53.52289,-113.52627
MacEwan University,53.53948,-113.49236
Mount Royal University,51.01228,-114.13239
University of Alberta,53.52289,-113.52627
University of Calgary,51.07663,-114.13209


We could see from the result that Athabasca University and University of Alberta share the same coordinate. It is impossible. We further check this location on the map, and it is clear that [53.52289, -113.52627] is the coordinate of University of Alberta.

In [15]:
import folium
map = folium.Map(location=[53.52289, -113.52627], zoom_start=14)
folium.Marker(location=[53.52289, -113.52627]).add_to(map)
map

We use the coordinate we got for Athabasca University and display it on the map. In fact, these two universities are quite far from each other.

In [16]:
# Add the marker pinned to the coordinate of Athabasca University
map = folium.Map(location=[53.52289, -113.52627], zoom_start=6)
folium.Marker(location=[53.52289, -113.52627], popup='University of Alberta').add_to(map)
folium.Marker(location=[54.714955, -113.308545], popup='Athabasca University').add_to(map)
map

The result told us that using this way to find the coordinate is not accurate. In fact, I have checked the complete result of the coordinates of all the universities obtained using the `geocoder`, and have noticed many duplicate values.

This is not the fault of the module. The reason is that I did not provide the formatted address but only the name of a university to the API. According to the documentation of the `geocoder`, the address is necessary to get the correct result.

#### 3.1.3. Recommendations of Food around the Universities

After obtaining the coordinates of the universities, we use Foursquare’s API to get the recommendations in the food section.

In [17]:
CLIENT_ID = 'XPY14AS3GHQDHDO3HTF1CNG5SOXOWZARINMBXYAU4UOYXIZF'
CLIENT_SECRET = 'YXGDSZG1BMYTEZFHLJL5CQZ0O3F2LMBJWNQGEFPPXSMFDTZO'
VERSION = '20200516'

In [18]:
# Define a function to help us get the recommendations

def university_food(university, lat, lng, section = 'food', radius = 500, limit = 10):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&section={}&limit={}'\
                                   .format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, section, limit)
    json = requests.get(url).json()
    
    try:
        df = pd.json_normalize(json['response']['groups'][0]['items'])
        df['venue.categories'] = [info[0]['name'] for info in df['venue.categories']]
        df = df[['venue.name', 'venue.location.lat', 'venue.location.lng', 'venue.location.distance', 'venue.categories']]
        df.columns = [col.replace("."," ").capitalize() for col in df.columns]
        df.insert(0, "University", university)
    except:
        df = pd.DataFrame([{"University":university}])
            
    return df

In [19]:
df_recomm = pd.DataFrame()
for university, lat, lng in zip(df_univ['Name'], df_univ['Latitude'], df_univ['Longitude']):
    print("Searching for {} ....".format(university))
    df = university_food(university, lat, lng, radius = 3000)
    df_recomm = pd.concat([df_recomm, df],axis=0)
df_recomm

Searching for Alberta University of the Arts ....
Searching for Athabasca University ....
Searching for MacEwan University ....
Searching for Mount Royal University ....
Searching for University of Alberta ....
Searching for University of Calgary ....
Searching for University of Lethbridge ....
Searching for Capilano University ....
Searching for Emily Carr University of Art and Design ....
Searching for Kwantlen Polytechnic University ....
Searching for Royal Roads University ....
Searching for Simon Fraser University ....
Searching for Thompson Rivers University ....
Searching for University of British Columbia ....
Searching for University of Victoria ....
Searching for University of the Fraser Valley ....
Searching for University of Northern British Columbia ....
Searching for Vancouver Island University ....
Searching for Brandon University ....
Searching for University College of the North ....
Searching for University of Manitoba ....
Searching for University of Winnipeg ....
Se

Unnamed: 0,University,Venue name,Venue location lat,Venue location lng,Venue location distance,Venue categories
0,Alberta University of the Arts,Jimmy's A&A Deli,51.070299,-114.092472,971.0,Mediterranean Restaurant
1,Alberta University of the Arts,Vendome Cafe,51.055138,-114.083323,943.0,Café
2,Alberta University of the Arts,Hayden Block,51.052595,-114.088226,1035.0,BBQ Joint
3,Alberta University of the Arts,Wow Chicken,51.054881,-114.085833,864.0,Korean Restaurant
4,Alberta University of the Arts,Peppino,51.052509,-114.090946,1011.0,Diner
...,...,...,...,...,...,...
5,The King's University,A&W,53.541615,-113.417087,1803.0,Fast Food Restaurant
6,The King's University,Sabu Sushi Bar,53.518032,-113.442067,1866.0,Sushi Restaurant
7,The King's University,Subway,53.525481,-113.444089,1809.0,Sandwich Place
8,The King's University,Fargo's,53.540761,-113.424622,1785.0,Steakhouse


In [20]:
df_recomm.reset_index(drop=True, inplace=True)
df_recomm

Unnamed: 0,University,Venue name,Venue location lat,Venue location lng,Venue location distance,Venue categories
0,Alberta University of the Arts,Jimmy's A&A Deli,51.070299,-114.092472,971.0,Mediterranean Restaurant
1,Alberta University of the Arts,Vendome Cafe,51.055138,-114.083323,943.0,Café
2,Alberta University of the Arts,Hayden Block,51.052595,-114.088226,1035.0,BBQ Joint
3,Alberta University of the Arts,Wow Chicken,51.054881,-114.085833,864.0,Korean Restaurant
4,Alberta University of the Arts,Peppino,51.052509,-114.090946,1011.0,Diner
...,...,...,...,...,...,...
832,The King's University,A&W,53.541615,-113.417087,1803.0,Fast Food Restaurant
833,The King's University,Sabu Sushi Bar,53.518032,-113.442067,1866.0,Sushi Restaurant
834,The King's University,Subway,53.525481,-113.444089,1809.0,Sandwich Place
835,The King's University,Fargo's,53.540761,-113.424622,1785.0,Steakhouse


In [86]:
# Avoid the data in API changed

seriID = np.random.randint(100,999,size=1)[0]
df_recomm.to_csv("Recommended Places_{}.csv".format(seriID), encoding="utf_8_sig")

Currently, each row is a recommendation for the specific university. To facilitate our analysis, we transform the table to be each row is an unique university, with the counts of each type of venues as columns.

In [59]:
# To work on the dataset exported on May 20
df_recomm = pd.read_csv("Recommended Places@0520.csv", index_col=0)
df_recomm

Unnamed: 0,University,Venue name,Venue location lat,Venue location lng,Venue location distance,Venue categories
0,Alberta University of the Arts,Jimmy's A&A Deli,51.070299,-114.092472,971.0,Mediterranean Restaurant
1,Alberta University of the Arts,Vendome Cafe,51.055138,-114.083323,943.0,Café
2,Alberta University of the Arts,Hayden Block,51.052595,-114.088226,1035.0,BBQ Joint
3,Alberta University of the Arts,Wow Chicken,51.054881,-114.085833,864.0,Korean Restaurant
4,Alberta University of the Arts,Peppino,51.052509,-114.090946,1011.0,Diner
...,...,...,...,...,...,...
851,The King's University,A&W,53.541615,-113.417087,1803.0,Fast Food Restaurant
852,The King's University,Sabu Sushi Bar,53.518032,-113.442067,1866.0,Sushi Restaurant
853,The King's University,Subway,53.525481,-113.444089,1809.0,Sandwich Place
854,The King's University,Fargo's,53.540761,-113.424622,1785.0,Steakhouse


In [60]:
# Get dummies by venue categories
df_recomm_dummies = pd.get_dummies(data=df_recomm, columns=['Venue categories'], dummy_na=True, prefix="", prefix_sep="")

# Calculate the number of each venue type of each university
df_recomm_count = df_recomm_dummies.groupby(by=['University'], axis=0, sort=False).sum()

# Drop columns with unmeaningful values
df_recomm_count = df_recomm_count.iloc[:,3:]

df_recomm_count.head()

Unnamed: 0_level_0,Afghan Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,Bistro,Brazilian Restaurant,Breakfast Spot,...,Sushi Restaurant,Taco Place,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,nan
University,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alberta University of the Arts,0,0,0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
Athabasca University,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
MacEwan University,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
Mount Royal University,0,0,0,0,0,1,0,0,0,1,...,0,0,0,0,0,0,0,1,0,0
University of Alberta,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
# Calculate the percentage of each venue type of each university
df_recomm_perc = df_recomm_count.div(df_recomm_count.sum(axis=1), axis='index', fill_value=None)
df_recomm_perc.head()

Unnamed: 0_level_0,Afghan Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,Bistro,Brazilian Restaurant,Breakfast Spot,...,Sushi Restaurant,Taco Place,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,nan
University,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alberta University of the Arts,0.0,0.0,0.0,0.1,0.0,0.1,0.0,0.0,0.0,0.0,...,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Athabasca University,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MacEwan University,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.0
Mount Royal University,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0
University of Alberta,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


With these two tables df_recomm_count and df_recomm_perc we can process our analysis

### 3.2. Exploratory Data Analysis

Before carrying out the clustering, we take two universities and obtain the recommendations nearby to get some insights.

In [62]:
for university in df_recomm_count.iloc[0:2,:].iterrows():
    print("------- {} ---------".format(university[0]))
    venues = university[1].sort_values(ascending=False)
    venues = venues[venues>0]
    print(venues)

------- Alberta University of the Arts ---------
Gastropub                   1
Diner                       1
BBQ Joint                   1
Italian Restaurant          1
Bakery                      1
Japanese Restaurant         1
Korean Restaurant           1
Sushi Restaurant            1
Café                        1
Mediterranean Restaurant    1
Name: Alberta University of the Arts, dtype: uint8
------- Athabasca University ---------
Sandwich Place         2
American Restaurant    1
Asian Restaurant       1
Gastropub              1
Burger Joint           1
Name: Athabasca University, dtype: uint8


In [63]:
for university in df_recomm_perc.iloc[0:2,:].iterrows():
    print("------- {} ---------".format(university[0]))
    venues = university[1].sort_values(ascending=False)
    venues = venues[venues>0]
    print(venues)

------- Alberta University of the Arts ---------
Gastropub                   0.1
Diner                       0.1
BBQ Joint                   0.1
Italian Restaurant          0.1
Bakery                      0.1
Japanese Restaurant         0.1
Korean Restaurant           0.1
Sushi Restaurant            0.1
Café                        0.1
Mediterranean Restaurant    0.1
Name: Alberta University of the Arts, dtype: float64
------- Athabasca University ---------
Sandwich Place         0.333333
American Restaurant    0.166667
Asian Restaurant       0.166667
Gastropub              0.166667
Burger Joint           0.166667
Name: Athabasca University, dtype: float64


From these two outcomes, we could see that the counts and the percentages are equally important. Although the percentage of each venue type near Alberta University of the Arts is lower than that of Athabasca University, in return it means that there are more food options available near Alberta University of the Arts.

But still, we will use the percentages to train our model. Notice that the counts are not continuous and the number of venues returned by the API for each university varies. Hence, using percentages allow us to compare the similarities of food options available near the universities between groups, so we will get better performance from the model and the result is more meaningful.

In addition, recall that we meant to use the name of the category to guess the main cuisine of the venue. However, it turns out that some venues only have vague category names like diner or restaurant. Even though, I think it is still worthwhile to proceed.

### 3.3. Model training

We already got the percentage of each venue category near each university in previous steps. At this stage, we are going to use the K-Means clustering algorithm to label the universities.

In [64]:
from sklearn.cluster import KMeans

In [65]:
# We set the number of groups to be 6, and set the random seed to be 0 to avoid the result changes everytime we rerun the program
n_clusters = 6
km = KMeans(n_clusters = n_clusters, init='k-means++', random_state = 0)

In [66]:
km.fit(df_recomm_perc)
labels = km.labels_

In [67]:
df_univ['Label'] = labels
df_univ.head()

Unnamed: 0,Name,City,Province,Address,Latitude,Longitude,Label
0,Alberta University of the Arts,Calgary,Alberta,"1407 14 Ave NW, Calgary, AB T2N 4R3, Canada",51.061571,-114.092098,5
1,Athabasca University,"Athabasca, Calgary, Edmonton",Alberta,"1 University Dr, Athabasca, AB T9S 3A3, Canada",54.714955,-113.308545,4
2,MacEwan University,Edmonton,Alberta,"10700 104 Ave NW, Edmonton, AB T5J 4S2, Canada",53.547054,-113.506372,5
3,Mount Royal University,Calgary,Alberta,"30 Mt Royal Cir SW, Calgary, AB T3E 7C9, Canada",51.01101,-114.129778,1
4,University of Alberta,"Edmonton, Camrose, Calgary",Alberta,"116 St & 85 Ave, Edmonton, AB T6G 2R3, Canada",53.523219,-113.526319,4


## PART 4 RESULTS

Now we could display the name of universities under each group.

In [68]:
for table in df_univ.groupby(['Label'], axis=0):
    print("-------- Label {} --------".format(table[0]))
    print(table[1][['Name','Province']])

-------- Label 0 --------
                                              Name          Province
13                  University of British Columbia  British Columbia
14                          University of Victoria  British Columbia
21                          University of Winnipeg          Manitoba
23                           St. Thomas University     New Brunswick
24                     University of New Brunswick     New Brunswick
25                           Université de Moncton     New Brunswick
27                               Acadia University       Nova Scotia
37                             Carleton University           Ontario
43                  Queen's University at Kingston           Ontario
44                Royal Military College of Canada           Ontario
45                              Ryerson University           Ontario
46                Université de l'Ontario français           Ontario
47                            University of Guelph           Ontario
50      

We could also figure out how many universities under each group.

In [69]:
df_univ.groupby(by=['Label'], axis=0).count()

Unnamed: 0_level_0,Name,City,Province,Address,Latitude,Longitude
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,24,24,24,24,24,24
1,21,21,21,21,21,21
2,1,1,1,1,1,1
3,16,16,16,16,16,16
4,12,12,12,12,12,12
5,16,16,16,16,16,16


While only one universities were assigned to group 2, the other universities have been evenly assigned to different groups.

Now we summarize the clustering result according to the provinces.

In [70]:
df_label_prov = df_univ.groupby(by=['Province', 'Label'], axis=0).count()
df_label_prov = df_label_prov[['Name']]
df_label_prov = df_label_prov.pivot_table(index='Label',columns='Province', fill_value=0)
df_label_prov.columns = df_label_prov.columns.droplevel(level=0)
df_label_prov

Province,Alberta,British Columbia,Manitoba,New Brunswick,Newfoundland and Labrador,Nova Scotia,Ontario,Prince Edward Island,Quebec,Saskatchewan
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0,2,1,4,0,1,8,0,8,0
1,1,5,2,0,1,1,8,0,2,1
2,0,0,1,0,0,0,0,0,0,0
3,1,0,1,3,0,2,4,1,3,1
4,4,3,1,1,0,1,1,0,1,0
5,2,6,0,0,0,4,3,0,1,0


Subject to the total number of universities in each province, almost every province has three or more categories. It can be inferred that the smaller area unit such as neighbourhood or block matters more than the province.

Since there are 90 universities in total, the best way to present our findings is visualization. From the map generated using `folium`, we can see six different colours on the map, each representing a different cluster.

In [71]:
# Define the colors to be used for different labels

from matplotlib import cm
from matplotlib import colors

colors_array = cm.autumn(np.linspace(0, 1, n_clusters))
color_ls = [colors.rgb2hex(i) for i in colors_array]

In [82]:
# Determine the central point of the map
lat_sta = df_univ['Latitude'].mean()
lng_sta = df_univ['Longitude'].mean()

# Initialize the map
univ_map = folium.Map(location=[lat_sta, lng_sta], zoom_start=4)

# Add circle marks to indicate the location of the universities
for university, lat, lng, label in zip(df_univ['Name'], df_univ['Latitude'], df_univ['Longitude'], df_univ['Label']):
    folium.CircleMarker(location=[lat,lng],  
                    popup="{}, Cluster {}".format(university.replace("'"," "), label),
                    parse_html=True,
                    radius=12,
                    stroke=True,
                    color=color_ls[label],
                    weight=1,
                    opacity=0.8,
                    fill=True,   
                    ).add_to(univ_map)
    # print(university, lat, lng, label)

univ_map

In [88]:
univ_map.save("University Food Map_{}.html".format(seriID))

Clearly, we cannot observe any clear geographic pattern of the clustering. Almost every province has universities from each label. This is consistent with what we concluded from the province distribution table that no apparent geographic pattern is observed. Therefore, our next step is to go back to our data frames and see which venue category is more common under each group.

In [74]:
# Add labels to the recommendations in counts
df_recomm_count.insert(0,"Label",labels)

In [75]:
# Group by labels
df_recomm_countsum = df_recomm_count.groupby(by=['Label'], axis=0).sum()

df_recomm_countsum

Unnamed: 0_level_0,Afghan Restaurant,American Restaurant,Asian Restaurant,BBQ Joint,Bagel Shop,Bakery,Belgian Restaurant,Bistro,Brazilian Restaurant,Breakfast Spot,...,Sushi Restaurant,Taco Place,Tapas Restaurant,Thai Restaurant,Theme Restaurant,Turkish Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wings Joint,nan
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,4,2,1,1,13,0,2,0,5,...,2,0,2,1,1,1,11,2,0,0
1,1,6,5,0,0,8,1,1,1,10,...,8,4,2,2,0,0,5,6,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,3,1,0,0,1,0,0,0,7,...,2,0,0,1,0,0,0,3,5,0
4,0,5,2,2,0,0,0,0,0,1,...,6,0,0,0,0,0,0,0,1,0
5,0,4,2,4,2,6,0,0,0,9,...,13,0,0,5,0,6,2,2,1,0


In [76]:
# Sort by the count of each category and print out

for label in df_recomm_countsum.index:
    df_recomm_sort = df_recomm_countsum.iloc[label,:].sort_values(ascending=False)
    print("-------- Label {} --------".format(label))
    print(df_recomm_sort.head(10))

-------- Label 0 --------
Café                             64
Restaurant                       36
Bakery                           13
Gastropub                        12
Vegetarian / Vegan Restaurant    11
Pizza Place                      11
French Restaurant                10
Italian Restaurant                7
Japanese Restaurant               6
Breakfast Spot                    5
Name: 0, dtype: uint8
-------- Label 1 --------
Restaurant                 29
Sandwich Place             15
Café                       13
Pizza Place                10
Breakfast Spot             10
Sushi Restaurant            8
Bakery                      8
New American Restaurant     7
Indian Restaurant           7
Italian Restaurant          6
Name: 1, dtype: uint8
-------- Label 2 --------
nan                            1
Fast Food Restaurant           0
Chinese Restaurant             0
Creperie                       0
Deli / Bodega                  0
Diner                          0
Donut Shop          

In [77]:
df_label_count = df_univ.groupby(by=['Label'], axis=0).count()
count_ls = df_label_count['Name'].to_list()

for label, count in zip(df_recomm_countsum.index, count_ls):
    df_recomm_sort = df_recomm_countsum.iloc[label,:].sort_values(ascending=False)
    df_recomm_avgcount = df_recomm_sort/count
    print("-------- Label {} --------".format(label))
    print(df_recomm_avgcount.head(10))

-------- Label 0 --------
Café                             2.666667
Restaurant                       1.500000
Bakery                           0.541667
Gastropub                        0.500000
Vegetarian / Vegan Restaurant    0.458333
Pizza Place                      0.458333
French Restaurant                0.416667
Italian Restaurant               0.291667
Japanese Restaurant              0.250000
Breakfast Spot                   0.208333
Name: 0, dtype: float64
-------- Label 1 --------
Restaurant                 1.380952
Sandwich Place             0.714286
Café                       0.619048
Pizza Place                0.476190
Breakfast Spot             0.476190
Sushi Restaurant           0.380952
Bakery                     0.380952
New American Restaurant    0.333333
Indian Restaurant          0.333333
Italian Restaurant         0.285714
Name: 1, dtype: float64
-------- Label 2 --------
nan                            1.0
Fast Food Restaurant           0.0
Chinese Restaurant      

## PART 5 DISCUSSION

We summarize the features of each group as follows.

* Label 0

Café dominates the list of the top 10 venue categories of this group. There are 64 cafés recommended in total and an average of 2.67 cafes near each university in this group. These two numbers are much higher than those of the rest of the groups. So the percentage of café must be an important factor that K-Means used when deciding which university should be assigned to cluster 0.

This group is also the only one that has Vegan Restaurant on the list. Good news for vegetarians. In addition, no fast-food was recommended. So basically this is a very healthy group.

The foreign restaurants recommended and are on the list include French Restaurant, Japanese Restaurant and Italian Restaurant. 

The universities in this group include University of British Columbia, McGill University, Université de Montréal and other 21 universities.

* Label 1

Restaurant is the top one on the list with an absolute advantage of 29 in total. Since we can infer nothing from it, we skip this category. Followed Restaurant are Sandwich Place and Café. Except that, the numbers of the rest of the categories in the list are quite even . 

The foreign restaurants recommended and are on the list include Sushi Restaurant, New American Restaurant, Indian Restaurant and Italian Restaurant. 

The universities in this group include University of Manitoba, University of Waterloo, York University and other 18 universities.

* Label 2

Only Canadian Mennonite University was classified to this group with no recommendations available.

* Label 3

The top one on the list of this group is still Restaurant, which tells us nothing. Other than that, this group got plenty of fast-food restaurants and nearly no foreign restaurants recommended. There are on average 1.4 fast-food restaurants, 0.8 sandwich place and 0.3 wings joint. In total, there are about 40 venues in these kinds being recommended. 

Greek Restaurant is the only foreign restaurants recommended and are on the list.

The universities in this group include University of Calgary, Université de Sherbrooke, and 14 other universities.

* Label 4

Only this group, group 2 and group 5 that Restaurant is not on the list of their top 10 common categories.

Similar to group 3, the universities in this group got a lot of fast-food recommendations. The difference is that Fast-Food Restaurant is the most common one in group 3 while in group 4 is Sandwich Place. There are on average 1.4 sandwich place, 1.2 fast-food restaurants and 0.5 Burger Joint. In total, there are about 37 venues in these kinds being recommended.

The foreign restaurants recommended and are on the list include Sushi Restaurant, American Restaurant and Italian Restaurant. 

The universities in this group include University of Alberta, Simon Fraser University and 10 other universities.

* Label 5

What is special about this group is that it has the most diverse food supplier recommendations. The foreign restaurants recommended and are on the list include Sushi Restaurant, Italian Restaurant, Mexican Restaurant, Turkish Restaurant, Japanese Restaurant and Indian Restaurant, six in total.

Another feature of this group is that no one category can get an average number over one. Some actually there is not a venue category that can dominate the list of this group.

The universities in this group include Dalhousie University, McMaster University and 14 other universities.


## PART 6 CONCLUSION

Overall, it seems that finding delicious foods around the universities does not constitute a serious problem for students there. The surrounding of almost every university offers enough options to be considered. 

Based on the result of our clustering, we could also see that the most common venue categories of each group are quite different. In general, venues that are classified into restaurants, fast-food restaurants, pizza places and cafés are most easily to be found nearby.  

In addition, there are many foreign restaurants in the neighbourhood. Therefore, international students and domestic students who would like to have something different can all find good places to go.

There are some categories we did not mention in the last section like gastropub, bakery and breakfast spots are also great options. 

So in summary, we hope this project could help you understand what the food map is like around the universities, and may inspire you on some similar or any other interesting ideas.

One kind warning is that the features of the groups are summarized based on groups in aggregate. Depending on the performance of the clustering algorithm, the individual university may not completely share the features of its group.
