###### Note:  To view project and maps perfectly, open the url: https://nbviewer.jupyter.org/ and copy the github address of this project and then go! 

<a href="https://www.bigdatauniversity.com"><img src="https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width="400" align="center"></a>

<h1 align="center"><font size="5">ML Capstone Project</font></h1>


## Coursera Applied Data Science Capstone

### Project Objective: Be creative and come up with an idea or problem to solve using location data.

#### 1. The problem and a discussion of the background
In the course of preparing and submitting responses to Weeks 1 - 3 of this Capstone Project, an intriguing anomoly in the Foursquare location data was discovered.  In short, over all of the City of Toronto, with its 103 neighborhoods, only one church, in one neighborhood was identified in all of the unique venue categories obtained from Foursquare.  Upon seeing this data, the thought came that while Canada, in general, has a declining Christian influence (see https://www.pewresearch.org/fact-tank/2019/07/01/5-facts-about-religion-in-canada/), it would be remarkable if a city the size of Toronto had only 1 church.  In other words, did the FourSquare data not track church venues or has the secularization of the city overwhelmed the Christian Church?  A quick google search did, indeed, confirm there are many churches in Toronto, but they are just not observed in FourSquare.  With this finding in hand, the question arises 'How to reverse or slow the decline of Christian influence in Toronto?' If the answer were simply, 'move to Toronto and find a church in a Neighborhood that faciliates sharing the Gospel' then the analysis below would be necessary, but not sufficient.

More to the point, the problem to solve in this project is to use Toronto location data such that it is apparent to any Christian individual or organization which churches have relatively more venues, such as coffee shops / Cafe's nearby where relationships can develop in an attempt to reverse or slow the decline of the Christian influence in Toronto.  In other words, this project seeks the answer to two questions:

A. Given existing churches, which church is best located to take advantage of venues (cafes and coffee houses) to meet and discuss spiritual topics in order to share the gospel of Jesus Christ.

B. If a group of Christians were interested in starting a new church, what neighborhood would be an optimal choice in terms of nearby venues (cafes and coffee houses) that facilitate conversation and relationship building.

As there are two questions to answer, there are two Parts (A & B) to this project with 2 jupyter notebooks submitted.

#### 2. The data and how it will be used to solve the problem
The FourSquare data needs to be enhanced with the addition of an outside source of Church Venue data. 

1. This data starts with downloading/saving the html page:'Toronto Canada Church Directory Churches in Toronto Elevate Christian Network.html'.
2. From the browser locally, use BeautifulSoup to scrap the local html file for Church address data, 
3. a list of addresses will be used as input to obtain Venue Lat, Long with, 
4. manually added Venue Category of 'Church' which is the same category used in the Foursquare data.
5. Once the Foursquare and Church Directory website data are combined, produce a map of Toronto which identifies Church vs. Cafe/Coffee House locations.

#### Part A
6. Define and apply a function that returns the closest centroid (church) for each point(venue or cafe/coffee house). We will use 'numpy broadcasting' to do this. Refer to the blog https://flothesof.github.io/k-means-numpy.html for an explanation of how numpy is used to identify which venue is assigned to which centroid.
7. Create a table of churches, sorted, in descending order, with the church having the most cafes/coffee houses nearby at the top.

#### Part B
8. This part of the project that deals with answering the question of where to locate a new church, uses steps 1. - 4. above as well as applying K-Means cluster analysis to the data in order to find the neighborhood with the attribute of having many cafes / Coffee Houses.  This is the same analysis as done in the assignment for Week 3 of this Capstone Project.


In [1]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

import requests # library to handle requests

from pandas.io.json import json_normalize  # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

#! pip install folium==0.5.0
import folium # map rendering library
from folium.plugins import MarkerCluster

from bs4 import BeautifulSoup


# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    



import geocoder
import matplotlib
import itertools
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
from sklearn import preprocessing
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

#print('Folium installed')
print('Libraries imported.')


Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.

Libraries imported.


In [2]:
import bs4
print(bs4.__version__)


4.9.3


In [None]:
#!conda install -c conda-forge geopy --yes
#!conda uninstall -c conda-forge geopy --yes

## 1. Scrape and Create the Dataframe
Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe 
    - The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
    - Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
    - More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in row 11 in the above table.

    - If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
    - Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making.
    - In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.


### To scrape the data from the website, use the library 'requests' to open and return the html.  Then parse it using the beautifulsoup library.  Put it in a pandas dataframe & clean it & define its shape.


In [3]:
# get the information about Post Code of the neighborhoods of Toronto from wikipage and clean it
#Use function “prettify” to look at nested structure of HTML page
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
#print(soup.prettify())


In [4]:
#Extract the First Table using the python 'find' command; display the table
table = soup.find('table',{'class':'wikitable sortable'})
#table

In [5]:
#Extract the Postal Codes from the HTML table
#First find all the Table Rows i.e. all the rows with data
table_rows = table.find_all('tr')
data = []  #Initialize the list to contain the rows
for row in table_rows:   #Loop through each row in the table rows list
    td=[]  #Initialize the list to contain each individual row
    for t in row.find_all('td'):  #Loop through each element in the list of data in a given row
        td.append(t.text.strip())  #append each element in the list: td.
    data.append(td) #gather each list 'td' into the 'data' list
    
#the pd.set_option simply allows us to see all the data in the dataframe
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
neighborhoods = pd.DataFrame(data, columns=['Postal_Code', 'Borough', 'Neighborhood'])  #Convert the list data into a Panda Dataframe

In [6]:
neighborhoods.drop(neighborhoods[neighborhoods.Borough == 'Not assigned'].index, inplace=True)
neighborhoods.reset_index(drop=True, inplace=True)

#Below, the groupby is applied to all three columns.  So effectively, each column is deduped within itself and 
#then the last column (Neighborhood), the unique values are joined / concatenated with ',' and the unique values.
neighborhoods = neighborhoods.groupby(['Postal_Code','Borough'])['Neighborhood'].apply(lambda x: ','.join(x)).reset_index()
neighborhoods





Unnamed: 0,Postal_Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park"
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge"
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [7]:
neighborhoods.shape

(103, 3)

### Get the Latitude and Longitude of the Coordinates for each Postal Code - Merge them to the Pandas DF created above

### geocoder Failed to pull in any data so went to .csv

In [8]:
latlong_neighborhoods = pd.read_csv("C:\\Users\\StephenVoorhees\\Documents\\_Training\\Python\\Projects\\ML-Capstone2\\Battle_of_Neighborhoods\\Geospatial_Coordinates.csv") 
neighborhoods= pd.merge(neighborhoods,latlong_neighborhoods, how= 'inner', on='Postal_Code')
neighborhoods.head()

Unnamed: 0,Postal_Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]))

The dataframe has 10 boroughs and 103 neighborhoods.


In [10]:
address = '33 Wanless Crescent, Toronto, ON'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.72681267221106, -79.3912508694319.


### Create a Map of Toronto (as in the NYC example) with neighborhoods superimposed on top.

### Let's look at Neighborhoods across Boroughs i.e. for all of Toronto

In [12]:
neighborhoods_data = neighborhoods
neighborhoods_data

Unnamed: 0,Postal_Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.727929,-79.262029
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


### Let's get the geographical coordinates of Toronto.


In [13]:
address = 'Malvern, Rouge, Toronto, ON'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Toronto are 43.8091955, -79.2217008.


In [14]:
# create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(neighborhoods_data['Latitude'], neighborhoods_data['Longitude'], neighborhoods_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Define Foursquare Credentials and Version

In [15]:
CLIENT_ID = '4NHYUOY1R4VMOQ04MJW2FC2AUOVCYTVLKU53CHBLMXBI5RTU' # your Foursquare ID
CLIENT_SECRET = 'CLPUAU4CNGQ45HH01RM24CN2YXVI4PYSDBSPNEZ0R3E5XIZJ' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4NHYUOY1R4VMOQ04MJW2FC2AUOVCYTVLKU53CHBLMXBI5RTU
CLIENT_SECRET:CLPUAU4CNGQ45HH01RM24CN2YXVI4PYSDBSPNEZ0R3E5XIZJ


#### Explore Toronto's the first neighborhood in our dataframe by getting the neighborhood's name, Latitude, Longitude

In [16]:
neighborhoods_data.loc[0, 'Neighborhood']

'Malvern, Rouge'

In [17]:
neighborhood_latitude = neighborhoods_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = neighborhoods_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = neighborhoods_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Malvern, Rouge are 43.806686299999996, -79.19435340000001.


#### Now, let's get the top 100 venues that are in Malvern, Rouge within a radius of 500 meters.  First, let's create the GET request URL. Name your URL: url.

In [18]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # 1609 Meters in a mile - define radius
# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL


'https://api.foursquare.com/v2/venues/explore?&client_id=4NHYUOY1R4VMOQ04MJW2FC2AUOVCYTVLKU53CHBLMXBI5RTU&client_secret=CLPUAU4CNGQ45HH01RM24CN2YXVI4PYSDBSPNEZ0R3E5XIZJ&v=20180604&ll=43.806686299999996,-79.19435340000001&radius=500&limit=100'

In [19]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5fc51e5e9fd1ad40105bf83c'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': 'Wendy’s',
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

In [20]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [21]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Wendy’s,Fast Food Restaurant,43.807448,-79.199056


In [22]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

1 venues were returned by Foursquare.


## 2. Explore Neighborhoods in Toronto
#### create a function to repeat the same process to all the neighborhoods inToronto

In [23]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET,VERSION,lat,lng,radius,LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [24]:
neighborhoods_venues = getNearbyVenues(names=neighborhoods_data['Neighborhood'],
                                   latitudes=neighborhoods_data['Latitude'],
                                   longitudes=neighborhoods_data['Longitude']
                                  )

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge
Hillcrest Village
Fairview, Henry Farm, Oriole
Bayview Village
York Mills, Silver Hills
Willowdale, Newtonbrook
Willowdale, Willowdale East
York Mills West
Willowdale, Willowdale West
Parkwoods
Don Mills
Don Mills
Bathurst Manor, Wilson Heights, Downsview North
Northwood Park, York University
Downsview
Downsview
Downsview
Downsview
Victoria Village
Parkview Hill, Woodbine Gardens
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto, Broadview North (Old East York)
The Danforth West, 

In [25]:
print(neighborhoods_venues.shape)
neighborhoods_venues.sort_values('Neighborhood').head(10)

(2141, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
57,Agincourt,43.7942,-79.262029,Panagio's Breakfast & Lunch,43.79237,-79.260203,Breakfast Spot
58,Agincourt,43.7942,-79.262029,El Pulgarcito,43.792648,-79.259208,Latin American Restaurant
59,Agincourt,43.7942,-79.262029,Twilight,43.791999,-79.258584,Lounge
60,Agincourt,43.7942,-79.262029,Mark's,43.791179,-79.259714,Clothing Store
61,Agincourt,43.7942,-79.262029,Commander Arena,43.794867,-79.267989,Skating Rink
2076,"Alderwood, Long Branch",43.602414,-79.543484,Il Paesano Pizzeria & Restaurant,43.60128,-79.545028,Pizza Place
2077,"Alderwood, Long Branch",43.602414,-79.543484,Toronto Gymnastics International,43.599832,-79.542924,Gym
2078,"Alderwood, Long Branch",43.602414,-79.543484,Timothy's Pub,43.600165,-79.544699,Pub
2079,"Alderwood, Long Branch",43.602414,-79.543484,Pizza Pizza,43.60534,-79.547252,Pizza Place
2080,"Alderwood, Long Branch",43.602414,-79.543484,Tim Hortons,43.602396,-79.545048,Coffee Shop


In [26]:
neighborhoods_venues.dtypes

Neighborhood               object
Neighborhood Latitude     float64
Neighborhood Longitude    float64
Venue                      object
Venue Latitude            float64
Venue Longitude           float64
Venue Category             object
dtype: object

#### With the Venue data captured from FourSquare we see that it is short on Church venues.  Below scraps the data from a web page saved locally i.e. 'Toronto Canada Church Directory Churches in Toronto Elevate Christian Network.html'.  Address data is scraped using BeautifulSoup to get Church Venue Latitude, Longitude data for each church address.

In [27]:
from bs4 import BeautifulSoup
html = open("C:\\Users\\StephenVoorhees\\Documents\\_Training\\Python\\Projects\\ML-Capstone2\\Battle_of_Neighborhoods\\Toronto Canada Church Directory Churches in Toronto Elevate Christian Network.html", encoding="utf8")     
soup = BeautifulSoup(html, 'html.parser')
soup.prettify()
#print(soup)
address = soup.find_all('small', class_ ='ic-yelp-address')
#print(address)
addr_lst = []
i = 0
while i < (len(address)):
    start_addr = str(address[i]).find('>')
    end_addr = str(address[i]).find('</')
    stripped_addr = str(address[i])[start_addr + 1:end_addr]
    i = i + 1
    addr_lst.append(stripped_addr)
#    print(i,stripped_addr)
print(addr_lst)


['427 Bloor Street W. Toronto, ON M5S 1X7Canada', '65 Church St. Corktown, Toronto, ON M5C 2E9 Canada', '66 Bond St. Downtown Core Toronto, ON M5B 1X2 Canada', '10 Trinity Sq. Downtown Core, Toronto, ON M5G 1B1 Canada', '425 King Street E. Corktown, Toronto, ON M5A 1L3 Canada', '605 Parliament Street, Cabbagetown, Toronto, ON M4X 1P9 Canada', '293 S Kingsway, Swansea, Toronto, ON M5A 1L7 Canada', '56 Queen Street E. Corktown, Toronto, ON M5C 2Z3 Canada', '131 McCaul Street, Downtown Core, Toronto, ON M5T 1W3 Canada', '794 Kingston Road, Upper Beach, Toronto, ON M4E 1R7 Canada', '1810 Queen Street East, The Beach, Toronto, ON M4L 1G8 Canada', '188 Lowther Avenue, The Annex, Toronto, ON M5R 1E8 Canada', '84 Old Burnhamthorpe Road, Etobicoke, Toronto, ON M9C 3S3 Canada', '1678 Dundas Street W. Brockton Village, Toronto, ON M6K 1V3 Canada', '589 Adelaide St W. Niagara, Toronto, ON M6J 1A8 Canada', '71 Gough Avenue, Greektown, Toronto, ON M4K 3N9 Canada', '1372 King Street W. Parkdale, Toro

Manually clean the address data to prepare the .csv file to contain an Address field that only has the address information required by the Nominatim library.  Also, put the Venue Category of 'Church' and the name of the Venue i.e. the churches' name to be compatible with the 'neighborhoods_venues' dataframe onto the .csv file.

The time library was thought to have been needed to put a 'sleep' into the loop using the geolocator library.  It turned out not to be necessary.

The code in the cell immediately below does the following:
1. reads the Church Address data from the csv, 
2. gets the lat/long, 
3. constructs a list of the Church data that is compatible with the neighborhood venue data, 
4. converts the list of church_data into a pandas dataframe, 
5. appends the church data to the neighborhoods venue data, 
6. drop Neighborhood data i.e. 'Neighborhood', Neighborhood Latitude, Neighborhood Longitude columns since this data is not needed to determine how close cafe data is to any given church location.
7. drops all venue data except those where people are most likely to meet, talk, and socialize.  I.e. Cafes and Coffee shops.

In [28]:
df = pd.read_csv("C:\\Users\\StephenVoorhees\\Documents\\_Training\\Python\\Projects\\ML-Capstone2\\Battle_of_Neighborhoods\\Church_Addresses_v1.csv") 
df.head(20)
df0=[]
df1=[]
df2=[]
df3=[]
df4=[]
df5=[]
df6=[]

for i in range(len(df)):
    address = str(df.loc[i,"Address"]) + str(', ') + str(df.loc[i,"City"]) + str(', ') + str(df.loc[i,"Province"])
    geolocator=Nominatim(user_agent="foursquare_agent")
    location = geolocator.geocode(address)
    venue_latitude = location.latitude
    venue_longitude = location.longitude
    venue_category = 'Church'
    venue = df.loc[i,"Venue"]
    neighborhood = df.loc[i,"Neighborhood"]
    neighborhood_latitude = df.loc[i,"Neighborhood Latitude"]
    neighborhood_longitude = df.loc[i,"Neighborhood Longitude"]
#    print(address,neighborhood, neighborhood_latitude,neighborhood_longitude,venue_latitude,venue_longitude,
#    venue_category,venue,venue_category)
    df0.append(neighborhood)
    df1.append(neighborhood_latitude)
    df2.append(neighborhood_longitude)
    df3.append(venue)
    df4.append(venue_latitude)
    df5.append(venue_longitude)
    df6.append(venue_category)

church_data = (zip(df0,df1,df2,df3,df4,df5,df6))

#Convert the list data into a Panda Dataframe
church_lat_long = pd.DataFrame(church_data, columns=['Neighborhood', 'Neighborhood Latitude', 
                                          'Neighborhood Longitude', 'Venue','Venue Latitude',
                                          'Venue Longitude','Venue Category'])  
#church_lat_long.dtypes
#print(church_lat_long)

#church_lat_long.to_csv('C:\\Users\\StephenVoorhees\\Documents\\_Training\\Python\\Projects\\ML-Capstone2\\Battle_of_Neighborhoods\\church_lat_long_v1.csv', encoding='utf-8', index=False)
#merged_neighborhoods_venues = neighborhoods_venues.append(church_lat_long, verify_integrity=True, ignore_index = True)
#merged_neighborhoods_venues

df7 = pd.DataFrame(neighborhoods_venues)
df8 = pd.DataFrame(church_lat_long)

df9 = df7.append(df8, verify_integrity=True, ignore_index=True)
venue_data = df9.drop(columns=['Neighborhood','Neighborhood Latitude','Neighborhood Longitude'])
#print(venue_data)

df10 = venue_data.loc[venue_data['Venue Category'].isin(['Café','Church','Coffee Shop','College Cafeteria'])]


In [29]:
#df10 has the correct values for the four Venue data fields, but they are not unique
print(df10)

                                     Venue  Venue Latitude  Venue Longitude  \
11                               Starbucks       43.770037       -79.221156   
12                             Tim Hortons       43.770827       -79.223078   
27                             Tim Hortons       43.726895       -79.266157   
42                          The Birchcliff       43.691666       -79.264532   
84                             Tim Hortons       43.799102       -79.318715   
104                              Starbucks       43.777990       -79.344091   
110                     Aroma Espresso Bar       43.777700       -79.344652   
123                            Tim Hortons       43.774993       -79.346303   
124                            Tim Hortons       43.777964       -79.344715   
127                            Tim Hortons       43.775249       -79.347740   
167            Maxim's Cafe and Patisserie       43.787863       -79.380751   
174                              Starbucks       43.

In [32]:
# Remove Duplicate Venue Records from the 'Café','Church','Coffee Shop','College Cafeteria' dataframe.
df11 = pd.DataFrame.drop_duplicates(df10)
print(df11)


                                     Venue  Venue Latitude  Venue Longitude  \
11                               Starbucks       43.770037       -79.221156   
12                             Tim Hortons       43.770827       -79.223078   
27                             Tim Hortons       43.726895       -79.266157   
42                          The Birchcliff       43.691666       -79.264532   
84                             Tim Hortons       43.799102       -79.318715   
104                              Starbucks       43.777990       -79.344091   
110                     Aroma Espresso Bar       43.777700       -79.344652   
123                            Tim Hortons       43.774993       -79.346303   
124                            Tim Hortons       43.777964       -79.344715   
127                            Tim Hortons       43.775249       -79.347740   
167            Maxim's Cafe and Patisserie       43.787863       -79.380751   
174                              Starbucks       43.

In [33]:
df11.shape

(235, 4)

In [34]:
# create map of Toronto's 'Café','Church','Coffee Shop','College Cafeteria' using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)


def color(ven_cat):
    if ven_cat == 'Church':
        col='yellow'
    else:
        col='green'
    return col

# add markers to map
for lat, lng, label, ven_cat in zip(df11['Venue Latitude'], df11['Venue Longitude'], df11['Venue'], df11['Venue Category']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=[color(ven_cat)],
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
#    print(color(ven_cat))
    
map_toronto

#### The first question to ask for this project is 'Assuming all churches are equal, which church is the closest to the most places where it is easiest to meet, talk, and socialize with people i.e. cafes, coffee houses?

To answer this question, the following steps are taken (using the blog https://flothesof.github.io/k-means-numpy.html as a template).
1. create a numpy dataset of the location data for all of the cafes (points)identified previously in the FourSquare data.
2. create a numpy dataset of the location data for all of the churches (centroids) identified above.
3. Use a procedure that calculates the distance between all the points (cafes) and centroids (churches)
4. Identify the top centroids (churches) that have the most points (cafes) clustered around them

In [36]:
df11.dtypes

Venue               object
Venue Latitude     float64
Venue Longitude    float64
Venue Category      object
dtype: object

In [81]:
df12 = df11.loc[venue_data['Venue Category'].isin(['Café','Coffee Shop','College Cafeteria'])]
df13 = df11.loc[venue_data['Venue Category'].isin(['Church'])]
df13 = df13.reset_index(drop=True)
print(df13)

                                 Venue  Venue Latitude  Venue Longitude  \
0          St James Anglican Cathedral       43.650110       -79.374292   
1                St Helens Church Band       43.666059       -79.405700   
2                   Holy Family Church       43.650497       -79.374002   
3             St Peter Catholic Church       43.655597       -79.377687   
4                  Holy Trinity Church       43.654410       -79.381956   
5      United Church of God Toronto,ON       43.701923       -79.520177   
6               St. Michaels Cathedral       43.667963       -79.369276   
7       Corpus Christi Catholic Church       43.647878       -79.484258   
8                   St Patricks Church       43.770257       -79.546425   
9                     St. Marys Church       43.654842       -79.391068   
10  St. Barnabas Roman Catholic Church       43.680040       -79.293660   
11               Little Trinity Church       43.668074       -79.309413   
12      Farmer Memorial B

In [43]:
df12.shape

(215, 4)

In [45]:
#Convert pandas df to numpy array
cafe_coord_lat = df12['Venue Latitude']
cafe_coord_long = df12['Venue Longitude']
cafe_coord = pd.DataFrame(zip(cafe_coord_lat,cafe_coord_long))
#, df11['Venue Longitude'])
print(cafe_coord.to_numpy())

[[ 43.7700372  -79.22115587]
 [ 43.77082707 -79.2230781 ]
 [ 43.72689509 -79.26615693]
 [ 43.69166644 -79.26453158]
 [ 43.7991018  -79.3187148 ]
 [ 43.77799    -79.344091  ]
 [ 43.7777002  -79.34465167]
 [ 43.7749925  -79.3463027 ]
 [ 43.777964   -79.344715  ]
 [ 43.77524918 -79.34773967]
 [ 43.78786338 -79.38075094]
 [ 43.768353   -79.413046  ]
 [ 43.76944882 -79.41308108]
 [ 43.76697902 -79.41220492]
 [ 43.768464   -79.414017  ]
 [ 43.7809395  -79.4442308 ]
 [ 43.74445646 -79.34645978]
 [ 43.72289725 -79.33911704]
 [ 43.7275361  -79.33954737]
 [ 43.755797   -79.440471  ]
 [ 43.75476687 -79.44325045]
 [ 43.76428928 -79.48879033]
 [ 43.72551663 -79.31310251]
 [ 43.70561066 -79.36077469]
 [ 43.7056293  -79.36102832]
 [ 43.706564   -79.359591  ]
 [ 43.7050898  -79.3505453 ]
 [ 43.67862953 -79.34746014]
 [ 43.67812618 -79.34843379]
 [ 43.67811808 -79.34948504]
 [ 43.677232   -79.3528982 ]
 [ 43.678879   -79.346357  ]
 [ 43.667662   -79.312006  ]
 [ 43.66137282 -79.33857736]
 [ 43.66080615

In [48]:
cafe_coord.shape

(215, 2)

In [71]:
#Convert pandas df to numpy array
church_coord_lat = df13['Venue Latitude']
church_coord_long = df13['Venue Longitude']
church_coord = pd.DataFrame(zip(church_coord_lat,church_coord_long))
#, df11['Venue Longitude'])
#print(church_coord.to_numpy())
print(church_coord)

            0          1
0   43.650110 -79.374292
1   43.666059 -79.405700
2   43.650497 -79.374002
3   43.655597 -79.377687
4   43.654410 -79.381956
5   43.701923 -79.520177
6   43.667963 -79.369276
7   43.647878 -79.484258
8   43.770257 -79.546425
9   43.654842 -79.391068
10  43.680040 -79.293660
11  43.668074 -79.309413
12  43.668320 -79.407317
13  43.642243 -79.582005
14  43.656221 -79.380768
15  43.650856 -79.376448
16  43.679289 -79.346200
17  43.701923 -79.520177
18  43.665631 -79.412573
19  43.796279 -79.227757


In [49]:
church_coord.shape

(20, 2)

In [53]:
#Identify the centroids i.e. existing churches as a numpy dataset
centroids = np.array(church_coord, copy=True ,order='K')
print(centroids)

[[ 43.65010974 -79.37429172]
 [ 43.666059   -79.40570002]
 [ 43.6504971  -79.37400189]
 [ 43.65559725 -79.37768689]
 [ 43.65441005 -79.38195639]
 [ 43.701923   -79.5201766 ]
 [ 43.66796266 -79.36927576]
 [ 43.64787805 -79.48425803]
 [ 43.7702571  -79.5464247 ]
 [ 43.6548424  -79.3910682 ]
 [ 43.68004028 -79.29366039]
 [ 43.6680738  -79.3094128 ]
 [ 43.6683203  -79.4073169 ]
 [ 43.64224265 -79.5820048 ]
 [ 43.6562214  -79.3807678 ]
 [ 43.6508556  -79.3764484 ]
 [ 43.6792891  -79.3462001 ]
 [ 43.701923   -79.5201766 ]
 [ 43.66563111 -79.4125726 ]
 [ 43.7962787  -79.2277572 ]]


In [61]:
def initialize_centroids(k):
    """returns k = 20 Churches (centroids) as the starting and ending points.  This is K-means without any iterations."""
    global centroids
    centroids = centroids
    return centroids[:k]

#### Identify the points to see how close they are to the centroids i.e. how close are the cafe venues to existing churches as a numpy dataset

In [54]:
points = np.array(cafe_coord, copy=True, order='K')
print(points)

[[ 43.7700372  -79.22115587]
 [ 43.77082707 -79.2230781 ]
 [ 43.72689509 -79.26615693]
 [ 43.69166644 -79.26453158]
 [ 43.7991018  -79.3187148 ]
 [ 43.77799    -79.344091  ]
 [ 43.7777002  -79.34465167]
 [ 43.7749925  -79.3463027 ]
 [ 43.777964   -79.344715  ]
 [ 43.77524918 -79.34773967]
 [ 43.78786338 -79.38075094]
 [ 43.768353   -79.413046  ]
 [ 43.76944882 -79.41308108]
 [ 43.76697902 -79.41220492]
 [ 43.768464   -79.414017  ]
 [ 43.7809395  -79.4442308 ]
 [ 43.74445646 -79.34645978]
 [ 43.72289725 -79.33911704]
 [ 43.7275361  -79.33954737]
 [ 43.755797   -79.440471  ]
 [ 43.75476687 -79.44325045]
 [ 43.76428928 -79.48879033]
 [ 43.72551663 -79.31310251]
 [ 43.70561066 -79.36077469]
 [ 43.7056293  -79.36102832]
 [ 43.706564   -79.359591  ]
 [ 43.7050898  -79.3505453 ]
 [ 43.67862953 -79.34746014]
 [ 43.67812618 -79.34843379]
 [ 43.67811808 -79.34948504]
 [ 43.677232   -79.3528982 ]
 [ 43.678879   -79.346357  ]
 [ 43.667662   -79.312006  ]
 [ 43.66137282 -79.33857736]
 [ 43.66080615

#### Now let's define a function that returns the closest centroid for each point. We will use numpy broadcasting to do this. Refer to the blog https://flothesof.github.io/k-means-numpy.html for an explanation of how numpy is used to identify which venue is assigned to which centroid.

In [62]:
def closest_centroid(points, centroids):
    """returns an array containing the index to the nearest centroid for each point"""
    distances = np.sqrt(((points - centroids[:, np.newaxis])**2).sum(axis=2))
    return np.argmin(distances, axis=0)

In [63]:
c = initialize_centroids(20)
closest_centroid(points, c)

array([19, 19, 10, 10, 19, 16, 16, 16, 16, 16, 16, 12, 12, 12, 12,  8, 16,
       16, 16, 12,  5,  8, 10, 16, 16, 16, 16, 16, 16, 16, 16, 16, 11, 16,
       16, 16, 16, 16, 12, 12, 12, 12, 12, 12, 12, 12, 12,  6,  6,  6,  6,
        6,  6,  6, 14, 14, 14,  6, 14, 14,  9, 14,  6,  2,  2,  2,  6,  2,
        2,  2,  6,  6,  2,  3, 14,  3, 14, 14, 14, 14,  3, 14,  4,  4, 14,
        3,  2, 15, 15,  2, 15, 15,  2, 15,  0,  0, 15,  0,  0,  2,  0, 14,
       14, 14,  9,  9,  9,  9, 14,  9,  9,  9, 14,  9,  4,  4, 15,  4,  4,
        9,  9,  4, 15,  4,  4, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
       15, 15, 15,  0, 15, 15,  0, 15, 15,  4, 15,  4, 15, 12, 12, 12, 12,
       12, 12, 12, 12,  1,  1,  1,  1,  1,  1,  9,  9,  9,  9,  1,  9,  1,
        1,  9, 15,  5, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18,
       18, 18,  7,  7,  7,  7,  7,  7,  7,  7,  7,  9,  9,  9,  9,  9,  9,
        9, 13, 13,  7,  7,  7,  7, 13, 13, 13,  5], dtype=int64)

#### This array, while it gives the solution, is not reader friendly.  The following cells creates a pandas dataframe in order to easily merge the dataframe of centroid names and locations with the dataframe that identifies which church the cafe/coffee house is closest.

In [70]:
closest_centroid_pd = pd.DataFrame(data=closest_centroid(points, c),columns=['nearest_church'])
print(closest_centroid_pd)

     nearest_church
0                19
1                19
2                10
3                10
4                19
5                16
6                16
7                16
8                16
9                16
10               16
11               12
12               12
13               12
14               12
15                8
16               16
17               16
18               16
19               12
20                5
21                8
22               10
23               16
24               16
25               16
26               16
27               16
28               16
29               16
30               16
31               16
32               11
33               16
34               16
35               16
36               16
37               16
38               12
39               12
40               12
41               12
42               12
43               12
44               12
45               12
46               12
47                6
48                6


In [82]:
df2 = closest_centroid_pd.set_index('nearest_church')
result = pd.concat([df2, df13], axis=1, join='inner')
result

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Venue Category
19,Holy Name Church,43.796279,-79.227757,Church
19,Holy Name Church,43.796279,-79.227757,Church
10,St. Barnabas Roman Catholic Church,43.68004,-79.29366,Church
10,St. Barnabas Roman Catholic Church,43.68004,-79.29366,Church
19,Holy Name Church,43.796279,-79.227757,Church
16,Trinity-St Pauls United Church,43.679289,-79.3462,Church
16,Trinity-St Pauls United Church,43.679289,-79.3462,Church
16,Trinity-St Pauls United Church,43.679289,-79.3462,Church
16,Trinity-St Pauls United Church,43.679289,-79.3462,Church
16,Trinity-St Pauls United Church,43.679289,-79.3462,Church


#### The final result below shows that the church with the most close by cafe / coffee houses is St Jamestown Community Church.  Not surprisingly, this church is located at lat/long (43.650856 	-79.376448) which is in the borough "Downtown Toronto" and Neighborhood "St James Town".  https://www.stjamestownchurch.com/.  Presumably, this church and neighborhood was named after James, the brother of Jesus, and who was the author of the Book of James in the New Testament.  This church uses the King James version of the Bible in their preaching.  Also of interest on this website is the use of Artificial Intelligence in the form of a Chatbot to facilitate the meeting and discussion with those inside or outside of the church membership.  Perhaps users of this chatbot would be invited to meet and discuss the Gospel of Jesus Christ (see the church's website for a simple explanation of the Gospel) at one of the cafes in St. James Town :) .

In [88]:
result.groupby('Venue').count().sort_values(by='Venue Category',ascending=False)

Unnamed: 0_level_0,Venue Latitude,Venue Longitude,Venue Category
Venue,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
St Jamestown Community Church,29,29,29
St. Marys Church,24,24,24
Trinity-St Pauls United Church,23,23,23
Farmer Memorial Baptist Church,22,22,22
Metropolitan United Church,18,18,18
St Johns Parish Church,15,15,15
Corpus Christi Catholic Church,13,13,13
St. Michaels Cathedral,12,12,12
Holy Family Church,11,11,11
Holy Trinity Church,11,11,11
