# The battle of the neighborhoods
## Where to place a new pharmacy in the Covid era ?

## Introduction
The client for this project wants to explore the best place for a pharmacy in the Toronto area. The latest spread of COVID 19 rose sales of hand sanitizers, disposable gloves and masks worldwide. It is likely that the sales volume of those items after the pandemic is over will stay at a high level, due to population awareness of virus spread mechanisms, increase fear, and new habits. While most of those items might be found in convenience stores or other retail shops, pharmacies are the most reliable places to get those items. Pharmacies offer prescription and over-the-counter medicines, required by people who got infected with COVID 19 but might not require hospitalization. Therefore, pharmacies have a clear role in the prevention of the disease and the recovery of infected individuals.

The main assumption of this work is that the larger the number of Covid Cases in a given neighborhood, the higher the need for disease prevention and palliation items. In other words, the hypothesis is that the market demand of a given neighborhood increases with the number of Covid Cases in the neighborhood. The number of existing pharmacies in each neighborhood will be used to estimate market supply. A neighborhood with a high number of COVID infections and a low number of pharmacies might be in need of new pharmacies. The number of Covid cases, existing pharmacies and distance to downtown will be analyzed for each neighborhood in Toronto to select the best neighborhoods for opening a new pharmacy.

### Data

Zip Codes and a list of Toronto neighborhoods were sourced from Wikipedia. Coordinates of Toronto neighborhoods were loaded from a csv file that has the geographical coordinates of each postal code, available at: http://cocl.us/Geospatial_data. A database for covid cases in Toronto was retrieved from the web, this database is updated weekly on Wednesdays, https://open.toronto.ca/dataset/covid-19-cases-in-toronto/. Foursquare was used to find out the number of existing pharmacies available in a given neighborhood. The distance of each neighborhood to downtown was calculated with a Geopy function, described in https://geopy.readthedocs.io/en/stable/#module-geopy.distance.

From the database with Covid cases in the Toronto area, only confirmed cases were selected. Confirmed cases were grouped by Zip Code. Foursquare was used to find pharmacies within a radius of 1000 m. Downtown Toronto coordinates, required to calculate the distance to downtown, correspond to the coordinates of Toronto City Hall.

In [None]:
#Import libraries

import pandas as pd
import numpy as np # library to handle data in a vectorized manner
import requests # library to handle requests
!pip3 install bs4
!pip3 install lxml html5lib beautifulsoup4

import json # library to handle JSON files
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt 
from matplotlib.pyplot import plot
# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library



print('Libraries imported.')

print ("Libraries Installed, Hello Capstone Project Course!")

### Part 1 , create 1st dataframe, with Toronto Neighborhoods

In [None]:
#use panda to scrap tables from webpage
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)


In [None]:
#after dfs is a list , the table we are interested in is in the first position of the list
dfs=dfs[0]
dfs

In [None]:
#data in dataframe is a string, with mixed data, first 3 values in the string is postal code, next up to ( is borough, then neighborhood
#initialize a dataframe
column_names = ['PostalCode', 'Borough', 'Neighborhood'] 
toronto = pd.DataFrame(columns=column_names)
#the loop below navigates each dfs value, used to construct a the required dataframe 
for i in range(dfs.shape[0]):
    for j in range (dfs.shape[1]):
        postalcode=dfs.iloc[i,j][0:3]   #the first three letters of the string are the postal code
        borough=dfs.iloc[i,j][3:].split("(")[0]  #the remaining part of the string is borough and neighborhood, use split function with ( as separator
        if borough!="Not assigned": #ignore not assigned borough
            neighborhood=dfs.iloc[i,j][3:].split("(")[1][:-1]   #this localizes the neighborhood, the [:-1] drops the closing )
            toronto = toronto.append({'PostalCode': postalcode,'Borough': borough,'Neighborhood': neighborhood}, ignore_index=True)

toronto  #diplay dataframe

In [None]:
toronto.Neighborhood=toronto.Neighborhood.str.replace((" /"),(","))  # clean neighborhood column, replace / with , as requested
toronto.head()   # check if dataframe is in the required format


In [None]:
print(toronto.shape)    # shape of dataframe
toronto[toronto["PostalCode"]=="M5A"]   # check results for M5A postal code

### Part 2, get latitude and longitude for each postal code

In [None]:
#load the csv data

df_coor = pd.read_csv('https://cocl.us/Geospatial_data')
print(df_coor.shape)



In [None]:
df_coor

In [None]:
#to get the final dataframe, toronto and df_coor must be merged


torontoneighcoor= toronto.join(df_coor.set_index('Postal Code'), on='PostalCode')
torontoneighcoor

In [None]:
#toronto=torontoneighcoor[torontoneighcoor["Borough"].str.contains("Toronto")] #create a new dataframe, work with only boroughs that contain the word Toronto
toronto=torontoneighcoor

### Part 3, analize covid cases in Toronto

In [None]:
#get the link to import covid 19 cases in toronto canada 
#in https://docs.ckan.org/en/latest/maintaining/datastore.html#downloading-resources
#its says that 
#A DataStore resource can be downloaded in the CSV file format from {CKAN-URL}/datastore/dump/{RESOURCE-ID}.


covid=pd.read_csv('https://ckan0.cf.opendata.inter.prod-toronto.ca/datastore/dump/e5bf35bc-e681-43da-b2ce-0242d00922ad')

In [None]:
#shape of the dataframe and first lines
print(covid.shape)
covid.head()



In [None]:
#FSA in the dataframe is the postal code of the home of the infected person. Let´s pick 
#just the postal codes that are included in the toronto dataframe. 

covidtoronto=covid[covid.FSA.isin(toronto.PostalCode)]


In [None]:
#shape and first lines of the dataframe
print(covidtoronto.shape)
covidtoronto.head()


In [None]:
#let us analize this dataframe
covidtoronto.describe(include="all")

In [None]:
#most are confirmed cases, but not all of them
covidtoronto.Classification.value_counts()

In [None]:
# let´s drop the rows with probable cases
covid_toronto_confirmed=covidtoronto[covidtoronto.Classification==("CONFIRMED")]
covid_toronto_confirmed

In [None]:
#let us find how many confirmed covid cases in each neighborhood

counts=covid_toronto_confirmed.groupby(by=["FSA"]).count()
counts

In [None]:
# now let us merge one column of this database (the first one, for example) with toronto neighborhood data
torontoneighcovid= toronto.join(counts["_id"], on='PostalCode')
# let us rename column and drop unncecessary columns
torontoneighcovid.rename(columns={"_id": "Covid_Cases"}, inplace=True)
#torontoneighcovid.drop(columns=["level_0","index"],inplace=True)
#let us replace NaN with cero covid cases
torontoneighcovid.Covid_Cases.fillna(0, inplace=True)
torontoneighcovid.head()


In [None]:
#I drop rows with M7Y and M7R postal codes, because they are mail processing centers,
#I assume no persons have an address with that postal codes, remember that covid cases
#are tracked based on home address
torontoneighcovid = torontoneighcovid[(torontoneighcovid.PostalCode != "M7R")]
torontoneighcovid = torontoneighcovid[(torontoneighcovid.PostalCode != "M7Y")]
torontoneighcovid

In [None]:
#let us get the coordinates of toronto to plot a map of Covid Cases
address = 'Toronto, CA'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

#let us take a look at covid cases in toronto, the size of the bubble is proportional to the number of cases in the neighborhood
#
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label, cases in zip(torontoneighcovid['Latitude'], torontoneighcovid['Longitude'], torontoneighcovid['Neighborhood'],torontoneighcovid['Covid_Cases']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2+((cases)/100),   #use an adequate scale(by trial and error) , add a number so that all neighborhoods are visible, even those without covid cases 
        popup=label,
        color='',
        fill=True,
        fill_color='#FF0000',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Part 5, count the number of pharmacies in each neighborhood

In [None]:
#use Foursquare to explore venues, pharmacies in this case
CLIENT_ID = 'JOV4LWLKB3MAR4211A02DKXTRPMWA1SXQF2JYFFLI2JHVZPR' # your Foursquare ID
CLIENT_SECRET = '2SLKJWY5GZ5ZGQAHHP3RFO0U5F3YFAZIQSFVPFA3T25MD0ZN' # your Foursquare Secret
ACCESS_TOKEN = '1JX1ECFH2WT0EVJTX5TFF5G3J4OFA0MMO1ZK3MFXPZTBFS2E' # your FourSquare Access Token
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
#let´s take a look on how many pharmacies we have in downtown Toronto (centered at the city Hall) 

search_query = 'pharmacy'
#after some trial and error with foursquare pharmacy gets much more reliable 
#results than drugstore, drugstore include not relevant results 
radius = 1000
url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude,ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)
results = requests.get(url).json()["response"]["venues"]
print (len(results)) # how many pharmacies in a given neighborhood
#take a look at results, to build the next function
results


In [None]:
#let us create a function to find a given shop near each neighborhood, adapted from functions 
#described in other labs
#names is neighborhood names, shop is what type of shop we are looking for
def getNearbyShop(names, shop, latitudes, longitudes, radius=1000):
    shop_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
                    
        # create the API request URL, to get given venue
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, lat, lng, ACCESS_TOKEN, VERSION, shop, radius, LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["venues"]
        print(len(results), shop, "in", name)   # show in the screen how many pharmacies for each hood
        
               # return only relevant information for each nearby shop
        shop_list.append([(name,lat,lng,v["name"],v["location"]["lat"], v["location"]["lng"],v["location"]["distance"]) for v in results])
        nearby_venues = pd.DataFrame([item for shop_list in shop_list for item in shop_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Shop Name', 
                  'Shop Latitude', 
                  'Shop Longitude',"distance"]
    
    return(nearby_venues)

In [None]:
#run the above function on each neighborhood and create a new dataframe called toronto_pharmacies
toronto_pharmacies=getNearbyShop(toronto.Neighborhood, "Pharmacy", toronto.Latitude,toronto.Longitude, radius=1000)

In [None]:
#let us take a look at the dataframe
toronto_pharmacies

In [None]:
#let´s plot the pharmacies on a map, superimposed to covid cases
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

#add markers to map, covid cases for each neighborhood
for lat, lng, label, cases in zip(torontoneighcovid['Latitude'], torontoneighcovid['Longitude'], torontoneighcovid['Neighborhood'],torontoneighcovid['Covid_Cases']): 
    label = folium.Popup(label, parse_html=True) 
    folium.CircleMarker( [lat, lng], radius=2+((cases)/100), #use an adequate scale(by trial and error) , add a number so that all neighborhoods are visible, even those without covid cases
                        popup=label, 
                        color='', 
                        fill=True, 
                        fill_color='#FF0000', 
                        fill_opacity=0.7, 
                        parse_html=False).add_to(map_toronto)

#add markers for each pharmacy to map
for lat, lng, label in zip(toronto_pharmacies['Shop Latitude'], toronto_pharmacies['Shop Longitude'], toronto_pharmacies['Shop Name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat, lng],
    radius=5,   
    popup=label,
    color='Blue',
    fill=True,
    fill_color='Blue',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto) 

map_toronto

In [None]:
#let us count the number of pharmacies in a given neighborhood and append that to dataframe with covidcases

covid_phar=torontoneighcovid.join(toronto_pharmacies.groupby("Neighborhood").count()["Shop Name"], on='Neighborhood')
covid_phar.rename(columns={"Shop Name": "Pharmacies"}, inplace=True)
#let's add 0 pharmacies for neighborhoods where foursquare did not found any pharmacy
covid_phar.Pharmacies.fillna(0, inplace=True)
#let's take a look at the dataframe, and resete the index
covid_phar.reset_index(drop=True,inplace=True)
covid_phar


In [None]:
max(np.sqrt(covid_phar.Covid_Cases))

In [None]:
#Let us make a map showing covid cases and number of pharmacies for each neighborhood. 
#Number of pharmacies will be indicated with a choroplet map, the color of the neighborhood indicates 
# the number of existing pharmacies
## First we need to get a file with toronto neighborhoods, this is available courtesy of a github user

!wget -q -O 'toronto.json' https://raw.githubusercontent.com/ag2816/Visualizations/master/data/Toronto2.geojson
with open('toronto.json') as json_data:
        toronto_data = json.load(json_data)


In [None]:
#I examined the geojson file and determined that postal code for each neighborhood
#is under 'feature.properties.CFSAUID' this will be passed as a key to make the choropleth map
toronto_map =folium.Map(location=[latitude, longitude], zoom_start=11)
#make the choropleth, inspired by previous labs
toronto_map.choropleth(geo_data=toronto_data, data=covid_phar,
    columns=['PostalCode', 'Pharmacies'],key_on='feature.properties.CFSAUID',
    line_color='black',fill_color='YlGnBu', 
    fill_opacity=0.7, 
    line_opacity=0.5 ,line_weight=3, legend_name="Number of Pharmacies"
            )
# add markers to map, to indicate covid cases
for lat, lng, label, cases in zip(torontoneighcovid['Latitude'], torontoneighcovid['Longitude'], torontoneighcovid['Neighborhood'],torontoneighcovid['Covid_Cases']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2+((cases)/200),   #use an adequate scale(by trial and error) , add a number so that all neighborhoods are visible, even those without covid cases 
        popup=label,
        color='',
        fill=True,
        fill_color='#FF0000',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map) 

toronto_map


We can see that many neighborhoods with high number of covid cases have less than five pharmacies. Later we will use Kmeans to sort neighborhoods into groups with similar features. 

### Part 6, measure the distance of each neighborhood to downtown Toronto (City Hall)

In [None]:
#install a function to get distance between two points of known latitude and longitude
from geopy import distance


# add a zeros column to the database, that will be updated later with the 
# distance of the neighborhood to downtown
covid_phar.insert(5, "distance", 0)


In [None]:
#calculate distance in km for each neighborhood using geopy function, store that in distance column
covid_phar["distance"]=[distance.distance((latitude,longitude),(covid_phar.Latitude[i],covid_phar.Longitude[i])).km for i in range(len(covid_phar.distance))]
covid_phar.head(5)

In [None]:
#let's get some statistics for relevant features of each neighborhood 
covid_phar.iloc[:,5:8].describe()

### Part 7, use Kmeans to cluster the neighborhoods and analyze each cluster

In [None]:
#let's arrange the neighborhood in clusters using Kmeans, to classify them according to releavant features
# set number of clusters
kclusters = 5

# run k-means clustering, after storing in X the scaled data

from sklearn.preprocessing import StandardScaler
scal=StandardScaler()
X = scal.fit_transform(covid_phar.iloc[:,5:8])
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(X)

#add the label of the Kmeans cluster to the dataframe
covid_phar["labels"]=kmeans.labels_
#To analyze the features of each cluster centers,
##let us apply the inverse standard scaler transformation to kmeans cluster centers
ClusterCenters=scal.inverse_transform(kmeans.cluster_centers_)
# check cluster labels generated for each row in the dataframe
kmeans.labels_

In [None]:
#let us take a look at the dataframe
covid_phar

In [None]:
# create map, showing each neighborhood with a color that indicates to which cluster it belongs to
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster, covid in zip(covid_phar['Latitude'], covid_phar['Longitude'], covid_phar['Neighborhood'], covid_phar['labels'], covid_phar['Covid_Cases']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster) + " CovidCases " + str(covid), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=1).add_to(map_clusters)
       
map_clusters


In [None]:
#let us make a dataframe of normalized feature, to make some additional plots
a=pd.DataFrame(X)


In [None]:
#let us make some boxplots to understand how neighborhoods were grouped by Kmeans unsupervised algorithm
#we use the same color code as in the previous graph

fig = plt.figure(figsize=(10, 8))
ax1 = fig.add_subplot(1, 1, 1)

ax1.set_title('Clustered Normalized data')
ax1.boxplot(a, showfliers=False)
ax1.set_xticklabels(labels=["Distance to downtown", "Covid Cases","Pharmacies"])

#plot all values, with a color that indicates which cluster they belong to

for i in [1,2,3]:
    y = a.iloc[:,i-1]
    x = np.random.normal(i, 0.04, size=len(y))   #add some noise to the value of the points, to aid the visualization
    for l, m ,cluster in zip (x, y, covid_phar.labels):
        ax1.plot(l,m, 'o', markerfacecolor=rainbow[cluster],markeredgecolor='k', markersize=10,alpha=0.5)     




In [None]:
#let us map assign a color to each label, so that graphs look good and consistent
labels=covid_phar.labels.replace({0:rainbow[0],1:rainbow[1],2:rainbow[2],3:rainbow[3],4:rainbow[4]})

#let us make a bubble plot, the diameter of the bubble is proporptional to covid cases, x and y coordinates show
#number of pharmacies and distance to downtown , respectively
ax1=covid_phar.plot(kind='scatter',
                    x='Pharmacies',
                    y='distance',
                    alpha=0.5,
                    figsize=(10, 5),
                     c=labels,
                    s=covid_phar.Covid_Cases      
                   )

#add a crossline marker for the center of each cluster
for i in range(kclusters):
     ax1.plot(ClusterCenters[i,2], ClusterCenters[i,0], 'x', markerfacecolor=rainbow[i], 
              markeredgecolor=rainbow[i], markersize=ClusterCenters[i,1]/100+10)
ax1.set_xlabel('Pharmacies')
ax1.set_ylabel('Distance to downtown')
ax1.set_title('Pharmacies, distance and Covid cases \n (Bubble size proportional to Covid Cases)')

### Part 8, analysis of each cluster

In [None]:
centers=pd.DataFrame(ClusterCenters, columns=["distance to downtown","Covid Cases","Existing Pharmacies"])
centers["color code"]=["Purple","Light blue","Light Green","Yellow","Red"]
centers

In [None]:
#let us analize each cluster
print ("This cluster is centered at {:.2f} km from downtown, {:.2f} Covid Cases and {:.2f} pharmacies"
       .format(ClusterCenters[0,0],ClusterCenters[0,1],ClusterCenters[0,2]))
covid_phar[covid_phar["labels"]==0]


Covid cases in this neighborhood are higher than the average number of covid cases for Toronto neighborhoods. The number of pharmacies is 8 or less, most neighborhoods have pharmacies below average. The distance from downtown ranges from 8 to 25 km. Those neighborhood probably need some more pharmacies.  

In [None]:
print ("This cluster is centered at {:.2f} km from downtown, {:.2f} Covid Cases and {:.2f} pharmacies"
       .format(ClusterCenters[1,0],ClusterCenters[1,1],ClusterCenters[1,2]))
covid_phar[covid_phar["labels"]==1]

Here we have neighborhoods close to downtown, with a number of pharmacies lower than 10. There is a mixed amount of covid cases. Those are probably the best neighborhoods to open pharmacies, if the customer decides to prioritize closeness to downtown.








In [None]:
print ("This cluster is centered at {:.2f} km from downtown, {:.2f} Covid Cases and {:.2f} pharmacies"
       .format(ClusterCenters[2,0],ClusterCenters[2,1],ClusterCenters[2,2]))
covid_phar[covid_phar["labels"]==2]

* Here we have neighborhoods with a number of pharmacies lower than 5. There is a mixed amount of covid cases. Those are probably good neighborhoods to open pharmacies, but the distance to downtown is intermediate.


In [None]:
print ("This cluster is centered at {:.2f} km from downtown, {:.2f} Covid Cases and {:.2f} pharmacies"
       .format(ClusterCenters[3,0],ClusterCenters[3,1],ClusterCenters[3,2]))
covid_phar[covid_phar["labels"]==3]


Those are the neighborhoods with the largest amount of pharmacies, and also closest to downtown. Curiously, the number of Covid cases was below average for all neighborhoods. The neighborhoods in this cluster do not need more pharmacies. 

In [None]:
print ("This cluster is centered at {:.2f} km from downtown, {:.2f} Covid Cases and {:.2f} pharmacies"
       .format(ClusterCenters[4,0],ClusterCenters[4,1],ClusterCenters[4,2]))
covid_phar[covid_phar["labels"]==4]

Those are the neighborhoods with the largest amount of covid cases, and with a number of pharmacies lower than the average. Despite they are far from downtown, they are in need of more pharmacies.

## Conclusions

In this project, Toronto neighborhoods were analyzed according to the number of Covid Cases, existing pharmacies and distance to downtown Toronto. Kmeans clustering was used to sort the neighborhoods into 5 groups, and each group was analyzed to determine which group contains the best options for opening a new pharmacy. Neighborhoods with postal code M1B, M3N and M9V (cluster 4) had the largest number of covid cases, and the number of pharmacies was well below average, so they are the best option if the investor is willing to open it more than 16 km from downtown. The closest to downtown neighborhoods (cluster 3), had the highest amount of pharmacies, but curiously the number of Covid cases were lower than average in those neighborhoods. The market is likely saturated in those neighborhoods, so they would not be a good option. The rest of the clusters (0,1 and 2) have neighborhoods with good options to open a pharmacy, the choice will depende on whether the investors prioritizes distance to downtown, or market demand.   