# Project for Analysis of - Opening a new Business 

# Business Problem


ABC Corp. business group wants to open a  new venture in Toronto  and wants to identify which would be the best business area to step into and in which neighborhood to open that business. 

They want to go big and establish a brand of them selves instead of competing with the already existing business groups in that area.

Hence they want an analysis to be done to find out 2 things the most fast moving business venture which is very popular and the areas where its not available or not so popular. Their approach is contrary to the regular businesses and they want to get an analysis for this before finalizing the venue.
 

# Data tobe used and Approach

Looking into the business requirements, we can see that there might be multiple ways of approaching the problem, right from capital required to best areas.

The approach followed will be as follows:

1. Get the Postal codes, Neighborhoods, Borough data from wiki

2. Use the Foursquare API for the details of the Neighborhoods and most common / popular venues there to identify the most common venues across Toronto.

3. Cluster the data to see the areas already having the most common business and identify the most common business for the ABC Corp. business group. This should give an idea on the most common business ventures which can be suggested to the ABC Corp. group.

4. Now identify which are the Neighborhoods  where such businesses are not the top venues. This means that if a good option is provided to the end users there is a potential of the business working out in the long run.

Though there might be other approaches taken after doing this analysis, but for the same of the project we will restrict those here.
 

# Methodology and Analysis

In [1]:
import requests
import numpy as np 
import lxml.html as lh
import pandas as pd
import json # library to handle JSON files
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

# Read from the url and get all the table elements

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')

## Check the columns scraped to be sure 

In [3]:
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
i=0
#For each row, check the columns being fetched
for t in tr_elements[0]:
    i+=1
    name=t.text
    name=name.replace("\n","")
    name= name.replace("\r","")
    print (i,name)
    col.append((name,[]))

1 Postcode
2 Borough
3 Neighbourhood


# Get all the data and check if all the columns have data

In [4]:
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
    #T is our j'th row
    T=tr_elements[j]
    
    #If row is not of size 3, the //tr data is not from our table 
    if len(T)!=3:
        break
    
    #i is the index of our column
    i=0
    
    #Iterate through each element of the row
    for t in T.iterchildren():
        data=t.text_content()
        data.replace("\n","")
        data.replace("\r","")
        #Check if row is empty
        if i>0:
        #Convert any numerical value to integers
            try:
                data=int(data)
            except:
                pass
        #Append the data to the empty list of the i'th column
        col[i][1].append(data)
        #Increment i for the next column
        i+=1
[len(C) for (title,C) in col]

[289, 289, 289]

# Convert to a dataframe

In [5]:
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned\n
1,M2A,Not assigned,Not assigned\n
2,M3A,North York,Parkwoods\n
3,M4A,North York,Victoria Village\n
4,M5A,Downtown Toronto,Harbourfront\n


# select only the cells that have an assigned borough and then replace the Neighbourhood with Borough where Borough is present and Neighbourhood = 'Not assigned'

In [6]:

filtered_df=df.loc[df['Borough'] != 'Not assigned']
# Clean data for new line characters
filtered_df = filtered_df.replace('\n','', regex=True)

filtered_df 
filtered_df['Neighbourhood'] = np.where(filtered_df['Neighbourhood'] == 'Not assigned',filtered_df['Borough'],filtered_df['Neighbourhood'])
filtered_df.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


# group by the Postcode and bring Neighbourhood in the required format for multiple values seperated by comma

In [7]:
filtered_df2=filtered_df.groupby(['Postcode', 'Borough'])['Neighbourhood'].agg([ ('Neighbourhood', ', '.join)]).reset_index()
filtered_df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


# show shape as requested

In [8]:
filtered_df2.shape

(103, 3)

## Read the Geospatial_Coordinates file for the data 

In [9]:
geo_df = pd.read_csv("Geospatial_Coordinates.csv")
geo_df.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## Merge the data for the longitude and latude from the data provided

In [10]:
filtered_df2.merge(geo_df)
filtered_df3=pd.merge(filtered_df2, geo_df, how='left',
        left_on='Postcode', right_on='Postcode')
filtered_df3.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [11]:
filtered_df3.to_csv("filtered_df3.csv")

In [12]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 
neighborhoods = filtered_df3[['Borough', 'Neighbourhood', 'Latitude', 'Longitude'] ]

In [13]:
#Toronto_neighborhoods=neighborhoods['Borough']
Toronto_neighborhoods=neighborhoods[neighborhoods['Borough'].str.contains("Toronto")]
Toronto_neighborhoods.head()

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude
37,East Toronto,The Beaches,43.676357,-79.293031
41,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
43,East Toronto,Studio District,43.659526,-79.340923
44,Central Toronto,Lawrence Park,43.72802,-79.38879


# Check the data for Toronto

In [14]:
print('The Toronto dataframe has {} boroughs and {} neighborhoods.'.format(
        len(Toronto_neighborhoods['Borough'].unique()),
        Toronto_neighborhoods.shape[0]
    )
)

The Toronto dataframe has 4 boroughs and 38 neighborhoods.


# Get the coordinates for Toronto

In [15]:
from geopy.geocoders import Nominatim 
address = 'Toronto'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))



The geograpical coordinate of Toronto are 43.653963, -79.387207.


# Create the map with the details

In [16]:
# create map of New York using latitude and longitude values
import folium # map rendering library
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='pink',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

#### Define Foursquare Credentials and Version

In [17]:
#CLIENT_ID = 'your-client-ID' # your Foursquare ID
#CLIENT_SECRET = 'your-client-secret' # your Foursquare Secret
CLIENT_ID = # Removed # your Foursquare ID
CLIENT_SECRET = Removed # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: QODQYVAS34WISVVR3ARMSYAWXCSS4X4RKW5YPHXOHWBPK0BA
CLIENT_SECRET:5CJI0UPZUOWZANDO3BJN1XV12EBKXEQ45I2EAKSSQWX5TWQM


In [18]:
#neighborhood_latitude=Toronto_neighborhoods.Latitude
#neighborhood_longitude=Toronto_neighborhoods.Longitude


neighborhood_latitude = Toronto_neighborhoods.loc[37, 'Latitude'] 
neighborhood_longitude = Toronto_neighborhoods.loc[37, 'Longitude'] 


# Latitude	Longitude
neighborhood_latitude , neighborhood_longitude

(43.67635739999999, -79.2930312)

In [19]:
# type your answer here

LIMIT = 100 # limit of number of venues returned by Foursquare API



radius = 500 # define radius

url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)

In [20]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5c1216cb351e3d2f8812659b'},
 'response': {'headerLocation': 'The Beaches',
  'headerFullLocation': 'The Beaches, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 4,
  'suggestedBounds': {'ne': {'lat': 43.680857404499996,
    'lng': -79.28682091449052},
   'sw': {'lat': 43.67185739549999, 'lng': -79.29924148550948}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b8daea1f964a520480833e3',
       'name': 'Grover Pub and Grub',
       'location': {'address': '676 Kingston Rd.',
        'crossStreet': 'at Main St.',
        'lat': 43.679181434941015,
        'lng': -79.29721535878515,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.679181434941015,
          'lng': -79.29721535878515}],
    

In [21]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [22]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Grover Pub and Grub,Pub,43.679181,-79.297215
1,Starbucks,Coffee Shop,43.678798,-79.298045
2,Upper Beaches,Neighborhood,43.680563,-79.292869
3,Fearless Meat,Burger Joint,43.680337,-79.290289


#### Check number of venues returned

In [23]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


In [24]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [25]:
Toronto_venues = getNearbyVenues(names=Toronto_neighborhoods['Neighbourhood'],
                                   latitudes=Toronto_neighborhoods['Latitude'],
                                   longitudes=Toronto_neighborhoods['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront, Regent Park
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The 

In [26]:
print(Toronto_venues.shape)
Toronto_venues.head()

(1695, 7)


Unnamed: 0,Neighbourhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
1,The Beaches,43.676357,-79.293031,Starbucks,43.678798,-79.298045,Coffee Shop
2,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
3,The Beaches,43.676357,-79.293031,Fearless Meat,43.680337,-79.290289,Burger Joint
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


In [27]:
# check no of neighbourhoods
Toronto_venues.groupby('Neighbourhood').count()


Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55
"Brockton, Exhibition Place, Parkdale Village",21,21,21,21,21,21
Business reply mail Processing Centre969 Eastern,15,15,15,15,15,15
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",14,14,14,14,14,14
"Cabbagetown, St. James Town",49,49,49,49,49,49
Central Bay Street,81,81,81,81,81,81
"Chinatown, Grange Park, Kensington Market",98,98,98,98,98,98
Christie,15,15,15,15,15,15
Church and Wellesley,82,82,82,82,82,82


In [28]:
# Find unique Categories
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))

There are 231 uniques categories.


In [29]:
# one hot encoding
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
Toronto_onehot['Neighbourhood'] = Toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [Toronto_onehot.columns[-1]] + list(Toronto_onehot.columns[:-1])
Toronto_onehot = Toronto_onehot[fixed_columns]

Toronto_onehot.head()

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
Toronto_onehot.shape

(1695, 232)

In [31]:
Toronto_grouped = Toronto_onehot.groupby('Neighbourhood').mean().reset_index()
Toronto_grouped

Unnamed: 0,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,...,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business reply mail Processing Centre969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.071429,0.071429,0.071429,0.142857,0.142857,0.142857,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012346,...,0.0,0.0,0.012346,0.0,0.0,0.012346,0.0,0.0,0.0,0.012346
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.05102,0.0,0.05102,0.010204,0.0,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012195,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,...,0.0,0.0,0.0,0.012195,0.012195,0.0,0.012195,0.012195,0.0,0.012195


In [32]:
Toronto_grouped.shape

(38, 232)

#### The top 5 most common venues

In [33]:
num_top_venues = 5

for hood in Toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = Toronto_grouped[Toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
                 venue  freq
0          Coffee Shop  0.06
1                 Café  0.05
2           Steakhouse  0.04
3      Thai Restaurant  0.04
4  American Restaurant  0.04


----Berczy Park----
                venue  freq
0         Coffee Shop  0.07
1        Cocktail Bar  0.05
2          Restaurant  0.05
3  Seafood Restaurant  0.04
4  Italian Restaurant  0.04


----Brockton, Exhibition Place, Parkdale Village----
            venue  freq
0     Coffee Shop  0.14
1  Breakfast Spot  0.10
2            Café  0.10
3             Gym  0.05
4   Grocery Store  0.05


----Business reply mail Processing Centre969 Eastern----
           venue  freq
0  Garden Center  0.07
1  Auto Workshop  0.07
2            Spa  0.07
3    Pizza Place  0.07
4        Brewery  0.07


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0  Airport Terminal  0.14
1    Airport Lounge  0.14
2   Airport 

In [34]:
# Function for sorting venues in desc order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [35]:
num_top_venues = 3

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = Toronto_grouped['Neighbourhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)


neighborhoods_venues_sorted.sort_values(['1st Most Common Venue', '2nd Most Common Venue'], ascending=[1, 1])


Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Airport Lounge
15,"Dovercourt Village, Dufferin",Bakery,Supermarket,Pharmacy
23,"Little Portugal, Trinity",Bar,Coffee Shop,Restaurant
26,"Parkdale, Roncesvalles",Breakfast Spot,Gift Shop,Dessert Shop
7,"Chinatown, Grange Park, Kensington Market",Café,Bar,Vietnamese Restaurant
18,"Harbord, University of Toronto",Café,Coffee Shop,Bar
33,Studio District,Café,Coffee Shop,Bakery
8,Christie,Café,Grocery Store,Park
30,"Ryerson, Garden District",Clothing Store,Coffee Shop,Cosmetics Shop
19,"Harbourfront East, Toronto Islands, Union Station",Coffee Shop,Aquarium,Hotel


### Cluster the data for Neighbourhood

In [36]:
# set number of clusters
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [37]:
Toronto_merged = Toronto_neighborhoods

# add clustering labels
Toronto_merged['Cluster Labels'] = kmeans.labels_

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

Toronto_merged.head() # check the last columns!
Toronto_merged.to_csv("merged.csv")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


#### Visualize the Clustered Data

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighbourhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analysis to see the most popular business ventures 

In [39]:

Analysis_df=Toronto_merged.groupby(['1st Most Common Venue','2nd Most Common Venue'])['Cluster Labels'].count().sort_values(ascending=False)
Analysis_df

1st Most Common Venue    2nd Most Common Venue
Coffee Shop              Café                     6
                         Restaurant               3
Café                     Coffee Shop              2
Sushi Restaurant         Coffee Shop              1
Bakery                   Supermarket              1
Bar                      Coffee Shop              1
Breakfast Spot           Gift Shop                1
Café                     Bar                      1
                         Grocery Store            1
Clothing Store           Coffee Shop              1
Coffee Shop              Aquarium                 1
                         Breakfast Spot           1
                         Hotel                    1
                         Italian Restaurant       1
                         Pub                      1
Trail                    Gym                      1
Coffee Shop              Sandwich Place           1
Greek Restaurant         Coffee Shop              1
Grocery Store    

# Results

Lets find the Neighbourhoods to figure out the Neighbourhood where we do not have the  most common business, Coffee Shop and Cafe as 1st or 2nd most common venue. IN the clustring done we can see that the cluster 3 has the most number of the Coffee Shops or Cafe's.

In [40]:
Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 2, Toronto_merged.columns[[1] + list(range(5, Toronto_merged.shape[1]))]]

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
37,The Beaches,Neighborhood,Coffee Shop,Pub
41,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Ice Cream Shop
42,"The Beaches West, India Bazaar",Park,Fast Food Restaurant,Fish & Chips Shop
43,Studio District,Café,Coffee Shop,Bakery
44,Lawrence Park,Park,Lake,Swim School
45,Davisville North,Hotel,Dance Studio,Dog Run
46,North Toronto West,Sporting Goods Shop,Coffee Shop,Yoga Studio
47,Davisville,Pizza Place,Sandwich Place,Dessert Shop
48,"Moore Park, Summerhill East",Trail,Gym,Tennis Court
49,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,Pub,Pizza Place


# Lets now filter and get the Neighbourhood where there are no Coffee Shop or Cafe's in the 1st Most Common Venue or the 2nd Most Common Venue

In [41]:
filtered_Neighbourhoods1=Toronto_merged.loc[Toronto_merged['1st Most Common Venue'] != 'Coffee Shop']
filtered_Neighbourhoods2=filtered_Neighbourhoods1.loc[filtered_Neighbourhoods1['2nd Most Common Venue'] != 'Coffee Shop']
filtered_Neighbourhoods3=filtered_Neighbourhoods2.loc[filtered_Neighbourhoods2['1st Most Common Venue'] != 'Café']
filtered_Neighbourhoods4=filtered_Neighbourhoods3.loc[filtered_Neighbourhoods3['2nd Most Common Venue'] != 'Café']

filtered_Neighbourhoods4

Unnamed: 0,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue
42,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,2,Park,Fast Food Restaurant,Fish & Chips Shop
44,Central Toronto,Lawrence Park,43.72802,-79.38879,2,Park,Lake,Swim School
45,Central Toronto,Davisville North,43.712751,-79.390197,2,Hotel,Dance Studio,Dog Run
47,Central Toronto,Davisville,43.704324,-79.38879,2,Pizza Place,Sandwich Place,Dessert Shop
48,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316,2,Trail,Gym,Tennis Court
50,Downtown Toronto,Rosedale,43.679563,-79.377529,2,Park,Playground,Trail
63,Central Toronto,Roselawn,43.711695,-79.416936,2,Health & Beauty Service,Garden,Discount Store
64,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307,2,Park,Trail,Jewelry Store
68,Downtown Toronto,"CN Tower, Bathurst Quay, Island airport, Harbo...",43.628947,-79.39442,1,Airport Service,Airport Terminal,Airport Lounge
76,West Toronto,"Dovercourt Village, Dufferin",43.669005,-79.442259,2,Bakery,Supermarket,Pharmacy


# Conclusion

One of the approaches can be to open a Cafe or Coffee Shop in these Neighbourhoods which can be successful as there is no such popular place in that area as most of the coffee / cafe shops are in cluster 3 .
WE can see from the clustering done that the most common business is a Coffe Shop / Cafe and the same can be taken up by ABC Corp.
The clustering also tells us that the areas where the Coffee Shops / Cafe are not available and hence can be opened for it to be a sucess ful business later.
