# Final Capstone Project: Week 2

### Best Tutoring Service Locations Search

1. Introduction/Business Problem:
Our goal is to ﬁnd the best location (areas/neighborhood) to start a tutoring service business in Ontario (Brampton), Canada. We need to ﬁnd areas with the schools with the largest enrollment as well as the lowest number of existing tutoring services (oﬀered nearby). The presence of schools determines existence of clients (students), the lack of existing tutoring services means a lack of competition for our prospective business owner. We also consider schools with a lower percentage of low-income families or a higher percentage of parents with university education which indicates that parents can aﬀord the tutorial services offered by the business owner. The process is to specialize, ﬁltering for Elementary or Secondary School, cities, and such.

2. Data:
Source data is publicly available from the government of Ontario at the following link: https://www.ontario.ca/data/schoolinformation-and-student-demographics
This data-set provides info about almost 5,000 schools all over Ontario, with the location coordinates and the number of students enrolled. Using the location data along with foursquare, a venue search engine, we may search for tutoring services near each of the schools. These will be categorized by the number of services nearby. 

## These are the libraries and functions we will use

Searching Foursquare, extracting the relevant information from the resulting JSON, and plotting points onto a Leaflet map.

In [25]:
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium as fol
import pandas as pd
pd.options.display.max_columns= None
pd.set_option('display.max_colwidth', -1)
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import requests

def foursq_search(lats, lngs, query, limit=50, radius=3000 ):
    '''Search Foursquare at the interable coordinates [lats/lngs] given for the given [query]. 
    Return a list of jsons containing the results'''
    res=[]
    CLIENT_ID = 'UEWJN1IMFLFWQ2VDYFEAZX00HEPRZQCEK14TB5YWKZWGUDMA' # your Foursquare ID
    CLIENT_SECRET = 'F3BLNCIFGVYQSXLFURGE1LAPXEQX53BAEYKGWMILZAYXEU2C' # your Foursquare Secret
    VERSION = '20180605' # Foursquare API version

    base_url= 'https://api.foursquare.com/v2/venues/search?'
    
    for lat, lng in zip( lats, lngs):
        url= base_url + '&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&query={}&limit={}'.format(
                CLIENT_ID,
                CLIENT_SECRET,
                VERSION,
                lat,
                lng,
                radius,
                query,
                limit)
        try:
            result= requests.get(url).json()
        except:
            print('Error searching: {},{}. Assigning 0 venues.'.format(lat, lng))
            result= { 'response':{} }
        res.append(result)
    return res

def extract_results( results, amt=[], venues=[], unique_venues= [], specs= [] , excls= []):
    '''Extract/update from each json in the list of [results] the [amt] of venues and the [unique_venues]. 
    Optionally return only venues containing a string in the [specs] list and not one in the [exclude] list.
    Return a list with the number of venues in each result and a dataframe with the lat/lng/name of each unique venue'''

    for i, result in enumerate( results ):
        
        if len( amt ) < len( results ): 
            amt.append( 0 )
        
        # non-empty result
        if result['response'] != {}:
                
            # iterate through the venues in the response
            for venue in result['response']['venues']:

                # make a tuple of the lat/lng/name of each venue
                temp_venue= ( venue['location']['lat'], venue['location']['lng'], venue['name'] )

                # if the specifics list is nonempty check that at least one of the strings is in the venue name
                # if there are specifics and they aren't in the venue name move on to the next venue
                # similar process if a string in the inclusion list is present
                if ( specs != [] ) & ( np.array([spec.lower() in temp_venue[2].lower() for spec in specs] ).sum() == 0): 
                    continue
                if ( excls != [] ) & ( np.array([excl.lower() in temp_venue[2].lower() for excl in excls] ).sum() != 0): 
                    continue

                amt[i]+= 1 
                
                # if we haven't encountered this venue, add it to our unique venue list
                if not temp_venue in unique_venues: 
                    unique_venues+= [ temp_venue ]

                    
    return amt, unique_venues



def plot_points( lats, lngs , radii=[], colors=[], labels=[], opacities=[], toner=False,zoom=12, prev_map= None):
    '''Plot/add coordinates [lats/lngs] with optional [radii],[colors],[labels]. Optinally update a [prev_map].
    Return a map.'''
    
    pt_amt= len( lats )
    
    # check initial conditions    
    make_popups= lambda labels: [ fol.Popup( l, parse_html=True) for l in labels ] \
                                if len( labels ) == pt_amt \
                                else [None] * pt_amt
    check_radii= lambda radii: radii if len( radii ) == pt_amt else [1] * pt_amt
    check_colors= lambda colors: colors if len( colors ) == pt_amt else ['black'] * pt_amt
    check_opacities= lambda opacities: opacities if len( opacities ) == pt_amt else [1] * pt_amt
    
    popups= make_popups( labels )
    radii= check_radii( radii )
    colors= check_colors( colors )
    opacities= check_opacities( opacities )
    
    # if there was no previous map make a new one
    if prev_map == None:
        center= [ lats.mean(), lngs.mean() ]
        tiles= 'Stamen Toner' if toner else 'OpenStreetMap'
        prev_map= fol.Map( location=center, zoom_start=zoom, control_scale=True, tiles=tiles)
        
    for lat, lng, r, color, op, popup in zip(lats, lngs, radii, colors, opacities, popups):
        fol.Circle(
            location=[lat,lng],
            radius=r,
            color=color,
            popup= popup,
            fill=True,
            fill_color=color,
            fill_opacity=op
        ).add_to(prev_map)

    return prev_map

### Import raw data & Data Preparation
Source data is publicly available from the government of Ontario at the following link: https://www.ontario.ca/data/schoolinformation-and-student-demographics
This data-set provides info about almost 5,000 schools all over Ontario, with the location coordinates and the number of students enrolled. Using the location data along with foursquare, a venue search engine, we may search for tutoring services near each of the schools. These will be categorized by the number of services nearby. 

Our Data Source: https://files.ontario.ca/opendata/sif_data_table_2015_2016_en.xlsx


In [26]:
# ontario has free data
data= pd.read_excel('https://files.ontario.ca/opendata/sif_data_table_2015_2016_en.xlsx')

# drop what seem to be keys as well as irrelevant/redundant columns
data.drop(['Board Number','Board Type','School Number', 'Province', 'Municipality','School Website','Board Website','Building Suite','P.O. Box'],axis=1,inplace=True)

# title case the city column for ease
data['City'] = data['City'].apply(lambda x: x.title())

# take only english speaking elementary and secondary schools into account
data= data[ data['School Language'] == 'English' ]
data.drop('School Language', axis=1, inplace=True)

data= data[ (data['School Level'] == 'Elementary') | (data['School Level'] == 'Secondary') ]
data.drop('School Level', axis=1, inplace=True)

print('The data has {} rows & {} cols.'.format(data.shape[0],data.shape[1]))

data.head()

The data has 4449 rows & 42 cols.


Unnamed: 0,Board Name,School Name,School Type,School Special Condition Code,Grade Range,Street,City,Postal Code,Phone Number,Fax Number,Enrolment,Latitude,Longitude,Percentage of Students Whose First Language Is Not English,Percentage of Students Whose First Language Is Not French,Percentage of Students Who Are New to Canada from a Non-English Speaking Country,Percentage of Students Who Are New to Canada from a Non-French Speaking Country,Percentage of Students Receiving Special Education Services,Percentage of Students Identified as Gifted,Percentage of Grade 3 Students Achieving the Provincial Standard in Reading,Change in Grade 3 Reading Achievement Over Three Years,Percentage of Grade 3 Students Achieving the Provincial Standard in Writing,Change in Grade 3 Writing Acheivement Over Three Years,Percentage of Grade 3 Students Achieving the Provincial Standard in Mathematics,Change in Grade 3 Mathematics Achievement Over Three Years,Percentage of Grade 6 Students Achieving the Provincial Standard in Reading,Change in Grade 6 Reading Achievement Over Three Years,Percentage of Grade 6 Students Achieving the Provincial Standard in Writing,Change in Grade 6 Writing Acheivement Over Three Years,Percentage of Grade 6 Students Achieving the Provincial Standard in Mathematics,Change in Grade 6 Mathematics Achievement Over Three Years,Percentage of Grade 9 Students Achieving the Provincial Standard in Academic Mathematics,Change in Grade 9 Academic Mathematics Acheivement Over Three Years,Percentage of Grade 9 Students Achieving the Provincial Standard in Applied Mathematics,Change in Grade 9 Applied Mathematics Achievement Over Three Years,Percentage of Students That Passed the Grade 10 OSSLT on Their First Attempt,Change in Grade 10 OSSLT Literacy Achievement Over Three Years,Percentage of Children Who Live in Low-Income Households,Percentage of Students Whose Parents Have Some Unviersity Education,Percentage of JK-Grade 3 Classes With 20 Students or Fewer,Percentage of JK-Grade 3 Classes With 23 Students or Fewer,Extract Date
0,Algoma DSB,Algoma Education Connection Secondary School,Public,Alternative,9-12,550 NORTHERN AVENUE,Sault Ste. Marie,P6B4J4,,,236.0,46.53477,-84.30772,,100,,,18.6,,,,,,,,,,,,,,N/D,,N/R,,N/R,,33.88,SP,,,Dec-04-17
1,Algoma DSB,Anna McCrea Public School,Public,Not applicable,JK-8,250 Mark,Sault Ste Marie,P6A3M7,705-945-7106,705-945-7221,168.0,46.50593,-84.28732,SP,100,SP,SP,15.5,,0.77,,0.58,,0.81,,0.8,,0.67,,0.53,,,,,,,,8.1,20.97,1.0,1.0,Dec-04-17
2,Algoma DSB,Arthur Henderson Public School,Public,Not applicable,JK-8,2 Henderson,Bruce Mines,P0R1C0,705-785-3483,705-785-3220,101.0,46.30183,-83.7802,SP,100,,,11.9,,0.38,,0.31,,0.46,,N/D,,N/D,,N/D,,,,,,,,13.42,SP,0.0,1.0,Dec-04-17
3,Algoma DSB,Ben R McMullin Public School,Public,Not applicable,JK-8,24 Paradise,Sault Ste Marie,P6B5K2,705-945-7108,705-945-7205,189.0,46.52455,-84.29804,SP,100,SP,SP,13.8,SP,0.44,,0.38,,0.44,,0.74,,0.65,,0.22,,,,,,,,27.9,14.95,1.0,1.0,Dec-04-17
4,Algoma DSB,Blind River Public School,Public,Not applicable,JK-8,19 Hanes,Blind River,P0R1B0,705-356-7752,705-356-0271,187.0,46.18454,-82.9576,SP,100,SP,SP,23.0,,0.5,,0.36,,0.5,,0.52,,0.65,,0.3,,,,,,,,22.36,10.7,1.0,1.0,Dec-04-17


### Clean up our data into the dataframe we will use
Create a dataframe including the most useful columns from the original data, we also get rid of null values and replace them with the average for the column.

In [27]:
# extract only the columns we want
cols= ['School Name','Enrolment','Latitude','Longitude','City']
school_df= data[cols].copy()

#these are the numerical portions of the original data
pct_df= data.iloc[:,-5:-3]

school_df= pd.concat( [school_df, pct_df], axis=1 , sort=True )

# change the column names to make them easier to work with
school_df.columns= ['school','enrol','lat','lng','city','pct_low_income', 'pct_uni_parents']

# drop all entries with null in any of the specified columns
school_df.dropna(subset= ['school','enrol','lat','lng','city'], inplace=True)

# make null entries the average for the numerical data
for col in school_df:
    if not col in ['school', 'enrol', 'lat', 'lng' ,'city']:
        avg= 0
        num_entries= 0
        for val in school_df[col].values:
            if (not val in ['SP','N/R','N/D']) & (val == val):
                avg+= val
                num_entries+= 1
        avg= avg / num_entries
        school_df[col].replace( ['SP','N/R','N/D', np.nan], avg, inplace=True )

print('The schools dataframe has {} rows & {} cols.'.format(school_df.shape[0],school_df.shape[1]))
school_df.head()

The schools dataframe has 4357 rows & 7 cols.


Unnamed: 0,school,enrol,lat,lng,city,pct_low_income,pct_uni_parents
0,Algoma Education Connection Secondary School,236.0,46.53477,-84.30772,Sault Ste. Marie,33.88,24.372899
1,Anna McCrea Public School,168.0,46.50593,-84.28732,Sault Ste Marie,8.1,20.97
2,Arthur Henderson Public School,101.0,46.30183,-83.7802,Bruce Mines,13.42,24.372899
3,Ben R McMullin Public School,189.0,46.52455,-84.29804,Sault Ste Marie,27.9,14.95
4,Blind River Public School,187.0,46.18454,-82.9576,Blind River,22.36,10.7


### Narrow our scope to Brampton

In [28]:
brampton_df= school_df[ school_df.city == 'Brampton' ].copy()

# drop columns we dont need
brampton_df.drop('city', axis=1,inplace=True)
brampton_df.reset_index(drop=True, inplace=True)

print('There are {} schools in Brampton'.format(brampton_df.shape[0]))
brampton_df.head()

There are 169 schools in Brampton


Unnamed: 0,school,enrol,lat,lng,pct_low_income,pct_uni_parents
0,Bishop Francis Allen Catholic School,358.0,43.66624,-79.74619,23.25,29.09
1,Cardinal Ambrozic Catholic Secondary School,1332.0,43.78772,-79.68312,16.58,31.78
2,Cardinal Leger Secondary School,1118.0,43.68409,-79.7505,20.87,11.33
3,Cardinal Newman Catholic School,504.0,43.72155,-79.69919,22.76,15.02
4,Father C W Sullivan Catholic School,301.0,43.70595,-79.74661,17.87,11.46


## Search FourSquare for tutoring services near schools
We make three searches: tutors, math and learn.

In [29]:
# get the foursquare results for searches 'tutor', 'math' and 'learn'

amt= []
unique_tutors= []

print('Working.. 1/3')
results1= foursq_search(brampton_df.lat, brampton_df.lng, query='tutor')
amt, unique_tutors= extract_results(results1)

print('Working.. 2/3')
results2= foursq_search(brampton_df.lat, brampton_df.lng, query='math')
amt, unique_tutors= extract_results(results2, amt=amt, unique_venues=unique_tutors, specs=['math ', 'mathematics', 'mathnasium'])

print('Working.. 3/3')
results3= foursq_search(brampton_df.lat, brampton_df.lng, query='learn')
amt, unique_tutors= extract_results(results3, amt=amt, unique_venues=unique_tutors , specs=[ 'oxford', 'sylvan'])
print('Done!')

# make a column for the number of services near each school
brampton_df['tutor_services']= amt

# this is a measure of how good the school is based on how many students are in it and the number of services near it
brampton_df['enrol_tutors_ratio']= brampton_df.enrol / (brampton_df.tutor_services + 1 )

print('Results collected.')
brampton_df.head()

Working.. 1/3
Working.. 2/3
Working.. 3/3
Done!
Results collected.


Unnamed: 0,school,enrol,lat,lng,pct_low_income,pct_uni_parents,tutor_services,enrol_tutors_ratio
0,Bishop Francis Allen Catholic School,358.0,43.66624,-79.74619,23.25,29.09,3,89.5
1,Cardinal Ambrozic Catholic Secondary School,1332.0,43.78772,-79.68312,16.58,31.78,1,666.0
2,Cardinal Leger Secondary School,1118.0,43.68409,-79.7505,20.87,11.33,2,372.666667
3,Cardinal Newman Catholic School,504.0,43.72155,-79.69919,22.76,15.02,1,252.0
4,Father C W Sullivan Catholic School,301.0,43.70595,-79.74661,17.87,11.46,2,100.333333


### Create a dataframe for the unique tutoring services

In [30]:
#make a dataframe with the information for each unique service found
unique_tutors_df= pd.DataFrame.from_records(unique_tutors, columns=['lat','lng','name'])
unique_tutors_df

Unnamed: 0,lat,lng,name
0,43.682367,-79.767512,Impel Tutors
1,43.772775,-79.660586,Learna Tutoring
2,43.666087,-79.737461,Academy for Mathematics & Science
3,43.715216,-79.723803,Kumon Math & Reading Centre
4,43.676515,-79.823418,UCMAS Mental Math School
5,43.63905,-79.716064,academy for mathematics & english
6,43.658689,-79.726444,Oxford Learning Centre
7,43.760366,-79.728131,Oxford Learning Centre
8,43.681636,-79.816006,Oxford Learning


## Mapping tutors and schools

In [31]:
tut_amt= unique_tutors_df.shape[0]

# make yellow circles signifying the effective radius of each tutoring services
area_map= plot_points( unique_tutors_df.lat, 
                         unique_tutors_df.lng,
                         [3000] * tut_amt,
                         ['yellow'] * tut_amt,
                         opacities= [0.1] * tut_amt )

# add the tutoring services to the map
tut_map= plot_points( unique_tutors_df.lat, 
                         unique_tutors_df.lng,
                         [100] * tut_amt,
                         ['red'] * tut_amt, 
                         unique_tutors_df.name, 
                         prev_map=area_map )


# add the schools to the map
sch_amt= brampton_df.shape[0]
labels= [ name + ' : {} tutoring services'.format(tut) for name, tut in zip( brampton_df.school, brampton_df.tutor_services ) ]

full_map= plot_points( brampton_df.lat, 
                         brampton_df.lng,
                         [80] * sch_amt,
                         ['blue'] * sch_amt, 
                         labels,
                         brampton_df.tutor_services / brampton_df.tutor_services.max(),
                         prev_map=tut_map )

full_map

## Using K-Means Algorithm to Cluster Schools

In [32]:
# make a temporary dataframe to extract the data to feed the K-Means algorithm
cols= ['school', 'lat', 'lng']
kmeans_tempdf= brampton_df.drop(cols, axis=1)
# ensure there are no null values
kmeans_tempdf.head()

Unnamed: 0,enrol,pct_low_income,pct_uni_parents,tutor_services,enrol_tutors_ratio
0,358.0,23.25,29.09,3,89.5
1,1332.0,16.58,31.78,1,666.0
2,1118.0,20.87,11.33,2,372.666667
3,504.0,22.76,15.02,1,252.0
4,301.0,17.87,11.46,2,100.333333


## Normalize our data

In [33]:
# fit our data to emulate a standard normal distribution to make sure all factors are equal
X= np.nan_to_num( kmeans_tempdf.values )
X= StandardScaler().fit_transform(X)
print(X[:5])
print('Data Standardized.')

[[-0.7695138   1.75701226  0.00509834  1.4828575  -0.86579394]
 [ 1.87295179 -0.02631759  0.22574879 -0.49819323  1.27867857]
 [ 1.292369    1.12068167 -1.45168676  0.49233213  0.18753326]
 [-0.37341526  1.62600302 -1.14901013 -0.49819323 -0.26132424]
 [-0.92415501  0.31858429 -1.44102336  0.49233213 -0.82549596]]
Data Standardized.


## Create the ML model and fit it with our data

In [34]:
clusters= 9

# run k-means on the data separated
kmeans= KMeans(init='k-means++', n_clusters=clusters, n_init= 12)
kmeans.fit(X)
print('Model fit with data.')

Model fit with data.


## We've got the clusters!

In [35]:
# make a column for the clusters given to each school
brampton_df['cluster']= kmeans.labels_
brampton_df[['school','cluster']].head()

Unnamed: 0,school,cluster
0,Bishop Francis Allen Catholic School,7
1,Cardinal Ambrozic Catholic Secondary School,3
2,Cardinal Leger Secondary School,3
3,Cardinal Newman Catholic School,4
4,Father C W Sullivan Catholic School,8


## Analyzing the Clusters

In [36]:
color_map= [ 'red','blue','orange','black','lime','green','pink','purple','brown' ]

# show the number of schools in each cluster as well as the mean ratio for each 
view= brampton_df.groupby('cluster').mean().reset_index()
view['color']= view.cluster.apply( lambda c: color_map[c].title() )
view['count'] = brampton_df.cluster.value_counts(sort=False)

cols= view.columns.tolist()
cols= cols[-2:] + [cols[-3]] + [cols[1]] + cols[4:-3]
view= view[cols]

view.columns= [ s.replace('_', ' ').title() for s in view.columns ]
view.set_index('Color', inplace=True)
view.index.name= None
view.sort_values('Enrol Tutors Ratio', ascending=False ).apply( lambda x: round(x, 2), axis=1)

Unnamed: 0,Count,Enrol Tutors Ratio,Enrol,Pct Low Income,Pct Uni Parents,Tutor Services
Pink,6.0,1311.08,1468.0,17.11,19.63,0.17
Orange,14.0,694.43,694.43,13.27,37.31,0.0
Black,15.0,496.68,1413.0,19.26,17.88,2.0
Green,27.0,352.7,677.22,14.23,42.01,0.93
Red,20.0,278.62,387.55,15.21,23.36,0.5
Lime,17.0,210.23,427.12,23.15,22.09,1.24
Blue,28.0,202.47,648.96,13.75,40.95,2.29
Brown,22.0,133.28,392.68,17.15,16.84,1.95
Purple,20.0,107.56,430.25,19.85,25.18,3.0


## Map the Clustered Schools

In [37]:
avgs= brampton_df.enrol_tutors_ratio
sch_amt= brampton_df.shape[0]
labels= [ name + ' : {} naive-expected students'.format( ratio ) for name, ratio in zip( brampton_df.school, avgs.apply(int) ) ]
color_map= [ 'red','blue','orange','black','lime','green','deeppink','purple','brown' ]

full_map= plot_points( brampton_df.lat, 
                         brampton_df.lng,
                         50 + 200*(( avgs - avgs.min() ) / (avgs.max() - avgs.min() )),
                         [ color_map[ cluster ] for cluster in brampton_df.cluster ], 
                         labels,
                         [0.5] * sch_amt )#, toner=True)

full_map

# Process Generalization
Here are functions generalizing the steps we took earlier, now we can repeat the process for any Ontario city we desire.

In [38]:
def get_city_df(city):
    if isinstance(city, str):
        city_df= school_df[ school_df.city == city ].copy()
    elif isinstance(city, list):
        city_df= school_df[ school_df.city == city[0] ].copy()
        for c in city[1:]:
            city_df= pd.concat( [ city_df, school_df[ school_df.city == c ] ], axis=0 )
    else:
        return None
    # drop columns we dont need
    city_df.drop('city', axis=1,inplace=True)
    city_df.reset_index(drop=True, inplace=True)
    
    return city_df

def find_tutors(city_df, queries= ['tutor'] ):
    # get the foursquare results for searches 'tutor', 'math' and 'learn'
    ttl= len(queries)
    results= []
    for i, query in enumerate(queries):
        print('Working.. {}/{}'.format(i + 1, ttl))
        results.append(foursq_search(city_df.lat, city_df.lng, query=query ))
        
    print('Done!')
    
    return results
    
def parse_results(city_df, results, specs=[], excls=[] ):
    ttl= len( results )
    amt=[]
    unique_tutors=[]
    
    make_empties= lambda lst: lst if len(lst) == ttl else [[]] * ttl
        
    specs= make_empties(specs)
    excls= make_empties(excls)
    
    for i, result in enumerate(results):
        amt, unique_tutors= extract_results(result, amt=amt, unique_venues=unique_tutors, specs=specs[i], excls= excls[i] )
    # make a column for the number of services near each school
    city_df['tutor_services']= amt

    # this is a measure of how good the school is based on how many students are in it and the number of services near it
    city_df['enrol_tutors_ratio']= city_df.enrol / (city_df.tutor_services + 1 )
    
    unique_tutors_df= pd.DataFrame.from_records(unique_tutors, columns=['lat','lng','name'])

    return city_df, unique_tutors_df

def cluster_schools(city_df, clus= 9):
    cols= ['school', 'lat', 'lng']
    
    X= np.nan_to_num( city_df.drop(cols, axis=1).values )
    X= StandardScaler().fit_transform(X)

    # run k-means on the data separated
    kmeans= KMeans(init='k-means++', n_clusters=clus, n_init= 12)
    kmeans.fit(X)

    # make a column for the clusters given to each school
    city_df['cluster']= kmeans.labels_
    
    return city_df

def map_sch_tut(city_df, unique_tutors_df, clustered=False, zoom=12, prev_map=None):
    tut_amt= unique_tutors_df.shape[0]

    area_map= plot_points( unique_tutors_df.lat, 
                             unique_tutors_df.lng,
                             [3000] * tut_amt,
                             ['yellow'] * tut_amt,
                             opacities= [0.09] * tut_amt,
                             zoom=zoom,
                             prev_map=prev_map )

    #add the tutoring services to the map
    tut_map= plot_points( unique_tutors_df.lat, 
                             unique_tutors_df.lng,
                             [100] * tut_amt,
                             ['red'] * tut_amt, 
                             unique_tutors_df.name, 
                             prev_map=area_map )



    sch_amt= city_df.shape[0]
    labels= [ name + ' : {} Nearby Services'.format(tut) for name, tut in zip( city_df.school, city_df.tutor_services ) ]

    if not clustered:
        full_map= plot_points( city_df.lat, 
                                 city_df.lng,
                                 [80] * sch_amt,
                                 ['blue'] * sch_amt, 
                                 labels,
                                 city_df.tutor_services / city_df.tutor_services.max(),
                                 prev_map=tut_map )

        return full_map
    else:
        return map_clusters(city_df, prev_map= tut_map)

def map_clusters(city_df, prev_map=None):
    avgs= city_df.enrol_tutors_ratio
    sch_amt= city_df.shape[0]
    labels= [ name + ' : {} Naive-expected Students : {} Nearby Services'.format( ratio, tut ) for name, ratio, tut in zip( city_df.school, avgs.apply(int), city_df.tutor_services ) ]
    color_map= [ 'brown','blue','pink','red','purple','black','yellow','orange','green' ]

    full_map= plot_points( city_df.lat, 
                             city_df.lng,
                             50 + 200*(( avgs - avgs.min() ) / (avgs.max() - avgs.min() )),
                             [ color_map[ cluster ] for cluster in city_df.cluster ], 
                             labels,
                             [0.5] * sch_amt,
                             prev_map=prev_map )

    return full_map

## Mississauga, Oakville, Brampton, Etobicoke and North York
We will now do the quick version of our process to the combination of these cities.  
Get the schools:

In [39]:
cities= ['Brampton','Oakville','Etobicoke','Mississauga','North York']
city_df= get_city_df(cities)
print('There are {} schools in '.format(city_df.shape[0]), end='')
for i, cit in enumerate(cities):
    print(cit, end= ' ') if cit != cities[-1] else print('& {}.'.format(cit))
city_df.head()

There are 610 schools in Brampton Oakville Etobicoke Mississauga & North York.


Unnamed: 0,school,enrol,lat,lng,pct_low_income,pct_uni_parents
0,Bishop Francis Allen Catholic School,358.0,43.66624,-79.74619,23.25,29.09
1,Cardinal Ambrozic Catholic Secondary School,1332.0,43.78772,-79.68312,16.58,31.78
2,Cardinal Leger Secondary School,1118.0,43.68409,-79.7505,20.87,11.33
3,Cardinal Newman Catholic School,504.0,43.72155,-79.69919,22.76,15.02
4,Father C W Sullivan Catholic School,301.0,43.70595,-79.74661,17.87,11.46


## Search for tutoring services:

In [40]:
queries= ['tutors', 'math', 'learning' ]
results= find_tutors(city_df, queries)
results

Working.. 1/3
Working.. 2/3
Working.. 3/3
Done!


[[{'meta': {'code': 200, 'requestId': '5c8c5da3dd57973af0179f30'},
   'response': {'venues': [{'id': '57ad6981498e0d78c3f6cd84',
      'name': 'Impel Tutors',
      'location': {'address': '43 McMurchy Ave N',
       'crossStreet': 'brompton',
       'lat': 43.68236694270353,
       'lng': -79.76751208305359,
       'labeledLatLngs': [{'label': 'display',
         'lat': 43.68236694270353,
         'lng': -79.76751208305359}],
       'distance': 2483,
       'postalCode': 'L6X 1X4',
       'cc': 'CA',
       'city': 'Ontario',
       'state': 'ON',
       'country': 'Canada',
       'formattedAddress': ['43 McMurchy Ave N (brompton)',
        'Ontario ON L6X 1X4',
        'Canada']},
      'categories': [{'id': '4bf58dd8d48988d1a8941735',
        'name': 'General College & University',
        'pluralName': 'General Colleges & Universities',
        'shortName': 'Education',
        'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/education/other_',
         'suffix': '.png'}

## Extract the venues:

In [62]:
city_sch_tut, services= parse_results( city_df, results )
print('There are {} unique relevant results.'.format( len( services ) ))
print(services.name[:5])
city_sch_tut.head()

There are 1 unique relevant results.
0    Impel Tutors
Name: name, dtype: object


Unnamed: 0,school,enrol,lat,lng,pct_low_income,pct_uni_parents,tutor_services,enrol_tutors_ratio,cluster
0,Bishop Francis Allen Catholic School,358.0,43.66624,-79.74619,23.25,29.09,1,179.0,2
1,Cardinal Ambrozic Catholic Secondary School,1332.0,43.78772,-79.68312,16.58,31.78,0,1332.0,7
2,Cardinal Leger Secondary School,1118.0,43.68409,-79.7505,20.87,11.33,1,559.0,2
3,Cardinal Newman Catholic School,504.0,43.72155,-79.69919,22.76,15.02,0,504.0,0
4,Father C W Sullivan Catholic School,301.0,43.70595,-79.74661,17.87,11.46,1,150.5,2


## Filter our results:
We look through the results and create specifications and exclusions to re-parse the results. These are done manually and could be automated.

In [63]:
specs= [[], ['math ', 'mathstat', 'mathematics'], []]
excls= [    [],
           [ 'copy room', 'humber', 'library', 'class', 'department' ],
           [ 'acend', 'elearning', 'e-learning' ,'playground','pavilion','disabilities',
               'early', 'teksource', 'build','rider','network','adult',
               'enabled', 'york','library','solutions','scotiabank',
               'tykes','child','bmo','international','agincourt','code','engage'
               'e-learning','music','ocadu','rbc','research','smw','ryerson',
               'reiki','employee', 'path' ,'otf','thornhill', 'day care', 'golf', 
                'humber', 'finance','gems'] ]

city_sch_tut, services= parse_results( city_df, results , specs=specs, excls=excls)
print('There are {} unique relevant results.'.format( len( services ) ))
print(services.name[:5])
city_sch_tut.head()

There are 1 unique relevant results.
0    Impel Tutors
Name: name, dtype: object


Unnamed: 0,school,enrol,lat,lng,pct_low_income,pct_uni_parents,tutor_services,enrol_tutors_ratio,cluster
0,Bishop Francis Allen Catholic School,358.0,43.66624,-79.74619,23.25,29.09,1,179.0,2
1,Cardinal Ambrozic Catholic Secondary School,1332.0,43.78772,-79.68312,16.58,31.78,0,1332.0,7
2,Cardinal Leger Secondary School,1118.0,43.68409,-79.7505,20.87,11.33,1,559.0,2
3,Cardinal Newman Catholic School,504.0,43.72155,-79.69919,22.76,15.02,0,504.0,0
4,Father C W Sullivan Catholic School,301.0,43.70595,-79.74661,17.87,11.46,1,150.5,2


## Here are the schools and tutoring services in Toronto:
  
* Each blue marker represents a school.
* Each red marker represents a tutoring service.
* The yellow circles denote a 3km radius from each service. A 3km radius was used in our searches.

In [64]:
map_sch_tut(city_sch_tut, services, zoom=11 )

## Now we use KMeans to cluster the schools

In [67]:
city_sch_tut= cluster_schools(city_sch_tut)
color_map= [ 'red','blue','orange','black','lime','green','pink','purple','brown' ]

# show the number of schools in each cluster as well as the mean ratio for each 
view= city_sch_tut.groupby('cluster').mean().reset_index()
view['color']= view.cluster.apply( lambda c: color_map[c].title() )
view['count'] = brampton_df.cluster.value_counts(sort=False)

cols= view.columns.tolist()
cols= cols[-2:] + [cols[-3]] + [cols[1]] + cols[4:-3]
view= view[cols]

view.columns= [ s.replace('_', ' ').title() for s in view.columns ]
view.set_index('Color', inplace=True)
view.index.name= None
view.sort_values('Enrol Tutors Ratio', ascending=False ).apply( lambda x: round(x, 2), axis=1)

Unnamed: 0,Count,Enrol Tutors Ratio,Enrol,Pct Low Income,Pct Uni Parents,Tutor Services
Green,27.0,1754.06,1754.06,19.61,24.29,0.0
Orange,14.0,1089.96,1089.96,20.94,22.68,0.0
Brown,22.0,731.83,731.83,14.85,40.33,0.0
Red,20.0,590.04,590.04,12.08,61.43,0.0
Lime,17.0,418.96,418.96,28.12,38.87,0.0
Purple,20.0,364.92,364.92,15.34,41.34,0.0
Pink,6.0,349.53,349.53,32.42,15.25,0.0
Blue,28.0,317.44,317.44,18.41,21.38,0.0
Black,15.0,274.39,548.78,18.39,24.17,1.0


## Finally here are the clustered schools along with the nearby services.

Here the colours to look out for are GREEN and ORANGE. These schools have the least nearby services as well as the most expected students.

In [72]:
map_sch_tut(city_sch_tut, services, clustered=True, zoom=11)