# Capstone Project - Week 4

# Table of Contents
[<li>Introduction: Problem Statement](#introduction)</li>
[<li>Data](#data)</li>

## Introduction: Problem Statement<a name="introduction"></a>

Schaumburg IL, is a suburb in the Northwest of Chicago. It is a multi-cultural suburb, famous for <b/>Woodfield Mall</b> and has offices of many <b/>multi-national corportions</b>. In this project, I will try to find suitable locations to open a new ** Indian restraurant**.

Schaumburg has many zip codes, which have a large number of restaurants of different cusines. Apart from Schaumburg, I will also explore few other zip codes which are in close vicinity. Overall, I will explore ten (10) zip codes in and around Schaumburg and try to find locations that could be candidates for opening a new Indian restaurant.

## Data<a name="data"></a>

For the problem statement, I will first find the coordinates of the zip codes that I want to explore. Then, I will find the restaurants around those zipcodes. I will also find the Indian restaurants around these zip codes.

I will use following sources to get the needed data.
1. Scrape Zipmat.net to get the zip codes for areas in vicinity of Schaumburg, IL
2. Geopy "geocode.Geolake" to get the coordinates for suburb and each zip code.
3. Use FourSquare API's to get the list of all restaurents and identify whether a restaurant or not.

### Suburbs to explore

I will first list the suburbs in vicinity of Schaumburg, IL. Next, I will get the coordinates of these suburbs. Then I will get the zip codes in those suburbs and finally I will get list of restaurants in those zip codes and identify which ones are Indian restaurants.

In [1]:
#import the libraries.
import pandas as pd
from geopy.geocoders import Geolake
import geopy.distance
from bs4 import BeautifulSoup
import requests
import folium
from geopy.extra.rate_limiter import RateLimiter

Create an list of areas around Schaumburg that we want to explore

In [2]:
#Suburbs to explore in vicinity of Schaumburg
suburbs = ['Schaumburg, IL', 'Hoffman Estates, IL', 'Streamwood, IL',
           'Roselle, IL', 'Elk Grove Village, IL', 'Rolling Meadows, IL', 'Palatine, IL']

#Geolake API Key
api_key = 'hc3bNjfFhLC3oBi2cyLB'

#Create Geolocator instance
geolocator = Geolake(api_key,user_agent="city_explorer")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)

#Primary weblink to zipmat.net for cook county, IL
weblink = "https://www.zipmap.net/Illinois/Cook_County/"

#lists to hold Suburb details and ZipCodes around those suburbs
suburbAddress = []
zipCodes = []

Let's create few functions that we will use to get the suburb and zip code data

In [3]:
#Function to load suburbs and zipcodes to dataframes
def loadZipDF(suburbAddress,zip_Codes):
    suburbAddressDF = pd.DataFrame(suburbAddress)
    suburbAddressDF.columns = ['Address', 'Latitude', 'Longitude']

    zip_CodesDF = pd.DataFrame(zip_Codes)
    zip_CodesDF.columns = ['Zip Code', 'Suburb', 'Suburb Latitude', 'Suburb Longitude',
                           'Latitude', 'Longitude', 'Address', 'Distance from Suburb', 'Distance from Center']

    zip_CodesDF = zip_CodesDF[['Suburb', 'Suburb Latitude', 'Suburb Longitude', 'Zip Code', 
                           'Latitude', 'Longitude', 'Address', 'Distance from Suburb', 'Distance from Center']]    
    return suburbAddressDF, zip_CodesDF

#Function to get the cooridates.
def getCoordinates(item):
    location = geocode(item, timeout=10)
    return location

#Function to get the zip codes from the zipmat.net
def getZipCodes(suburb,latitude,longitude):
    suburb = suburb.split(',', 1)[0] #remove the state name from the suburb
    suburb= suburb.replace(' ', '_') #replace whitespaces with '_' to create proper url
    url = weblink + suburb + '.htm'
    point1 = (latitude, longitude)
    #read the url
    req  = requests.get(url) # get the data from url

    data = req.text
    soup = BeautifulSoup(data,"lxml") #parse the data
        
    lst = []
    for link in soup.find_all('a'): 
        lst.append(link.get('href'))

    # get the zip codes into a list
    for lst1 in lst:
        if'/zips/' in lst1:
            string = lst1.split('/zips/')[1].split(".htm")[0]
            location = getCoordinates(string)
            point2 = (location.latitude, location.longitude)
            distanceFromSuburb = geopy.distance.distance(point1, point2).m
            distanceFromCenter = geopy.distance.distance(latlng, point2).m
            zipCodes.append([string,suburb,latitude, longitude,location.latitude,location.longitude,
                             location.address,distanceFromSuburb,distanceFromCenter])
    return zipCodes

Get the coordinates of city center

In [4]:
location = getCoordinates('60173') #Hoffman Estates zip code as city center
lat = location.latitude
lng = location.longitude
latlng = (lat,lng)
latlng

(42.0581, -88.0482)

Load the coordinates for Suburbs as well as the zip codes around them.

In [5]:
zipLoaded = False
#Try loading the Suburb and Zip Code details if we already have them from the previous run of this program
try:
    suburbAddressDF = pd.read_csv('c:\\rajesh dhar\\SuburbAddress.csv', index_col = None)
    suburbAddress = suburbAddressDF.values.tolist()

    zip_CodesDF = pd.read_csv('c:\\rajesh dhar\\ZipCodes.csv', index_col = None)
    zip_Codes = zip_CodesDF.values.tolist()
    print('Suburbs and Zip Codes Loaded')
    zipLoaded = True    

except:
    pass

#If this is first time run of the program and we already don't have suburb and zipcodes csv files
if zipLoaded == False:
    for suburb in suburbs:
        print("Getting details for Suburban Area : ", suburb) # Print name of Suburb that is being processed

        # Get the coordinates of the suburb
        location = getCoordinates(suburb)
        latitude = location.latitude
        longitude = location.longitude
        address = location.address

        suburbAddress.append([address,latitude,longitude]) # save the address, latitude and longitude
        zipCodes = getZipCodes(suburb,latitude,longitude)  # get zip code details around the suburb  

    zipCodes.sort(key = lambda x: x[7], reverse=True) # sort the zip codes based on their distance

    #We will store zip codes in a dictionary to get rid of overlapping zipcodes and just keep the ones which are closet to suburb.
    zipCode = {}    
    for l1 in zipCodes:
        zipCode[l1[0]] = l1[1:]

    #Store truncated zipcodes to a list
    zip_Codes = [] 
    for key, value in zipCode.items():
        value.insert(0,key)            
        zip_Codes.append(value)

    #load suburbs and zipcodes to dataframes
    suburbAddressDF, zip_CodesDF = loadZipDF(suburbAddress,zip_Codes)

    zip_CodesDF.sort_values(by=['Suburb','Zip Code'], inplace= True) # sort zip codes within suburb

    #Save the suburbs and zipcodes to csv files for next excution of this progrm
    suburbAddressDF.to_csv('c:\\rajesh dhar\\SuburbAddress.csv', index=False)
    zip_CodesDF.to_csv('c:\\rajesh dhar\\ZipCodes.csv', index=False)

    print('Suburbs and Zip Codes Loaded')

Getting details for Suburban Area :  Schaumburg, IL
Getting details for Suburban Area :  Hoffman Estates, IL
Getting details for Suburban Area :  Streamwood, IL
Getting details for Suburban Area :  Roselle, IL
Getting details for Suburban Area :  Elk Grove Village, IL
Getting details for Suburban Area :  Rolling Meadows, IL
Getting details for Suburban Area :  Palatine, IL
Suburbs and Zip Codes Loaded


In [6]:
print('Coordinates for {} Suburbs and {} Zip Codes Loaded'.format(len(suburbAddressDF), len(zip_CodesDF)))

Coordinates for 7 Suburbs and 34 Zip Codes Loaded


In [7]:
#List of suburb with latitude and Longitude
suburbAddressDF

Unnamed: 0,Address,Latitude,Longitude
0,"Schaumburg, US",42.029075,-88.090953
1,"Hoffman Estates, US",42.06589,-88.121643
2,"Streamwood, US",42.021546,-88.175984
3,"Roselle, US",41.980013,-88.089189
4,"Elk Grove Village, US",42.006304,-88.005855
5,"Rolling Meadows, US",42.076532,-88.025688
6,"Palatine, US",42.119485,-88.041359


In [8]:
zip_CodesDF[5:15]

Unnamed: 0,Suburb,Suburb Latitude,Suburb Longitude,Zip Code,Latitude,Longitude,Address,Distance from Suburb,Distance from Center
32,Elk_Grove_Village,42.006304,-88.005855,60191,41.9602,-87.981,"Wood Dale, US",5519.642413,12216.170165
12,Hoffman_Estates,42.06589,-88.121643,60169,42.0493,-88.1065,"Hoffman Estates, US",2228.640339,4924.124896
18,Hoffman_Estates,42.06589,-88.121643,60192,42.0428,-88.0798,"Hoffman Estates, US",4309.923768,3119.554152
19,Hoffman_Estates,42.06589,-88.121643,60195,42.0764,-88.1093,"Schaumburg, US",1551.207998,5450.085925
33,Palatine,42.119485,-88.041359,60004,42.112,-87.9792,"Arlington Heights, US",5207.37576,8272.669875
8,Palatine,42.119485,-88.041359,60010,42.1614,-88.1383,"Barrington, US",9268.19345,13681.636887
7,Palatine,42.119485,-88.041359,60047,42.2165,-88.0769,"Lake Zurich, US",11169.116825,17753.694608
9,Palatine,42.119485,-88.041359,60067,42.1139,-88.0429,"Palatine, US",633.311587,6213.475959
26,Palatine,42.119485,-88.041359,60074,42.1458,-88.023,"Palatine, US",3293.585446,9961.831594
24,Palatine,42.119485,-88.041359,60089,42.1598,-87.9644,"Buffalo Grove, US",7780.084556,13253.19485


Let's count how many zip codes are in each suburb

In [9]:
zip_CodesDF.groupby(['Suburb'])['Zip Code'].count()

Suburb
Elk_Grove_Village    6
Hoffman_Estates      3
Palatine             7
Rolling_Meadows      5
Schaumburg           6
Streamwood           7
Name: Zip Code, dtype: int64

Okay each suburb has fairly similar number of zip codes around it

Let's plot the zip codes on the Map

In [10]:
map_schaumburg = folium.Map(location=[lat, lng], zoom_start=11)
folium.Marker(latlng, popup='Schaumburg').add_to(map_schaumburg)
for lat, lon, addr, zipcde in zip(zip_CodesDF['Latitude'], zip_CodesDF['Longitude'], zip_CodesDF['Address'], zip_CodesDF['Zip Code']):
    label = '{}, {}'.format(zipcde, addr)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker([lat, lon], radius=3, color='blue', fill=True, popup=label,
                        fill_color='blue', fill_opacity=1).add_to(map_schaumburg) 
map_schaumburg

Okay, we got Zip Codes for the Suburbs around Schaumburg. Zip Codes seem to be concentrated towards west, east and south of Schuamburng.

Now Lets get the deatils of restaurants around these zip codes and identify the Indian restaurants among them. We will do this using Foursquare API.

Define Foursquare Credentials and Version

In [11]:
client_id = '24ALNEKCXXGKZP1LR5I4RVXC5PLEBUSVI2H3N5NXHGSDKMFC' # your Foursquare ID
client_secret = '3YYXMPCUOHNXKQGM13QRWBHICR1Z1FEOG41RKXRGNUDU0N5M' # your Foursquare Secret
version = '20180605'

Let's use Foursquare category for 'food' to restrict the type of venues we get.

In [12]:
#Four Squarre id for Food
food_category = '4d4b7105d754a06374d81259'

Let's create few functions that will get us restaurant data. We will also store this data in csv files.

In [13]:
#Function to replace country name from the address
def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', United States', '')
    return address

In [14]:
#Function to get list of restaurants using Foursquare API
def getRestaurants(lat, lon, radius=500, limit=100):
   
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}\
        &v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, food_category, radius, limit)
            
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    venues = [(item['venue']['id'],
                   item['venue']['name'],
                   item['venue']['categories'][0]['name'],
                   item['venue']['location']['lat'], 
                   item['venue']['location']['lng'],
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        

    return venues

In [15]:
#Function to load restaurants to dataframes
def loadRestaurantDF(all_restaurants, indian_restaurants):
    all_restaurantsDF = pd.DataFrame(all_restaurants)
    all_restaurantsDF.columns = ['Venue Id', 'Venue Name', 'Venue Category', 'Venue Latitude',
                               'Venue Longitude', 'Venue Location', 'Venue Distance',
                               'Zip Code', 'Is Indian', 'Distance From Center']

    indian_restaurantsDF = pd.DataFrame(indian_restaurants)
    indian_restaurantsDF.columns = ['Venue Id', 'Venue Name', 'Venue Category', 'Venue Latitude',
                               'Venue Longitude', 'Venue Location', 'Venue Distance',
                               'Zip Code', 'Is Indian', 'Distance From Center']
    
    return all_restaurantsDF, indian_restaurantsDF

Let's get the restaurants aroud the Zip Codes within the radius of 3KM. We will load these restaurants in two csv files. 1st file for all restaurants and second file only for Indian restaurants. We will use these files in subsquent program executions to avoid using Foursquare API.

In [16]:
loaded = False
#Try loading the restaurants if we already have them from the previous run of this program
try:
    all_restaurantsDF = pd.read_csv('C:\\rajesh dhar\\all_restaurants.csv', index_col = None)
    allRestaurants = all_restaurantsDF.values.tolist()    
    
    indian_restaurantsDF = pd.read_csv('C:\\rajesh dhar\\indian_restaurants.csv', index_col = None)
    indianRestaurants = indian_restaurantsDF.values.tolist()    
    
    loaded = True
    print('Restaurant data Loaded')

except:
    pass

if not loaded:

    restaurants ={}
    indian_restaurants = {} 
    indian_restaurant = []

    all_restaurants = {}
    all_restaurant = []

    venues = []    
    radius = 3000 #3KM Define the radius for Foursquare API
    limit = 100
    for lati, longi, zipCode in zip(zip_CodesDF['Latitude'], zip_CodesDF['Longitude'],zip_CodesDF['Zip Code']):
        print('Fetching restaurants for Zip Code ', zipCode)
        venues = getRestaurants(lati, longi, radius,limit)

        for venue in venues:
            venueId = venue[0]
            venueName = venue[1]
            venueCat = venue[2]
            venueLat = venue[3]
            venueLon = venue[4]
            venueLoc = venue[5]
            venueDist = venue[6]

            point1 = (lat, lng)
            point2 = (venueLat, venueLon)
            distanceFromCenter = geopy.distance.distance(point1, point2).m #distance of venue from city center

            if 'Indian' in venueCat: #Check if restaurant category is Indian
                isIndian = True
            else:
                isIndian = False

            all_restaurant.append([venueId, venueName, venueCat, venueLat, venueLon, venueLoc, venueDist, 
                          zipCode, isIndian, distanceFromCenter])     
            if isIndian:
                indian_restaurant.append([venueId, venueName, venueCat, venueLat, venueLon, venueLoc, venueDist, 
                          zipCode, isIndian, distanceFromCenter])     

        all_restaurant.sort(key = lambda x: x[6], reverse=True)
        indian_restaurant.sort(key = lambda x: x[6], reverse=True)

        #store restaurants in a dictionary to get rid of overlapping restaurants and just keep the ones which are closet to zip code.        
        for l1 in all_restaurant:
            all_restaurants[l1[0]] = l1[1:]             
        for l2 in indian_restaurant:
            indian_restaurants[l2[0]] = l2[1:]

        allRestaurants = [] 
        indianRestaurants = [] 
        
        #Store truncated zipcodes to a list
        for key, value in all_restaurants.items():
            value.insert(0,key)            
            allRestaurants.append(value)
        for key, value in indian_restaurants.items():
            value.insert(0,key)            
            indianRestaurants.append(value)
          
    all_restaurantsDF, indian_restaurantsDF = loadRestaurantDF(allRestaurants, indianRestaurants)

    #sort the dataframes for restaurants name within Zip Codes
    all_restaurantsDF.sort_values(by=['Zip Code','Venue Name'], inplace=True)
    indian_restaurantsDF.sort_values(by=['Zip Code','Venue Name'], inplace=True)
    
    #Save the restaurants to csv files for next excution of this progrm
    all_restaurantsDF.to_csv('c:\\rajesh dhar\\all_restaurants.csv', index=False)
    indian_restaurantsDF.to_csv('c:\\rajesh dhar\\Indian_restaurants.csv', index=False)
    loaded = True
    print('Restaurant data Loaded')

Fetching restaurants for Zip Code  60007
Fetching restaurants for Zip Code  60101
Fetching restaurants for Zip Code  60106
Fetching restaurants for Zip Code  60143
Fetching restaurants for Zip Code  60157
Fetching restaurants for Zip Code  60191
Fetching restaurants for Zip Code  60169
Fetching restaurants for Zip Code  60192
Fetching restaurants for Zip Code  60195
Fetching restaurants for Zip Code  60004
Fetching restaurants for Zip Code  60010
Fetching restaurants for Zip Code  60047
Fetching restaurants for Zip Code  60067
Fetching restaurants for Zip Code  60074
Fetching restaurants for Zip Code  60089
Fetching restaurants for Zip Code  60095
Fetching restaurants for Zip Code  60005
Fetching restaurants for Zip Code  60008
Fetching restaurants for Zip Code  60056
Fetching restaurants for Zip Code  60070
Fetching restaurants for Zip Code  60173
Fetching restaurants for Zip Code  60108
Fetching restaurants for Zip Code  60117
Fetching restaurants for Zip Code  60172
Fetching restaur

In [17]:
print('Total number of restaurants:', len(allRestaurants))
print('Total number of Indian restaurants:', len(indianRestaurants))
print('Percentage of Indian restaurants: {:.2f}%'.format(len(indianRestaurants) / len(allRestaurants) * 100))
print('Average number of restaurants in Zip Code: {:.2f}'.format(len(allRestaurants)/len(all_restaurantsDF['Zip Code'].unique())))

Total number of restaurants: 1499
Total number of Indian restaurants: 32
Percentage of Indian restaurants: 2.13%
Average number of restaurants in Zip Code: 44.09


Let's plot restaurants and see where they fit on the Map. We will try to see where are the Indian restaurants in these zip codes.

In [20]:
map_schaumburg = folium.Map(location=[lat, lng], zoom_start=11)
folium.Marker(latlng, popup='Schaumburg').add_to(map_schaumburg)
for lat, lon, name, isIndian in zip(all_restaurantsDF['Venue Latitude'], all_restaurantsDF['Venue Longitude'], all_restaurantsDF['Venue Name'],all_restaurantsDF['Is Indian']):
    color = 'red' if isIndian else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True,
                        fill_color=color, fill_opacity=1).add_to(map_schaumburg)
map_schaumburg

Alright, We don't see lot of Indian restaurants (red circles). But as we zoom, we do see there are areas where there is significant concentration of Indian restaurents.

In [21]:
#Print list of all restaurants
all_restaurantsDF[5:15]

Unnamed: 0,Venue Id,Venue Name,Venue Category,Venue Latitude,Venue Longitude,Venue Location,Venue Distance,Zip Code,Is Indian,Distance From Center
478,4b58a8e8f964a520336428e3,Chipotle Mexican Grill,Mexican Restaurant,42.112781,-87.97832,338 E Rand Rd (Off Rand Rd. in the Northpoint ...,113,60004,False,22417.408379
471,4b5345dff964a520369527e3,Corner Bakery Cafe,Bakery,42.110841,-87.975805,"470 E Rand Rd,, Arlington Heights, IL 60004",308,60004,False,22264.506003
479,4be9e1346295c9b6b6c98508,Domino's Pizza,Pizza Place,42.111412,-87.979336,325 E Rand Rd (Rand Rd and Arlingtron Heights ...,66,60004,False,22248.864161
430,4c10dbd5ce57c928977382d2,Dunkin' Donuts,Donut Shop,42.138642,-87.984169,"105 W Dundee Rd, Arlington Heights, IL 60004",2994,60004,False,25096.395451
461,4a677de8f964a52088c91fe3,Dunkin' Donuts,Donut Shop,42.106194,-87.970405,"1010 E Rand Rd, Arlington Heights, IL 60004",972,60004,False,21895.395774
476,5830948a24ca6a7d84491b62,Gail's Carriage Place,Breakfast Spot,42.113484,-87.979423,"306 E Rand Rd, Arlington Heights, IL 60004",166,60004,False,22469.468343
477,4f74eb2ce4b07d85bfba0b52,Garibaldi's,Italian Restaurant,42.112069,-87.981041,"1960 N Arlington Heights Rd, Arlington Heights...",152,60004,False,22283.696973
469,5c2becf092e7a9002c478095,Golden Corral,Buffet,42.108451,-87.976266,"445 E Palatine Rd, Arlington Heights, IL 60004",463,60004,False,21998.618805
462,4f934f71e4b03828fdb9c76a,JD's Q & Brew,BBQ Joint,42.116777,-87.986122,"286 W Rand Rd, Arlington Heights, IL 60004",780,60004,False,22691.346008
434,4bec3e6175b2c9b66b35438d,Jimmy John's,Sandwich Place,42.137691,-87.986504,"1-3 Villa Verde Dr, Wheeling, IL 60004",2922,60004,False,24952.828513


In [22]:
indian_restaurantsDF[5:15]

Unnamed: 0,Venue Id,Venue Name,Venue Category,Venue Latitude,Venue Longitude,Venue Location,Venue Distance,Zip Code,Is Indian,Distance From Center
22,4d4f2d3020d3236ab73ea328,Ashoka,Indian Restaurant,41.937627,-88.079854,"252 E Army Trail Rd, Glendale Heights, IL 60139",1196,60108,True,3426.812364
0,4bd4cd02cfa7b7135ce724da,Cool Mirchi,Indian Restaurant,41.999869,-88.059911,"814 E Nerge Rd, Roselle, IL 60172",2178,60157,True,9167.126867
3,52b4efd7498e25b1f6f44607,Atithi,Indian Restaurant,42.048326,-88.084375,"167 W Golf Rd, Schaumburg, IL 60195",1832,60169,True,14804.62317
4,4e14fbada809f31f66a2d93c,Atithi,Indian Restaurant,42.04839,-88.084402,"167 W Golf Rd, Schaumburg, IL 60195",1829,60169,True,14812.048198
8,4bc9e55b68f976b01e115e83,Dakshin Indian Cuisine,Indian Restaurant,42.047734,-88.098834,"1135 N Salem Dr (at Golf Rd.), Schaumburg, IL ...",657,60169,True,15029.839803
5,4c040f57f56c2d7ffbbb1d66,Hakka Wok,Indian Restaurant,42.046031,-88.125197,"1851 W Golf Rd, Schaumburg, IL 60194",1587,60169,True,15607.062327
10,56f8784a498e69128da0003b,Honest,Indian Restaurant,42.049291,-88.101547,"835 W Higgins Rd, Schaumburg, IL 60195",409,60169,True,15259.567037
12,4c4c8be05609c9b6e945bc92,Hot Breads,Indian Restaurant,42.047799,-88.105356,"1065 W Golf Rd, Hoffman Estates, IL 60169",192,60169,True,15196.380611
6,55691b98498e033029e8b641,Inchin's Bamboo Garden,Indian Chinese Restaurant,42.046619,-88.124826,"Schaumburg, IL",1544,60169,True,15654.204578
7,4d8518c1d5fab60c1579f09b,India House Restaurant,Indian Restaurant,42.047927,-88.097757,"721 W Golf Rd (at Higgins Rd.), Hoffman Estate...",738,60169,True,15025.74955


We are done with the data collection for the project. We may need some additional data later, but for now we are good. We will use this data in Week 5 Lab to perform our analysis.