## Objective

As part of the Capstone project, we will try to perform exploratory analysis using location data and run through the Foursquare API to achieve our objective.

Our objective is to find an ideal location for us to open a restaurant.

### Stage 1: Data

For the above objective, we will be using open-data acquired from the Dubai Statistics Center. The data is available in the form of an Excel sheet, which will require a considerable amount of refinement. The data source is accessible at below location:
    
Report URL <a href="https://www.dsc.gov.ae/Report/DSC_SYB_2019_01%20_%2002.xlsx">https://www.dsc.gov.ae/Report/DSC_SYB_2019_01%20_%2002.xlsx</a>

We chose this data source because it contains the list of communities, and their corresponding population updated until 2019.

#### Step 1.1: Extract data

In [1]:
# importing libraries

import requests
import pandas as pd

In [2]:
# reading excel report from the source.

data_url = 'https://www.dsc.gov.ae/Report/DSC_SYB_2019_01%20_%2002.xlsx'
df_raw_report = pd.read_excel(data_url)

# determining structure
df_raw_report.shape

(247, 5)

#### Step 1.2: Data Wrangling

Because the report has a considerable amount of header and footer data, we will be removing it.

In [3]:
# removing header information

df_raw_report = df_raw_report.iloc[7:]
df_raw_report.head()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
7,101,نخلة ديرة,2,NAKHLAT DEIRA,101
8,111,الكورنيش,1735,AL CORNICHE,111
9,112,الرأس,7460,AL RASS,112
10,113,الضغاية,15899,AL DHAGAYA,113
11,114,البطين,2841,AL BUTEEN,114


In [4]:
# removing footer from the report

df_raw_report = df_raw_report[:-6]
df_raw_report.tail()

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4
236,978,سيح شعيله,3,SAIH SHUA'ALAH,978
237,981,مقطره,804,MUGATRAH,981
238,987,الليان 1,10,AL LAYAN 1,987
239,988,الليان 2,0,AL LAYAN 2,988
240,991,حفير,0,HEFAIR,991


The report also contains additional columns which we do not require as they represent the same information in Arabic.

In [5]:
df_raw_report = df_raw_report[['Unnamed: 2', 'Unnamed: 3']]

Renaming column headers

In [6]:
df_raw_report.rename(columns = {'Unnamed: 2':'population', 'Unnamed: 3':'community'}, inplace = True)

#### Step 1.3: Data Wrangling Continues

If you look at the report, the communities are split in Sectors. These sectors are in the report as splitter rows, which we need to remove. 

In [7]:
# Get the indexs of the rows which have text like 'Sector' in community column

sector_index = df_raw_report[df_raw_report['community'].isin(['Sector 1', 'Sector 2', 'Sector 3', 'Sector 4', 'Sector 5', 'Sector 6', 'Sector 7', 'Sector 8'])].index

# droping the rows based on found indeces
df_raw_report.drop(sector_index, inplace = True)

df_raw_report.shape

(226, 2)

Lets change the order of the columns in our dataframe

In [8]:
df_raw_report = df_raw_report[['community', 'population']]
df_raw_report.head()

Unnamed: 0,community,population
7,NAKHLAT DEIRA,2
8,AL CORNICHE,1735
9,AL RASS,7460
10,AL DHAGAYA,15899
11,AL BUTEEN,2841


Lets sort the dataframe by population (descending)

In [9]:
df_raw_report.sort_values(by = ['population'], inplace = True, ascending = False)
df_raw_report.head(10)

Unnamed: 0,community,population
56,MUHAISANAH SECOND,196316
107,AL GOZE IND. SECOND,159978
153,JABAL ALI INDUSTRIAL FIRST,128975
163,WARSAN FIRST,106072
23,HOR AL ANZ,83187
147,JABAL ALI FIRST,75287
77,AL KARAMA,75066
152,DUBAI INVESTMENT PARK1,69956
20,AL MURQABAT,69771
51,MURDAF,64355


We will be extracting coordinates using GeoPy by leveraging Google Maps or some other data source provider. When I was scouting for the data, I noticed that in our report, the area names have a suffix like FIRST, SECOND, THIRD, ETC., While the same areas were marked with number 1, 2, 3 in Google Maps. 

This means if I have to pass WARSAN FIRST to GeoPy, it won't find the coordinates. To solve this problem, we will replace the suffix with numerical values.

In [10]:
df_raw_report.replace('FIRST', '1', regex = True, inplace = True)
df_raw_report.replace('SECOND', '2', regex = True, inplace = True)
df_raw_report.replace('THIRD', '3', regex = True, inplace = True)
df_raw_report.replace('FOURTH', '4', regex = True, inplace = True)
df_raw_report.replace('FIFTH', '5', regex = True, inplace = True)
df_raw_report.replace('SIXTH', '6', regex = True, inplace = True)
df_raw_report.head(5)

Unnamed: 0,community,population
56,MUHAISANAH 2,196316
107,AL GOZE IND. 2,159978
153,JABAL ALI INDUSTRIAL 1,128975
163,WARSAN 1,106072
23,HOR AL ANZ,83187


Removing industrial areas from out list of communities as we are only intreseted in commercial+residential areas for our restaurant

In [11]:
df_raw_report = df_raw_report[~df_raw_report.community.str.contains('IND.')]
df_raw_report.head()

Unnamed: 0,community,population
56,MUHAISANAH 2,196316
163,WARSAN 1,106072
23,HOR AL ANZ,83187
147,JABAL ALI 1,75287
77,AL KARAMA,75066


Some of the names of locality in this dataset were not as they are represented in map providers. For example, 'Al Quoz' is named as 'Al Goze.' This can cause inconsistency and may leave us excluding the populated areas from our analysis. Following are the naming corrections which we had to.

In [12]:
df_raw_report.replace('GOZE', 'QUOZ', regex = True, inplace = True)
df_raw_report.replace('JABAL ALI 1', 'JEBEL ALI', regex = True, inplace = True)
df_raw_report.replace('MURDAF', 'MIRDIF', regex = True, inplace = True)
df_raw_report.replace('PARK1', 'PARK 1', regex = True, inplace = True)
df_raw_report.replace('PARK2', 'PARK 2', regex = True, inplace = True)
df_raw_report.replace('MURQABAT', 'MURAQABAT', regex = True, inplace = True)
df_raw_report.replace('MARSA DUBAI (AL MINA AL SEYAHI) ', 'MARSA DUBAI', inplace = True)
df_raw_report.replace('AL BADA', 'AL BADA\'A', regex = True, inplace = True)
df_raw_report.replace('SUQ', 'SOUQ', regex = True, inplace = True)
df_raw_report.replace('AL THANYAH 5 (EMIRATE HILLS 1) ', 'EMIRATES HILLS 1', inplace = True)
df_raw_report.replace('AL THANYAH 4 (EMIRATE HILLS 3) ', 'EMIRATES HILLS 3', inplace = True)
df_raw_report.replace('AL THANYAH 3 (EMIRATE HILLS 2)', 'EMIRATES HILLS 2', inplace = True)
df_raw_report.replace('NADD HESSA', 'DUBAI SILICON OASIS', inplace = True)
df_raw_report.replace('AL THANYAH 1 (V. RABIE SAHRA\'A)', 'TECOM', inplace = True)
df_raw_report.replace('MENA JABAL ALI', 'JEBEL ALI NORTH FREE ZONE', inplace = True)
df_raw_report.replace('MUHAISANAH 4', 'MUHAISNAH 4', inplace = True)
df_raw_report.replace('OUD AL MUTEEN 1', 'OUD AL MUTEENA 1', inplace = True)
df_raw_report.replace('WADI AL SAFA 6 (ARABIAN RANCHES)', 'ARABIAN RANCHES', inplace = True)
df_raw_report.replace('NAD AL HAMAR', 'NADD AL HAMAR', inplace = True)
df_raw_report.replace('AL SOUQ AL KABEER', 'BUR DUBAI', inplace = True)
df_raw_report.replace('AL KALIJ AL TEJARI', 'BUSINESS BAY', inplace = True)
df_raw_report.replace('AL WAHEDA', 'AL WUHEIDA', inplace = True)
df_raw_report.replace('AL HEBIAH 4', 'DUBAI SPORTS CITY', inplace = True)
df_raw_report.replace('UM SOUQAIM 2', 'UMM SUQEIM 2', inplace = True)
df_raw_report.replace('UM SOUQAIM 1', 'UMM SUQUEIM 1', inplace = True)
df_raw_report.replace('AL HEBIAH 1', 'MOTOR CITY', inplace = True)
df_raw_report.replace('AL BAESHAA 2', 'AL BARSHA 2', inplace = True)
df_raw_report.replace('MADINAT DUBAI AL MELAHEYAH (AL MINA)', 'DUBAI MARITIME CITY', inplace = True)
df_raw_report.replace('AL DHAGAYA', 'AL RAS', inplace = True)
df_raw_report.replace('AL REGA', 'AL RIGGA', inplace = True)
df_raw_report.replace('WADI AL SAFA 3', 'LIVING LEGENDS', inplace = True)
df_raw_report.replace('AL HEBIAH 5', 'REMRAAM', inplace = True)
df_raw_report.replace('AL SAFFA 1', 'AL SAFA 1', inplace = True)
df_raw_report.replace('UM SOUQAIM 3', 'UMM SUQEIM 3', inplace = True)
df_raw_report.replace('REGA AL BUTEEN', 'RIGGAT AL BUTEEN', inplace = True)
pd.set_option('display.max_rows', None)

Now that we have our desired dataframe, we will proceed to Stage 2 of our work.

### Stage 2: Coordinates

In stage 2, we will extract each community's coordinates and append it to our data frame. 

To minimize the time required to extract such information, we will be obtaining the coordinates of the top 100 communities with the highest population.

#### Step 2.1: Top 100 

In [13]:
# getting the top 100 communities based on population

df_communities = df_raw_report.head(100)

#### Step 2.2: GeoPy

In [14]:
# importing library

from geopy.geocoders import Nominatim

In [15]:
# defining function to get coordinates based on community name

def get_latitude_longitude(community_name):
    # initialize your variable to None
    lat_lng_coords = None
    
    # loop until you get the coordinates
    #while(location is None):
    geolocator = Nominatim(user_agent="waqa5_ahm3d_capstone")
    location = geolocator.geocode('{}, Dubai, United Arab Emirates'.format(community_name))
    
    latitude = location.latitude
    longitude = location.longitude
    
    return latitude, longitude


Now time to loop through Top 100 communities and append their coordinates into dataframe

In [16]:
for i, row in df_communities.head(100).iterrows():
    community_name = row['community']
    
    #Function call
    try:
        lat, long = get_latitude_longitude(community_name)
        
        #Appending to dataframe
        df_communities.loc[i, 'latitude'] = lat
        df_communities.loc[i, 'longitude'] = long
    except:
        pass

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [17]:
#Dropping NaN entries from our dataset

df_communities.dropna(inplace = True)
df_communities.head(10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_communities.dropna(inplace = True)


Unnamed: 0,community,population,latitude,longitude
56,MUHAISANAH 2,196316,25.280555,55.410502
163,WARSAN 1,106072,25.160381,55.425285
23,HOR AL ANZ,83187,25.277042,55.3373
147,JEBEL ALI,75287,25.028782,55.123823
77,AL KARAMA,75066,25.244403,55.304755
152,DUBAI INVESTMENT PARK 1,69956,25.010873,55.165855
20,AL MURAQABAT,69771,25.265104,55.329721
51,MIRDIF,64355,25.221335,55.423499
43,AL NAHDA 2,61936,25.290592,55.376731
121,MARSA DUBAI,61047,25.087754,55.146172


As you can see from above, it requires allot of efforts to make your data usable as per your requirement.

I will be saving this dataset and will publish this on Kaggle for anyone in future looking for top 100 communities in Dubai along with their population.

In [18]:
df_communities.to_csv(r'F:\Study\Data Science\Course 9\top_100_dubai_communities_by_population.csv', index = False, header = True)

In [19]:
print('The dataframe has {} communities.'.format(
        len(df_communities['community'].unique()),
        df_communities.shape[0]
    )
)

#Resetting index

df_communities.reset_index(drop=True, inplace=True)

The dataframe has 99 communities.


### Stage 3: Mapping

Let's take a look at Dubai and based on our dataset, lets see where all these communities are.

For mapping, we will be using Folium.

In [20]:
import folium # map rendering library

#### Step 3.1: Get Dubai city coordinates 

In [21]:
#Using Nominatim, we will get latitude and longitude for Dubai city

dxb_address = 'Dubai, United Arab Emirates'

geolocator = Nominatim(user_agent="dxb_explorer")
location = geolocator.geocode(dxb_address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Dubai are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Dubai are 25.0750095, 55.18876088183319.


#### Step 3.2: Mapping Dubai via Folium

With Folium, we will map out Dubai and then place markers for each community we have in our dataframe 

In [22]:
# create map of Du using latitude and longitude values
map_dubai = folium.Map(location = [latitude, longitude], zoom_start = 11)

# add markers to map
for lat, lng, label in zip(df_communities['latitude'], df_communities['longitude'], df_communities['community']):
    label = folium.Popup(label, parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius = 7,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(map_dubai)  
    
map_dubai

Because we sorted our dataframe based on population and we picked top 100 communities out of the complete dataset, we are able to cover most of the residential/commercial communities. But we did missed out few of them.

For the said purpose we are discussing, I think we are good to go )

### Stage 4: Foursquare

Now that we have everything we need, let's proceed to next step, i.e. Foursquare

#### Step 4.1: Credentials

Let's set our credentials for utilizing and making API calls

In [23]:
CLIENT_ID = 'UVIZFMZHGH02PUGI2F1DLYPLIBSTU2CR1MI4RR15YZKHNCPK' # your Foursquare ID
CLIENT_SECRET = 'FWMB0BKSIOY50XXZMGUMJFLSY4F3QZ3EWKJIOD2PYL4NMKPI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: UVIZFMZHGH02PUGI2F1DLYPLIBSTU2CR1MI4RR15YZKHNCPK
CLIENT_SECRET:FWMB0BKSIOY50XXZMGUMJFLSY4F3QZ3EWKJIOD2PYL4NMKPI


It's fine. It's a free account :)

#### Step 4.2: Start Small

Let's start with a single community and see what we get from Foursquare

In [24]:
df_communities.loc[0, 'community']

'MUHAISANAH 2 '

Getting coordinates

In [25]:
community_latitude = df_communities.loc[0, 'latitude'] # neighborhood latitude value
community_longitude = df_communities.loc[0, 'longitude'] # neighborhood longitude value

community_name = df_communities.loc[0, 'community'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(community_name, community_latitude, community_longitude))

Latitude and longitude values of MUHAISANAH 2  are 25.2805548, 55.4105021.


Let's generate GET URL for Foursquare API call. We will be requesting for Top 100 venues in the locality

In [26]:
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, community_latitude, community_longitude, VERSION, 500, 100)
url

'https://api.foursquare.com/v2/venues/explore?client_id=UVIZFMZHGH02PUGI2F1DLYPLIBSTU2CR1MI4RR15YZKHNCPK&client_secret=FWMB0BKSIOY50XXZMGUMJFLSY4F3QZ3EWKJIOD2PYL4NMKPI&ll=25.2805548,55.4105021&v=20180605&radius=500&limit=100'

Because we have to make a JSON call, lets import JSON library

In [27]:
import json

In [28]:
small_results = requests.get(url).json()

Let's decode

In [29]:
# function that extracts the category of the venue

def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [30]:
# Cleaning the results

venues = small_results['response']['groups'][0]['items']
    
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head(10)

Unnamed: 0,name,categories,lat,lng
0,McDonald's (ماكدونالدز),Fast Food Restaurant,25.281423,55.411649
1,LuLu Center - LuLu Village,Grocery Store,25.280531,55.410506
2,McDonald's LuLu Village,Fast Food Restaurant,25.281949,55.411271
3,Sandesh,Indian Restaurant,25.280876,55.41072
4,UAE Exchange,Currency Exchange,25.281937,55.409644
5,Al Ansari Exchange,Currency Exchange,25.281536,55.411369
6,Amer Quick Plus,Business Service,25.280714,55.41544


Let's see the total number of venues returned by Foursquare

In [31]:
print('{} venues were returned by Foursquare for {}.'.format(nearby_venues.shape[0], community_name))

7 venues were returned by Foursquare for MUHAISANAH 2 .


#### Step 4.3: Explore Dubai

As our initial test for single community turned out good, lets get the list of all venues across all communities.

For this purpose, let's create a function which will loop through all communities and will compile the list venues

In [34]:
def getDubaiVenues(names, latitudes, longitudes, radius = 500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Community',
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude',
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Category']
    
    return(nearby_venues)

Let's Loop

In [35]:
# type your answer here
LIMIT = 100

dubai_venues = getDubaiVenues(names = df_communities['community'], latitudes = df_communities['latitude'], longitudes = df_communities['longitude'])

Let's take a peak inside the venues

In [36]:
print('{} venues were returned by Foursquare.'.format(dubai_venues.shape[0], community_name))

dubai_venues.head()

1416 venues were returned by Foursquare.


Unnamed: 0,Community,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
0,MUHAISANAH 2,25.280555,55.410502,McDonald's (ماكدونالدز),25.281423,55.411649,Fast Food Restaurant
1,MUHAISANAH 2,25.280555,55.410502,LuLu Center - LuLu Village,25.280531,55.410506,Grocery Store
2,MUHAISANAH 2,25.280555,55.410502,McDonald's LuLu Village,25.281949,55.411271,Fast Food Restaurant
3,MUHAISANAH 2,25.280555,55.410502,Sandesh,25.280876,55.41072,Indian Restaurant
4,MUHAISANAH 2,25.280555,55.410502,UAE Exchange,25.281937,55.409644,Currency Exchange


Let's check how many venues per community were returned by Foursquare

In [37]:
dubai_venues.groupby('Community').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
Community,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ABU HAIL,4,4,4,4,4,4
AL BADA'A,5,5,5,5,5,5
AL BARAHA,11,11,11,11,11,11
AL BARSHA 2,6,6,6,6,6,6
AL BARSHA SOUTH 2,4,4,4,4,4,4
AL BARSHA SOUTH 4,46,46,46,46,46,46
AL BARSHA SOUTH 5,46,46,46,46,46,46
AL BARSHAA 1,54,54,54,54,54,54
AL BARSHAA 3,54,54,54,54,54,54
AL GARHOUD,7,7,7,7,7,7


Because we have are intrested in restaurant category, so we have to see which categories of venues are returned by Foursquare

In [38]:
print('There are {} uniques categories.'.format(len(dubai_venues['Category'].unique())))

There are 199 uniques categories.


### Stage 5: Prepare & Analyze

Let's start analyzing our data for each community and transform it so we can utilize it efficiently during ML process

#### Step 5.1: Prepare

Let's prepare our data so it can conform to ML standards.

Categories provided by Foursquare are in label form, the machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

To transform our data to numerical form, we will perform One Hot Encoding. This will transpose out Category lebels in to Features/Columns in our dataframe with value as 0 or 1.

In [39]:
# one hot encoding
dubai_onehot = pd.get_dummies(dubai_venues[['Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
dubai_onehot['Community'] = dubai_venues['Community'] 

# move neighborhood column to the first column
fixed_columns = [dubai_onehot.columns[-1]] + list(dubai_onehot.columns[:-1])
dubai_onehot = dubai_onehot[fixed_columns]

print(dubai_onehot.shape)
dubai_onehot.head()

(1416, 200)


Unnamed: 0,Community,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Water Park,Waterfront,Wings Joint,Women's Store,Yemeni Restaurant
0,MUHAISANAH 2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,MUHAISANAH 2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,MUHAISANAH 2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,MUHAISANAH 2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,MUHAISANAH 2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's find the category mean of for each community

In [40]:
dubai_onehot_grouped = dubai_onehot.groupby(["Community"]).mean().reset_index()

dubai_onehot_grouped.head(10)

Unnamed: 0,Community,Accessories Store,Afghan Restaurant,African Restaurant,American Restaurant,Aquarium,Arcade,Art Gallery,Arts & Crafts Store,Asian Restaurant,...,Tram Station,Tunnel,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Water Park,Waterfront,Wings Joint,Women's Store,Yemeni Restaurant
0,ABU HAIL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AL BADA'A,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,AL BARAHA,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,AL BARSHA 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,AL BARSHA SOUTH 2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,AL BARSHA SOUTH 4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,AL BARSHA SOUTH 5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,AL BARSHAA 1,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.037037,...,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,AL BARSHAA 3,0.0,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.037037,...,0.0,0.0,0.018519,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,AL GARHOUD,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Step 5.2: Filter

Because our data set included above 200 categories, and we are only intrested in analyzing Pakistani restaurants, Let's filter our data frame.

Let's see all the categories available to us

In [41]:
dubai_venues['Category'].unique()

array(['Fast Food Restaurant', 'Grocery Store', 'Indian Restaurant',
       'Currency Exchange', 'Business Service', 'Garden Center',
       'Kitchen Supply Store', 'Gym Pool', 'Market', 'Convenience Store',
       'History Museum', 'Park', 'Campground', 'Coffee Shop',
       'Middle Eastern Restaurant', 'Korean Restaurant', 'Bakery',
       'Asian Restaurant', 'Vegetarian / Vegan Restaurant',
       'Ice Cream Shop', 'Pakistani Restaurant',
       'South Indian Restaurant', 'Restaurant', 'Japanese Restaurant',
       'Café', 'Jewelry Store', 'Chocolate Shop', 'Hotel',
       'Kurdish Restaurant', 'Filipino Restaurant', 'Lounge',
       'Gym / Fitness Center', 'Spa', 'Snack Place', 'Department Store',
       'Pizza Place', 'Supermarket', 'BBQ Joint', 'Hotel Bar',
       'Accessories Store', 'Nail Salon', 'Gym', 'Hookah Bar',
       'Seafood Restaurant', 'Cosmetics Shop', 'Pet Store', 'Tea Room',
       'Furniture / Home Store', 'Wings Joint', 'Chinese Restaurant',
       'Sporting Good

As we can see above, we have a category called 'Pakistani Restaurant', Let's see how many restaurants in total we have.

In [42]:
len(dubai_onehot_grouped[dubai_onehot_grouped['Pakistani Restaurant'] > 0])

8

Trust me, there are many more Pakistani restarurants in Dubai, It's just that we only above restaurants are marked as Pakistani in Faursquare :)

Let's filter and get the list of Pakistani restaurants.

In [43]:
pakistani_restaurant_venues = dubai_onehot_grouped[['Community', 'Pakistani Restaurant']]
pakistani_restaurant_venues.sort_values(by = 'Pakistani Restaurant', ascending=False).head()

Unnamed: 0,Community,Pakistani Restaurant
7,AL BARSHAA 1,0.055556
8,AL BARSHAA 3,0.055556
65,JUMEIRA 3,0.052632
64,JUMEIRA 2,0.052632
63,JUMEIRA 1,0.052632


### Stage 6: Clustering & Analysis

Now let's create clusters of communities based on where the Pakistani Restaurants are situated. Once we have a visual of the cluster, we can start breaking those clusters and see how many are in each cluster.

#### Step 6.1: Clustering

Let's cluster our communities into 5

In [61]:
# importing library

from sklearn.cluster import KMeans

In [69]:
# set number of clusters
k = 5

dxb_clustering = pakistani_restaurant_venues.drop(["Community"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters = k, random_state = 0).fit(dxb_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 2, 2, 4, 4, 0])

Create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [70]:
dubai_merged = pakistani_restaurant_venues.copy()

# add clustering labels
dubai_merged["Cluster"] = kmeans.labels_

In [71]:
dubai_merged.head()

Unnamed: 0,Community,Pakistani Restaurant,Cluster
0,ABU HAIL,0.0,0
1,AL BADA'A,0.0,0
2,AL BARAHA,0.0,0
3,AL BARSHA 2,0.0,0
4,AL BARSHA SOUTH 2,0.0,0


Now merging our dubai_grouped data with dubai_venues_data to add latitude/longitude for each neighborhood

In [72]:
dubai_merged = dubai_merged.join(dubai_venues.set_index("Community"), on="Community")

dubai_merged.sort_values(by = 'Pakistani Restaurant', inplace = True)
dubai_merged.head()

Unnamed: 0,Community,Pakistani Restaurant,Cluster,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
0,ABU HAIL,0.0,0,25.285942,55.329444,Hamriya Park,25.28571,55.333,Park
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Dubai Marriott Harbour Hotel & Suites,25.087784,55.146433,Hotel
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Toro Toro,25.086723,55.143713,Latin American Restaurant
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Observatory Bar & Grill,25.087903,55.146222,Hotel Bar
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Deniz Restaurant & Cafe,25.087833,55.14693,Restaurant


Let's see how the Clusters look like

In [73]:
# Importing libraries

import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np

In [74]:
# create map
map_clusters = folium.Map(location = [latitude, longitude], zoom_start = 11)

# set color scheme for the clusters
x = np.arange(k)

ys = [ i + x + ( i * x ) ** 2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(dubai_merged['Neighborhood Latitude'], dubai_merged['Neighborhood Longitude'], dubai_merged['Community'], dubai_merged['Cluster']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster))
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Step 6.2: Examining Clusters

Now, we can examine each cluster and determine the ideal location for our restaurant venue. 

In [86]:
# Cluster: 1

dubai_merged.loc[dubai_merged['Cluster'] == 0].head(25)

Unnamed: 0,Community,Pakistani Restaurant,Cluster,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
0,ABU HAIL,0.0,0,25.285942,55.329444,Hamriya Park,25.28571,55.333,Park
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Dubai Marriott Harbour Hotel & Suites,25.087784,55.146433,Hotel
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Toro Toro,25.086723,55.143713,Latin American Restaurant
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Observatory Bar & Grill,25.087903,55.146222,Hotel Bar
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Deniz Restaurant & Cafe,25.087833,55.14693,Restaurant
67,MARSA DUBAI,0.0,0,25.087754,55.146172,The Açaí Spot - Marina,25.086301,55.147448,Coffee Shop
67,MARSA DUBAI,0.0,0,25.087754,55.146172,Buddha Bar,25.08656,55.144828,Cocktail Bar
66,LIVING LEGENDS,0.0,0,25.085079,55.298122,Delta International Real Estate (دلتا العالمية...,25.088754,55.298085,Real Estate Office
62,JEBEL ALI NORTH FREE ZONE,0.0,0,24.986114,55.081811,Philipp Plein,24.984995,55.079386,Women's Store
61,JEBEL ALI,0.0,0,25.028782,55.123823,Caffè Nero,25.028296,55.125027,Coffee Shop


In [92]:
# Cluster: 2

dubai_merged.loc[dubai_merged['Cluster'] == 1]

Unnamed: 0,Community,Pakistani Restaurant,Cluster,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Noodle Bowl,25.237252,55.275581,Chinese Restaurant
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Tipanan Filipino Restaurant,25.235842,55.277682,Asian Restaurant
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Westzone Supermarket,25.233702,55.278282,Supermarket
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,KFC (كنتاكي),25.235876,55.277485,Fried Chicken Joint
64,JUMEIRA 2,0.052632,1,25.233175,55.277371,Sultan Palace,25.236897,55.27693,Café
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Golden Fork,25.237202,55.275681,Asian Restaurant
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Haretna Restaurant,25.236602,55.277046,Middle Eastern Restaurant
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Carrefour Express,25.236968,55.279525,Convenience Store
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Mini Chinese Restaurant,25.23652,55.276698,Chinese Restaurant
65,JUMEIRA 3,0.052632,1,25.233175,55.277371,Ambot Tik,25.235203,55.279757,Indian Restaurant


In [93]:
# Cluster: 3

dubai_merged.loc[dubai_merged['Cluster'] == 2]

Unnamed: 0,Community,Pakistani Restaurant,Cluster,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,KFC,25.114629,55.20587,Fried Chicken Joint
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,City Stay Hotel Apartments,25.114969,55.205621,Bed & Breakfast
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,City Stay Inn Hotel Apartment,25.114349,55.203629,Bed & Breakfast
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,Travellers,25.110041,55.203043,Hotel
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,ClubSilk,25.111847,55.203013,Nightclub
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,D' Fusion desi cuisine,25.109495,55.20286,Indian Restaurant
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,"Centre Circle Sports Bar, Ramada Chelsea Hotel.",25.108663,55.204018,Sports Bar
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,Tashkent Restaurant,25.11478,55.205255,Restaurant
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,Grand Excelsior Hotel,25.111694,55.203082,Hotel
5,AL BARSHA SOUTH 4,0.021739,2,25.111724,55.206427,Avari Hotel Apartments,25.113124,55.205852,Bed & Breakfast


In [91]:
# Cluster: 4

dubai_merged.loc[dubai_merged['Cluster'] == 3]

Unnamed: 0,Community,Pakistani Restaurant,Cluster,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
13,AL KARAMA,0.027778,3,25.244403,55.304755,Little Hut Indian Restaurant,25.244952,55.305362,Indian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Malabar Paris Restaurant,25.246005,55.305037,Indian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Sri Krishna Sweets,25.246373,55.307956,Indian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,China Garden Karama,25.24626,55.30068,Asian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Bangalore Empire Karama,25.242486,55.30685,Asian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Panoor Restaurant,25.240625,55.306101,South Indian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Manisha's Kitchen,25.248705,55.304931,Indian Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Golden Fork,25.241388,55.305041,Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Agemono,25.245605,55.302806,Japanese Restaurant
13,AL KARAMA,0.027778,3,25.244403,55.304755,Cafe Xhale,25.247994,55.303914,Café


In [94]:
# Cluster: 5

dubai_merged.loc[dubai_merged['Cluster'] == 4]

Unnamed: 0,Community,Pakistani Restaurant,Cluster,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Category
7,AL BARSHAA 1,0.055556,4,25.111031,55.191054,Adagio Premium,25.112225,55.188747,Hotel
7,AL BARSHAA 1,0.055556,4,25.111031,55.191054,UZB Avenue,25.110966,55.190809,Russian Restaurant
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,Dr. Khalifa Building,25.112036,55.189507,Residential Building (Apartment / Condo)
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,brands for less,25.112042,55.187154,Discount Store
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,Daily Restaurant,25.113801,55.190895,Pakistani Restaurant
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,Nom Nom Asia Restaurant,25.107986,55.191212,Asian Restaurant
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,Centro Barsha by Rotana,25.113314,55.194876,Hotel
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,Sarhad Darbar,25.112642,55.193563,Pakistani Restaurant
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,Breakfast @ Centro Barsha,25.112957,55.194887,Breakfast Spot
8,AL BARSHAA 3,0.055556,4,25.111031,55.191054,The Pet Spa,25.110346,55.195651,Pet Store


### Conclusion

As you can see from the above Folium map, Communities in cluster 1, marked as Red, do not have any 'Pakistani Restaurants,' Which gives us a lot of choices to select where we want to start our business.

There are certainly fewer restaurants in communities in clusters 2, 3, 4, and 5, But there is a potential to have a successful business.

### Shortcomings

Certain shortcomings were identified throughout data extracting and analysis, which impacted the results and decision making.

The only information available to us was a list of the community name, and it's population. Although we all know that when trying to decide where to open a specific ethnic restaurant, we have to see those areas' demographics. In our case, if we had information regarding communities with a high population of Pakistanis, that would have made a significant impact on decision making.

Another weakness in data was identified when we retrieved the list of venues from Foursquare. We have only managed to pull out 8 places marked as 'Pakistani Restaurant' in our top 100 communities based on their total population. This low number of restaurants could be due to miscategorization or mere the fact that Pakistani restaurants are not extensively listed on Foursquare.