# The Battle of Neighborhoods in Dhaka City
### A Data Science project

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

<!-- The business problem that is addressed in this notebook is that, if a person wants to open a new **coffee shop** in a city in **Canada**, then what are the things that he/she has to look into before opening the shop. Here, by analyzing and exploring all of the Neighborhoods in the **Boroughs(North York, East York and York)** in the city **Vaughan**, he can get useful insights about the venues present in the neighborhoods. If he/she can find a neighborhood where no coffee shop is present currently he/she could try to establish one in that neighborhood. Also, he/she has to explore the neighboring neighborhoods to get better insight for his/her business. In this case, the **stakeholders** are **himself/herself** and the people in the neighborhoods. As he/she will be the **owner** of the coffee shop, and he/she wants to make profit off of it, he/she needs to analyze all the neighborhoods near the city. So, he/she will be the **internal stakeholder**. And **the customer** will be the **consumers**. The popularity and prosperity of his/her business will very much depend of the customers' mood, whether they like the coffee shop or not, whether they like the services given by the employees or not. So, the customers will be the **external stakeholder** of the business. -->

## Data <a name="data"></a>

<!-- The dataset that I am working on is the Neighborhood data of Canada according to their postal Codes. It has been downloaded from the wikipedia page: [Canada Postal codes](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M.). To scrape the webpage, I have used the "beautifulsoup4" library. The dataset consists of three columns, namely, PostalCode ==> refers to the postal code of each of the Neighborhood, Borough ==> the Borough in which the Neighborhood is situated, and Neighborhood ==> the name of the Neighborhood.
To explore each of the Neighborhoods, where all of the coffee shops, parks, restaurants and other venues, the Foursquare API has been used. To use the Foursquare API I needed the latitude and the longitude values of each of the Neighborhoods. The latitude and the longitude values are collected from this [website](http://cocl.us/Geospatial_data).  -->

### Part 00: Importing Libraries

In [1]:
#importing important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans
!conda install -c conda-forge folium=0.5.0 --yes 
import folium

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    openssl-1.1.1f             |       h516909a_0         2.1 MB  conda-forge
    ------------------------------------------------------------
                       

In [2]:
!pip install beautifulsoup4



In [3]:
!pip install lxml



In [4]:
!pip install requests



In [5]:
from bs4 import BeautifulSoup
import requests

### Part 01: Generating the data

In [6]:
#Getting the source data from wikipedia page
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_in_Bangladesh').text

In [7]:
#Using BeautifulSoup4 to read the data
soup = BeautifulSoup(source, 'lxml')

In [8]:
#print(soup.prettify())

In [9]:
#Capturing the data table
table = soup.find("table", attrs={"class":"wikitable"})

In [10]:
table

<table class="wikitable">
<tbody><tr>
<th style="text-align: center; font-weight: bold;">District
</th>
<th style="text-align: center; font-weight: bold;">Thana
</th>
<th style="text-align: center; font-weight: bold;">SubOffice
</th>
<th style="text-align: center; font-weight: bold;">Post Code<sup class="reference" id="cite_ref-dhaka_1-0"><a href="#cite_note-dhaka-1">[1]</a></sup>
</th></tr>
<tr>
<td>Dhaka
</td>
<td>Dhaka
</td>
<td>Dhaka Cantonment--TSO
</td>
<td style="text-align: center;">1206
</td></tr>
<tr>
<td>Dhaka
</td>
<td>Dhamrai
</td>
<td>Dhamrai
</td>
<td style="text-align: center;">1350
</td>
<td>Dhaka
</td>
<td>Dhamrai
</td>
<td>Kalampur
</td>
<td style="text-align: center;">1351
</td></tr>
<tr>
<td>Dhaka
</td>
<td>Dhanmondi
</td>
<td>Jigatala TSO
</td>
<td style="text-align: center;">1209
</td></tr>
<tr>
<td>Dhaka
</td>
<td>Gulshan
</td>
<td>Banani TSO
</td>
<td style="text-align: center;">1213
</td></tr>
<tr>
<td>Dhaka
</td>
<td>Gulshan
</td>
<td>Badda
</td>
<td style="t

In [11]:
District = []
for i in table.find_all('tr'):
    for j in i.find_all('td'):
        District.append(j.text)
    District.append("***")  

In [12]:
new_dist = []
for i in District:
    new_dist.append(i.split('\n'))

In [13]:
for i in new_dist:
    if i==['***']:
        new_dist.remove(i)

In [14]:
new_dist

[['Dhaka', ''],
 ['Dhaka', ''],
 ['Dhaka Cantonment--TSO', ''],
 ['1206', ''],
 ['Dhaka', ''],
 ['Dhamrai', ''],
 ['Dhamrai', ''],
 ['1350', ''],
 ['Dhaka', ''],
 ['Dhamrai', ''],
 ['Kalampur', ''],
 ['1351', ''],
 ['Dhaka', ''],
 ['Dhanmondi', ''],
 ['Jigatala TSO', ''],
 ['1209', ''],
 ['Dhaka', ''],
 ['Gulshan', ''],
 ['Banani TSO', ''],
 ['1213', ''],
 ['Dhaka', ''],
 ['Gulshan', ''],
 ['Badda', ''],
 ['1212', ''],
 ['Dhaka', ''],
 ['Gulshan', ''],
 ['Gulshan Model Town', ''],
 ['1212', ''],
 ['Dhaka', ''],
 ['Jatrabari', ''],
 ['Dhania TSO', ''],
 ['1236', ''],
 ['Dhaka', ''],
 ['Joypara', ''],
 ['Joypara', ''],
 ['1331', ''],
 ['Dhaka', ''],
 ['Joypara', ''],
 ['Narisha', ''],
 ['1332', ''],
 ['Dhaka', ''],
 ['Joypara', ''],
 ['Palamganj', ''],
 ['1331', ''],
 ['Dhaka', ''],
 ['Keraniganj', ''],
 ['Ati', ''],
 ['1312', ''],
 ['Dhaka', ''],
 ['Keraniganj', ''],
 ['Dhaka Jute Mills', ''],
 ['1311', ''],
 ['Dhaka', ''],
 ['Keraniganj', ''],
 ['Kalatia', ''],
 ['1313', ''],
 ['Dhaka'

In [15]:
value = []
for i in new_dist:
    for j in i:
        if j=="":
            continue
        value.append(j)

In [16]:
District = []
Thana = []
Suboffice = []
PostCode = []
for i in range(0,1196,4):
    District.append(value[i])
for i in range(1,1197,4):
    Thana.append(value[i])
for i in range(2,1197,4):
    Suboffice.append(value[i])
for i in range(3,1197,4):
    PostCode.append(value[i])

In [17]:
List = list(zip(District, Thana, Suboffice, PostCode))

In [18]:
column_names = ['District', 'Thana', 'Suboffice', 'PostCode']
df = pd.DataFrame(List, columns=column_names)

In [19]:
df.head()

Unnamed: 0,District,Thana,Suboffice,PostCode
0,Dhaka,Dhaka,Dhaka Cantonment--TSO,1206
1,Dhaka,Dhamrai,Dhamrai,1350
2,Dhaka,Dhamrai,Kalampur,1351
3,Dhaka,Dhanmondi,Jigatala TSO,1209
4,Dhaka,Gulshan,Banani TSO,1213


In [20]:
df.tail()

Unnamed: 0,District,Thana,Suboffice,PostCode
294,Tangail,Sakhipur,Sakhipur,1950
295,Tangail,Tangail Sadar,Kagmari,1901
296,Tangail,Tangail Sadar,Korotia,1903
297,Tangail,Tangail Sadar,Purabari,1904
298,Tangail,Tangail Sadar,Santosh,1902


In [21]:
!pip install geocoder
from geopy.geocoders import Nominatim 
import geocoder # import geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 7.3MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6


### Part 02: Adding Latitude and Longitude of the Neighborhoods the dataframe

In [22]:
from  geopy.geocoders import Nominatim
geolocator = Nominatim()
city ="Dhaka"
country ="Bangladesh"
lat = []
long = []
for thana in df['Thana']:
    try:
        loc = geolocator.geocode(thana+','+city+','+ country)
        lat.append(loc.latitude)
        long.append(loc.longitude)
    except:
        lat.append(None)
        long.append(None)
# print("****")
# print(lat)
# print("")
# print(long)

  from ipykernel import kernelapp as app


In [23]:
df['Latitude'] = lat
df['Longitude'] = long

In [24]:
df.dropna(inplace=True)

In [25]:
df_not_dhaka = df[df['District']!='Dhaka']
df.drop(df_not_dhaka.index, inplace=True)
df

Unnamed: 0,District,Thana,Suboffice,PostCode,Latitude,Longitude
0,Dhaka,Dhaka,Dhaka Cantonment--TSO,1206,23.759357,90.378814
1,Dhaka,Dhamrai,Dhamrai,1350,23.920162,90.21087
2,Dhaka,Dhamrai,Kalampur,1351,23.920162,90.21087
3,Dhaka,Dhanmondi,Jigatala TSO,1209,23.759357,90.378814
4,Dhaka,Gulshan,Banani TSO,1213,23.789987,90.411627
5,Dhaka,Gulshan,Badda,1212,23.789987,90.411627
6,Dhaka,Gulshan,Gulshan Model Town,1212,23.789987,90.411627
7,Dhaka,Jatrabari,Dhania TSO,1236,23.710423,90.434467
8,Dhaka,Joypara,Joypara,1331,23.607599,90.124962
9,Dhaka,Joypara,Narisha,1332,23.607599,90.124962


### Part 03: Using Geopy and Folium library to generate and explore the areas (Thanas) of Dhaka

In [26]:
#Using Geopy library to get Latitude and Longitude of Dhaka
address = 'Dhaka'

geolocator = Nominatim(user_agent="dhk_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Dhaka are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Dhaka are 23.7593572, 90.3788136.


In [27]:
# Creating a map of Neighborhoods (Thanas) using latitude and longitude values in Dhaka
map_dhaka = folium.Map(location=[latitude, longitude], zoom_start=10)

# Adding markers to the map
for lat, lng, thana, suboffice in zip(df['Latitude'], df['Longitude'], df['Thana'], df['Suboffice']):
    label = '{}, {}'.format(suboffice, thana)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_dhaka)  
    
map_dhaka

In [28]:
df = df.groupby(['District','Thana','Latitude','Longitude'])['Suboffice'].apply(', '.join).reset_index()
df

Unnamed: 0,District,Thana,Latitude,Longitude,Suboffice
0,Dhaka,Dhaka,23.759357,90.378814,Dhaka Cantonment--TSO
1,Dhaka,Dhamrai,23.920162,90.21087,"Dhamrai, Kalampur"
2,Dhaka,Dhanmondi,23.759357,90.378814,Jigatala TSO
3,Dhaka,Gulshan,23.789987,90.411627,"Banani TSO, Badda, Gulshan Model Town"
4,Dhaka,Jatrabari,23.710423,90.434467,Dhania TSO
5,Dhaka,Joypara,23.607599,90.124962,"Joypara, Narisha, Palamganj"
6,Dhaka,Keraniganj,23.698189,90.350526,"Ati, Dhaka Jute Mills, Kalatia, Keraniganj"
7,Dhaka,Khilgaon,23.749702,90.417566,KhilgaonTSO
8,Dhaka,Khilkhet,23.830698,90.423599,KhilkhetTSO
9,Dhaka,Lalbag,23.718856,90.38878,Posta TSO


## Methodology <a name="methodology"></a>

<!-- As the business problem revolves around opening a coffee shop in a neighborhood in city of Vaughan in Canada, at first step the relevant **boroughs** are selected. The boroughs are: **North York, East York and York**. 

In the second step, **all the neighborhoods** that resides in the boroughs selected have been figured out. After that, using the **foursquare API**, the **venues** that are residing in those neighborhoods are found out.

In the next step, **filtering** of the neighborhoods have been done based on the criteria on the absence of coffee shops. This results in the neighborhoods in those boroughs that does not have any coffee shops in them.

Finally, a **clustering technique (k-means clustering)**  was used to find the clusters of similar neighborhoods. The clustering gives the necessary insight that is needed to find a place where if the coffee shop is established would result in **higher profit and customer satisfaction** for the owner.  -->

### Part 04: Using Foursquare API

In [29]:
#Defining Foursquare API Client ID, secret key, and version
CLIENT_ID = '5Q22GH3WURNDT2U33WNXOEGPESYBSSLTODWMIXUEHGYRXLXQ' 
CLIENT_SECRET = 'KVZBSGRAJELOR02BXJMSSZGPE2MLVPXQHNIH1VOJI0LDPOSM' 
VERSION = '20180605' 

print('My credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

My credentials:
CLIENT_ID: 5Q22GH3WURNDT2U33WNXOEGPESYBSSLTODWMIXUEHGYRXLXQ
CLIENT_SECRET:KVZBSGRAJELOR02BXJMSSZGPE2MLVPXQHNIH1VOJI0LDPOSM


In [30]:
import json
from pandas.io.json import json_normalize 

import matplotlib.cm as cm
import matplotlib.colors as colors

In [31]:
# Get data of first neighborhood and use Foursquare API to get some insight of the venues of the neighborhood
thana_latitude = df['Latitude'][0] 
thana_longitude = df['Longitude'][0] 

thana_name = df['Thana'][0] 

print('Latitude and longitude values of {} are {}, {}.'.format(thana_name, 
                                                               thana_latitude, 
                                                               thana_longitude))

Latitude and longitude values of Dhaka are 23.7593572, 90.3788136.


In [33]:
# Setup API URL to explore venues near by Parkwoods
LIMIT = 150
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, thana_latitude, thana_longitude, VERSION, radius, LIMIT)
neighborhood_json = requests.get(url).json()

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
venues = neighborhood_json['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Aarong,Arts & Crafts Store,23.758283,90.374102
1,Seventh Heaven,Café,23.758298,90.374111
2,Khamar Bari Mor,Plaza,23.759292,90.383501
3,Labanga,Restaurant,23.757643,90.374806
4,Fuwang Shwarma,Food,23.758093,90.374557


#### Getting nearby venues of the neighborhoods using Foursquare API

In [40]:
#Function to get nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 150;
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
          
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Thana', 
                  'Thana Latitude', 
                  'Thana Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [41]:
#Generate venues of Vaughan and printing the neighborhoods
print("Thanas in Dhaka:")
dhaka_venues = getNearbyVenues(names=df['Thana'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Thanas in Dhaka:
Dhaka
Dhamrai
Dhanmondi
Gulshan
Jatrabari
Joypara
Keraniganj
Khilgaon
Khilkhet
Lalbag
Mirpur
Mohammadpur
Motijheel
Nawabganj
Ramna
Sabujbag
Savar
Sutrapur
Tejgaon
Tejgaon Industrial Area
Uttara


In [42]:
dhaka_venues.head(10)

Unnamed: 0,Thana,Thana Latitude,Thana Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Dhaka,23.759357,90.378814,Aarong,23.758283,90.374102,Arts & Crafts Store
1,Dhaka,23.759357,90.378814,Seventh Heaven,23.758298,90.374111,Café
2,Dhaka,23.759357,90.378814,Khamar Bari Mor,23.759292,90.383501,Plaza
3,Dhaka,23.759357,90.378814,Labanga,23.757643,90.374806,Restaurant
4,Dhaka,23.759357,90.378814,Fuwang Shwarma,23.758093,90.374557,Food
5,Dhaka,23.759357,90.378814,Khowab,23.758457,90.374153,Women's Store
6,Dhamrai,23.920162,90.21087,Dhamrai Bazar,23.919938,90.211445,Market
7,Dhanmondi,23.759357,90.378814,Aarong,23.758283,90.374102,Arts & Crafts Store
8,Dhanmondi,23.759357,90.378814,Seventh Heaven,23.758298,90.374111,Café
9,Dhanmondi,23.759357,90.378814,Khamar Bari Mor,23.759292,90.383501,Plaza


### Part 05: Exploring all of the neighborhoods of Dhaka City

#### Finding out how many unique categories can be curated from all the returned venues

In [43]:
print('There are {} uniques categories.'.format(len(dhaka_venues['Venue Category'].unique())))

There are 50 uniques categories.


#### Checking the size of the resulting dataframe

In [44]:
dhaka_venues.shape

(102, 7)

#### Checking how many venues were returned for each neighborhood

In [46]:
dhaka_venues.groupby('Thana').count()

Unnamed: 0_level_0,Thana Latitude,Thana Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Thana,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Dhaka,6,6,6,6,6,6
Dhamrai,1,1,1,1,1,1
Dhanmondi,6,6,6,6,6,6
Gulshan,38,38,38,38,38,38
Jatrabari,1,1,1,1,1,1
Khilgaon,6,6,6,6,6,6
Khilkhet,5,5,5,5,5,5
Lalbag,3,3,3,3,3,3
Mirpur,1,1,1,1,1,1
Mohammadpur,5,5,5,5,5,5


In [49]:
#checking out the list of different venue catagories
venue_category = list(dhaka_venues['Venue Category'].unique())
difrnt_venue_categories = pd.DataFrame(venue_category, columns=['Venue Category'])
difrnt_venue_categories

Unnamed: 0,Venue Category
0,Arts & Crafts Store
1,Café
2,Plaza
3,Restaurant
4,Food
5,Women's Store
6,Market
7,Clothing Store
8,Hotel
9,Coffee Shop


#### Insights: 
We can see that their are 49 different category of venues present in different thanas of Dhaka city. Now lets find out how many thanas does not have any japanese restaurant in them.

#### Finding out the thanas t

## Analysis <a name="analysis"></a>

#### Grouping rows by neighborhood and by taking the mean of the frequency of occurrence of each category

#### Finding each neighborhood along with the top 5 most common venues

#### Sorting the venues in descending order

### Part 07: Clustering the Neighborhoods

#### Running K-means clustering algorithm to cluster the neighborhoods

#### Creating a map for the clusters

### Part 08: Examining the clusters

## Results and Discussion <a name="results"></a>

<!-- So the cluster analysis results in 5 clusters of neighborhoods present in the boroughs of: North York, East York and York. To select the neighborhoods that would be perfect for opening a coffee shop two neighborhoods clusters have been selected, namely **cluster 0** and **cluster 3**. 

In cluster 0, the neighborhoods present are: 
'Parkview Hill / Woodbine Gardens', 'Glencairn', 'Woodbine Heights', 'Hillcrest Village', 'Bayview Village', 'Downsview', 'North Park / Maple Leaf Park / Upwood Park', 'Runnymede / The Junction North'.

In cluster 3, the neighborhoods present are:
'Parkwoods', 'Caledonia-Fairbanks', 'Weston', 'York Mills West'.

Although they fall in the same cluster, the distance between neighborhoods in cluster 3 is much greater than the neighborhoods in cluster 0.
So neighborhoods in cluster 0 would be a good choice for a potential neighborhood to open a coffee shop based on business perspective. Remember, the data that have been worked on, consists only of the neighborhoods that does not have any coffee shops in them. From the map analysis of the clusters it is found that the **Downsview** neighborhood might be the best choice in cluster 0.    -->

## Conclusion <a name="conclusion"></a>

<!-- Although the dataset consists of neighborhood data of every city in Canada and the foursquare API has been used to find out all the venues residing in those neighborhoods, but lack of population data, population density data in the neighborhoods certainly limit the capability to get a proper analysis of the business potential of each neighborhood. But, based on the current data, it can be said that, **Downsview** is a good choice to open a coffee shop in the city of Vaughan.  -->