# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

### Table of Contents

1. [Introduction - Business Problem](#introduction)
2. [Importing Libraries and Initial Setup](#setup)
3. [Data Collection, Exploration and Preprocessing](#data)
4. [Methodology](#method)
5. [Modelling and Analysis](#analysis)
6. [Results and Discussion](#result)
7. [Conclusion](#conclusion)


## 1. Introduction - Business Problem <a name="introduction"></a>

In this project, we will undertake the task of **identifying the best location to open a new Pizzeria in Bangalore, India**.

Bangalore, or officially named Bengaluru, is the capital and the largest city of the Indian state of Karnataka. It has a population of more than 8 million, making it the third most populous city in India. Spread over an area of ~8000 square kilometers, the city has the unique distinction of having the highest elevation among all major cities of India. At a height of over 900 meters  above sea level, Bangalore is known for its pleasant climate throughout the year. Often referred to as the Silicon Valley of India, it is also the second fastest growing major metropolis in the country.

The rapid growth of the city provide a great number of lucrative business opportunities. While the city already has a vast number of restaurants spread across different localities and neighbourhoods, there are many new neighbourhoods coming up as a result of the expanding city limits. At the same time, there are new residential and commercial development projects across various locations.


## 2. Importing Libraries and Initial Setup <a name="setup"></a>

In [72]:
import pandas as pd
import numpy as np
import requests
import re
from bs4 import BeautifulSoup
import pgeocode
from geopy.geocoders import Nominatim
import matplotlib.cm as cm
import matplotlib.colors as colors
import folium
from pandas import json_normalize
from sklearn.cluster import KMeans

GOOGLE_API_KEY = "AIzaSyD-hg1ZaCFYVUpjb7pr5xc-O0VGsdFhqc0"
BANGALORE_NEIGHBOURHOODS = "bangalore_neighbourhoods.csv"

## 3. Data Collection, Exploration and Preprocessing <a name="data"></a>

In [73]:
def readDataframeFromHTML(htmlTable):
    htmlRows = htmlTable.find_all("tr")
    dataRows = []
    for tr in htmlRows:
        htmlCells = tr.find_all(re.compile(r"(th|td)"))
        drow = []
        for td in htmlCells:
            try:
                drow.append(td.text.replace("\n", ""))
            except:
                continue
        if len(drow) > 0:
            dataRows.append(drow)

    df = pd.DataFrame(dataRows[1:], columns=dataRows[0])
    return (df)


In [74]:
# Read the html table into a dataframe from the given url using BeautifulSoup
pincodesURL = "https://finkode.com/ka/bangalore.html"
htmlPage = requests.get(pincodesURL)
soup = BeautifulSoup(htmlPage.text, "html.parser")
htmlTable = soup.find("table", attrs={"class":"plist"})
df = readDataframeFromHTML(htmlTable)
# Print the shape and first 5 rows of the raw dataframe
print(df.shape)
df.head()

(270, 3)


Unnamed: 0,Post Office,District,Pincode
0,A F Station Yelahanka S.O,Bangalore,560063
1,Adugodi S.O,Bangalore,560030
2,Agara B.O,Bangalore,560034
3,Agram S.O,Bangalore,560007
4,Amruthahalli B.O,Bangalore,560092


In [75]:
# Convert Pincodes to int64
df["Pincode"] = df["Pincode"].astype("int64")
# Remove S.O and B.O from post office names
df["Post Office"] = df["Post Office"].str.replace("S.O", "", regex=False).str.replace("B.O", "", regex=False).str.strip()
# Drop column District as it doesn't contain any relevant information
if "District" in df.columns: df.drop(columns=["District"], inplace=True)
# Combine duplicate pin code rows into a single row
duplicateCodes = df.groupby(by="Pincode").count().reset_index(drop=False)
if duplicateCodes[duplicateCodes["Post Office"] > 1].shape[0] > 0:
    df["Post Office"] = df.groupby(by="Pincode")["Post Office"].transform(lambda x: ','.join(x, ))
    df.drop_duplicates(inplace=True)
# Print the shape and first 5 rows of the clean dataframe
df.reset_index(drop=True, inplace=True)
print(df.shape)
df.head()

(104, 2)


Unnamed: 0,Post Office,Pincode
0,"A F Station Yelahanka,BSF Campus Yelahanka",560063
1,Adugodi,560030
2,"Agara,Koramangala I Block,Koramangala,St. John...",560034
3,Agram,560007
4,"Amruthahalli,Byatarayanapura,Kodigehalli,Sahak...",560092


In [83]:
def getLatLngForPincode(pinCode):
    coords = {"Pincode": pinCode, "Latitude": None, "Longitude": None}
    api_key = GOOGLE_API_KEY
    base_url = "https://maps.googleapis.com/maps/api/geocode/json"
    endpoint = f"{base_url}?address={pinCode},IN&key={api_key}"
    r = requests.get(endpoint)
    if r.status_code not in range(200, 299):
        return (coords)
    try:
        results = r.json()["results"][0]
        coords["Latitude"] = results["geometry"]["location"]["lat"]
        coords["Longitude"] = results["geometry"]["location"]["lng"]
    except:
        pass
    return (coords)

In [86]:
# Check if bangalore_neighbourhoods csv file already exists - we don't want to call google maps API if data is already saved
try:
    newDF = pd.read_csv(filepath_or_buffer=BANGALORE_NEIGHBOURHOODS)
except:
    # Get coordinates for each pincode
    newDF = df.copy(deep=True)
    allCoords = newDF["Pincode"].map(getLatLngForPincode)
    coordsDF = pd.DataFrame(allCoords.to_list())
    coordsDF["Pincode"] = coordsDF["Pincode"].astype("int64")
    # Combine the post office and coordinates into a new dataframe
    newDF = newDF.join(coordsDF.set_index("Pincode"), how="left", on="Pincode")
    newDF.to_csv(path_or_buf=BANGALORE_NEIGHBOURHOODS, index=False)

print(newDF.shape)
newDF.head()


(104, 4)


Unnamed: 0,Post Office,Pincode,Latitude,Longitude
0,"A F Station Yelahanka,BSF Campus Yelahanka",560063,13.129087,77.614226
1,Adugodi,560030,12.94415,77.607623
2,"Agara,Koramangala I Block,Koramangala,St. John...",560034,12.926138,77.622109
3,Agram,560007,12.957917,77.630912
4,"Amruthahalli,Byatarayanapura,Kodigehalli,Sahak...",560092,13.064104,77.593121


In [88]:
# Check if the dataframe has any missing coordinates
invalidCoords = newDF[newDF["Latitude"].isna() | newDF["Longitude"].isna()]
print("Coordinates not found for {} neighbourhoods.".format(invalidCoords.shape[0]))


Coordinates not found for 0 neighbourhoods.


## 4. Methodology <a name="method"></a>