<a href="https://colab.research.google.com/github/sreeduttsasikumar/IBM_Coursera_Capstone_Project/blob/main/Capstone_Project_Battle_of_Neighbourhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Capstone Project - The Battle of the Neighborhoods</center>
# <center> Analyzing major cities in Kerala, India for choosing the site for a Residential Project <center>

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

One of the major Real Estate investor in India, based out of Mumbai, **wants to construct a 5-star Residential Complex in any major city in Kerala**. To that purpose, he wants to analyze the neighbourhood of the cities and choose a neighbourhood/city which will be suitable for the project.

He is planning to have a **water facing villa theme**, and the water body can be anything like Beaches, Lakes, River, Backwater, etc. Also, the **locality should be suitable for a 5 star residential villa with all necessary facilities and ameninites nearby**
The requirement also specifies, the location should not be around a city center or commercial area, as that won't go inline with the project's theme.

We will use the power of Data Science to map the requirements described against the data available and find out one/mutiple cities which will be well suited for the project.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* presence of water body in locality
* presense of amentities for household such as Grocery, Schools, Hospitals, etc
* should not be a trending/happening locality as that won't be suitable for a residential area

We decided to use the wikipedia page which lists the major Muncipality and Muncipality Corporations in Kerala. Along with that, we will be using another website which lists down the population of lot of cities in Kerala. We can cannpull only relevant information from their for our analysis.

Will extract the data from the table and will use those locations for finding out the perfect location as described in the requirement.

By using the city details downloaded, corresponding latitudes and longitudes will be captured using geocoder api. Also using foursquare api we will find out the neighbouring venues of those locations which will be the basis for the analysis we do in this project.

This will be our dataset for the analysis

Lets start the journey of data exploration, data enriching, data cleaning, data transfomration, etc etc

Use below cell to consolidate all the library package importing

In [1]:
#pandas don't need much info. Anyways imported for dataframe functionality
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
# map rendering library
import folium
# library to handle JSON files 
import json
# library to handle requests
import requests
#library to handle mathematical equations
import numpy as np
# import k-means from clustering stage
from sklearn.cluster import KMeans
#to visualize the ideal K value for K-means algorithm
from yellowbrick.cluster import KElbowVisualizer



###Downloading Data

In [2]:
#Assign the wikipedia url to variable url
url = 'https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Kerala'
#fetch from the url using panda function
data_from_url = pd.read_html(url)
#get the second table which is the relevant table for our project
kerala_city_df = data_from_url[1]
#drop columns not needed for analysis
kerala_city_df.drop(kerala_city_df.columns[[0, 2, 3, 4]], axis = 1, inplace = True) 
#rename columns to meaningful names
kerala_city_df.columns = ['City', 'Muncipality']
kerala_city_df.head()

Unnamed: 0,City,Muncipality
0,Thiruvananthapuram,Thiruvananthapuram
1,Kozhikode,Kozhikode
2,Kochi,Ernakulam
3,Kollam,Kollam
4,Thrissur,Thrissur


Identified one other website as well, which can provide additional cities as well. Downloading data from there as well

In [3]:
#Assign the www.citypopulation.de url to variable url
url = 'https://www.citypopulation.de/en/india/kerala/'
#fetch from the url using panda function
data_from_url = pd.read_html(url)
#fetch Name and District only from the dataset
addtn_data = data_from_url[1][['Name', 'District']]
#remove unwanted characters from the dataset
addtn_data.loc[:, 'Name'] = addtn_data['Name'].str.replace(r" \(.*\)","")
addtn_data.loc[:, 'Name'] = addtn_data['Name'].str.replace(r" \[.*\]","")
#rename the columns to City and Muncipality to match the initial dataset
addtn_data.columns = ['City', 'Muncipality']
#append the 1st dataset with new dataset
kerala_city_df = kerala_city_df.append(addtn_data)
kerala_city_df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Unnamed: 0,City,Muncipality
0,Thiruvananthapuram,Thiruvananthapuram
1,Kozhikode,Kozhikode
2,Kochi,Ernakulam
3,Kollam,Kollam
4,Thrissur,Thrissur


###Data Cleaning

Format the complete dataset to make it useable for analysis

In [4]:
#correct a few data for proper address resolution and coordinates fecthing by geocoder
kerala_city_df.City[kerala_city_df.City=='Thiruvananthapuram'] = 'Trivandrum, TVM'
kerala_city_df.Muncipality[kerala_city_df.Muncipality=='Thiruvananthapuram'] = 'Trivandrum'
kerala_city_df.drop(index = kerala_city_df[kerala_city_df['City']=='Municipalities'].index[0], inplace= True)
#removing duplicates from the full dataset
kerala_city_df = kerala_city_df[~kerala_city_df.duplicated('City')].sort_values(ascending=True, by='City').reset_index(drop=True)
print("Total number of cities available for analysis as of now are {}.".format(kerala_city_df.shape[0]))
kerala_city_df.head()

Total number of cities available for analysis as of now are 540.


Unnamed: 0,City,Muncipality
0,Abdu Rahiman Nagar,Malappuram
1,Adat,Thrissur
2,Adichanalloor,Kollam
3,Adinad,Kollam
4,Adoor,Pathanamthitta


###Data Enrichment

####Lets use Geocoder to fetch the latitude and longitudes of the cities and update them in the dataframe

Let's define below function to convert the city to corresponding latitudes and longitudes

In [5]:
 def lat_lon_from_address(cities):  
  city_lat_long_temp = []
  for city in cities: 
    geolocator = Nominatim(user_agent="city_explorer")
    location = geolocator.geocode(city)
    #print("Location: ", city, ":-", location)
    if(location != None):
      latitude = location.latitude
      longitude = location.longitude
      address = location.address
      if('Kerala' in address):
        city_lat_long_temp.append([city, latitude, longitude])
      else:
        city_lat_long_temp.append([city, 'NA', 'NA'])
    else:
      city_lat_long_temp.append([city, 'NA', 'NA'])
  return (city_lat_long_temp)

Call the above defined function by passing the city names as parameters and assign the newly found latitudes and longitudes in a new dataframe. This will be then merged with main dataframe kerala_city_df

In [6]:
#call the function
city_lat_long = pd.DataFrame(lat_lon_from_address(kerala_city_df['City']))
#set columns for the new dataframe
city_lat_long.columns = ['City', 'City Latitude', 'City Longitude']
#using pandas merge function merge the initial data set and the coordinte data set using the common column City
kerala_city_df = pd.merge(kerala_city_df, city_lat_long, left_on='City', right_on='City')
kerala_city_df.head()

Unnamed: 0,City,Muncipality,City Latitude,City Longitude
0,Abdu Rahiman Nagar,Malappuram,11.0701,75.9345
1,Adat,Thrissur,,
2,Adichanalloor,Kollam,8.87892,76.7174
3,Adinad,Kollam,,
4,Adoor,Pathanamthitta,9.15679,76.7553


While using geocoder to fecth the address, certain cities were not resolved by the geocoder api. Also some were wrongly resolved as non Kerala addresses.
Those were marked as NA in Latitude and Longitude. We need to remove those from kerala_city_df dataset

In [7]:
kerala_city_df = kerala_city_df[kerala_city_df['City Latitude'] != 'NA'].reset_index(drop=True)
print("Total number of cities available for analysis as of now are {}.".format(kerala_city_df.shape[0]))

Total number of cities available for analysis as of now are 468.


#### Create a map of Kerala with neighborhoods superimposed on top.

In [8]:
#get geo location of Kerala on which the cities can be super imposed
geolocator = Nominatim(user_agent="city_explorer")
location = geolocator.geocode('Kerala, KL')
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Kerala are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Kerala are 9.5797046, 76.5691745.


In [9]:
map_kerala = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, muncipality, city in zip(kerala_city_df['City Latitude'], kerala_city_df['City Longitude'], kerala_city_df['Muncipality'], kerala_city_df['City']):
    label = '{}, {}'.format(city, muncipality)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_kerala)  
    
map_kerala

#### Define Foursquare Credentials and Version

In [10]:
CLIENT_ID = 'UDO03A1SPAJOCS4SJODJ31ORPYBC2NA2M2ND4QLRL51LNO3H' # your Foursquare ID
CLIENT_SECRET = '3LBAEVCJ21YGCIMYN1LCZBXWCHA4CE1QI511MI4HJP4ASSXN' # your Foursquare Secret
#CLIENT_ID = 'AUCE4MWQCEBF5LJWQKXWO0ODU5KSVM5UMMTQZV240HDXLMJO' # your Foursquare ID
#CLIENT_SECRET = 'CYCCQGXWWDBQG1VSHNT5BZN1CIL4JUT3PU3W4BVKKR2V4NI1'
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

#### Let's create a function to fetch nearby venues around the cities in Kerala

In [11]:
def getNearbyVenues(names, muncipalities, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, muncipality, lat, lng in zip(names, muncipalities, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            muncipality,
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Muncipality', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called _kerala_venues_.

In [12]:
kerala_venues = getNearbyVenues(names=kerala_city_df['City'],
                                   muncipalities=kerala_city_df['Muncipality'],
                                   latitudes=kerala_city_df['City Latitude'],
                                   longitudes=kerala_city_df['City Longitude']
                                  )
kerala_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Muncipality,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Abdu Rahiman Nagar,11.070081,75.934498,Malappuram,Calicut International Airport,11.064209,75.939217,Airport
1,Abdu Rahiman Nagar,11.070081,75.934498,Malappuram,babul yemen,11.062729,75.930557,Halal Restaurant
2,Adichanalloor,8.878922,76.717369,Kollam,KSRTC Bus Stand Chathannoor,8.879553,76.708734,Bus Station
3,Adoor,9.156793,76.755262,Pathanamthitta,"hills park, parakode",9.154142,76.759157,Chinese Restaurant
4,Adoor,9.156793,76.755262,Pathanamthitta,Green Valley,9.148232,76.754976,Concert Hall


#### Let's check the size of the resulting dataframe

In [13]:
print("Out of {} cities in the initial transformed dataset, {} cities were populated in the resulting dataset after nearby venue search.".format(kerala_city_df.shape[0], len(kerala_venues['City'].unique())))

Out of 468 cities in the initial transformed dataset, 270 cities were populated in the resulting dataset after nearby venue search.


In [14]:
print("So ineffect {} cities were filtered out as foursquare API couldn't fetch any nearby venues of those cities.".format(kerala_city_df.shape[0] - len(kerala_venues['City'].unique())))

So ineffect 198 cities were filtered out as foursquare API couldn't fetch any nearby venues of those cities.


Below are those cities, which got filtered out. At later stage, will try to correct the initial data, so that these could be also be included the final data set

In [15]:
#create new df without the cities having no venues. Need this for clustering
kerala_city_filtered_df = kerala_city_df[kerala_city_df.City.isin(kerala_venues['City'])].reset_index(drop=True)
#displaying the cities with no venues
kerala_city_df[~kerala_city_df.City.isin(kerala_venues['City'])].reset_index(drop=True).head()

Unnamed: 0,City,Muncipality,City Latitude,City Longitude
0,Alangad,Ernakulam,10.8541,76.4389
1,Alappuzha,Alappuzha,9.48871,76.4152
2,Ariyallur,Malappuram,11.0817,75.8523
3,Arookutty,Alappuzha,9.86816,76.3247
4,Avanur,Thrissur,10.6016,76.1863


In [16]:
print("Total {} venues were identified amongst {} available cities from Kerala".format(kerala_venues.shape[0], len(kerala_venues['City'].unique())))

Total 1047 venues were identified amongst 270 available cities from Kerala


Let's check how many venues were returned for each City

In [17]:
kerala_venues.groupby('City').count().sort_values(ascending=False, by="Venue Category").head()

Unnamed: 0_level_0,City Latitude,City Longitude,Muncipality,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
"Trivandrum, TVM",60,60,60,60,60,60,60
Kozhikode,47,47,47,47,47,47,47
Pallikkunnu,23,23,23,23,23,23,23
Thrissur,22,22,22,22,22,22,22
Kollam,22,22,22,22,22,22,22


#### Let's find out how many unique categories can be curated from all the returned venues

In [18]:
print('There are {} uniques categories.'.format(len(kerala_venues['Venue Category'].unique())))

There are 172 uniques categories.


In [19]:
#get the unique Venue Categories from the list we generated above
kerala_unique_venues_df = pd.DataFrame({'Venue Category': kerala_venues['Venue Category'].unique()})
#sort in ascending value
kerala_unique_venues_df = kerala_unique_venues_df.sort_values('Venue Category',ascending=True).reset_index(drop=True)
kerala_unique_venues_df.head()

Unnamed: 0,Venue Category
0,ATM
1,Accessories Store
2,Airport
3,Airport Food Court
4,Airport Lounge


## Methodology <a name="methodology"></a>

In this project, we will try to find the best location suitable for the Residential project, which satisfies the below main requirements


*   Presence of water body in the neighbourhood
*   Should have Rsidential area benfitial amenities nearby
*   Should not be near or around a commercial center

The dataset at hand will be tweaked in such a way that we will get the consolidated list by cities and the mean value of each venues nearby that city.
To categorize the cities based on the nearby venues, we can use the <b>unsupervised machine learning algorithm 'K-Means'</b>. We can use 5 clusters to segment the data set we have. 
Inorder to help the algorithm to group the cities which satisfies our requirement, we have to increase the value(mean value here) of those venues we are looking for. For that purpose, we will be creating a <b>Venue-Weightage matrix</b>, with each venue being weighted based on the importance as per the requirement. On multiplying the dataset with matrix we prepared, we will get the weighted data set, whcih will have higher values for those venues which we need to group together. Algorithm will take care of the rest, to cluster all the similar areas into one cluster.



#### Form a weightage matrix, using below logic applied on venues


1.   Venue related to water body - 10
2.   Venue is must have for a residential area - 5
3.   Venue is good to have for a residential area - 2
4.   Any other venue - 1

This weightage matrix is formed, so that we can multiply this matrix with each city venue details, and will get weighted dataset based on the preference of venues we are looking for this project

Below weightage matrix is formed with initial undertsnading and unique list of categories. So please spare some time at this point to review the list and add new entries





In [20]:
weightage_array = [ ('ATM',5),('Accessories Store',1),('Afghan Restaurant',1),('Airport Food Court',1),('Airport Lounge',1),('Airport Service',1),
('Airport Terminal',1),('Airport',1),('American Restaurant',1),('Arcade',1),('Argentinian Restaurant',1),('Art Gallery',1),('Art Museum',1),
('Arts & Crafts Store',1),('Asian Restaurant',1),('Astrologer',1),('Athletics & Sports',2),('Australian Restaurant',1),('Auto Dealership',1),
('Baby Store',2),('Badminton Court',2),('Bagel Shop',1),('Bakery',2),('Bank',5),('Bar',1),('Basketball Court',2),('Bathing Area',1),('BBQ Joint',1),
('Beach',10),('Bed & Breakfast',1),('Bike Rental / Bike Share',1),('Boat or Ferry',2),('Boutique',5),('Bowling Alley',1),('Breakfast Spot',1),
('Bridal Shop',1),('Bridge',1),('Burger Joint',1),('Bus Line',2),('Bus Station',2),('Bus Stop',2),('Business Service',1),('Café',2),('Campground',1),
('Car Wash',5),('Casino',1),('Chinese Restaurant',1),('Climbing Gym',1),('Clothing Store',1),('Coffee Shop',1),('Comfort Food Restaurant',1),
('Concert Hall',1),('Convenience Store',5),('Cosmetics Shop',2),('Cricket Ground',2),('Currency Exchange',1),('Dance Studio',2),('Department Store',5),
('Diner',1),('Dog Run',2),('Electronics Store',2),('Falafel Restaurant',1),('Farm',2),('Farmers Market',2),('Fast Food Restaurant',1),('Fish & Chips Shop',1),
('Fish Market',5),('Flea Market',2),('Flower Shop',1),('Food & Drink Shop',1),('Food Court',1),('Food Truck',1),('Food',1),('Football Stadium',1),
('Forest',1),('Fried Chicken Joint',1),('Furniture / Home Store',2),('Gastropub',1),('General Travel',1),('Gift Shop',1),('Grocery Store',5),
('Gym / Fitness Center',5),('Gym',2),('Halal Restaurant',1),('Harbor / Marina',1),('Health & Beauty Service',2),('Historic Site',1),('History Museum',1),
('Hotel Bar',1),('Hotel Pool',1),('Hotel',1),('Ice Cream Shop',1),('Indian Restaurant',1),('Indie Movie Theater',2),('Intersection',1),
('IT Services',1),('Italian Restaurant',1),('Jewelry Store',1),('Juice Bar',1),('Kerala Restaurant',1),('Lake',10),('Light Rail Station',2),
('Lighthouse',1),('Liquor Store',1),('Lounge',1),('Market',5),('Mattress Store',1),('Men\'s Store',1),('Middle Eastern Restaurant',1),
('Mobile Phone Shop',2),('Motel',1),('Motorcycle Shop',1),('Mountain',1),('Movie Theater',5),('Moving Target',1),('Multicuisine Indian Restaurant',1),
('Multiplex',5),('Music Store',1),('Music Venue',1),('Neighborhood',1),('Nightclub',1),('Office',1),('Optical Shop',1),('Other Great Outdoors',1),
('Outlet Mall',2),('Outlet Store',2),('Park',5),('Performing Arts Venue',1),('Persian Restaurant',1),('Pharmacy',5),('Pizza Place',1),('Platform',1),
('Playground',5),('Plaza',1),('Pool',5),('Portuguese Restaurant',1),('Pub',1),('Recording Studio',1),('Recreation Center',1),('Resort',1),('Rest Area',1),
('Restaurant',1),('River',10),('Sandwich Place',1),('Scenic Lookout',1),('Seafood Restaurant',1),('Shopping Mall',2),('Shopping Plaza',1),('Ski Area',1),
('Smoke Shop',1),('Snack Place',1),('Soccer Field',1),('South Indian Restaurant',1),('Southern / Soul Food Restaurant',1),('Spa',2),
('Sporting Goods Shop',2),('Sports Club',1),('Stadium',1),('Student Center',1),('Supermarket',5),('Surf Spot',1),('Tea Room',1),('Temple',5),
('Tennis Court',2),('Tour Provider',1),('Tourist Information Center',1),('Track Stadium',1),('Track',1),('Trail',1),('Train Station',2),
('Travel & Transport',2),('Vegetarian / Vegan Restaurant',1),('Volleyball Court',2),('Women\'s Store',2),('Auditorium',1),('Auto Garage',2),
('Bookstore',2),('Cafeteria',1),('Donut Shop',1),('Metro Station',5),('Pier',2),('Shoe Store',1),('Stationery Store',5),('Toll Plaza',1),
('Theme Restaurant',1) 
]
weightage_matrix = pd.DataFrame(weightage_array, columns = ['Venue Category', 'Weight'])
weightage_matrix = weightage_matrix.sort_values(ascending = True, by = 'Venue Category').reset_index(drop=True)
weightage_matrix.head()

Unnamed: 0,Venue Category,Weight
0,ATM,5
1,Accessories Store,1
2,Afghan Restaurant,1
3,Airport,1
4,Airport Food Court,1


####Verify the weightage matrix against the unique venues list to find out missing data in weightage matrix

In the result set, look for any rows with weightage as NaN. Those are the venues, not available in weightage matrix. In that case, you need to update the weightage array in above cell with new Venue Category and corresponding weight. Then need to rerun from that cell onwards.

In [21]:
#to ensure duplicate columns are not added as part of reexecuting below step multiple times
kerala_unique_venues_df.drop(kerala_unique_venues_df.columns.difference(['Venue Category']), 1, inplace=True)
#merging df containing unique kerala venues and df containing venue weightages, to verify whether all venues are weighted
kerala_unique_venues_df = kerala_unique_venues_df.merge(weightage_matrix, left_on='Venue Category', right_on='Venue Category')
print("If any Venue is displayed below, those were the one's with no weight in weightage matrix. Please update the matrix then, and rerun from that point")
kerala_unique_venues_df[~kerala_unique_venues_df['Venue Category'].isin(weightage_matrix['Venue Category'].values)]
#weightage matrix should remove those venues, not in the final venue list, to avoid computation issues in later stage
weightage_matrix = weightage_matrix[weightage_matrix['Venue Category'].isin(kerala_unique_venues_df['Venue Category'].values)].reset_index(drop=True)
#To verify the final counts matches
print("Count of venues in Final Venue List: ", kerala_unique_venues_df.shape[0])
print("Count of venues in Final Weightage Matrix: ", weightage_matrix.shape[0])

If any Venue is displayed below, those were the one's with no weight in weightage matrix. Please update the matrix then, and rerun from that point
Count of venues in Final Venue List:  172
Count of venues in Final Weightage Matrix:  172


Analyze each neighbourhood

In [22]:
# one hot encoding
kerala_onehot = pd.get_dummies(kerala_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kerala_onehot['City'] = kerala_venues['City'] 

# move neighborhood column to the first column
fixed_columns = [kerala_onehot.columns[-1]] + list(kerala_onehot.columns[:-1])
kerala_onehot = kerala_onehot[fixed_columns]

kerala_onehot.head()

Unnamed: 0,City,ATM,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Astrologer,Athletics & Sports,Auditorium,Australian Restaurant,Auto Dealership,Auto Garage,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Bathing Area,Beach,Bed & Breakfast,Bike Rental / Bike Share,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Bridal Shop,Bridge,Burger Joint,Bus Line,Bus Station,Bus Stop,Cafeteria,Café,Campground,Chinese Restaurant,Climbing Gym,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cricket Ground,Currency Exchange,Department Store,Diner,Dog Run,Donut Shop,Electronics Store,Falafel Restaurant,Farm,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Football Stadium,Forest,Fried Chicken Joint,Furniture / Home Store,Gastropub,General Travel,Gift Shop,Grocery Store,Gym,Gym / Fitness Center,Halal Restaurant,Harbor / Marina,Health & Beauty Service,Historic Site,Hotel,Hotel Bar,Hotel Pool,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Jewelry Store,Juice Bar,Kerala Restaurant,Lake,Light Rail Station,Lighthouse,Liquor Store,Lounge,Market,Mattress Store,Men's Store,Metro Station,Middle Eastern Restaurant,Mobile Phone Shop,Motel,Motorcycle Shop,Mountain,Movie Theater,Moving Target,Multicuisine Indian Restaurant,Multiplex,Music Venue,Neighborhood,Nightclub,Office,Optical Shop,Other Great Outdoors,Outlet Mall,Park,Performing Arts Venue,Persian Restaurant,Pharmacy,Pier,Pizza Place,Platform,Playground,Plaza,Pool,Portuguese Restaurant,Pub,Recording Studio,Recreation Center,Resort,Rest Area,Restaurant,River,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Ski Area,Smoke Shop,Snack Place,Soccer Field,South Indian Restaurant,Southern / Soul Food Restaurant,Spa,Sporting Goods Shop,Stadium,Student Center,Supermarket,Surf Spot,Tea Room,Temple,Tennis Court,Theme Restaurant,Toll Plaza,Tour Provider,Tourist Information Center,Track,Track Stadium,Trail,Train Station,Travel & Transport,Vegetarian / Vegan Restaurant,Volleyball Court,Women's Store
0,Abdu Rahiman Nagar,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Abdu Rahiman Nagar,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Adichanalloor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Adoor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Adoor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Next, let's group rows by city and by taking the mean of the frequency of occurrence of each category

In [23]:
kerala_grouped = kerala_onehot.groupby('City').mean().reset_index()
kerala_grouped.head()

Unnamed: 0,City,ATM,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Arcade,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Astrologer,Athletics & Sports,Auditorium,Australian Restaurant,Auto Dealership,Auto Garage,BBQ Joint,Baby Store,Badminton Court,Bagel Shop,Bakery,Bank,Bar,Bathing Area,Beach,Bed & Breakfast,Bike Rental / Bike Share,Boat or Ferry,Bookstore,Boutique,Bowling Alley,Breakfast Spot,Bridal Shop,Bridge,Burger Joint,Bus Line,Bus Station,Bus Stop,Cafeteria,Café,Campground,Chinese Restaurant,Climbing Gym,Clothing Store,Coffee Shop,Comfort Food Restaurant,Concert Hall,Convenience Store,Cricket Ground,Currency Exchange,Department Store,Diner,Dog Run,Donut Shop,Electronics Store,Falafel Restaurant,Farm,Fast Food Restaurant,Fish & Chips Shop,Fish Market,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Football Stadium,Forest,Fried Chicken Joint,Furniture / Home Store,Gastropub,General Travel,Gift Shop,Grocery Store,Gym,Gym / Fitness Center,Halal Restaurant,Harbor / Marina,Health & Beauty Service,Historic Site,Hotel,Hotel Bar,Hotel Pool,IT Services,Ice Cream Shop,Indian Restaurant,Indie Movie Theater,Intersection,Italian Restaurant,Jewelry Store,Juice Bar,Kerala Restaurant,Lake,Light Rail Station,Lighthouse,Liquor Store,Lounge,Market,Mattress Store,Men's Store,Metro Station,Middle Eastern Restaurant,Mobile Phone Shop,Motel,Motorcycle Shop,Mountain,Movie Theater,Moving Target,Multicuisine Indian Restaurant,Multiplex,Music Venue,Neighborhood,Nightclub,Office,Optical Shop,Other Great Outdoors,Outlet Mall,Park,Performing Arts Venue,Persian Restaurant,Pharmacy,Pier,Pizza Place,Platform,Playground,Plaza,Pool,Portuguese Restaurant,Pub,Recording Studio,Recreation Center,Resort,Rest Area,Restaurant,River,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shopping Mall,Shopping Plaza,Ski Area,Smoke Shop,Snack Place,Soccer Field,South Indian Restaurant,Southern / Soul Food Restaurant,Spa,Sporting Goods Shop,Stadium,Student Center,Supermarket,Surf Spot,Tea Room,Temple,Tennis Court,Theme Restaurant,Toll Plaza,Tour Provider,Tourist Information Center,Track,Track Stadium,Trail,Train Station,Travel & Transport,Vegetarian / Vegan Restaurant,Volleyball Court,Women's Store
0,Abdu Rahiman Nagar,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Adichanalloor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Adoor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Akathiyoor,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Alamcode,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Apply the weightage matrix at this stage by multiplying the weightage vector with mean of dummy values calculated. The result will be a new dataframe *kerala_grouped_weighted* which we can use in clustering

In [24]:
kerala_grouped_weighted = pd.concat([kerala_grouped.iloc[: , [0]], kerala_grouped.iloc[:, 1:(kerala_unique_venues_df.shape[0]+1)] * weightage_matrix['Weight'].values], axis=1)

Just to re verify the multiplication worked, we can divide the "Movie Theater" weighted value with initial value. The result should be an array of 5, which is the weight we allocated for Movie Theater

In [25]:
kerala_grouped_weighted['Movie Theater'][kerala_grouped_weighted['Movie Theater'] > 0] / kerala_grouped['Movie Theater'][kerala_grouped['Movie Theater'] > 0]

4      5.0
6      5.0
18     5.0
22     5.0
37     5.0
55     5.0
56     5.0
60     5.0
109    5.0
121    5.0
123    5.0
146    5.0
158    5.0
167    5.0
178    5.0
179    5.0
182    5.0
183    5.0
185    5.0
192    5.0
207    5.0
210    5.0
221    5.0
240    5.0
244    5.0
Name: Movie Theater, dtype: float64

###Let's create the new dataframe and display the top 10 venues for each neighborhood.

Function for returning the top n venues

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now we can extract the top ten venues of each city and create a new dataframe, which can then be used for analysis

In [27]:
#top n number of venues
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['City']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
city_venues_sorted = pd.DataFrame(columns=columns)
city_venues_sorted['City'] = kerala_grouped_weighted['City']

for ind in np.arange(kerala_grouped_weighted.shape[0]):
    city_venues_sorted.iloc[ind, 1:] = return_most_common_venues(kerala_grouped.iloc[ind, :], num_top_venues)

city_venues_sorted.head()

Unnamed: 0,City,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Abdu Rahiman Nagar,Airport,Halal Restaurant,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
1,Adichanalloor,Bus Station,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
2,Adoor,Chinese Restaurant,Concert Hall,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
3,Akathiyoor,Indian Restaurant,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
4,Alamcode,Movie Theater,Indian Restaurant,Indie Movie Theater,Bridge,Women's Store,Currency Exchange,Farm,Falafel Restaurant,Electronics Store,Donut Shop


## . Cluster Neighborhoods

Let's cluster with 5 clusters for now to validate the dataset we prepared

In [28]:
# set number of clusters
kclusters = 5

kerala_grouped_clustering = kerala_grouped_weighted.drop('City', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kerala_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [29]:
# add clustering labels
city_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

kerala_merged = kerala_city_filtered_df

# merge kerala_merged with kerala_city_filtered_df to add latitude/longitude for each city having atleast one venue
kerala_merged = kerala_merged.join(city_venues_sorted.set_index('City'), on='City')

kerala_merged.head()

Unnamed: 0,City,Muncipality,City Latitude,City Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Abdu Rahiman Nagar,Malappuram,11.0701,75.9345,0,Airport,Halal Restaurant,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
1,Adichanalloor,Kollam,8.87892,76.7174,0,Bus Station,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
2,Adoor,Pathanamthitta,9.15679,76.7553,0,Chinese Restaurant,Concert Hall,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
3,Akathiyoor,Thrissur,10.6763,76.082,0,Indian Restaurant,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
4,Alamcode,Malappuram,8.71992,76.8134,0,Movie Theater,Indian Restaurant,Indie Movie Theater,Bridge,Women's Store,Currency Exchange,Farm,Falafel Restaurant,Electronics Store,Donut Shop


Finally, let's visualize the resulting clusters

In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kerala_merged['City Latitude'], kerala_merged['City Longitude'], kerala_merged['City'], kerala_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Analysis <a name="analysis"></a>

###Examine Clusters

Initilaize a dataframe which will hold the selected city candidates from the cluster analysis below

In [31]:
suitable_city_candidates = pd.DataFrame(columns=['City', 'Beach', 'Lake', 'River', 'Weighted_Score_Amenities', 'Muncipality', 'City Latitude', 'City Longitude'])

#### Cluster 1 - Analysis

In [32]:
cluster1_df = kerala_merged.loc[kerala_merged['Cluster Labels'] == 0, kerala_merged.columns[[0,1] + list(range(5, kerala_merged.shape[1]))]]
cluster1_df.head()

Unnamed: 0,City,Muncipality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Abdu Rahiman Nagar,Malappuram,Airport,Halal Restaurant,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
1,Adichanalloor,Kollam,Bus Station,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
2,Adoor,Pathanamthitta,Chinese Restaurant,Concert Hall,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
3,Akathiyoor,Thrissur,Indian Restaurant,Women's Store,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
4,Alamcode,Malappuram,Movie Theater,Indian Restaurant,Indie Movie Theater,Bridge,Women's Store,Currency Exchange,Farm,Falafel Restaurant,Electronics Store,Donut Shop


On analyzing the type of venues in the neighbourhood of Cluster 1 cities, we can segment cluster 1 as not suitable for residential project. The reason being, the most popular venues in those cities were **Eateries, Theaters and other commercial institues**

Hence not considering further for choosing site for Residential Area Project

#### Cluster 2 - Analysis

In [33]:
cluster2_df = kerala_merged.loc[kerala_merged['Cluster Labels'] == 1, kerala_merged.columns[[0,1] + list(range(5, kerala_merged.shape[1]))]]
cluster2_df.head()

Unnamed: 0,City,Muncipality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
25,Chekkiad,Kozhikode,River,Restaurant,Cricket Ground,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store
69,Kadungalloor,Ernakulam,River,Cricket Ground,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store,Currency Exchange
163,Muzhappilangad,Kannur,River,Beach,Cafeteria,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store


On analyzing the type of venues in the neighbourhood of Cluster 2 cities, we can segment cluster 2 as a <b>suitable candidate for Residential project</b> . The reason being, the there are water bodies, and some level of basic amenities nearby.

Will further investigate the cluster and will find out top 3 cities, which have water body presence as well as good amount of basic amenities nearby

In [34]:
print("Total number of cities in this Cluster 2: ", cluster2_df.shape[0])
#select those columns from the weighted_matrix having weights > 1. Idea is to get those columns(Venues) having relevance for Residental Project
venues_with_weight = ['City', 'Weighted_Score_Amenities']
for venue in weightage_matrix[weightage_matrix['Weight'] > 1]['Venue Category']:
    venues_with_weight.append(venue)
#from the list of all cities with all venue details, we select only those cities coming under Cluster 2
Cluster_2_cities = kerala_merged.loc[kerala_merged['Cluster Labels'] == 1][['City','Muncipality','City Latitude', 'City Longitude']]
#select the weigthed kerala cities details containing the above selected cities
Cluster_2_full_venues = kerala_grouped_weighted[kerala_grouped_weighted.City.isin(Cluster_2_cities['City'])].reset_index(drop=True)
#We are interested in those cities which have atleast some presence of water body. So filtering only those cities having waterbody presence
idx = np.where((Cluster_2_full_venues['Beach'] > 0) | (Cluster_2_full_venues['Lake'] > 0) | (Cluster_2_full_venues['River'] > 0))
Cluster_2_full_venues = Cluster_2_full_venues.loc[idx]
#computing a score of the other Residential Ameninities/facilities nearby the cities to set priorities over the cities selected
Cluster_2_full_venues["Weighted_Score_Amenities"] = Cluster_2_full_venues[Cluster_2_full_venues.columns.difference(['Beach', 'Lake', 'River'])].sum(axis=1)
#select the top 3 cities having a presence of waterbody and having higher score of amenities score
Cluster_2_full_venues_top3 = Cluster_2_full_venues[['City', 'Beach', 'Lake', 'River', 'Weighted_Score_Amenities']].sort_values(ascending=False, by='Weighted_Score_Amenities').reset_index(drop=True).head(3)
#eliminate any cities where Amenities score is 0. Those were cities which don't have Residential benefitial amenities in neighbourhood
Cluster_2_full_venues_top3 = Cluster_2_full_venues_top3[Cluster_2_full_venues_top3['Weighted_Score_Amenities'] > 0]
#merging with previously fetched df to get the latitudes and longitudes
Cluster_2_full_venues_top3 = Cluster_2_full_venues_top3.merge(Cluster_2_cities, left_on='City', right_on='City')
#append the top 3 cities to suitable_city_candidates dataframe from final analysis
suitable_city_candidates = suitable_city_candidates.append(Cluster_2_full_venues_top3, ignore_index = True)
print("Find below the top 3 cities in this cluster, sorted in order of amenities score")
Cluster_2_full_venues_top3

Total number of cities in this Cluster 2:  3
Find below the top 3 cities in this cluster, sorted in order of amenities score


Unnamed: 0,City,Beach,Lake,River,Weighted_Score_Amenities,Muncipality,City Latitude,City Longitude
0,Chekkiad,0.0,0.0,5.0,0.5,Kozhikode,11.7194,75.6427


#### Cluster 3 - Analysis

In [35]:
cluster3_df = kerala_merged.loc[kerala_merged['Cluster Labels'] == 2, kerala_merged.columns[[0,1] + list(range(5, kerala_merged.shape[1]))]]
cluster3_df.head()

Unnamed: 0,City,Muncipality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Azhikode North,Kannur,Beach,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
15,Azhikode South,Kannur,Beach,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
16,Azhiyur,Kozhikode,Beach,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
80,Kandalloor,Alappuzha,Harbor / Marina,Beach,Women's Store,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run
180,Palissery,Thrissur,Beach,Historic Site,Train Station,Indian Restaurant,Department Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop


On analyzing the type of venues in the neighbourhood of Cluster 3 cities, we can segment cluster 3 as a <b>candidate for Residential project</b>. The reason being, the there are water bodies, but the amount of Basic Amenities were very less.

Will further investigate the cluster and will find out top 3 cities, which have water body presence as well as good amount of basic amenities nearby

In [36]:
print("Total number of cities in this Cluster 3: ", cluster3_df.shape[0])
#select those columns from the weighted_matrix having weights > 1. Idea is to get those columns(Venues) having relevance for Residental Project
venues_with_weight = ['City', 'Weighted_Score_Amenities']
for venue in weightage_matrix[weightage_matrix['Weight'] > 1]['Venue Category']:
    venues_with_weight.append(venue)
#from the list of all cities with all venue details, we select only those cities coming under Cluster 3
Cluster_3_cities = kerala_merged.loc[kerala_merged['Cluster Labels'] == 2][['City','Muncipality','City Latitude', 'City Longitude']]
#select the weigthed kerala cities details containing the above selected cities
Cluster_3_full_venues = kerala_grouped_weighted[kerala_grouped_weighted.City.isin(Cluster_3_cities['City'])].reset_index(drop=True)
#We are interested in those cities which have atleast some presence of water body. So filtering only those cities having waterbody presence
idx = np.where((Cluster_3_full_venues['Beach'] > 0) | (Cluster_3_full_venues['Lake'] > 0) | (Cluster_3_full_venues['River'] > 0))
Cluster_3_full_venues = Cluster_3_full_venues.loc[idx]
#computing a score of the other Residential Ameninities/facilities nearby the cities to set priorities over the cities selected
Cluster_3_full_venues["Weighted_Score_Amenities"] = Cluster_3_full_venues[Cluster_3_full_venues.columns.difference(['Beach', 'Lake', 'River'])].sum(axis=1)
#select the top 3 cities having a presence of waterbody and having higher score of amenities score
Cluster_3_full_venues_top3 = Cluster_3_full_venues[['City', 'Beach', 'Lake', 'River', 'Weighted_Score_Amenities']].sort_values(ascending=False, by='Weighted_Score_Amenities').reset_index(drop=True).head(3)
#eliminate any cities where Amenities score is 0. Those were cities which don't have Residential benefitial amenities in neighbourhood
Cluster_3_full_venues_top3 = Cluster_3_full_venues_top3[Cluster_3_full_venues_top3['Weighted_Score_Amenities'] > 0]
#merging with previously fetched df to get the latitudes and longitudes
Cluster_3_full_venues_top3 = Cluster_3_full_venues_top3.merge(Cluster_3_cities, left_on='City', right_on='City')
#append the top 3 cities to suitable_city_candidates dataframe from final analysis
suitable_city_candidates = suitable_city_candidates.append(Cluster_3_full_venues_top3, ignore_index = True)
print("Find below the top 3 cities in this cluster, sorted in order of amenities score")
Cluster_3_full_venues_top3

Total number of cities in this Cluster 3:  8
Find below the top 3 cities in this cluster, sorted in order of amenities score


Unnamed: 0,City,Beach,Lake,River,Weighted_Score_Amenities,Muncipality,City Latitude,City Longitude
0,Pallikkara,6.666667,0.0,0.0,1.666667,Kasaragod,12.3851,75.044
1,Palissery,5.714286,0.0,0.0,0.571429,Thrissur,11.7521,75.4859
2,Kandalloor,5.0,0.0,0.0,0.5,Alappuzha,9.16963,76.4677


#### Cluster 4 - Analysis

In [37]:
cluster4_df = kerala_merged.loc[kerala_merged['Cluster Labels'] == 3, kerala_merged.columns[[0,1] + list(range(5, kerala_merged.shape[1]))]]
cluster4_df.head()

Unnamed: 0,City,Muncipality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
11,Athiyannur,Trivandrum,ATM,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store
27,Chelamattom,Ernakulam,ATM,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store
54,Eramala,Kozhikode,ATM,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store
85,Kannadiparamba,Kannur,ATM,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store
130,Kunnathunad,Ernakulam,ATM,Currency Exchange,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store


On analyzing the type of venues in the neighbourhood of Cluster 4 cities, we can segment cluster 4 as a <b>candidate for Residential project</b>. The reason being, the there are good amount of residential type amenities nearby. But on 1st look couldn't find any water bodies, which we need to investigate further.

Will further investigate the cluster and will find out top 3 cities, which have water body presence as well as good amount of basic amenities nearby

In [38]:
print("Total number of cities in this Cluster 4: ", cluster4_df.shape[0])
#select those columns from the weighted_matrix having weights > 1. Idea is to get those columns(Venues) having relevance for Residental Project
venues_with_weight = ['City', 'Weighted_Score_Amenities']
for venue in weightage_matrix[weightage_matrix['Weight'] > 1]['Venue Category']:
    venues_with_weight.append(venue)
#from the list of all cities with all venue details, we select only those cities coming under Cluster 4
Cluster_4_cities = kerala_merged.loc[kerala_merged['Cluster Labels'] == 3][['City','Muncipality','City Latitude', 'City Longitude']]
#select the weigthed kerala cities details containing the above selected cities
Cluster_4_full_venues = kerala_grouped_weighted[kerala_grouped_weighted.City.isin(Cluster_4_cities['City'])].reset_index(drop=True)
#We are interested in those cities which have atleast some presence of water body. So filtering only those cities having waterbody presence
idx = np.where((Cluster_4_full_venues['Beach'] > 0) | (Cluster_4_full_venues['Lake'] > 0) | (Cluster_4_full_venues['River'] > 0))
Cluster_4_full_venues = Cluster_4_full_venues.loc[idx]
#computing a score of the other Residential Ameninities/facilities nearby the cities to set priorities over the cities selected
Cluster_4_full_venues["Weighted_Score_Amenities"] = Cluster_4_full_venues[Cluster_4_full_venues.columns.difference(['Beach', 'Lake', 'River'])].sum(axis=1)
#select the top 3 cities having a presence of waterbody and having higher score of amenities score
Cluster_4_full_venues_top3 = Cluster_4_full_venues[['City', 'Beach', 'Lake', 'River', 'Weighted_Score_Amenities']].sort_values(ascending=False, by='Weighted_Score_Amenities').reset_index(drop=True).head(3)
#eliminate any cities where Amenities score is 0. Those were cities which don't have Residential benefitial amenities in neighbourhood
Cluster_4_full_venues_top3 = Cluster_4_full_venues_top3[Cluster_4_full_venues_top3['Weighted_Score_Amenities'] > 0]
#merging with previously fetched df to get the latitudes and longitudes
Cluster_4_full_venues_top3 = Cluster_4_full_venues_top3.merge(Cluster_4_cities, left_on='City', right_on='City')
#append the top 3 cities to suitable_city_candidates dataframe from final analysis
suitable_city_candidates = suitable_city_candidates.append(Cluster_4_full_venues_top3, ignore_index = True)
print("Find below the top 3 cities in this cluster, sorted in order of amenities score")
Cluster_4_full_venues_top3

Total number of cities in this Cluster 4:  14
Find below the top 3 cities in this cluster, sorted in order of amenities score


Unnamed: 0,Beach,Lake,River,Weighted_Score_Amenities,City,Muncipality,City Latitude,City Longitude


#### Cluster 5 - Analysis

In [39]:
cluster5_df = kerala_merged.loc[kerala_merged['Cluster Labels'] == 4, kerala_merged.columns[[0,1] + list(range(5, kerala_merged.shape[1]))]]
cluster5_df.head()

Unnamed: 0,City,Muncipality,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
48,Elamkunnapuzha,Ernakulam,Lake,Fast Food Restaurant,Beach,Fish Market,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
98,Kattappana,Idukki,Hotel,Bus Station,Shopping Plaza,Lake,Women's Store,Currency Exchange,Farm,Falafel Restaurant,Electronics Store,Donut Shop
148,Marampilly,Ernakulam,Lake,Badminton Court,Chinese Restaurant,Climbing Gym,Fish & Chips Shop,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop
160,Mulavukad,Ernakulam,Lake,Indian Restaurant,Women's Store,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner
169,New Mahe,Kannur,Lake,Fish Market,Fast Food Restaurant,Farm,Falafel Restaurant,Electronics Store,Donut Shop,Dog Run,Diner,Department Store


On analyzing the type of venues in the neighbourhood of Cluster 5 cities, we can segment cluster 5 as a <b>most suitable candidate for Residential project</b>. The reason being, the there are good amount of water bodies and residential type amenities nearby. Also, this cluster lacks those commercial centers which are not required around Residential Area.

Will further investigate the cluster and will find out top 3 cities, which have water body presence as well as good amount of basic amenities nearby

In [40]:
print("Total number of cities in this Cluster 5: ", cluster5_df.shape[0])
#select those columns from the weighted_matrix having weights > 1. Idea is to get those columns(Venues) having relevance for Residental Project
venues_with_weight = ['City', 'Weighted_Score_Amenities']
for venue in weightage_matrix[weightage_matrix['Weight'] > 1]['Venue Category']:
    venues_with_weight.append(venue)
#from the list of all cities with all venue details, we select only those cities coming under Cluster 5
Cluster_5_cities = kerala_merged.loc[kerala_merged['Cluster Labels'] == 4][['City','Muncipality','City Latitude', 'City Longitude']]
#select the weigthed kerala cities details containing the above selected cities
Cluster_5_full_venues = kerala_grouped_weighted[kerala_grouped_weighted.City.isin(Cluster_5_cities['City'])].reset_index(drop=True)
#We are interested in those cities which have atleast some presence of water body. So filtering only those cities having waterbody presence
idx = np.where((Cluster_5_full_venues['Beach'] > 0) | (Cluster_5_full_venues['Lake'] > 0) | (Cluster_5_full_venues['River'] > 0))
Cluster_5_full_venues = Cluster_5_full_venues.loc[idx]
#computing a score of the other Residential Ameninities/facilities nearby the cities to set priorities over the cities selected
Cluster_5_full_venues["Weighted_Score_Amenities"] = Cluster_5_full_venues[Cluster_5_full_venues.columns.difference(['Beach', 'Lake', 'River'])].sum(axis=1)
#select the top 3 cities having a presence of waterbody and having higher score of amenities score
Cluster_5_full_venues_top3 = Cluster_5_full_venues[['City', 'Beach', 'Lake', 'River', 'Weighted_Score_Amenities']].sort_values(ascending=False, by='Weighted_Score_Amenities').reset_index(drop=True).head(3)
#eliminate any cities where Amenities score is 0. Those were cities which don't have Residential benefitial amenities in neighbourhood
Cluster_5_full_venues_top3 = Cluster_5_full_venues_top3[Cluster_5_full_venues_top3['Weighted_Score_Amenities'] > 0]
#merging with previously fetched df to get the latitudes and longitudes
Cluster_5_full_venues_top3 = Cluster_5_full_venues_top3.merge(Cluster_5_cities, left_on='City', right_on='City')
#append the top 3 cities to suitable_city_candidates dataframe from final analysis
suitable_city_candidates = suitable_city_candidates.append(Cluster_5_full_venues_top3, ignore_index = True)
print("Find below the top 3 cities in this cluster, sorted in order of amenities score")
Cluster_5_full_venues_top3

Total number of cities in this Cluster 5:  8
Find below the top 3 cities in this cluster, sorted in order of amenities score


Unnamed: 0,City,Beach,Lake,River,Weighted_Score_Amenities,Muncipality,City Latitude,City Longitude
0,Panayam,0.0,2.5,0.0,2.0,Kollam,8.96276,76.6189
1,Veiloor,0.0,3.333333,0.0,1.333333,Trivandrum,8.61564,76.8289
2,Kattappana,0.0,2.5,0.0,1.0,Idukki,9.75617,77.1141


Below dataframe contains the filtered list of final contenders for the Residential Project location

In [41]:
#for ease of viuslization and analysis, sorting the data by Weighted Score
suitable_city_candidates = suitable_city_candidates.sort_values(ascending=False, by = 'Weighted_Score_Amenities').reset_index(drop=True)
suitable_city_candidates

Unnamed: 0,City,Beach,Lake,River,Weighted_Score_Amenities,Muncipality,City Latitude,City Longitude
0,Panayam,0.0,2.5,0.0,2.0,Kollam,8.96276,76.6189
1,Pallikkara,6.666667,0.0,0.0,1.666667,Kasaragod,12.3851,75.044
2,Veiloor,0.0,3.333333,0.0,1.333333,Trivandrum,8.61564,76.8289
3,Kattappana,0.0,2.5,0.0,1.0,Idukki,9.75617,77.1141
4,Palissery,5.714286,0.0,0.0,0.571429,Thrissur,11.7521,75.4859
5,Chekkiad,0.0,0.0,5.0,0.5,Kozhikode,11.7194,75.6427
6,Kandalloor,5.0,0.0,0.0,0.5,Alappuzha,9.16963,76.4677


Let's plot these locations in the map of Kerala to visualize the locations

In [42]:
map_kerala = folium.Map(location=[latitude, longitude], zoom_start=8)

# add markers to map
for lat, lng, muncipality, city in zip(suitable_city_candidates['City Latitude'], suitable_city_candidates['City Longitude'], suitable_city_candidates['Muncipality'], suitable_city_candidates['City']):
    label = '{}, {}'.format(city, muncipality)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_kerala)  
    
map_kerala

## Results and Discussion <a name="results"></a>

After carefully analyzing the dataset we prepared, with major cities in Kerala and enriching the same with nearby venues with the help of Geocoder Api and Foursquare Api, we applied the K-Means clustering method.
To help K-Means algorith to segment areas/cities which matches our requirement, we had applieed weightages to the city-wise venue mean value.

At the end, we analyzed the 5 different clusters. First identified the type each cluster belongs to based on the top nearby venues of each cities in that cluster and will see whether each cluster can be a Residential area or not. Based on that analysis, we will proceed with Residential area material clusters to find out those cities having presence of any water body and having the presence of Residential area benefitial amenities nearby. For sake of filtering/reducing the list for conclusion, we filtered out the top 3 cities sorted by sum of ameninties score. Which infact will give those cities from the cluster which have below qualities
*   Presence of water body
*   Basic amenities nearby
*   Not having large influx of commercial venues

Cluster 1 was purely commercial area, with lots of shops, eateries, theaters etc nearby. Hence we excluded that from the final list.
Cluster 2, 3, 5 are really good contenders for the Residential project location. From Cluster 2, there is only one city qualified the above requirements. But from Cluster 4 and 5, we had top 3 cities shortlisted. Cluster 4 eventhough have very good presence of Residential area amenities nearby, unfortunately, they don't have any water body presence nearby. So no cities have qualified from that cluster as well

<b>At the end, we have shortlisted 7 cities as the final contenders for the Residential Project location</b>


Find out the best city based on the top score of weighted amenities score, from each type of water body.



1.   Suggested city with presence of <b>Beach</b> in neighbourhood



In [43]:
suitable_city_candidates[suitable_city_candidates['Beach'] > 0].sort_values(by='Weighted_Score_Amenities', ascending=False).head(1).reset_index(drop=True)[['City', 'Muncipality']]

Unnamed: 0,City,Muncipality
0,Pallikkara,Kasaragod


2.   Suggested city with presence of <b>Lake</b> in neighbourhood

In [44]:
suitable_city_candidates[suitable_city_candidates['Lake'] > 0].sort_values(by='Weighted_Score_Amenities', ascending=False).head(1).reset_index(drop=True)[['City', 'Muncipality']]

Unnamed: 0,City,Muncipality
0,Panayam,Kollam


3.   Suggested city with presence of <b>River</b> in neighbourhood

In [45]:
suitable_city_candidates[suitable_city_candidates['River'] > 0].sort_values(by='Weighted_Score_Amenities', ascending=False).head(1).reset_index(drop=True)[['City', 'Muncipality']]

Unnamed: 0,City,Muncipality
0,Chekkiad,Kozhikode


## Conclusion <a name="conclusion"></a>

As per the project requirement, we have to find a location in Kerala with below qualities met
* presence of water body in locality
* presense of amentities for household such as Grocery, Schools, Hospitals, etc
* should not be a trending/happening locality as that won't be suitable for a residential area

We could either provide the top rated city from the list of 7 cities we shortlisted, based on the top score on nearby amenities presence. But since we have multiple types of water bodies available in the shortlisted list, it would be better and benefitial for the Residential Area Stakeholders, if we provide one option for each type water body(Beach, Lake, River).

So below are those cities 

<dl>
  <dt><b><u>Beach Location</u></b></dt>
  <dd>Pallikara City in Kasaragod District</dd>
  <dt><b><u>Lake Location</u></b></dt>
  <dd>Panayam City in Kollam District</dd>
  <dt><b><u>River Location</u></b></dt>
  <dd>Chekkiad City in Kozhikode District</dd>
</dl>

Final decision on which city to choose will be vested with the Residential Area Project team, as they have to make sure the area should match the theme they are planning for the Residential Complex.