The city of London was built in the year 43 AD by the Roman empire and is a city rich in history and geographical diversity. It has seen wars, plagues, fires and most recently a virus pandemic. As the capital city of the United Kingdom, it is home to 9 million residents and is a global hub for commerce and culture. Each year the city attracts tens of millions of visitors who come to marvel at historic sites such the Tower of London and Buckingham palace. The population of the city continues to grow each year as it attracts many migrants from around the world, in search of employment and a cosmopolitan lifestyle. 
However, for all its charm the city has a fundamental problem. There is a severe shortage of housing in the city, and current supply is unable to keep up with the growth in demand.

The problem I aim to solve in this analysis is to identify the more affordable neighbourhoods that offer the same local venues and services as others. This will help anyone who lives in the city, or is looking to move to the city, find an affordable neighbourhood. Given the rapid rate of change in London, this will help those living in the city keep up with the changes and find affordable accommodation without compromising on their standard of living. 

In this analysis I will cluster London neighbourhoods based their local features. The feature dataset will be obtained using the foursquare API service to extract a list of the venues nearby. This dataset will be used to group areas by their geographical similarity. I will also bring in data on house price transactions in the local area to understand local house prices and compare this for the different neighbourhoods. 

In [None]:
import pandas as pd
import folium
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import os
import requests, io
from scipy.spatial.distance import cdist

In [None]:
%matplotlib inline

In [None]:
CLIENT_ID = os.environ.get('FOURSQUARE_ID')
CLIENT_SECRET = os.environ.get('FOURSQUARE_SECRET')

VERSION = '20180605' # Foursquare API version
LIMIT = 100

In [None]:
postcode_url = 'https://www.doogal.co.uk/UKPostcodesCSV.ashx?area=London'
london_codes_all = pd.read_csv(postcode_url)

In [None]:
london_codes_all.head(5)

In [None]:
london_codes_all.columns

The data data obtained in this table contains a huge amount of information that will be useful for further analysis. At the moment we will only take the colums relevant to the geographical grouping of neighbourhoods and their positional coordinates as this is what we will use to extract the venue feature datset using foursqaure.

In [None]:
london_relevant_columns = london_codes_all[['District','Ward', 'Constituency', 'Postcode district', 'Postcode', 'Latitude', 'Longitude']]

In [None]:
london_relevant_columns.head()

In [None]:
london_relevant_columns.shape

From the table we can see that there are a number of different ways to group neighbourhoods in London; District, ward, consituency and postcode area. In addition we can see that there is an additonal level of granularity we can see in specific postcodes. We will visualise each of these grouping layers using folium maps library to understand the merits of each method

In [None]:
# Central coordinates for London obtained from google maps
latitude = 51.5074
longitude = -0.1278

In [None]:
def create_map(frame, layer):
# create map using latitude and longitude values
    map_folium = folium.Map(location=[latitude, longitude], zoom_start=10)

    # add markers to map
    for lat, lng, layer in zip(frame['Latitude'], frame['Longitude'], frame[layer]):
        label = '{}'.format(layer)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
            [lat, lng],
            radius=5,
            popup=label,
            color='blue',
            fill=True,
            fill_color='#3186cc',
            fill_opacity=0.7,
            parse_html=False).add_to(map_folium)  

    return map_folium

In order to obtain a centroid location for the different layers of geographical groupings, I have used the group by method and will take the mean of the components coordinates that form a geographical layer.

In [None]:
consituencies = london_relevant_columns.groupby('Constituency').agg('mean').reset_index()
consituencies.head()

In [None]:
create_map(consituencies,'Constituency')

In [None]:
Borough = london_relevant_columns.groupby('District').agg('mean').reset_index()
create_map(Borough,'District')

In [None]:
districts = london_relevant_columns.groupby('Postcode district').agg('mean').reset_index()
create_map(districts,'Postcode district')

In [None]:
districts.shape

The map above shows us that using postocode districts results in large geographical cluster of points in central London. these points much closer to each other than the points in the outer areas of the city. As a result any analysis on geographical features may result in a skewed data set for these points are they will be very geographically similar. 

In order to correct this, we can either use a higher layer geographical aggregation or we will have to clean the data in some way to reduce these clusters points. We can  use the code below to clean the data at Postcode district level. However, I will revisit this later.

In [None]:
# Residential london only
districts = districts[
    (districts['Postcode district'].str[0]=='E') 
    | (districts['Postcode district'].str[0]=='N')
    | (districts['Postcode district'].str[0]=='S')
    | (districts['Postcode district'].str[0]=='W')
]
districts = districts[
    (districts['Postcode district'].str[0:2]!='EN')
    & (districts['Postcode district'].str[0:2]!='SM')
    & (districts['Postcode district'].str[0:2]!='WD')
    & (districts['Postcode district'].str[0:2]!='WC')
    & (districts['Postcode district'].str[0:2]!='W1')
    & (districts['Postcode district'].str[0:2]!='EC')
    & (districts['Postcode district'].str[0:3]!='SW1')
]

## Analyse data on UK house transactions

The url below  provides us with data on house purchases in 2020. This is detailed at a transaction level for each property and includes the prices paid for the property and the date of the transaction. This data is obtained from the land registry office of HM government. 

In [None]:
prices_2020_url = 'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv'
prices_2020 = pd.read_csv(prices_2020_url, header=None)
prices_2020.head(2)

In [None]:
# prices_2020.rename(columns={
#     0:'id',
#     1:'price',
#     2:'transaction_date',
#     3:'postcode',
    
# })

We can see from the preview of the dataframe above that this dataset does not include any headers and includes transations from all parts of the UK. The header names can be interpreted due to knowledge of the data, and those familiar with UK addresses will recognise that this data includes all the expected fields such as postcode, city, address etc. 

Given that this analysis is concerned with London neighbourhoods, I will filter the dataset below for only london transactions

In [None]:
london_prices = prices_2020[prices_2020[13]=='GREATER LONDON']
london_prices.head(3)

The transaction data is remerged here with the original data set including positional data for each postcode and geographical layer

In [None]:
transactions_geo = london_prices.merge(london_relevant_columns, how='left', left_on=3, right_on='Postcode')
transactions_geo.shape

In [None]:
transactions_geo[transactions_geo['Postcode'].notnull()].shape

In [None]:
transactions_geo.head(3)

Now that we have transactional data alongside the geographical layers that correspond to them, we can use this to analyse the distribution of house prices across the different geographical layers of looking at London addresses. 

In [None]:
def group_transactions(frame, layer):
    avg_prices = frame[[layer,1]].groupby(layer).mean().reset_index()
    avg_prices[1] = avg_prices[1].astype(int)
    avg_prices.rename(columns={1:'avg_price'}, inplace=True)
    return avg_prices

In [None]:
def plot_hit_price(frame):
    bins = np.linspace(frame['avg_price'].min(), frame['avg_price'].max(), 25)
    plt.figure(figsize=(10,5))
    plt.title(frame.columns[0])
    return plt.hist(frame['avg_price'], bins=bins)

In [None]:
# Price distribution by postcode
postcode_prices = group_transactions(transactions_geo, 'Postcode district')
plot_hit_price(postcode_prices)

In [None]:
constituency_prices = group_transactions(transactions_geo, 'Constituency')
# bins = np.linspace(y.min(), y.max(), 25)
# plt.figure(figsize=(10,5))
# plt.hist(constituency_prices['avg_price'], bins=bins)

In [None]:
constituency_prices.head()

In [None]:
plot_hit_price(constituency_prices)

In [None]:
borough_prices = group_transactions(transactions_geo, 'District')
plot_hit_price(borough_prices)

We can see from the charts above that the distribution of house prices follows a fairly normal distribution, but then has long tail skewing the data towards higher values. This is no surprise given the high premium that is placed on prime central london real estate. In order to help group this data better, I will classify the prices into four bands below.

In [None]:
def price_classificaton(frame):
    lower = np.percentile(frame['avg_price'],25)
    median = np.percentile(frame['avg_price'],50)
    upper = np.percentile(frame['avg_price'],75)
    frame.loc[frame['avg_price'] > upper, 'price_band'] = 'expensive' 
    frame.loc[frame['avg_price'] <= upper, 'price_band'] = 'premium' 
    frame.loc[frame['avg_price'] <= median, 'price_band'] = 'mid range' 
    frame.loc[frame['avg_price'] <= lower, 'price_band'] = 'cheap' 
    return frame

In [None]:
postcode_prices = price_classificaton(group_transactions(transactions_geo, 'Postcode district'))

In [None]:
constituency_prices = price_classificaton(group_transactions(transactions_geo, 'Constituency'))

### Use foursquare to get local venues

I wil extract neighbourhood features at two geographical layers, postcode area and constituency. I will repeat the analysis for both to understand what effect if any the geographical proximity of central london postcodes may have. 

In [None]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [None]:
# london_venues_raw = getNearbyVenues(names=districts['Postcode district'],
#                                    latitudes=districts['Latitude'],
#                                    longitudes=districts['Longitude']
#                                   )

In [None]:
# london_venues_raw.to_csv('london_venues.csv')

In [None]:
london_venues_raw = pd.read_csv('london_venues.csv')

In [None]:
# # consituencies
# consitutency_venues_raw = getNearbyVenues(names=consituencies['Constituency'],
#                                    latitudes=consituencies['Latitude'],
#                                    longitudes=consituencies['Longitude']
#                                   )

In [None]:
# consitutency_venues_raw.to_csv('constituency_venues.csv')

In [None]:
london_venues = london_venues_raw.copy()
london_venues.shape

In [None]:
london_venues.head()

In [None]:
london_venues['Venue Category'].value_counts().head()

In [None]:
len(london_venues['Venue Category'].unique())

As we can see from above there are a lot of features for this dataset (385). This may affect the performance of the clustering algorithm and we may need to carry out some feature engineering to improve this. 

### one hot encoding postcode data

In [None]:
def pre_processing(frame):
    london_one_hot = pd.get_dummies(frame[['Venue Category']], prefix="", prefix_sep="")
    london_one_hot['layer'] = frame['Neighborhood']
    london_grouped_category = london_one_hot.groupby('layer').mean().reset_index()
    return london_grouped_category

In [None]:
london_grouped = pre_processing(london_venues)
london_grouped.shape

In [None]:
london_grouped.head()

### summary of top venues

In [None]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [None]:
def get_top_venues(frame):
    num_top_venues = 10

    indicators = ['st', 'nd', 'rd']

    # create columns according to number of top venues
    columns = ['layer']
    for ind in np.arange(num_top_venues):
        try:
            columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
        except:
            columns.append('{}th Most Common Venue'.format(ind+1))

    # create a new dataframe
    top_venues = pd.DataFrame(columns=columns)
    top_venues['layer'] = frame['layer']

    for ind in np.arange(frame.shape[0]):
        top_venues.iloc[ind, 1:] = return_most_common_venues(frame.iloc[ind, :], num_top_venues)
        
    return top_venues

In [None]:
neighborhoods_venues_sorted = get_top_venues(london_grouped)
neighborhoods_venues_sorted.shape

In [None]:
neighborhoods_venues_sorted.head()

### clustering postcode data

In [None]:
def elbow_method(frame):
    elbow_df = frame.drop('layer', 1)
#     elbow_df = frame
    distortions = []
    K = range(1,20)
    for k in K:
        kmeanModel = KMeans(n_clusters=k, random_state=0).fit(elbow_df)
        distortions.append(sum(np.min(cdist(elbow_df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / elbow_df.shape[0])

    # Plot the elbow
    plt.plot(K, distortions, 'bx-')
    plt.xlabel('k')
    plt.ylabel('Distortion')
    plt.title('optimal k')
    plt.show()

In [None]:
def kmeans_plot(frame,k):
    hist_plot_df = frame.drop('layer', 1)
#     hist_plot_df = frame
    # run k-means clustering
    kmeans = KMeans(n_clusters=k, random_state=0).fit(hist_plot_df)
    plt.title(f'k={k}')
    # plot
    return plt.hist(kmeans.labels_, bins=k)

In [None]:
elbow_method(london_grouped)

The chart above shows us that there is no obvious optimal point for K, and additional clusters continue to improve the clustering algorithm. This may be due to the fact that there very high number of features in this dataset. I will revisit this once the initial analysis is complete. 

To help further in identifying an optimal k, I will plot the frequency of each cluster below. This will give an insight into how successful the clustering approach has been in finding similarities between neighbourhoods. 

In [None]:
kmeans_plot(london_grouped, 10)

In [None]:
kmeans_plot(london_grouped, 7)

In [None]:
kmeans_plot(london_grouped, 5)

Since adding additonal clusters continues to improve accuracy I will use k = 7 for my analysis, as the historgram plots above indicate that this gives us an additional cluster with subtantial data points, indicating the algorithm has found additional similarities between clusters. 

In [None]:
# tableau_postcodes = london_grouped.insert(0, 'Cluster Labels', labels)
# london_grouped.to_csv('postcodes_tableau.csv')

In [None]:
k = 7
grouped_clustering = london_grouped.drop('layer', 1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(grouped_clustering)
labels = kmeans.labels_
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', labels)
len(kmeans.labels_)

In [None]:
postcode_map = districts.merge(neighborhoods_venues_sorted, left_on='Postcode district', right_on='layer')

In [None]:
postcode_map.head()

### Map of london clusters

In [None]:
def cluster_map(frame, k):
    # create map
    map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

    # set color scheme for the clusters
    x = np.arange(k)
    ys = [i + x + (i*x)**2 for i in range(k)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]

    # add markers to the map
    markers_colors = []
    for lat, lon, poi, cluster in zip(frame['Latitude'], frame['Longitude'], frame['layer'], frame['Cluster Labels']):
        label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        folium.CircleMarker(
            [lat, lon],
            radius=5,
            popup=label,
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.7).add_to(map_clusters)

    return map_clusters

In [None]:
cluster_map(postcode_map,k)

We can see from the map above that the kmeans algorithm has been able to group certain areas of London based on their geographical features. We can look at the most common types of venues in those areas in those neighbourhoods to understand their common features. I will also join the average house price data back onto this data frame so we can get an idea of price bands

In [None]:
def view_df(frame, cluster):
    return frame.loc[frame['Cluster Labels'] == cluster, frame.columns[[0] + list(range(4, frame.shape[1]))]]

In [None]:
postcode_cluster_prices = postcode_map.merge(postcode_prices, how='left', on='Postcode district')
postcode_cluster_prices.shape
# postcode_cluster_prices.to_csv('postcodes_tableau.csv')

In [None]:
view_df(postcode_cluster_prices,0)

In [None]:
view_df(postcode_cluster_prices,1)

In [None]:
view_df(postcode_cluster_prices,2)

In [None]:
view_df(postcode_cluster_prices,3)

In [None]:
view_df(postcode_cluster_prices,4)

In [None]:
view_df(postcode_cluster_prices,5)

In [None]:
view_df(postcode_cluster_prices,6)

In [None]:
postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == i, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head().index

In [None]:
dataframe[column].value_counts().index.tolist()

In [None]:
df_top5_venue_counts = pd.DataFrame(columns=['cluster', 'venue', 'count'])
count_list = []
for i in range(7):
    x = pd.DataFrame(columns=['cluster', 'venue', 'count'])
#     count_list.append(
#     postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == i, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head()
#     )
    x['cluster'] = 'cluster {}'.format(i)
    x['venue'] = postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == i, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head().index
    x['count'] = postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == i, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head()[0]
    df_top5_venue_counts = pd.concat([df_top5_venue_counts,x])
    df_top5_venue_counts = df_top5_venue_counts.drop('cluster', 1)

In [None]:
# df_top5_venue_counts

In [None]:
df_top5_venue_counts

In [None]:
postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == 5, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head()

In [None]:
postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == 6, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head()
# postcode_cluster_prices.stack().value_counts()

### repeating the analysis excluding prime central london real estate

In [None]:
london_venues = london_venues[
    (london_venues['Neighborhood'].str[0]=='E') 
    | (london_venues['Neighborhood'].str[0]=='N')
    | (london_venues['Neighborhood'].str[0]=='S')
    | (london_venues['Neighborhood'].str[0]=='W')
]
london_venues = london_venues[
    (london_venues['Neighborhood'].str[0:2]!='EN')
    & (london_venues['Neighborhood'].str[0:2]!='SM')
    & (london_venues['Neighborhood'].str[0:2]!='WD')
    & (london_venues['Neighborhood'].str[0:2]!='WC')
    & (london_venues['Neighborhood'].str[0:2]!='EC')
]
london_venues.shape

In [None]:
london_grouped = pre_processing(london_venues)
london_grouped.shape

### summary of top venues

In [None]:
neighborhoods_venues_sorted = get_top_venues(london_grouped)
neighborhoods_venues_sorted.shape

In [None]:
elbow_method(london_grouped)

In [None]:
kmeans_plot(london_grouped, 10)

In [None]:
kmeans_plot(london_grouped, 7)

In [None]:
kmeans_plot(london_grouped, 5)

In [None]:
k = 5
grouped_clustering = london_grouped.drop('layer', 1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(grouped_clustering)
labels = kmeans.labels_
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', labels)
len(kmeans.labels_)

In [None]:
postcode_map = districts.merge(neighborhoods_venues_sorted, left_on='Postcode district', right_on='layer')

In [None]:
cluster_map(postcode_map,k)

In [None]:
postcode_cluster_prices = postcode_map.merge(postcode_prices, how='left', on='Postcode district')
postcode_cluster_prices.shape

In [None]:
view_df(postcode_cluster_prices,0)

In [None]:
view_df(postcode_cluster_prices,1)

In [None]:
view_df(postcode_cluster_prices,2)

In [None]:
view_df(postcode_cluster_prices,3)

In [None]:
view_df(postcode_cluster_prices,4)

### Feature engineering before preprocessing to improve clustering

The elbow method charts from above showed us that we cannot find an optimal value for K as the there was no significant elbow point which marked the best value. This is likely due to the fact the dataset currently has a large number of features (385 including outer london and central london postcodes). As we can see from features below, many of these features are very similar and differ due to slight name changes or sub categories. For example, the various different restraunts by world cuisine could be grouped together as restraunts. In this section I will carry out some feature engineering to reduce the number of features by combining these categories, to improve the clustering algorithm. 

In [None]:
london_venues['Venue Category'].unique()

In [None]:
len(london_venues['Venue Category'].unique())

In [None]:
Restaurant_msk = london_venues['Venue Category'].str.contains('Restaurant')
Bar_msk = london_venues['Venue Category'].str.contains('Bar')
Shop_msk = london_venues['Venue Category'].str.contains('Shop')
Store_msk = london_venues['Venue Category'].str.contains('Store')
Gym_msk = london_venues['Venue Category'].str.contains('Gym')
food_place_msk = london_venues['Venue Category'].str.contains('Place')
museum_msk = london_venues['Venue Category'].str.contains('Museum')

In [None]:
len(london_venues.loc[Shop_msk,'Venue Category'].unique())

In [None]:
def feature_engineering(frame):
    Restaurant_msk = frame['Venue Category'].str.contains('Restaurant')
    Bar_msk = frame['Venue Category'].str.contains('Bar')
    Shop_msk = frame['Venue Category'].str.contains('Shop')
    Store_msk = frame['Venue Category'].str.contains('Store')
    Gym_msk = frame['Venue Category'].str.contains('Gym')
    food_place_msk = frame['Venue Category'].str.contains('Place')
    museum_msk = frame['Venue Category'].str.contains('Museum')
    frame.loc[Restaurant_msk, 'Venue Category'] = 'Restaurant'
    frame.loc[Bar_msk, 'Venue Category'] = 'Bar'
    frame.loc[Shop_msk, 'Venue Category'] = 'Shop'
    frame.loc[Store_msk, 'Venue Category'] = 'Store'
    frame.loc[Gym_msk, 'Venue Category'] = 'Gym'
    frame.loc[food_place_msk, 'Venue Category'] = 'Place'
    frame.loc[museum_msk, 'Venue Category'] = 'Museum'
    return frame

In [None]:
london_venues = feature_engineering(london_venues)

In [None]:
len(london_venues['Venue Category'].unique())

As we can see from this, the feature engineering has halved the number of features in the dataset, without losing valuable information. Hopefully this will improve clustering. The approach will now be repeated with this dataset.

### Repeating clustering with feature engineered dataset

In [None]:
london_grouped = pre_processing(london_venues)
london_grouped.shape

In [None]:
neighborhoods_venues_sorted = get_top_venues(london_grouped)
neighborhoods_venues_sorted.shape

In [None]:
elbow_method(london_grouped)

In [None]:
kmeans_plot(london_grouped, 7)

In [None]:
kmeans_plot(london_grouped, 10)

In [None]:
k = 10
grouped_clustering = london_grouped.drop('layer', 1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(grouped_clustering)
labels = kmeans.labels_
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', labels)
len(kmeans.labels_)

In [None]:
postcode_map = districts.merge(neighborhoods_venues_sorted, left_on='Postcode district', right_on='layer')

In [None]:
cluster_map(postcode_map,k)

In [None]:
postcode_cluster_prices = postcode_map.merge(postcode_prices, how='left', on='Postcode district')
postcode_cluster_prices.shape

In [None]:
view_df(postcode_cluster_prices,0)

In [None]:
view_df(postcode_cluster_prices,1)

In [None]:
view_df(postcode_cluster_prices,2)

In [None]:
view_df(postcode_cluster_prices,3)

In [None]:
view_df(postcode_cluster_prices,4)

In [None]:
view_df(postcode_cluster_prices,5)

In [None]:
view_df(postcode_cluster_prices,6)

In [None]:
view_df(postcode_cluster_prices,7)

In [None]:
view_df(postcode_cluster_prices,8)

In [None]:
postcode_cluster_prices.loc[postcode_cluster_prices['Cluster Labels'] == 1, postcode_cluster_prices.columns[list(range(4, postcode_cluster_prices.shape[1]))]].stack().value_counts().head()

The feature engineering has greatly improved the clustering, we can identify clear characteristics of each cluster group distinc and unique from the others. The addition of the price data also allows us to identify the relative price of similar neighbourhoods side by side. 

### Kmeans at constituency level

In [None]:
# add information on london zone

In [None]:
def pre_processing(frame):
    london_one_hot = pd.get_dummies(frame[['Venue Category']], prefix="", prefix_sep="")
    london_one_hot['layer'] = frame['Neighborhood']
    london_grouped_category = london_one_hot.groupby('layer').mean().reset_index()
    return london_grouped_category

In [None]:
# constits = consitutency_venues_raw.copy()
constits = pd.read_csv('constituency_venues.csv')
constits = feature_engineering(constits)
constits_one_hot = pre_processing(constits)
elbow_method(constits_one_hot)

In [None]:
kmeans_plot(constits_one_hot,5)

In [None]:
kmeans_plot(constits_one_hot,15)

In [None]:
constit_venues_sorted = get_top_venues(constits_one_hot)

In [None]:
k = 15
constits_one_hot = constits_one_hot.drop('layer', 1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(constits_one_hot)
labels = kmeans.labels_
constit_venues_sorted.insert(0, 'Cluster Labels', labels)

In [None]:
boroughs_map_df = consituencies.merge(constit_venues_sorted, left_on='Constituency', right_on='layer')

In [None]:
cluster_map(boroughs_map_df,k)

In [None]:
borough_cluster_prices = boroughs_map_df.merge(constituency_prices, how='left', on='Constituency')
borough_cluster_prices.shape

In [None]:
view_df(borough_cluster_prices,0)

In [None]:
view_df(borough_cluster_prices,3)

In [None]:
view_df(borough_cluster_prices,4)

In [None]:
view_df(borough_cluster_prices,6)

In [None]:
view_df(borough_cluster_prices,8)

In [None]:
view_df(boroughs_map_df,12)

In [None]:
view_df(boroughs_map_df,14)

The analysis above shows that using a higher level geography does not provide us with good separation of clusters. There are two reasons for this, 1) the foursquare API extract less feature data due to the broader geographies 2) the higher level geography covers such a diverse range of features that the algorithm is unable to distinguish features amongst them. 

### Kmeans at consituency level with zonal data

In [None]:
def pre_processing(frame):
    x1 = pd.get_dummies(frame['London zone'])
    x2 = pd.get_dummies(frame['Venue Category'])
    london_one_hot = pd.concat([x1,x2], axis=1)
    london_one_hot['layer'] = frame['Neighborhood']
    london_grouped_category = london_one_hot.groupby('layer').mean().reset_index()
    return london_grouped_category

In [None]:
london_relevant_columns = london_codes_all[['District','Ward', 'Constituency', 'Postcode district', 'Postcode', 'London zone', 'Latitude', 'Longitude']]

In [None]:
consituencies = london_relevant_columns.groupby('Constituency').agg({'Latitude':'mean','Longitude':'mean', 'London zone':'median'}).reset_index()
consituencies.head()

In [None]:
# constits = consitutency_venues_raw.copy()
constits = pd.read_csv('constituency_venues.csv')
constits = constits.merge(consituencies[['Constituency','London zone']], how='left', left_on='Neighborhood', right_on='Constituency')
constits = constits.drop('Constituency', 1)
constits_one_hot = pre_processing(constits)
elbow_method(constits_one_hot)

In [None]:
constit_venues_sorted = get_top_venues(constits_one_hot)

In [None]:
kmeans_plot(constits_one_hot,6)

In [None]:
k = 6
constits_one_hot = constits_one_hot.drop('layer', 1)
kmeans = KMeans(n_clusters=k, random_state=0).fit(constits_one_hot)
labels = kmeans.labels_
constit_venues_sorted.insert(0, 'Cluster Labels', labels)

In [None]:
boroughs_map_df = consituencies.merge(constit_venues_sorted, left_on='Constituency', right_on='layer')

In [None]:
cluster_map(boroughs_map_df,k)

In [None]:
view_df(boroughs_map_df,0)

In [None]:
view_df(boroughs_map_df,1)

In [None]:
view_df(boroughs_map_df,2)

In [None]:
view_df(boroughs_map_df,3)

In [None]:
view_df(boroughs_map_df,4)

In [None]:
view_df(boroughs_map_df,5)

### Kmeans at constituency level with price data

In [None]:
def pre_processing(frame):
    london_one_hot = pd.get_dummies(frame[['Venue Category']], prefix="", prefix_sep="")
    london_one_hot['layer'] = frame['Neighborhood']
    london_grouped_category = london_one_hot.groupby('layer').sum().reset_index()
    return london_grouped_category

In [None]:
def pre_processing_price(frame):
    london_one_hot = pd.get_dummies(frame[['price_band']], prefix="", prefix_sep="")
    london_one_hot['layer'] = frame['Constituency']
    return london_one_hot

In [None]:
constituency_prices = price_classificaton(group_transactions(transactions_geo, 'Constituency'))

In [None]:
constits = consitutency_venues_raw.copy()
constits_one_hot = pre_processing(constits)
price_one_hot = pre_processing_price(constituency_prices)

factor = 1
columns = price_one_hot['layer']
price_one_hot = price_one_hot[['cheap', 'expensive', 'mid range', 'premium']]/factor
price_one_hot['layer'] = columns
price_one_hot.shape

In [None]:
price_one_hot.head()

In [None]:
price_cluster = price_one_hot.merge(constits_one_hot, how='inner', on='layer')
constit_venues_sorted = get_top_venues(constits_one_hot)

In [None]:
price_cluster = price_cluster.drop('layer', 1)
elbow_method(price_cluster)

In [None]:
kmeans_plot(price_cluster,6)

In [None]:
k = 5
kmeans = KMeans(n_clusters=k, random_state=0).fit(price_cluster)

In [None]:
labels = kmeans.labels_
len(labels)
constit_venues_sorted.insert(0, 'Cluster Labels', labels)

In [None]:
constit_venues_sorted.head()

In [None]:
x = rejoin_clusters(consituencies,constit_venues_sorted)
constituency_prices.rename(columns={'Constituency':'layer'}, inplace=True)
final_df = x.merge(constituency_prices, on='layer')

In [None]:
final_df.head()

In [None]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kcluster_price)
ys = [i + x + (i*x)**2 for i in range(kcluster_price)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, band, cluster in zip(final_df['Latitude'], final_df['Longitude'], final_df['layer'], final_df['price_band'], final_df['Cluster Labels']):
    label = folium.Popup(str(poi) +'-' + str(band) + ' ' + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [None]:
final_df.loc[final_df['Cluster Labels'] == 0, final_df.columns[[0] + list(range(4, final_df.shape[1]))]]

In [None]:
final_df.loc[final_df['Cluster Labels'] == 1, final_df.columns[[0] + list(range(4, final_df.shape[1]))]]

In [None]:
final_df.loc[final_df['Cluster Labels'] == 2, final_df.columns[[0] + list(range(4, final_df.shape[1]))]]

In [None]:
final_df.loc[final_df['Cluster Labels'] == 3, final_df.columns[[0] + list(range(4, final_df.shape[1]))]]

In [None]:
final_df.loc[final_df['Cluster Labels'] == 4, final_df.columns[[0] + list(range(4, final_df.shape[1]))]]

In [None]:
final_df.loc[final_df['Cluster Labels'] == 5, final_df.columns[[0] + list(range(4, final_df.shape[1]))]]

In order to evaluate how local venues may affect house prices I need to merge the prices data (numerical) with venue data (categorical) and evaluate this together in a classification model. The data must be pre processed and normalised, however there is a step of normalisatoin already in place for the categorical data where i have grouped by the mean of frequency of each occurence

In [None]:
from sklearn import preprocessing

In [None]:
london_grouped.shape

In [None]:
# london_grouped_for_price = london_grouped.copy()

Aggregating the feature data set by summing the total to obtain the frequency of occurence, so this can be used with the preprocessing libraries

In [None]:
london_grouped_for_price = london_one_hot.groupby('Postcode district').sum().reset_index()

obtaining target prediction variable for the feature set

In [None]:
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

In [None]:
prices_venue = avg_prices.merge(london_grouped_for_price[['Postcode district']], how='right', on='Postcode district')
y = prices_venue['avg_price']
y.shape

In [None]:
X = london_grouped_for_price.copy().drop('Postcode district', 1)

In [None]:
# X = X.replace([np.inf, -np.inf], np.nan)
# X[X.isna().any(axis=1)]
# X.shape

using PCA to reduce elements

In [None]:
# X = PCA(n_components=2).fit_transform(X)

In [None]:
# y = np.array(y)

In [None]:
# X = preprocessing.StandardScaler().fit(X).transform(X)
# y = y.reshape(-1, 1)
# y = preprocessing.StandardScaler().fit(y).transform(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

In [None]:
# plt.hist(X, bins='auto')

In [None]:
# k = 4
# neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
# yhat = neigh.predict(X_test)
# print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
# print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
# k = 4
# neigh = KMeans(k).fit_transform(X_train)
# yhat = neigh.predict(X_test)
# print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
# print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))

In [None]:
# Ks = 20
# mean_acc = np.zeros((Ks-1))
# std_acc = np.zeros((Ks-1))
# ConfustionMx = [];
# for n in range(1,Ks):
    
#     #Train Model and Predict  
#     neigh = KNeighborsClassifier(n_neighbors = n).fit(X_train,y_train)
#     yhat=neigh.predict(X_test)
#     mean_acc[n-1] = metrics.accuracy_score(y_test, yhat)

    
#     std_acc[n-1]=np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

# plt.plot(range(1,Ks),mean_acc,'g')
# plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.10)
# plt.legend(('Accuracy ', '+/- 3xstd'))
# plt.ylabel('Accuracy ')
# plt.xlabel('(K)')
# plt.tight_layout()
# plt.show()
# print( "The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1) 

## Regression analysis

In [None]:
X = london_grouped_for_price.copy().drop('Postcode district', 1)
prices_venue = avg_prices.merge(london_grouped_for_price[['Postcode district']], how='right', on='Postcode district')
y = prices_venue['avg_price']
y.shape

In [None]:
# from sklearn import linear_model
# regr = linear_model.LinearRegression()
# x = np.asanyarray(X)
# y = np.asanyarray(y)
# regr.fit (x, y)
# # The coefficients
# print ('Coefficients: ', regr.coef_)

rejoining clusters with price data to see if this has any indication of value

In [None]:
#Find correlation for numeric variables

target = prices_venue['avg_price']

corr = train.corr()
corr_abs = corr.abs()

nr_num_cols = len(num_feat)

ser_corr = corr_abs.nlargest(nr_num_cols, target)[target]
print(ser_corr)

In [None]:
# Highest value Frequency percentage in categorical variables 
for i in list(cat_feat):
    pct = df[i].value_counts()[0] / 2919
    print('Highest value Percentage of {}: {:3f}'.format(i, pct))

In [None]:
# Highest value Frequency percentage in categorical variables 
for i in list(cat_feat):
    pct = df[i].value_counts()[0] / 2919
    print('Highest value Percentage of {}: {:3f}'.format(i, pct))