<h1>Housing based on venue preferences and price of the square meter</h1>

Buying a house is one of the big financial decision one person does in his life, in this desition is very difficult to be impartial due to the emotional implications persons have at the moment to choose a neighborhood and a house/apartment. To have a more logistic approximation to the decision of where to buy a house, in this study I want to propose an initial approximation that takes into account the venue’s preferences of a person, and the average land prices of each neighborhood in a city.
To choose a house the most important characteristics persons consider based on the study of House Repay (https://www.fastrepayhomeloan.com.au/7-factors-to-consider-when-buying-a-house/) are:

-Neighborhood

-Schools and Colleges

-Infrastructure (transportation, connectivity with other neighborhoods)

-Crime (crime index)

-House inspection

-Green open space

In this evaluation, the objective will be to produce personalized results based on preferences of a given person, this means the previously described factors will be grouped to simplify the process of determining the best locations that accomplish the preferences of a person, and the price per square foot on each neighborhood will be included. Additionally, since the information will be at the neighborhood level, specific data like location and house inspection not will be included. The information about venues will be based on personal preferences this means schools/colleges and green spaces will be important only in those cases in which the persons consider this as an important preference. Finally, crime data and infrastructure data not will be included since they are considered out of the scope of this project, but future versions of the model can be included to refine the result of the model.

<h1>Data Description</h1>


In this case, de case of study to create the model will be the information about the city of Madrid in Spain. To create the model the information will be at the start in four different datasets.

-The first dataset will be the price of the square foot on each neighborhood in Madrid city(https://www.idealista.com/sala-de-prensa/informes-precio-vivienda/venta/madrid-comunidad/madrid-provincia/madrid/).

-The second dataset is the location data of each neighborhood for this geopy was used.

-The third dataset is the venue’s information of each neighborhood where foursquare API was used.

-Finally, the last dataset will be the user preferences which were used as a recommendation system input to get a recommendation for the given user.

<h1>Discussion and Background</h1>

found the right places to live are a difficult task and transcendental desition in a person's life, and if at this we sum that cities have become bigger and offer a diversity of activities and places makes this desition even a greater task. 
To determine an initial approach to solve the problem I decided to analysis Madrid as a case of study since is a major city, have a multicultural population which causes the city to have a different kind of venues and is possible obtain data of the different neighborhoods of the city.
On each of the following sections will be described the data acquisition and data preparation of the first data sets.

<h1>Required Libraries</h1>

For this analysis the required libraries are:

geopy - this library will be used to obtain geographical coordinates of each of the neighborhoods.

folium - this is used to draw maps to show the results of each step of the analysis.

geopandas - used to create dataframes with polygons and points in order to represent geographical structures into dataframes.

In [4]:
##!pip install geocoder
!pip install geopy
!pip install folium
!pip install geopandas
!pip install bs4

Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
[?25l  Downloading https://files.pythonhosted.org/packages/66/25/ff030e2437265616a1e9b25ccc864e0371a0bc3adb7c5a404fd661c6f4f6/beautifulsoup4-4.9.1-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 7.2MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/6f/8f/457f4a5390eeae1cc3aeab89deb7724c965be841ffca6cfca9197482e470/soupsieve-2.0.1-py3-none-any.whl
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/a0/b0/b2/4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472
Successfully built bs4
Installing collected packages: soupsieve, beautifulsoup4, bs4
Successfully installed beautifulsoup4-4.9.1 bs4-0.0.

In [51]:
#import geocoder
import pandas as pd
import numpy as np
import requests
import urllib.request
import matplotlib.pyplot as plt
import sklearn.utils
import folium
import geopandas as gpd
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
from sklearn.cluster import DBSCAN 
from sklearn.datasets.samples_generator import make_blobs 
from sklearn.preprocessing import StandardScaler 
from pylab import rcParams
from pandas.io.json import json_normalize
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
from sklearn import preprocessing
from folium import FeatureGroup
%matplotlib inline

<h1>Data Acquisition</h1>

In this section is describe how each one of the data sets was obtained. The first data set was the price of the square foot on each neighborhood, for this, I use the data of iedalista.com which is the most popular site to rent and buy properties in Spain. To get this information I scrape the data from (https://www.idealista.com/sala-de-prensa/informes-precio-vivienda/venta/madrid-comunidad/madrid-provincia/madrid/), but since idealista have some protection in the form of captcha to avoid his data can be obtained by automated software a few tricks are needed to get the data. Fist once you try to scrape the data using a request, is possible that the webpage gives you a response indicating that in order to consult the data fulfills a captcha is needed, to solve this simply open the web page in a browser and solve the captcha before sending the request again. If you cannot see the captcha, use CTRL + F5 to force your browser to delete cache and this will cause that the captcha loads correctly.

In [7]:
url ='https://www.idealista.com/sala-de-prensa/informes-precio-vivienda/venta/madrid-comunidad/madrid-provincia/madrid/'
headers = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "accept-language": "es,es-CO;q=0.9,en;q=0.8,en-US;q=0.7",
    "cache-control": "max-age=0",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "cookie": "_pxhd=99a120b3d70ad9c8e49eab82dacac59a109da6afadb4f4540aaff5eccbc74086:bda91760-b002-11ea-be18-9b17275421ed; cookieDirectiveClosed=true; _pxvid=bda91760-b002-11ea-be18-9b17275421ed; _hjid=159fba1f-7f13-4703-9933-dfb89fd6ce98; atuserid=%7B%22name%22%3A%22atuserid%22%2C%22val%22%3A%2240a799d6-963a-4a97-843b-9db4be350f11%22%2C%22options%22%3A%7B%22end%22%3A%222021-07-18T18%3A54%3A02.179Z%22%2C%22path%22%3A%22%2F%22%7D%7D; atidvisitor=%7B%22name%22%3A%22atidvisitor%22%2C%22val%22%3A%7B%22vrn%22%3A%22-582065-%22%7D%2C%22options%22%3A%7B%22path%22%3A%22%2F%22%2C%22session%22%3A15724800%2C%22end%22%3A15724800%7D%7D; utag_main=v_id:0172be7b2612006c35334e93fdf803079007107100942$_sn:2$_se:2$_ss:0$_st:1592338985494$_prevVtSource:directTraffic%3Bexp-1592337241247$_prevVtCampaignCode:%3Bexp-1592337241247$_prevVtDomainReferrer:%3Bexp-1592337241247$_prevVtSubdomaninReferrer:%3Bexp-1592337241247$_prevVtUrlReferrer:%3Bexp-1592337241247$_prevVtprevPageName:255%3A%3A%3A%3A%3A%3A%3A%3A%3Bexp-1592340785499$ses_id:1592337184671%3Bexp-session$_pn:2%3Bexp-session; _px2=eyJ1IjoiZmVhOTIxODAtYjAwYS0xMWVhLWE2ODctYzMxZjBiNzljOGQzIiwidiI6ImJkYTkxNzYwLWIwMDItMTFlYS1iZTE4LTliMTcyNzU0MjFlZCIsInQiOjE1OTIzMzc0ODY2MDMsImgiOiJhMTA0NmQ2ZWZiZjAwNTUwMGMzZTU2ZTUwMGRhYWMyNTI5MmQ1YjUwOWJmMjMzNTM3NTE5ZmNlYmZjYjZmNDMxIn0="
}
response = requests.get(url,headers=headers )
response.content

b'<!DOCTYPE html><html lang="es" dir="ltr" class="project-idn_price_indicator idcms-web" prefix="content: http://purl.org/rss/1.0/modules/content/ dc: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ og: http://ogp.me/ns# rdfs: http://www.w3.org/2000/01/rdf-schema# schema: http://schema.org/ sioc: http://rdfs.org/sioc/ns# sioct: http://rdfs.org/sioc/types# skos: http://www.w3.org/2004/02/skos/core# xsd: http://www.w3.org/2001/XMLSchema# "><head><meta charset="utf-8" /><meta name="author" content="idealista.com" /><meta property="fb:admins" content="1577646318" /><link rel="canonical" href="https://www.idealista.com/sala-de-prensa/informes-precio-vivienda/venta/madrid-comunidad/madrid-provincia/madrid/" /><meta name="app-download-url" content="/download" /><meta property="og:site_name" content="idealista" /><meta name="twitter:card" content="summary_large_image" /><meta name="google-play-app" content="app-id=com.idealista.android" /><meta property="fb:app_id" content="33762455

Once the request returns the correct information I use BeautifulSoup to scrape the data from the result of the request and store it into a data frame.

In [8]:
soup = BeautifulSoup(response.text, 'html.parser')
table_data = soup.find_all('table')
# table_data
df = pd.read_html(table_data[0].prettify(), flavor='bs4')[0]
df.head()

Unnamed: 0,Localización,Precio m2 mayo 2020,Variación mensual,Variación trimestral,Variación anual,Máximo histórico,Variación máximo
0,Madrid,3.782 €/m2,"+0,5 %","+1,5 %","-1,0 %",3.822 €/m2 jul 2019,"-1,1 %"
1,Arganzuela,3.962 €/m2,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %"
2,Barajas,3.144 €/m2,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %"
3,Carabanchel,2.146 €/m2,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %"
4,Centro,5.075 €/m2,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %"


The second data set that was needed, is the geographical location of each neighborhood in Madrid, to get this data I use the names of the neighborhoods in the first data set and with the geopy library, I obtain the geographical coordinates of each one of the neighbors.

In [9]:
location_list=[]
for neighborhood in df["Localización"]:
    geolocator = Nominatim(user_agent="madrid_explorer")
    location = geolocator.geocode('Madrid, '+neighborhood)
    latitude = location.latitude
    longitude = location.longitude
    location_list.append([neighborhood, latitude, longitude])
    print('The geograpical coordinate of '+neighborhood+' are {}, {}.'.format(latitude, longitude))

location_list

The geograpical coordinate of Madrid are 40.4167047, -3.7035825.
The geograpical coordinate of Arganzuela are 40.39806845, -3.6937339526567428.
The geograpical coordinate of Barajas are 40.4733176, -3.5798446.
The geograpical coordinate of Carabanchel are 40.3742112, -3.744676.
The geograpical coordinate of Centro are 40.417652700000005, -3.7079137662915533.
The geograpical coordinate of Chamartín are 40.4589872, -3.6761288.
The geograpical coordinate of Chamberí are 40.43624735, -3.7038303534513837.
The geograpical coordinate of Ciudad Lineal are 40.4484305, -3.650495.
The geograpical coordinate of Fuencarral are 40.4262741, -3.7009067.
The geograpical coordinate of Hortaleza are 40.4725491, -3.6425515.
The geograpical coordinate of Latina are 40.4035317, -3.736152.
The geograpical coordinate of Moncloa are 40.4350196, -3.719236.
The geograpical coordinate of Moratalaz are 40.4059332, -3.6448737.
The geograpical coordinate of Puente de Vallecas are 40.3835532, -3.65453548036571.
The g

[['Madrid', 40.4167047, -3.7035825],
 ['Arganzuela', 40.39806845, -3.6937339526567428],
 ['Barajas', 40.4733176, -3.5798446],
 ['Carabanchel', 40.3742112, -3.744676],
 ['Centro', 40.417652700000005, -3.7079137662915533],
 ['Chamartín', 40.4589872, -3.6761288],
 ['Chamberí', 40.43624735, -3.7038303534513837],
 ['Ciudad Lineal', 40.4484305, -3.650495],
 ['Fuencarral', 40.4262741, -3.7009067],
 ['Hortaleza', 40.4725491, -3.6425515],
 ['Latina', 40.4035317, -3.736152],
 ['Moncloa', 40.4350196, -3.719236],
 ['Moratalaz', 40.4059332, -3.6448737],
 ['Puente de Vallecas', 40.3835532, -3.65453548036571],
 ['Retiro', 40.4111495, -3.6760566],
 ['Salamanca', 40.4270451, -3.6806024],
 ['San Blas', 40.4275001, -3.615954],
 ['Tetuán', 40.4605781, -3.6982806],
 ['Usera', 40.383894, -3.7064459],
 ['Vicálvaro', 40.3965841, -3.5766216],
 ['Villa de Vallecas', 40.3739576, -3.6121632],
 ['Villaverde', 40.3456104, -3.6959556]]

Once we have both data sets I do a process of cleaning and normalize the data by renaming columns, remove no required data, and finally by merging both data sets into a single data frame.

In [11]:
neighborhood_df = pd.DataFrame(data=location_list)
neighborhood_df.columns = ['neighborhood','latitude','longitude']
neighborhood_df.head()

Unnamed: 0,neighborhood,latitude,longitude
0,Madrid,40.416705,-3.703582
1,Arganzuela,40.398068,-3.693734
2,Barajas,40.473318,-3.579845
3,Carabanchel,40.374211,-3.744676
4,Centro,40.417653,-3.707914


In [12]:
df.rename(columns={'Localización':'neighborhood', 'Precio m2 mayo 2020':'price_m2', 'Variación mensual':'monthly_variation', 'Variación trimestral':'quarterly_variation','Variación anual':'anual_variation','Máximo histórico':'historical_max','Variación máximo':'max_variation'},inplace=True)
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation
0,Madrid,3.782 €/m2,"+0,5 %","+1,5 %","-1,0 %",3.822 €/m2 jul 2019,"-1,1 %"
1,Arganzuela,3.962 €/m2,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %"
2,Barajas,3.144 €/m2,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %"
3,Carabanchel,2.146 €/m2,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %"
4,Centro,5.075 €/m2,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %"


In [13]:
# Merge
df = pd.merge(df, neighborhood_df, on='neighborhood')

In [14]:
# Remove Madrid data
madrid_data = df.iloc[0]
df = df[df.neighborhood != 'Madrid']
df.reset_index()
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude
1,Arganzuela,3.962 €/m2,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734
2,Barajas,3.144 €/m2,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845
3,Carabanchel,2.146 €/m2,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676
4,Centro,5.075 €/m2,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914
5,Chamartín,5.179 €/m2,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129


In [15]:
print(madrid_data['latitude'])
print(madrid_data['longitude'])

40.4167047
-3.7035825


In [16]:
df['neighborhood'] = df['neighborhood'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude
1,Arganzuela,3.962 €/m2,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734
2,Barajas,3.144 €/m2,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845
3,Carabanchel,2.146 €/m2,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676
4,Centro,5.075 €/m2,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914
5,Chamartin,5.179 €/m2,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129


Finally, in order to verify that all the data is correct, I draw a map using folium and verify that each marker corresponds to each of the neighbors. In this case, one of the markers was offset (Fuencarral) of the neighbor so the coordinates are overrides with the correct coordinates.

In [17]:
#Create map using latitude and longitude values

madrid_data['latitude']
madrid_data['longitude']

map_madrid = folium.Map(width=1000, height=500,location=[madrid_data['latitude'], madrid_data['longitude']], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(df['latitude'], df['longitude'], df['neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_madrid)  
    
map_madrid

In [18]:
# Fix Fuencarral location
index = int(df[df['neighborhood']=='Fuencarral'].index[0])
df['latitude'][index] = 40.519031
df['longitude'][index] = -3.775905

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


The third data set used was the polygons of the neighborhoods of Madrid, this data was obtained from fantasmagoria.com
https://fantasmagoria.carto.com/api/v2/sql?filename=distrito_geojson&q=select+*+from+public.distrito_geojson&format=geojson&bounds=&api_key= 

The geojson obtained, was loaded as a data frame, drop the unnecessary columns and merge the data with the main data frame used in the two previous steps.

In [19]:
url_geo_madrid='https://fantasmagoria.carto.com/api/v2/sql?filename=distrito_geojson&q=select+*+from+public.distrito_geojson&format=geojson&bounds=&api_key='
response_geo = requests.get(url_geo_madrid)
df_geo = gpd.GeoDataFrame(response_geo.json())
df_geo.head()

Unnamed: 0,type,features
0,FeatureCollection,"{'type': 'Feature', 'geometry': {'type': 'Mult..."
1,FeatureCollection,"{'type': 'Feature', 'geometry': {'type': 'Mult..."
2,FeatureCollection,"{'type': 'Feature', 'geometry': {'type': 'Mult..."
3,FeatureCollection,"{'type': 'Feature', 'geometry': {'type': 'Mult..."
4,FeatureCollection,"{'type': 'Feature', 'geometry': {'type': 'Mult..."


In [20]:
df_geo = json_normalize(df_geo["features"])
df_geo.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,type,geometry.type,geometry.coordinates,properties.codigo,properties.label,properties.codigoalternativo,properties._about,properties.cartodb_id,properties.created_at,properties.updated_at
0,Feature,MultiPolygon,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621,Barajas,21,http://datos.localidata.com/recurso/territorio...,2,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
1,Feature,MultiPolygon,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611,Carabanchel,11,http://datos.localidata.com/recurso/territorio...,3,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
2,Feature,MultiPolygon,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601,Centro,1,http://datos.localidata.com/recurso/territorio...,4,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
3,Feature,MultiPolygon,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605,Chamartín,5,http://datos.localidata.com/recurso/territorio...,5,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
4,Feature,MultiPolygon,"[[[[-3.698789, 40.446603], [-3.698725, 40.4465...",28079607,Chamberí,7,http://datos.localidata.com/recurso/territorio...,6,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z


In [21]:
df_geo.replace('Fuencarral-El Pardo', 'Fuencarral', inplace=True)
df_geo.replace('Moncloa-Aravaca', 'Moncloa',inplace=True)
df_geo.head()

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.codigo,properties.label,properties.codigoalternativo,properties._about,properties.cartodb_id,properties.created_at,properties.updated_at
0,Feature,MultiPolygon,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621,Barajas,21,http://datos.localidata.com/recurso/territorio...,2,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
1,Feature,MultiPolygon,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611,Carabanchel,11,http://datos.localidata.com/recurso/territorio...,3,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
2,Feature,MultiPolygon,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601,Centro,1,http://datos.localidata.com/recurso/territorio...,4,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
3,Feature,MultiPolygon,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605,Chamartín,5,http://datos.localidata.com/recurso/territorio...,5,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
4,Feature,MultiPolygon,"[[[[-3.698789, 40.446603], [-3.698725, 40.4465...",28079607,Chamberí,7,http://datos.localidata.com/recurso/territorio...,6,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z


In [22]:
df_geo.rename(columns={'properties.label':'neighborhood'},inplace=True)
df_geo['neighborhood']=df_geo['neighborhood'].str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
df_geo.head()

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.codigo,neighborhood,properties.codigoalternativo,properties._about,properties.cartodb_id,properties.created_at,properties.updated_at
0,Feature,MultiPolygon,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621,Barajas,21,http://datos.localidata.com/recurso/territorio...,2,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
1,Feature,MultiPolygon,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611,Carabanchel,11,http://datos.localidata.com/recurso/territorio...,3,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
2,Feature,MultiPolygon,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601,Centro,1,http://datos.localidata.com/recurso/territorio...,4,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
3,Feature,MultiPolygon,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605,Chamartin,5,http://datos.localidata.com/recurso/territorio...,5,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
4,Feature,MultiPolygon,"[[[[-3.698789, 40.446603], [-3.698725, 40.4465...",28079607,Chamberi,7,http://datos.localidata.com/recurso/territorio...,6,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z


In [23]:
df = pd.merge(df, df_geo, on='neighborhood')
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude,type,geometry.type,geometry.coordinates,properties.codigo,properties.codigoalternativo,properties._about,properties.cartodb_id,properties.created_at,properties.updated_at
0,Arganzuela,3.962 €/m2,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734,Feature,MultiPolygon,"[[[[-3.703413, 40.405096], [-3.703165, 40.4050...",28079602,2,http://datos.localidata.com/recurso/territorio...,1,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
1,Barajas,3.144 €/m2,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845,Feature,MultiPolygon,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621,21,http://datos.localidata.com/recurso/territorio...,2,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
2,Carabanchel,2.146 €/m2,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676,Feature,MultiPolygon,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611,11,http://datos.localidata.com/recurso/territorio...,3,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
3,Centro,5.075 €/m2,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914,Feature,MultiPolygon,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601,1,http://datos.localidata.com/recurso/territorio...,4,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z
4,Chamartin,5.179 €/m2,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129,Feature,MultiPolygon,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605,5,http://datos.localidata.com/recurso/territorio...,5,2014-09-25T20:40:54Z,2014-09-25T20:40:54Z


In [24]:
df.drop(['geometry.type', 'properties._about','properties.cartodb_id','properties.created_at','properties.updated_at','type', 'properties.codigoalternativo'], axis=1, inplace=True)
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude,geometry.coordinates,properties.codigo
0,Arganzuela,3.962 €/m2,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734,"[[[[-3.703413, 40.405096], [-3.703165, 40.4050...",28079602
1,Barajas,3.144 €/m2,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621
2,Carabanchel,2.146 €/m2,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611
3,Centro,5.075 €/m2,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601
4,Chamartin,5.179 €/m2,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605


In [25]:
df['price_m2']=df['price_m2'].str.split(" ", n = 1, expand = True)[0]
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude,geometry.coordinates,properties.codigo
0,Arganzuela,3.962,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734,"[[[[-3.703413, 40.405096], [-3.703165, 40.4050...",28079602
1,Barajas,3.144,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621
2,Carabanchel,2.146,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611
3,Centro,5.075,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601
4,Chamartin,5.179,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605


In [26]:
df['price_m2'] =  pd.to_numeric(df['price_m2'].str.replace('.',''))
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude,geometry.coordinates,properties.codigo
0,Arganzuela,3962,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734,"[[[[-3.703413, 40.405096], [-3.703165, 40.4050...",28079602
1,Barajas,3144,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621
2,Carabanchel,2146,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611
3,Centro,5075,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601
4,Chamartin,5179,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605


Now that we have clean the information, and merge it to the main data frame we can draw the map again, this time with the neighborhood boundaries and information as popups with the price per square meter on each neighborhood.

In [27]:
# Initialize the map:
choroplet_map = folium.Map(width=1500, height=500,location=[madrid_data['latitude'], madrid_data['longitude']], zoom_start=10)
 
# Add the color for the chloropleth:
choroplet_map.choropleth(
    geo_data=response_geo.json(),
    name='choropleth',
    data=df,
    columns=['properties.codigo', 'price_m2'],
    key_on='properties.codigo',
    fill_color='BuPu',
    fill_opacity=0.7,
    line_opacity=0.9,
    legend_name='neighborhoods price by m2 in thousand of euros'
)

folium.LayerControl().add_to(choroplet_map)

for lat, lng, neighborhood, price in zip(df['latitude'], df['longitude'], df['neighborhood'],df['price_m2']):
    label = neighborhood+" price m2: "+str(round(price, 3))+" euros" 
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.00001,
        parse_html=False).add_to(choroplet_map)


choroplet_map

The next data set we need is the venue’s information, for this, I use the foursquare API with a restriction of 200 sites per query (1 query per neighborhood) and without radius limit. With this configuration, foursquare should return 200 (this number can be changed if you want to get more venues per call) most relevant venues around a given location. Since the area of each neighborhood can overlap with other neighborhoods we have to check for each point if are inside the boundaries of each polygon we have defined for each neighborhood in the case is not, the point will be drop.

In [28]:
# Use function to get nerby venues using foursquare

def getNearbyVenues(names, latitudes, longitudes, LIMIT=200):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()
        #print(results)
        results = results["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [29]:
CLIENT_ID = 'FFRJN0UR5SSHZTTQIBYXKNEDJXKKPPBVTMK3OJIBVD02JDG3' # your Foursquare ID
CLIENT_SECRET = 'I2SZ4OU2ZFUWPRPXOIAVXUROGNJDXCRE5WTPZDVJRL42E1E1' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

madrid_venues = getNearbyVenues(names=df['neighborhood'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )
print(madrid_venues.shape)
madrid_venues.head()

(2054, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arganzuela,40.398068,-3.693734,Tres Cerditos,40.397316,-3.694184,Chinese Restaurant
1,Arganzuela,40.398068,-3.693734,Mercado de Motores,40.399149,-3.691978,Flea Market
2,Arganzuela,40.398068,-3.693734,Magasand Deli,40.396811,-3.691293,Restaurant
3,Arganzuela,40.398068,-3.693734,Museo del Ferrocarril (Antigua Estación de Del...,40.399395,-3.692286,Museum
4,Arganzuela,40.398068,-3.693734,PanArte,40.399279,-3.694182,Bakery


In [30]:
# Since foursquare returns venius based in a radious we have to filter venues that does not belong to each neighborhood 
# the number of venues can change based on the moment of call the Foursquare API 
counter = 0
for neighborhood, ven_lon,ven_lat in zip(madrid_venues['Neighborhood'], madrid_venues['Venue Latitude'], madrid_venues['Venue Longitude']):
    poly_index = df[df['neighborhood'] == neighborhood].index[0]
    row = df[df['neighborhood'] == neighborhood]
    polygon_array=row['geometry.coordinates'][poly_index][0][0]
    point = Point(ven_lat,ven_lon)
    #print(point)
    polygon = Polygon(polygon_array)
    #print(polygon)
    if(not polygon.contains(point)):
        # print("Removing venue"+str(point))
        # print("Not inside neighborhood"+str(polygon))
        madrid_venues.drop(counter, inplace=True)  
    counter=counter+1
madrid_venues.reset_index()
madrid_venues.shape

(1344, 7)

In [31]:
madrid_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arganzuela,40.398068,-3.693734,Tres Cerditos,40.397316,-3.694184,Chinese Restaurant
1,Arganzuela,40.398068,-3.693734,Mercado de Motores,40.399149,-3.691978,Flea Market
2,Arganzuela,40.398068,-3.693734,Magasand Deli,40.396811,-3.691293,Restaurant
3,Arganzuela,40.398068,-3.693734,Museo del Ferrocarril (Antigua Estación de Del...,40.399395,-3.692286,Museum
4,Arganzuela,40.398068,-3.693734,PanArte,40.399279,-3.694182,Bakery


The next step now we have the data of the venues inside each neighborhood is to determine what are the most common venues for which is used the function get_dummies to create a data frame that shows us the distribution of venues per neighborhood.

In [32]:
# now that we have most relevant venues for each neighborhood we have to determine which are the most common venues. 
#For this we transform the information to categorical information
madrid_onehot = pd.get_dummies(madrid_venues[['Venue Category']], prefix="", prefix_sep="")

# get Neighborhood column index
NhIndex = madrid_onehot.columns.get_loc("Neighborhood")

# copy Neighborhood into a temporal variable
tmp_Nh = madrid_onehot['Neighborhood']

# delete Neighborhood column
madrid_onehot.drop(madrid_onehot.columns[NhIndex], axis=1, inplace=True)

# insert the column in the index 0 of the dataframe
madrid_onehot.insert(0, 'Neighborhood', tmp_Nh)

# add neighborhood data back to dataframe
madrid_onehot['Neighborhood'] = madrid_venues['Neighborhood'] 

madrid_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Lounge,Airport Service,American Restaurant,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Trade School,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Video Game Store,Wine Bar,Wine Shop,Women's Store,Zoo
0,Arganzuela,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Arganzuela,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Arganzuela,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Arganzuela,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Arganzuela,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now that we have the number of venues per neighbor we calculate the mean to obtain ve average of each type of venue per neighborhood.

In [33]:
madrid_grouped = madrid_onehot.groupby('Neighborhood').mean().reset_index()
madrid_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Lounge,Airport Service,American Restaurant,Aquarium,Arcade,Arepa Restaurant,Argentinian Restaurant,...,Trade School,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Venezuelan Restaurant,Video Game Store,Wine Bar,Wine Shop,Women's Store,Zoo
0,Arganzuela,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,...,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
1,Barajas,0.0125,0.0125,0.0625,0.075,0.0,0.0,0.0,0.0,0.025,...,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Carabanchel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Centro,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.01,0.0,0.0
4,Chamartin,0.0,0.0,0.0,0.0,0.010309,0.0,0.010309,0.0,0.020619,...,0.0,0.0,0.010309,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
# function to get the most comon venues by Neighborhood# function to get the most comon venues by 

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [35]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = madrid_grouped['Neighborhood']

for ind in np.arange(madrid_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(madrid_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arganzuela,Spanish Restaurant,Restaurant,Tapas Restaurant,Grocery Store,Indie Theater,Gym,Market,Farmers Market,Chinese Restaurant,Gym / Fitness Center
1,Barajas,Spanish Restaurant,Airport Service,Hotel,Airport Lounge,Duty-free Shop,Restaurant,Coffee Shop,Japanese Restaurant,Rental Car Location,Soccer Field
2,Carabanchel,Bar,Gym,Supermarket,Spanish Restaurant,Restaurant,Bakery,Tapas Restaurant,Concert Hall,Park,Pizza Place
3,Centro,Tapas Restaurant,Plaza,Spanish Restaurant,Hotel,Hostel,Ice Cream Shop,Restaurant,Bookstore,Clothing Store,Gourmet Shop
4,Chamartin,Restaurant,Spanish Restaurant,Mediterranean Restaurant,Seafood Restaurant,Japanese Restaurant,Pizza Place,Bar,Tapas Restaurant,Asian Restaurant,Steakhouse


In [37]:
# Now that we have most common venues per neighborhood we gonna gonna define a user preferences in order to use recomender systems specificly 
# a content based approach
# user_preferences = {}

features=list(madrid_grouped.columns)
del features[0]
features

['Accessories Store',
 'Airport',
 'Airport Lounge',
 'Airport Service',
 'American Restaurant',
 'Aquarium',
 'Arcade',
 'Arepa Restaurant',
 'Argentinian Restaurant',
 'Art Gallery',
 'Art Museum',
 'Art Studio',
 'Asian Restaurant',
 'Athletics & Sports',
 'Auto Garage',
 'BBQ Joint',
 'Bakery',
 'Bar',
 'Bed & Breakfast',
 'Beer Bar',
 'Beer Garden',
 'Beer Store',
 'Big Box Store',
 'Bistro',
 'Boarding House',
 'Bookstore',
 'Boutique',
 'Brazilian Restaurant',
 'Breakfast Spot',
 'Brewery',
 'Bubble Tea Shop',
 'Building',
 'Burger Joint',
 'Burrito Place',
 'Café',
 'Cajun / Creole Restaurant',
 'Candy Store',
 'Chinese Restaurant',
 'Chocolate Shop',
 'Church',
 'Circus',
 'Clothing Store',
 'Cocktail Bar',
 'Coffee Shop',
 'Comedy Club',
 'Comfort Food Restaurant',
 'Comic Shop',
 'Concert Hall',
 'Convenience Store',
 'Cosmetics Shop',
 'Cuban Restaurant',
 'Cupcake Shop',
 'Deli / Bodega',
 'Department Store',
 'Dessert Shop',
 'Diner',
 'Dog Run',
 'Duty-free Shop',
 'East

With the result information of each neighborhood, we can create a recommendation system that based on the preferences of a user can show him which neighborhood has more venues in common of whit the user preferences and which is the cost per square meter of each neighborhood.

In order to create the recommendation system, the first step is to determine with the information we have what type of recommendation system we can use. In this case, the recommendation will be based on the information we have defined for each neighborhood (venues), so this type of recommendation system is a content-based recommendation system. In this type of recommendation first, we define features/preferences for each user. In this case, we gonna create a fake user and assign a set of features as user preferences.

In [38]:
user_preferences = ['Bar','Garden','Park','Gym']
user_preferences

['Bar', 'Garden', 'Park', 'Gym']

In [39]:
user_profile = pd.DataFrame(features) 
user_profile.rename(columns={0:'features'}, inplace=True)
user_profile.head()

Unnamed: 0,features
0,Accessories Store
1,Airport
2,Airport Lounge
3,Airport Service
4,American Restaurant


Now that preferences are defined as a list we have to convert it into numerical values and then multiplied by the matrix of neighborhood information.
At this point, we can assign weights to the user preferences if we want that one preference will be more important than others, for simplicity in this case all the preferences will be valued as 1.

In [40]:
feature_values=[]
for feature in user_profile['features']:
    #print(str(feature)+" vs "+str(user_preferences))
    if(feature in user_preferences):
        feature_values.append(1)
        print(feature)     
    else:
        feature_values.append(0)
feature_values

Bar
Garden
Gym
Park


[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 1,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [41]:
user_profile['feature_values']=feature_values
user_profile.head()

Unnamed: 0,features,feature_values
0,Accessories Store,0
1,Airport,0
2,Airport Lounge,0
3,Airport Service,0
4,American Restaurant,0


In [42]:
features_neighborhood = madrid_grouped.T
features_neighborhood.columns = features_neighborhood.iloc[0]
features_neighborhood.drop(features_neighborhood.index[0], inplace=True)
features_neighborhood.head(20)

Neighborhood,Arganzuela,Barajas,Carabanchel,Centro,Chamartin,Chamberi,Ciudad Lineal,Fuencarral,Hortaleza,Latina,...,Moratalaz,Puente de Vallecas,Retiro,Salamanca,San Blas,Tetuan,Usera,Vicalvaro,Villa de Vallecas,Villaverde
Accessories Store,0.0,0.0125,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0
Airport,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Airport Lounge,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Airport Service,0.0,0.075,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
American Restaurant,0.0,0.0,0.0,0.01,0.0103093,0.0106383,0.0,0.0,0.0232558,0.0,...,0.0,0.0,0.0,0.01,0.0,0.0111111,0.0,0.0,0.0,0.0
Aquarium,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0357143,0.0,0.0
Arcade,0.0,0.0,0.0,0.0,0.0103093,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Arepa Restaurant,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Argentinian Restaurant,0.02,0.025,0.0,0.0,0.0206186,0.0,0.0576923,0.0178571,0.0,0.0,...,0.0,0.0,0.0102041,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Art Gallery,0.02,0.0125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0102041,0.01,0.0,0.0111111,0.0,0.0,0.0,0.0


As a result of multiply, the user profile with the neighborhood data we will get a matrix that represents if a neighborhood has one or more of the user preferences.

In [43]:
df_features=pd.DataFrame(feature_values, columns={'features_val'})
madrid_recomendation_matrix = features_neighborhood.mul(feature_values, axis=0)
madrid_recomendation_matrix.head()

Neighborhood,Arganzuela,Barajas,Carabanchel,Centro,Chamartin,Chamberi,Ciudad Lineal,Fuencarral,Hortaleza,Latina,...,Moratalaz,Puente de Vallecas,Retiro,Salamanca,San Blas,Tetuan,Usera,Vicalvaro,Villa de Vallecas,Villaverde
Accessories Store,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Airport,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Airport Lounge,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Airport Service,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
American Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
recomentdation_totals = madrid_recomendation_matrix.sum(axis = 0, skipna = True) 
recomentdation_totals.head()

Neighborhood
Arganzuela     0.070000
Barajas        0.037500
Carabanchel    0.159420
Centro         0.030000
Chamartin      0.051546
dtype: float64

In [45]:
# now we have values per neighborhood based on the user preference of the user. 
# finally we need to compute the resultant value with the price value in order to give a result that have the price as feature.
df_total = pd.DataFrame(recomentdation_totals, columns = ['values'])
df_total.head()

Unnamed: 0_level_0,values
Neighborhood,Unnamed: 1_level_1
Arganzuela,0.07
Barajas,0.0375
Carabanchel,0.15942
Centro,0.03
Chamartin,0.051546


The final result of the recommendation will be calculated as the sum per neighborhood where at a higher number most related is the neighbor with the user preferences

In [46]:
df_total.index.names = ['neighborhood']
df = pd.merge(df, df_total, on='neighborhood')
df.head()

Unnamed: 0,neighborhood,price_m2,monthly_variation,quarterly_variation,anual_variation,historical_max,max_variation,latitude,longitude,geometry.coordinates,properties.codigo,values
0,Arganzuela,3962,"+1,5 %","+1,8 %","-2,9 %",4.096 €/m2 jul 2019,"-3,3 %",40.398068,-3.693734,"[[[[-3.703413, 40.405096], [-3.703165, 40.4050...",28079602,0.07
1,Barajas,3144,"-2,5 %","-2,1 %","-0,8 %",3.663 €/m2 mar 2009,"-14,2 %",40.473318,-3.579845,"[[[[-3.561544, 40.510729], [-3.56154, 40.51071...",28079621,0.0375
2,Carabanchel,2146,"-0,6 %","-2,8 %","-2,1 %",3.173 €/m2 jun 2007,"-32,4 %",40.374211,-3.744676,"[[[[-3.724663, 40.404549], [-3.724586, 40.4045...",28079611,0.15942
3,Centro,5075,"+0,3 %","-0,2 %","+1,7 %",5.096 €/m2 ene 2020,"-0,4 %",40.417653,-3.707914,"[[[[-3.712148, 40.430235], [-3.71205, 40.43022...",28079601,0.03
4,Chamartin,5179,"+0,7 %","+1,4 %","+2,5 %",5.216 €/m2 nov 2018,"-0,7 %",40.458987,-3.676129,"[[[[-3.673517, 40.482855], [-3.673633, 40.4822...",28079605,0.051546


Finally to relate the price with the recommendation system result is possible to use layer in the choropleth map so we can visualize bot results or filter just by each of them. In this case the relation is not applied directly since it is unknown how much a user is willing to pay per square meter in a location that covers his preferences, so we left this decision to each user at least in this initial part of the analysis.

In [59]:
# Initialize the map:
choroplet_map = folium.Map(width=1000, height=500,location=[madrid_data['latitude'], madrid_data['longitude']], zoom_start=10)

feature_group = FeatureGroup(name='Neighborhoods')

# Add the color for the chloropleth:
choroplet_map.choropleth(
    geo_data=response_geo.json(),
    name='user preferences',
    data=df,
    columns=['properties.codigo', 'values'],
    key_on='properties.codigo',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.9,
    legend_name='neighborhoods relationship with user preferences (Higueris means better fit to user preferences)'
)

choroplet_map.choropleth(
    geo_data=response_geo.json(),
    name='prices by m2',
    data=df,
    columns=['properties.codigo', 'price_m2'],
    key_on='properties.codigo',
    fill_color='PuRd',
    fill_opacity=0.7,
    line_opacity=0.9,
    legend_name='neighborhoods price by m2 in thousand of euros'
)
#.add_to(choroplet_map)

for lat, lng, neighborhood, price in zip(df['latitude'], df['longitude'], df['neighborhood'],df['price_m2']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.00001,
        parse_html=False).add_to(feature_group)

feature_group.add_to(choroplet_map)
folium.LayerControl().add_to(choroplet_map)

choroplet_map