# Capstone Project - The Battle of Neighborhoods!

Install and import required packages

In [None]:
# install the Google Trends API
!pip install pytrends

# install the Daft Listings API
!pip install daftlistings

In [143]:
# python packages
import pprint
import requests

# Google Trends API packages
from pytrends.request import TrendReq

# Daft listings API packages
from daftlistings import Daft, RentType, SortOrder, SortType
from joblib import Parallel, delayed
import time

# convert an address into latitude and longitude values
from geopy.geocoders import Nominatim

## 1. Introduction
This section outlines a general background for the Business Problem that I'll be trying to solve as part of the capstone project.

The primary focus for this project would be on the city Dublin and its 22 different District areas.  

This project tries to achieve the following analyses for the respective target audience in mind:  
1) **House Renting**: Finding an apartment to rent in Dublin city is very challenging giving the housing crisis. The target audience in this case is people looking for rental apartments in the city. The attempt here is to filter out properties based on user preferences for apartment characteristics, neighborhood choices, pricing and crime rate in the neighborhood in which the property is situated.  
2) **Neighborhood Clustering**: The approach here is to use visualization techniques to cluster districts within Dublin city using clustering techniques based on the venues and venue categories present in different districts. We can get a sense of how different districts are oriented within the city in terms of different places, amenities, transport routes and most importantly if distance from the city centre plays a role in driving this.  
3) **Google Trends**: This data would act as one of the features where we try to do regerssion analysis for predicting the rent price for each apartment. The hypothesis would be that google trends for a search for an apartment to rent in a particular neighborhood would affect the pricing for the rentals. The analysis performed in the subsequent report would test this hypothesis.  
4) **Crimes**: This data would act as additional filtering for users looking to rent an apartment as well as drive the clustering of the districts as planned in point 2 above. It would be intersting to use visualizatin techniques again to find out if crimes are related to the geograhphical attributes of a particular neighborhood.    

Overall the aim is to aid people looking for rentals in Dublin city and help them filter out neighborhoods and properties based on their preferences as well as other local factors driving their decision making.  
Apart from that, the visualiztion techniques used for analysing different datasets would help certain stakeholders make decisions in terms of government planning, business marketing decisions as well as general readers looking for some insights of their own city! 

## 2. Data
This section defines the different data sources as well as their sample examples that have been used for this assignment.

### 1) Google Trends API
Please find below sample of how we plan to use this API (https://pypi.org/project/pytrends/)  
As seen in the example below, we are basically getting month-wise interests for apartments searches containing different districts within Dublin city.  
The plan is to aggregate the data to engineer features for the pricing predective model that we plan to build in the sections to follow.  

In [151]:
pytrends = TrendReq(hl='en-US', tz=0)

kw_list = ["Dublin 1 rent", "Dublin 2 rent"]
pytrends.build_payload(kw_list, geo='IE')

pytrends.interest_over_time()

Unnamed: 0_level_0,Dublin 1 rent,Dublin 2 rent,isPartial
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2015-06-07,54,41,False
2015-06-14,40,0,False
2015-06-21,27,27,False
2015-06-28,53,79,False
2015-07-05,50,25,False
...,...,...,...
2020-05-03,16,16,False
2020-05-10,16,27,False
2020-05-17,24,27,False
2020-05-24,0,17,False


### 2) Daft Listings API
As seen below, this is a very useful API (https://github.com/AnthonyBloomer/daftlistings/) yet simple to use and get upto speed.  
The sample example below shows a search using the API to get all listings in "Dublin city for rental 3-bed apartments with a max price of 2800EUR and furnished".  
We fetch all such listings and build a dataframe containing all the useful features for each property which as seen below would consist of <price', 'facilities', 'formalised_address', 'num_bedrooms', 'num_bathrooms', 'latitude', 'longitude'>  
This data would help us recommend properties to the targeted end-user as well as the geographical  coordinates would help us visually analyse the data in question.  

In [55]:
def translate_listing_to_json(listing):
    try:
        if listing.search_type != 'rental':
            return None
        return listing.as_dict()
    except:
        return None

In [56]:
daft = Daft()
daft.set_county("Dublin City")
daft.set_listing_type(RentType.ANY)
daft.set_max_price(2800)
daft.set_min_beds(3)
daft.set_max_beds(3)
daft.set_furnished(True)
daft.set_sort_order(SortOrder.ASCENDING)
daft.set_sort_by(SortType.PRICE)

In [None]:
listings = daft.search()
properties = []
print("Translating {} listing object into json, it will take a few minutes".format(str(len(listings))))
print("Igonre the error message")

# time the translation
start = time.time()
properties = Parallel(n_jobs=6, prefer="threads")(delayed(translate_listing_to_json)(listing) for listing in listings)
properties = [p for p in properties if p is not None] # remove the None
end = time.time()
print("Time for json translations {}s".format(end-start))

In [58]:
listings

[Listing (Vantage, Central Park, Leopardstown, Dublin 18),
 Listing (Abbot Court Apartments, Cualanor, Upper Glenageary Road, Dun Laoghaire, Co. Dublin),
 Listing (Charlotte Apartments, Charlotte Apartments, Honeypark, Dun Laoghaire, Co. Dublin),
 Listing (Gandon View, Gandon Park, Lucan, Co. Dublin),
 Listing (Clancy Quay, By Kennedy Wilson, Dublin 8),
 Listing (St. Raphaels, Stillorgan, Stillorgan, Co. Dublin),
 Listing (Bridgefield, Northwood, Bridgefield, Northwood, Santry, Dublin 9),
 Listing (Neptune Apartments, Honeypark, Dun Laoghaire, Co. Dublin),
 Listing (Elmfield Apartments, Ballyogan Road, Leopardstown, Dublin 18),
 Listing (St. Edmunds, St Edmunds, Lucan, Co. Dublin),
 Listing (The Grange, Brewery Road, Stillorgan, Co. Dublin),
 Listing (Rockview, Blackglen Road, Sandyford, Dublin 18),
 Listing (From Here),
 Listing (Windsor Road, Rathmines, Dublin 6, Rathmines, Dublin 6, South Dublin City),
 Listing (South Circular Road, Dublin 8, South Dublin City),
 Listing (Gerardstow

In [145]:
df = pd.DataFrame(properties)
df = df[['price', 'facilities', 'formalised_address', 'num_bedrooms', 'num_bathrooms', 'latitude', 'longitude']]
df.head()

Unnamed: 0,price,facilities,formalised_address,num_bedrooms,num_bathrooms,latitude,longitude
0,165,"[Parking, Central Heating, Cable Television, W...","Windsor Road, Rathmines, Dublin 6, Rathmines, ...",3,1,53.3175647939334,-6.25748129083226
1,1200,"[Central Heating, Cable Television, Washing Ma...","South Circular Road, Dublin 8, South Dublin City",3,1,53.33127698319413,-6.28267682702122
2,1350,"[Parking, Central Heating, Washing Machine, Mi...","Gerardstown Mews, Ballyboughal, North Co. Dublin",3,2,53.5432261,-6.2700563
3,1455,"[Parking, Central Heating, Cable Television, W...","Chapel Farm Drive, Lusk, North Co. Dublin",3,3,53.522223221700216,-6.165513989163145
4,1485,"[Parking, Central Heating, Cable Television, W...","Marlfield Lawn, Tallaght, Dublin 24, South Co....",3,3,53.270907,-6.369826


### 3) Ireland Crimes CSV
This dataset has been obtained from https://data.gov.ie/dataset/crimes-at-garda-stations-level-2010-2016 which gives a statistical breakdown of different types of crimes committed within the areas that come under each of the different police (garda) stations in Ireland.  
Sample below shows the dataset specifics, where we can see the different year-wise columns representing different types of crimes as well as x and y geographical coordinates would help us do some high-level visual analysis.  
As said earlier, this dataset would be aggregated for each of the districts in Dublin and could help us effectively filter out districts having larger crime rates.  

In [146]:
crimes_df = pd.read_csv('crimes_garda_stations.csv', encoding = "ISO-8859-1")
crimes_df.head()

Unnamed: 0,id,Station,Divisions,x,y,"Attempts or threats to murder, assaults, harassments and related offences 2004","Attempts or threats to murder, assaults, harassments and related offences 2005","Attempts or threats to murder, assaults, harassments and related offences 2006","Attempts or threats to murder, assaults, harassments and related offences 2007","Attempts or threats to murder, assaults, harassments and related offences 2008",...,"Offences against government, justice procedures and organisation of crime 2007","Offences against government, justice procedures and organisation of crime 2008","Offences against government, justice procedures and organisation of crime 2009","Offences against government, justice procedures and organisation of crime 2010","Offences against government, justice procedures and organisation of crime 2011","Offences against government, justice procedures and organisation of crime 2012","Offences against government, justice procedures and organisation of crime 2013","Offences against government, justice procedures and organisation of crime 2014","Offences against government, justice procedures and organisation of crime 2015","Offences against government, justice procedures and organisation of crime 2016*"
0,20441,Abbeyfeale,Limerick Division,112219.0,126928.0,25,38,25,40,45,...,1,1,11,15,5,7,7,3,0,0
1,20117,Abbeyleix,Laois/Offaly Division,244196.0,184819.0,9,12,14,12,34,...,3,2,1,0,0,5,3,2,5,0
2,20424,Adare,Limerick Division,146337.0,146092.0,1,3,0,6,3,...,0,0,0,0,0,0,0,5,2,0
3,20217,Aglish,Waterford Division,212252.0,91029.0,4,4,3,1,2,...,0,0,0,1,0,0,0,0,0,0
4,20522,Ahascragh,Galway Division,178054.0,238416.0,1,0,1,1,4,...,0,0,2,3,1,0,0,0,0,0


### 4) Foursquare Places API
Finally, the last part involves a similar approach taken during the previous weeks in this course where we had analysed different neighborhoods in Toronto, Canada.  
The challenge here is to obtain different districts comprising within Dublin City and obtain their respectice geographical coordinates using Nominatim geolocator.  
The sample code given below shows how we plan to construct the final dataframe where each row would be an individual venue along-with the attributes of each of the venues including their geolcation coordinates.  
OneHotEncoding can be used to get a feature representing distribution of different types of venues as well as the most popular and dominating venue type in each of the districts within Dublin city.  

In [72]:
CLIENT_ID = 'XXXX' # your Foursquare ID
CLIENT_SECRET = 'YYYY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
limit=100

In [139]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['District', 
                  'District Latitude', 
                  'District Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [126]:
dublin_df = pd.read_csv('dublin_districts.csv', encoding = "utf-8")
dublin_df

Unnamed: 0,District,Latitude,Longitude
0,Dublin 1,,
1,Dublin 2,,
2,Dublin 3,,
3,Dublin 4,,
4,Dublin 5,,
5,Dublin 6,,
6,Dublin 6W,,
7,Dublin 7,,
8,Dublin 8,,
9,Dublin 9,,


In [128]:
geolocator = Nominatim(user_agent="dublin_districts")

for district in dublin_df['District']:
    location = geolocator.geocode(district)
    
    latitude = location.latitude
    longitude = location.longitude
    print('The geograpical coordinate of {} are {}, {}.'.format(district, latitude, longitude))
    
    dublin_df.loc[dublin_df['District']==district, ['Latitude']] = latitude
    dublin_df.loc[dublin_df['District']==district, ['Longitude']] = longitude 
    
dublin_df.head()

The geograpical coordinate of Dublin 1 are 53.3524881, -6.256645689721826.
The geograpical coordinate of Dublin 2 are 53.33894015, -6.252712821759609.
The geograpical coordinate of Dublin 3 are 53.361223100000004, -6.1854668060000355.
The geograpical coordinate of Dublin 4 are 53.32750729999999, -6.227485885927834.
The geograpical coordinate of Dublin 5 are 53.3834538, -6.181923245473566.
The geograpical coordinate of Dublin 6 are 53.3176976, -6.259525132569765.
The geograpical coordinate of Dublin 6W are 53.30928205, -6.299434891747282.
The geograpical coordinate of Dublin 7 are 53.3605505, -6.284470454564643.
The geograpical coordinate of Dublin 8 are 53.350262900000004, -6.320212883866121.
The geograpical coordinate of Dublin 9 are 53.3860497, -6.245577085317763.
The geograpical coordinate of Dublin 10 are 53.34321655, -6.360963597131269.
The geograpical coordinate of Dublin 11 are 53.386613600000004, -6.292626932293775.
The geograpical coordinate of Dublin 12 are 53.32052905, -6.32

Unnamed: 0,District,Latitude,Longitude
0,Dublin 1,53.352488,-6.256646
1,Dublin 2,53.33894,-6.252713
2,Dublin 3,53.361223,-6.185467
3,Dublin 4,53.327507,-6.227486
4,Dublin 5,53.383454,-6.181923


In [141]:
dublin_venues = getNearbyVenues(names=dublin_df['District'],
                                   latitudes=dublin_df['Latitude'],
                                   longitudes=dublin_df['Longitude']
                                  )

Dublin 1
Dublin 2
Dublin 3
Dublin 4
Dublin 5
Dublin 6
Dublin 6W
Dublin 7
Dublin 8
Dublin 9
Dublin 10
Dublin 11
Dublin 12
Dublin 13
Dublin 14
Dublin 15
Dublin 16
Dublin 17
Dublin 18
Dublin 20
Dublin 22
Dublin 24


In [152]:
dublin_venues

Unnamed: 0,District,District Latitude,District Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Dublin 1,53.352488,-6.256646,147 Deli,53.353410,-6.259807,Deli / Bodega
1,Dublin 1,53.352488,-6.256646,The Celt,53.350442,-6.255071,Pub
2,Dublin 1,53.352488,-6.256646,Gate Theatre,53.353113,-6.261997,Theater
3,Dublin 1,53.352488,-6.256646,Murray's Bar,53.352419,-6.261256,Pub
4,Dublin 1,53.352488,-6.256646,Dealz,53.350623,-6.263183,Discount Store
...,...,...,...,...,...,...,...
1260,Dublin 22,53.317018,-6.438519,Google Ireland Data Centre,53.313348,-6.447727,IT Services
1261,Dublin 22,53.317018,-6.438519,The Swallows,53.323342,-6.423912,Bar
1262,Dublin 22,53.317018,-6.438519,Cuisine de France HQ,53.322246,-6.457265,Bakery
1263,Dublin 22,53.317018,-6.438519,SPAR,53.320557,-6.413150,Convenience Store


In [142]:
dublin_venues.groupby('District').count()['Venue']

District
Dublin 1     100
Dublin 10     28
Dublin 11     33
Dublin 12     38
Dublin 13     60
Dublin 14    100
Dublin 15     80
Dublin 16      5
Dublin 17     39
Dublin 18     31
Dublin 2     100
Dublin 20     60
Dublin 22      6
Dublin 24      1
Dublin 3      43
Dublin 4     100
Dublin 5      43
Dublin 6     100
Dublin 6W     69
Dublin 7     100
Dublin 8      71
Dublin 9      58
Name: Venue, dtype: int64