In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Dublin Intro & Motivation

More and more people are attracted by the beautiful scenery, the wild and romantic culture and the broad prospects for development of Ireland, come to work and study in Ireland. A large number of immigrants have caused rents to rise in Ireland, especially in the Dublin area. 

A lot of large companies are established in Dublin (Google, Facebook, Twitter, Linkedin, Huawei, TikTok etc), so Dublin may be the best place to live in Ireland, no matter work or study. Dublin has Trinity College (TCD), University College Dublin (UCD), Dublin City University (DCU) and Dublin University of Technology (TU Dublin) and other universities that attract a large number of students to study. In this context, it becomes very difficult for someone who has been to Dublin in the future to find a suitable house. As an international student studying in Ireland, my requirement for a house is to keep the rent as low as possible under safe conditions. Therefore, in this notebook, I will mainly use the rental data collected on Daft.ie to explore where to rent a house, and appropriately introduce my rental experience in Dublin.

**If you just like me two years ago, are someone who knows little about Dublin and is desperate to find a suitable house in Dublin real property market, I hope this Notebook can help you.**

**The map and final conclusion in the bottom, you can scroll down to bottom.**

In [None]:
# import the necessary library
import pandas as pd
import numpy as np

import re
import matplotlib.pyplot as plt

# folium, json and requests will be used for creating maps
import folium 
import json
import requests

plt.style.use("ggplot")

## 1. Import Dataset & Basic Info

* 1. Import dataset
* 2. Basic information of dataset
* 3. data preprocessing 


### 1.1 Import dataset

In [None]:
# import dataset
data = pd.read_csv("../input/predicting-dublin-rental-daftie/daft_v_2.csv")

In [None]:
data.head(5)

There is a problem in the "price" column. The value in "price" column is string. If we want to analyse it or pass it in our machine learning models we need convert it to Integeter or float data type. Moreover, there are two different units for rental, "Per week" and "Per month". I convert weekly rental to monthly.

First, we need to check the basic information of this dataset.

### 1.2 Basic information of dataset

In [None]:
# basic info of dataset
data.shape

From above, we can easly find out that there are 2718 rows and 10 columns in this dataset. 

Each row represents a real property in Dublin city. The 10 columns respectively represent the "price","address", "number of bathroom", "number of bedroom", "if furnished or not","description","property type","property ID","longitude" and "latitude".


In [None]:
data.describe

### 1.3 data preprocessing

In [None]:
# convert the string in "price" col to integer.

p_list = []

for price in data["price"]:
    num = re.findall("\d+\.?\d*",price)
    num = "".join(num)
    num = int(num)
        
    if "Per week" in price:
        num = (num/7) * 30 # convert the weekly rental to monthly

    p_list.append(num)
    
data_copy = data.copy()
data_copy["price"] = p_list

In [None]:
data_copy.head(5)

Now, the data in "price" column are in integer data type.

## 2.EDA

* 2.1 Monthly rental distribution
* 2.2 "rental_per_month" distribution
* 2.3 Remove outliers in "rent_per_month" col

### 2.1 Monthly rental distribution

In [None]:
fig1 = plt.figure(figsize=(15,8))
ax1 = plt.subplot()

ax1.hist(x = data_copy["price"],bins = 100)

plt.xlim((0,15000)) # set the x limited value.
plt.xlabel("monthly rental")
plt.ylabel("count")
plt.title("Monthly Rental Histogram")
plt.show()

From the above, we can know that most properties' rental are under 4000 euro per month in Dublin and above 1000 euro. 


But there will be several bedrooms in some properties. Most of time, we just want to rent one bedroom to live(share kitchen, may be also share bathroom). So, I will use the monthly rental to divide the number of bedrooms to get a new feature-- **"rental_per_room"**.


In [None]:
data_copy["rental_per_room"] = data_copy["price"]/data_copy["bedroom"]

In [None]:
data_copy.head(5)

last colmun "rental_per_room" is the new feature.

There are some infinite values in this new feature("rental_per_room"). The reason is that some real properties' bedroom number is 0, which is impossible. So, I will simply use -1 to replace these infinite values in "rental_per_room" column.

In [None]:
# use -1 to replace the infinite value in "rental_per_room".

data_copy["rental_per_room"][np.isinf(data_copy["rental_per_room"])] = -1
data_copy

In [None]:
# check how many real properties with 0 bedroom

na_num = 0
for bedroom_num in data_copy["bedroom"]:
    if bedroom_num == 0:
        na_num += 1
        
print(na_num)

There are 158 real properties with 0 bedroom.

In [None]:
sum(data_copy["description"][data_copy["rental_per_room"] == -1].isnull())

In the above 158 "0 bedroom" real properties, there are 51 also without description. 

I really doubt if these "0 bedroom" and with no description real properties are exist.

### 2.2 "rental_per_room" distribution

In [None]:
fig1 = plt.figure(figsize=(15,8))
ax1 = plt.subplot()

ax1.hist(x = data_copy["rental_per_room"],
         bins = "auto",color="blue",
         alpha = 0.6,width = 100,

        )

plt.xticks(range(0,8000,500))
plt.xlabel("The rental per room")
plt.ylabel("count")
plt.title("The rental per room in Dublin")
plt.show()

Now, we got the histogram of Dublin rental per room. First, there are a lot of properties distribute around 0 euro per month. The reason is that some properties' bedroom number is 0. So, this can be treated as nan value. If you ignore the "0 rental" properties you can find that rental for one room between 500 euro to 4000 euro. That is a huge range. So, I will check some statistics of "The rental per room".

### 2.3Avarage monthly rental in Dublin
I will use the mean value and median value of "rental_per_room" column to demonstrate how many euro will be spent on a single room per month.

When I check the statistics, I remove the nan value in "rental_per_room" column, which means the -1 value in "the rental per room" column.

In [None]:
# remove the -1 value in "the rental per room" column and calculate the mean value.
data_copy["rental_per_room"][data_copy["rental_per_room"] != -1].mean()

The **mean** of one room rental in Dublin is 1538 euro per month.


In [None]:
data_copy["rental_per_room"][data_copy["rental_per_room"] != -1].median()

The **median** of one room rental in Dublin is 1500 euro per month. 

That's a really high rental....


### 2.3 Remove Outliers in "rental_per_room" col

However, since this data set only contains more than 2700 data, it may be affected by outliers. So you should try to delete the data after outliers

**Two ways to determine outliers:**
* 3 Sigma rule(if data in normal distribution)
* boxplot

#### 2.3.1 Three Sigma rule(if data in normal distribution)

Three sigma rule：On the premise that the data conforms to the normal distribution, 95% of the observations are distributed within two standard deviations of the mean, and 99.7% are distributed within three standard deviations. Therefore, if there are data points distributed outside the range of three standard deviations of the mean, 99.7% of the confidence think this data is outlier.

But to use the three sigma rule to judge outliers, the data should conform to a normal distribution.


**If “rental_per_room” conforms to normal distribution:**

In [None]:
# Calculate Skewness
print("Skewness: ",data_copy["rental_per_room"][data_copy["rental_per_room"] != -1.0].skew()) 

In [None]:
# Calculate Kurtosis
print("Kurtosis: ",data_copy["rental_per_room"][data_copy["rental_per_room"] != -1.0].kurt()) 

In [None]:
from scipy import stats
p = stats.shapiro(data_copy["rental_per_room"][data_copy["rental_per_room"] != -1.0])[1]

p < 0.05

**Conclusion: the "rental_per_room" column do not conform to normal distribution.** 

Kurtosis is greater than the absolute value of 2, so the data cannot be regarded as an approximate normal distribution. In addition, I also used the shapiro-wilk test to check whether the data conforms to the normal distribution. From the results of the shapiro-wilk test, the p value is less than 0.05, and the null hypothesis is rejected, that is, the data does not conform to the normal distribution.

Because the data does not conform to a normal distribution, there is no guarantee that the three sigma rule can be used to correctly find outliers. Therefore, I consider using box plots to detect outliers.

#### 2.3.2 Boxplot of "rental_per_room"

In [None]:
# 使用箱线图查看数据的分布情况
rental_per_month = list(data_copy["rental_per_room"])
i = 0

l = []
for index in range(len(rental_per_month)):
    if rental_per_month[index] != -1.0:
        l.append(rental_per_month[index])
fig = plt.figure(figsize=(10,7))

ax1 = plt.subplot()
ax1 = plt.boxplot(x = l)

In [None]:
fig1 = plt.figure(figsize=(15,8))
ax1 = plt.subplot()

ax1.hist(x = l,
         bins = "auto",color="blue",
         alpha = 0.6,width = 100,

        )

plt.xticks(range(0,8000,500))
plt.xlabel("The rental per room")
plt.ylabel("count")
plt.title("The rental per room in Dublin")
plt.show()

From the boxplot, it is found that a single room with a monthly rent of more than 3,000 Euros is considered an outlier.

I will delete these outliers and recalculate the average and median of "rental_per_room".

In [None]:
# remove outliers in boxplot

Q1 = data_copy["rental_per_room"][data_copy["rental_per_room"] != -1.0 ].quantile(0.25)
Q3 = data_copy["rental_per_room"][data_copy["rental_per_room"] != -1.0 ].quantile(0.75)
IQR = Q3 - Q1


In [None]:
l1 = []
for index in range(len(l)):
    if Q1 - 1.5*IQR < l[index] < Q3 + 1.5*IQR:
        l1.append(l[index])
        
len(l1) 

### 2.4 mean value & median value

In [None]:
fig = plt.figure(figsize=(10,7))

ax1 = plt.subplot()
ax1 = plt.boxplot(x = l1)

In [None]:
np.mean(l1)

In [None]:
np.median(l1)

After removing outliers, the average monthly rent for a room was reduced from 1538 Euros to 1496 Euros. Now, the average cost of renting a room is still high. Next, I will use the map to find out which locations are good locations for renting in Dublin.

## 3.Maps

3.1 clean "longitude" and "latitude" columns

3.2 **Parish areas** in Dublin city

3.3 Demonstrate mean rental(per room) for each parish area(**Map**)

In [None]:
# import library
import folium
import json
import requests

### 3.1 Clean "longitude" and "latitude" columns

In [None]:
# clean "longitude" and "latitude" columns

# Some coordinate data contains special characters, delete these special characters
import re
longi = []
lati = []
for index in range(len(data_copy)):
    longitude = data_copy.iloc[index]["longitude"]
    latitude = data_copy.iloc[index]["latitude"]
    
    longitude = re.findall(r'-?\d+\.?\d*e?-?\d*?',longitude)[0]
    latitude = re.findall(r'-?\d+\.?\d*e?-?\d*?',latitude)[0]
    
    longi.append(longitude)
    lati.append(latitude)
    
data_copy["longitude"] = longi
data_copy["latitude"] = lati

data_copy.head(5)

### 3.2 Parish areas in Dublin

The Parishes area is like some small communities in Dublin. I have lived in Grangegorman, Warrenmount and Ranelagh. In short, in my experience, you will have completely different experiences living in different areas. Therefore, parishes area is also an important factor when renting a house.As Ian said ,["In Ireland, most primary schools are run by the Catholic Church and the rules for enrolling often include complex lists of rules with those in the local parish often being preferred. This means when you are looking for accommodation to rent or buy it can be very important to know in advance which parish the property is located in."](https://www.ianhuston.net/2017/04/mapping-dublin-parish-boundaries/)

In [None]:

# Put the geojson data on Dublin map
dublin_map = folium.Map(location=[53.3302,  -6.3106],zoom_start= 10) # Dublin's map

url = ("https://raw.githubusercontent.com/ihuston/dublin_parishes/master/data/cleaned_dublin_parishes.geojson")
dublin_parishes_edge = f"{url}"

folium.GeoJson(dublin_parishes_edge, name= "geojson").add_to(dublin_map)
dublin_map

The parishes area, large and small, divides Dublin into many small communities.

### 3.3 Demonstrate mean rental(per room) for each parish area(**Map**)

In [None]:
import geojson
import json
import requests
import re 

# Parse geojson file of Dublin parishes area(provides by Ian)
dub_parishes_url = "https://raw.githubusercontent.com/ihuston/dublin_parishes/master/data/cleaned_dublin_parishes.geojson"
dub_text = requests.get(dub_parishes_url).text

dub_parishes = json.loads(dub_text)
dub_parishes["features"][1]

# Create a map from parish name to polygon
parishes_poly = {}
for index in range(len(dub_parishes["features"])):
    parishes = dub_parishes["features"][index]["properties"]["Parish Name"]
    poly = dub_parishes["features"][index]["geometry"]["coordinates"][0]
    if len(poly) == 1:
        poly = poly[0]
    parishes_poly[parishes] = poly
    
print(parishes_poly)

In [None]:
# A function that calculates whether a certain coordinate point is in the polygon,
# and returns a boolean value to indicate whether the point is in the polygon.

def is_in_poly(p, poly):
    """
    :param p: [x, y]
    :param poly: [[], [], [], [], ...]
    :return:
    """
    px, py = p
    is_in = False
    for i, corner in enumerate(poly):
        next_i = i + 1 if i + 1 < len(poly) else 0
        x1, y1 = corner
        x2, y2 = poly[next_i]
        if (x1 == px and y1 == py) or (x2 == px and y2 == py):  # if point is on vertex
            is_in = True
            break
        if min(y1, y2) < py <= max(y1, y2):  # find horizontal edges of polygon
            x = x1 + (py - y1) * (x2 - x1) / (y2 - y1)
            if x == px:  # if point is on edge
                is_in = True
                break
            elif x > px:  # if point is on left-side of line
                is_in = not is_in
    return is_in
 

In [None]:

# Add a column called "parishes" in data_copy
i = 0
data_copy["parishes"] = "na"
parishes_list = []
for index in range(len(data_copy)):
    # get the longitude and latitude
    longitude = float(data_copy.iloc[index]["longitude"])
    latitude = float(data_copy.iloc[index]["latitude"])
    position = [longitude,latitude]

    for key,value in parishes_poly.items():
        if is_in_poly(position,value):
            data_copy.iloc[index,-1] = key
            break

In [None]:
data_copy.head(5)

In [None]:

# Calculate the mean value of "rental_per_room" in different parish area.
parishes_mean_rental_per_room = data_copy["rental_per_room"][data_copy["rental_per_room"] != -1.0].groupby(data_copy["parishes"]).mean()

In [None]:
parish = pd.DataFrame(parishes_mean_rental_per_room)
parish

In [None]:
parish["parishes_name"] = parish.index
parish.reset_index(drop=True)
parish

In [None]:
parish.columns

In [None]:
dict(parishes_mean_rental_per_room)

In [None]:
dublin_map = folium.Map(location=[53.3302,  -6.3106],zoom_start= 11) # Dublin's map

url = ("https://raw.githubusercontent.com/ihuston/dublin_parishes/master/data/cleaned_dublin_parishes.geojson")
dublin_parishes_edge = f"{url}"

# folium.GeoJson(dublin_parishes_edge, name= "geojson").add_to(dublin_map)

f = folium.Choropleth(
        geo_data= dublin_parishes_edge,
        data = parish,
        columns=["parishes_name","rental_per_room"],
        key_on="feature.properties.Parish Name",
        name="choropleth",
        bins = 8,
        fill_color = "BuPu",
        fill_opacity=0.7,
        line_opacity=0.2,
        highlight=True
    ).add_to(dublin_map)

folium.LayerControl().add_to(dublin_map)


f.geojson.add_child(
    folium.features.GeoJsonTooltip(['Parish Name'],labels=False)
)


dublin_map

### Insights from map

The average monthly rent for each parish area ranges from 755 to 2481. This data set was collected in September 2020. Since September is the start date of most schools, rents may be higher. First, I will introduce the information I can find in the map above. After that I will introduce the place where I used to live.

**Insight from map:**

There are some black areas on the map, which means that there is no real estate in these black areas in this data set.

The darker color, the higher rental. We can clearly find the most expensive areas in Dalkey and Marley Grange (2265 euro-2481 euro). The rents in most parish areas adjacent to these two areas are also very expensive. But what is interesting is that I found that the rents in Sallynoggin and Johnstown next to Dalkey are relatively low (755 euro-971 euro). If you go to school at UCD, I think Sallynoggin and Johnstown may be very good choices. In addition, Milltown has lower rents than surrounding areas. At the same time, Milltown is closer to the city center. If your company or school is in the city center, Milltown may be a good choice.

**My experience:**

As I mentioned before, I used to live in Grangegorman, Warrenmount and Ranelagh. Rent prices in Grangegorman and Warremount are relatively low. Ranelagh's life impressed me deeply. I think Ranelagh is one of the most livable areas in Dublin. There are many supermarkets, coffe shops and pubs, which are very convenient for both life and entertainment. But the rent is relatively high.

