# Machine Learning Project Checklist

1. Frame the problem and look at the big picture.

2. **Get the data.**

3. Explore the data to gain insights.

4. Explore many different models and short-list the best ones.

5. Fine-tune your models and combine them into a great solution.

6. Present your solution.

## 2. Get the data

In [1]:
import os

In [2]:
REPO_DIR = os.path.join(os.environ['USERPROFILE'], 'repos')
PROJ_DIR = os.path.join(REPO_DIR, 'real_estate_machine_learning')
QUERIES_DIR = os.path.join(PROJ_DIR, 'src', 'osm')
os.chdir(PROJ_DIR)

In [3]:
SRC_DIR = './src'
DATA_DIR = './data'
RAW_DIR = os.path.join(DATA_DIR, 'raw')
EXT_DIR = os.path.join(DATA_DIR, 'ext')
IMG_DIR = './img'
INPUT_DIR = '../real_estate_hungary/output'
FILENAME = 'ForSaleRent_20181101.csv'
INPUT_FILEPATH=os.path.join(INPUT_DIR, FILENAME)

In [4]:
from urllib.request import HTTPDefaultErrorHandler, HTTPError, URLError
from http.client import RemoteDisconnected

In [5]:
import pandas as pd, numpy as np
import src.preparation as prep
import src.processing as proc
from src.utils import *

## Scraped data
I have written a Python script, based on this module [real_estate_hungary](https://github.com/tszereny/real_estate_hungary, "tszereny's GitHub page"), which extracts pieces of information from one of the most popular Hungarian [real estate website](https://ingatlan.com/, "https://ingatlan.com"). In short it turns the data on the website into tabular form.  
The scraped dataset contains more than 50,000 records of real estate properties in Budapest, the capital city of Hungary.

In [6]:
na_hun_equivalent='nincs megadva'

In [7]:
raw=pd.read_csv(INPUT_FILEPATH, encoding='utf8', na_values=na_hun_equivalent)

Translate column names from Hungarian to English

In [8]:
raw=proc.transform_naming(raw)

In [9]:
raw.head()

Unnamed: 0,address,accessibility,batch_num,ceiling_height,buses,buses_count,furnished,cluster_id,property_id,desc,...,trolley_buses_count,listing_type,orientation,trams,trams_count,is_ad_active,all_night_services,all_night_services_count,year_built,building_floors
0,Budai Bolero II,igen,0,,103|133E|33,3.0,,c_3362683,26313868,| Exkluzív otthon az Ön igényeire szabva! A Bu...,...,,for-sale,nyugat,1,1.0,Y,,,2019.0,10
1,Csata utca 30.,,0,3 m-nél alacsonyabb,14|32|105,3.0,,c_2374563,24714938,| XIII kerület közkedvelt részén az Árpád-hí...,...,,for-sale,északnyugat,1|14,2.0,Y,914|914A,2.0,,5
2,Csata utca 30.,,0,3 m-nél alacsonyabb,14|32|105,3.0,,c_2959407,25561892,| XIII kerület közkedvelt részén az Árpád-hí...,...,,for-sale,délnyugat,1|14,2.0,Y,914|914A,2.0,2018.0,5
3,Csata utca 30.,,0,3 m-nél alacsonyabb,14|32|105,3.0,,c_4236203,27741740,| XIII kerület közkedvelt részén az Árpád-hí...,...,,for-sale,északnyugat,1|14,2.0,Y,914|914A,2.0,2018.0,5
4,Csata utca 30.,,0,3 m-nél alacsonyabb,14|32|105,3.0,,c_3801583,26996343,| XIII kerület közkedvelt részén az Árpád-hí...,...,,for-sale,északkelet,1|14,2.0,Y,914|914A,2.0,2018.0,5


### Available columns

In [9]:
raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55146 entries, 0 to 55145
Data columns (total 60 columns):
address                     55120 non-null object
accessibility               14456 non-null object
batch_num                   55146 non-null int64
ceiling_height              37722 non-null object
buses                       36823 non-null object
buses_count                 36823 non-null float64
furnished                   8593 non-null object
cluster_id                  55146 non-null object
property_id                 55146 non-null int64
desc                        54531 non-null object
city_district               55146 non-null int64
smoking                     4822 non-null object
floors                      50258 non-null object
energy_perf_cert            2809 non-null object
balcony                     23573 non-null object
bath_and_wc                 29760 non-null object
type_of_heating             49418 non-null object
equipped                    7926 non-null obje

## OpenStreetMap
Locating different attributes of Budapest, such as:
- Boundaries of Budapest and its sub-districts
- Uninhabited areas
- Agglomeration of Budapest  
Utilizing [overpy](https://github.com/DinoTools/python-overpy, "overpy's GitHub page") a Python wrapper to query geographical data from [OpenStreetMap](https://www.openstreetmap.org, "OpenStreetMap's homepage").

Boundaries of Budapest and its sub-districts:

In [10]:
bound_q = read_txt(os.path.join(QUERIES_DIR, 'boundaries_osm_query.txt'))
bound_osm = prep.OSM(query=bound_q)
bound = bound_osm.to_df(node_attrs=['id', 'lat', 'lon'], add_tags=['name', 'postal_code'])
bound.head()

Unnamed: 0,id,lat,lon,name,postal_code
0,2366619156,47.6006244,19.1140788,1044 Budapest IV. kerület,1044
1,1768916389,47.6005581,19.1140093,1044 Budapest IV. kerület,1044
2,1768916389,47.6005581,19.1140093,1044 Budapest IV. kerület,1044
3,5777156960,47.6004505,19.1138815,1044 Budapest IV. kerület,1044
4,3841617455,47.599922,19.1131844,1044 Budapest IV. kerület,1044


Save the result in comma separated flat file

In [36]:
bound.to_csv(os.path.join(EXT_DIR, 'boundaries.csv'), encoding='utf8', index=False)

Uninhabited areas e.g.:
- Danube
- Islands
- etc.

In [46]:
uninhab_q = read_txt(os.path.join(QUERIES_DIR, 'uninhabited_osm_query.txt'))
uninhab_osm = prep.OSM(query=uninhab_q)
uninhab = uninhab_osm.to_df(node_attrs=['id', 'lat', 'lon'])
uninhab.head()

Unnamed: 0,id,lat,lon
0,35966076,47.3903119,19.0089999
1,35966077,47.3904866,19.008688
2,35966078,47.3907746,19.0082894
3,1540019894,47.3909018,19.0080858
4,35966079,47.3910512,19.0075134


Save the result in comma separated flat file

In [47]:
uninhab.to_csv(os.path.join(EXT_DIR, 'uninhabited.csv'), encoding="utf8", index=False)

Boundaries of the agglomeration of Budapest:

In [11]:
agglom_q = read_txt(os.path.join(QUERIES_DIR, 'agglomeration_osm_query.txt'))
agglom_osm = prep.OSM(query=agglom_q)
agglom = agglom_osm.to_df(node_attrs=['id', 'lat', 'lon'], add_tags=['name'])
agglom.head()

Unnamed: 0,id,lat,lon,name
0,303616978,47.3054043,18.8583169,Százhalombatta
1,373873292,47.3046396,18.8604005,Százhalombatta
2,373873296,47.3060417,18.8646632,Százhalombatta
3,303616988,47.3054631,18.8680175,Százhalombatta
4,335719239,47.3042648,18.8719475,Százhalombatta


Save the result in comma separated flat file

In [12]:
agglom.to_csv(os.path.join(EXT_DIR, 'agglomeration.csv'), encoding="utf8", index=False)

## Elevation
GPS coordinates of the properties available are in the scraped data, although elevation of the given coordinate is not published on the real estate website. Luckily some folks put together an [open-elevation API](https://github.com/Jorl17/open-elevation) to gather elevation data.  
Usage is a pretty simple, sending a post request with latitude-longitude pairs and receiving the data in JSON.

To get all the unique GPS coordinates, I am removing duplicated records from raw scraped data: 

In [10]:
u_gps = raw[['lat', 'lng']].drop_duplicates().reset_index(drop = True)
print('Total number of unique GPS coordinates: {0:,}'.format(len(u_gps)))

Total number of unique GPS coordinates: 7,540


Retrieving the elevation data from API in smaller batches, sending 100 coordinates at once:

In [None]:
n_batch = int(np.ceil(len(u_gps)/100))
intervals = calc_intervals(n_batch, len(u_gps))
gps_elevation = pd.DataFrame()
for i, interval in enumerate(intervals):
    while True:
        batch_gps_elevation = None
        try:
            elev = prep.Elevation(df=u_gps[interval.start:interval.stop], latitude='lat', longitude='lng')
            batch_gps_elevation = elev.retrieve_to_df()
            print('{}. success - batch retrieved!'. format(i))
        except HTTPError as err:
            pass
        except URLError as err:
            pass
        except RemoteDisconnected as err:
            pass
        if batch_gps_elevation is not None:
            break
    gps_elevation = pd.concat([gps_elevation, batch_gps_elevation], axis=0)

In [14]:
gps_elevation.head()

Unnamed: 0,elevation,latitude,longitude
0,102,47.46068,19.04869
1,106,47.529858,19.07906
2,106,47.52973,19.078869
3,102,47.538403,19.064398
4,102,47.54674,19.06614


Save the result in comma separated flat file

In [20]:
gps_elevation.to_csv(os.path.join(EXT_DIR, 'elevation.csv'), encoding="utf8", index=False)

## Public domain names
Downloading the official public domain names in Hungary, from [goverment portal](https://ceginformaciosszolgalat.kormany.hu/download/b/46/11000/kozterulet_jelleg_2015_09_07.txt).  
Such as: street, road, square etc.  
It will be used for text analysis of addresses.

In [21]:
public_domains_raw = prep.get_public_domain_names()

In [22]:
public_domains = [line for line in public_domains_raw.split('\r\n') if len(line)>0]

In [23]:
public_domains[50:55]

['hegyhát dűlő', 'hegyhát', 'köz', 'hrsz', 'hrsz.']

Save the text file

In [9]:
save_txt(os.path.join(EXT_DIR, 'public_domains_2015_09_07.txt'), public_domains_raw)