<a href="https://colab.research.google.com/github/sadikinisaac/sadikinisaac/blob/master/Capstone_Assignment_4_Isaacsadikin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Capstone Assignment #03 - Machine Learning
---

---
⚠️**Notebook Instruction:**
- Feel free to insert additional cell(s) as you need.
- You may import any packages that you need.
- You may create additional variables, columns, or object in this notebook to derive at your solution.
- If not specifed, you may use any name for your variable or column.
- If any expected output is provided, you are required to remove any additional columns that is not included in the expected output.


- ⬇️ Enter your name as a string in cell below.

---

In [1]:
Name = "Isaac Sadikin"

## 📥 Download Required Data Files

In [2]:
import requests

filename = 'data-week-16.zip'
url = f'https://d17lzt44idt8rf.cloudfront.net/{filename}'
response = requests.get(url)

# Make sure the request was successful
if response.status_code == 200:

  # Write the content to a file
  with open(filename, 'wb') as f:
      f.write(response.content)

In [3]:
!unzip $filename

Archive:  data-week-16.zip
   creating: 00_raw_data/
  inflating: 00_raw_data/hdb_facilities_distance.csv  
  inflating: 00_raw_data/hdb_resale_with_info.csv  
  inflating: 00_raw_data/hdb_type_sold.csv  
   creating: 01_processed_data/
  inflating: 01_processed_data/df_q3_start.csv  
  inflating: 01_processed_data/df_q4_start.csv  


---

<!--TABLE OF CONTENTS-->
# Table of Contents:
- [0.0 Import Packages, Configure Notebook](#0.0-Import-Packages,-Configure-Notebook)
- [1.0 Loading Data](#1.0-Loading-Data)
  - [1.1 Loading HDB_Resales records](#1.1-Loading-HDB_Resales-records)
  - [1.2 Loading "Facilities_Distance" data](#1.2-Loading-"Facilities_Distance"-data)
  - [1.3 Loading "HDB Type Sold Count" data](#1.3-Loading-"HDB-Type-Sold-Count"-data)
- [2.0 Data Processing](#2.0-Data-Processing)
  - [2.1 Calculate Percentage of the Unit's Flat Type](#2.1-Calculate-Percentage-of-the-Unit's-Flat-Type)
  - [2.2 Comparing Floor Size to Average of the same Flat Type and Flat Model](#2.2-Comparing-Floor-Size-to-Average-of-the-same-Flat-Type-and-Flat-Model)
  - [2.2 Process and Merge "Facilities_Distance" data](#2.2-Process-and-Merge-"Facilities_Distance"-data)
- [3 Feature Engineering](#3-Feature-Engineering)
  - [3.1 Creating Dummy Variables](#3.1-Creating-Dummy-Variables)
- [4.0 Modeling](#4.0-Modeling)
  - [4.1 Specifying the Features and Splitting the Data](#4.1-Specifying-the-Features-and-Splitting-the-Data)
  - [4.2 Modeling Training and Testing](#4.2-Modeling-Training-and-Testing)
- [Bonus Question](#Bonus-Question)

---

# 0.0 Import Packages, Configure Notebook

In [4]:
import numpy as np
import pandas as pd

In [5]:
# Settings for Matplotlib (& Seaborn)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

# Import libraries for charting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Set the size of charts
plt.rc('figure', figsize=(16,9))
plt.style.use('fivethirtyeight')
sns.set_context(context={'figure.figsize': (16,9)})

# 1.0 Loading Data

## 1.1 Loading HDB_Resales records

⚠️ All raw data are in the **00_raw_data** folder

---

> 🔷 **[ Question 1a ]** <br>
> Read the csv file **hdb_resale_with_info.csv** into a variable **df**.

In [7]:
df = pd.read_csv('00_raw_data/hdb_resale_with_info.csv')

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133769 entries, 0 to 133768
Data columns (total 19 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Tranc_Year             133769 non-null  int64  
 1   Tranc_Month            133769 non-null  int64  
 2   town                   133769 non-null  object 
 3   flat_type              133769 non-null  object 
 4   storey_range           133769 non-null  object 
 5   floor_area_sqm         133769 non-null  float64
 6   flat_model             133769 non-null  object 
 7   lease_commence_date    133769 non-null  int64  
 8   resale_price           133769 non-null  float64
 9   storey_range_midpoint  133769 non-null  int64  
 10  floor_area_sqft        133769 non-null  float64
 11  age_approx             133769 non-null  int64  
 12  address                133769 non-null  object 
 13  max_floor_lvl          133769 non-null  int64  
 14  year_completed         133769 non-nu

In [9]:
# Display 10 random rows
df.sample(10)

Unnamed: 0,Tranc_Year,Tranc_Month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,storey_range_midpoint,floor_area_sqft,age_approx,address,max_floor_lvl,year_completed,residential,commercial,storey_relative,flat_type_numerized
101583,2020,1,BUKIT PANJANG,4 ROOM,25 TO 27,93.0,Model A,2015,480000.0,26,1001.0427,7,"635B, SENJA RD",28,2014,Y,N,0.928571,4
41838,2017,4,HOUGANG,3 ROOM,01 TO 03,74.0,Model A,1987,248000.0,2,796.5286,35,"660, HOUGANG AVE 8",12,1986,Y,N,0.166667,3
47342,2017,7,GEYLANG,3 ROOM,07 TO 09,56.0,Standard,1969,245000.0,8,602.7784,57,"47, CIRCUIT RD",10,1964,Y,N,0.8,3
124387,2021,1,PASIR RIS,5 ROOM,04 TO 06,126.0,Improved,1995,478000.0,5,1356.2514,28,"146, PASIR RIS ST 11",11,1993,Y,N,0.454545,5
51611,2017,9,SEMBAWANG,5 ROOM,07 TO 09,110.0,Premium Apartment,2006,405000.0,8,1184.029,16,"466A, SEMBAWANG DR",21,2005,Y,Y,0.380952,5
99868,2019,12,CLEMENTI,5 ROOM,07 TO 09,121.0,Improved,1979,552000.0,8,1302.4319,43,"342, CLEMENTI AVE 5",12,1978,Y,N,0.666667,5
30390,2016,8,SERANGOON,4 ROOM,04 TO 06,107.0,Model A,1989,560000.0,5,1151.7373,33,"425, SERANGOON AVE 1",13,1988,Y,N,0.384615,4
78228,2018,12,JURONG WEST,5 ROOM,01 TO 03,112.0,Improved,2015,450000.0,2,1205.5568,8,"183D, BOON LAY AVE",16,2013,Y,N,0.125,5
33388,2016,10,PASIR RIS,4 ROOM,04 TO 06,103.0,Model A,1989,410000.0,5,1108.6817,32,"416, PASIR RIS DR 6",13,1989,Y,Y,0.384615,4
82370,2019,3,CHOA CHU KANG,5 ROOM,04 TO 06,122.0,Improved,1989,383000.0,5,1313.1958,33,"236, CHOA CHU KANG CTRL",12,1988,Y,N,0.416667,5


## 1.2 Loading "Facilities_Distance" data

> 🔷 **[ Question 1b ]** <br>
> Read the csv file **hdb_facilities_distance.csv** into a variable **df_facility**

In [10]:
df_facility = pd.read_csv('00_raw_data/hdb_facilities_distance.csv')

In [11]:
df_facility.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9175 entries, 0 to 9174
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   address                  9175 non-null   object 
 1   mall_nearest_distance    9030 non-null   float64
 2   mall_within_500m         3592 non-null   float64
 3   mall_within_1km          7483 non-null   float64
 4   mall_within_2km          8947 non-null   float64
 5   hawker_nearest_distance  9175 non-null   float64
 6   hawker_within_500m       2996 non-null   float64
 7   hawker_within_1km        5353 non-null   float64
 8   hawker_within_2km        7418 non-null   float64
 9   hawker_food_stalls       9165 non-null   float64
 10  hawker_market_stalls     9165 non-null   float64
 11  mrt_nearest_distance     9175 non-null   float64
 12  mrt_within_500m          2883 non-null   float64
 13  mrt_within_1km           7046 non-null   float64
 14  mrt_within_2km          

In [12]:
df_facility.sample(10)

Unnamed: 0,address,mall_nearest_distance,mall_within_500m,mall_within_1km,mall_within_2km,hawker_nearest_distance,hawker_within_500m,hawker_within_1km,hawker_within_2km,hawker_food_stalls,hawker_market_stalls,mrt_nearest_distance,mrt_within_500m,mrt_within_1km,mrt_within_2km
1386,"161, SIMEI RD",841.89743,,1.0,5.0,1728.040588,,,1.0,45.0,99.0,426.721459,1.0,2.0,7.0
3545,"307A, ANCHORVALE RD",827.611062,,2.0,8.0,1696.478133,,,1.0,40.0,0.0,866.271034,,2.0,2.0
5845,"504A, MONTREAL DR",619.635776,,1.0,2.0,2215.605157,,,,56.0,123.0,516.933119,,1.0,1.0
9051,"952, HOUGANG AVE 9",195.545398,1.0,1.0,8.0,581.735587,,1.0,3.0,40.0,0.0,1689.264552,,,3.0
4909,"422, TAMPINES ST 41",390.406346,1.0,5.0,7.0,893.148032,,1.0,3.0,42.0,0.0,489.640028,2.0,3.0,6.0
755,"126, BEDOK NTH ST 2",816.940043,,2.0,5.0,359.445208,2.0,4.0,7.0,48.0,147.0,833.98353,,1.0,4.0
3184,"28, NEW UPP CHANGI RD",315.276237,1.0,4.0,5.0,352.003402,2.0,4.0,8.0,64.0,154.0,558.499914,,1.0,3.0
1587,"175, BT BATOK WEST AVE 8",972.507036,,1.0,5.0,494.289802,1.0,1.0,2.0,60.0,87.0,966.888603,,1.0,5.0
2117,"210A, COMPASSVALE LANE",204.097139,1.0,3.0,7.0,2199.008116,,,,40.0,0.0,805.779595,,2.0,3.0
5132,"44, CIRCUIT RD",1136.093766,,,6.0,111.55858,3.0,4.0,8.0,106.0,0.0,320.51136,3.0,4.0,11.0


## 1.3 Loading "HDB Type Sold Count" data

> 🔷 **[ Question 1c ]** <br>
> Read the csv file **hdb_type_sold.csv** into a variable **df_hdb_sold**

In [13]:
df_type_sold = pd.read_csv('00_raw_data/hdb_type_sold.csv')

In [14]:
df_type_sold.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9175 entries, 0 to 9174
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   address               9175 non-null   object
 1   total_dwelling_units  9175 non-null   int64 
 2   1 ROOM                9175 non-null   int64 
 3   2 ROOM                9175 non-null   int64 
 4   3 ROOM                9175 non-null   int64 
 5   4 ROOM                9175 non-null   int64 
 6   5 ROOM                9175 non-null   int64 
 7   EXECUTIVE             9175 non-null   int64 
 8   MULTI-GENERATION      9175 non-null   int64 
 9   STUDIO                9175 non-null   int64 
dtypes: int64(9), object(1)
memory usage: 716.9+ KB


In [15]:
df_type_sold.sample(10)

Unnamed: 0,address,total_dwelling_units,1 ROOM,2 ROOM,3 ROOM,4 ROOM,5 ROOM,EXECUTIVE,MULTI-GENERATION,STUDIO
7977,"225, LOR 8 TOA PAYOH",173,0,0,170,0,3,0,0,0
2198,"692B, CHOA CHU KANG CRES",188,0,0,0,96,92,0,0,0
3344,"313, SEMBAWANG DR",105,0,0,0,30,75,0,0,0
6790,"234A, SERANGOON AVE 2",90,0,0,0,0,0,90,0,0
5194,"11, LOR 8 TOA PAYOH",124,0,0,0,94,30,0,0,0
5897,"424, WOODLANDS ST 41",99,0,0,0,77,22,0,0,0
5991,"289E, BT BATOK ST 25",185,0,0,0,161,24,0,0,0
415,"44, SIMS DR",240,0,0,0,240,0,0,0,0
3189,"531, JURONG WEST ST 52",131,0,0,75,55,1,0,0,0
249,"69, REDHILL CL",106,0,0,0,77,29,0,0,0


# 2.0 Data Processing

## 2.1 Calculate Percentage of the Unit's Flat Type

> 🔷 **[ Question 2a ]** <br>
> Create a new column **flat_type_percent** in **df**, to store the percentage of the unit's flat_type over the **total_dwelling_units**.
>
> For example: If a row's **flat_type** is *3 ROOM*, we'll use number of **3 ROOM** sold (available in  **df_type_sold**|) to be divided by the **total_dwelling_units** of the block.
>
> The top 10 rows of your **df** should look like this at the end of this question (screenshots shows columns starting from the right-most column).
> ![](https://i.imgur.com/uhA1BkA.png)

In [20]:
df = pd.merge(df, df_type_sold, how='left', on='address')

In [21]:
df.head(10)

Unnamed: 0,Tranc_Year,Tranc_Month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,storey_range_midpoint,...,flat_type_numerized,total_dwelling_units,1 ROOM,2 ROOM,3 ROOM,4 ROOM,5 ROOM,EXECUTIVE,MULTI-GENERATION,STUDIO
0,2015,1,ANG MO KIO,3 ROOM,07 TO 09,60.0,Improved,1986,255000.0,8,...,3,198,0,57,137,1,1,0,0,0
1,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,275000.0,2,...,3,191,0,0,165,21,3,2,0,0
2,2015,1,ANG MO KIO,3 ROOM,01 TO 03,69.0,New Generation,1980,285000.0,2,...,3,84,0,0,84,0,0,0,0,0
3,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1979,290000.0,2,...,3,23,0,0,23,0,0,0,0,0
4,2015,1,ANG MO KIO,3 ROOM,07 TO 09,68.0,New Generation,1980,290000.0,8,...,3,187,0,0,158,24,2,3,0,0
5,2015,1,ANG MO KIO,3 ROOM,07 TO 09,67.0,New Generation,1980,290000.0,8,...,3,140,0,0,116,20,1,3,0,0
6,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1980,290000.0,2,...,3,28,0,0,28,0,0,0,0,0
7,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,293000.0,2,...,3,179,0,0,158,20,0,1,0,0
8,2015,1,ANG MO KIO,3 ROOM,01 TO 03,67.0,New Generation,1978,300000.0,2,...,3,154,0,0,121,33,0,0,0,0
9,2015,1,ANG MO KIO,3 ROOM,13 TO 15,68.0,New Generation,1985,307500.0,14,...,3,214,0,0,187,24,1,1,0,0


In [22]:
def create_flat_type_percent(row):
    flat_type = row['flat_type_numerized']
    if (flat_type == 1):
        return row['1 ROOM']/row['total_dwelling_units']
    elif (flat_type == 2):
        return row['2 ROOM']/row['total_dwelling_units']
    elif (flat_type == 3):
        return row['3 ROOM']/row['total_dwelling_units']
    elif (flat_type == 4):
        return row['4 ROOM']/row['total_dwelling_units']
    elif (flat_type == 5):
        return row['5 ROOM']/row['total_dwelling_units']
    elif (flat_type == 6):
        return row['EXECUTIVE']/row['total_dwelling_units']
    elif (flat_type == 7):
        return row['MULTI-GENERATION']/row['total_dwelling_units']
    else:
        return row['STUDIO']/row['total_dwelling_units']

In [23]:
df['flat_type_percent'] = df.apply(create_flat_type_percent, axis=1)

In [24]:
df.head(10)

Unnamed: 0,Tranc_Year,Tranc_Month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,storey_range_midpoint,...,total_dwelling_units,1 ROOM,2 ROOM,3 ROOM,4 ROOM,5 ROOM,EXECUTIVE,MULTI-GENERATION,STUDIO,flat_type_percent
0,2015,1,ANG MO KIO,3 ROOM,07 TO 09,60.0,Improved,1986,255000.0,8,...,198,0,57,137,1,1,0,0,0,0.691919
1,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,275000.0,2,...,191,0,0,165,21,3,2,0,0,0.863874
2,2015,1,ANG MO KIO,3 ROOM,01 TO 03,69.0,New Generation,1980,285000.0,2,...,84,0,0,84,0,0,0,0,0,1.0
3,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1979,290000.0,2,...,23,0,0,23,0,0,0,0,0,1.0
4,2015,1,ANG MO KIO,3 ROOM,07 TO 09,68.0,New Generation,1980,290000.0,8,...,187,0,0,158,24,2,3,0,0,0.84492
5,2015,1,ANG MO KIO,3 ROOM,07 TO 09,67.0,New Generation,1980,290000.0,8,...,140,0,0,116,20,1,3,0,0,0.828571
6,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1980,290000.0,2,...,28,0,0,28,0,0,0,0,0,1.0
7,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,293000.0,2,...,179,0,0,158,20,0,1,0,0,0.882682
8,2015,1,ANG MO KIO,3 ROOM,01 TO 03,67.0,New Generation,1978,300000.0,2,...,154,0,0,121,33,0,0,0,0,0.785714
9,2015,1,ANG MO KIO,3 ROOM,13 TO 15,68.0,New Generation,1985,307500.0,14,...,214,0,0,187,24,1,1,0,0,0.873832


> 🔷 **[ Question 2b ]** <br>
> Remove the columns below from the dataframe **df**. The column names have been stored in **cols_to_remove** for you.
>
> The **df** used in the rest of notebook should not contain these columns.

In [25]:
cols_to_remove = [
 '1 ROOM',
 '2 ROOM',
 '3 ROOM',
 '4 ROOM',
 '5 ROOM',
 'EXECUTIVE',
 'MULTI-GENERATION',
 'STUDIO'
]


In [26]:
df.drop(cols_to_remove, axis=1)

Unnamed: 0,Tranc_Year,Tranc_Month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,storey_range_midpoint,...,age_approx,address,max_floor_lvl,year_completed,residential,commercial,storey_relative,flat_type_numerized,total_dwelling_units,flat_type_percent
0,2015,1,ANG MO KIO,3 ROOM,07 TO 09,60.0,Improved,1986,255000.0,8,...,41,"174, ANG MO KIO AVE 4",11,1980,Y,N,0.727273,3,198,0.691919
1,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,275000.0,2,...,42,"541, ANG MO KIO AVE 10",8,1979,Y,N,0.250000,3,191,0.863874
2,2015,1,ANG MO KIO,3 ROOM,01 TO 03,69.0,New Generation,1980,285000.0,2,...,40,"163, ANG MO KIO AVE 4",4,1981,Y,Y,0.500000,3,84,1.000000
3,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1979,290000.0,2,...,42,"446, ANG MO KIO AVE 10",4,1979,Y,Y,0.500000,3,23,1.000000
4,2015,1,ANG MO KIO,3 ROOM,07 TO 09,68.0,New Generation,1980,290000.0,8,...,42,"557, ANG MO KIO AVE 10",13,1979,Y,N,0.615385,3,187,0.844920
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
133764,2021,4,WOODLANDS,5 ROOM,01 TO 03,113.0,Improved,2017,505000.0,2,...,5,"890A, WOODLANDS DR 50",16,2016,Y,N,0.125000,5,118,0.491525
133765,2021,4,WOODLANDS,5 ROOM,04 TO 06,113.0,Improved,2017,565000.0,5,...,6,"889B, WOODLANDS DR 50",16,2015,Y,N,0.312500,5,90,0.333333
133766,2021,4,YISHUN,4 ROOM,07 TO 09,93.0,Model A,2018,460888.0,8,...,5,"506C, YISHUN AVE 4",13,2016,Y,N,0.615385,4,106,0.566038
133767,2021,4,YISHUN,5 ROOM,13 TO 15,113.0,Improved,2017,600000.0,14,...,5,"511B, YISHUN ST 51",13,2016,Y,N,1.076923,5,120,0.600000


## 2.2 Comparing Floor Size to Average of the same Flat Type and Flat Model

> 🔷 **[ Question 2c ]** <br>
> Create a new column **diff_from_avg_sqft** in **df**, to store the difference between the each unit's **floor_size_sqft** and the average floor size of the same **flat_type** and **flat_model** in the dataset **df**.
>
> At the end of the question, the top 10 rows of your **df** should look like this:<br>
>(screenshots shows columns starting from the right-most column).

> ![](https://i.imgur.com/maHcAi6.png)

In [27]:
groupby_flattype_flatmodel = df.groupby(['flat_type','flat_model'])
groupby_flattype_flatmodel = groupby_flattype_flatmodel['floor_area_sqft'].mean()
groupby_flattype_flatmodel = groupby_flattype_flatmodel.reset_index()
groupby_flattype_flatmodel.columns = ['flat_type', 'flat_model', 'avg_sqft']
groupby_flattype_flatmodel

Unnamed: 0,flat_type,flat_model,avg_sqft
0,1 ROOM,Improved,333.6809
1,2 ROOM,2-room,531.019067
2,2 ROOM,DBSS,538.195
3,2 ROOM,Improved,486.226785
4,2 ROOM,Model A,500.010481
5,2 ROOM,Premium Apartment,560.440393
6,2 ROOM,Standard,481.129986
7,3 ROOM,DBSS,707.820845
8,3 ROOM,Improved,702.687321
9,3 ROOM,Model A,765.773809


In [28]:
df = pd.merge(df, groupby_flattype_flatmodel, how='left', on=['flat_type','flat_model'])
df['diff_from_avg_sqft'] = (df['floor_area_sqft']) - df['avg_sqft']

In [29]:
df.head(10)

Unnamed: 0,Tranc_Year,Tranc_Month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,storey_range_midpoint,...,2 ROOM,3 ROOM,4 ROOM,5 ROOM,EXECUTIVE,MULTI-GENERATION,STUDIO,flat_type_percent,avg_sqft,diff_from_avg_sqft
0,2015,1,ANG MO KIO,3 ROOM,07 TO 09,60.0,Improved,1986,255000.0,8,...,57,137,1,1,0,0,0,0.691919,702.687321,-56.853321
1,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,275000.0,2,...,0,165,21,3,2,0,0,0.863874,752.99472,-21.04952
2,2015,1,ANG MO KIO,3 ROOM,01 TO 03,69.0,New Generation,1980,285000.0,2,...,0,84,0,0,0,0,0,1.0,752.99472,-10.28562
3,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1979,290000.0,2,...,0,23,0,0,0,0,0,1.0,752.99472,-21.04952
4,2015,1,ANG MO KIO,3 ROOM,07 TO 09,68.0,New Generation,1980,290000.0,8,...,0,158,24,2,3,0,0,0.84492,752.99472,-21.04952
5,2015,1,ANG MO KIO,3 ROOM,07 TO 09,67.0,New Generation,1980,290000.0,8,...,0,116,20,1,3,0,0,0.828571,752.99472,-31.81342
6,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1980,290000.0,2,...,0,28,0,0,0,0,0,1.0,752.99472,-21.04952
7,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,293000.0,2,...,0,158,20,0,1,0,0,0.882682,752.99472,-21.04952
8,2015,1,ANG MO KIO,3 ROOM,01 TO 03,67.0,New Generation,1978,300000.0,2,...,0,121,33,0,0,0,0,0.785714,752.99472,-31.81342
9,2015,1,ANG MO KIO,3 ROOM,13 TO 15,68.0,New Generation,1985,307500.0,14,...,0,187,24,1,1,0,0,0.873832,752.99472,-21.04952


## 2.2 Process and Merge "Facilities_Distance" data

> 🔷 **[ Question 2d ]** <br>
> **df_facility** contains missing values when there is no mall/hawker/mrt within the specified distance (e.g. mall_within_1km).
>
> Fill all these missing values with 0.

In [30]:
df_facility.fillna(0, inplace=True)

> 🔷 **[ Question 2e ]** <br>
> Merge all the columns in **df_facility** into **df**, so that for each record in **df**, it will have the respective information from **df_facility**.

In [31]:
df = pd.merge(df, df_facility, how='left', on='address')

In [32]:
df.head(10)

Unnamed: 0,Tranc_Year,Tranc_Month,town,flat_type,storey_range,floor_area_sqm,flat_model,lease_commence_date,resale_price,storey_range_midpoint,...,hawker_nearest_distance,hawker_within_500m,hawker_within_1km,hawker_within_2km,hawker_food_stalls,hawker_market_stalls,mrt_nearest_distance,mrt_within_500m,mrt_within_1km,mrt_within_2km
0,2015,1,ANG MO KIO,3 ROOM,07 TO 09,60.0,Improved,1986,255000.0,8,...,188.702345,1.0,4.0,8.0,40.0,84.0,1098.721414,0.0,0.0,2.0
1,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,275000.0,2,...,187.2736,1.0,2.0,11.0,50.0,100.0,806.15876,0.0,1.0,2.0
2,2015,1,ANG MO KIO,3 ROOM,01 TO 03,69.0,New Generation,1980,285000.0,2,...,165.99102,1.0,5.0,8.0,40.0,84.0,1179.565865,0.0,0.0,2.0
3,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1979,290000.0,2,...,134.216983,1.0,4.0,10.0,39.0,113.0,688.607876,0.0,1.0,2.0
4,2015,1,ANG MO KIO,3 ROOM,07 TO 09,68.0,New Generation,1980,290000.0,8,...,385.235689,2.0,2.0,7.0,50.0,100.0,929.155194,0.0,1.0,2.0
5,2015,1,ANG MO KIO,3 ROOM,07 TO 09,67.0,New Generation,1980,290000.0,8,...,558.321643,0.0,2.0,6.0,52.0,166.0,1037.261445,0.0,0.0,2.0
6,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1980,290000.0,2,...,163.692784,1.0,4.0,9.0,45.0,78.0,249.474469,1.0,1.0,2.0
7,2015,1,ANG MO KIO,3 ROOM,01 TO 03,68.0,New Generation,1981,293000.0,2,...,437.188375,2.0,3.0,10.0,40.0,148.0,979.320238,0.0,1.0,5.0
8,2015,1,ANG MO KIO,3 ROOM,01 TO 03,67.0,New Generation,1978,300000.0,2,...,385.888274,2.0,4.0,8.0,10.0,101.0,1321.144042,0.0,0.0,2.0
9,2015,1,ANG MO KIO,3 ROOM,13 TO 15,68.0,New Generation,1985,307500.0,14,...,371.883169,1.0,3.0,7.0,39.0,113.0,1095.78066,0.0,0.0,1.0


# 3 Feature Engineering

⚠️ Optional: If you are unable to complete previous questions, read **df_q3_start.csv** from folder **"01_processed_data"** into the variable df before continue.

In [None]:
#Optional: Load in data processed up to this point


## 3.1 Creating Dummy Variables

> 🔷 **[ Question 3 ]** <br>
> Create dummy variables for **town** and **flat_model**. <br>
> Create a new dataframe **df_dataset** that combines all the columns in **df** with these dummy variables.

> If you're unable to complete previous questions, read **df_q3_start** from folder "01_processed_data" into the variable **df** before continue.

In [33]:
features_category = [
    'town','flat_model'
]

In [34]:
df_dummies = pd.get_dummies(df[features_category], drop_first=True)

In [35]:
df_dataset = pd.concat([df, df_dummies], axis=1, sort=False)

# 4.0 Modeling

⚠️ Optional: If you're unable to complete previous questions, read **df_q4_start** from folder **"01_processed_data"** into the variable df before continue.

In [None]:
#optional: Load in data processed up to this point


## 4.1 Specifying the Features and Splitting the Data

In [36]:
from sklearn import model_selection
from sklearn import metrics

In [37]:
target = 'resale_price'

In [38]:
# The df_dummies below is a variable created in Section 3.1.
# Make sure you are using the same variable if you have used different variable name in Section 3.1
# Otherwise, you do not have to change the code below
features_dummies = list(df_dummies.columns)

In [39]:
features_numeric = [
    'Tranc_Year',
    'flat_type_numerized',
    'floor_area_sqft',
    'diff_from_avg_sqft',
    'storey_relative',
    'total_dwelling_units',
    'flat_type_percent',
    'age_approx',
    'mrt_nearest_distance',
    'mall_nearest_distance'
]

features = features_numeric + features_dummies

> 🔷 **[ Question 4a ]** <br>
> We will train our model based on the data before 2021 and test the model on data from 2021.<br>
> Split the **df_dataset** into **df_test** and **df_train**.

In [40]:
df_train = df_dataset[df_dataset['Tranc_Year'] < 2021]
df_test = df_dataset[df_dataset['Tranc_Year'] == 2021]

In [41]:
# Execute this line, DO NOT Modify
df_train = df_train.sample(len(df_train), random_state=2023)

> 🔷 **[ Question 4b ]** <br>
> For each of the training set and testing set, split it into:
> - dataframe contains only the **features** for training
> - dataframe contains only the **target** for training
> - dataframe contains only the **features** for testing
> - dataframe contains only the **target** for testing

In [42]:
x_train=df_train[features]
y_train=df_train[target]
x_test=df_test[features]
y_test=df_test[target]

## 4.2 Modeling Training and Testing

In [43]:
# You're required to use LinearRegression for this section
from sklearn.linear_model import LinearRegression

> 🔷 **[ Question 4c ]** <br>
> Complete the 3 Key Steps: Instantiate, Train, and Predict
>
> You must use LinearRegression for this question.

In [44]:

# Instantiate the model
model = LinearRegression()

# Train the model
model.fit(x_train,y_train)

# Generate predictions based on the test data set
predictions = model.predict(x_test)




> 🔷 **[ Question 4d ]** <br>
> Complete the following cells for calculate the Root Mean Square Error (RMSE).

In [45]:
# Validate the model
mse = metrics.mean_squared_error(predictions, y_test)
rmse = np.sqrt(mse)

print("About 95% of these predictions are between -" + str(np.round(rmse*2, 2)) + " and " + str(np.round(rmse*2, 2))
      + " of actual resale values")

print("About 67% of these predictions are between -" + str(np.round(rmse, 2)) + " and " + str(np.round(rmse, 2))
      + " of actual resale values")

About 95% of these predictions are between -130620.05 and 130620.05 of actual resale values
About 67% of these predictions are between -65310.03 and 65310.03 of actual resale values


In general, the RMSE tells us, on average, how far off our predictions are from the actual values.

For example, if the RMSE of the HDB resale price predictions is $10,000, that means our predictions are typically off by about \$10,000."

# Bonus Question

> 🔷 **[ Bonus Question ]** <br>
> Create a model with more or different features, to predict **resale_price**, wtih the objective to **improve the model's RMSE** <br>
> - Use the same training and testing set from question 4. <br>
> - **MUST** use **Linear Regression** algorithm. <br>
> - You can create more cells wherever you need to. <br>

In [46]:
features_additional = [
    'lease_commence_date',
    'hawker_nearest_distance',
    'hawker_market_stalls',
    'mall_within_500m',
    'hawker_within_500m',
    'mrt_within_500m'
]

features = features + features_additional

In [47]:
x_train=df_train[features]
y_train=df_train[target]
x_test=df_test[features]
y_test = df_test[target]

In [48]:
# Instantiate the model
model = LinearRegression()

# Train the model
model.fit(x_train,y_train)

# Generate predictions based on the test data set
predictions = model.predict(x_test)

In [49]:
# Validate the model
mse = metrics.mean_squared_error(predictions, y_test)
rmse = np.sqrt(mse)

print("About 95% of these predictions are between -" + str(np.round(rmse*2, 2)) + " and " + str(np.round(rmse*2, 2))
      + " of actual resale values")

print("About 67% of these predictions are between -" + str(np.round(rmse, 2)) + " and " + str(np.round(rmse, 2))
      + " of actual resale values")

About 95% of these predictions are between -129122.15 and 129122.15 of actual resale values
About 67% of these predictions are between -64561.08 and 64561.08 of actual resale values


---
## END OF CAPSTONE ASSIGNMENT
- Well done everyone!