# Data Processing

## Goals

This program aims to:
- Get us a brief looking of the missing data
- Deal with the missing data
- Create an initial data visualization

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from conversion import *
from extract_postal import extract
from get_room import get_room

In [2]:
# there seems to be some ASCII character encoded in the raw data file
# in order to make pandas able to read the file, I added 'encoding = "unicode_escape"'

df = pd.read_csv("data_raw.csv", encoding = "unicode_escape")

df.head(5)

Unnamed: 0,Quarter,Postal code,Building type,Price per square meter (EUR/m2)
0,2010Q1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),"Blocks of flats, one-room flat",5458
1,2010Q1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),"Blocks of flats, two-room flat",5164
2,2010Q1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),"Blocks of flats, three-room flat+",4944
3,2010Q1,00120 Punavuori (Helsinki ),"Blocks of flats, one-room flat",5515
4,2010Q1,00120 Punavuori (Helsinki ),"Blocks of flats, two-room flat",5349


In [3]:
# change the column names
df.columns = ["quarter", "address", "type", "price"]

# convert the rows with price value missing into NaN value.
df.loc[(df.price == ".."), "price"] = np.NaN

"""
Here, I want to extract the information from 3 features [quarter, address, type] into a numerical list
so that we can use it for machine learning model.

These functions are written in the external files conversion.py, extract_postal.py, and get_room.py.

In the end, we insert new columns into our dataframe.
"""

date_count = conversion()
postal_code, city = extract()
room = get_room()

df["date_count"] = date_count
df["postal_code"] = postal_code
df["city"] = city
df["room"] = room

df.head(5)

Unnamed: 0,quarter,address,type,price,date_count,postal_code,city,room
0,2010Q1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),"Blocks of flats, one-room flat",5458,1,100,Helsinki,1
1,2010Q1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),"Blocks of flats, two-room flat",5164,1,100,Helsinki,2
2,2010Q1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),"Blocks of flats, three-room flat+",4944,1,100,Helsinki,3
3,2010Q1,00120 Punavuori (Helsinki ),"Blocks of flats, one-room flat",5515,1,120,Helsinki,1
4,2010Q1,00120 Punavuori (Helsinki ),"Blocks of flats, two-room flat",5349,1,120,Helsinki,2


In [4]:
print(df.shape)
print(df["price"].isna().sum())

(23607, 8)
16127


In [5]:
"""
From the cell below, we can see that 16127 out of 23607 rows have missing price.
It is indeed a big portion of the data. However, if we just ignore these rows,
we still have 7000 datapoints, and it is enough for our machine learning model.
"""

# remove rows which have missing values
df = df.dropna(axis=0)

# reset the row indices 
df = df.reset_index()

# remove column "quarter", "address", "type" that are not used
df = df.drop(["quarter", "address", "type", "index"], axis=1)

# change the type of the postal_code column into integer
df["postal_code"] = df["postal_code"].astype(int)

# switch the column order
df = df[["city", "date_count", "postal_code", "room", "price"]]

df.tail(5)

Unnamed: 0,city,date_count,postal_code,room,price
7475,Espoo,48,2650,2,4564
7476,Espoo,48,2650,3,3929
7477,Kauniainen,48,2700,3,4824
7478,Espoo,48,2710,3,2782
7479,Espoo,48,2760,2,2898


In [6]:
df.head(5)

Unnamed: 0,city,date_count,postal_code,room,price
0,Helsinki,1,100,1,5458
1,Helsinki,1,100,2,5164
2,Helsinki,1,100,3,4944
3,Helsinki,1,120,1,5515
4,Helsinki,1,120,2,5349


In [7]:
df.to_csv("data_cleaned.csv")