## Setup

### Create sample data

In [None]:
!mkdir -p data

In [1]:
%%writefile data/01_01_2023.csv
prop_id;house_type;has_furniture;surface;price;
1;Apartment;True;15;490;
2;Apartment;False;18;450;
3;House;False;50;650;
4;Apartment;True;20;540;
5;House;True;55;750;

Writing 01_01_2023.csv


In [2]:
%%writefile data/02_01_2023.csv
prop_id;house_type;has_furniture;surface;price;
1;Apartment;True;15;490;
4;Apartment;True;20;550;
5;House;True;55;750;
6;Apartment;True;22;500;
7;Apartment;True;24;580;
8;House;True;52;650;

Writing 02_01_2023.csv


In [3]:
%%writefile data/today.csv
prop_id;house_type;has_furniture;surface;price;
4;Apartment;True;20;540;
6;Apartment;True;22;500;
8;House;True;52;640;
9;Apartment;True;30;600;
10;Apartment;False;35;620;

Writing today.csv


### Install libraries

In [4]:
%%writefile requirements.txt
SQLAlchemy==1.4.39
pandas==1.3.5
psycopg2-binary==2.9.5

Writing requirements.txt


In [None]:
!pip install -r requirements.txt

### Import libraries

In [24]:
from typing import Optional
from datetime import datetime, timedelta
import uuid
from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker
import pandas as pd
import boto3
import json

### Database connection

In [6]:
def get_secret(secret_name, region_name="us-east-1"):
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name)
    get_secret_value_response = client.get_secret_value(SecretId=secret_name)
    get_secret_value_response = json.loads(get_secret_value_response['SecretString'])
    return get_secret_value_response

In [7]:
creds = get_secret("wysde")
USERNAME = creds["RDS_POSTGRES_USERNAME"]
PASSWORD = creds["RDS_POSTGRES_PASSWORD"]
HOST = creds["RDS_POSTGRES_HOST"]
DATABASE = 'sparsh'

conn_str = 'postgresql://{0}:{1}@{2}/{3}'.format(USERNAME, PASSWORD, HOST, DATABASE)

In [19]:
%config SqlMagic.autopandas=True
%config SqlMagic.displaycon=False
%config SqlMagic.feedback=False
%config SqlMagic.displaylimit=5
%reload_ext sql
%sql {conn_str}

In [8]:
engine = create_engine(conn_str)
Session = sessionmaker(engine)
session = Session()

## Data ingestion

After collecting data from the website, we'll load the new data to the data warehouse. Suppose we already have some data in the warehouse, we won't need the same data again. Otherwise, the statistics won't be correct. Here is what we need to implement:

-   Add new data to the warehouse
-   For data that has been changed on the website (for example, the price drops), we'll mark them as expired and add the new version to the warehouse.
-   If the offer is no longer available on the website, we mark it as "expired" in the warehouse.

To determine whether a property is "new", we can use the technique described [here](https://www.sspaeti.com/blog/data-engineering-project-in-twenty-minutes). We'll add a new field called `fingerprint` that will be used to differentiate properties. We already have the "prop_id" field, which is a unique identifier, in the data source. Since we also want to track the price, we will add that information to the `fingerprint`. As a result, the fingerprint is created by joining `prop_id` and `price`. We can have many rows for the same property in the database, as long as the price is different.

We create a table called `housing_staging` for pre-aggregation data and `housing_model` for aggregated data. In the staging table, we use `valid_from` and `expired` fields to track the offer. We will see how these fields are used in the BackFill and Recompute Data part.

In [15]:
def setup_db():
    create_table_housing_staging_sql = """
    CREATE TABLE IF NOT EXISTS housing_staging (
        id varchar PRIMARY KEY,
        prop_id int8,
        house_type varchar,
        has_furniture boolean,
        price float,
        surface float,
        fingerprint varchar,
        valid_from timestamp DEFAULT CURRENT_TIMESTAMP,
        expired timestamp DEFAULT NULL
        );
    """
    session.execute(text(create_table_housing_staging_sql))

    create_table_housing_model_sql = """
    CREATE TABLE IF NOT EXISTS housing_model(
        date timestamp,
        house_type varchar,
        has_furniture boolean,
        nb_props int,
        avg_price float,
        avg_price_per_m2 float
    );
    """
    session.execute(text(create_table_housing_model_sql))

    session.commit()

In [16]:
setup_db()

Now suppose you loaded the data collected to a table tmp_housing. Thanks to the field fingerprint , we can identify offers that have been changed or removed from the website.

In [17]:
def load_data(csv_file: str) -> None:
    extracted_data = pd.read_csv(csv_file, sep=";")

    extracted_data["fingerprint"] = extracted_data["prop_id"].astype(
        str
    ) + extracted_data["price"].astype(str)
    extracted_data["id"] = [uuid.uuid4() for _ in range(len(extracted_data.index))]

    session.execute(text("DROP TABLE IF EXISTS tmp_housing"))
    session.commit()

    extracted_data.to_sql("tmp_housing", engine)

    expired: datetime = datetime.today()
    expired = expired - timedelta(days=1)
    expired = expired.replace(hour=23, minute=59, second=59)
    expired = expired.strftime("%Y-%m-%d %H:%M:%S")
    update_updated_n_deleted_props_sql = f"""UPDATE housing_staging
            SET expired = '{expired}' 
            WHERE id IN (SELECT h.id 
                FROM housing_staging h
                LEFT JOIN tmp_housing t
                ON h.prop_id = t.prop_id
                AND h.expired IS NULL
                WHERE t.prop_id IS NULL
                OR h.fingerprint != t.fingerprint
            )
            """
    session.execute(text(update_updated_n_deleted_props_sql))

    valid_from: datetime = datetime.today()
    valid_from = valid_from.replace(hour=0, minute=0, second=0)
    valid_from = valid_from.strftime("%Y-%m-%d %H:%M:%S")
    insert_new_or_updated_props_sql = f"""
        INSERT INTO housing_staging (
            id, prop_id,
            house_type, has_furniture,
            price, surface,
            fingerprint, valid_from
        ) (SELECT t.id, t.prop_id,
        t.house_type, t.has_furniture,
        t.price, t.surface,
        t.fingerprint, '{valid_from}'
        FROM tmp_housing t
        LEFT JOIN housing_staging h
        ON t.prop_id = h.prop_id
        AND h.expired IS NULL
        WHERE h.prop_id IS NULL
        OR h.fingerprint != t.fingerprint)
        """
    session.execute(text(insert_new_or_updated_props_sql))

    session.execute(text("DROP TABLE tmp_housing"))
    session.commit()

Run this function for today.csv file. In the housing_staging table, you’ll have these rows (the value of valid_from column will be different):

In [18]:
load_data('data/today.csv')

In [20]:
%%sql
SELECT * FROM housing_staging

Unnamed: 0,id,prop_id,house_type,has_furniture,price,surface,fingerprint,valid_from,expired
0,97952e76-5127-4398-b792-b56e6ed7f518,4,Apartment,True,540.0,20.0,4540,2023-02-22,
1,3c006ad9-8d0b-44cd-a70d-707dd5a8d690,6,Apartment,True,500.0,22.0,6500,2023-02-22,
2,0620d450-6935-4bf1-a23f-e7cbd4c5fd65,8,House,True,640.0,52.0,8640,2023-02-22,
3,37d404aa-7c70-4dce-9575-954473aca082,9,Apartment,True,600.0,30.0,9600,2023-02-22,
4,a7b5c9f9-0ee8-4b2a-87aa-7168b38c02a3,10,Apartment,False,620.0,35.0,10620,2023-02-22,


## Data Modeling

The model will answer three following questions:

- How many offers are currently available?
- The average price.
- The average cost per m2.

We can compute these 3 KPIs from the housing_staging table. The idea is to filter out expired data.

In [21]:
def transform_data() -> None:
    date: datetime = datetime.today()
    date = date.replace(hour=0, minute=0, second=0)
    date = date.strftime("%Y-%m-%d %H:%M:%S")

    transform_data_sql = f"""
            INSERT INTO housing_model (
                date, house_type, has_furniture,
                nb_props, avg_price, avg_price_per_m2
            ) (SELECT '{date}', house_type, has_furniture,
                COUNT(prop_id) nb_props, AVG(price) avg_price,
                AVG(price/surface) avg_price_per_m2
                FROM housing_staging
                WHERE expired IS NULL
                GROUP BY house_type, has_furniture
            )
    """

    session.execute(text(transform_data_sql))
    session.commit()

In [22]:
transform_data()

Here is what we get from the housing_model table

In [23]:
%%sql

SELECT * FROM housing_model

Unnamed: 0,date,house_type,has_furniture,nb_props,avg_price,avg_price_per_m2
0,2023-02-22,Apartment,False,1,620.0,17.714286
1,2023-02-22,Apartment,True,3,546.666667,23.242424
2,2023-02-22,House,True,1,640.0,12.307692


## Backfill and Recompute Data

Sometimes, we need to backfill and recompute data for a day in the past. For instance, when you discover corrupted data in the warehouse and decide to reload the staging data and recalculate the KPI. Another example is that our pipeline failed to load data to the warehouse due to an error.

The tricky part is when we reload data for a day, the data of later dates will be affected. To illustrate that, consider this scenario:

We have the following rows in the housing_staging table:

```
| prop_id | valid_from          | expired |
|---------|---------------------|---------|
| 4       | 2023-02-18 00:00:00 | NULL    |
| 6       | 2023-02-18 00:00:00 | NULL    |
| 8       | 2023-02-18 00:00:00 | NULL    |
| 9       | 2023-02-18 00:00:00 | NULL    |
| 10      | 2023-02-18 00:00:00 | NULL    |
```

Now if we want to load data in the file 02_01_2023.csv, we should have this result:

```
| prop_id | valid_from          | expired             |
|---------|---------------------|---------------------|
|       1 | 2023-01-02 00:00:00 | 2023-02-17 23:59:59 |
|       4 | 2023-01-02 00:00:00 | 2023-02-17 23:59:59 |
|       4 | 2023-02-18 00:00:00 |                     |
|       5 | 2023-01-02 00:00:00 | 2023-02-17 23:59:59 |
|       6 | 2023-01-02 00:00:00 |                     |
|       7 | 2023-01-02 00:00:00 | 2023-02-17 23:59:59 |
|       8 | 2023-01-02 00:00:00 | 2023-02-17 23:59:59 |
|       8 | 2023-02-18 00:00:00 |                     |
|       9 | 2023-02-18 00:00:00 |                     |
|      10 | 2023-02-18 00:00:00 |                     |
```

The following modifications are made:

- Property with id 1,5 and 7 are added. It is valid from 02 January 2023 to the end of 17 February 2023. Because it was in 02_01_2023.csv but not today.csv
- A new line for properties 4 and 8 is added. Because the price of this property changes between 02_01_2023.csvand today.csv
- For property 6, the value of valid_from is modified from 2023–02–18 00:00:00 to 2023–01–02 00:00:00 . Because this property appears in 2 files and its price is unchanged, we should take the more ancient timestamp.
Property 9 and 10 data remain unchanged.

Given this complexity, we have 2 options:

- We remove data from the date we want to reload. Then reload data from that date. This is easier to do but can take longer time to finish if you have a lot of data.
- We load only data for the date and update data for later days.

In this lab, we take the first option. To load data in 02_01_2023.csv , we first need to remove data from that date. Then rerun the load function for 02_01_2023.csv and today.csv.

In the previous section, we only run the ingestion and transformation steps for the current date. We need to modify those functions by including a new parameter for_date. Then, we set the value of valid_from and expired field according to for_date.

In [25]:
def load_data(csv_file: str, for_date: Optional[datetime] = None) -> None:
    extracted_data = pd.read_csv(csv_file, sep=";")

    extracted_data["fingerprint"] = extracted_data["prop_id"].astype(
        str
    ) + extracted_data["price"].astype(str)
    extracted_data["id"] = [uuid.uuid4()
                            for _ in range(len(extracted_data.index))]

    session.execute(text("DROP TABLE IF EXISTS tmp_housing"))
    session.commit()

    extracted_data.to_sql("tmp_housing", engine)

    expired: datetime = datetime.today()
    if for_date is not None:
        expired = for_date
    expired = expired - timedelta(days=1)
    expired = expired.replace(hour=23, minute=59, second=59)
    expired = expired.strftime("%Y-%m-%d %H:%M:%S")
    update_updated_n_deleted_props_sql = f"""UPDATE housing_staging
            SET expired = '{expired}' 
            WHERE id IN (SELECT h.id 
                FROM housing_staging h
                LEFT JOIN tmp_housing t
                ON h.prop_id = t.prop_id
                AND h.expired IS NULL
                WHERE t.prop_id IS NULL
                OR h.fingerprint != t.fingerprint
            )
            """
    session.execute(text(update_updated_n_deleted_props_sql))

    valid_from: datetime = datetime.today()
    if for_date is not None:
        valid_from = for_date
    valid_from = valid_from.replace(hour=0, minute=0, second=0)
    valid_from = valid_from.strftime("%Y-%m-%d %H:%M:%S")
    insert_new_or_updated_props_sql = f"""
        INSERT INTO housing_staging (
            id, prop_id,
            house_type, has_furniture,
            price, surface,
            fingerprint, valid_from
        ) (SELECT t.id, t.prop_id,
            t.house_type, t.has_furniture,
            t.price, t.surface,
            t.fingerprint, '{valid_from}'
            FROM tmp_housing t
            LEFT JOIN housing_staging h
            AND h.expired IS NULL
            ON t.prop_id = h.prop_id
            WHERE h.prop_id IS NULL
            OR t.fingerprint != h.fingerprint
        )
        """
    session.execute(text(insert_new_or_updated_props_sql))

    session.execute(text("DROP TABLE tmp_housing"))
    session.commit()

For the transformation step, we will need to adjust the condition. In the previous version, we take the rows where the expired field is NULL because we were running for the current day.

If we recompute data in the housing_model table for day D given the following days’ data has been loaded in the housing_staging table, we will only take rows that are valid until day D and are expired from the end of day D or are still active.

In [None]:
def transform_data(for_date: Optional[datetime] = None):

    date: datetime = datetime.today()
    if for_date is not None:
    date = for_date
    date = date.replace(hour=0, minute=0, second=0)
    date = date.strftime("%Y-%m-%d %H:%M:%S")

    transform_data_sql = f"""
            INSERT INTO housing_model(
                date, house_type, has_furniture, nb_props,
                avg_price, avg_price_per_m2
            ) (SELECT '{date}', house_type, has_furniture,
                COUNT(prop_id) nb_props, AVG(price) avg_price,
                AVG(price/surface) avg_price_per_m2
                FROM housing_staging
                WHERE valid_from <= '{date}'
                AND (expired IS NULL OR expired > '{date}')
                GROUP BY house_type, has_furniture
            ) 
            """
            
    session.execute(text(transform_data_sql))

And that’s it, you have the data ingestion and transformation necessary to analyze housing price data over time.

## Conclusion

In this lab, we applied change data capture and slowly changing dimension type 2 modelization to build a model for housing price analysis.