# Build a test data set of house price data

In this tutorial, we'll build a test data set. The dataset will be a simple real estate dataset showing attributes such as price, number of bedrooms, parking etc. The data is entirely fabricated. Once created, we'll also show how you can load the data set as a CSV onto AWS S3.

## Import setup packages

In [4]:
import pandas as pd
import tomli
import pprint

# We'll use random type functions to generate the random house price data
import random

## Import config using a TOML file

We'll store our config in a TOML file. The config will contain basic instructions for creating the test data. There are 4 tiers, each with different constraints. For example, in tier 1, we create house with prices in the range 100k to 200k.


_Contents of config/housing.toml_

```yaml
# Config for creating a housing dataset

[tier-1]
price = "100000-200000"
bedrooms = "1-3"
bathrooms = "1-3"
estate_agent_code = "1-5" 
transport_link_code = "1-8"
council_tax = ["A", "B", "C"]
freehold = [true, false]
garage = [false]
parking = [true, false]

[tier-2]
...

```

In [5]:
with open('config/housing.toml', 'rb') as file_obj:
    housing_config = tomli.load(file_obj)
    
pprint.pprint(housing_config["tier-1"])

{'bathrooms': '1-3',
 'bedrooms': '1-3',
 'council_tax': ['A', 'B', 'C'],
 'estate_agent_code': '1-5',
 'freehold': [True, False],
 'garage': [False],
 'parking': [True, False],
 'price': '100000-200000',
 'transport_link_code': '1-8'}


## Build random data using the config file

Next we build our random DataFrame using the TOML file. We use two main functions (plus some helper functions):

1. The first function `generate_one_row_random_house_data` creates a random row of data per the config specification
2. The second main function `generate_housing_df` uses the first function to create a large sample of data

As always we try to keep our functions small:

> <i>"The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that."</i>  -Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship

In [6]:
def get_randint_from_range_str(range_str):
    
    """
    Randomly returns an int in range
    The range is parsed from string in the form '2-10'
    If the value is greater than 100000, round to 3 decimal places 
    """
    
    range_list = [int(i) for i in range_str.split("-")]
    
    min_value, max_value = min(range_list), max(range_list)
    
    rand_int = random.randint(min_value, max_value)
    
    # For bigger numbers, such as price, we round to closest thousand
    # House prices are more likely to be quoted rounded to the nearest thousand
    
    return rand_int if rand_int <= 100000 else round(rand_int, -3)

In [7]:
def get_random_tier_config(housing_config):
    
        """
        Returns tier config for one tier.
        The config set is selected randomly
        """
    
        tiers = list(housing_config.keys())

        random_tier = random.choice(tiers)

        tier_config = housing_config[random_tier]
    
        return tier_config

In [17]:
def generate_one_row_random_house_data(config):
    
    """
    Returns one row of data in dict format
    The data is randomly produced using constraints in the config file
    """

    new_row = {}

    for housing_attribute in config:
        
        housing_attribute_value = config[housing_attribute]
        
        if type(housing_attribute_value) == str:
            
            # Note:
            # In production, we'd employ some defensive programming here to ensure that strings
            # loaded from the config are always in the form "{int}-{int}". For example: 2-8
                        
            new_row[housing_attribute] = get_randint_from_range_str(housing_attribute_value)

        if type(housing_attribute_value) == list:

            new_row[housing_attribute] = random.choice(housing_attribute_value)
            
    return new_row

In [18]:
def generate_housing_df(housing_config, row_size=1000000):
    
    """
    Returns a dataframe of random housing data per 
    constraints outlined in TOML file.
    The number of rows is specified in the parameters.
    """

    df_rows = []

    for i in range(row_size):

        tier_config = get_random_tier_config(housing_config)

        new_row = generate_one_row_random_house_data(tier_config)

        df_rows.append(new_row)

    return pd.DataFrame(df_rows)

In [19]:
df = generate_housing_df(housing_config)
df.head(3)

Unnamed: 0,price,bedrooms,bathrooms,estate_agent_code,transport_link_code,council_tax,freehold,garage,parking
0,304000,4,3,4,7,F,True,True,False
1,218000,2,3,3,7,B,False,True,False
2,100000,1,2,5,8,C,True,False,True


In [21]:
# Check the statistical dispersion of the sample data

df.describe()

Unnamed: 0,price,bedrooms,bathrooms,estate_agent_code,transport_link_code
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,362416.987,3.623684,3.623902,3.623851,5.124652
std,225967.881825,1.705249,1.703965,1.148088,1.985941
min,100000.0,1.0,1.0,1.0,1.0
25%,200000.0,2.0,2.0,3.0,4.0
50%,300000.0,3.0,3.0,4.0,5.0
75%,400000.0,4.0,4.0,5.0,7.0
max,1000000.0,8.0,8.0,5.0,8.0


## Save CSV file to local data folder

In [26]:
df.to_csv("data/housing-data.csv", index=False)

## Save CSV file to S3

Finally, we save the CSV file to S3. Fortunately, the Pandas API does all of the "heavy lifting" to connect to AWS.

In [25]:
%%script echo skipping

S3_BUCKET_LOCATION = "<BUCKET_LOCATION>"

df.to_csv(S3_BUCKET_LOCATION)

skipping
