# Build a test CSV dataset and load to S3

In this tutorial, we'll build a test CSV which we'll load onto AWS S3. The dataset will be a simple housing dataset showing attributes such as price, number of bedrooms etc. The data is entirely fabricated.

## Import setup packages

In [1]:
import pandas as pd
import tomli
import pprint

# We'll use random type functions to generate the random house price data
from random import choice, randint
from numpy.random import normal

## Import config using a TOML file

We'll store our config in a TOML file. The config will contain basic instructions for creating the test data. For example, we specify a min and a max when generating a random int for "number of bedrooms".


_Contents of config/housing.toml_

```yaml
# Config for creating a housing dataset

[price]
type = "dist"
mean = 500000
sd = 200000

[area]
type = "min_max"
max = 300
min = 50

[bathrooms]
type = "min_max"
max = 5
min = 1

[bedrooms]
type = "min_max"
max = 5
min = 1

[garage]
type = "bool"

[parking]
type = "bool"

[council_tax_band]
type = "list"
list = ["A", "B", "C", "D", "E"]

```

In [2]:
with open('config/housing.toml', 'rb') as file_obj:
    housing_config = tomli.load(file_obj)
    
pprint.pprint(housing_config)

{'area': {'max': 300, 'min': 50, 'type': 'min_max'},
 'bathrooms': {'max': 5, 'min': 1, 'type': 'min_max'},
 'bedrooms': {'max': 5, 'min': 1, 'type': 'min_max'},
 'council_tax_band': {'list': ['A', 'B', 'C', 'D', 'E'], 'type': 'list'},
 'garage': {'type': 'bool'},
 'parking': {'type': 'bool'},
 'price': {'mean': 500000, 'sd': 200000, 'type': 'dist'}}


## Build random data using the config file

Next we build our random data using the TOML file. We have two functions below:
    
1. The first returns a random data point based on the config passed in
2. The second builds a DataFrame using these data points

In [3]:
def build_random_housing_column_value(attribute):
        
    if attribute["type"] == "min_max":
        
        return randint(attribute["min"], attribute["max"] )
    
    elif attribute.get("type") == "bool":
        
        return choice([True, False])
    
    elif attribute.get("type") == "dist":
        
        sample_price = normal(loc=attribute["mean"], scale=attribute["sd"], size=1)[0]
        
        return int(round(sample_price, -3))
    
    elif attribute.get("type") == "list":
        
        return choice(attribute["list"])
    

In [4]:
def generate_housing_data_df(sample_size=1000000):

    df_cols = {}

    for col in housing_config:

        column_config = housing_config[col]

        df_cols[col] = [build_random_housing_column_value(column_config) for i in range(0, sample_size)]
        
    return  pd.DataFrame(df_cols)

In [5]:
df = generate_housing_data_df()
df.head(3)

Unnamed: 0,price,area,bathrooms,bedrooms,garage,parking,council_tax_band
0,695000,202,2,4,False,True,B
1,574000,218,5,4,True,True,E
2,408000,228,4,1,True,True,D


In [6]:
# Check the statistical dispersion of the sample data

df.describe()

Unnamed: 0,price,area,bathrooms,bedrooms
count,1000000.0,1000000.0,1000000.0,1000000.0
mean,499727.8,175.034369,3.000174,2.999759
std,200018.1,72.434211,1.41324,1.412847
min,-470000.0,50.0,1.0,1.0
25%,365000.0,112.0,2.0,2.0
50%,500000.0,175.0,3.0,3.0
75%,635000.0,238.0,4.0,4.0
max,1502000.0,300.0,5.0,5.0


## Save CSV file to S3

Finally, we save the CSV file to S3. Fortunately, the Pandas API does all of the "heavy lifting" to connect to AWS.

In [7]:
S3_BUCKET_LOCATION = "<BUCKET_LOCATION>"

df.to_csv(S3_BUCKET_LOCATION)