# Creating a database

Here you will be creating the data that will go into your database. The data is created in python and will be turned into a database in SQLite.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

We need to create data of various formats: nominal, ordinal, interval and ratio.
Let's pick the topic of house sales.

In [8]:
# Number of samples
n = 1000

# Nominal data: postcodes. Note that these are made less specific for data privacy reasons.
postcodes = [f'AL{str(i).zfill(2)}' for i in range(1, 11)]
postcode_data = np.random.choice(postcodes, n)

# Ordinal data: Age groups
age_groups = ['18-25', '26-35', '36-45', '46-55', '56-65', '66+']
age_group_data = np.random.choice(age_groups, n, p=[0.1, 0.2, 0.3, 0.2, 0.1, 0.1])

# Interval data: Age of construction
construction_year = np.random.randint(1700, 2000, n)
construction_month = np.random.randint(1, 13, n)
construction_day = np.random.randint(1, 29, n)
construction_date = [f'{construction_year[i]}-{str(construction_month[i]).zfill(2)}-'
                     f'{str(construction_day[i]).zfill(2)}' for i in range(n)]

# Ratio data: Price of the house
price_data = np.random.lognormal(mean=13, sigma=0.5, size=n).astype(int)

# Create DataFrame
df = pd.DataFrame({
    'Postcode': postcode_data,
    'Age_Group': age_group_data,
    'Construction_Date': construction_date,
    'House_Price': price_data
})

print(df.head())

  Postcode Age_Group Construction_Date  House_Price
0     AL07     46-55        1893-07-22       231165
1     AL01     26-35        1962-04-27       618135
2     AL01     26-35        1738-06-15       726213
3     AL03     46-55        1945-08-13       576944
4     AL04     36-45        1980-10-18       485213


That's a minimal example, now create some additional (at least 4 more) columns for any the following concepts:
- Number of Bedrooms
- Number of Bathrooms
- Total Square Footage
- Property Type (Detached, Semi-Detached, etc.)
- Number of Floors
- Furnishing (Finished, Unfinished)
- Garden Size
- Time on Market
- Mortgage Rate
- Number of Previous Owners
- Pet-Friendly (Yes/No)
- Appliances Included
- Internet Connectivity (Fiber, DSL, etc.)
- Local Crime Rate
- Distance to Nearby Schools
- Distance to Motorway
- Name of Seller


Random is better and make sure to ensure the numbers **make sense** and have a consistent N.
Read the numpy random documentation for ideas on different distributions.

Most real databases have missing data or otherwise undesirable values, "filler" values. You can simulate this with masking. Note that integers cannot have NaNs as options (so not np.random.randint).

In [21]:
square_footage = np.random.uniform(50, 1000, n)

# Randomly select 50 indices to set to NaN
n_points = 50
random_indices = np.random.choice(square_footage.size, n_points, replace=False).astype(int)
square_footage[random_indices] = np.nan
print(len(square_footage[np.isnan(square_footage)]))  # check how many values are NaN

50


You should apply something similar to your created data. Also of use if you want to mask out given values (rather than randomly selecting indices), is to use np.where(condition, thing to do if true, thing to do if false).

It is helpful to set as your dataframe index, the primary key of the database, as this is by default saved to the output csv. Note that this can be a compound key.

In [None]:
# Example index
df.set_index(['Postcode', 'seller_name'], inplace=True)

Okay, now for splitting a pandas dataframe into multiple csvs, we need to select different columns at once.
To do so, pass a list of column names to the dataframe.

In [None]:
# Information on Seller relation
df_seller_information = df[['seller_name', 'number_previous_owners', 'time_on_market']]
df_seller_information.to_csv('house_seller.csv')  # by default, the index is saved to the csv

The tables to be saved depend upon what columns you have created.

## Outputting csvs to SQLite
- Open SQLite Browser
- New database, save as house_database.db
- Ensure you are on the database structure tab
- File > Import > Table from CSV file
- Check the preview looks as expected then confirm the import
- Write changes (if you skip this, it will prompt you on the next step)
- Right click on the new table then modify table
- Check the boxes as appropriate, Primary Key, Not Null, Auto Increment, Unique
- On some tables, scroll along to the right and click on the foreign key section, and assign the relation/attribute

Once all of this is complete, perform any JOIN transaction under the Execute SQL tab.
From that join, use the SQLite plotting functionality to make a basic scatter plot.

Upload the image from your scatter plot (there is a button to save the image) to this notebook below:


![title](yourimage.png)