# Assemble

This notebook loads and assembles all your basic data sources, including tabulor and geospatial data.

The final output is two dataframes:

- UNIVERSE
- SALES

The SALES dataframe represents transactions or parcels; these are ownership transfers with prices, dates, and metadata.  
The UNIVERSE dataframe represents the parcels themselves (land and buildings, and their associated characteristics).

These will be packaged together in a handy data structure called a `SalesUniversePair`, or `sup` for short. `openavmkit` provides many handy functions that carefully perform operations on `sup`s without mixing up their fields.

The key thing to understand is that the **Assemble** notebook outputs a `sup` that represents *factual assertions* about the world. In later notebooks, we will have to add assumptions, opinions, and educated guesses, but we first will establish the firmest facts we can in this notebook.

You can think of the two dataframes in the `sup` as answering the following questions:

- UNIVERSE:
  - Where is each parcel located in space, and what is its shape?
  - What are the *current* characteristics of each parcel?
    - Which parcels have buildings and which are vacant lots?
    - How big is each parcel?
    - What is the age/size/quality/condition/etc of each building?
- SALES:
  - Which parcels have sold?
  - What prices did they sell for?
  - What dates did they sell on?
  - Which sales were valid?
  - What characteristics were different *at the time of sale* from how the parcel is now?



In [None]:
# Change these as desired

# The slug of the locality you are currently working on
locality = "us-pa-philadelphia"

# Which cloud service to look for data in (only used for a new locality, ignored otherwise)
bootstrap_cloud = "azure"

# Whether to print out a lot of stuff (can help with debugging) or stay mostly quiet
verbose = True

# Clear previous state for this notebook and start fresh
clear_checkpoints = True

# 1. Basic setup

In [None]:
import init_notebooks
init_notebooks.setup_environment()

In [None]:
# import OpenAVMkit:
from openavmkit.pipeline import ( 
    init_notebook,
    from_checkpoint,
    delete_checkpoints,
    examine_df,
    examine_df_in_ridiculous_detail,
    examine_sup,
    examine_sup_in_ridiculous_detail,
    cloud_sync,
    load_settings,
    load_dataframes,
    process_data,
    process_sales,
    enrich_sup_streets,
    tag_model_groups_sup,
    write_notebook_output_sup
)

In [None]:
init_notebook(locality)

In [None]:
if clear_checkpoints:
    delete_checkpoints("1-assemble")

# 2. Sync with Cloud
- If you have configured cloud storage, syncs with your remote storage
- Reconciles your local input files with the versions on the remote server
- Pulls down whatever is newer from the remote server
- Uploads whatever is newer from your local machine

In [None]:
cloud_sync(locality, verbose=True, bootstrap=bootstrap_cloud)

In [None]:
settings = load_settings()

# 3. Load & process data

In [None]:
# load all of our initial dataframes, but don't do anything with them just yet
dataframes = from_checkpoint("1-assemble-01-load_dataframes", load_dataframes,
    {
        "settings":load_settings(),
        "verbose":verbose
    }
)

In [None]:
# assemble our data
sales_univ_pair = from_checkpoint("1-assemble-02-process_data", process_data,
    {
        "dataframes":dataframes, 
        "settings":load_settings(), 
        "verbose":verbose
    }
)

In [None]:
# calculate street frontages
sales_univ_pair = from_checkpoint("1-assemble-03-enrich_streets", enrich_sup_streets,
    {
        "sup": sales_univ_pair,
        "settings":load_settings(), 
        "verbose":verbose
    }
)

# 4. Inspect results

## 4.1 Examine

- Run the next cell and look at the printed out results.
- Note the "Non-zero" and "Non-null" columns in particular and make sure they're what you expect
- This view is for a quick glance to get a good idea of what all your data is

In [None]:
examine_sup(sales_univ_pair, load_settings())

## 4.2 Examine in ridiculous detail

- You've looked, now LOOK AGAIN. This cell will run `describe()` for each numeric field and `value_counts()` for each categorical field.
- Use this info to decide which variables are useful/useless
- Consult this readout when you build your modeling group filters

In [None]:
examine_sup_in_ridiculous_detail(sales_univ_pair, load_settings())

## 4.3 Look at it on a map

- Go to your `out/look/` folder
- There should be parquets there
- Drop them into ArcGIS, QGIS, or Felt
- Look at your location fields and make sure they make sense

# 5. Tag modeling groups
- Separates rows into groups like "single family", "townhomes" and "commercial" as specified by the user
- These groups will guide all further processing

In [None]:
sales_univ_pair = from_checkpoint("1-assemble-04-tag_modeling_groups", tag_model_groups_sup,
    {
        "sup": sales_univ_pair, 
        "settings": load_settings(), 
        "verbose": verbose
    }
)

# 6. Write out results

In [None]:
write_notebook_output_sup(
    sales_univ_pair, 
    "1-assemble", 
    parquet=True, 
    gpkg=False, 
    shp=False
)

# 7. Look at it on a map!
- Take the files output in the previous step and put them in a map viewer like QGIS, ArcGIS, or Felt
- Look at them with your eyeballs
- Make sure the data looks correct
- If not, go back and fix it!
- Don't proceed to the next step until everything looks right