# Clean

This notebook loads your assembled data and prepares it for modeling.

The final output is a `sup` that is fully ready for modeling and analysis.

Here's what happens in this notebook:
- We fill gaps in the data using reasonable assumptions
- We prepare the data for further analysis by marking clusters
- We process sales validity information
- We run our own sales scrutiny heuristic to make sure we only use trustworthy sales that reflect market value

These operations are necessary for modeling, but nevertheless inject a certain amount of subjectivity into the model, which is why we keep the results of the **Clean** notebook separate from those produced in the **Assemble** notebook.

In [None]:
# Change these as desired

# The slug of the locality you are currently working on
locality = "us-pa-philadelphia"

# Whether to print out a lot of stuff (can help with debugging) or stay mostly quiet
verbose = True

# Clear previous state for this notebook and start fresh
clear_checkpoints = True

# Set to true to have sales scrutiny drop sales rather than just flagging them
sales_scrutiny_drop_outliers = False    # Drop outlier sales in sales clusters
sales_scrutiny_drop_heuristics = True  # Drop sales that match suspicious metadata patterns

# 1. Basic setup

In [None]:
import init_notebooks
init_notebooks.setup_environment()

In [None]:
# import a bunch of stuff
from openavmkit.pipeline import (
    init_notebook,
    from_checkpoint,
    delete_checkpoints,
    write_checkpoint,
    read_pickle,
    load_settings,
    examine_sup,
    fill_unknown_values_sup,
    process_sales,
    mark_ss_ids_per_model_group_sup,
    mark_horizontal_equity_clusters_per_model_group_sup,
    run_sales_scrutiny,
    write_notebook_output_sup
)

In [None]:
init_notebook(locality)

In [None]:
if clear_checkpoints:
    delete_checkpoints("2-clean")

# 2. Load data

In [None]:
settings = load_settings()

In [None]:
# load the data
sales_univ_pair = read_pickle("out/1-assemble-sup")

In [None]:
examine_sup(sales_univ_pair, settings)

# 3. Fill unknowns

Modeling functions are unable to process null data, so you need to fill them in somehow.   
The goal is to **eliminate all gaps in your data,** at least for fields you intend to turn into modeling variables.

Consult the documentation for more details and best practices on filling unknown values.

In [None]:
# Fill holes in the data with sensible defaults
sales_univ_pair = fill_unknown_values_sup(sales_univ_pair, settings)

# 4. Clustering

We cluster all similar properties and give each cluster a unique ID.  
Later, we'll use these ID's whenever we want to run a horizontal equity study.


In [None]:
settings = load_settings()
sales_univ_pair = from_checkpoint("2-clean-00-horizontal-equity", mark_horizontal_equity_clusters_per_model_group_sup,
    {
        "sup": sales_univ_pair,
        "settings": settings,
        "verbose": verbose,
        "do_land_clusters": True,
        "do_impr_clusters": True
    }
)

## 5. Process sales

We process sales validity information to set all the right codes for later use.  
We calculate time trends for sales over time to generate time-adjusted sale prices.

In [None]:
sales_univ_pair = from_checkpoint("2-clean-01-process_sales", process_sales,
    {
        "sup": sales_univ_pair,
        "settings": load_settings(),
        "verbose": verbose
    }
)

## 6. Scrutinize sales

We cluster all sales of similar properties in similar locations.  
We flag individual sales that are anomalously high or low for their local cluster.  
This helps us catch potentially invalid sales that slipped by the assessor's notice.  

In [None]:
sales_univ_pair = from_checkpoint("2-clean-02-sales-scrutiny", run_sales_scrutiny,
    {
        "sup": sales_univ_pair, 
        "settings": load_settings(), 
        "drop_cluster_outliers": sales_scrutiny_drop_outliers, 
        "drop_heuristic_outliers": sales_scrutiny_drop_heuristics, 
        "verbose": verbose
    }
)

# 7. Write out results

In [None]:
write_notebook_output_sup(sales_univ_pair, "2-clean")

# 8. Look at it on a map!
- Take the files output in the previous step and put them in a map viewer like QGIS, ArcGIS, or Felt
- Look at them with your eyeballs
- Make sure the data looks correct
- If not, go back and fix it!
- Don't proceed to the next step until everything looks right