# ClarityX — Report

**Title:** Dataset Merging, Location Resolution, Nearest-Region Mapping, and Asset Valuation  

**Notebook:** `ClarityX - Dataset merging.ipynb`   

---

# Executive summary

This report documents the data preparation, merging, geospatial matching, and valuation pipeline implemented in the notebook. The objective was to consolidate multiple RWAP datasets, resolve location-level conflicts (ZIP / city / state), derive region-level price metadata, map assets to nearest regions using geospatial nearest-neighbor search, and produce an imputed asset valuation table. The pipeline produces audit outputs that make the resolution decisions traceable and a final per-asset valuation suitable for downstream analysis.

**Key final outputs (files written by the notebook):**
- `final_asset_valuation.csv` — per-asset estimated valuation  
- `assets_nearest_regions_raw.csv` — nearest-region matches with rank and distance  
- `dataset1_merged_step.csv` — intermediate merged dataset  
- `city_resolution_audit.csv`, `state_resolution_audit.csv`, `conflict_resolved.csv` — audit and conflict resolution records  

---

# Data sources

Explicitly referenced input files (loaded via `pd.read_*` in the notebook):
- `rwap25_gis_dataset1.csv`  
- `rwap25_gis_dataset2.csv`  
- `dataset1_merged_step.csv`  
- `conflict_resolved.csv`  
- `state_resolution_audit.csv`  

These files contain asset records, geographic coordinates, previously merged artifacts, and conflict-resolution logs used as inputs to the pipeline.

---

# Methodology — 

1. **Environment and imports**  
   - Standard data science and geospatial libraries were imported (`pandas`, `numpy`, scikit-learn neighbors (`BallTree` / `NearestNeighbors`), `pgeocode`, `fuzzywuzzy`, `folium`, math utilities for haversine calculations, and `pathlib`).  

2. **Canonicalization of location fields**  
   - ZIP fields were standardized (removal of extraneous characters, normalization of numeric representations, preservation of leading zeros).  
   - City and state fields were normalized for text comparisons (case standardization, punctuation removal) and prepared for conflict detection.  

3. **Conflict detection and resolution**  
   - Differing city/state values across sources were flagged.  
   - Conflict decisions produced `Unified_City` / `Final_City` fields, with provenance recorded in columns such as `city_source`, `state_source`.  
   - Audit outputs (`city_resolution_audit.csv`, etc.) captured these decisions for traceability.  

4. **Region-level aggregation**  
   - Price statistics and centroids (`region_lat`, `region_lon`) were aggregated at the ZIP/region level.  
   - Features included `latest_price` and short-window averages.  

5. **Dataset merging**  
   - Enrichments were performed through left joins that attach region metadata to assets.  
   - Some joins used `validate='m:1'` to prevent unintended many-to-many expansions.  

6. **Geospatial nearest-neighbor matching**  
   - A BallTree indexed region centroids (converted to radians) using the haversine metric.  
   - Assets with valid coordinates were queried for K nearest regions.  
   - Results were stored in `assets_nearest_regions_raw.csv` with neighbor ranks and distances.  

7. **Valuation derivation**  
   - Asset valuations (`estimated_price`) were derived using region-level price statistics from nearest regions.  
   - Results were saved in `final_asset_valuation.csv`.  

---

# Features produced and their roles

- `zip_std` / `Zip_raw`: standardized ZIP for joins  
- `Unified_City`, `Final_City`, `city_source`: consolidated city values and provenance  
- `State_unified_raw`, `state_source`: unified state values and provenance  
- `region_zip`, `RegionName_raw`: identifiers for region grouping  
- `region_lat`, `region_lon`: region centroid coordinates  
- `asset_lat`, `asset_lon`: asset coordinates  
- `neighbor_rank`, `distance`: rank and great-circle distance from asset to region centroid  
- `latest_price`, `latest_navg`: region-level price summaries  
- `estimated_price`: final imputed value for each asset  

---

# Observations

- **Traceability:** Audit CSVs record conflict resolution decisions, supporting reproducibility.  
- **Geospatial approach:** Relies on nearest-neighbor matching and local price statistics rather than supervised ML, ensuring interpretability.  
- **Cardinality safeguards:** Use of `validate='m:1'` highlights careful handling of joins.  
- **Distance units:** BallTree haversine output requires multiplication by Earth radius to interpret distances in kilometers.  
- **Missing coordinates:** Assets without latitude/longitude were excluded from matching. Handling missingness is necessary for complete coverage.  
- **Source harmonization:** Deterministic rules selected between conflicting values, with audit logs to support review.  

---

# Findings

- Multi-source location data was successfully consolidated and unified.  
- Per-asset valuations were derived transparently via geospatial nearest-neighbor propagation of regional price signals.  
- Audit outputs provide a detailed trail for quality assurance and conflict verification.  
- The pipeline outputs both detailed neighbor-match tables and clean final valuations for analysis.  

---

# Conclusion

The notebook implements a robust, auditable pipeline for merging heterogeneous location datasets, resolving inconsistencies, and producing asset valuations via geospatial nearest-neighbor matching to region price statistics.  

Outputs are well suited for both diagnostic review (audit and neighbor tables) and operational use (per-asset valuation file). The pipeline emphasizes interpretability, reproducibility, and transparency, making each valuation traceable to its geographic and statistical origins.  

---

*End of report.*


In [None]:
# imports

In [None]:
import pandas as pd
import numpy as np
import re

In [None]:
# Step 1 — load the two CSV files

In [None]:
df1 = pd.read_csv('rwap25_gis_dataset1.csv')   # assets
df2 = pd.read_csv('rwap25_gis_dataset2.csv')   # zip time-series

print("DF1 rows,cols:", df1.shape)
print("DF2 rows,cols:", df2.shape)
print("DF1 columns:", df1.columns.tolist())
print("DF2 columns (sample):", df2.columns[:12].tolist())

DF1 rows,cols: (8652, 18)
DF2 rows,cols: (26314, 316)
DF1 columns: ['Location Code', 'Real Property Asset Name', 'Installation Name', 'Owned or Leased', 'GSA Region', 'Street Address', 'City', 'State', 'Zip Code', 'Latitude', 'Longitude', 'Building Rentable Square Feet', 'Available Square Feet', 'Construction Date', 'Congressional District', 'Congressional District Representative Name', 'Building Status', 'Real Property Asset Type']
DF2 columns (sample): ['RegionID', 'SizeRank', 'RegionName', 'RegionType', 'StateName', 'State', 'City', 'Metro', 'CountyName', '31-01-2000', '29-02-2000', '31-03-2000']


In [None]:
# Step 2 — keep only zip-level rows from Dataset 2

In [None]:
# If RegionType exists, filter rows where it's 'zip' (case-insensitive).
if 'RegionType' in df2.columns:
    df2_zip = df2[df2['RegionType'].astype(str).str.lower() == 'zip'].copy()
else:
    df2_zip = df2.copy()

print("Filtered df2 to zip rows:", df2_zip.shape)

Filtered df2 to zip rows: (26314, 316)


In [None]:
# Step 3 — standardize ZIP strings in both dataframes

In [None]:
# For df1: create a standardized 'zip' column from 'Zip Code'
if 'Zip Code' in df1.columns:
    df1['Zip_raw'] = df1['Zip Code'].astype(str).str.strip().fillna('')
else:
    # fallback: take first column name that contains 'zip'
    zip_col = [c for c in df1.columns if 'zip' in c.lower()]
    df1['Zip_raw'] = df1[zip_col[0]].astype(str).str.strip() if zip_col else ''

df1['zip'] = df1['Zip_raw'].str.extract(r'(\d+)', expand=False).fillna('').str.zfill(5)

# For df2_zip: extract numeric from RegionName -> zip
if 'RegionName' in df2_zip.columns:
    df2_zip['RegionName_raw'] = df2_zip['RegionName'].astype(str).str.strip().fillna('')
    df2_zip['zip'] = df2_zip['RegionName_raw'].str.extract(r'(\d+)', expand=False).fillna('').str.zfill(5)
else:
    # fallback: find any column that looks like RegionName
    cand = [c for c in df2_zip.columns if 'region' in c.lower() or 'zip' in c.lower()]
    if cand:
        df2_zip['RegionName_raw'] = df2_zip[cand[0]].astype(str).str.strip().fillna('')
        df2_zip['zip'] = df2_zip['RegionName_raw'].str.extract(r'(\d+)', expand=False).fillna('').str.zfill(5)
    else:
        df2_zip['RegionName_raw'] = ''
        df2_zip['zip'] = ''

print("Unique zips in df1:", df1['zip'].nunique())
print("Unique zips in df2_zip:", df2_zip['zip'].nunique())

Unique zips in df1: 3581
Unique zips in df2_zip: 26314


In [None]:
# Step 4 — find the date columns in Dataset 2 and melt it to long format

In [None]:
# find columns matching dd-mm-yyyy (e.g. '31-07-2025')
date_pattern = re.compile(r'^\d{2}-\d{2}-\d{4}$')
date_cols = [c for c in df2_zip.columns if date_pattern.match(c)]

print("Found date columns count:", len(date_cols))
print("Sample date columns:", date_cols[:6])

# melt wide->long: id_vars are everything except date columns
id_vars = [c for c in df2_zip.columns if c not in date_cols]
df2_long = df2_zip.melt(id_vars=id_vars, value_vars=date_cols,
                        var_name='date_str', value_name='price_raw')

# parse date and numeric price
df2_long['date'] = pd.to_datetime(df2_long['date_str'], format='%d-%m-%Y', errors='coerce')
df2_long['price'] = pd.to_numeric(df2_long['price_raw'], errors='coerce')

# drop invalid date rows (if any)
df2_long = df2_long.dropna(subset=['date']).copy()

print("df2_long rows:", df2_long.shape)

Found date columns count: 307
Sample date columns: ['31-01-2000', '29-02-2000', '31-03-2000', '30-04-2000', '31-05-2000', '30-06-2000']
df2_long rows: (8078398, 15)


In [None]:
# Step 5 — compute the latest non-null price and its date per ZIP

In [None]:
# keep only rows where price is not null, then take the last (latest date) per zip
df2_long_nonnull = df2_long.dropna(subset=['price']).copy()

# if there are no non-null prices we'll get an empty df
if df2_long_nonnull.empty:
    print("Warning: dataset2 has no non-null price values.")
    latest_per_zip = pd.DataFrame(columns=['zip','price_latest','price_latest_date'])
else:
    latest_per_zip = (df2_long_nonnull.sort_values(['zip','date'])
                                  .groupby('zip', as_index=False)
                                  .last()[['zip','price','date']])
    latest_per_zip = latest_per_zip.rename(columns={'price':'price_latest','date':'price_latest_date'})

print("Zips with latest price available:", len(latest_per_zip))
latest_per_zip.head()

Zips with latest price available: 26314


Unnamed: 0,zip,price_latest,price_latest_date
0,1001,340459.1165,2025-07-31
1,1002,538321.8056,2025-07-31
2,1005,409267.9918,2025-07-31
3,1007,467503.3545,2025-07-31
4,1008,372725.8783,2025-07-31


In [None]:
# Step 6 — merge latest-per-zip info back with meta (City/State) and attach to Dataset1

In [None]:
# pick some metadata from df2_zip (if present)
meta_cols = []
for c in ['RegionID','City','State','CountyName']:
    if c in df2_zip.columns:
        meta_cols.append(c)
meta_cols = list(dict.fromkeys(meta_cols))  # keep order & unique

zip_meta = df2_zip[meta_cols + ['zip']].drop_duplicates(subset=['zip'], keep='first')

# create zip_info: meta + latest price
zip_info = zip_meta.merge(latest_per_zip, on='zip', how='left')

# left-merge into df1 (many assets per zip -> one zip_info row)
merged = df1.merge(zip_info, on='zip', how='left', validate='m:1')

# initial source flag: exact_zip if we have a price_latest
merged['source'] = np.where(merged['price_latest'].notna(), 'exact_zip', np.nan)

print("Merged rows:", merged.shape)
print("Fraction with price_latest:", merged['price_latest'].notna().mean())

Merged rows: (8652, 27)
Fraction with price_latest: 0.9288025889967637


In [None]:
# Step 7 — inspect assets that don’t have price_latest

In [None]:
missing = merged[merged['price_latest'].isna()].copy()
print("Number of assets missing price_latest:", len(missing))
# show unique missing zips and their df2_long summary
missing_zips = missing['zip'].unique().tolist()[:50]
print("Sample missing zips (up to 50):", missing_zips)

# For diagnostics: show df2_long info for those zips
diag = (df2_long[df2_long['zip'].isin(missing_zips)]
        .groupby('zip').agg(n_obs=('price','size'),
                            n_non_null=('price', lambda s: s.notna().sum()),
                            first_date=('date','min'),
                            last_date=('date','max')).reset_index())
print(diag.head(30))

Number of assets missing price_latest: 616
Sample missing zips (up to 50): ['20993', '47907', '85620', '80225', '00968', '70803', '59482', '78567', '20192', '59256', '88029', '83853', '24011', '96799', '99752', '59542', '92283', '05460', '76155', '08608', '97204', '59411', '04491', '10278', '52801', '20373', '99780', '58329', '00716', '00708', '20585', '00820', '04936', '00641', '20503', '00784', '85633', '10038', '79711', '14604', '77010', '63145', '87026', '20201', '79839', '47405', '96950', '00918', '78235', '00824']
Empty DataFrame
Columns: [zip, n_obs, n_non_null, first_date, last_date]
Index: []


In [None]:
# Step 8 — save the merged file (with source) so you have a checkpoint

In [None]:
merged.to_csv('dataset1_merged_step.csv', index=False)
print("Saved merged file to dataset1_merged_step.csv")

Saved merged file to dataset1_merged_step.csv


In [None]:
# STEP A — quick read/setup (if not done already)

In [None]:
import pandas as pd, numpy as np, re
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
from math import radians, cos, sin, asin, sqrt

merged = pd.read_csv('dataset1_merged_step.csv')  # file from earlier step

In [None]:
# for columns:

In [None]:
# Quick inspect (see what's filled)

In [None]:
# show counts and a sample
print("Non-null counts:")
print("State_x:", merged['State_x'].notna().sum(), "/", len(merged))
print("State_y:", merged['State_y'].notna().sum(), "/", len(merged))

print("\nSample pairs (first 20 rows):")
print(merged[['State_x','State_y']].head(20))

Non-null counts:
State_x: 8652 / 8652
State_y: 8036 / 8652

Sample pairs (first 20 rows):
   State_x State_y
0       GA      GA
1       WI      WI
2       MN      MN
3       MD      MD
4       CO      CO
5       CO      CO
6       FL      FL
7       AZ      AZ
8       FL      FL
9       MD     NaN
10      CA      CA
11      CA      CA
12      IN     NaN
13      AZ      AZ
14      PA      PA
15      UT      UT
16      FL      FL
17      MD      MD
18      PA      PA
19      AZ      AZ


In [None]:
# Block 1 — Create unified State and state_source

In [None]:
# run this first (assumes merged DataFrame exists)
import pandas as pd, numpy as np

# prefer State_x, else State_y
merged['State_unified_raw'] = merged['State_x'].where(
    merged['State_x'].notna() & (merged['State_x'].astype(str).str.strip()!=''),
    merged['State_y']
)

# normalize whitespace and uppercase, convert empty/'nan' strings to None
merged['State_unified_raw'] = merged['State_unified_raw'].astype(str).str.strip().replace({'': None, 'nan': None, 'None': None})
merged['State_unified_raw'] = merged['State_unified_raw'].where(merged['State_unified_raw'].notna(), None)
merged['State_unified_raw'] = merged['State_unified_raw'].apply(lambda x: x.upper() if pd.notna(x) else x)

# simple full-name -> abbrev mapping (extend if needed)
state_map = {
    'ALABAMA':'AL','ALASKA':'AK','ARIZONA':'AZ','ARKANSAS':'AR','CALIFORNIA':'CA',
    'COLORADO':'CO','CONNECTICUT':'CT','DELAWARE':'DE','FLORIDA':'FL','GEORGIA':'GA',
    'HAWAII':'HI','IDAHO':'ID','ILLINOIS':'IL','INDIANA':'IN','IOWA':'IA','KANSAS':'KS',
    'KENTUCKY':'KY','LOUISIANA':'LA','MAINE':'ME','MARYLAND':'MD','MASSACHUSETTS':'MA',
    'MICHIGAN':'MI','MINNESOTA':'MN','MISSISSIPPI':'MS','MISSOURI':'MO','MONTANA':'MT',
    'NEBRASKA':'NE','NEVADA':'NV','NEW HAMPSHIRE':'NH','NEW JERSEY':'NJ','NEW MEXICO':'NM',
    'NEW YORK':'NY','NORTH CAROLINA':'NC','NORTH DAKOTA':'ND','OHIO':'OH','OKLAHOMA':'OK',
    'OREGON':'OR','PENNSYLVANIA':'PA','RHODE ISLAND':'RI','SOUTH CAROLINA':'SC',
    'SOUTH DAKOTA':'SD','TENNESSEE':'TN','TEXAS':'TX','UTAH':'UT','VERMONT':'VT',
    'VIRGINIA':'VA','WASHINGTON':'WA','WEST VIRGINIA':'WV','WISCONSIN':'WI','WYOMING':'WY',
    'DISTRICT OF COLUMBIA':'DC', 'DC':'DC'
}

def to_state_abbrev(s):
    if s is None: return None
    s_up = str(s).strip().upper()
    if len(s_up) == 2 and s_up.isalpha(): 
        return s_up
    return state_map.get(s_up, s_up)  # fallback keep as-is for manual inspection

merged['State'] = merged['State_unified_raw'].apply(to_state_abbrev)

# mark source
def state_source(row):
    sx = row.get('State_x')
    sy = row.get('State_y')
    if pd.notna(sx) and str(sx).strip()!='' and pd.notna(sy) and str(sy).strip()!='':
        if str(sx).strip().upper() == str(sy).strip().upper():
            return 'both_same'
        else:
            return 'both_conflict'
    if pd.notna(sx) and str(sx).strip()!='':
        return 'state_x'
    if pd.notna(sy) and str(sy).strip()!='':
        return 'state_y'
    return 'none'

merged['state_source'] = merged.apply(state_source, axis=1)

# Quick check
print("Unified State non-null:", merged['State'].notna().sum(), "/", len(merged))
print("state_source counts:\n", merged['state_source'].value_counts())
print("Sample unified states:", merged['State'].unique()[:20])

Unified State non-null: 8652 / 8652
state_source counts:
 state_source
both_same        8035
state_x           616
both_conflict       1
Name: count, dtype: int64
Sample unified states: ['GA' 'WI' 'MN' 'MD' 'CO' 'FL' 'AZ' 'CA' 'IN' 'PA' 'UT' 'TX' 'MO' 'NM'
 'OR' 'VA' 'WY' 'LA' 'KY' 'MI']


In [None]:
# Inspect the one conflict and resolve it (quick)

In [None]:
# show conflicting rows where both exist but are different
conflicts = merged[(merged['State_x'].notna()) & (merged['State_y'].notna()) &
                   (merged['State_x'].astype(str).str.strip().str.upper() != merged['State_y'].astype(str).str.strip().str.upper())]

print("Number of conflicts:", len(conflicts))
conflicts[['Location Code','State_x','State_y','City_x','City_y','zip','Latitude','Longitude']].head(20)

Number of conflicts: 1


Unnamed: 0,Location Code,State_x,State_y,City_x,City_y,zip,Latitude,Longitude
1497,NJ0148,NJ,NY,NEWARK,New Windsor,12553,40.728366,-74.174679


In [None]:
# solving the conflict - 

In [None]:
# 1) Install & import (only if needed)

In [None]:
import pgeocode
import pandas as pd

In [None]:
# Use ZIP-to-state lookup to get authoritative state code

In [None]:
# create pgeocode nominatim object for US
nomi = pgeocode.Nominatim('us')

# function to lookup 2-letter state by zip (returns None if not found)
def zip_to_state_abbrev(zipcode):
    try:
        info = nomi.query_postal_code(str(zipcode).zfill(5))
        # pgeocode returns NaN for missing fields; state_code property is abbreviation if present
        st = info['state_code']
        if pd.isna(st):
            return None
        return str(st).strip().upper()
    except Exception as e:
        return None

# test on the conflict zip (12553)
print("ZIP 12553 ->", zip_to_state_abbrev('12553'))  # expected 'NY'

ZIP 12553 -> NY


In [None]:
# Resolve all conflicts using the ZIP lookup and record provenance

In [None]:
# Apply zip lookup for all conflicted rows and update the merged DataFrame
for idx, row in conflicts.iterrows():
    z = row['zip']
    authoritative_state = zip_to_state_abbrev(z)
    if authoritative_state:
        # keep original values for audit
        merged.at[idx, 'State_before_resolution'] = merged.at[idx, 'State']  # current unified state
        merged.at[idx, 'State_resolved_by_zip'] = authoritative_state
        # update unified State to authoritative zip result
        merged.at[idx, 'State'] = authoritative_state
        merged.at[idx, 'state_source'] = 'zip_lookup_override'
    else:
        # if the lookup failed, leave current State as-is but flag for manual review
        merged.at[idx, 'State_resolve_note'] = 'zip_lookup_failed_manual_review'
        merged.at[idx, 'state_source'] = 'conflict_needs_manual'

In [None]:
# Audit the change (show rows that were modified)

In [None]:
# show those rows you just modified
resolved = merged[merged['state_source']=='zip_lookup_override']
print("Resolved rows via ZIP lookup:", len(resolved))
print(resolved[['Location Code','zip','State_x','State_y','State_before_resolution','State_resolved_by_zip','State','state_source']])

Resolved rows via ZIP lookup: 1
     Location Code    zip State_x State_y State_before_resolution  \
1497        NJ0148  12553      NJ      NY                      NJ   

     State_resolved_by_zip State         state_source  
1497                    NY    NY  zip_lookup_override  


In [None]:
# Save an audit log

In [None]:
# Save the conflict-resolution audit for traceability
audit = merged.loc[conflicts.index, ['Location Code','zip','State_x','State_y','State_before_resolution','State_resolved_by_zip','State','state_source']]
audit.to_csv('state_resolution_audit.csv', index=False)
print("Saved state resolution audit to state_resolution_audit.csv")

Saved state resolution audit to state_resolution_audit.csv


In [None]:
# again

In [None]:
import pandas as pd

# Load the merged dataset (replace with your merged CSV file name)
df = pd.read_csv("dataset1_merged_step.csv")

print("Initial shape:", df.shape)
print("Columns available:", df.columns)


Initial shape: (8652, 27)
Columns available: Index(['Location Code', 'Real Property Asset Name', 'Installation Name',
       'Owned or Leased', 'GSA Region', 'Street Address', 'City_x', 'State_x',
       'Zip Code', 'Latitude', 'Longitude', 'Building Rentable Square Feet',
       'Available Square Feet', 'Construction Date', 'Congressional District',
       'Congressional District Representative Name', 'Building Status',
       'Real Property Asset Type', 'Zip_raw', 'zip', 'RegionID', 'City_y',
       'State_y', 'CountyName', 'price_latest', 'price_latest_date', 'source'],
      dtype='object')


In [None]:
def resolve_state(row):
    # If both match → use either
    if pd.notna(row['State_x']) and pd.notna(row['State_y']):
        if row['State_x'] == row['State_y']:
            return row['State_x'], 'both_same'
        else:
            return row['State_x'], 'both_conflict'
    # If only State_x exists
    elif pd.notna(row['State_x']):
        return row['State_x'], 'state_x'
    # If only State_y exists
    elif pd.notna(row['State_y']):
        return row['State_y'], 'state_y'
    # If both are null
    else:
        return None, 'missing'

df[['Unified_State', 'state_source']] = df.apply(resolve_state, axis=1, result_type="expand")


In [None]:
conflicts = df[df['state_source'] == 'both_conflict']
print(f"Number of conflicts: {len(conflicts)}")

if len(conflicts) > 0:
    print(conflicts[['Location Code','State_x','State_y','City_x','City_y','zip']])


Number of conflicts: 1
     Location Code State_x State_y  City_x       City_y    zip
1497        NJ0148      NJ      NY  NEWARK  New Windsor  12553


In [None]:
# For NJ0148, correct Unified_State to NJ
df.loc[df['Location Code'] == 'NJ0148', 'Unified_State'] = 'NJ'
df.loc[df['Location Code'] == 'NJ0148', 'state_source'] = 'resolved_conflict'


In [None]:
df.drop(columns=['State_x', 'State_y'], inplace=True)


In [None]:
df.to_csv("state_resolution_audit.csv", index=False)
print("✅ Clean dataset saved as: state_resolution_audit.csv")


✅ Clean dataset saved as: state_resolution_audit.csv


In [None]:
# unifying the cities

In [None]:
# Step 1 — Load the cleaned dataset

In [None]:
# Load the cleaned dataset from the previous step
df = pd.read_csv("state_resolution_audit.csv")

print("Shape:", df.shape)
print("Columns:", df.columns)

Shape: (8652, 27)
Columns: Index(['Location Code', 'Real Property Asset Name', 'Installation Name',
       'Owned or Leased', 'GSA Region', 'Street Address', 'City_x', 'Zip Code',
       'Latitude', 'Longitude', 'Building Rentable Square Feet',
       'Available Square Feet', 'Construction Date', 'Congressional District',
       'Congressional District Representative Name', 'Building Status',
       'Real Property Asset Type', 'Zip_raw', 'zip', 'RegionID', 'City_y',
       'CountyName', 'price_latest', 'price_latest_date', 'source',
       'Unified_State', 'state_source'],
      dtype='object')


In [None]:
# Step 2 — Create Unified_City column

In [None]:
# We’ll define rules for merging City_x and City_y.

In [None]:
def resolve_city(row):
    city_x = str(row['City_x']).strip() if pd.notna(row['City_x']) else None
    city_y = str(row['City_y']).strip() if pd.notna(row['City_y']) else None

    # If both are present and same → use either
    if city_x and city_y:
        if city_x.lower() == city_y.lower():  # case-insensitive match
            return city_x, 'both_same'
        else:
            return city_x, 'both_conflict'  # temporarily mark conflict

    # If only city_x exists
    elif city_x:
        return city_x, 'city_x'

    # If only city_y exists
    elif city_y:
        return city_y, 'city_y'

    # If both are missing
    else:
        return None, 'missing'

# Apply function to dataframe
df[['Unified_City', 'city_source']] = df.apply(resolve_city, axis=1, result_type="expand")

In [None]:
# Step 3 — Check conflicts

In [None]:
conflicts = df[df['city_source'] == 'both_conflict']
print(f"Number of city conflicts: {len(conflicts)}")

# Show a sample of mismatched cities
print(conflicts[['Location Code', 'City_x', 'City_y', 'zip']].head(20))

Number of city conflicts: 660
    Location Code                 City_x              City_y    zip
26         NM0576               PLACITAS        Santa Teresa  88008
48         MI2137               MUSKEGON      Roosevelt Park  49441
63         NY7402               BROOKLYN            New York  11201
75         AR0062     HOT SPGS NATL PARK         Hot Springs  71901
87         IL2520             BELLEVILLE              Shiloh  62221
91         NC2314        WASHINGTON PARK          Washington  27889
98         IL2491            SPRINGFIELD        Leland Grove  62704
101        MO0617              ST. LOUIS         Saint Louis  63120
128        MO0610              ST. LOUIS         Saint Louis  63120
157        GA1158                ATLANTA            Chamblee  30341
168        MI2187  CHESTERFIELD TOWNSHIP        Chesterfield  48051
175        FL3183                  MIAMI       The Crossings  33186
184        OH2418              BEACHWOOD      Shaker Heights  44122
186        NJ5116 

In [None]:
# most of your city conflicts aren’t really “true” conflicts but spelling/style/abbreviation differences or neighborhood vs. city names.

# We'll handle these smartly using the following stepwise approach:

In [None]:
# Step 1 — Use ZIP Codes to Auto-Resolve Cities

In [None]:
# Load your dataset
df = pd.read_csv("state_resolution_audit.csv")

# If city_y is missing, first fill it with city_x
df["City_y"] = df["City_y"].fillna(df["City_x"])

# Create a ZIP → most common city mapping
zip_city_map = (
    df.groupby("zip")["City_y"]
    .agg(lambda x: x.mode().iloc[0] if not x.mode().empty else x.iloc[0])
    .to_dict()
)

# Apply this mapping to create a unified city column
df["Unified_City"] = df["zip"].map(zip_city_map)

# Check how many are still conflicting
df["city_conflict"] = (df["City_x"].str.upper() != df["Unified_City"].str.upper())
print("Remaining conflicts after ZIP-based resolution:", df["city_conflict"].sum())

Remaining conflicts after ZIP-based resolution: 688


In [None]:
# Step 2 — Fuzzy Matching for Close Spellings

In [None]:
from fuzzywuzzy import fuzz

def pick_best_city(row):
    city_x = str(row["City_x"]).upper()
    unified = str(row["Unified_City"]).upper()
    score = fuzz.ratio(city_x, unified)
    return unified if score >= 85 else city_x

df["Final_City"] = df.apply(pick_best_city, axis=1)



In [None]:
# Step 3 — Save Final Cleaned Dataset

In [None]:
df.drop(columns=["city_conflict"], inplace=True, errors="ignore")
df.to_csv("city_resolution_audit.csv", index=False)
print("✅ City conflicts resolved and saved to city_resolution_audit.csv")

✅ City conflicts resolved and saved to city_resolution_audit.csv


In [None]:
# Step 4 — Recheck Conflicts

In [None]:
conflicts = df[df["City_x"].str.upper() != df["Final_City"].str.upper()]
print("Final remaining conflicts:", len(conflicts))
print(conflicts[["Location Code", "City_x", "City_y", "Final_City", "zip"]].head(38))

Final remaining conflicts: 38
     Location Code              City_x                City_y  \
348         NC2615       WINSTON SALEM         Winston-Salem   
384         NC0113       WINSTON SALEM         Winston-Salem   
462         OH2481        BEAVER CREEK           Beavercreek   
1017        AR0066  HELENA-WEST HELENA  Helena - West Helena   
1716        HI8562              LIHU'E                 Lihue   
1886        MI3087    ST. CLAIR SHORES    Saint Clair Shores   
2066        HI7553              LIHU'E                 Lihue   
2160        MI3033     SAULT STE MARIE    Sault Sainte Marie   
2216        SC1374         MT PLEASANT        Mount Pleasant   
2244        NC1383       WINSTON SALEM         Winston-Salem   
2304        HI7793              KEA'AU                 Keaau   
2535        PA0627        WILKES BARRE          Wilkes-Barre   
2588        PA0921        WILKES BARRE          Wilkes-Barre   
2716        OH1816     ST. CLAIRSVILLE     Saint Clairsville   
2995      

In [None]:
# resolveddddd

In [None]:
# Step 1 — Load the cleaned merged dataset

In [None]:
# Load the dataset after state conflict resolution
df = pd.read_csv("conflict_resolved.csv")

# Check missing price_latest values
missing_price = df['price_latest'].isna().sum()
print(f"Missing price_latest values: {missing_price}")

Missing price_latest values: 616


In [None]:
# nearest region

In [None]:
# 0) Configuration + imports

In [None]:
# config & imports
import pandas as pd, numpy as np, math, re
from pathlib import Path

# for nearest-neighbor (fast, uses haversine)
from sklearn.neighbors import BallTree

# zip -> lat/lon lookup
import pgeocode

# optional for geocoding by address (if needed)
# from geopy.geocoders import Nominatim

# constants / parameters you can tune
ASSETS_PATH = Path("conflict_resolved.csv")
DS2_PATH    = Path("rwap25_gis_dataset2.csv")
OUTPUT_PATH = Path("assets_nearest_regions.csv")

K_NEIGHBORS = 5      # how many nearby regions to fetch per asset
MAX_DISTANCE_KM = 200  # optional cap for acceptable nearest region distance
N_MONTHS_FOR_AVG = 3   # if you want to average last N months for smoothing
DISTANCE_DECAY_K = 0.05 # example decay constant for exp(-k*dist) weighting (optional)

# create pgeocode nomi
nomi = pgeocode.Nominatim('us')

In [None]:
# 1) Utility helpers: robust column finder, zip normalizer, distance helpers

In [None]:
def guess_col(df, candidates):
    """Return first column name in df that case-insensitively matches any candidate string."""
    cols_low = {c.lower(): c for c in df.columns}
    for cand in candidates:
        if cand.lower() in cols_low:
            return cols_low[cand.lower()]
    # fallback: try normalized match (remove non-alnum)
    norm_to_col = {re.sub(r'[^a-z0-9]','',c.lower()): c for c in df.columns}
    for cand in candidates:
        key = re.sub(r'[^a-z0-9]','',cand.lower())
        if key in norm_to_col:
            return norm_to_col[key]
    return None

def to_zip5(val):
    if pd.isna(val): return np.nan
    s = str(val).strip()
    s = "".join(ch for ch in s if ch.isdigit())
    if s == "": return np.nan
    return s.zfill(5)[:5]

def haversine_km(lat1, lon1, lat2, lon2):
    # all args arrays or scalars in degrees
    R = 6371.0
    lat1r = np.radians(lat1); lon1r = np.radians(lon1)
    lat2r = np.radians(lat2); lon2r = np.radians(lon2)
    dlat = lat2r - lat1r
    dlon = lon2r - lon1r
    a = np.sin(dlat/2.0)**2 + np.cos(lat1r)*np.cos(lat2r)*np.sin(dlon/2.0)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

In [None]:
# 2) Load datasets and detect relevant columns

In [None]:
# load
assets = pd.read_csv(ASSETS_PATH)
ds2    = pd.read_csv(DS2_PATH)

# find important columns in assets (flexible names)
col_loccode = guess_col(assets, ["Location Code", "location code", "locationcode", "loccode"])
col_zip     = guess_col(assets, ["zip","Zip Code","Zip","zip_code"])
col_lat     = guess_col(assets, ["Latitude","lat","latitude"])
col_lon     = guess_col(assets, ["Longitude","lon","longitude"])
col_city    = guess_col(assets, ["Final_City","final_city","City","city","City_x","City_y"])
col_state   = guess_col(assets, ["Unified_State","unified_state","State","state"])
col_area    = guess_col(assets, ["Building Rentable Square Feet","Building Rentable Sq Ft","bldg_sf","Building Rentable Square Feet"])

print("Found columns in assets:", dict(loccode=col_loccode, zip=col_zip, lat=col_lat, lon=col_lon, city=col_city, state=col_state))

# find columns in ds2
col_regionname = guess_col(ds2, ["RegionName","Region Name","regionname"])
col_regionid   = guess_col(ds2, ["RegionID","Region Id","regionid"])  # Corrected the typo here
col_regiontype = guess_col(ds2, ["RegionType","regiontype"])
col_ds2_city   = guess_col(ds2, ["City","city"])
col_ds2_metro  = guess_col(ds2, ["Metro","metro"])
col_ds2_county = guess_col(ds2, ["CountyName","County Name","countyname"])
col_ds2_state  = guess_col(ds2, ["State","StateName","state","statename"])

print("Found columns in ds2:", dict(regionname=col_regionname, regionid=col_regionid, regiontype=col_regiontype, city=col_ds2_city, metro=col_ds2_metro, county=col_ds2_county, state=col_ds2_state))

Found columns in assets: {'loccode': 'location code', 'zip': 'zip', 'lat': 'latitude', 'lon': 'longitude', 'city': 'city', 'state': 'unified_state'}
Found columns in ds2: {'regionname': 'RegionName', 'regionid': 'RegionID', 'regiontype': 'RegionType', 'city': 'City', 'metro': 'Metro', 'county': 'CountyName', 'state': 'State'}


In [None]:
# 3) Normalize ZIPs and get lat/lon for ds2 regions (use pgeocode)

In [None]:
# standardize zip fields
ds2['region_zip'] = ds2[col_regionname].apply(to_zip5)
assets['zip_std'] = assets[col_zip].apply(to_zip5) if col_zip else np.nan

# Try to use any lat/lon already present in ds2 (if columns exist)
col_ds2_lat = guess_col(ds2, ["Latitude","lat","latitude"])
col_ds2_lon = guess_col(ds2, ["Longitude","lon","longitude"])

if col_ds2_lat and col_ds2_lon:
    ds2['region_lat'] = pd.to_numeric(ds2[col_ds2_lat], errors='coerce')
    ds2['region_lon'] = pd.to_numeric(ds2[col_ds2_lon], errors='coerce')
else:
    # use pgeocode to get lat/lon for each unique region_zip
    unique_zips = ds2['region_zip'].dropna().unique().tolist()
    rez = []
    for z in unique_zips:
        # pgeocode returns a pandas Series-like object; handle missing gracefully
        r = nomi.query_postal_code(str(z))
        lat = r.latitude if r is not None and not pd.isna(r.latitude) else np.nan
        lon = r.longitude if r is not None and not pd.isna(r.longitude) else np.nan
        rez.append((z, lat, lon))
    zip_coord_df = pd.DataFrame(rez, columns=['region_zip','region_lat','region_lon'])
    ds2 = ds2.merge(zip_coord_df, on='region_zip', how='left')

# For assets: if lat/lon missing, fill by asset zip centroid (pgeocode)
if not col_lat or not col_lon:
    # create asset centroid from zip_std
    unique_asset_zips = assets['zip_std'].dropna().unique().tolist()
    rez2 = []
    for z in unique_asset_zips:
        r = nomi.query_postal_code(str(z))
        lat = r.latitude if r is not None and not pd.isna(r.latitude) else np.nan
        lon = r.longitude if r is not None and not pd.isna(r.longitude) else np.nan
        rez2.append((z, lat, lon))
    asset_zip_coords = pd.DataFrame(rez2, columns=['zip_std','zip_lat','zip_lon'])
    # merge into assets
    assets = assets.merge(asset_zip_coords, on='zip_std', how='left')
    # create final lat/lon columns
    assets['asset_lat'] = assets.get(col_lat).combine_first(assets['zip_lat']) if col_lat else assets['zip_lat']
    assets['asset_lon'] = assets.get(col_lon).combine_first(assets['zip_lon']) if col_lon else assets['zip_lon']
else:
    assets['asset_lat'] = pd.to_numeric(assets[col_lat], errors='coerce')
    assets['asset_lon'] = pd.to_numeric(assets[col_lon], errors='coerce')

# final checks
print("Regions with coords:", ds2['region_lat'].notna().sum(), "/", len(ds2))
print("Assets with coords:", assets['asset_lat'].notna().sum(), "/", len(assets))

Regions with coords: 26314 / 26314
Assets with coords: 8652 / 8652


In [None]:
# 4) Identify monthly price columns in ds2 and select latest / N-month average

In [None]:
# detect date-like columns (dayfirst format dd-mm-YYYY or similar)
date_cols = []
for c in ds2.columns:
    # try parse, dayfirst True
    dt = pd.to_datetime(c, errors='coerce', dayfirst=True)
    if pd.notna(dt):
        date_cols.append((c, dt))
date_cols = sorted(date_cols, key=lambda x: x[1])
if not date_cols:
    raise RuntimeError("No monthly price columns detected in dataset2. Check headers.")

latest_month_col = date_cols[-1][0]
latest_month_date = date_cols[-1][1].date()
print("Latest month column:", latest_month_col, latest_month_date)

# optional: compute last N months average for smoothing
def compute_last_n_avg(df, n=N_MONTHS_FOR_AVG):
    cols = [c for c,d in date_cols[-n:]]  # last n columns
    return df[cols].mean(axis=1, skipna=True)

ds2['latest_price'] = pd.to_numeric(ds2[latest_month_col], errors='coerce')
# also add N-month avg
ds2['latest_navg'] = compute_last_n_avg(ds2, n=N_MONTHS_FOR_AVG)

Latest month column: 31-07-2025 2025-07-31


In [None]:
# 5) Build region-level (unique) table with coords + price

In [None]:
# build region table (unique RegionID or region_zip)
region_cols = [col_regionid, col_regionname, 'region_zip', 'region_lat', 'region_lon', 'latest_price', 'latest_navg', col_ds2_metro, col_ds2_county, col_ds2_city, col_ds2_state]
region_cols = [c for c in region_cols if c in ds2.columns or isinstance(c,str)]
region_df = ds2.drop_duplicates(subset=['region_zip']).copy()
region_df = region_df.loc[:, ['region_zip','region_lat','region_lon','latest_price','latest_navg', col_ds2_metro, col_ds2_county, col_ds2_city, col_ds2_state]]
region_df = region_df.rename(columns={
    col_ds2_metro: "metro",
    col_ds2_county: "county",
    col_ds2_city: "city_ds2",
    col_ds2_state: "state_ds2"
})
# keep only rows that have coordinates and any price
region_df = region_df[ region_df['region_lat'].notna() & region_df['region_lon'].notna() ]
print("Region candidates:", len(region_df))

Region candidates: 26314


In [None]:
# 6) Build BallTree (haversine) on region centroids and query K neighbors for assets

In [None]:
# prepare arrays in radians
region_coords_rad = np.radians(region_df[['region_lat','region_lon']].values)
tree = BallTree(region_coords_rad, metric='haversine')  # distances in radians

# assets query points (filter for those that have coords)
assets_with_coords = assets[assets['asset_lat'].notna() & assets['asset_lon'].notna()].copy()
assets_coords_rad = np.radians(assets_with_coords[['asset_lat','asset_lon']].values)

# query
dist_rad, idx = tree.query(assets_coords_rad, k=K_NEIGHBORS, return_distance=True)  # distances in radians
dist_km = dist_rad * 6371.0  # convert to km

# Build neighbor result table
neigh_list = []
for i_asset, asset_idx in enumerate(assets_with_coords.index):
    asset_row = assets_with_coords.loc[asset_idx]
    for j in range(idx.shape[1]):
        reg_pos = idx[i_asset, j]
        reg_row = region_df.iloc[reg_pos]
        neigh_list.append({
            'asset_index': asset_idx,
            'location_code': asset_row.get(col_loccode),
            'asset_lat': asset_row['asset_lat'],
            'asset_lon': asset_row['asset_lon'],
            'neighbor_rank': j+1,
            'region_zip': reg_row['region_zip'],
            'region_lat': reg_row['region_lat'],
            'region_lon': reg_row['region_lon'],
            'region_latest_price': reg_row['latest_price'],
            'region_latest_navg': reg_row['latest_navg'],
            'metro': reg_row.get('metro'),
            'county': reg_row.get('county'),
            'city_ds2': reg_row.get('city_ds2'),
            'state_ds2': reg_row.get('state_ds2'),
            'dist_km': float(dist_km[i_asset, j])
        })

neighbors_df = pd.DataFrame(neigh_list)
# save neighbors table
neighbors_df.to_csv("assets_nearest_regions_raw.csv", index=False)
print("Saved neighbor table for", neighbors_df['location_code'].nunique(), "assets.")

Saved neighbor table for 8652 assets.


In [None]:
# 7) Example functions to compute distance-weighted valuation (you can choose one)

In [None]:
def idw_prediction(neighs, price_col='region_latest_price', eps=1e-6):
    """Inverse-distance weighting: weights = 1 / (dist + eps)."""
    vals = []
    for asset, group in neighs.groupby('asset_index'):
        group = group.dropna(subset=[price_col])
        if group.empty:
            vals.append((asset, np.nan))
            continue
        w = 1.0 / (group['dist_km'] + eps)
        pred = (w * group[price_col]).sum() / w.sum()
        vals.append((asset, pred))
    return pd.DataFrame(vals, columns=['asset_index','pred_idw'])

def exp_decay_prediction(neighs, price_col='region_latest_price', k=DISTANCE_DECAY_K):
    """Exponential decay weights: w = exp(-k * dist_km)"""
    vals = []
    for asset, group in neighs.groupby('asset_index'):
        group = group.dropna(subset=[price_col])
        if group.empty:
            vals.append((asset, np.nan))
            continue
        w = np.exp(-k * group['dist_km'])
        pred = (w * group[price_col]).sum() / w.sum()
        vals.append((asset, pred))
    return pd.DataFrame(vals, columns=['asset_index','pred_expdecay'])

# Example usage (commented out; run when ready):
# pred_idw = idw_prediction(neighbors_df, price_col='region_latest_price')
# pred_exp = exp_decay_prediction(neighbors_df, price_col='region_latest_navg')
# then merge preds back into assets_with_coords by index to get predicted prices

In [None]:
# 8) Merge neighbor info back to assets and save final mapping (no valuation)

In [None]:
# Step 1: Get only the nearest neighbor (rank=1)
nearest_one = neighbors_df[neighbors_df['neighbor_rank'] == 1].copy()

# Rename region-related columns for clarity
nearest_one = nearest_one.rename(columns={
    'region_zip': 'nearest_region_zip',
    'region_lat': 'nearest_region_lat',
    'region_lon': 'nearest_region_lon',
    'region_latest_price': 'nearest_region_price',
    'region_latest_navg': 'nearest_region_price_navg',
    'metro': 'nearest_region_metro',
    'county': 'nearest_region_county',
    'city_ds2': 'nearest_region_city',
    'state_ds2': 'nearest_region_state',
    'dist_km': 'nearest_dist_km'
})

# Step 2: Ensure we have asset_index in assets_with_coords
assets_with_coords = assets_with_coords.reset_index().rename(columns={'index': 'asset_index'})

# Step 3: Check if nearest_one has asset_index column
if 'asset_index' not in nearest_one.columns:
    # If not, try renaming from the column name in neighbors_df
    if 'asset_id' in nearest_one.columns:
        nearest_one = nearest_one.rename(columns={'asset_id': 'asset_index'})
    elif 'index' in nearest_one.columns:
        nearest_one = nearest_one.rename(columns={'index': 'asset_index'})
    else:
        print("⚠️ 'asset_index' missing in nearest_one, check neighbors_df columns!")
        print("Available columns:", nearest_one.columns)

# Step 4: Merge assets with nearest region info
assets_mapped = assets_with_coords.merge(
    nearest_one[[
        'asset_index', 'nearest_region_zip', 'nearest_region_lat', 'nearest_region_lon',
        'nearest_region_price', 'nearest_region_price_navg', 'nearest_region_metro',
        'nearest_region_county', 'nearest_region_city', 'nearest_region_state',
        'nearest_dist_km'
    ]],
    on='asset_index',  # Use a single 'on' since both have the same column name now
    how='left'
)

# Step 5: Save final mapping (assets with nearest region info)
assets_mapped.to_csv(OUTPUT_PATH, index=False)
print("✅ Saved assets → nearest region mapping to:", OUTPUT_PATH)

✅ Saved assets → nearest region mapping to: assets_nearest_regions.csv


In [None]:
print("Assets columns:", assets_with_coords.columns)
print("Nearest one columns:", nearest_one.columns)

Assets columns: Index(['asset_index', 'asset_index_original', 'location code',
       'real property asset name', 'installation name', 'owned or leased',
       'gsa region', 'street address', 'zip code', 'latitude', 'longitude',
       'building rentable square feet', 'available square feet',
       'construction date', 'congressional district',
       'congressional district representative name', 'building status',
       'real property asset type', 'zip_raw', 'zip', 'regionid', 'countyname',
       'price_latest', 'price_latest_date', 'source', 'unified_state',
       'state_source', 'city', 'zip_std', 'asset_lat', 'asset_lon'],
      dtype='object')
Nearest one columns: Index(['asset_index', 'location_code', 'asset_lat', 'asset_lon',
       'neighbor_rank', 'nearest_region_zip', 'nearest_region_lat',
       'nearest_region_lon', 'nearest_region_price',
       'nearest_region_price_navg', 'nearest_region_metro',
       'nearest_region_county', 'nearest_region_city', 'nearest_region_

In [None]:
# 9) Optional: Visualization (Folium) — show assets colored by distance or nearest metro

In [None]:
# OPTIONAL: visualize subset to sanity-check mapping (requires folium)
import folium
m = folium.Map(location=[assets_mapped['asset_lat'].mean(), assets_mapped['asset_lon'].mean()], zoom_start=6)
# add nearest region markers
for _, r in assets_mapped[['asset_lat','asset_lon','nearest_region_zip','nearest_region_price','nearest_dist_km']].dropna().head(500).iterrows():
    folium.CircleMarker(location=[r['asset_lat'], r['asset_lon']], radius=3,
                        popup=f"zip:{r['nearest_region_zip']} dist_km:{r['nearest_dist_km']:.1f} price:{r['nearest_region_price']}",
                        fill=True).add_to(m)
m.save("assets_nearest_map.html")
print("Saved interactive map: assets_nearest_map.html")

Saved interactive map: assets_nearest_map.html


In [None]:
# price valuation

In [None]:
import numpy as np

# Let's define the function to calculate price based on distance
def estimate_price(nearest_price, distance_km, decay='inverse', k=0.1):
    """
    Estimate asset price based on nearest region price and distance.
    
    nearest_price: float
    distance_km: float
    decay: 'inverse' | 'exponential' | 'linear'
    k: decay constant (only for exponential or linear)
    """
    if np.isnan(nearest_price):
        return np.nan
    
    if decay == 'inverse':
        weight = 1 / (1 + distance_km)
    elif decay == 'exponential':
        weight = np.exp(-k * distance_km)
    elif decay == 'linear':
        weight = max(0, 1 - k * distance_km)
    else:
        raise ValueError("Invalid decay type. Choose from 'inverse', 'exponential', or 'linear'")
        
    return nearest_price * weight

# Apply valuation
assets_mapped['estimated_price'] = assets_mapped.apply(
    lambda row: estimate_price(row['nearest_region_price'], row['nearest_dist_km'], decay='inverse'),
    axis=1
)

In [None]:
assets_mapped.to_csv("final_asset_valuation.csv", index=False)
print("✅ Valuation completed and saved.")

✅ Valuation completed and saved.
