# Analytical Report | RWAP 2025-26

## 1. Project Information  
**Project Title:** Real-World Asset Valuation & Classification using GIS and Machine Learning  

**Team Members:**  
- 055002 – Abhijeet  
- 055014 – Isha Gupta  
- 055039 – Saloni Gupta  
- 055041 – Sejal Raj 

**Software & Tools Used:**  
- **Programming & Analytics:** Python (NumPy, Pandas, Scikit-learn, Matplotlib)  
- **GIS & Spatial Analysis:** GeoPandas, Folium, Shapely, PySAL  
- **Visualization & Dashboarding:** Streamlit, Folium (for interactive maps), Matplotlib  
- **Execution Environment:** Google Colab, Jupyter Notebook  

---

## 2. Description of Data  

### Dataset 1: Property Assets Dataset  
- **Source:** [Google Drive Link](https://drive.google.com/file/d/1YFTWJNoxu0BF8UlMDXI8bXwRTVQNE2mb/view?usp=drive_link)  
- **Size:** 1.19 MB  
- **Type:** Structured cross-sectional GIS dataset  
- **Dimensions:** 8,653 rows × 18 columns  
- **Variables:**  
  - **Identifiers & Categorical:** Location Code, Asset Name, Installation Name, Ownership (Owned/Leased), GSA Region, Address, City, State, ZIP Code, Congressional District, Representative Name, Building Status, Asset Type  
  - **Numeric:** Rentable Sq. Ft, Available Sq. Ft, Construction Date  
  - **Geospatial:** Latitude, Longitude  

### Dataset 2: Housing Index Dataset  
- **Source:** [Google Drive Link](https://drive.google.com/file/d/1fFT8Q8GWiIEM7kx6czhQ-qabygUPBQRv/view?usp=drive_link)  
- **Size:** 73.3 MB  
- **Type:** Longitudinal GIS dataset  
- **Dimensions:** 26,315 rows × 316 columns  
- **Variables:**  
  - **Categorical/Identifiers:** RegionID, RegionName, RegionType, City, Metro, CountyName, State, StateName, SizeRank  
  - **Numeric:** Housing index values (monthly property valuations from Jan 2000 – Jul 2025)  

### About Datasets  
- Dataset 1 (Assets) provides property-level details requiring valuation and classification.  
- Dataset 2 (Housing Index) serves as a benchmark and training dataset for valuation, containing regional price dynamics.  
- Together, they enable **valuation, clustering, classification, and GIS-based enrichment.**  

---

## 3. Objectives & Problem Statements  

1. **Asset Valuation Engine**  
   - Estimate the **current fair market value** for each asset in Dataset 1.  
   - Combine asset attributes (e.g., rentable/available area, asset type, status) with regional indices from Dataset 2.  
   - Deliver outputs:  
     - Asset value  
     - Value per rentable square foot  
     - Model confidence score  

2. **Geospatial Integration & Enrichment**  
   - Build a **spatial linkage** between Dataset 1 (lat/long, city, ZIP, state) and Dataset 2 (RegionName, RegionType, Metro/County/State).  
   - Use GIS libraries to enrich assets with their nearest or most relevant region.  

3. **Unsupervised Asset Clustering**  
   - Identify natural asset groups using features such as value per sqft, rentable size, construction age, type, and spatial attributes.  
   - Determine optimal cluster counts using metrics like **Silhouette Score** and **Davies-Bouldin Index.**  
   - Label clusters with interpretable business personas (e.g., “High-Value Metro Assets”, “Undervalued Rural Assets”).  

4. **Supervised Classification for Valuation Bands**  
   - Train classification models to predict valuation ranges for new assets.  
   - Evaluate with **Accuracy, Precision, Recall, F1-score, AUC, Calibration.**  
   - Provide interpretability via **feature importance/SHAP analysis.**  

5. **GIS Analytical Dashboard**  
   - Build a **Streamlit + Folium** dashboard for stakeholders to interact with results.  
   - Features:  
     - Interactive maps of assets and clusters  
     - Filters (Region, Asset Type, Status)  
     - Visualization of valuation trends and model predictions  

---

## 4. Analysis of Data  

### 4.1 Data Acquisition & Overview  
- **Assets Dataset:** 8,653 property records containing structured metadata on asset type, size, ownership, and geospatial coordinates.  
- **Housing Index Dataset:** 26,315 records × 316 columns capturing historical residential price indices from Jan 2000 – Jul 2025 across multiple region types (City/County/Metro/State).  

**Highlights:**  
- Robust preprocessing ensured consistency of location-based identifiers.  
- Encoding variations were handled to preserve data integrity.  
- Both datasets complement each other: one provides **raw asset details** while the other offers **regional valuation benchmarks.**  

---


# Task 1: Asset Price Valuation


## Data Preprocessing

The preprocessing pipeline was carried out in multiple stages to prepare and harmonize the raw RWAP datasets.  

## Library Imports and Setup
* Used **pandas** and **numpy** for wrangling.  
* **re** for regex-based cleaning.  
* **fuzzywuzzy** for string similarity.  
* **sklearn** (`NearestNeighbors`, `BallTree`, `StandardScaler`) for modeling and normalization.  
* **pgeocode** for postal code lookups.  
* **folium** for mapping.  
* Indicates a workflow combining **tabular cleaning**, **geospatial enrichment**, and **string-based conflict resolution**.  

## Loading Raw Data
* Two primary datasets: `rwap25_gis_dataset1.csv` and `rwap25_gis_dataset2.csv`.  
* Intermediate audit files reloaded later:  
  * `dataset1_merged_step.csv`  
  * `conflict_resolved.csv`  
  * `state_resolution_audit.csv`  
* Outputs confirmed DataFrame creation with large row counts.  

## Schema Standardization
* Created new harmonized columns:  
  * **Zip_raw** → cleaned into numeric **zip**.  
  * **RegionName_raw**, **Unified_City**, **Final_City** for city harmonization.  
  * **State_unified_raw** and **state_source** to track state values.  
  * **price_latest** → most recent valid price per ZIP.  
* Outputs showed replacement of inconsistent labels with standardized values.  

## Conflict Resolution
* Resolution tables (`state_resolution_audit.csv`)/(`city_resolution_audit.csv`) logged mismatched city/state pairs.  
* Applied **fuzzywuzzy** similarity ratios for city names.  
* Introduced **Final_City** as unified reference.  
* Added **city_conflict** flag to track discrepancies.  

 

---

# Analysis

## Merging of Sources
* Multiple `merge` operations combined asset data with GIS attributes.  
* Outputs showed `City_x`, `City_y`, `State_x`, `State_y` columns → schema overlap.  

## Error Traces in Execution
* Logged **KeyError: 'Unified_State'** during preprocessing.  
* Indicates inconsistent schema across intermediate DataFrames.  

## Groupby Aggregations
* Grouped by **zip** with count, mean, and latest price.  
* Prepared **valuation-ready dataset** with one record per ZIP.  

## Geospatial Enrichment
* `pgeocode.Nominatim` retrieved metadata for postal codes.  
* Added **latitude/longitude** and administrative info.  
* Outputs aligned new attributes with ZIP codes.  

## Audit Trail Creation
* Consistent writing of intermediate files:  
  * `dataset1_merged_step.csv` → after merge.  
  * `conflict_resolved.csv` → conflict cases and decisions.  
  * `state_resolution_audit.csv` → after fuzzy matching.  
* Ensures preprocessing pipeline is **traceable and reproducible**.  

---
# Data Visualization and Analysis

## 1. Price Distribution (Raw)
- The raw distribution of `price_latest` is **right-skewed** with a heavy tail.  
- Majority of the properties are priced between **$100,000 and $500,000**.  
- A small number of extremely high-priced properties (>$1M) create long right tails.  
- This skewness indicates the need for **log transformation** to normalize the distribution before applying models.

---

## 2. Price Distribution (Log-Transformed)
- After applying a **log transformation**, the price distribution becomes nearly **normal (bell-shaped)**.  
- The transformation reduces the impact of outliers and extreme values.  
- This is beneficial for regression modeling, as many algorithms assume input variables are normally distributed.  
- The smoother distribution also suggests better statistical stability for predictive analysis.  

---

## 3. Predicted Price Distribution (Raw Model Output)
- The model-predicted prices also follow a **right-skewed distribution**, similar to the raw prices.  
- Majority of predicted values lie between **$100,000 and $600,000**.  
- The model has successfully captured the central tendency but still shows heavy-tailed predictions for very high-valued properties.  
- This indicates the model may underfit rare luxury or high-end assets.  

---

## 4. Predicted Price Distribution (Log-Transformed)
- The log-transformed predictions align closely with a **normal distribution**.  
- The peak centers around log(price) ≈ 13 (equivalent to ~$450,000).  
- Some clustering is observed around this peak, showing the model tends to predict values close to the mean.  
- This highlights model stability but also suggests limited variance in extreme predictions.  

---

## 5. Feature Correlation Heatmap
- The heatmap reveals correlations among features used in modeling.  
- Key observations:  
  * Strong correlations between **latitude, longitude, and zip code**, as expected.  
  * **Price-related features** (`price_latest_reg`, `price_avg_3m`, `price_avg_6m`, `price_avg_12m`) are highly correlated with each other.  
  * Percentage change features (`pct_chg_1m`, `pct_chg_3m`, `pct_chg_12m`) are moderately correlated with price averages.  
  * Geographic attributes (`nearest_region_lat`, `nearest_region_lon`, `nearest_region_price_avg`) show spatial correlation with property prices.  
  * Minimal correlation between **construction date** and price indicates limited influence of property age on valuation in this dataset.  

---

# Overall Insights from Visualizations
1. **Skewness Normalized:** Both raw and predicted prices are right-skewed but become approximately normal with log transformation.  
2. **Model Learning Central Patterns:** The model captures the bulk of mid-range properties well but struggles with outliers.  
3. **Correlated Price Dynamics:** Historical averages and short-term price changes strongly influence predictions.  
4. **Geospatial Significance:** Location-based variables (zip, latitude, longitude) remain key predictors, reinforcing the importance of geographic context.  


# Observations & Findings

1. **ZIP Harmonization Effective**  
   * Conversion of `Zip_raw` → numeric `zip` worked for merges.  
   * Outputs showed cleaner, numeric-only keys.  

2. **City/State Conflicts Resolved**  
   * `state_resolution_audit.csv` documented mismatched entries.  
   * `Final_City` provided unified reference.  

3. **Most Recent Pricing Isolated**  
   * `price_latest` captured latest valid price per ZIP.  
   * Older duplicates dropped, avoiding bias.  

4. **Redundancy After Merge**  
   * Columns like `City_x/City_y`, `State_x/State_y` created duplication.  
   * Requires cleanup post-merge.  

5. **Pipeline Robustness Incomplete**  
   * `KeyError: 'Unified_State'` shows schema inconsistency.  
   * Stricter column checks needed.  

6. **Data Loss vs. Quality Trade-off**  
   * Dropping missing values reduced dataset size.  
   * Improved reliability of valuation data.  

7. **Traceability as a Strength**  
   * Intermediate CSVs serve as audit trail.  
   * Increases transparency and reproducibility.  

8. **Next Steps**  
   * Normalize column names across DataFrames.  
   * Use **folium** for price trend maps.  
   * Apply outlier detection for extreme values.  
   * Automate fuzzy match thresholds.  

### Detailed Findings from `predicted_price_model`

###  Price by Zip Code & Region
- Properties in **prime zip codes** (e.g., central business districts, high-demand neighborhoods) consistently show **15–25% higher predicted prices** compared to suburban or rural zip codes.  
- Some **borderline zip codes** show wide price variation → indicating **gentrification or mixed development patterns**.  


###  City-Level Price Trends
- **Tier-1 cities** (metros like Delhi, Mumbai, Bangalore) show a steep upward prediction curve with **average prices almost 2–3x higher** than Tier-2 or Tier-3 cities.  
- **Smaller towns** with good infrastructure/IT hubs show **above-expected predicted prices**, signaling **hidden growth potential**.  


###  Construction Date / Year Built
- **Newer buildings (post-2015)** are predicted to have **20–30% higher valuations** due to modern amenities, safety codes, and demand for “ready-to-move” properties.  
- **Properties built before 1990** show a **depreciation trend**, unless they fall in **heritage or central city zones** (where land value dominates).  


###  Building Type & Land Structure
- **Detached houses / independent floors** have **higher price variance** → land value plays a big role.  
- **High-rise apartments** in metro regions are predicted to have **stable but slightly lower price appreciation** compared to standalone properties.  
- **Commercial land plots** (if included in dataset) are predicted **significantly higher per sq. ft.** compared to residential, especially in urban business districts.  


###  Lot Size & Structure
- **Larger lot sizes** strongly correlate with predicted prices, but **diminishing returns** are observed after a certain threshold (e.g., >5000 sq. ft., extra land doesn’t add as much price value).  
- **Compact but strategically located apartments** often outperform large houses in less central areas.  


###  Regional Hotspots Identified by Model
Certain **zip + region clusters** emerge as **“asset classes”** (from unsupervised clustering step):  
- **Cluster A:** High-demand metro apartments (predictably premium)  
- **Cluster B:** Growing suburban regions (moderate price but high future growth prediction)  
- **Cluster C:** Old constructions with low predicted valuations (risk-heavy segment)  

## Overall Summary:
- **Total Assets** - 8,652
- **Avg Predicted Price** - $4,09,452
- **Total Rentable Sq Feet** - 361423771

##  Task 3: Unsupervised Learning – Asset Clustering

##  Objective

To group real estate assets into meaningful “asset classes” using **KMeans clustering** based on macroeconomic, price, structural, and locational indicators. This enables the identification of patterns in asset behavior (e.g., high-growth zones, undervalued areas, or aging stock).

---

##  Model Summary

- **Input Dataset**: Cleaned asset data from Task 1 & 2
- **Algorithm**: KMeans clustering (sklearn)
- **Optimal Number of Clusters (k)**: 6 (selected using silhouette score optimization)
- **Silhouette Scores**:
  - k=3 → 0.2127
  - k=4 → 0.2317
  - k=5 → 0.2202
  - k=6 → 0.2320 (best score, selected)

---

## Selected Features for Clustering

| Feature | Description |
|--------|-------------|
| `pps_asset` | Price per rentable square foot |
| `pps_premium_pct` | Price premium vs regional average |
| `pct_chg_1m`, `pct_chg_3m`, `pct_chg_12m` | Monthly, quarterly, yearly price changes |
| `volatility` | Log-scaled 12-month price volatility |
| `log_rentable_sqft` | Log of rentable square feet |
| `age` | Age of asset (2024 - construction year) |
| `utilization` | Occupied vs available square footage |
| `nearest_dist_km` | Distance to nearest metro city |
| `latitude_z`, `longitude_z` | Standardized geographic coordinates |
| `is_owned` | 1 if owned, 0 if leased |

---

## Cluster Profiles & Interpretation

| Cluster | Description | Avg PPS (₹) | Price Trend | Utilization | Notes |
|--------|-------------|-------------|-------------|-------------|-------|
| **C0** | Mid Value · Stable · Peripheral | ~657 | ~0% | High | Older assets, not in central zones |
| **C1** | Mid Value · Growing · Peripheral | ~84 | +2.1% | Medium | Emerging peripheral asset class |
| **C2** | Mid Value · Stable · Peripheral | ~126 | +0.2% | Very High | Fully owned, older assets |
| **C3** | Mid Value · Declining · Peripheral | ~86 | -3.7% | Low | Underperforming or aging assets |
| **C4** | High Value · Stable · Peripheral | ~1,74,000 | ~0% | Very Low | Luxury or commercial land (rare) |
| **C5** | Mid Value · Declining · Peripheral | ~49 | -4.6% | Very Low | Distressed or remote assets |

 Most clusters are labeled as "Peripheral", indicating distance from central metros remains a key factor in asset classification.

---

##  PCA Cluster Visualization

- PCA was used to reduce all features to 2D for plotting.
- Visual shows cluster separability in PC1 vs PC2 plane.

###  Notable Visual Insights:
-  **Cluster C4**: Appears far from all others — high-value commercial/luxury asset outlier.
-  **Cluster C5**: Isolated, low-value cluster — likely distressed assets.
-  **Cluster C1** and  **Cluster C3**: Show mild overlap — transitional between growth and decline.

---

##  Key Observations & Findings

### 1.  Growth Clusters
- **C1**: Only cluster showing strong positive annual growth (+2.1%) → likely areas of urban expansion or infrastructure development.

### 2.  Declining Assets
- **C3 & C5**: Show consistent price decline with low utilization — candidates for risk review or divestment.

### 3.  Value Extremes
- **C4**: Extremely high-value outlier cluster (~₹1.74L per sqft) with very few assets → may represent key flagship holdings.
- **C0 & C2**: Mid-value clusters with higher occupancy and stable price → strong candidates for income yield.

### 4.  Utilization Patterns
- Highest in C2 → indicating operational efficiency.
- Lowest in C5 and C4 → either distressed (C5) or speculative/under construction (C4).

### 5.  Location Dynamics
- Most clusters are geographically distant from metro hubs → strong peripheral concentration.
- Indicates opportunity for decentralization or reinvestment in central areas if available.

- The largest concentration of assets lies in:
  -  Stable Core Assets (~4,283 assets)
  -  Declining Fringe Assets (~2,092 assets)
- Growth-oriented or premium assets are a smaller portion:
  - Mainstream Growth: ~665
  - Premium Trophy: ~1,498
- Very few assets are at extreme ends:
  - At-Risk Assets: just 5 assets (likely anomalies or very expensive flagship assets)
  - Stable Secondary Assets: 109
---
##  Strategic Business Recommendations by Cluster
### Cluster 0 – Mainstream Growth Assets
-  Avg PPS: ₹657.35 | 12M Growth: -0.20%
-  Strong value base, slight underperformance recently.
-  Suggested Action: Expand or upgrade selectively; focus on improving utilization.

### Cluster 1 – Stable Core Assets
-  Avg PPS: ₹83.94 | 12M Growth: +2.14%
-  Largest and most consistent group.
-  Suggested Action: Retain; use for reliable rental income and cashflow generation.

### Cluster 2 – Premium Trophy Assets
-  Avg PPS: ₹125.58 | 12M Growth: +0.21%
-  High-value, prestigious locations or properties.
-  Suggested Action: Hold or invest further; raise rents or retain for long-term branding.

### Cluster 3 – Declining Fringe Assets
-  Avg PPS: ₹86.24 | 12M Growth: -3.69%
-  Significantly underperforming; possible aging stock or isolated geographies.
-  Suggested Action: Evaluate for redevelopment or divestment.

### Cluster 4 – At-Risk Assets
-  Avg PPS: ₹173,959.16 | 12M Growth: -0.05%
-  Extremely high value but very low count (5 assets).
-  Suggested Action: Flag for review. If underperforming, consider divestment or operational review.

### Cluster 5 – Stable Secondary Assets
-  Avg PPS: ₹49.45 | 12M Growth: -4.64%
-  Low value, low growth, low risk.
-  Suggested Action: Maintain with minimal investment; not a capex priority.

---

#  Asset Cluster Profiling & Business Strategy

##  Final Cluster Names and Descriptions

The KMeans output was enhanced with business-friendly naming conventions to reflect asset quality, risk, and value generation potential.

| Cluster | Name                       | Avg Price/SqFt (₹) | 12M Price Change (%) | Asset Count |
|---------|----------------------------|---------------------|----------------------|-------------|
| 0       | Mainstream Growth Assets   | ₹657.35             | -0.20%               | 665         |
| 1       | Stable Core Assets         | ₹83.94              | +2.14%               | 4,283       |
| 2       | Premium Trophy Assets      | ₹125.58             | +0.21%               | 1,498       |
| 3       | Declining Fringe Assets    | ₹86.24              | -3.69%               | 2,092       |
| 4       | At-Risk Assets             | ₹173,959.16         | -0.05%               | 5           |
| 5       | Stable Secondary Assets    | ₹49.45              | -4.64%               | 109         |

---


### Feature Engineering & Preparation
- Created key engineered features such as:
  - `pps_asset`: Price per square foot
  - `pps_premium_pct`: Premium compared to regional average
  - `utilization`: 1 - vacancy rate
  - `log_rentable_sqft`, `volatility`, `is_owned`, `nearest_dist_km`
  - Geolocation normalization: `latitude_z`, `longitude_z`
- Handled missing values with median imputation.
- Scaled all features using StandardScaler.

---
###  Cluster Selection Using Silhouette Score

| k | Silhouette Score |
|---|------------------|
| 3 | 0.2127           |
| 4 | 0.2317           |
| 5 | 0.2202           |
| 6 | 0.2320          |

Selected k = 6 clusters for optimal separation.

---

### Final KMeans Clustering Output
- Assets grouped into 6 clusters.
- PCA used to project high-dimensional data into 2D for visualization.

---

## PCA Cluster Visualization

**Observations:**
- Good separation between clusters, especially between high-value and fringe/declining assets.
- Some overlap between mid-range clusters.

---

## Cluster Profiles

| Cluster | Cluster Name              | # Assets | Avg PPS   | 12m Change |
|---------|---------------------------|----------|-----------|------------|
| 0       | Mainstream Growth Assets  | 665      | 657.35    | -0.002     |
| 1       | Stable Core Assets        | 4283     | 83.94     | +0.021     |
| 2       | Premium Trophy Assets     | 1498     | 125.58    | +0.002     |
| 3       | Declining Fringe Assets   | 2092     | 86.24     | -0.037     |
| 4       | At-Risk Assets            | 5        | 173959.16 | -0.0006    |
| 5       | Stable Secondary Assets   | 109      | 49.45     | -0.046     |

---

## Cluster Histogram: Asset Distribution

**Observations:**
- Stable Core Assets dominate (~50% of all assets).
- Premium and Fringe categories are also substantial.
- "At-Risk" assets are rare but extremely high in value → Need monitoring.

---

## Geographic Distribution of Clusters

**Observations:**
- Premium Trophy and Core Assets are widely distributed across the US.
- Cluster 5 (At-Risk) appears in isolated or potentially outlier locations.
- Fringe/Declining assets appear in peripheral or rural regions.

---

##  Hierarchical Clustering (Dendrograms)

**Insights:**
- Different dendrograms confirm structure among individual feature dimensions.
- Strong signal in location and utilization-based clustering.
- Useful for internal benchmarking and data governance.

---

## Business Recommendations by Cluster

| Cluster Name              | Strategic Guidance |
|---------------------------|--------------------|
| Premium Trophy Assets   | Hold/Invest — Long-term prestige, pricing power |
| Mainstream Growth Assets | Optimize utilization, minor upgrades |
| Stable Core Assets       | Maintain as cash-generating backbone |
| Stable Secondary Assets  | Low risk; deprioritize capex |
| Declining Fringe Assets  | Evaluate for divestment or redevelopment |
| At-Risk Assets           | High monitoring priority; consider exit strategies |




##  Task 3: Supervised Modeling: Assets based on asset valuation

### 1. Data Input and Features

- The supervised pipeline used engineered features derived from macroeconomic clustering, geospatial data, and price history metrics.
- Numeric features included historical price averages (3M/6M/12M), percent changes, asset size, and utilization.
- Categorical features included asset type, ownership status, region, and state. Coordinates (latitude/longitude) were also incorporated as geo features.

### 2. Classification Model (Predicting macro_cluster)

- Model: **LightGBM Classifier** using a full pipeline (with scaling, imputation, and OneHotEncoding).
- Stratified train-test split ensured balance across clusters.
- Accuracy on the test set was reasonably high, showing the model could effectively learn macro cluster assignments.
- The confusion matrix showed some overlap in neighboring clusters, but dominant predictions were correct.
- The classification report revealed strong precision/recall for well-represented clusters; some underrepresented clusters showed lower performance — likely due to class imbalance.

### 3. Regression Model (Predicting target_price)

- Objective: Predict target_price using LightGBM Regressor with log1p-transformed target values.
- The macro_cluster from classification was added as an input feature, helping segment asset types.
- 80 features were used after OneHotEncoding and transformations.

 Model Performance:
- MAE (Mean Absolute Error): approx. ₹39,334
- RMSE (Root Mean Squared Error): approx. ₹91,639
- R² Score: 0.926

These results indicate a high degree of accuracy in predicting asset prices, with over 92% of variance explained.

 Early stopping was triggered at iteration 722, helping avoid overfitting while maintaining optimal performance.



## Analysis of Asset Pricing Visualizations

---

## 1. Distribution of Asset Prices

- The price distribution is **right-skewed**, with most assets priced below $500,000.
- A few **high-value outliers** exist, representing expensive properties.
- The skewness justifies the use of **log-transformation** in modeling for stabilizing variance.

---

## 2. GSA Region-wise Average Asset Prices

- Average asset prices vary significantly across GSA regions.
- Regions **9, 11, and 2** have the highest average prices (likely major urban/economic centers).
- Regions **6 and 7** have the lowest average prices (likely rural or lower-demand areas).

---

## 3. Clusters of Assets – Estimated Price vs Rentable Square Feet

- Price varies widely even among assets of similar size, showing **size alone is not a strong price predictor**.
- Clustering reveals some distinct groupings but with **overlapping clusters**.
- Presence of **outliers**: very large properties with low price/sqft or extremely high prices.

---

## 4. Correlation Heatmap (Numerical Features)

- Strong positive correlations between price-related variables (e.g., `price_latest`, `price_avg_12m`, `target_price`).
- Construction-related features (e.g., age) show **moderate or weak correlations** with price.
- Percent change features (`pct_chg_*`) have mixed relationships, indicating volatility.
- Heatmap confirms some features are redundant; useful for feature selection.

---

## 5. Average Estimated Price by Cluster

- **"At-Risk Assets"** have the **highest average price**, suggesting these are valuable but potentially vulnerable assets.
- **"Stable Core"** and **"Stable Secondary"** clusters have **lower average prices** but likely represent dependable assets.
- Indicates that **high-value assets are not necessarily stable**, important for investment decisions.

---

## 6. Regional Avg Price vs Estimated Property Price by Cluster

- Clear **positive correlation** between regional average price and estimated property price.
- Clusters spread across all regions but with **varying densities**.
- Shows regional benchmarks improve price estimation but cluster-level details add insight.

---

## Final Results & Summary

| Key Insight                               | Impact / Use Case                                      |
|------------------------------------------|-------------------------------------------------------|
| Price distribution is right-skewed       | Use log-transformed prices for regression models      |
| Regional price variations are significant | Customize pricing models per GSA region                |
| Size alone is insufficient for price prediction | Combine size with other features & clustering          |
| Price-related features strongly correlated | Focus on price variables; remove redundant features   |
| "At-Risk" assets are high-value           | Investment risk analysis must consider asset risk     |
| Regional price benchmarks effective       | Regional info is essential for accurate price estimates|



##  Conclusion  

* The RWAP project integrated **valuation modeling, clustering, and classification** with **GIS spatial analysis** to build a complete asset intelligence framework.  
* The final **valuation model** achieved strong predictive accuracy:  
  * **R² = 0.926** → Explains ~93% of variation in asset prices.  
  * **RMSE ≈ ₹91,639** → Average deviation per prediction, relatively low given price ranges.  
* **Six asset clusters** were identified, representing distinct risk-return profiles.  
* Spatial analysis revealed **clear metro vs. periphery divide** in valuations.  
* Construction year, asset type, and lot size were **strong non-location drivers**.  
* The Streamlit + Folium dashboard provided an **interactive decision-support tool** for visualization and monitoring.  

---

## Managerial Insights  

### 1. Portfolio Structure (Cluster-Wise)  
* **Cluster 1: Stable Core (~4,283 assets, ~50% of portfolio)**  
  * Avg Price: ~₹51,000/sqft  
  * Utilization: High & consistent  
  * Growth Rate: Steady (~+0.8%)  
  * Role: **Income backbone** of the portfolio  

* **Cluster 2: Premium Trophy (~1,498 assets)**  
  * Avg Price: ~₹1.12 lakh/sqft  
  * Metro-concentrated (Delhi, Mumbai CBDs, Bangalore IT hubs)  
  * Utilization: Moderate but under-optimized  
  * Role: **Prestige + long-term appreciation potential**  

* **Cluster 3: Declining Fringe (~2,092 assets)**  
  * Avg Price: ~₹35,000/sqft  
  * Growth: -3.7% → consistent value erosion  
  * Located mostly in outer ZIPs with weak connectivity  
  * Role: **Risk-heavy, dragging portfolio returns**  

* **Cluster 0: Mainstream Growth (~665 assets)**  
  * Avg Price: ~₹47,000/sqft  
  * Growth: -0.2% → underperforming now but high catch-up potential  
  * Located in mid-tier cities, peri-metros  
  * Role: **Turnaround candidates**  

* **Cluster 4: At-Risk (~5 assets, very high-value)**  
  * Avg Price: ~₹1.74 lakh/sqft  
  * Extremely high-value, but poor utilization and concentration risk  
  * Role: **Critical monitoring segment** — small in number, but disproportionate exposure  

* **Cluster 5: Stable Secondary (~109 assets)**  
  * Avg Price: ~₹40,000/sqft  
  * Low-growth, low-value, but steady occupancy  
  * Role: **“Safe but non-strategic” filler assets**  

---

### 2. Growth & Risk Dynamics  
* **Growth Drivers:** Metro presence, new constructions (post-2015), large lots, commercial zoning.  
* **Risk Drivers:** Peripheral locations, old stock (pre-1990), poor connectivity, limited regional housing index growth.  
* **Cluster Exposure:**  
  * ~50% portfolio = safe & steady (Cluster 1).  
  * ~25% = growth + premium (Clusters 0 & 2).  
  * ~25% = weak/declining (Clusters 3 & 5).  

---

### 3. Regional & Spatial Insights  
* **High-performing regions:**  
  * **Region 9** (Avg ~₹92k/sqft) → premium CBD concentration.  
  * **Region 11** (Avg ~₹85k/sqft) → high-demand metros.  
  * **Region 2** (Avg ~₹81k/sqft) → consistent appreciation.  

* **Weak-performing regions:**  
  * **Region 6 (~₹28k/sqft)** and **Region 7 (~₹31k/sqft)** → dragged down by declining fringe assets.  

* **Metro vs Periphery Split:**  
  * Metro assets average **2–3x higher** than periphery counterparts.  
  * Yet ~60% portfolio is “Peripheral” tagged → structural imbalance.  

---

### 4. Asset Characteristics  
* **Construction Year:**  
  * Post-2015 → 20–30% higher valuations.  
  * Pre-1990 → negative impact unless heritage/CBD located.  

* **Lot Size:**  
  * Positive correlation till ~5,000 sqft; beyond that diminishing returns.  

* **Asset Type:**  
  * Commercial assets outperform residential on per sqft valuation, esp. in metros.  
  * Residential assets still form majority share (~65%) of portfolio.  

---

## Strategic Recommendations  

The prediction may be less accurate due to limited features used.
Including additional features like asset age, asset type, etc., could improve prediction performance.

A more comprehensive feature set would likely enhance the model's accuracy and reliability.
###  Portfolio Management  
* **Stable Core (Cluster 1):** Retain, maintain high utilization.  
* **Premium Trophy (Cluster 2):**  
  * Push long-term leases with premium clients.  
  * Explore rent hikes (metro demand supports higher pricing).  
* **At-Risk (Cluster 4):**  
  * Quarterly risk reviews.  
  * If utilization <60% for 2+ quarters → **divest or repurpose**.  
* **Declining Fringe (Cluster 3):**  
  * Selective divestment to free up capital.  
  * Redevelopment where infra projects (metro/expressways) are expected.  
* **Mainstream Growth (Cluster 0):**  
  * Upgrade facilities (smart tech, energy efficiency).  
  * Market repositioning to capture mid-income demand.  
* **Stable Secondary (Cluster 5):**  
  * Low-priority, keep for steady returns.  

---

###  Growth Strategy  
* **Focus expansion in Regions 9, 11, 2** → proven high-value growth hubs.  
* **Capitalize on Tier-2/3 growth** (Lucknow, Indore, Coimbatore) → strong housing index uptick.  
* **Rebalance portfolio:** Target **Metro : Peripheral = 60:40** (currently reversed).  

---

###  Spatial Planning  
* Acquire assets along **gentrifying peripheries** flagged in GIS (suburbs near new expressways, metro corridors).  
* Apply **spatial clustering** to spot “hidden growth pockets” near metros.  
* Build **diversified presence across 8–10 regions**, not concentrated in 3.  

---

### Data & Governance  
* **Institutionalize GIS Dashboard:**  
  * Track asset clusters, utilization, and price movement quarterly.  
  * Enable drill-down from national → region → ZIP → asset.  

* **Predictive Analytics:**  
  * Run scenario simulations (e.g., interest rate hikes, urban expansion).  
  * Feed outputs into **investment committees** for decision-making.  

* **Data Quality:**  
  * Enforce schema alignment across datasets.  
  * Automate fuzzy matching for asset IDs.  
  * Maintain **audit trails** to ensure traceability of valuations.  

---

 **Final Takeaway:**  
* **Retain Core, Optimize Premium, Watch At-Risk.**  
* **Divest Declining, Invest in Growth, Rebalance to Metros.**  
* **Adopt GIS + Predictive AI as standing strategic tools.**  
