## Zillow Analysis

Reading our cleaned dataset, via src/process_zillow.py...

In [32]:
import pandas as pd

df = pd.read_parquet("../data/processed/zillow_rent_cleaned.parquet")

In [33]:
df.isna().sum()

zpid                    0
zestimate            2643
rentZestimate           0
latitude                0
longitude               0
bedrooms             2194
bathrooms            1808
livingArea           1805
yearBuilt            2789
lotSize              3532
homeType                0
homeStatus              0
miles_to_old_well       0
dtype: int64

In [34]:
df.shape

(25182, 13)

From the two cells above, we know that our "cleaned" data has quite a few missing pieces. 

Namely, out of the 22,763 total listings...
- 2642 have a missing zestimate
- 2194 have missing bedroom values
- 1808 have missing bathroom values
- 1805 have missing livingArea values
- 2789 have missing yearBuilt values
- 3532 have missing lotSize values

I think the best course of action as we continue to clean our data is to drop zestimate, yearBuilt, livingArrea, and lotSize as features in our dataset. This technical tradeoff allows us to retain the majority of our current dataset. Let's also drop homeType and homeStatus since they won't play a big part of our analysis.

Next, we'll drop all rows which have a missing bedrooms or bathrooms value. These are imperative to training our model!

In [35]:
import pandas as pd

# Load the original cleaned Zillow dataset (includes rentZestimate)
df = pd.read_parquet("../data/processed/zillow_rent_cleaned.parquet")

# Drop columns no longer needed for rent modeling
df = df.drop(columns=[
    "zestimate",
    "yearBuilt",
    "lotSize",
    "homeType",
    "homeStatus"
])

# Drop rows missing key features for rent prediction
df = df.dropna(subset=["rentZestimate", "bedrooms", "bathrooms", "livingArea"])

# Optional: drop listings with 0 bedrooms (to avoid divide-by-zero)
df = df[df["bedrooms"] > 0]

# Save the final cleaned dataset for rent modeling
output_path = "../data/processed/zillow_minimal_rent_cleaned.parquet"
df.to_parquet(output_path)

print(f"✅ Rent-focused cleaned dataset saved to: {output_path}")
print(f"Remaining rows: {len(df)}")

# Show remaining missing values (should be 0 or very low)
print(df.isna().sum())

✅ Rent-focused cleaned dataset saved to: ../data/processed/zillow_minimal_rent_cleaned.parquet
Remaining rows: 22494
zpid                 0
rentZestimate        0
latitude             0
longitude            0
bedrooms             0
bathrooms            0
livingArea           0
miles_to_old_well    0
dtype: int64


Great! Now we have 22,494 listings with no missing data woohoo. All set now. Time to get to work...