In [None]:
import pandas as pd

df = pd.read_csv("zillow_cleaned.csv")
df.shape
df.dtypes

df.isnull().mean().sort_values(ascending=False)

df.describe()

# Data Overview

## Data Sources
The dataset used in this project is derived from Zillow residential property data and was originally obtained from Kaggle during an earlier phase of the Integrated Capstone Project. The data represents aggregated housing and property characteristics across multiple U.S. counties and ZIP codes. It contains no personally identifiable information and is intended for research and analytical purposes.

## Data Collection Method
The data was downloaded in CSV format from a public Kaggle repository and loaded into the analysis environment using Python and the pandas library. Prior to analysis, the dataset was cleaned and standardized in a previous semester to remove inconsistencies and ensure usability. The cleaned dataset (`zillow_cleaned.csv`) is reused in this phase of the project to maintain consistency across analyses.

## Data Description
The cleaned dataset contains **64,894 observations and 19 variables**. All variables are numeric and include a mix of continuous and discrete features describing physical housing characteristics, geographic identifiers, and assessed property values. Key variables include square footage, number of bedrooms and bathrooms, year built, lot size, and total assessed property value. The target variable for modeling is `taxvaluedollarcnt`, which represents the assessed value of a property in U.S. dollars.

### Summary Statistics (Selected Variables)

| Variable | Mean | Median | Std | Min | Max |
|--------|------|--------|-----|-----|-----|
| taxvaluedollarcnt | — | — | — | — | — |
| calculatedfinishedsquarefeet | 1767 | 1555 | 826 | 128 | 5954 |
| bedroomcnt | 3.08 | 3.0 | 0.99 | 0 | 11 |
| bathroomcnt | 2.27 | 2.0 | 0.91 | 0 | 10 |
| yearbuilt | — | — | — | — | — |

*Summary statistics are selectively reported for variables most relevant to the analysis.*

## Data Dictionary

| Variable Name | Type | Description | Example Value |
|--------------|------|-------------|---------------|
| taxvaluedollarcnt | float | Assessed total property value in USD | 525000 |
| calculatedfinishedsquarefeet | float | Total finished living area in square feet | 1800 |
| bedroomcnt | float | Number of bedrooms in the property | 3 |
| bathroomcnt | float | Number of bathrooms in the property | 2.5 |
| lotsizesquarefeet | float | Lot size in square feet | 6000 |
| yearbuilt | float | Year the property was constructed | 1987 |
| regionidzip | float | ZIP code identifier for the property | 90210 |
| latitude | float | Geographic latitude coordinate | 34.12 |
| longitude | float | Geographic longitude coordinate | -118.41 |

## Data Quality Assessment
The dataset is fully cleaned, with **no missing values across all 19 variables**. All features are consistently typed as numeric values, enabling efficient statistical analysis and modeling. Because the dataset was preprocessed prior to this phase of the project, no duplicate records or structural inconsistencies were detected. While the data quality is high, it is important to note that assessed values may not perfectly reflect market transaction prices, which introduces a potential limitation for predictive accuracy.

## Ethical Considerations
The dataset is publicly available and contains no personally identifiable information, minimizing privacy concerns. However, ethical considerations remain regarding potential geographic and socioeconomic bias. Properties from higher-value regions may be overrepresented, which could influence model predictions and reduce generalizability to lower-income or underrepresented areas. Additionally, assessed property values may reflect historical or institutional valuation practices rather than true market conditions. These limitations are acknowledged and considered when interpreting analytical results.