# Data Exploration
**Objective:** Load the raw housing dataset and perform an initial inspection to identify data quality issues, statistical patterns, and required cleaning steps.

#### 1. Data Loading

We load the raw, immutable dataset from the `data/raw/` directory. This ensures we are analyzing the original source before any transformations are applied.

In [None]:
import pandas as pd

# Load raw data
df = pd.read_csv('../data/raw/housing.csv')

print("====== HOUSING.CSV ======\n")

# Display first five rows and last 5 rows
print("----------- HEAD -----------\n")
print(df.head(),"\n")
print("----------- TAIL -----------\n")
print(df.tail(),"\n")

#### 2. Dataset Scale and Metadata

In this section, we examine the dimensions of the dataset and the data types of each column. This helps us identify which columns are numerical and which are categorical (requiring encoding later).

In [None]:
# Show shape
print("----------- SHAPE -----------\n")
print("Shape: ", df.shape, "\n")

# Display column names and data types, Show memory usage
print("----------- INFO -----------\n")
print(df.info(), "\n")

#### 3. Statistical Analysis

We perform a descriptive statistical analysis to understand the distribution, mean, and spread of our features.

* **Numerical Analysis:** Focuses on the range and standard deviation of **house prices**, **income**,
and **room counts**.
* **Categorical Analysis:** Examines the unique values in the `ocean_proximity` column.

In [None]:
# For numerical columns
print("----------- DESCRIPTION -----------\n")
print("---- NUMERICAL COLUMNS ----\n")
print(df.describe(), "\n")
# For Categorical columns
print("---- CATEGORICAL COLUMNS ----\n")
print(df.describe(include="object"), "\n")

#### 4. Data Quality Assessment

To prepare for the cleaning phase, we check for:

   1. **Missing Values:** Identifying columns that require imputation.

   2. **Duplicate Rows:** Ensuring data integrity.

In [None]:
# Check for missing values
print("----------- MISSING VALUES -----------\n")
print(df.isnull().sum(), "\n")

# Check for Duplicate rows
print("----------- DUPLICATE ROWS -----------")
print(df.duplicated().sum(), "\n")

##### Data Quality Report Markdown Cell:
##### - Initial state: 20,640 records.
##### - Missing values: 'total_bedrooms' has 207 nulls.
##### - Data Types: 'ocean_proximity' is object/string, needs encoding.
##### - Outliers: High max values in total_rooms suggest outliers.