# Vietnam Housing Data Preprocessing

This notebook demonstrates the complete data preprocessing pipeline for the Vietnam Housing dataset.

## Steps:
1. Load and explore raw data
2. Remove unnecessary columns
3. Handle duplicates
4. Handle missing values
5. Handle outliers
6. Data visualization
7. Save processed data

## 1. Import Libraries

In [None]:
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Add src to path
sys.path.append('../src')

from preprocessing import HousingDataPreprocessor
import utils

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("✓ Libraries imported successfully")

## 2. Load Raw Data

**Note**: Download the dataset from Kaggle:
- Dataset: Vietnam Housing Dataset (Hanoi)
- Save to: `../data/vietnam_housing.csv`

If you don't have the dataset, this notebook will create sample data for demonstration.

In [None]:
# Initialize preprocessor
preprocessor = HousingDataPreprocessor()

# Try to load data
data_path = '../data/vietnam_housing.csv'

try:
    df = preprocessor.load_data(data_path)
    print("✓ Real data loaded successfully")
except FileNotFoundError:
    print("⚠️ Dataset not found. Creating sample data for demonstration...")
    
    # Create sample data
    np.random.seed(42)
    n_samples = 1000
    
    sample_data = {
        'Unnamed: 0': range(n_samples),
        'Ngày': pd.date_range('2023-01-01', periods=n_samples, freq='D'),
        'Địa chỉ': [f'Address {i}' for i in range(n_samples)],
        'Quận': np.random.choice(['Ba Đình', 'Hoàn Kiếm', 'Đống Đa', 'Hai Bà Trưng', 
                                  'Cầu Giấy', 'Thanh Xuân', 'Tây Hồ'], n_samples),
        'Huyện': np.random.choice(['Ba Đình', 'Hoàn Kiếm', 'Đống Đa', 'Hai Bà Trưng'], n_samples),
        'Giá': np.random.uniform(1e9, 20e9, n_samples),
        'Diện tích': np.random.uniform(30, 200, n_samples),
        'Giá/m²': np.random.uniform(20e6, 150e6, n_samples),
        'Số tầng': np.random.randint(1, 6, n_samples),
        'Số phòng ngủ': np.random.randint(1, 6, n_samples),
        'Dài': np.random.uniform(5, 20, n_samples),
        'Rộng': np.random.uniform(3, 15, n_samples),
        'Loại hình nhà ở': np.random.choice(['Nhà riêng', 'Nhà mặt phố', 'Biệt thự'], n_samples),
        'Giấy tờ pháp lý': np.random.choice(['Sổ đỏ/ Sổ hồng', 'Hợp đồng mua bán'], n_samples)
    }
    
    df = pd.DataFrame(sample_data)
    # Add some missing values
    df.loc[df.sample(50).index, 'Số tầng'] = np.nan
    df.loc[df.sample(30).index, 'Loại hình nhà ở'] = np.nan
    
    preprocessor.df = df
    print(f"✓ Sample data created: {df.shape}")

## 3. Exploratory Data Analysis (EDA)

In [None]:
# Display basic information
utils.print_data_info(df, "Raw Dataset Information")

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Check missing values
print("Missing Values:")
missing = df.isnull().sum()
missing[missing > 0].sort_values(ascending=False)

## 4. Data Cleaning Pipeline

### 4.1 Remove Unnecessary Columns

In [None]:
preprocessor.remove_unnecessary_columns()
print(f"Shape after removing columns: {preprocessor.df.shape}")

### 4.2 Remove Duplicates

In [None]:
preprocessor.remove_duplicates()
print(f"Shape after removing duplicates: {preprocessor.df.shape}")

### 4.3 Handle Missing Values

In [None]:
preprocessor.handle_missing_values()

print("\nMissing values after handling:")
missing_after = preprocessor.df.isnull().sum()
print(missing_after[missing_after > 0] if missing_after.any() else "No missing values!")

### 4.4 Handle Outliers

In [None]:
# Visualize outliers before removal
if 'Giá' in preprocessor.df.columns:
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))
    
    for idx, col in enumerate(['Giá', 'Diện tích', 'Giá/m²']):
        if col in preprocessor.df.columns:
            axes[idx].boxplot(preprocessor.df[col].dropna())
            axes[idx].set_title(f'{col} - Before Outlier Removal')
            axes[idx].set_ylabel(col)
    
    plt.tight_layout()
    plt.show()

In [None]:
# Remove outliers
preprocessor.handle_outliers(method='iqr', threshold=1.5)
print(f"Shape after removing outliers: {preprocessor.df.shape}")

## 5. Data Visualization

In [None]:
# Price distribution
if 'Giá' in preprocessor.df.columns:
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.hist(preprocessor.df['Giá'] / 1e9, bins=50, edgecolor='black')
    plt.xlabel('Giá (tỷ VNĐ)')
    plt.ylabel('Frequency')
    plt.title('Price Distribution')
    
    plt.subplot(1, 2, 2)
    plt.boxplot(preprocessor.df['Giá'] / 1e9)
    plt.ylabel('Giá (tỷ VNĐ)')
    plt.title('Price Box Plot')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Price by district
if 'Quận' in preprocessor.df.columns and 'Giá' in preprocessor.df.columns:
    plt.figure(figsize=(12, 6))
    
    district_price = preprocessor.df.groupby('Quận')['Giá'].mean().sort_values(ascending=False) / 1e9
    district_price.plot(kind='bar', color='skyblue', edgecolor='black')
    plt.xlabel('Quận')
    plt.ylabel('Giá trung bình (tỷ VNĐ)')
    plt.title('Average Price by District')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

In [None]:
# Correlation matrix for numerical features
numerical_cols = preprocessor.df.select_dtypes(include=[np.number]).columns
if len(numerical_cols) > 0:
    plt.figure(figsize=(10, 8))
    correlation = preprocessor.df[numerical_cols].corr()
    sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True)
    plt.title('Correlation Matrix')
    plt.tight_layout()
    plt.show()

## 6. Save Processed Data

In [None]:
# Save the cleaned data (before encoding)
output_path = '../data/processed_housing_data.csv'

preprocessor.df_processed = preprocessor.df.copy()
preprocessor.save_processed_data(output_path)

print(f"\nFinal dataset shape: {preprocessor.df_processed.shape}")
print("\nData preprocessing completed successfully! ✓")

## Summary

In this notebook, we:
1. ✓ Loaded raw housing data
2. ✓ Removed unnecessary columns
3. ✓ Handled duplicate records
4. ✓ Handled missing values according to rules
5. ✓ Removed outliers using IQR method
6. ✓ Visualized data distributions and relationships
7. ✓ Saved processed data for model training

**Next Step**: Use `02_model_training.ipynb` to train and evaluate ML models.