# Ti·ªÅn X·ª≠ L√Ω D·ªØ Li·ªáu B·∫•t ƒê·ªông S·∫£n Nhatot
## Data Preprocessing for Nhatot Housing Dataset

Notebook n√†y th·ª±c hi·ªán c√°c b∆∞·ªõc:
1. üìÇ Load d·ªØ li·ªáu
2. üßπ L√†m s·∫°ch d·ªØ li·ªáu
3. üîç Ph√¢n t√≠ch missing values
4. üîß X·ª≠ l√Ω missing values
5. üè∑Ô∏è Encoding categorical features
6. ‚öôÔ∏è Feature engineering
7. üìè Scaling features (optional)
8. üíæ L∆∞u d·ªØ li·ªáu ƒë√£ x·ª≠ l√Ω

**L∆∞u √Ω:** File g·ªëc `nhatot_crawl4ai.csv` s·∫Ω KH√îNG b·ªã thay ƒë·ªïi!

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# C·∫•u h√¨nh hi·ªÉn th·ªã
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')

print("‚úì Libraries imported successfully!")

## 2. Load D·ªØ Li·ªáu

In [None]:
# Load d·ªØ li·ªáu
file_path = 'nhatot_crawl4ai.csv'
df = pd.read_csv(file_path)

# T·∫°o b·∫£n sao ƒë·ªÉ b·∫£o to√†n d·ªØ li·ªáu g·ªëc
df_original = df.copy()

print(f"üìÇ Loaded data:")
print(f"   Rows: {len(df):,}")
print(f"   Columns: {len(df.columns)}")
print(f"   ‚úì Created backup of original data")
print(f"\nColumns: {list(df.columns)}")

In [None]:
# Xem 5 d√≤ng ƒë·∫ßu ti√™n
df.head()

In [None]:
# Th√¥ng tin t·ªïng quan
df.info()

## 3. L√†m S·∫°ch D·ªØ Li·ªáu

### 3.1. X√≥a C√°c D√≤ng Tr·ªëng Ho√†n To√†n

In [None]:
print("üßπ Cleaning empty rows...")
initial_count = len(df)

# X√≥a d√≤ng c√≥ t·∫•t c·∫£ gi√° tr·ªã l√† NaN
df = df.dropna(how='all')

removed = initial_count - len(df)
print(f"‚úì Removed {removed:,} empty rows")
print(f"  Remaining: {len(df):,} rows")

### 3.2. X√≥a D·ªØ Li·ªáu Tr√πng L·∫∑p

In [None]:
print("üîç Removing duplicates...")
initial_count = len(df)

df = df.drop_duplicates()

removed = initial_count - len(df)
print(f"‚úì Removed {removed:,} duplicate rows")
print(f"  Remaining: {len(df):,} rows")

## 4. Ph√¢n T√≠ch Missing Values

In [None]:
# Ph√¢n t√≠ch missing values
missing_stats = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df) * 100).round(2)
})

missing_stats = missing_stats[missing_stats['Missing_Count'] > 0].sort_values(
    'Missing_Percentage', ascending=False
)

print("üìä Missing Values Analysis:")
print("=" * 60)
print(missing_stats.to_string(index=False))
print("=" * 60)

In [None]:
# Visualize missing values
plt.figure(figsize=(12, 6))
if len(missing_stats) > 0:
    sns.barplot(data=missing_stats, x='Column', y='Missing_Percentage', palette='viridis')
    plt.xticks(rotation=45, ha='right')
    plt.title('Missing Values by Column (%)', fontsize=14, fontweight='bold')
    plt.ylabel('Missing Percentage (%)')
    plt.xlabel('Column Name')
    plt.tight_layout()
    plt.show()
else:
    print("No missing values found!")

## 5. X·ª≠ L√Ω C·ªôt Gi√° (Price Column)

In [None]:
def parse_price(price_str):
    """
    Convert Vietnamese price format to numeric value
    
    Examples:
        "1,5 t·ª∑" -> 1500000000
        "500 tri·ªáu" -> 500000000
    """
    if pd.isna(price_str):
        return np.nan
    
    price_str = str(price_str).strip().replace('"', '')
    
    try:
        if 't·ª∑' in price_str:
            value = price_str.replace('t·ª∑', '').strip().replace(',', '.')
            return float(value) * 1_000_000_000
        elif 'tri·ªáu' in price_str:
            value = price_str.replace('tri·ªáu', '').strip().replace(',', '.')
            return float(value) * 1_000_000
        else:
            value = price_str.replace(',', '.')
            return float(value)
    except:
        return np.nan

# √Åp d·ª•ng conversion
print("üí∞ Processing price column...")
df['Gi√° b√°n (VND)'] = df['Gi√° b√°n'].apply(parse_price)

# X√≥a c√°c d√≤ng c√≥ gi√° kh√¥ng h·ª£p l·ªá
initial_count = len(df)
df = df.dropna(subset=['Gi√° b√°n (VND)'])
removed = initial_count - len(df)

print(f"‚úì Converted prices to numeric")
print(f"  Removed {removed:,} rows with invalid prices")
print(f"  Price range: {df['Gi√° b√°n (VND)'].min():,.0f} - {df['Gi√° b√°n (VND)'].max():,.0f} VND")
print(f"  Mean price: {df['Gi√° b√°n (VND)'].mean():,.0f} VND")

In [None]:
# Visualize price distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(df['Gi√° b√°n (VND)'] / 1e9, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Gi√° (t·ª∑ VND)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Price Distribution')

# Box plot
axes[1].boxplot(df['Gi√° b√°n (VND)'] / 1e9)
axes[1].set_ylabel('Gi√° (t·ª∑ VND)')
axes[1].set_title('Price Box Plot')

plt.tight_layout()
plt.show()

## 6. X·ª≠ L√Ω C√°c C·ªôt S·ªë (Numeric Columns)

In [None]:
def parse_numeric_column(val):
    """Parse numeric columns that might have special values"""
    if pd.isna(val):
        return np.nan
    
    val_str = str(val).strip().lower()
    
    # Handle special values like "nhi·ªÅu h∆°n 6"
    if 'nhi·ªÅu h∆°n' in val_str or 'h∆°n' in val_str:
        import re
        numbers = re.findall(r'\d+', val_str)
        if numbers:
            return float(numbers[0]) + 1
        return np.nan
    
    try:
        return float(val_str.replace(',', '.'))
    except:
        return np.nan

# √Åp d·ª•ng cho c√°c c·ªôt s·ªë
print("üî¢ Processing numeric columns...")
numeric_cols = [
    'Di·ªán t√≠ch (m2)',
    'Chi·ªÅu ngang (m)',
    'Chi·ªÅu d√†i (m)',
    'S·ªë ph√≤ng ng·ªß',
    'S·ªë ph√≤ng v·ªá sinh',
    'S·ªë t·∫ßng'
]

for col in numeric_cols:
    if col in df.columns:
        df[col] = df[col].apply(parse_numeric_column)
        print(f"  ‚úì Cleaned {col}")

print("‚úì All numeric columns processed")

## 7. X·ª≠ L√Ω Missing Values

### Chi·∫øn l∆∞·ª£c:
- **Numeric columns**: ƒêi·ªÅn b·∫±ng median
- **Categorical columns**: ƒêi·ªÅn b·∫±ng mode ho·∫∑c "Kh√¥ng r√µ"

In [None]:
print("üîß Handling missing values...")

# 1. Numeric columns: fill with median
numeric_cols_in_df = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols_in_df:
    if df[col].isnull().sum() > 0:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"  ‚úì Filled {col} with median: {median_val:.2f}")

# 2. Categorical columns: fill with mode or 'Kh√¥ng r√µ'
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if col == 'Gi√° b√°n':  # Skip original price column
        continue
    if df[col].isnull().sum() > 0:
        mode_val = df[col].mode()
        if len(mode_val) > 0:
            df[col].fillna(mode_val[0], inplace=True)
            print(f"  ‚úì Filled {col} with mode: {mode_val[0]}")
        else:
            df[col].fillna('Kh√¥ng r√µ', inplace=True)
            print(f"  ‚úì Filled {col} with 'Kh√¥ng r√µ'")

print(f"\n‚úì Missing values handled. Remaining rows: {len(df):,}")

In [None]:
# Ki·ªÉm tra l·∫°i missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")

if remaining_missing > 0:
    print("\nColumns still have missing values:")
    print(df.isnull().sum()[df.isnull().sum() > 0])

## 8. Feature Engineering

T·∫°o c√°c features m·ªõi t·ª´ d·ªØ li·ªáu hi·ªán c√≥

In [None]:
print("‚öôÔ∏è  Feature Engineering...")

# 1. Price per square meter
if 'Di·ªán t√≠ch (m2)' in df.columns and 'Gi√° b√°n (VND)' in df.columns:
    df['Gi√°/m2'] = df['Gi√° b√°n (VND)'] / df['Di·ªán t√≠ch (m2)']
    print("  ‚úì Created 'Gi√°/m2' (price per sqm)")

# 2. Total rooms
if 'S·ªë ph√≤ng ng·ªß' in df.columns and 'S·ªë ph√≤ng v·ªá sinh' in df.columns:
    df['T·ªïng s·ªë ph√≤ng'] = df['S·ªë ph√≤ng ng·ªß'] + df['S·ªë ph√≤ng v·ªá sinh']
    print("  ‚úì Created 'T·ªïng s·ªë ph√≤ng' (total rooms)")

# 3. Area from dimensions
if 'Chi·ªÅu ngang (m)' in df.columns and 'Chi·ªÅu d√†i (m)' in df.columns:
    df['Di·ªán t√≠ch ∆∞·ªõc t√≠nh'] = df['Chi·ªÅu ngang (m)'] * df['Chi·ªÅu d√†i (m)']
    print("  ‚úì Created 'Di·ªán t√≠ch ∆∞·ªõc t√≠nh' (estimated area)")

# 4. Property size category
if 'Di·ªán t√≠ch (m2)' in df.columns:
    def categorize_size(area):
        if pd.isna(area):
            return 'Kh√¥ng r√µ'
        if area < 30:
            return 'R·∫•t nh·ªè'
        elif area < 50:
            return 'Nh·ªè'
        elif area < 80:
            return 'Trung b√¨nh'
        elif area < 150:
            return 'L·ªõn'
        else:
            return 'R·∫•t l·ªõn'
    
    df['K√≠ch th∆∞·ªõc'] = df['Di·ªán t√≠ch (m2)'].apply(categorize_size)
    print("  ‚úì Created 'K√≠ch th∆∞·ªõc' (size category)")

print("\n‚úì Feature engineering completed!")

In [None]:
# Visualize new features
if 'K√≠ch th∆∞·ªõc' in df.columns:
    plt.figure(figsize=(10, 6))
    size_counts = df['K√≠ch th∆∞·ªõc'].value_counts()
    plt.pie(size_counts, labels=size_counts.index, autopct='%1.1f%%', startangle=90)
    plt.title('Distribution by Property Size', fontsize=14, fontweight='bold')
    plt.axis('equal')
    plt.show()

## 9. Encoding Categorical Features

Chuy·ªÉn ƒë·ªïi c√°c bi·∫øn categorical th√†nh d·∫°ng s·ªë

In [None]:
print("üè∑Ô∏è  Encoding categorical features...")

categorical_cols = [
    'Th√†nh ph·ªë',
    'Ph∆∞·ªùng/X√£',
    'Lo·∫°i h√¨nh',
    'Gi·∫•y t·ªù ph√°p l√Ω',
    'H∆∞·ªõng',
    'T√¨nh tr·∫°ng n·ªôi th·∫•t'
]

label_encoders = {}

for col in categorical_cols:
    if col in df.columns:
        le = LabelEncoder()
        # Handle NaN by treating as a separate category
        df[col] = df[col].fillna('Kh√¥ng r√µ')
        df[f'{col}_encoded'] = le.fit_transform(df[col])
        label_encoders[col] = le
        n_categories = len(le.classes_)
        print(f"  ‚úì Encoded {col} ({n_categories} categories)")

print("\n‚úì All categorical features encoded!")

In [None]:
# Xem mapping c·ªßa m·ªôt v√†i categorical features
print("\nExample: Lo·∫°i h√¨nh encoding")
if 'Lo·∫°i h√¨nh' in df.columns:
    mapping_df = pd.DataFrame({
        'Original': label_encoders['Lo·∫°i h√¨nh'].classes_,
        'Encoded': range(len(label_encoders['Lo·∫°i h√¨nh'].classes_))
    })
    print(mapping_df)

## 10. Scaling Features (Optional)

**L∆∞u √Ω:** B·∫°n c√≥ th·ªÉ b·ªè qua b∆∞·ªõc n√†y n·∫øu kh√¥ng c·∫ßn scaling

In [None]:
# B·ªè comment ƒë·ªÉ ch·∫°y scaling
# print("üìè Scaling features...")

# scaler = StandardScaler()
# cols_to_scale = ['Di·ªán t√≠ch (m2)', 'Chi·ªÅu ngang (m)', 'Chi·ªÅu d√†i (m)', 'S·ªë ph√≤ng ng·ªß', 'S·ªë ph√≤ng v·ªá sinh', 'S·ªë t·∫ßng']

# for col in cols_to_scale:
#     if col in df.columns:
#         df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])
#         print(f"  ‚úì Scaled {col}")

# print("\n‚úì Features scaled!")

## 11. Summary Statistics

In [None]:
print("üìà Summary Statistics:")
print("=" * 80)

# Numeric features
print("\nNumeric Features:")
numeric_cols_display = df.select_dtypes(include=[np.number]).columns
df[numeric_cols_display].describe()

In [None]:
# Categorical features
print("\nCategorical Features (Top 5 values):")
categorical_cols_display = ['Th√†nh ph·ªë', 'Lo·∫°i h√¨nh', 'Gi·∫•y t·ªù ph√°p l√Ω', 'T√¨nh tr·∫°ng n·ªôi th·∫•t']

for col in categorical_cols_display:
    if col in df.columns:
        print(f"\n{col}:")
        print(df[col].value_counts().head())

## 12. Visualization: Correlation Matrix

In [None]:
# Correlation matrix
plt.figure(figsize=(14, 10))

# Select key numeric columns
key_cols = ['Gi√° b√°n (VND)', 'Di·ªán t√≠ch (m2)', 'Chi·ªÅu ngang (m)', 'Chi·ªÅu d√†i (m)', 
            'S·ªë ph√≤ng ng·ªß', 'S·ªë ph√≤ng v·ªá sinh', 'S·ªë t·∫ßng', 'Gi√°/m2', 'T·ªïng s·ªë ph√≤ng']
key_cols = [col for col in key_cols if col in df.columns]

corr_matrix = df[key_cols].corr()

sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Key Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

## 13. L∆∞u D·ªØ Li·ªáu ƒê√£ X·ª≠ L√Ω

In [None]:
# L∆∞u d·ªØ li·ªáu ƒë√£ x·ª≠ l√Ω
output_file = 'nhatot_crawl4ai_processed.csv'
backup_file = 'nhatot_crawl4ai_original_backup.csv'

print("üíæ Saving files...")

# Save processed data
df.to_csv(output_file, index=False, encoding='utf-8-sig')
print(f"‚úì Processed data: {output_file}")
print(f"  ‚Üí {len(df):,} rows and {len(df.columns)} columns")

# Save original data backup
df_original.to_csv(backup_file, index=False, encoding='utf-8-sig')
print(f"\n‚úì Original data backup: {backup_file}")
print(f"  ‚Üí {len(df_original):,} rows and {len(df_original.columns)} columns")

print(f"\nüìÅ Files saved:")
print(f"  ‚Ä¢ Original file (unchanged): {file_path}")
print(f"  ‚Ä¢ Processed file (new): {output_file}")
print(f"  ‚Ä¢ Backup file (new): {backup_file}")

## 14. Chu·∫©n B·ªã Cho Machine Learning (Optional)

In [None]:
# Prepare for modeling
print("üéØ Preparing data for modeling...")

# Select feature columns (encoded and numeric)
feature_cols = []
for col in df.columns:
    if '_encoded' in col:
        feature_cols.append(col)
    elif col in ['Di·ªán t√≠ch (m2)', 'Chi·ªÅu ngang (m)', 'Chi·ªÅu d√†i (m)', 
                 'S·ªë ph√≤ng ng·ªß', 'S·ªë ph√≤ng v·ªá sinh', 'S·ªë t·∫ßng', 'T·ªïng s·ªë ph√≤ng']:
        feature_cols.append(col)

X = df[feature_cols].fillna(0)
y = df['Gi√° b√°n (VND)']

# Split train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"‚úì Train set: {len(X_train):,} samples")
print(f"‚úì Test set: {len(X_test):,} samples")
print(f"‚úì Features: {len(feature_cols)}")
print(f"\nFeature columns: {feature_cols}")

## 15. K·∫øt Lu·∫≠n

### ‚úÖ ƒê√£ ho√†n th√†nh:
1. Load v√† backup d·ªØ li·ªáu g·ªëc
2. L√†m s·∫°ch d·ªØ li·ªáu (empty rows, duplicates)
3. X·ª≠ l√Ω missing values
4. Chuy·ªÉn ƒë·ªïi gi√° t·ª´ text sang s·ªë
5. X·ª≠ l√Ω c√°c c·ªôt s·ªë
6. Feature engineering (t·∫°o features m·ªõi)
7. Encoding categorical features
8. L∆∞u d·ªØ li·ªáu ƒë√£ x·ª≠ l√Ω

### üìä K·∫øt qu·∫£:
- **File g·ªëc**: Kh√¥ng b·ªã thay ƒë·ªïi ‚úì
- **File processed**: S·∫µn s√†ng cho modeling ‚úì
- **File backup**: ƒê√£ l∆∞u an to√†n ‚úì

### üéì B∆∞·ªõc ti·∫øp theo:
B·∫°n c√≥ th·ªÉ s·ª≠ d·ª•ng `X_train`, `X_test`, `y_train`, `y_test` ƒë·ªÉ train c√°c models nh∆∞:
- Linear Regression
- Random Forest
- XGBoost
- Neural Networks