# üè† House Price Prediction - Exploratory Data Analysis (EDA)

## VietAI - Foundations of Machine Learning Final Project

**M·ª•c ti√™u:**
- Hi·ªÉu r√µ c·∫•u tr√∫c v√† ƒë·∫∑c ƒëi·ªÉm c·ªßa b·ªô d·ªØ li·ªáu House Prices
- Ph√¢n t√≠ch th·ªëng k√™ m√¥ t·∫£ c√°c bi·∫øn
- Ph√°t hi·ªán missing values v√† outliers
- Tr·ª±c quan h√≥a m·ªëi t∆∞∆°ng quan gi·ªØa c√°c bi·∫øn

**Dataset:** [Kaggle House Prices Competition](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)


In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
import os

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Create directories
os.makedirs('../data/raw', exist_ok=True)
os.makedirs('../data/processed', exist_ok=True)
os.makedirs('../reports', exist_ok=True)
os.makedirs('../models', exist_ok=True)

# Custom color palette
COLORS = {
    'primary': '#2E86AB',
    'secondary': '#A23B72',
    'accent': '#F18F01',
    'success': '#C73E1D',
    'dark': '#3B1F2B'
}

print("‚úÖ Libraries imported successfully!")


‚úÖ Libraries imported successfully!


## 1. Load Data

**L∆∞u √Ω:** T·∫£i d·ªØ li·ªáu t·ª´ [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) v√† ƒë·∫∑t trong th∆∞ m·ª•c `../data/raw/`


In [4]:
# Load training data
train_path = '../data/raw/train.csv'
test_path = '../data/raw/test.csv'

try:
    train_df = pd.read_csv(train_path)
    test_df = pd.read_csv(test_path)
    print(f"‚úÖ Training data loaded: {train_df.shape[0]} rows, {train_df.shape[1]} columns")
    print(f"‚úÖ Test data loaded: {test_df.shape[0]} rows, {test_df.shape[1]} columns")
except FileNotFoundError:
    print("‚ö†Ô∏è Data files not found!")
    print("Please download data from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques")
    print("And place train.csv and test.csv in ../data/raw/ folder")


‚ö†Ô∏è Data files not found!
Please download data from: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
And place train.csv and test.csv in ../data/raw/ folder


In [5]:
# First look at the data
train_df.head(10)


NameError: name 'train_df' is not defined

In [None]:
# Data info
print("="*60)
print("DATA INFO")
print("="*60)
train_df.info()


## 2. Th·ªëng K√™ M√¥ T·∫£ (Descriptive Statistics)


In [None]:
# Numerical columns statistics
print("üìä TH·ªêNG K√ä C√ÅC BI·∫æN S·ªê")
print("="*80)
train_df.describe().T


In [None]:
# Categorical columns
categorical_cols = train_df.select_dtypes(include=['object']).columns
numerical_cols = train_df.select_dtypes(include=[np.number]).columns

print(f"\nüìå S·ªë l∆∞·ª£ng bi·∫øn s·ªë (Numerical): {len(numerical_cols)}")
print(f"üìå S·ªë l∆∞·ª£ng bi·∫øn ph√¢n lo·∫°i (Categorical): {len(categorical_cols)}")
print(f"\nüî¢ Bi·∫øn s·ªë: {list(numerical_cols[:10])}...")
print(f"\nüìù Bi·∫øn ph√¢n lo·∫°i: {list(categorical_cols[:10])}...")


In [None]:
# Target variable (SalePrice) statistics
print("\nüéØ BI·∫æN M·ª§C TI√äU: SalePrice")
print("="*50)
print(train_df['SalePrice'].describe())
print(f"\nüìà Skewness: {train_df['SalePrice'].skew():.4f}")
print(f"üìä Kurtosis: {train_df['SalePrice'].kurtosis():.4f}")


## 3. Ph√¢n T√≠ch Missing Values


In [None]:
# Missing values analysis
def analyze_missing_values(df, name="Dataset"):
    """Analyze missing values in a DataFrame."""
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Column': missing.index,
        'Missing Count': missing.values,
        'Missing %': missing_pct.values
    })
    
    missing_df = missing_df[missing_df['Missing Count'] > 0]
    missing_df = missing_df.sort_values('Missing %', ascending=False)
    
    print(f"\nüîç {name} - MISSING VALUES ANALYSIS")
    print("="*60)
    print(f"T·ªïng s·ªë c·ªôt c√≥ missing values: {len(missing_df)}")
    print(f"T·ªïng s·ªë gi√° tr·ªã missing: {missing_df['Missing Count'].sum()}")
    
    return missing_df

missing_train = analyze_missing_values(train_df, "Training Data")
missing_train.head(20)


In [None]:
# Visualize missing values
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart of missing values
if len(missing_train) > 0:
    top_missing = missing_train.head(15)
    ax1 = axes[0]
    bars = ax1.barh(top_missing['Column'], top_missing['Missing %'], 
                    color=COLORS['primary'], edgecolor=COLORS['dark'])
    ax1.set_xlabel('Ph·∫ßn trƒÉm Missing (%)', fontsize=12)
    ax1.set_title('Top 15 C·ªôt C√≥ Missing Values', fontsize=14, fontweight='bold')
    ax1.invert_yaxis()
    
    # Add value labels
    for bar, pct in zip(bars, top_missing['Missing %']):
        ax1.text(bar.get_width() + 0.5, bar.get_y() + bar.get_height()/2, 
                f'{pct:.1f}%', va='center', fontsize=10)

# Heatmap of missing values
ax2 = axes[1]
cols_with_missing = missing_train['Column'].tolist()[:20]
if cols_with_missing:
    msno_data = train_df[cols_with_missing].isnull()
    sns.heatmap(msno_data.T, cbar=True, ax=ax2, cmap='YlOrRd')
    ax2.set_title('Missing Values Pattern', fontsize=14, fontweight='bold')
    ax2.set_xlabel('Sample Index')

plt.tight_layout()
plt.savefig('../reports/missing_values.png', dpi=150, bbox_inches='tight')
plt.show()


## 4. Ph√¢n T√≠ch Bi·∫øn M·ª•c Ti√™u (Target Variable)


In [None]:
# Target variable distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Histogram
ax1 = axes[0]
ax1.hist(train_df['SalePrice'], bins=50, color=COLORS['primary'], 
         edgecolor=COLORS['dark'], alpha=0.7)
ax1.axvline(train_df['SalePrice'].mean(), color=COLORS['secondary'], 
            linestyle='--', linewidth=2, label=f'Mean: ${train_df["SalePrice"].mean():,.0f}')
ax1.axvline(train_df['SalePrice'].median(), color=COLORS['accent'], 
            linestyle='--', linewidth=2, label=f'Median: ${train_df["SalePrice"].median():,.0f}')
ax1.set_xlabel('SalePrice ($)', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Ph√¢n Ph·ªëi Gi√° Nh√†', fontsize=14, fontweight='bold')
ax1.legend()

# Log-transformed
ax2 = axes[1]
log_price = np.log1p(train_df['SalePrice'])
ax2.hist(log_price, bins=50, color=COLORS['secondary'], 
         edgecolor=COLORS['dark'], alpha=0.7)
ax2.set_xlabel('Log(SalePrice)', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Ph√¢n Ph·ªëi Log-Transformed', fontsize=14, fontweight='bold')

# Q-Q Plot
ax3 = axes[2]
stats.probplot(log_price, dist="norm", plot=ax3)
ax3.set_title('Q-Q Plot (Log SalePrice)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../reports/target_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nüìä Skewness g·ªëc: {train_df['SalePrice'].skew():.4f}")
print(f"üìä Skewness sau log transform: {log_price.skew():.4f}")


## 5. Ph√¢n T√≠ch T∆∞∆°ng Quan (Correlation Analysis)


In [None]:
# Correlation with SalePrice
correlations = train_df.select_dtypes(include=[np.number]).corr()['SalePrice'].sort_values(ascending=False)

print("üîó TOP 15 BI·∫æN T∆Ø∆†NG QUAN CAO NH·∫§T V·ªöI SALEPRICE")
print("="*50)
for col, corr in correlations.head(16).items():
    if col != 'SalePrice':
        print(f"{col:25s}: {corr:.4f}")


In [None]:
# Correlation heatmap for top features
top_corr_features = correlations.head(12).index.tolist()

fig, ax = plt.subplots(figsize=(12, 10))
corr_matrix = train_df[top_corr_features].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', 
            cmap='RdYlBu_r', center=0, linewidths=0.5,
            square=True, ax=ax)

ax.set_title('Ma Tr·∫≠n T∆∞∆°ng Quan - Top Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig('../reports/correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()


In [None]:
# Scatter plots for top correlated features
top_features = ['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 
                'FullBath', 'YearBuilt']

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.flatten()

for i, feature in enumerate(top_features):
    ax = axes[i]
    ax.scatter(train_df[feature], train_df['SalePrice'], 
               alpha=0.5, c=COLORS['primary'], edgecolors=COLORS['dark'], linewidth=0.5)
    
    # Add trend line
    valid_mask = train_df[feature].notna()
    z = np.polyfit(train_df.loc[valid_mask, feature], 
                   train_df.loc[valid_mask, 'SalePrice'], 1)
    p = np.poly1d(z)
    x_line = np.linspace(train_df[feature].min(), train_df[feature].max(), 100)
    ax.plot(x_line, p(x_line), color=COLORS['secondary'], linewidth=2, label='Trend')
    
    corr = train_df[feature].corr(train_df['SalePrice'])
    ax.set_xlabel(feature, fontsize=11)
    ax.set_ylabel('SalePrice ($)', fontsize=11)
    ax.set_title(f'{feature} vs SalePrice\n(r = {corr:.3f})', fontsize=12, fontweight='bold')

plt.suptitle('M·ªëi Quan H·ªá Gi·ªØa Top Features v√† Gi√° Nh√†', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../reports/scatter_plots.png', dpi=150, bbox_inches='tight')
plt.show()


## 6. Ph√¢n T√≠ch Bi·∫øn Ph√¢n Lo·∫°i (Categorical Features)


In [None]:
# Analyze categorical features vs SalePrice
important_cat_features = ['Neighborhood', 'ExterQual', 'KitchenQual', 
                          'BldgType', 'HouseStyle', 'SaleCondition']

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for i, feature in enumerate(important_cat_features):
    ax = axes[i]
    
    # Order by median SalePrice
    order = train_df.groupby(feature)['SalePrice'].median().sort_values(ascending=False).index
    
    sns.boxplot(data=train_df, x=feature, y='SalePrice', order=order,
                palette='viridis', ax=ax)
    
    ax.set_xlabel(feature, fontsize=11)
    ax.set_ylabel('SalePrice ($)', fontsize=11)
    ax.set_title(f'SalePrice by {feature}', fontsize=12, fontweight='bold')
    ax.tick_params(axis='x', rotation=45)

plt.suptitle('Ph√¢n T√≠ch Gi√° Nh√† Theo Bi·∫øn Ph√¢n Lo·∫°i', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('../reports/categorical_analysis.png', dpi=150, bbox_inches='tight')
plt.show()


## 7. Ph√°t Hi·ªán Outliers


In [None]:
# Detect outliers using IQR method
def detect_outliers_iqr(df, column):
    """Detect outliers using IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower) | (df[column] > upper)]
    return outliers, lower, upper

# Analyze outliers for key numerical features
key_features = ['SalePrice', 'GrLivArea', 'LotArea', 'TotalBsmtSF', 'GarageArea']

print("üîç OUTLIERS ANALYSIS (IQR Method)")
print("="*60)

for feature in key_features:
    if feature in train_df.columns:
        outliers, lower, upper = detect_outliers_iqr(train_df, feature)
        print(f"\n{feature}:")
        print(f"  - Bounds: [{lower:.2f}, {upper:.2f}]")
        print(f"  - Number of outliers: {len(outliers)} ({len(outliers)/len(train_df)*100:.2f}%)")


In [None]:
# Identify specific outliers in GrLivArea
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(train_df['GrLivArea'], train_df['SalePrice'], 
           c=COLORS['primary'], alpha=0.6, edgecolors=COLORS['dark'])

# Highlight potential outliers
outlier_mask = (train_df['GrLivArea'] > 4000) & (train_df['SalePrice'] < 300000)
ax.scatter(train_df.loc[outlier_mask, 'GrLivArea'], 
           train_df.loc[outlier_mask, 'SalePrice'],
           c=COLORS['success'], s=100, marker='x', linewidths=3,
           label=f'Potential Outliers ({outlier_mask.sum()})')

ax.set_xlabel('GrLivArea (sq ft)', fontsize=12)
ax.set_ylabel('SalePrice ($)', fontsize=12)
ax.set_title('GrLivArea vs SalePrice - Outlier Detection', fontsize=14, fontweight='bold')
ax.legend()

plt.tight_layout()
plt.savefig('../reports/grlivarea_outliers.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\n‚ö†Ô∏è Identified {outlier_mask.sum()} potential outliers with large area but low price")


## 8. K·∫øt Lu·∫≠n EDA

### Nh·ªØng ph√°t hi·ªán ch√≠nh:
1. **Ch·∫•t l∆∞·ª£ng t·ªïng th·ªÉ (OverallQual)** l√† y·∫øu t·ªë quan tr·ªçng nh·∫•t ·∫£nh h∆∞·ªüng ƒë·∫øn gi√° nh√†
2. **Di·ªán t√≠ch** (GrLivArea, TotalBsmtSF) c√≥ m·ªëi t∆∞∆°ng quan m·∫°nh v·ªõi gi√°
3. **Missing values** c·∫ßn ƒë∆∞·ª£c x·ª≠ l√Ω c·∫©n th·∫≠n, ƒë·∫∑c bi·ªát c√°c c·ªôt li√™n quan ƒë·∫øn ti·ªán √≠ch ƒë·∫∑c bi·ªát (Pool, Fence, Alley)
4. **Log transformation** n√™n ƒë∆∞·ª£c √°p d·ª•ng cho bi·∫øn SalePrice ƒë·ªÉ gi·∫£m skewness
5. **Feature Engineering** c√≥ th·ªÉ t·∫°o ra c√°c bi·∫øn m·ªõi c√≥ gi√° tr·ªã nh∆∞ TotalSF, HouseAge

### B∆∞·ªõc ti·∫øp theo:
- Data Preprocessing & Feature Engineering
- Model Training & Evaluation
