# House Prices Regression Techniques

## Project Overview
This notebook contains the complete analysis and modeling workflow for predicting house prices using the Ames Housing dataset.

**Goal:** Predict the sales price for each house in the test set.

**Evaluation Metric:** RMSE between the logarithm of predicted and observed sales prices.

## 1. Import Libraries

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load Data

Load the training and test datasets.

In [2]:
# Load datasets
train_df = pd.read_csv('Data/train.csv')
test_df = pd.read_csv('Data/test.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

Training data shape: (1460, 81)
Test data shape: (1459, 80)


## 3. Exploratory Data Analysis (EDA)

### 3.1 First Look at the Data

In [None]:
# Display first few rows
train_df.head()

In [None]:
# Basic information about the dataset
train_df.info()

In [None]:
# Statistical summary
train_df.describe()

### 3.2 Target Variable Analysis

In [None]:
# Target variable statistics
print("SalePrice Statistics:")
print(train_df['SalePrice'].describe())
print(f"\nSkewness: {train_df['SalePrice'].skew()}")
print(f"Kurtosis: {train_df['SalePrice'].kurtosis()}")

In [None]:
# Visualize target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original distribution
axes[0].hist(train_df['SalePrice'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Sale Price')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Sale Prices')

# Log-transformed distribution
axes[1].hist(np.log1p(train_df['SalePrice']), bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[1].set_xlabel('Log(Sale Price)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Log-Transformed Sale Prices')

plt.tight_layout()
plt.show()

### 3.3 Missing Values Analysis

In [3]:
# Calculate missing values
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    print(f"Your selected dataframe has {df.shape[1]} columns.\n"
          f"There are {mis_val_table_ren_columns.shape[0]} columns that have missing values.")
    return mis_val_table_ren_columns

# Display missing values for training data
print("Training Data - Missing Values:")
missing_train = missing_values_table(train_df)
missing_train.head(20)

Training Data - Missing Values:
Your selected dataframe has 81 columns.
There are 19 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values
PoolQC,1453,99.5
MiscFeature,1406,96.3
Alley,1369,93.8
Fence,1179,80.8
MasVnrType,872,59.7
FireplaceQu,690,47.3
LotFrontage,259,17.7
GarageType,81,5.5
GarageYrBlt,81,5.5
GarageFinish,81,5.5


### 3.4 Data Types

In [4]:
# Separate numerical and categorical features
numerical_features = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

# Remove Id and SalePrice from features
if 'Id' in numerical_features:
    numerical_features.remove('Id')
if 'SalePrice' in numerical_features:
    numerical_features.remove('SalePrice')

print(f"Number of numerical features: {len(numerical_features)}")
print(f"Number of categorical features: {len(categorical_features)}")
print(f"\nNumerical features: {numerical_features[:10]}...")
print(f"\nCategorical features: {categorical_features[:10]}...")

Number of numerical features: 36
Number of categorical features: 43

Numerical features: ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2']...

Categorical features: ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1']...


## 4. Data Preprocessing

In this section, we'll prepare the data for modeling by:
- Combining train and test data for consistent preprocessing
- Handling missing values strategically
- Creating new features
- Encoding categorical variables
- Scaling numerical features

### 4.1 Combine Train and Test Data

In [5]:
# Save the target variable
y_train = train_df['SalePrice'].copy()

# Save test IDs for submission
test_ids = test_df['Id'].copy()

# Drop Id and SalePrice from train
train_df = train_df.drop(['Id', 'SalePrice'], axis=1)
test_df = test_df.drop(['Id'], axis=1)

# Store the number of training samples
n_train = train_df.shape[0]

# Combine train and test for consistent preprocessing
all_data = pd.concat([train_df, test_df], axis=0, ignore_index=True)

print(f"Combined dataset shape: {all_data.shape}")
print(f"Training samples: {n_train}")
print(f"Test samples: {all_data.shape[0] - n_train}")

Combined dataset shape: (2919, 79)
Training samples: 1460
Test samples: 1459


### 4.2 Handle Missing Values

We'll handle missing values based on the data description and domain knowledge.

In [6]:
# Check missing values in combined dataset
missing_data = all_data.isnull().sum()
missing_data = missing_data[missing_data > 0].sort_values(ascending=False)
missing_percent = (missing_data / len(all_data)) * 100

missing_df = pd.DataFrame({'Missing Count': missing_data, 'Percentage': missing_percent})
print(f"Features with missing values: {len(missing_df)}")
print("\nTop 10 features with missing values:")
print(missing_df.head(10))

Features with missing values: 34

Top 10 features with missing values:
             Missing Count  Percentage
PoolQC                2909   99.657417
MiscFeature           2814   96.402878
Alley                 2721   93.216855
Fence                 2348   80.438506
MasVnrType            1766   60.500171
FireplaceQu           1420   48.646797
LotFrontage            486   16.649538
GarageQual             159    5.447071
GarageYrBlt            159    5.447071
GarageCond             159    5.447071


In [7]:
# For features where NA means "None" or "No feature"
none_cols = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu',
             'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
             'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
             'MasVnrType']

for col in none_cols:
    if col in all_data.columns:
        all_data[col] = all_data[col].fillna('None')

# For garage year built, fill with 0 (no garage)
if 'GarageYrBlt' in all_data.columns:
    all_data['GarageYrBlt'] = all_data['GarageYrBlt'].fillna(0)

# For basement and garage related numerical features, fill with 0
zero_cols = ['BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
             'BsmtFullBath', 'BsmtHalfBath', 'GarageCars', 'GarageArea',
             'MasVnrArea']

for col in zero_cols:
    if col in all_data.columns:
        all_data[col] = all_data[col].fillna(0)

# For LotFrontage, fill with median by neighborhood
if 'LotFrontage' in all_data.columns:
    all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(
        lambda x: x.fillna(x.median()))

# For other categorical features, fill with mode
categorical_cols = all_data.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if all_data[col].isnull().sum() > 0:
        all_data[col] = all_data[col].fillna(all_data[col].mode()[0])

# For remaining numerical features, fill with median
numerical_cols = all_data.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    if all_data[col].isnull().sum() > 0:
        all_data[col] = all_data[col].fillna(all_data[col].median())

# Verify no missing values remain
print(f"\nRemaining missing values: {all_data.isnull().sum().sum()}")
print("Missing value handling complete!")


Remaining missing values: 0
Missing value handling complete!


### 4.3 Feature Engineering

Create new features that might be useful for prediction.

In [8]:
# Total square footage
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

# Total bathrooms
all_data['TotalBath'] = (all_data['FullBath'] + (0.5 * all_data['HalfBath']) +
                         all_data['BsmtFullBath'] + (0.5 * all_data['BsmtHalfBath']))

# Total porch square footage
all_data['TotalPorchSF'] = (all_data['OpenPorchSF'] + all_data['3SsnPorch'] +
                            all_data['EnclosedPorch'] + all_data['ScreenPorch'] +
                            all_data['WoodDeckSF'])

# Has pool
all_data['HasPool'] = (all_data['PoolArea'] > 0).astype(int)

# Has 2nd floor
all_data['Has2ndFloor'] = (all_data['2ndFlrSF'] > 0).astype(int)

# Has garage
all_data['HasGarage'] = (all_data['GarageArea'] > 0).astype(int)

# Has basement
all_data['HasBsmt'] = (all_data['TotalBsmtSF'] > 0).astype(int)

# Has fireplace
all_data['HasFireplace'] = (all_data['Fireplaces'] > 0).astype(int)

# House age
all_data['HouseAge'] = all_data['YrSold'] - all_data['YearBuilt']

# Years since remodel
all_data['YearsSinceRemod'] = all_data['YrSold'] - all_data['YearRemodAdd']

# Is new house (sold in the year it was built)
all_data['IsNew'] = (all_data['YearBuilt'] == all_data['YrSold']).astype(int)

print("Feature engineering complete!")
print(f"New dataset shape: {all_data.shape}")

Feature engineering complete!
New dataset shape: (2919, 90)


### 4.4 Handle Skewed Features

Log-transform highly skewed numerical features to make them more normally distributed.

In [9]:
from scipy.stats import skew

# Get numerical features
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Calculate skewness
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna()))

# Select features with skewness > 0.75
skewed_feats = skewed_feats[abs(skewed_feats) > 0.75]

print(f"Number of skewed features: {len(skewed_feats)}")
print("\nTop 10 most skewed features:")
print(skewed_feats.sort_values(ascending=False).head(10))

# Apply log1p transformation to reduce skewness
for feat in skewed_feats.index:
    all_data[feat] = np.log1p(all_data[feat])

print("\nSkewness correction applied!")

Number of skewed features: 29

Top 10 most skewed features:
MiscVal          21.947195
PoolArea         16.898328
HasPool          14.884318
LotArea          12.822431
LowQualFinSF     12.088761
3SsnPorch        11.376065
IsNew             4.712237
KitchenAbvGr      4.302254
BsmtFinSF2        4.146143
EnclosedPorch     4.003891
dtype: float64

Skewness correction applied!


### 4.5 Encode Categorical Variables

Convert categorical features to numerical using one-hot encoding.

In [10]:
# Get dummy variables for categorical features
all_data = pd.get_dummies(all_data, drop_first=True)

print(f"Dataset shape after encoding: {all_data.shape}")
print(f"Number of features: {all_data.shape[1]}")

Dataset shape after encoding: (2919, 270)
Number of features: 270


### 4.6 Split Back to Train and Test Sets

In [11]:
# Split back into train and test sets
X_train = all_data[:n_train].copy()
X_test = all_data[n_train:].copy()

# Also transform the target variable (log transformation)
y_train_log = np.log1p(y_train)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_train_log shape: {y_train_log.shape}")
print("\nData preprocessing complete!")

X_train shape: (1460, 270)
X_test shape: (1459, 270)
y_train shape: (1460,)
y_train_log shape: (1460,)

Data preprocessing complete!


### 4.7 Preprocessing Summary

In [12]:
print("=" * 60)
print("DATA PREPROCESSING SUMMARY")
print("=" * 60)
print(f"\nOriginal features: 79")
print(f"After feature engineering: 90")
print(f"After one-hot encoding: {X_train.shape[1]}")
print(f"\nTraining samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")
print(f"\nMissing values handled: ✓")
print(f"Skewed features corrected: ✓")
print(f"Categorical encoding: ✓")
print(f"Target transformation (log): ✓")
print("\n" + "=" * 60)
print("Ready for modeling!")
print("=" * 60)

DATA PREPROCESSING SUMMARY

Original features: 79
After feature engineering: 90
After one-hot encoding: 270

Training samples: 1460
Test samples: 1459

Missing values handled: ✓
Skewed features corrected: ✓
Categorical encoding: ✓
Target transformation (log): ✓

Ready for modeling!


## 5. Feature Engineering

*To be continued...*

## 6. Model Development

*To be continued...*

## 7. Model Evaluation

*To be continued...*

## 8. Generate Predictions

*To be continued...*