# House Prices Regression Techniques

## Project Overview
This notebook contains the complete analysis and modeling workflow for predicting house prices using the Ames Housing dataset.

**Goal:** Predict the sales price for each house in the test set.

**Evaluation Metric:** RMSE between the logarithm of predicted and observed sales prices.

## 1. Import Libraries

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

## 2. Load Data

Load the training and test datasets.

In [None]:
# Load datasets
train_df = pd.read_csv('Data/train.csv')
test_df = pd.read_csv('Data/test.csv')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

## 3. Exploratory Data Analysis (EDA)

### 3.1 First Look at the Data

In [None]:
# Display first few rows
train_df.head()

In [None]:
# Basic information about the dataset
train_df.info()

In [None]:
# Statistical summary
train_df.describe()

### 3.2 Target Variable Analysis

In [None]:
# Target variable statistics
print("SalePrice Statistics:")
print(train_df['SalePrice'].describe())
print(f"\nSkewness: {train_df['SalePrice'].skew()}")
print(f"Kurtosis: {train_df['SalePrice'].kurtosis()}")

In [None]:
# Visualize target variable distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original distribution
axes[0].hist(train_df['SalePrice'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Sale Price')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Sale Prices')

# Log-transformed distribution
axes[1].hist(np.log1p(train_df['SalePrice']), bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[1].set_xlabel('Log(Sale Price)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Log-Transformed Sale Prices')

plt.tight_layout()
plt.show()

### 3.3 Missing Values Analysis

In [None]:
# Calculate missing values
def missing_values_table(df):
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    print(f"Your selected dataframe has {df.shape[1]} columns.\n"
          f"There are {mis_val_table_ren_columns.shape[0]} columns that have missing values.")
    return mis_val_table_ren_columns

# Display missing values for training data
print("Training Data - Missing Values:")
missing_train = missing_values_table(train_df)
missing_train.head(20)

### 3.4 Data Types

In [None]:
# Separate numerical and categorical features
numerical_features = train_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = train_df.select_dtypes(include=['object']).columns.tolist()

# Remove Id and SalePrice from features
if 'Id' in numerical_features:
    numerical_features.remove('Id')
if 'SalePrice' in numerical_features:
    numerical_features.remove('SalePrice')

print(f"Number of numerical features: {len(numerical_features)}")
print(f"Number of categorical features: {len(categorical_features)}")
print(f"\nNumerical features: {numerical_features[:10]}...")
print(f"\nCategorical features: {categorical_features[:10]}...")

## 4. Data Preprocessing

This section will contain:
- Missing value imputation
- Feature engineering
- Encoding categorical variables
- Feature scaling

*To be continued...*

## 5. Feature Engineering

*To be continued...*

## 6. Model Development

*To be continued...*

## 7. Model Evaluation

*To be continued...*

## 8. Generate Predictions

*To be continued...*