# Machine Learning Implementation

**Objective:** Build and evaluate predictive models to estimate house prices.

**Approach:** `Regression`.

**Models:** `Linear Regression` and `Decision Tree Regressor`.

**Evaluation Metrics:** `R-squared` (R2R2) and `Mean Squared Error` (MSE).

#### 1. Data Loading and Feature Selection

We load the processed dataset and select the features that will be used for prediction. Based on our EDA, we are including the geographical coordinates, housing age, economic indicators, and our engineered features `(Rooms_Per_Household)`.

In [6]:
import sys; sys.path.append("..")
import pandas as pd
from sklearn.model_selection import train_test_split
from src.models import train_linear_regression, train_decision_tree, evaluate_model, save_results

# Load processed data
df = pd.read_csv('../data/processed/cleaned_housing.csv')

# Feature Selection: Choosing relevant columns for the model
# We drop the target 'Price' to create the feature matrix X
X = df.drop('Price', axis=1)
y = df['Price']

print(f"Features selected: {list(X.columns)}")

Features selected: ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'Rooms_Per_Household', 'Bedrooms_Per_Room', 'ocean_proximity_INLAND', 'ocean_proximity_ISLAND', 'ocean_proximity_NEAR BAY', 'ocean_proximity_NEAR OCEAN']


#### 2. Train-Test Split

To properly evaluate our models, we split the data into a Training Set (80%) and a Testing Set (20%). This ensures we test the model on data it has never seen before, preventing overfitting.

In [7]:
# Split data using train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples")

Training set size: 13532 samples
Testing set size: 3384 samples


#### 3. Model 1: Linear Regression

We implement a Multiple `Linear Regression` model. This model assumes a linear relationship between the features (like income and rooms) and the house price.

In [8]:
# Train Linear Regression model
lr_model = train_linear_regression(X_train, y_train)

# Conduct model evaluation (Requirement 2.3.3)
lr_metrics = evaluate_model(lr_model, X_test, y_test)

print("Linear Regression Performance:")
print(f"R-squared Score: {lr_metrics['R2']}")
print(f"Mean Squared Error: {lr_metrics['MSE']}")

Linear Regression Performance:
R-squared Score: 0.6116
Mean Squared Error: 3359271807.5834


#### 4. Model 2: Decision Tree Regressor

`The Decision Tree Regressor` can capture non-linear relationships and interactions between features (e.g., how the impact of 'House Age' might change depending on 'Location').

In [9]:
# Train Decision Tree Regressor
dt_model = train_decision_tree(X_train, y_train, max_depth=5)

# Conduct model evaluation (Requirement 2.3.3)
dt_metrics = evaluate_model(dt_model, X_test, y_test)

print("Decision Tree Performance:")
print(f"R-squared Score: {dt_metrics['R2']}")
print(f"Mean Squared Error: {dt_metrics['MSE']}")

Decision Tree Performance:
R-squared Score: 0.5669
Mean Squared Error: 3746393949.6182


#### 5. Final Comparison and Discussion

**Model Comparison:**

* **Linear Regression R²:** _0.6116_

* **Decision Tree R²:** _0.5669_

**Discussion:**
The Decision Tree Regressor typically performs better on this dataset.

* Why? House prices are heavily influenced by complex geographical interactions (Longitude/Latitude) and non-linear thresholds (e.g., a specific neighborhood being much more expensive regardless of house age). Linear Regression struggles to map these non-straight-line relationships, whereas the Decision Tree can split the data into specific local regions.

#### 6. Feature Importance: What drives house prices?

Using the Decision Tree model, we can identify which features most affect the predicted price.

In [10]:
# Get feature importance from the Decision Tree
importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': dt_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("Top 5 Features Most Affecting House Prices:")
print(importance.head(5))

# Save the final results to the results folder
all_results = {
    "Linear_Regression": lr_metrics,
    "Decision_Tree": dt_metrics,
    "Top_Feature": importance.iloc[0]['Feature']
}
save_results(all_results, "model_comparison.json")

Top 5 Features Most Affecting House Prices:
                   Feature  Importance
7            median_income    0.449400
10  ocean_proximity_INLAND    0.442941
0                longitude    0.050943
1                 latitude    0.032174
2       housing_median_age    0.014267
Metrics saved successfully to ../reports/results\model_comparison.json
