# Task 4 ‚Äì House Price Prediction (Regression Analysis)

This Jupyter Notebook implements **Task 4: Regression Analysis** from the _Data Analysis with Python_ lab.

We will:
- Load and explore the house price dataset
- Handle missing values
- Detect possible outliers
- Preprocess numeric and categorical features
- Train a **Linear Regression** model
- Evaluate the model using **RMSE** and **R¬≤**
- Show feature importance (coefficients)


In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


## 1. Load and inspect the dataset

In [None]:
# Make sure 'house_prices.csv' is in the same folder as this notebook.

print('üìÇ Loading dataset: house_prices.csv')
data = pd.read_csv('house_prices.csv')

print('\nüîπ First 5 rows of dataset:')
display(data.head())

print('\n‚ÑπÔ∏è Dataset info:')
print(data.info())

print('\nüìä Summary statistics:')
display(data.describe())


## 2. Check and handle missing values

In [None]:
print('‚ùì Missing values in each column:')
print(data.isnull().sum())

# Define columns
numeric_cols = ['Size', 'Number_of_Rooms']
categorical_cols = ['Location']
target_col = 'Price'

# Handle missing values
for col in numeric_cols:
    data[col] = data[col].fillna(data[col].mean())

for col in categorical_cols:
    data[col] = data[col].fillna(data[col].mode()[0])

print('\n‚úÖ Missing values after handling:')
print(data.isnull().sum())


## 3. Outlier detection using IQR

In [None]:
print('üìå Outlier detection (IQR method):')
for col in numeric_cols + [target_col]:
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    outliers = data[(data[col] < lower) | (data[col] > upper)]
    print(f' - {col}: {len(outliers)} possible outliers')

# NOTE: We are not removing outliers in this task.


## 4. Feature selection and preprocessing

In [None]:
# Features (X) and target (y)
X = data[numeric_cols + categorical_cols]
y = data[target_col]

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first')  # avoid dummy trap

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_cols),
        ('cat', categorical_transformer, categorical_cols),
    ]
)


## 5. Train-test split and model training

In [None]:
model = LinearRegression()

clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model),
])

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

print('üìê Data shapes:')
print(' - X_train:', X_train.shape)
print(' - X_test :', X_test.shape)
print(' - y_train:', y_train.shape)
print(' - y_test :', y_test.shape)

print('\nüöÄ Training Linear Regression model...')
clf.fit(X_train, y_train)


## 6. Predictions and model evaluation

In [None]:
y_pred = clf.predict(X_test)

results = pd.DataFrame({
    'Actual_Price': y_test.values,
    'Predicted_Price': y_pred
})

print('üìã Actual vs Predicted Prices (first 10 rows):')
display(results.head(10))

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print('\nüìä Model Evaluation:')
print(f' - RMSE     : {rmse:.2f}')
print(f' - R¬≤ Score : {r2:.4f}')


## 7. Feature importance (coefficients)

In [None]:
# Fit preprocessor to training data to get feature names
preprocessor.fit(X_train)
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
all_feature_names = numeric_cols + list(cat_feature_names)

coefficients = clf.named_steps['model'].coef_

feature_importance = pd.DataFrame({
    'Feature': all_feature_names,
    'Coefficient': coefficients
}).sort_values(by='Coefficient', ascending=False)

print('‚≠ê Feature Importance (Linear Regression Coefficients):')
display(feature_importance)
