# Housing Price Prediction

## 1. Introduction
This notebook focuses on predicting housing prices using regression techniques.
The goal is to explore housing data, perform exploratory data analysis, and build
a predictive model that estimates house prices based on property features.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 2. Dataset Overview
The dataset is assumed to be available locally. For reproducibility, the housing dataset can also be loaded directly from scikit-learn.


## 3. Data Loading
Loading and inspection of the dataset to understand the basic structure and the distribution of features.

In [None]:
housing = pd.read_csv('housing.csv')
housing.head()

In [None]:
housing.describe()

In [None]:
housing.info

## 4. Data Cleaning and Preprocessing
Handling missing values, encoding, creating a pipeline,etc.

In [None]:
num_cols = ['longitude', 'latitude', 'housing_median_age', 'total_rooms','total_bedrooms', 'population', 'households', 'median_income']
cat_cols = ['ocean_proximity']

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

In [None]:
num_transformer = SimpleImputer(strategy='mean')
cat_transformer = Pipeline(steps=[
    ('impute', SimpleImputer(strategy='constant')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('numerical', num_transformer, num_cols),
    ('categorical', cat_transformer, cat_cols)
])

## 5. Exploratory Data Analysis (EDA)
In this section, we explore relationships between housing features and prices.


In [None]:
housing.hist(bins=50, figsize=(12,8))
plt.show()

Putting the median income into categories gives a picture of how different groups of people within the district earn, where the majority of the earners lie and how that influences the price of housing in the district

In [None]:
housing['income_cat'] = pd.cut(housing['median_income'],
                         bins=[0, 1.5, 3.0, 4.5, 6, np.inf],
                         labels=[1, 2, 3, 4, 5])

In [None]:
housing['income_cat'].value_counts().sort_index().plot.bar(rot=0, grid=True)
plt.xlabel('Income Category')
plt.ylabel('Number of Districts')
plt.show()

This plot shows house longitudes and latitudes affect the price of housing within the district

In [None]:
housing.plot(kind='scatter', x='longitude', y='latitude', grid=True, alpha=0.2)
plt.show()

## 6. Feature Selection and Preparation
Preparing X and y.

In [None]:
#features 
X = housing.drop(['median_house_value'], axis=1)
y=housing.median_house_value

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 7. Model Training
Used RandomForestRegressor

In [None]:
model = RandomForestRegressor(random_state=42)

In [None]:
#create a pipeline
pipeline =Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', model)
])

pipeline.fit(X_train, y_train)

## 8. Model Evaluation

In [None]:
predictions = pipeline.predict(X_test)

from sklearn.metrics import mean_absolute_error
score = mean_absolute_error(predictions, y_test)
print("Score:", score)

## Conclusion
This project demonstrates a complete data science workflow, from data exploration
and preprocessing to regression modeling and evaluation. The results highlight the
importance of feature selection and exploratory analysis in predicting housing prices.