## 🏠 Domain Analysis: California Housing Dataset

### 📘 Domain Overview
The dataset belongs to the **Real Estate and Housing Price Prediction** domain.  
It contains information about **houses in California**, including their **geographic location**, **demographics**, and **median house values**.  

This dataset was originally derived from the **1990 U.S. Census data** and is widely used in data science for regression modeling.

---

### 🧩 Business Objective
The main goal is to:
> **Predict the median house value for a given California district** based on various features such as income level, location, population, and housing characteristics.

This helps:
- Real estate agencies estimate regional house prices  
- Government and planners understand housing affordability  
- Investors identify high-value areas for development

---

### ⚙️ Type of Problem
- **Category:** Supervised Machine Learning  
- **Type:** Regression Problem  
- **Target Variable:** `median_house_value` (continuous numeric value)

---

### 📊 Nature of Data
Each row represents **a block group (district)** in California.  
Each column provides aggregated information about that district.

| Feature | Description | Type |
|----------|--------------|------|
| **longitude** | Longitude coordinate of the district | Numerical |
| **latitude** | Latitude coordinate of the district | Numerical |
| **housing_median_age** | Median age of houses in the district | Numerical |
| **total_rooms** | Total number of rooms in all houses in the district | Numerical |
| **total_bedrooms** | Total number of bedrooms in all houses in the district | Numerical |
| **population** | Total population in the district | Numerical |
| **households** | Total number of households (families) | Numerical |
| **median_income** | Median income of households (in tens of thousands USD) | Numerical |
| **median_house_value** | Median house value (target variable, in USD) | Numerical |
| **ocean_proximity** | Location category of the district (e.g., NEAR BAY, INLAND, etc.) | Categorical |

---

### 🧠 Data Insight Possibilities
- Relationship between **income** and **house value**  
- Impact of **proximity to the ocean** on prices  
- Regional distribution of house values based on **latitude and longitude**  
- Correlation between **population density** and **housing demand**

---

### 🎯 Use Cases
1. **Price Prediction:** Estimate housing prices for new or missing areas.  
2. **Urban Planning:** Analyze where housing prices are increasing or decreasing.  
3. **Socioeconomic Analysis:** Understand how income affects house values.  
4. **Location Intelligence:** Find prime property areas near coastal regions.

---

### ✅ Summary
| Aspect | Details |
|--------|----------|
| **Domain** | Real Estate / California Housing |
| **Objective** | Predict median house value |
| **Problem Type** | Regression |
| **Target Variable** | `median_house_value` |
| **Instances** | District-level data from California (1990 Census) |
| **Applications** | Price prediction, city planning, real estate analysis |

---


## 📂 Dataset Overview

### 📘 About the Dataset
The **California Housing Dataset** provides information about various housing and demographic attributes for different block groups (districts) in California.  
Each record represents a **district**, and each feature describes some characteristic of that district — such as location, income level, or housing details.

The dataset is commonly used for **predicting housing prices** and **understanding the impact of socioeconomic and geographical factors** on housing values.

---

### 🧾 Structure of the Dataset
| Feature Name | Description | Data Type | Example |
|---------------|-------------|------------|----------|
| **longitude** | Longitude coordinate of the district (negative values indicate west of the prime meridian) | Numerical | -122.23 |
| **latitude** | Latitude coordinate of the district | Numerical | 37.88 |
| **housing_median_age** | Median age of houses in the district | Numerical | 41 |
| **total_rooms** | Total number of rooms within the district | Numerical | 880 |
| **total_bedrooms** | Total number of bedrooms within the district | Numerical | 129 |
| **population** | Total population of the district | Numerical | 322 |
| **households** | Total number of households (families) in the district | Numerical | 126 |
| **median_income** | Median income of households (in tens of thousands of USD) | Numerical | 8.3252 |
| **median_house_value** | Median value of houses in the district (target variable, in USD) | Numerical | 452600 |
| **ocean_proximity** | Proximity to the ocean (categorical location indicator) | Categorical | NEAR BAY |

---

### 🧮 Data Characteristics
- **Rows:** Each row = one district in California  
- **Columns:** Each column = one feature describing the district  
- **Target Column:** `median_house_value`  
- **Feature Columns:** All others except the target

---

### ⚙️ Data Type Distribution
- **Numerical Columns:** `longitude`, `latitude`, `housing_median_age`, `total_rooms`, `total_bedrooms`, `population`, `households`, `median_income`, `median_house_value`  
- **Categorical Columns:** `ocean_proximity`

---

### 🎯 Purpose of Dataset Overview
- To understand what each feature represents  
- To identify which columns need **encoding or scaling**  
- To prepare for **Exploratory Data Analysis (EDA)** and **Feature Engineering**

---

### ✅ Summary
| Category | Count | Examples |
|-----------|--------|-----------|
| **Numerical Features** | 9 | latitude, population, median_income |
| **Categorical Features** | 1 | ocean_proximity |
| **Target Feature** | 1 | median_house_value |

---


In [2]:
# Step 1: Import All Required Libraries

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Automated EDA (Exploratory Data Analysis)
from ydata_profiling import ProfileReport


#### Import the CSV Data as Pandas DataFrame

In [3]:
df = pd.read_csv('housing.csv')

In [4]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [6]:
df.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [7]:
df.shape

(20640, 10)

In [8]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [9]:
df.duplicated().sum()

0

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [11]:
df.nunique()

longitude               844
latitude                862
housing_median_age       52
total_rooms            5926
total_bedrooms         1923
population             3888
households             1815
median_income         12928
median_house_value     3842
ocean_proximity           5
dtype: int64

In [12]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


In [13]:
print("Categories in 'ocean_proximity' variable:     ",end="" )
print(df['ocean_proximity'].unique())

Categories in 'ocean_proximity' variable:     ['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']


In [14]:
# define numerical & categorical columns
numeric_features = [feature for feature in df.columns if df[feature].dtype != 'O']
categorical_features = [feature for feature in df.columns if df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

We have 9 numerical features : ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']

We have 1 categorical features : ['ocean_proximity']


In [15]:
profile = ProfileReport(
    df,
    title="California Housing Data Profiling Report",
    explorative=True,
    minimal=False,   # Set True for faster but lighter report
    correlations={"pearson": {"calculate": True}},
)

In [16]:
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|██████████████████████████████████████████| 10/10 [00:00<00:00, 102.84it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [17]:
# Or Else we can do manually univariate analysis,bivirate analysis,multivariate analysis.

In [18]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [20]:
# Filling The Missing Values
df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['total_bedrooms'].fillna(df['total_bedrooms'].median(), inplace=True)


In [21]:
## Now our data is ready for Scaling And Encoding