# 🏡 Min-Max Normalization Workshop
## Team Name: 2
## Team Members: Manu Mathew, Parth, Kumari Nikitha
---

## ❗ Why We Normalize: The Problem with Raw Feature Scales

In housing data, features like `Price` and `Lot_Size` can have values in the hundreds of thousands, while others like `Num_Bedrooms` range from 1 to 5. This creates problems when we use algorithms that depend on numeric magnitudes.

---

### ⚠️ What Goes Wrong Without Normalization

---

### 1. 🧭 K-Nearest Neighbors (KNN)

KNN uses the **Euclidean distance** formula:

$$
d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + \cdots}
$$

**Example:**

- $ \text{Price}_1 = 650{,}000, \quad \text{Price}_2 = 250{,}000 $
- $ \text{Bedrooms}_1 = 3, \quad \text{Bedrooms}_2 = 2 $

Now compute squared differences:

$$
(\text{Price}_1 - \text{Price}_2)^2 = (650{,}000 - 250{,}000)^2 = (400{,}000)^2 = 1.6 \times 10^{11}
$$
$$
(\text{Bedrooms}_1 - \text{Bedrooms}_2)^2 = (3 - 2)^2 = 1
$$

➡️ **Price dominates the distance calculation**, making smaller features like `Bedrooms` irrelevant.

---

### 2. 📉 Linear Regression

Linear regression estimates:

$$
y = \beta_1 \cdot \text{Price} + \beta_2 \cdot \text{Bedrooms} + \beta_3 \cdot \text{Lot\_Size} + \epsilon
$$

If `Price` has very large values:
- Gradient updates for $ \beta_1 $ will be **much larger**
- Gradient updates for $ \beta_2 $ (Bedrooms) will be **very small**

➡️ The model overfits high-magnitude features like `Price`.

---

### 3. 🧠 Neural Networks

A single neuron computes:

$$
z = w_1 \cdot \text{Price} + w_2 \cdot \text{Bedrooms} + w_3 \cdot \text{Lot\_Size}
$$

If:

- $ \text{Price} = 650{,}000 $
- $ \text{Bedrooms} = 3 $
- $ \text{Lot\_Size} = 8{,}000 $

Then:

$$
z \approx w_1 \cdot 650{,}000 + w_2 \cdot 3 + w_3 \cdot 8{,}000
$$

➡️ Even with equal weights, `Price` contributes **most of the activation**, making it difficult for the network to learn from other features.

---

### ✅ Solution: Min-Max Normalization

We apply the transformation:

$$
x_{\text{normalized}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

This scales all features to a common range (typically $[0, 1]$).

| Feature      | Raw Value | Min     | Max     | Normalized Value |
|--------------|-----------|---------|---------|------------------|
| Price        | 650,000   | 250,000 | 800,000 | 0.72             |
| Bedrooms     | 3         | 1       | 5       | 0.50             |
| Lot_Size     | 8,000     | 3,000   | 10,000  | 0.714            |

➡️ Now, **each feature contributes fairly** to model training or distance comparisons.

---

## 📌 Use Case: Housing Data
We are normalizing features from a real estate dataset to prepare it for machine learning analysis.

In [7]:
# 🔢 Load and display dataset
import pandas as pd
df = pd.read_csv('./data/housing_data.csv')
df.head()

Unnamed: 0,House_ID,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size
0,H100000,574507,1462,3,3,2002,4878
1,H100001,479260,1727,2,2,1979,4943
2,H100002,597153,1403,5,2,1952,5595
3,H100003,728454,1646,5,2,1992,9305
4,H100004,464876,853,1,1,1956,7407


# Data Cleaning  and Exploratory Data Analysis

In [8]:
# checking for missing values
print(df.isnull().sum())

# checking for duplicates
print(df.duplicated().sum())

House_ID         0
Price            0
Area_sqft        0
Num_Bedrooms     0
Num_Bathrooms    0
Year_Built       0
Lot_Size         0
dtype: int64
0


There is no missing or null values in the data set.

In [11]:
# displaying basic information about the DataFrame
print("DataFrame Information:")
df.info()

# displaying descriptive statistics
print("\n Descriptive Statistics:")
df.describe()

DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   House_ID       2000 non-null   object
 1   Price          2000 non-null   int64 
 2   Area_sqft      2000 non-null   int64 
 3   Num_Bedrooms   2000 non-null   int64 
 4   Num_Bathrooms  2000 non-null   int64 
 5   Year_Built     2000 non-null   int64 
 6   Lot_Size       2000 non-null   int64 
dtypes: int64(6), object(1)
memory usage: 109.5+ KB

 Descriptive Statistics:


Unnamed: 0,Price,Area_sqft,Num_Bedrooms,Num_Bathrooms,Year_Built,Lot_Size
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,506896.1,1796.453,2.9835,1.966,1985.6895,6025.246
std,147878.6,502.185109,1.409333,0.825945,21.159536,2008.527265
min,100000.0,400.0,1.0,1.0,1950.0,1000.0
25%,406600.2,1445.0,2.0,1.0,1967.0,4664.0
50%,506703.0,1799.5,3.0,2.0,1986.0,6010.5
75%,602445.8,2132.0,4.0,3.0,2003.0,7414.0
max,1077909.0,3763.0,5.0,3.0,2022.0,13088.0


### 🔎 Step 1 — Implement Min-Max Normalization on the Housing Dataset

In [10]:
# ✍️ Implement Min-Max Normalization manually here (no sklearn/numpy)
# Normalize: Price, Area_sqft, Num_Bedrooms, Num_Bathrooms, Lot_Size

### 🔎 Talking Point 1 — [Insert your review comment here]

Reviwed by:
- Name
- Name
- Name