# Model to predict House Prices

## Steps to perform the same
1. Define the Problem Statement
2. Prepare the Data
3. Split the data into features and target values
4. Normalize/Standardize the Features
5. Split the data into training and testing
6. Chose and Train the Model
7. Make Predictions
8. Evaluate the Model (Measure the error or loss)
9. Adjust parameters (weight and bias) to reduce the error
10. Repeat Steps 5 to 8, until the error is minimized (converged)

### Define the Problem Statement

The objective is to prepate the Model which can predict the price the House, for the given input parameters.
Input Parametrs = Size (in Sq Ft), No of Bedrooms and City

### Prepare the Data

In [4]:
import pandas as pd
data = {
    'Size': [1000, 1500, 2000, 1200, 1800, 1400, 1600, 1300, 1700, 1100],
    'Bedrooms': [2, 3, 4, 2, 3, 3, 3, 2, 3, 2],
    'City': ['Mumbai', 'Bangalore', 'Mumbai', 'Pune', 'Bangalore', 'Mumbai', 'Pune', 'Mumbai', 'Bangalore', 'Pune'],
    'Price': [200000, 250000, 310000, 220000, 290000, 240000, 270000, 210000, 280000, 215000]
}

df = pd.DataFrame(data)
print(df)

   Size  Bedrooms       City   Price
0  1000         2     Mumbai  200000
1  1500         3  Bangalore  250000
2  2000         4     Mumbai  310000
3  1200         2       Pune  220000
4  1800         3  Bangalore  290000
5  1400         3     Mumbai  240000
6  1600         3       Pune  270000
7  1300         2     Mumbai  210000
8  1700         3  Bangalore  280000
9  1100         2       Pune  215000


#### Encode the categorical feature (City)
Why Do We Need to Encode Categorical Columns like City ? 
Machine learning models (especially linear regression, decision trees, SVM, etc.) work only with numerical values. They cannot directly understand text labels like "Mumbai", "Pune", or "Bangalore".

Solution: Convert Categories to Numbers
We use **One-Hot Encoding**, which:
1. Creates a separate binary column for each city.
2. Puts 1 where that city is present, 0 otherwise.

We drop one column (City_Pune) using drop_first=True to avoid redundancy (called dummy variable trap).

In [6]:
df_encoded = pd.get_dummies(df, columns=['City'], drop_first=True)
print(df_encoded)

   Size  Bedrooms   Price  City_Mumbai  City_Pune
0  1000         2  200000         True      False
1  1500         3  250000        False      False
2  2000         4  310000         True      False
3  1200         2  220000        False       True
4  1800         3  290000        False      False
5  1400         3  240000         True      False
6  1600         3  270000        False       True
7  1300         2  210000         True      False
8  1700         3  280000        False      False
9  1100         2  215000        False       True


### Split the data into features and target values

In [10]:
X = df_encoded.drop('Price', axis=1) # axis=1 means "drop column" (not row). If you used axis=0, it would try to drop a row.
Y = df_encoded['Price']
print(f"Features Data: \n {X}")
print("\n\n")
print(f"Target Data: \n {Y}")

Features Data: 
    Size  Bedrooms  City_Mumbai  City_Pune
0  1000         2         True      False
1  1500         3        False      False
2  2000         4         True      False
3  1200         2        False       True
4  1800         3        False      False
5  1400         3         True      False
6  1600         3        False       True
7  1300         2         True      False
8  1700         3        False      False
9  1100         2        False       True



Target Data: 
 0    200000
1    250000
2    310000
3    220000
4    290000
5    240000
6    270000
7    210000
8    280000
9    215000
Name: Price, dtype: int64


### Normalize/Standardize the Features
Normalization (or Standardization) ensures that all input features contribute equally to the model training — especially important when features are on different scales.

**Problem Without Normalization:**

Let’s say you have:  
Size in 1000s (e.g., 1500, 2000, 2500)  
Bedrooms in single digits (e.g., 2, 3, 4)

If you don't normalize:  
Size will dominate the learning process simply because its values are much larger.  
The model may ignore smaller-scale features like Bedrooms, even if they matter.

In [17]:
from sklearn.preprocessing import StandardScaler
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.compose import ColumnTransformer
# from sklearn.pipeline import Pipeline

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# # Define the column transformer (OneHotEncoder for 'City', StandardScaler for numerical columns)
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('city', OneHotEncoder(), ['City']),  # OneHotEncoder for 'City'
#         ('num', StandardScaler(), ['Size', 'Bedrooms'])  # StandardScaler for 'Size' and 'Bedrooms'
#     ])

# # Create a pipeline with preprocessing steps
# pipeline = Pipeline(steps=[
#     ('preprocessor', preprocessor)
# ])

# # # Fit and transform the features using the pipeline
# X_scaled = pipeline.fit_transform(X)

print(X_scaled)

[[-1.5132889  -1.09321633  1.22474487 -0.65465367]
 [ 0.13159034  0.46852129 -0.81649658 -0.65465367]
 [ 1.77646958  2.0302589   1.22474487 -0.65465367]
 [-0.8553372  -1.09321633 -0.81649658  1.52752523]
 [ 1.11851788  0.46852129 -0.81649658 -0.65465367]
 [-0.19738551  0.46852129  1.22474487 -0.65465367]
 [ 0.46056619  0.46852129 -0.81649658  1.52752523]
 [-0.52636136 -1.09321633  1.22474487 -0.65465367]
 [ 0.78954203  0.46852129 -0.81649658 -0.65465367]
 [-1.18431305 -1.09321633 -0.81649658  1.52752523]]


### Split the data into Train and Test

Splits data into:
1. 80% for training <br>
2. 20% for testing <br>
random_state=42 ensures reproducibility.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

X_train, X_test, Y_train, Y_test = train_test_split(
    X_scaled, Y, test_size=0.2, random_state=42
)

### Chose and Train the Model
model = LinearRegression()
model.fit(X_train, Y_train)

