
 Training and Testing Data In  Learning Regression Method With House Price Prediction

![Pic](https://builtin.com/sites/www.builtin.com/files/styles/ckeditor_optimize/public/inline-images/4_train-test-split.jpg)

# 🏠 House Price Prediction Model with Multiple Features

## Why do we split the data into train and test?
---

When we build a machine learning model, our goal is not just  
to make it work perfectly on the current data, but to make sure  
it can predict prices well for **new, unseen houses** too.

If we train and test the model on the same data, the model will  
perform **unrealistically well**, because it's simply **memorizing the answers**.  
But in reality, new data will be different.

That's why we use **`train_test_split`**:  
✅ We train the model on one part of the data (**training set**).  
✅ We test the model on a separate part (**testing set**).  

This helps us check if the model is **genuinely learning patterns**  
or just **memorizing**.

---

## How does train_test_split work?
---

Imagine we have **4600 house records**.  
We decide to split the data:  
- **80% (around 3680 rows)** for training the model.  
- **20% (around 920 rows)** for testing the model.  

The **training data** helps the model learn the relationship  
between features like **bedrooms, bathrooms, floors, etc.**, and **price**.

The **testing data** is kept aside to check how well the model predicts  
prices for houses it has **never seen before**.

---

## Why is this important?
---

This method protects us from **overfitting**, which means the model  
performs well on known data but **fails on new data**.

In simple words:  
- **Training data ➡️ Learn the pattern.**  
- **Testing data ➡️ Check the model's skills on fresh questions.**

---

## ✅ Summary:
---

🏡 **`train_test_split`** is a necessary step to:  
✅ Make our model **reliable**.  
✅ Measure **real-world performance**.  
✅ Avoid **fake high accuracy**.  
✅ Ensure the model **generalizes well** to new houses.  

---


# Let's Code:

In [76]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

In [77]:
#  Let's Load the dataset
df = pd.read_csv('../Data/HousePriceData.csv')
df.head()

Unnamed: 0,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,sqft_above,sqft_basement,yr_built,yr_renovated,street,city,statezip,country
0,2014-05-02 00:00:00,313000.0,3.0,1.5,1340,7912,1.5,0,0,3,1340,0,1955,2005,18810 Densmore Ave N,Shoreline,WA 98133,USA
1,2014-05-02 00:00:00,2384000.0,5.0,2.5,3650,9050,2.0,0,4,5,3370,280,1921,0,709 W Blaine St,Seattle,WA 98119,USA
2,2014-05-02 00:00:00,342000.0,3.0,2.0,1930,11947,1.0,0,0,4,1930,0,1966,0,26206-26214 143rd Ave SE,Kent,WA 98042,USA
3,2014-05-02 00:00:00,420000.0,3.0,2.25,2000,8030,1.0,0,0,4,1000,1000,1963,0,857 170th Pl NE,Bellevue,WA 98008,USA
4,2014-05-02 00:00:00,550000.0,4.0,2.5,1940,10500,1.0,0,0,4,1140,800,1976,1992,9105 170th Ave NE,Redmond,WA 98052,USA


In [78]:
#  Information about The Whole Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float

In [79]:
#  Let's Choose More Thn one Independent Variable For Better Training
#  so, let's Choose
X = df[['bedrooms','floors','condition','sqft_living']]
y = df['price']

In [80]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=42)

In [81]:
x_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4140 entries, 1074 to 860
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     4140 non-null   float64
 1   floors       4140 non-null   float64
 2   condition    4140 non-null   int64  
 3   sqft_living  4140 non-null   int64  
dtypes: float64(2), int64(2)
memory usage: 161.7 KB


In [82]:
y_train.info()

<class 'pandas.core.series.Series'>
Index: 4140 entries, 1074 to 860
Series name: price
Non-Null Count  Dtype  
--------------  -----  
4140 non-null   float64
dtypes: float64(1)
memory usage: 64.7 KB


In [83]:
x_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 460 entries, 3683 to 203
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     460 non-null    float64
 1   floors       460 non-null    float64
 2   condition    460 non-null    int64  
 3   sqft_living  460 non-null    int64  
dtypes: float64(2), int64(2)
memory usage: 18.0 KB


In [84]:
y_test.info()

<class 'pandas.core.series.Series'>
Index: 460 entries, 3683 to 203
Series name: price
Non-Null Count  Dtype  
--------------  -----  
460 non-null    float64
dtypes: float64(1)
memory usage: 7.2 KB


In [85]:
#  Lets Bring The Model First
model = LinearRegression()

Train the Model, So That It Can Make a Best Fit Line
On X -> FLOOR, CONDITION, SQRFT LIVING, y -> PRICE


In [86]:
model.fit(x_train, y_train) # The model is trained 

In [87]:
# Lets Find The Acurracy Score of Model
model.score(x_test,y_test)

0.47109640602920755