<a href="https://colab.research.google.com/github/salarMokhtariL/House-Prices/blob/main/house_price_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# "Mpowering House Price Prediction through Effective Data Preprocessing and Linear Regression Modeling
> By Salar Mokhtari Laleh

This revised topic highlights the importance of effective data preprocessing techniques in building a robust linear regression model for predicting house prices. By emphasizing the need for high-quality data and well-performed preprocessing steps, this topic conveys a more powerful message about the value of data preparation in the predictive modeling process. It also emphasizes the potential benefits of using linear regression as a model of choice for housing price prediction, highlighting its simplicity and interpretability. Overall, this revised topic is more impactful, engaging, and relevant to the audience interested in developing accurate and reliable models for predicting house prices.

# Importing Required Libraries
importing necessary libraries for data cleaning, preprocessing, and model building.

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Data Cleaning and Preprocessing
## Loading the Dataset
load the train and test datasets using Pandas.



In [2]:
# Train datasets

train_data = pd.read_csv("https://raw.githubusercontent.com/salarMokhtariL/House-Prices/main/Dataset/train.csv")

# Test datasets

test_data = pd.read_csv("https://raw.githubusercontent.com/salarMokhtariL/House-Prices/main/Dataset/test.csv")

In [3]:
train_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## Removing Outliers
remove the outliers from the train data to ensure that they do not skew the model.

In [4]:
# Remove outliers

train_data = train_data[train_data.GrLivArea < 4500]

## Handling Missing Values
handle the missing values in the dataset. We will fill the missing values in the numerical columns with their mean and in the categorical columns with the mode.

In [5]:
# Handle missing values

train_data.fillna(train_data.mean(), inplace=True)
train_data.fillna(train_data.mode().iloc[0], inplace=True)
test_data.fillna(test_data.mean(), inplace=True)
test_data.fillna(test_data.mode().iloc[0], inplace=True)

  train_data.fillna(train_data.mean(), inplace=True)
  test_data.fillna(test_data.mean(), inplace=True)


## Encoding Categorical Variables
encode the categorical variables in the dataset using one-hot encoding.

In [6]:
# Encode categorical variables

train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)

In [7]:
# Align the columns of the train and test data

train_cols = set(train_data.columns)
test_cols = set(test_data.columns)
missing_cols = train_cols - test_cols
for col in missing_cols:
    test_data[col] = 0
test_data = test_data[train_data.columns]

## Splitting the Data
Finally, we will split the train data into train and validation sets using `train_test_split` from `sklearn`.

In [8]:
X_train, X_val, y_train, y_val = train_test_split(train_data.drop('SalePrice', axis=1), 
                                                  train_data['SalePrice'], test_size=0.2, random_state=42)
     

# Model Building
Now that we have preprocessed the data, we can move on to building our model.

## Training the Model
We will use `LinearRegression` from `sklearn` to build our model and fit it to the train data.

In [9]:
# Train the model

reg = LinearRegression()
reg.fit(X_train, y_train)

## Model Evaluation
Next, we will evaluate our model on the validation set using the mean squared error and R-squared metrics.

In [10]:
# Evaluate the model on the validation set

y_val_pred = reg.predict(X_val)
mse = mean_squared_error(y_val, y_val_pred)
r2 = r2_score(y_val, y_val_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 660957896.6883367
R-squared: 0.8803420491632767


# Making Predictions
Finally, we will use our trained model to make predictions on the test data.

In [11]:
# Make predictions on the test data

test_data = test_data.drop('SalePrice', axis=1)  # Drop the SalePrice column
y_test_pred = reg.predict(test_data)

In [12]:
# Save predictions to a CSV file
submission_df = pd.DataFrame({'Id': test_data.Id, 'SalePrice': y_test_pred})
submission_df.to_csv('submission.csv', index=False)