# Implement a linear regression model to predict the prices of houses based on their square footage and the number of bedrooms and bathrooms.

* Load the data: We'll load the training data from train.csv.
* Explore the data: We'll take a quick look at the dataset to understand its structure.
* Preprocess the data: We'll clean and preprocess the data.
* Train the model: We'll use the relevant features to train a linear regression model.

In [1]:
import pandas as pd

# Load the training data
train_data_path = 'train.csv'
train_df = pd.read_csv(train_data_path)

# Display the first few rows of the dataset
train_df.head()


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


#### The dataset contains 81 columns, which includes a variety of features related to house attributes and the target variable SalePrice.

#### For our linear regression model, we'll focus on the following features:

* 'GrLivArea' (Above grade (ground) living area square feet)
* 'BedroomAbvGr' (Number of bedrooms above grade)
* 'FullBath' (Number of full bathrooms above grade)

#### Preprocessing the Data
We'll perform the following preprocessing steps:

* Select the relevant features and target variable.
* Handle any missing values.
* Split the data into training and testing sets.

* Let's proceed with these steps:

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Select relevant features and target variable
features = ['GrLivArea', 'BedroomAbvGr', 'FullBath']
target = 'SalePrice'
X = train_df[features]
y = train_df[target]

# Handle missing values by filling with the median value
X = X.fillna(X.median())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, r2


(2806426667.247853, 0.6341189942328371)

#### The model's performance metrics are as follows:

* Mean Squared Error (MSE): 2,806,426,667.25
* R-squared (R²): 0.6341

#### Summary
* MSE indicates the average squared difference between the observed and predicted values. A lower MSE indicates a better fit.
* R² indicates the proportion of variance in the dependent variable that is predictable from the independent variables. An R² value of 0.6341 means that approximately 63.41% of the variance in house prices is explained by the model.

In [3]:
# Select additional relevant features
additional_features = ['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GarageCars']
all_features = features + additional_features
X = train_df[all_features]

# Handle missing values by filling with the median value
X = X.fillna(X.median())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [4]:
# Remove outliers based on GrLivArea and SalePrice
train_df = train_df[train_df['GrLivArea'] < 4500]
train_df = train_df[train_df['SalePrice'] < 700000]

# Re-select relevant features and target variable
X = train_df[all_features]
y = train_df[target]

# Handle missing values by filling with the median value
X = X.fillna(X.median())


In [5]:
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [6]:
from sklearn.model_selection import cross_val_score

# Train the linear regression model
model = LinearRegression()

# Evaluate the model using cross-validation
cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
cv_mean_score = cv_scores.mean()

# Train the model on the entire training set
model.fit(X_train_scaled, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

cv_mean_score, mse, r2


(0.7471764894536925, 1520895741.6198645, 0.801716941295824)

In [9]:
import pandas as pd
import numpy as np

# Load the test data
test_data_path = 'test.csv'
test_df = pd.read_csv(test_data_path)

# Select the same features as used in the training
X_test_final = test_df[all_features]

# Handle missing values by filling with the median value
X_test_final = X_test_final.fillna(X_test_final.median())

# Scale the features using the same scaler fitted on the training data
X_test_final_scaled = scaler.transform(X_test_final)

# Make predictions on the test data
test_predictions = model.predict(X_test_final_scaled)

# Prepare the submission DataFrame
submission_df = pd.DataFrame({
    'Id': test_df['Id'],
    'SalePrice': test_predictions
})

# Save the submission DataFrame to a CSV file
submission_file_path = 'submission.csv'
submission_df.to_csv(submission_file_path, index=False)

submission_df.head()


Unnamed: 0,Id,SalePrice
0,1461,111545.45869
1,1462,159028.80814
2,1463,172247.567099
3,1464,190597.957866
4,1465,223492.168727
