# Predicting House Prices

You are working for a real estate company, and your goal is to build a predictive model to estimate house prices based on 
various features. 
You have a dataset containing information about houses, such as square footage, number of bedrooms, number of bathrooms, 
and other relevant attributes. 

You are tasked with the following:

Dataset: You can choose/download the dataset from Kaggle/ UCI Repository or any other medium.

# 1. Data Preparation:

a. Load the dataset using pandas.

b. Explore and clean the data. Handle missing values and outliers

c. Split the dataset into training and testing sets.

In [6]:
import pandas as pd
df = pd.read_csv('Housing.csv')

In [7]:
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [8]:
df.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,545.0,545.0,545.0,545.0,545.0
mean,4766729.0,5150.541284,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2170.141023,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1650.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4600.0,3.0,1.0,2.0,0.0
75%,5740000.0,6360.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


In [9]:
import pandas as pd
data = pd.read_csv('Housing.csv')

In [27]:
from sklearn.model_selection import train_test_split
X = data['price'] 
y = data['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=17)

In [28]:
data.shape

(545, 13)

# 2. Implement Simple Linear Regression:

a. Choose a feature (e.g., square footage) as the independent variable (X) and house prices as the dependent variable (y).

b. Implement a simple linear regression model using sklearn to predict house prices based on the selected feature.

c. Visualize the data and the regression line.

In [7]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd

# Load the dataset from Housing.csv
data = pd.read_csv('Housing.csv')

# Extract the independent variable (X) and dependent variable (y)
X = data.drop(columns=['area'])
y = data['area'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test)

# Visualize the data and regression line
plt.scatter(X, y, label='Data Points')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('area')
plt.ylabel('prices')
plt.legend()
plt.title('Simple Linear Regression')
plt.show()

ValueError: could not convert string to float: 'yes'

# 3. Evaluate the Simple Linear Regression Model:

a. Use scikit-learn to calculate the R-squared value to assess the goodness of fit.

b. Interpret the R-squared value and discuss the model's performance.

In [5]:
import pandas as pd
data = pd.read_csv('Housing.csv')

In [6]:
from sklearn.metrics import r2_score

# Assuming you have y_test and y_pred for the updated dataset
r2 = r2_score(y_test, y_pred)
print(f'R-squared value: {r2:.4f}')

R-squared value: 0.3067


# 4. Implement Multiple Linear Regression:

a. Select multiple features (e.g., square footage, number of bedrooms, number of bathrooms) as independent variables (X) and house prices as the dependent variable (y).

b. Implement a multiple linear regression model using scikit-learn to predict house prices based on the selected features.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

data = pd.read_csv('Housing.csv')

X = data.drop(columns=['area'])
y = data['area'].values

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Create and train the multiple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test)

# Evaluate the model (optional)
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

# Print the evaluation metrics (MSE and R-squared)
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'R-squared value: {r_squared:.4f}')

ValueError: could not convert string to float: 'yes'

# 5. Evaluate the Multiple Linear Regression Model:

a. Calculate the Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) to assess the model's accuracy.

b. Discuss the advantages of using multiple features in regression analysis.

In [63]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Calculate MAE
mae = mean_absolute_error(y_test, y_pred)

# Calculate MSE
mse = mean_squared_error(y_test, y_pred)

# Calculate RMSE (by taking the square root of MSE)
rmse = np.sqrt(mse)

# Print the evaluation metrics
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')

Mean Absolute Error (MAE): 898431.90
Mean Squared Error (MSE): 1505858701048.00
Root Mean Squared Error (RMSE): 1227134.35


# 6. Model Comparison:

a. Compare the results of the simple linear regression and multiple linear regression models.

b. Discuss the advantages and limitations of each model.

In [64]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load your dataset from a CSV file (replace 'Housing.csv' with your dataset file)
data = pd.read_csv('Housing.csv')

# Simple Linear Regression (SLR) using 'area'
X_slr = data[['area']].values
y_slr = data['price'].values

# Split the data into training and testing sets for SLR
X_slr_train, X_slr_test, y_slr_train, y_slr_test = train_test_split(X_slr, y_slr, test_size=0.2, random_state=0)

# Multiple Linear Regression (MLR) using 'area', 'bedrooms', and 'bathrooms'
X_mlr = data[['area', 'bedrooms', 'bathrooms']].values
y_mlr = data['price'].values

# Split the data into training and testing sets for MLR
X_mlr_train, X_mlr_test, y_mlr_train, y_mlr_test = train_test_split(X_mlr, y_mlr, test_size=0.2, random_state=0)

# Create and train the SLR model
slr_model = LinearRegression()
slr_model.fit(X_slr_train, y_slr_train)
y_slr_pred = slr_model.predict(X_slr_test)

# Create and train the MLR model
mlr_model = LinearRegression()
mlr_model.fit(X_mlr_train, y_mlr_train)
y_mlr_pred = mlr_model.predict(X_mlr_test)

# Calculate evaluation metrics for SLR
slr_r_squared = r2_score(y_slr_test, y_slr_pred)
slr_mae = mean_absolute_error(y_slr_test, y_slr_pred)
slr_mse = mean_squared_error(y_slr_test, y_slr_pred)
slr_rmse = np.sqrt(slr_mse)

# Calculate evaluation metrics for MLR
mlr_r_squared = r2_score(y_mlr_test, y_mlr_pred)
mlr_mae = mean_absolute_error(y_mlr_test, y_mlr_pred)
mlr_mse = mean_squared_error(y_mlr_test, y_mlr_pred)
mlr_rmse = np.sqrt(mlr_mse)

# Print the evaluation metrics for both models
print("Simple Linear Regression (SLR) Results:")
print(f'R-squared value (SLR): {slr_r_squared:.4f}')
print(f'Mean Absolute Error (MAE) (SLR): {slr_mae:.2f}')
print(f'Mean Squared Error (MSE) (SLR): {slr_mse:.2f}')
print(f'Root Mean Squared Error (RMSE) (SLR): {slr_rmse:.2f}')
print("\nMultiple Linear Regression (MLR) Results:")
print(f'R-squared value (MLR): {mlr_r_squared:.4f}')
print(f'Mean Absolute Error (MAE) (MLR): {mlr_mae:.2f}')
print(f'Mean Squared Error (MSE) (MLR): {mlr_mse:.2f}')
print(f'Root Mean Squared Error (RMSE) (MLR): {mlr_rmse:.2f}')

# Discuss the advantages and limitations of each model
print("\nAdvantages and Limitations:")
print("Advantages of SLR:")
print("- Simplicity and ease of interpretation.")
print("- Suitable for clear linear relationships.")
print("- Less complex and computationally efficient.")
print("\nLimitations of SLR:")
print("- Limited to capturing single-variable relationships.")
print("- May not perform well in cases with complex, multi-variable dependencies.")

print("\nAdvantages of MLR:")
print("- Increased accuracy through the consideration of multiple features.")
print("- Ability to capture complex, multi-variable relationships.")
print("- Flexibility in modeling real-world scenarios.")
print("\nLimitations of MLR:")
print("- Risk of overfitting if irrelevant features are included.")
print("- Increased model complexity and reduced interpretability.")

Simple Linear Regression (SLR) Results:
R-squared value (SLR): 0.3067
Mean Absolute Error (MAE) (SLR): 1026553.77
Mean Squared Error (MSE) (SLR): 1997672371756.09
Root Mean Squared Error (RMSE) (SLR): 1413390.38

Multiple Linear Regression (MLR) Results:
R-squared value (MLR): 0.4774
Mean Absolute Error (MAE) (MLR): 898431.90
Mean Squared Error (MSE) (MLR): 1505858701048.00
Root Mean Squared Error (RMSE) (MLR): 1227134.35

Advantages and Limitations:
Advantages of SLR:
- Simplicity and ease of interpretation.
- Suitable for clear linear relationships.
- Less complex and computationally efficient.

Limitations of SLR:
- Limited to capturing single-variable relationships.
- May not perform well in cases with complex, multi-variable dependencies.

Advantages of MLR:
- Increased accuracy through the consideration of multiple features.
- Ability to capture complex, multi-variable relationships.
- Flexibility in modeling real-world scenarios.

Limitations of MLR:
- Risk of overfitting if irr

# 8. Conclusion:

Summarize the findings and provide insights into how this predictive model can be used to assist the real estate company in estimating house prices.

In [65]:
print("Summary of Findings:")
print(f'R-squared value: {r_squared:.4f}')
print(f'Mean Absolute Error (MAE): {mae:.2f}')
print(f'Mean Squared Error (MSE): {mse:.2f}')
print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')

# Provide insights
print("\nInsights:")
print("1. The predictive model has an R-squared value of", r_squared, "indicating that it can explain a significant portion of the variance in house prices based on the selected features.")
print("2. The RMSE value of", rmse, "suggests that, on average, the model's predictions are accurate within this range of house prices.")
print("3. The MAE value of", mae, "indicates the average absolute difference between predicted and actual prices.")
print("4. This predictive model can be used by the real estate company to estimate house prices for properties using 'area,' 'bedrooms,' and 'bathrooms' as input features.")
print("5. It provides a valuable tool for pricing properties, assisting buyers and sellers in making informed decisions, and optimizing the company's real estate operations.")

Summary of Findings:
R-squared value: 0.4774
Mean Absolute Error (MAE): 898431.90
Mean Squared Error (MSE): 1505858701048.00
Root Mean Squared Error (RMSE): 1227134.35

Insights:
1. The predictive model has an R-squared value of 0.47740787592677614 indicating that it can explain a significant portion of the variance in house prices based on the selected features.
2. The RMSE value of 1227134.3451505213 suggests that, on average, the model's predictions are accurate within this range of house prices.
3. The MAE value of 898431.898353197 indicates the average absolute difference between predicted and actual prices.
4. This predictive model can be used by the real estate company to estimate house prices for properties using 'area,' 'bedrooms,' and 'bathrooms' as input features.
5. It provides a valuable tool for pricing properties, assisting buyers and sellers in making informed decisions, and optimizing the company's real estate operations.
