# Case Study: Predicting House Prices Using Linear Regression
Step-by-step guide on predicting house prices using linear regression with one independent variable (square footage), including a Python demo and visualizations.

### Step 1: Define the Problem

We want to predict house prices based on square footage. The linear regression model will have the following form:

$y = \theta_0 + \theta_1 x_1$

Where:

- $y$ = Price of the house
- $x_1$ = Square footage of the house
- $ \theta_0 $ = Intercept (bias term)
- $ \theta_1 $ = Coefficient for square footage (the weight)


### Step 2: Data Collection and Preprocessing

Here’s a small dataset for house prices:

| Square Footage ($x_1$) | Price ($y$)   |
|------------------------|---------------|
| 1500                   | 350,000       |
| 1800                   | 450,000       |
| 2400                   | 500,000       |
| 3000                   | 600,000       |
| 1200                   | 250,000       |

We’ll use this data to train and test our linear regression model.


# Step 3: Model Building
Let’s import necessary libraries and create our model.

In [83]:
#import packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

#Generate Data

In [None]:
# Small Data
X = np.array([1500, 1800, 2400, 3000, 1200]).reshape(-1, 1)  # Square footage
y = np.array([350000, 450000, 500000, 600000, 250000])  # Prices

In [None]:
# Simulate a large dataset (1000 data points)
np.random.seed(42)  # For reproducibility
X = np.random.uniform(1000, 4000, 1000).reshape(-1, 1)  # Square footage between 1000 and 4000

# Generate corresponding house prices (y) based on the relationship: y = 100 * X + 50000 + noise
y = 100 * X.flatten() + 50000 + np.random.normal(0, 50000, 1000)  # Adding some noise to the data

print(f"X shape: {X.shape}")  # Should print (1000, 1)
print(f"y shape: {y.shape}")  # Should print (1000,)


# Step 4: Splitting the Data
We will split the dataset into training and testing sets to evaluate the model's performance.

In [None]:
# Split data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data size: {len(X_train)}")
print(f"Testing data size: {len(X_test)}")

# Step 5: Model Training
Now we will train the linear regression model using the training data.

In [None]:
# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Get the learned parameters
theta_0 = model.intercept_  # Intercept
theta_1 = model.coef_[0]    # Coefficient for square footage

print(f"Intercept (θ0): {theta_0}")
print(f"Coefficient (θ1): {theta_1}")


# Step 6: Model Evaluation
After training the model, we evaluate its performance on the test data using metrics like Mean Squared Error (MSE) and R-squared (R²).

In [None]:
# Predict house prices on the test set
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R²): {r2}")


# Step 7: Making Predictions
Once the model is trained, we can use it to make predictions. For example, to predict the price of a house with 2200 square feet:

In [None]:
# Predict the price for a house with 2200 square feet
predicted_price = model.predict(np.array([[2200]]))
# Access the scalar value from the array and format it
print(f"Predicted price for a 2200 sqft house: ${predicted_price[0]:,.2f}")

# Step 8: Visualization
We will plot the data points and the regression line to visualize the model.

In [None]:
# Plotting the actual data and regression line
plt.scatter(X_test, y_test, color='blue', label='Training Data')  # Scatter plot for the actual data
plt.plot(X_test, model.predict(X_test), color='red', label='Regression Line')  # Plotting the regression line
plt.xlabel('Square Footage')
plt.ylabel('Price')
plt.legend()
plt.show()
