<a href="https://colab.research.google.com/github/vkroz/neural-machines/blob/main/regression%20101/regression%20-%20home%20prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# End-to-End Linear Regression Example: Home Price Prediction



## 1. Setup and Dependencies

In [None]:
# Install required packages
!conda install datasets scikit-learn pandas numpy matplotlib seaborn

Retrieving notices: done
Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.



In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

Matplotlib is building the font cache; this may take a moment.


## 2. Loading Data from Hugging Face

We'll use the 'house_prices' dataset from Hugging Face, which contains information about house features and their prices.

In [None]:
# Load the dataset from Hugging Face
dataset = load_dataset("maharshipandya/house-prices")
print(f"Dataset structure: {dataset}")

DatasetNotFoundError: Dataset 'maharshipandya/house-prices' doesn't exist on the Hub or cannot be accessed.

In [None]:
# Convert to pandas DataFrame for easier manipulation
df = dataset['train'].to_pandas()

# Display basic information
print(f"Dataset shape: {df.shape}")
df.head()

## 3. Exploratory Data Analysis

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values[missing_values > 0])

In [None]:
# Basic statistics of numerical features
df.describe()

In [None]:
# Distribution of the target variable (SalePrice)
plt.figure(figsize=(10, 6))
sns.histplot(df['SalePrice'], kde=True)
plt.title('Distribution of House Prices')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Correlation between numerical features and the target
numerical_features = df.select_dtypes(include=['int64', 'float64']).columns
correlation = df[numerical_features].corr()['SalePrice'].sort_values(ascending=False)
print("Top 10 features correlated with SalePrice:")
print(correlation[:11])  # Including SalePrice itself

In [None]:
# Visualize the top 5 correlated features with SalePrice
top_features = correlation[1:6].index  # Exclude SalePrice itself

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

for i, feature in enumerate(top_features):
    sns.scatterplot(x=feature, y='SalePrice', data=df, ax=axes[i])
    axes[i].set_title(f'{feature} vs SalePrice')

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

In [None]:
# Select features based on correlation analysis
# We'll use the top correlated numerical features for simplicity
selected_features = correlation[1:6].index.tolist()
print(f"Selected features: {selected_features}")

# Prepare the data
X = df[selected_features]
y = df['SalePrice']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

## 5. Model Training

In [None]:
# Create a pipeline with preprocessing and model
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('regressor', LinearRegression())  # Linear regression model
])

# Train the model
pipeline.fit(X_train, y_train)

# Get the coefficients
coefficients = pipeline.named_steps['regressor'].coef_
intercept = pipeline.named_steps['regressor'].intercept_

# Display the model coefficients
coef_df = pd.DataFrame({'Feature': selected_features, 'Coefficient': coefficients})
print("Model Coefficients:")
print(coef_df)
print(f"Intercept: {intercept:.2f}")

## 6. Model Evaluation

In [None]:
# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Calculate evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display the metrics
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"R² Score: {r2:.4f}")

In [None]:
# Visualize actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Prices')
plt.ylabel('Predicted Prices')
plt.title('Actual vs Predicted House Prices')
plt.show()

In [None]:
# Plot residuals
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Predicted Prices')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

## 7. Model Inference

Now let's use our trained model to make predictions on new data.

In [None]:
# Create a function for making predictions
def predict_house_price(features_dict):
    # Convert input dictionary to DataFrame
    input_df = pd.DataFrame([features_dict])

    # Ensure all required features are present
    for feature in selected_features:
        if feature not in input_df.columns:
            raise ValueError(f"Missing required feature: {feature}")

    # Make prediction
    predicted_price = pipeline.predict(input_df[selected_features])[0]
    return predicted_price

In [None]:
# Example: Predict prices for sample houses
# We'll use the median values from our dataset as a starting point
sample_house = {}
for feature in selected_features:
    sample_house[feature] = df[feature].median()

print("Sample house features:")
for feature, value in sample_house.items():
    print(f"{feature}: {value}")

predicted_price = predict_house_price(sample_house)
print(f"\nPredicted house price: ${predicted_price:.2f}")

In [None]:
# Let's try with different values
# Create a more expensive house by increasing the values by 20%
expensive_house = {}
for feature in selected_features:
    expensive_house[feature] = df[feature].median() * 1.2

print("Expensive house features:")
for feature, value in expensive_house.items():
    print(f"{feature}: {value}")

predicted_price = predict_house_price(expensive_house)
print(f"\nPredicted house price: ${predicted_price:.2f}")

In [None]:
# Create a less expensive house by decreasing the values by 20%
cheaper_house = {}
for feature in selected_features:
    cheaper_house[feature] = df[feature].median() * 0.8

print("Cheaper house features:")
for feature, value in cheaper_house.items():
    print(f"{feature}: {value}")

predicted_price = predict_house_price(cheaper_house)
print(f"\nPredicted house price: ${predicted_price:.2f}")

## 8. Conclusion

In this notebook, we've demonstrated a complete machine learning lifecycle for a linear regression model to predict house prices:

1. **Data Loading**: We loaded a house prices dataset from Hugging Face.
2. **Exploratory Data Analysis**: We analyzed the dataset to understand its structure and relationships.
3. **Feature Selection**: We selected the most relevant features based on correlation with the target variable.
4. **Data Preprocessing**: We split the data and standardized the features.
5. **Model Training**: We trained a linear regression model using scikit-learn.
6. **Model Evaluation**: We evaluated the model using various metrics (RMSE, MAE, R²).
7. **Model Inference**: We used the trained model to make predictions on new data.

This simple example demonstrates the fundamental steps in a machine learning project. For a real-world application, you might want to consider:
- More sophisticated feature engineering
- Handling categorical variables
- Addressing outliers and missing values more thoroughly
- Trying more complex models
- Implementing cross-validation
- Model deployment strategies