In [1]:
import pandas as pd
# Load the dataset
dataset_url = 'housing_data.csv'
data = pd.read_csv(dataset_url)

# Display the first few rows of the dataset
print(data.head())
# Summary statistics
print(data.describe())
# Information about data types and missing values
print(data.info())

           ID Date House was Sold  Sale Price  No of Bedrooms  \
0  7129300520     14 October 2017    221900.0               3   
1  6414100192    14 December 2017    538000.0               3   
2  5631500400    15 February 2016    180000.0               2   
3  2487200875    14 December 2017    604000.0               4   
4  1954400510    15 February 2016    510000.0               3   

   No of Bathrooms  Flat Area (in Sqft)  Lot Area (in Sqft)  No of Floors  \
0             1.00               1180.0              5650.0           1.0   
1             2.25               2570.0              7242.0           2.0   
2             1.00                770.0             10000.0           1.0   
3             3.00               1960.0              5000.0           1.0   
4             2.00               1680.0              8080.0           1.0   

  Waterfront View No of Times Visited  ... Overall Grade  \
0              No                 NaN  ...             7   
1              No         

In [2]:
data.head()

Unnamed: 0,ID,Date House was Sold,Sale Price,No of Bedrooms,No of Bathrooms,Flat Area (in Sqft),Lot Area (in Sqft),No of Floors,Waterfront View,No of Times Visited,...,Overall Grade,Area of the House from Basement (in Sqft),Basement Area (in Sqft),Age of House (in Years),Renovated Year,Zipcode,Latitude,Longitude,Living Area after Renovation (in Sqft),Lot Area after Renovation (in Sqft)
0,7129300520,14 October 2017,221900.0,3,1.0,1180.0,5650.0,1.0,No,,...,7,1180.0,0,63,0,98178.0,47.5112,-122.257,1340.0,5650
1,6414100192,14 December 2017,538000.0,3,2.25,2570.0,7242.0,2.0,No,,...,7,2170.0,400,67,1991,98125.0,47.721,-122.319,1690.0,7639
2,5631500400,15 February 2016,180000.0,2,1.0,770.0,10000.0,1.0,No,,...,6,770.0,0,85,0,98028.0,47.7379,-122.233,2720.0,8062
3,2487200875,14 December 2017,604000.0,4,3.0,1960.0,5000.0,1.0,No,,...,7,1050.0,910,53,0,98136.0,47.5208,-122.393,1360.0,5000
4,1954400510,15 February 2016,510000.0,3,2.0,1680.0,8080.0,1.0,No,,...,8,1680.0,0,31,0,98074.0,47.6168,-122.045,1800.0,7503


Step-by-Step Explanation
1. Data Loading and Exploration

Objective: Load the dataset and understand its structure.

Why: Before we can analyze or model the data, we need to understand what it looks like, including the types of variables, any missing values, and basic statistics.

How:

    Load the Data: We load the dataset into a pandas DataFrame to manipulate and explore it easily.
    Inspect the Data: We look at the first few rows, summary statistics, and information about data types and missing values to get a sense of the dataset's structure.

2. Data Preprocessing

Objective: Clean and transform the data to prepare it for modeling.

Why: Real-world data is often messy, with missing values, categorical data that needs encoding, and numerical data that needs scaling. Preprocessing ensures the data is in a suitable format for machine learning algorithms.

How:

    Handle Missing Values: Impute or remove missing values to prevent errors during model training.
    Encode Categorical Variables: Convert categorical variables into numerical format using one-hot encoding.
    Scale Numerical Features: Standardize numerical features to ensure they are on a similar scale, which helps certain algorithms perform better.

Steps:

    Identify categorical and numerical features.
    Create pipelines for preprocessing numerical and categorical data.
    Combine these pipelines using ColumnTransformer.

**Step 1: Define Features and Target**
# Define the features (X) and the target (y)
X = data.drop(['ID', 'Sale Price', 'Date House was Sold'], axis=1)
y = data['Sale Price']

Explanation:

    Features (X): These are the input variables that we will use to predict the target. We drop columns ID, Sale Price, and Date House was Sold because:
        ID is just an identifier and doesn't help in predicting the price.
        Sale Price is the target variable, so it shouldn't be included in the features.
        Date House was Sold is likely not necessary for this basic model.
    Target (y): This is the variable we want to predict. In this case, it's the Sale Price.

Step 2: Identify Categorical and Numerical Features

python

# Define categorical and numerical columns
categorical_features = ['Waterfront View', 'Condition of the House', 'No of Times Visited']
numerical_features = [col for col in X.columns if col not in categorical_features]

Explanation:

    Categorical Features: These are variables that represent categories or groups (e.g., Waterfront View, Condition of the House, No of Times Visited).
    Numerical Features: These are variables that represent numeric values. Here, we include all columns in X that are not in the list of categorical features.

Step 3: Preprocessing Pipeline for Numerical Features

python

# Preprocessing pipeline for numerical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

Explanation:

    Pipeline: A sequence of data processing steps. Each step in the pipeline is an operation that transforms the data.
    SimpleImputer(strategy='median'): This step fills in any missing values in the numerical features with the median value of that feature. Using the median is often a good choice for numerical data because it is less affected by outliers than the mean.
    StandardScaler(): This step standardizes the numerical features by removing the mean and scaling to unit variance. Standardizing ensures that each feature contributes equally to the model.

Step 4: Preprocessing Pipeline for Categorical Features

python

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Explanation:

    SimpleImputer(strategy='most_frequent'): This step fills in missing values in the categorical features with the most frequent (most common) value in each column. This is suitable for categorical data as it replaces missing values with the mode of the data.
    OneHotEncoder(handle_unknown='ignore'): This step converts categorical variables into a series of binary (0 or 1) columns, also known as one-hot encoding. For example, if a feature has categories 'A', 'B', and 'C', it will be transformed into three new columns. The handle_unknown='ignore' parameter ensures that any unseen categories in the test set are ignored, preventing errors.

Step 5: Combine Preprocessing Steps

python

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

Explanation:

    ColumnTransformer: This allows us to apply different preprocessing steps to different columns (numerical and categorical) within a single transformer. It applies the numerical transformer to the numerical features and the categorical transformer to the categorical features.

Step 6: Split the Data into Training and Testing Sets

python

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation:

    train_test_split: This function splits the dataset into two parts: training and testing sets.
        Training Set (X_train, y_train): This subset is used to train the model.
        Testing Set (X_test, y_test): This subset is used to evaluate the model's performance.
        test_size=0.2: 20% of the data is set aside for testing.
        random_state=42: This ensures reproducibility. Using the same random state every time ensures the same split is produced.

Step 7: Apply Preprocessing to the Data

python

# Apply preprocessing to the data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

Explanation:

    fit_transform on X_train: The fit_transform method fits the preprocessing steps on the training data and then transforms it. This means it calculates the required parameters (like mean and variance for scaling) from the training data and applies the transformations.
    transform on X_test: The transform method only applies the transformations using the parameters calculated from the training data. This ensures that the test data is transformed in the same way as the training data, without re-fitting the parameters.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the features (X) and the target (y)
X = data.drop(['ID', 'Sale Price', 'Date House was Sold'], axis=1)
y = data['Sale Price']

# Define categorical and numerical columns
categorical_features = ['Waterfront View', 'Condition of the House', 'No of Times Visited']
numerical_features = [col for col in X.columns if col not in categorical_features]

# Preprocessing pipeline for numerical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing to the data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)


3. Model Training and Evaluation

Objective: Train multiple regression models and evaluate their performance to select the best one.

Why: Different algorithms have different strengths and weaknesses. By training and evaluating multiple models, we can identify which one performs best on our data.

How:

    Select Models: Choose a few regression algorithms to test.
    Train Models: Fit each model to the training data.
    Evaluate Models: Use metrics like Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared to evaluate model performance.

Steps:

    Define a function to train and evaluate a model.
    Train multiple models and compare their performance.

Step 1: Import Libraries and Define Models

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


Explanation:

    Import Libraries: We import various regression models and evaluation metrics from the scikit-learn library.
        Models: These are the algorithms we will use to predict house prices.
            Linear Regression: A simple model that assumes a linear relationship between features and the target.
            Decision Tree: A model that splits data into branches to make predictions.
            Random Forest: An ensemble model that combines multiple decision trees to improve accuracy.
            Gradient Boosting: Another ensemble model that builds trees sequentially to correct errors of previous trees.
        Metrics: These are the methods to evaluate how well the models perform.
            mean_absolute_error (MAE): Average absolute difference between predicted and actual values.
            mean_squared_error (MSE): Average squared difference between predicted and actual values.
            r2_score (R2): Proportion of the variance in the target variable that is predictable from the features.

  Step 2: Define the Models

  models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

Explanation:

    Dictionary of Models: We create a dictionary where the keys are model names and the values are instances of the model. This helps us easily iterate over each model.
    random_state=42: This is set to ensure reproducibility of results by making sure the random processes in these algorithms produce the same results every time.

Step 3: Define the Evaluation Function

python

def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    return mae, mse, rmse, r2

Explanation:

    evaluate_model Function: This function trains a model, makes predictions, and calculates evaluation metrics.
        model.fit(X_train, y_train): Trains the model using the training data.
        model.predict(X_test): Uses the trained model to make predictions on the test data.
        Calculate Metrics:
            MAE: Measures the average magnitude of errors in the predictions.
            MSE: Measures the average of the squares of the errors.
            RMSE: The square root of MSE, which is in the same units as the target variable.
            R2: Indicates how well the model explains the variability of the target variable.

Step 4: Evaluate and Compare Models

python

results = {}
for model_name, model in models.items():
    results[model_name] = evaluate_model(model, X_train, y_train, X_test, y_test)

Explanation:

    results Dictionary: We create an empty dictionary to store the results.
    Loop Through Models: For each model in our dictionary:
        Evaluate Model: Call the evaluate_model function.
        Store Results: Save the metrics (MAE, MSE, RMSE, R2) for each model in the results dictionary.

Step 5: Display the Results

python

for model_name, metrics in results.items():
    print(f"{model_name}: MAE={metrics[0]}, MSE={metrics[1]}, RMSE={metrics[2]}, R2={metrics[3]}")

Explanation:

    Loop Through Results: For each model name and its corresponding metrics in the results dictionary:
        Print Metrics: Display the name of the model and its performance metrics.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42)
}

# Function to evaluate models
def evaluate_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    r2 = r2_score(y_test, y_pred)
    return mae, mse, rmse, r2

# Split the data into training and testing sets
# (repeating here for completeness)
from sklearn.model_selection import train_test_split

X = data.drop(['ID', 'Sale Price', 'Date House was Sold'], axis=1)
y = data['Sale Price']

# Check for missing values in y
print("Missing values in y:", y.isna().sum())

# Drop rows where y is NaN
data = data.dropna(subset=['Sale Price'])

# Redefine X and y after dropping NaNs
X = data.drop(['ID', 'Sale Price', 'Date House was Sold'], axis=1)
y = data['Sale Price']

# Define categorical and numerical columns
categorical_features = ['Waterfront View', 'Condition of the House', 'No of Times Visited']
numerical_features = [col for col in X.columns if col not in categorical_features]

# Preprocessing pipeline for numerical features
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing pipeline for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing to the data
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Evaluate and compare models
results = {}
for model_name, model in models.items():
    results[model_name] = evaluate_model(model, X_train, y_train, X_test, y_test)

# Display the results
for model_name, metrics in results.items():
    print(f"{model_name}: MAE={metrics[0]}, MSE={metrics[1]}, RMSE={metrics[2]}, R2={metrics[3]}")


Missing values in y: 4




Linear Regression: MAE=131897.95504367576, MSE=45464898795.143036, RMSE=213224.99570909372, R2=0.6882405788344389
Decision Tree: MAE=104596.95546043498, MSE=42671438659.96558, RMSE=206570.66263137557, R2=0.7073957411216367
Random Forest: MAE=71773.73091898041, MSE=17992974851.118885, RMSE=134137.89491086733, R2=0.8766195554529514
Gradient Boosting: MAE=80912.53862978627, MSE=20244557024.90255, RMSE=142283.36875721824, R2=0.8611801291304962




. Model Deployment

Objective: Save the best model for future predictions.

Why: Once we have identified the best-performing model, we want to save it so that we can use it to make predictions on new data without retraining.

How:

    Save the Model: Use joblib to save the trained model to disk.
    Load and Use the Model: Load the saved model and use it to make predictions.

Steps:

    Train the best model on the entire dataset.
    Save the model using joblib.
    Load the model and use it for predictions.

In [None]:
import joblib

# Assuming Random Forest is the best model
best_model = RandomForestRegressor(random_state=42)
best_model.fit(X_train, y_train)
joblib.dump(best_model, 'house_price_predictor.pkl')

# To load and use the model for predictions
loaded_model = joblib.load('house_price_predictor.pkl')
sample_data = X_test[0].reshape(1, -1)
predicted_price = loaded_model.predict(sample_data)
print(f"Predicted price: {predicted_price}")


Predicted price: [519872.]


In [None]:
import joblib

# Save the preprocessor after fitting
joblib.dump(preprocessor, 'preprocessor.pkl')


['preprocessor.pkl']

In [None]:
import joblib
import numpy as np
import pandas as pd

# Load trained model and preprocessor
model = joblib.load('house_price_predictor.pkl')
preprocessor = joblib.load('preprocessor.pkl')

# Define feature names
feature_names = [
    'No of Bedrooms', 'No of Bathrooms', 'Flat Area (in Sqft)', 'Lot Area (in Sqft)',
    'No of Floors', 'Waterfront View', 'No of Times Visited', 'Condition of the House',
    'Overall Grade', 'Area of the House from Basement (in Sqft)', 'Basement Area (in Sqft)',
    'Age of House (in Years)', 'Renovated Year', 'Zipcode', 'Latitude', 'Longitude',
    'Living Area after Renovation (in Sqft)', 'Lot Area after Renovation (in Sqft)'
]

# Take user input for each feature
user_input = []
for feature in feature_names:
    value = input(f"Enter value for {feature}: ")
    user_input.append(value)

# Convert input into a Pandas DataFrame
user_input_df = pd.DataFrame([user_input], columns=feature_names)

# Preprocess the user input
user_input_transformed = preprocessor.transform(user_input_df)

# Predict the house price
predicted_price = model.predict(user_input_transformed)
print(f"Predicted House Price: {predicted_price[0]}")


Predicted House Price: 277597.8


**Start**
______
**Load Data**
 ______
  
**Handle Missing Values**
 ______
  
  Drop rows where target ('Sale Price') is NaN
  
  ______
Define Features (X) and Target (y)
______

Split Data into Training and Testing Sets
 ______
Preprocess Data
  
  - Separate numerical and categorical features
  - Define preprocessing pipeline for numerical features
    - Impute missing values with median
    - Standardize the data
  - Define preprocessing pipeline for categorical features
    - Impute missing values with the most frequent value
    - Apply one-hot encoding
  - Combine preprocessing steps
  

______
Apply Preprocessing to Training and Testing Data
 ______
Define Models
 ______
Evaluate Models
  
  - Fit model on training data
  - Predict on testing data
  - Calculate evaluation metrics (MAE, MSE, RMSE, R2)
 ______

 Compare Results
______  
End
