# House Price Prediction Project Documentation

## Problem Statement

The problem addressed in this project is to predict house prices using machine learning techniques. The objective is to develop a model that accurately predicts the prices of houses based on a set of features such as location, square footage, number of bedrooms and bathrooms, and other relevant factors. The project involves several phases, including data preprocessing, feature engineering, model selection, training, and evaluation.

## Design Thinking Process

1. **Data Source:** We selected a dataset containing information about houses, including features like location, square footage, bedrooms, bathrooms, and price.

2. **Data Preprocessing:** Data was cleaned, missing values handled, and categorical features were converted into numerical representations.

3. **Feature Selection:** The most relevant features for predicting house prices were identified to improve model accuracy.

4. **Model Selection:** We chose the XGBoost regression algorithm for its robustness and predictive power.

5. **Model Training:** The selected model was trained using the preprocessed data.

6. **Evaluation:** Model performance was assessed using metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared.

## Dataset Description

The dataset used for this project is named "USA_Housing.csv." It contains the following columns:

- Avg. Area Income: Average income of residents in the area.
- Avg. Area House Age: Average age of houses in the area.
- Avg. Area Number of Rooms: Average number of rooms in houses in the area.
- Avg. Area Number of Bedrooms: Average number of bedrooms in houses in the area.
- Area Population: Population of the area.
- Price: Price of the house.
- Address: Address of the house.

## Data Preprocessing

1. **Data Cleaning:** The dataset was checked for missing values, and any missing data was either removed or imputed.

2. **Feature Engineering:** New features, such as price per square foot, were created to enhance the dataset's information.

3. **Encoding Categorical Variables:** Categorical features, such as address, were converted into numerical representations.

4. **Scaling Numerical Features:** Numerical features were standardized using StandardScaler to ensure consistent scales.

## Model Training

The XGBoost regression algorithm was chosen for model training. XGBoost is an ensemble learning method known for its speed, performance, and predictive power. It was trained with the preprocessed dataset, and hyperparameter tuning was performed to optimize its performance.

## Evaluation Metrics

The choice of regression algorithm, XGBoost, was complemented by the following evaluation metrics:

- **Mean Absolute Error (MAE):** Measures the average absolute difference between the predicted and actual prices.
- **Root Mean Squared Error (RMSE):** Measures the square root of the average squared differences between predicted and actual prices.
- **R-squared (R²):** Measures the proportion of the variance in the dependent variable (house prices) that is predictable from the independent variables.

These metrics provide insights into how well the model predicts house prices and its ability to capture variance in the data.


## Submission

The project is ready for submission, and the code provided in previous responses can be used for running the entire project, including making predictions based on user input.

**Data Loading and Initial Data Exploration**

In [None]:
import pandas as pd

# Load the dataset
data = pd.read_csv("/content/USA_Housing.csv") #provide the dataset path clearly

# Display the first few rows of the dataset
print(data.head())

   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
0      79545.458574             5.682861                   7.009188   
1      79248.642455             6.002900                   6.730821   
2      61287.067179             5.865890                   8.512727   
3      63345.240046             7.188236                   5.586729   
4      59982.197226             5.040555                   7.839388   

   Avg. Area Number of Bedrooms  Area Population         Price  \
0                          4.09     23086.800503  1.059034e+06   
1                          3.09     40173.072174  1.505891e+06   
2                          5.13     36882.159400  1.058988e+06   
3                          3.26     34310.242831  1.260617e+06   
4                          4.23     26354.109472  6.309435e+05   

                                             Address  
0  208 Michael Ferry Apt. 674\nLaurabury, NE 3701...  
1  188 Johnson Views Suite 079\nLake Kathleen, CA...  
2  9127 Eli

**Data Preprocessing**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Extract features and target variable
X = data[["Avg. Area Income", "Avg. Area House Age", "Avg. Area Number of Rooms", "Avg. Area Number of Bedrooms", "Area Population"]]
y = data["Price"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (optional, but can be helpful for some regression models)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Model Training (XGBoost)**

In [None]:
import xgboost as xgb

# XGBoost Regression
xgb_model = xgb.XGBRegressor(objective="reg:squarederror", n_estimators=100, random_state=42)
xgb_model.fit(X_train, y_train)

**Model Evaluation**

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


# Evaluate XGBoost model
xgb_predictions = xgb_model.predict(X_test)
print("XGBoost Model Evaluation:")
print("MAE:", mean_absolute_error(y_test, xgb_predictions))
print("RMSE:", np.sqrt(mean_squared_error(y_test, xgb_predictions)))
print("R-squared (R²):", r2_score(y_test, xgb_predictions))

XGBoost Model Evaluation:
MAE: 96076.29387955746
RMSE: 122117.77295258203
R-squared (R²): 0.8787901076275872


**User Input for Prediction**

In [None]:
# User Input for Prediction
print("\nPredict House Price for New Data:")
avg_area_income = float(input("Average Area Income: "))
avg_area_house_age = float(input("Average Area House Age: "))
avg_area_num_rooms = float(input("Average Area Number of Rooms: "))
avg_area_num_bedrooms = float(input("Average Area Number of Bedrooms: "))
area_population = float(input("Area Population: "))

# Prepare the user input data
user_input = np.array([[avg_area_income, avg_area_house_age, avg_area_num_rooms, avg_area_num_bedrooms, area_population]])
user_input = scaler.transform(user_input)  #Standardize the input data

# Make predictions for user input
predicted_price = xgb_model.predict(user_input)
print(f"Predicted House Price: ${predicted_price[0]:,.2f}")


Predict House Price for New Data:
Average Area Income: 79545.45857431678
Average Area House Age: 5.682861321615587
Average Area Number of Rooms: 7.009188142792237
Average Area Number of Bedrooms: 4.09
Area Population: 23086.800502686456
Predicted House Price: $1,163,972.75




These are the code snippets corresponding to the steps outlined in the documentation. You can use these code segments together to run the complete house price prediction project.

## Conclusion

This project successfully addressed the problem of house price prediction, with a focus on data preprocessing, model selection, and evaluation. The chosen XGBoost regression model demonstrated its predictive power through the selected evaluation metrics. The documentation presented outlines the problem statement, design thinking process, dataset description, data preprocessing steps, model training process, and the rationale behind the choice of regression algorithm and evaluation metrics.