# House Price Prediction Model Development

## Part 2: Feature Selection, Model Training, and Evaluation

In this phase of the project, we continue building the house price prediction model. This phase consists of three crucial steps: Feature Selection, Model Training, and Evaluation. Below is a detailed procedure for each of these steps:

### 1. Feature Selection

Feature selection is the process of choosing the most relevant features or attributes that have the most significant impact on predicting house prices. Effective feature selection can improve model accuracy, reduce overfitting, and enhance model interpretability.

**Procedure for Feature Selection:**

- **Data Inspection:** Examine the dataset containing features and the target variable (house prices) to understand the available data.

- **Domain Knowledge:** Consider domain knowledge and industry expertise to identify features that are likely to influence house prices. For example, factors like location, square footage, number of bedrooms, and bathrooms can be critical.

- **Statistical Analysis:** Conduct statistical analysis, such as correlation analysis, to measure the relationships between features and the target variable. Features with strong correlations may be considered.

- **Feature Importance:** Utilize techniques like feature importance from tree-based models (e.g., Random Forest) to identify the features that contribute the most to prediction.

- **Dimensionality Reduction:** Explore dimensionality reduction methods, like Principal Component Analysis (PCA), to reduce the number of features while preserving as much information as possible.

### 2. Model Training

Model training involves selecting a regression algorithm and training it on the preprocessed dataset. The choice of the regression algorithm depends on the dataset's characteristics and the model's performance.

**Procedure for Model Training:**

- **Regression Algorithm Selection:** Choose a suitable regression algorithm for the task. Common choices include Linear Regression, Gradient Boosting, Random Forest Regression, XGBoost Regression, and Support Vector Regression.

- **Data Splitting:** Divide the dataset into training and testing sets (e.g., 80% for training, 20% for testing) to evaluate the model's generalization performance.

- **Hyperparameter Tuning:** Optimize hyperparameters of the selected regression model. Techniques like grid search or random search can help in finding the best hyperparameters.

- **Model Fitting:** Train the selected regression model on the training data, adjusting it to minimize prediction errors. This step involves optimizing model parameters based on the training data.

### 3. Evaluation

Evaluation is the final step in assessing the model's performance. It provides insights into how well the model predicts house prices and helps in decision-making.

**Procedure for Model Evaluation:**

- **Metrics:** Calculate relevant regression metrics, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²). These metrics quantify the model's accuracy, prediction errors, and the proportion of the target variable's variance explained by the model.

- **Visualization:** Visualize the model's predictions vs. actual prices through scatter plots, regression plots, or other graphical representations. Visualization can help understand the model's strengths and weaknesses.

- **Cross-Validation:** Consider using k-fold cross-validation to assess the model's stability and robustness. This technique helps ensure that the model generalizes well to unseen data.

- **Interpretability:** Evaluate how interpretable the model is. In some cases, model interpretability is crucial for making informed decisions.

By following this procedure, we aim to develop a robust house price prediction model that can be used to make informed decisions in the real estate market. It is essential to document and track the results at each stage to monitor the model's progress and performance.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the dataset
data = pd.read_csv("/content/USA_Housing.csv")

# Display the first few rows of the dataset
print(data.head())

# Data Preprocessing
# Extract features and target variable
X = data[["Avg. Area Income", "Avg. Area House Age", "Avg. Area Number of Rooms", "Avg. Area Number of Bedrooms", "Area Population"]]
y = data["Price"]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature Selection
# (You can further customize feature selection methods based on your needs)

# Model Training
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict using Linear Regression
lr_predictions = lr_model.predict(X_test)

# Evaluation
print("Model Evaluation:")
print("MAE (Mean Absolute Error):", mean_absolute_error(y_test, lr_predictions))
print("RMSE (Root Mean Squared Error):", mean_squared_error(y_test, lr_predictions, squared=False))
print("R-squared (R²):", r2_score(y_test, lr_predictions))


   Avg. Area Income  Avg. Area House Age  Avg. Area Number of Rooms  \
0      79545.458574             5.682861                   7.009188   
1      79248.642455             6.002900                   6.730821   
2      61287.067179             5.865890                   8.512727   
3      63345.240046             7.188236                   5.586729   
4      59982.197226             5.040555                   7.839388   

   Avg. Area Number of Bedrooms  Area Population         Price  \
0                          4.09     23086.800503  1.059034e+06   
1                          3.09     40173.072174  1.505891e+06   
2                          5.13     36882.159400  1.058988e+06   
3                          3.26     34310.242831  1.260617e+06   
4                          4.23     26354.109472  6.309435e+05   

                                             Address  
0  208 Michael Ferry Apt. 674\nLaurabury, NE 3701...  
1  188 Johnson Views Suite 079\nLake Kathleen, CA...  
2  9127 Eli

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Load the dataset
data = pd.read_csv("/content/USA_Housing.csv")

# Data Preprocessing
# Extract features and target variable
X = data[["Avg. Area Income", "Avg. Area House Age", "Avg. Area Number of Rooms", "Avg. Area Number of Bedrooms", "Area Population"]]
y = data["Price"]

# Train a Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X, y)

# Input from the user
print("Please enter the following features to predict the house price:")
avg_income = float(input("Average Area Income: "))
house_age = float(input("Average Area House Age: "))
num_rooms = float(input("Average Area Number of Rooms: "))
num_bedrooms = float(input("Average Area Number of Bedrooms: "))
population = float(input("Area Population: "))

# Create a feature vector from user input
user_input = [[avg_income, house_age, num_rooms, num_bedrooms, population]]

# Predict the house price
predicted_price = lr_model.predict(user_input)

# Display the predicted price
print(f"Predicted House Price: ${predicted_price[0]:,.2f}")


Please enter the following features to predict the house price:
Average Area Income: 79545.45857431678
Average Area House Age: 5.682861321615587
Average Area Number of Rooms: 7.009188142792237
Average Area Number of Bedrooms: 4.09
Area Population: 23086.800502686456
Predicted House Price: $1,223,847.04


