<a href="https://colab.research.google.com/github/relhe/inf8245ae/blob/main/pratice/housing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# House price prediction

# Overview
This Jupyter Notebook provides a comprehensive approach to modeling and predicting house prices using machine learning techniques. This project aims to explore the relationships between housing features and their prices, leveraging data preprocessing, exploratory data analysis, feature engineering, and model evaluation.

# Introduction
The real estate market is a dynamic and complex landscape influenced by a myriad of factors, including location, property features, economic conditions, and buyer preferences. Accurately predicting house prices is essential for various stakeholders, including buyers, sellers, investors, and real estate agents, as it enables informed decision-making and strategic planning.

This project focuses on modeling house prices using a dataset that contains various attributes related to residential properties. The primary goal is to develop a predictive model that can estimate house prices based on these features. By employing machine learning techniques, we aim to uncover the underlying relationships between property characteristics and their market values.

In this notebook, we will:

* **Explore the Dataset**: Understand the structure and content of the data, identify trends, and visualize key relationships.
* **Preprocess the Data**: Clean and prepare the dataset for analysis, addressing issues such as missing values and categorical variables.
* **Engineer Features**: Create new variables that could enhance model performance based on insights gained during the exploratory analysis.
* **Build and Evaluate Models**: Train several regression models, compare their performance using standard evaluation metrics, and select the best model for predictions.
* **Make Predictions**: Apply the best-performing model to predict house prices for new or unseen data, demonstrating the model's practical applicability.

By the end of this analysis, we will not only gain insights into the factors that influence house prices but also develop a robust predictive model that can be used in real-world scenarios. This notebook serves as a practical resource for anyone interested in leveraging data science techniques in the real estate domain.

## Setup dependencies for the project

In [None]:
%pip install numpy
%pip install matplotlib
%pip install pandas
%pip install seaborn
%pip install scikit-learn


## Importation
* Import necessary library

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns


# Data loading
Load and understand the dataset containing various attributes of houses
* load the dataset and display the first rows

In [None]:
df= pd.read_csv('Housing.csv')
df.head()

# Exploratory data analysis (EDA)
Visualize and analyze the dataset to uncover patterns and insights that influence house prices.
* visualizations: histograms, scatter plots, and correlation matrices.
* Summary statistics and insights from the data

In [None]:
print(f"That dataset has {df.shape[0]} rows and {df.shape[1]} columns\n")
print(f"The columns are: {df.columns.to_list()}\n")
print("The data types of each column are:")
print(df.dtypes.to_string())

## Summary statistics of the dataset

In [None]:
# Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
print(df.describe().to_string())

## Keys insight from statistics
* **Price Distribution**: The **mean** house price is **$4 766 729**, with a standard deviation of $1 870 440, indicating a wide range of prices within the dataset. The minimum price is $1 750 000, while the maximum reaches $13 300 000, suggesting variability based on features.

* **Bedrooms** and **Bathrooms**: The average number of bedrooms is 2.9, while the average number of bathrooms is 1.2. This indicates that most houses tend to have at least 3 bedrooms and 1 bathrooms, which are common requirements for families.

* **Area**: The average square footage is 5150 square feet, with a range from 1650 to 16200 square feet. This suggests that the dataset includes both smaller and larger homes, impacting the price significantly.

* **stories** : The mean storey for house is 0.86, with suggest in average house has 1 storey.


## Visualization

In [None]:
numerical_columns = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"The numerical type columns are: {numerical_columns}\n")
numerical_columns.remove('price')
print(f"The numerical type columns without the target column are: {numerical_columns}\n")

In [None]:
# Plotting the distribution of the numerical columns
for i in numerical_columns:
    sns.histplot(df[i])
    plt.title(f'Distribution of {i}')
    plt.show()

### Scatterplot of the house price versus area

In [None]:
sns.scatterplot(x='area', y='price', data=df)
plt.title('Price vs Area')
plt.show()

### Boxplots of price of the house versus features

In [None]:
# Plotting scatter plots for the numerical columns against the target column
discrete_columns = numerical_columns.copy()
discrete_columns.remove('area')
for i in discrete_columns:
    sns.boxplot(x=df[i], y=df['price'])
    plt.title(f'Price vs {i}')
    plt.show()

Potential Correlations: Initial observations suggest that variables such as Square_Feet, Bedrooms, and Bathrooms may correlate positively with house prices. Further analysis (such as a correlation matrix) will be conducted to quantify these relationships.

# Data preprocessing
Clean and prepare the dataset for analysis, addressing issues such as missing values and categorical variables.

* Handling missing values.
* Encoding categorical features.
* Scaling and normalizing numerical features.*

### Missing values
If exist, we will identify and handle missing values using the following approaches:

In [None]:
print("False if there are no missing values and True if there are missing values for each column:")
print(df.isnull().sum()>0)

The is no missing value in the dataset

### Encoding categorical variables
Machine learning algorithms require numerical input, so we need to convert categorical variables into numerical formats. We can use techniques like one-hot encoding or label encoding:
* **One-Hot Encoding**: This method creates binary columns for each category in a categorical variable.
* **Label Encoding**: Use this method for ordinal categories where there is a natural order. For example, if we had a categorical variable "Quality" with values like "Low," "Medium," and "High," we could map them to numerical values.

### Scaling numerical features
To ensure that all numerical features contribute equally to model training, we can scale them using standardization (z-score normalization) or min-max scaling:

* **Standardization**: This technique scales features to have a mean of 0 and a standard deviation of 1.
* **Min-Max Scaling**: This method scales features to a specified range, typically [0, 1]

# Feature engineering and feature selection
After preprocessing, it’s essential to select relevant features that will be used in the model. We can drop unnecessary columns or use methods like correlation analysis to determine which features contribute most to the target variable (house prices).

Create new variables that could enhance model performance based on insights gained during the exploratory analysis.

* Creating new features based on existing data.
* Selecting relevant features for modeling.

# Model definition
Train several regression models, compare their performance using standard evaluation metrics, and select the best model for predictions.

* Splitting the dataset into training and test sets.
* Training various regression models.
* Evaluating models using appropriate metrics.

# Error or loss function to minimize

# Model testing with test set