---

## **About Author:**

---

[<img src="https://media.licdn.com/dms/image/v2/D4D03AQEsEJ_gVNnU3w/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1720678393429?e=2147483647&v=beta&t=RALfVAKvT6TmP2BBpil_CGzZK5L5ykNUou5yModXTVw" width="20%">](https://shaheer.kesug.com)

**Mr. Shaheer Ali**  
*Data Scientist / ML Engineer*\
BS Computer Science

<a href="https://www.youtube.com/channel/UCUTphw52izMNv9W6AOIFGJA">
  <img width="50" height="50" src="https://img.icons8.com/color/48/youtube-play.png" alt="youtube-play"/>
</a>
<a href="https://twitter.com/__shaheerali190">
  <img width="50" height="50" src="https://img.icons8.com/dotty/80/x.png" alt="x"/>
</a>
<a href="https://www.linkedin.com/in/shaheer-ali-2761aa303/">
  <img width="50" height="50" src="https://img.icons8.com/color/48/linkedin.png" alt="linkedin"/>
</a>
<a href="https://github.com/shaheeralics">
  <img width="50" height="50" src="https://img.icons8.com/ios-glyphs/50/github.png" alt="github"/>
</a>
<a href="https://www.kaggle.com/shaheerali197">
  <img width="50" height="50" src="https://img.icons8.com/clouds/100/kaggle.png" alt="kaggle"/>
</a>

[Portfolio Site]('https://shaheer.kesug.com')

---

# UK House Price Prediction:

## **`Table of Contents`**

1. [**Introduction**](#1)
   1. [**1.1 Project Description**]
   2. [**1.2 Data Description**]
2. [**Acquiring and Loading Data**](#2)
   1. [**2.1 Importing Libraries:**]
   2. [**2.2 Loading Data**]
   3. [**2.3 Basic Data Exploration**]
   4. [**2.4 Areas to Fix**]
3. [**Data Proprocessing**](#3)
   1. [**3.1 Pre-processing Details:**]
   2. [**3.2 Rename Columns**]
   3. [**3.3 Drop Redundant Columns**]
   4. [**3.4 Changing Data Types**]
   5. [**3.5 Dropping Duplicates**]
   6. [**3.6 Handling Missing Values**]
   7. [**3.7 Handling Unreasonable Data Ranges**]
   8. [**3.8 Feature Engineering / Transformation**]
4. [**Data Analysis**](#4)
   1. [**4.1 Exploring `Column Name`**]
5. [**Conclusion**](#5)
   1. [**5.1 Insights**]
   2. [**5.2 Suggestions**]
   3. [**5.3 Possible Next Steps**]
6. [**Epilogue**](#6) 
   1. [**6.1 References**]
   2. [**6.2 Information about the Author:**]

---

# **`1. Introduction`**
 

![image](https://storage.googleapis.com/kaggle-datasets-images/5759473/9470878/ff2cff623e6d418dc5a6f4382b86f047/dataset-cover.png?t=2024-09-24-13-49-24)

##  **1.1 Project Description**

**Goal/Purpose:** 

The purpose of the project is quite simple and to the point. In this notebook we will try to train an AI model on a dataset from kaggle, and then will predict the house price by the input of some features.

<p>&nbsp;</p>

**Models To Be Used**

Models to be used will be decided at the end of the project.

<p>&nbsp;</p>

**Assumptions/Methodology/Scope:** 

will discuss after the project is completed.

<p>&nbsp;</p>

## **1.2 Data Description**

**Content:** 

The dataset is saved as a CSV file with 90,000 records, each representing a property transaction in the UK from 2015 to 2024.

<p>&nbsp;</p>

**Description of Data & Attributes:** 

This dataset has been meticulously pre-processed from the official UK government’s Price Paid Data, available for research purposes. The original dataset contains millions of rows spanning from 1995 to 2024, which posed significant challenges for machine learning operations due to its large size. For this project, we focused on house price predictions and filtered the data to only include transactions from 2015 to 2024. The final dataset contains 90,000 randomly sampled records, which should be ideal for training machine learning models efficiently.
The goal of this dataset is to provide a well-structured, pre-processed dataset for students, researchers, and developers interested in creating house price prediction models using UK data. There are limited UK house price datasets available on Kaggle, so this contribution aims to fill that gap, offering a reliable dataset for dissertations, academic projects, or research purposes.
This dataset is tailored for use in supervised learning models and has been cleaned, ensuring the removal of missing values and encoding of categorical variables. We hope this serves as a valuable resource for anyone studying house price prediction or real estate trends in the UK.


| Feature Name  | Description  |
| :------------ | :----------- |
| Price         | Sale price of the property (target variable). |
| Date          | Date of the property transaction. Converted to datetime format for easier handling. |
| Postcode      | Postcode of the property, offering location-based information. |
| property_type | Type of property (Detached, Semi-detached, Terraced, Flat, etc.). |
| new_build     | Indicator whether the property was newly built at the time of sale (Yes or No). |
| freehold      | Indicator whether the property was sold as freehold or leasehold (Freehold, Leasehold). |
| Street        | Street name of the property location. |
| Locality      | Locality of the property. |
| Town          | Town or city where the property is located. |
| District      | Administrative district of the property. |
| County        | County where the property is located. |

<p>&nbsp;</p>

**Acknowledgements:** 

This dataset is provided by `Swarup Sudulaganti ` and the original source can be found on [website](https://www.kaggle.com/datasets/swarupsudulaganti/uk-house-price-prediction-dataset-2015-to-2024).

---

# **`2. Aquiring & Loading Data`**
 

## **2.1 Importing Libraries:**

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

## **2.2 Loading Data**

In [2]:
# Load DataFrame
file = '../../datasets/Projects/P1-UK_House_Price_Prediction_dataset_2015_to_2024.csv'
df = pd.read_csv(file)

## **2.3 Basic Data Exploration**

In [3]:
# Show rows and columns count
print(f"Rows count: {df.shape[0]}\nColumns count: {df.shape[1]}")

Rows count: 90000
Columns count: 11


In [4]:
df.head()

Unnamed: 0,price,date,postcode,property_type,new_build,freehold,street,locality,town,district,county
0,735000,2017-08-07,LE17 5AP,D,N,F,CLAYBROOKE COURT,CLAYBROOKE PARVA,LUTTERWORTH,HARBOROUGH,LEICESTERSHIRE
1,160000,2023-02-03,SA11 4BD,T,N,F,GORED COTTAGES,MELINCOURT,NEATH,NEATH PORT TALBOT,NEATH PORT TALBOT
2,176500,2015-01-06,ME3 0DQ,S,N,F,GREEN LANE,ISLE OF GRAIN,ROCHESTER,MEDWAY,MEDWAY
3,625000,2021-10-13,RH20 3EU,D,N,F,LINFIELD COPSE,THAKEHAM,PULBOROUGH,HORSHAM,WEST SUSSEX
4,202000,2019-09-27,SN13 8EN,S,N,F,CLYDESDALE ROAD,BOX,CORSHAM,WILTSHIRE,WILTSHIRE


In [5]:
df.tail()

Unnamed: 0,price,date,postcode,property_type,new_build,freehold,street,locality,town,district,county
89995,295000,2021-08-13,GL11 5UW,D,N,F,PEVELANDS,CAM,DURSLEY,STROUD,GLOUCESTERSHIRE
89996,325000,2021-12-07,WS3 3UJ,D,N,F,GLENEAGLES ROAD,BLOXWICH,WALSALL,WALSALL,WEST MIDLANDS
89997,167000,2015-11-06,BD17 5LR,S,N,F,CORNWALL CRESCENT,BAILDON,SHIPLEY,BRADFORD,WEST YORKSHIRE
89998,80000,2023-09-26,DN4 7RZ,F,N,L,STOOPS LANE,BESSACARR,DONCASTER,DONCASTER,SOUTH YORKSHIRE
89999,1912500,2019-02-08,NE11 0TU,O,N,L,PRINCESWAY,TEAM VALLEY TRADING ESTATE,GATESHEAD,GATESHEAD,TYNE AND WEAR


### *2.3.1 Check Data Types*

In [6]:
# Show data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90000 entries, 0 to 89999
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   price          90000 non-null  int64 
 1   date           90000 non-null  object
 2   postcode       90000 non-null  object
 3   property_type  90000 non-null  object
 4   new_build      90000 non-null  object
 5   freehold       90000 non-null  object
 6   street         90000 non-null  object
 7   locality       90000 non-null  object
 8   town           90000 non-null  object
 9   district       90000 non-null  object
 10  county         90000 non-null  object
dtypes: int64(1), object(10)
memory usage: 7.6+ MB


- `column0` is an **integer**.
- `column1`,`column2`, `column3`, `column4`,- `column5`, `column6`, `column7`,- `column8`, `column9`, `column10` are **Objects**.
- `column1` should be datetime

### *2.3.2 Check Missing Data*

In [7]:
# Print percentage of missing values
missing_percent = df.isna().mean().sort_values(ascending=False)
print('---- Percentage of Missing Values (%) -----')
if missing_percent.sum():
    print(missing_percent[missing_percent > 0] * 100)
else:
    print('No Missing Values')

---- Percentage of Missing Values (%) -----
No Missing Values


### 2.3.3 Check for Duplicate Rows

In [8]:
# Show number of duplicated rows
print(f"No. of entirely duplicated rows: {df.duplicated().sum()}")

No. of entirely duplicated rows: 43


### 2.3.4 Check Uniqueness of Data

In [9]:
# Print the percentage similarity of values (the lower %, the better)
num_unique = df.nunique().sort_values()
print('---- Number Of Unique Values in The Dataset -----')
print(num_unique)

---- Number Of Unique Values in The Dataset -----
new_build            2
freehold             2
property_type        5
county             117
district           359
town               958
date              2553
price             6111
locality         10145
street           43483
postcode         75323
dtype: int64


### *2.3.5 Check Data Range*

In [17]:
# importing library
from skimpy import skim
# Print summary statistics
df.describe(include='all')
skim(df)

### *2.3.6 Checking Value Counts of Categorical Columns*

In [19]:
# checking value counts of categorical columns having less than 10 unique values
for col in df.select_dtypes(include='object').columns:
    if df[col].nunique() < 10:
        print(f'---- {col} ----')
        print(df[col].value_counts())
        print('\n')

---- property_type ----
property_type
D    30435
S    26228
T    21380
F     7568
O     4389
Name: count, dtype: int64


---- new_build ----
new_build
N    78862
Y    11138
Name: count, dtype: int64


---- freehold ----
freehold
F    76994
L    13006
Name: count, dtype: int64




---

## **2.4 Areas to Fix**
**Data Types**
- [date column] To be converted to datatime

**Missing Data**
- [No Missing Values Found] 

**Duplicate Rows**
- [43 duplicated values found but they are not valid ] so skipping this step.

---

# **`3. Data Preprocessing`**

## **3.1 Pre-processing Details:**

- Changing Data Types
- Feature Engineering / Transformation

## **3.4 Changing Data Types**

In [34]:
# # Convert columns to the right data types
# df[col] = df[col].astype('string')
# df[col] = df[col].astype('int')
# df[col] = pd.to_datetime(df[col], infer_datetime_format=True)

# # Convert to categorical datatype
# col_cat = ptypes.CategoricalDtype(categories=['A', 'B', 'C'], ordered=True)
# df['col_cat'] = df['col_cat'].astype(col_cat)

In [35]:
# # Verify conversion
# assert ptypes.is_string_dtype(df[col])
# assert ptypes.is_numeric_dtype(df[col])
# cols_to_check = []
# assert all(ptypes.is_datetime64_any_dtype(df[col]) for col in cols_to_check)

## **3.8 Feature Engineering / Transformation**

In [40]:
# # Get unique values of interested columns
# cols = []
# pd.unique(df[cols].values.ravel('k'))  # argument 'k' lists the values in the order of the cols 

In [41]:
# # Create custom function
# # Google style docstrings
# # https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html
# def custom_function(param1: int, param2: str) -> bool:
#     """Example function with PEP 484 type annotations.

#     Args:
#         param1: The first parameter.
#         param2: The second parameter.

#     Returns:
#         The return value. True for success, False otherwise.

#     """

In [42]:
# # Apply function to multiple columns
# cols = []
# df_updated = df.copy()
# df_updated[cols] = df_updated[cols].applymap(custom_function)

# # Create new aggregated boolean column
# df_updated['bool'] = df_updated[cols].any(axis=1, skipna=False)

---

# *`4. Data Analysis`*
 

Here is where your analysis begins. You can add different sections based on your project goals.

## **4.1 Exploring `Column Name`**

In [43]:
# Code and visualization

**Observations**
- Ob 1
- Ob 2
- Ob 3

---

# **`5. Conclusion`**
 



## **5.1 Insights**
State the insights/outcomes of your project or notebook.

## **5.2 Suggestions**

Make suggestions based on insights.

## **5.3 Possible Next Steps**
Areas to expand on:
- (if there is any)

---

# *`6. Epilogue`*

## **6.1 References**

This is how we use inline citation[<sup id="fn1-back">[1]</sup>](#fn1).

[<span id="fn1">1.</span>](#fn1-back) _Author (date)._ Title. Available at: https://website.com (Accessed: Date). 

> Use [https://www.citethisforme.com/](https://www.citethisforme.com/) to create citations.

---

## **6.2 Information about the Author:**

[<img src="https://media.licdn.com/dms/image/D4D03AQH8PR9DDb3VxQ/profile-displayphoto-shrink_200_200/0/1713280211622?e=2147483647&v=beta&t=5TpzxNZJRmU3_zjNLoRb-O2V9amv1-1rwM5OczG01ZY" width="20%">](https://www.facebook.com/groups/codanics/permalink/1872283496462303/ "Image")


**Mr. ShaheerAli**

BS Computer Science\
[Youtube channel](https://www.youtube.com/channel/UCUTphw52izMNv9W6AOIFGJA)\
[Twitter](https://twitter.com/__shaheerali190)\
[Linkedin](https://www.linkedin.com/in/shaheer-ali-2761aa303/)\
[github](https://github.com/shaheeralics)\
[Kaggle](https://www.kaggle.com/shaheerali197)\
[Portfolio Website](https://shaheer.kesug.com)

## **6.3 Versioning**
Notebook and insights by (Mr.Shaheer Ali).
- Version: 1.5.0
- Date: 2023-05-15