# Level 1, Task 1: Data Cleaning & Preprocessing

In this notebook, I worked with a raw housing dataset and applied some basic data cleaning steps to make it ready for analysis.

## Below are What I did:
- I looked for and fixed missing values
- Removed duplicates
- Standardized some of the data formats
- Saved the cleaned dataset for further tasks

## The various Tools I Used:
- Python
- Pandas
- NumPy

Let's get started!

In [23]:
# Importing Libraries
import pandas as pd

In [24]:
# Loading the dataset
df = pd.read_csv('house prediction.csv')

In [25]:
# Displaying the first few rows of the dataset to understand its structure
print("Preview of the dataset")
df.head()

Preview of the dataset


Unnamed: 0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
0,0.02731 0.00 7.070 0 0.4690 6.4210 78...
1,0.02729 0.00 7.070 0 0.4690 7.1850 61...
2,0.03237 0.00 2.180 0 0.4580 6.9980 45...
3,0.06905 0.00 2.180 0 0.4580 7.1470 54...
4,0.02985 0.00 2.180 0 0.4580 6.4300 58...


In [26]:
# Checking basic information about the dataset
print("Dataset Info: ")
df.info()

Dataset Info: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 1 columns):
 #   Column                                                                                            Non-Null Count  Dtype 
---  ------                                                                                            --------------  ----- 
 0    0.00632  18.00   2.310  0  0.5380  6.5750  65.20  4.0900   1  296.0  15.30 396.90   4.98  24.00  505 non-null    object
dtypes: object(1)
memory usage: 4.1+ KB


In [27]:
# Summary statistics of the numeric columns
print("Summary Statistics:")
df.describe()

Summary Statistics:


Unnamed: 0,0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00
count,505
unique,505
top,0.04741 0.00 11.930 0 0.5730 6.0300 80...
freq,1


In [28]:
# Check for missing values
print("Missing Values in Each Column:")
missing = df.isnull().sum()
print(missing[missing > 0])  # only show columns with missing values


Missing Values in Each Column:
Series([], dtype: int64)


In [29]:
# Handle missing values

# Example: Fill numerical missing values with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    df[col] = df[col].fillna(df[col].median())

# Example: Fill categorical missing values with mode
categorical_cols = df.select_dtypes(include='object').columns
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

print("Missing values handled.")


Missing values handled.


In [30]:
# Removing duplicate rows
print(f"Duplicates before removal: {df.duplicated().sum()}")
df.drop_duplicates(inplace=True)
print(f"Duplicates after removal: {df.duplicated().sum()}")


Duplicates before removal: 0
Duplicates after removal: 0


In [31]:
# Standardize data formats

# If there’s a date column, convert it to datetime format
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
    print("Date column standardized.")

# You can also clean categorical columns (remove whitespace, lowercasing)
for col in categorical_cols:
    df[col] = df[col].str.strip().str.lower()

print("Categorical columns standardized.")


Categorical columns standardized.


In [32]:
# Saving the cleaned dataset
df.to_csv('Cleaned_House_Prediction.csv', index=False)
print("Cleaned dataset saved as 'Cleaned_House_Prediction.csv'")


Cleaned dataset saved as 'Cleaned_House_Prediction.csv'
