# Day 1: Data Collection
## Nairobi House Prediction Project

**Date:** 17/02/2026  
**Author:** Shadrack Kimaau

## Objectives
1. Identify and document data sources
2. Collect/download raw data
3. Perform initial data inspection
4. Document data characteristics

## 1. Import Libraries

In [1]:
# Data manipulation
import pandas as pd
import numpy as np


# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Utilities
import os

## 2. Data Sources

In [4]:
from pathlib import Path
import pandas as pd

# Notebook location
notebook_path = Path().resolve()  # current working directory
print("Notebook folder:", notebook_path)

# Build path to CSV
csv_path = notebook_path.parent / "data" / "raw" / "raw_listings.csv"
print("CSV path:", csv_path)

# Read CSV
data = pd.read_csv(csv_path)
data.head()

Notebook folder: /home/shaddy/Downloads/LT Data Fellowship/Nairobi House Prediction /notebooks
CSV path: /home/shaddy/Downloads/LT Data Fellowship/Nairobi House Prediction /data/raw/raw_listings.csv


Unnamed: 0,Location,Property Type,Bedrooms,Bathrooms,Size,Price (KES),Listing Date,Amenities
0,Nairobi - Lavington,House,5 bedrooms,7.0,,750000,19 February 2026,"Aircon, Alarm, Service Charge Included, Walk I..."
1,Nairobi - Karen Hardy,House,3 bedrooms,3.0,,380000,17 February 2026,"Backup Generator, Alarm, Serviced, Service Cha..."
2,Nairobi - Kitisuru,House,4 bedrooms,4.0,,500000,17 February 2026,"Aircon, Alarm, Backup Generator, En Suite, Fib..."
3,Nairobi - Lavington,House,4 bedrooms,4.0,,350000,20 February 2026,"Alarm, Service Charge Included, Backup Generat..."
4,Nairobi - Runda,House,5 bedrooms,5.0,450 m²,541000,20 February 2026,"Alarm, Backup Generator, En Suite, Fibre Inter..."


## 3. Load Raw Data

In [None]:
# Load data
print(f"Data shape: {data.shape}")
print(f"Number of rows: {data.shape[0]}")
print(f"Number of columns: {data.shape[1]}")
print("\nColumn names:")
print(data.columns.tolist())

Data shape: (502, 8)
Number of rows: 502
Number of columns: 8

Column names:
['Location', 'Property Type', 'Bedrooms', 'Bathrooms', 'Size', 'Price (KES)', 'Listing Date', 'Amenities']


## 4. Initial Data Inspection

In [6]:
# Display first few rows
print("First 10 rows of the dataset:")
data.head(10)

First 10 rows of the dataset:


Unnamed: 0,Location,Property Type,Bedrooms,Bathrooms,Size,Price (KES),Listing Date,Amenities
0,Nairobi - Lavington,House,5 bedrooms,7.0,,750000,19 February 2026,"Aircon, Alarm, Service Charge Included, Walk I..."
1,Nairobi - Karen Hardy,House,3 bedrooms,3.0,,380000,17 February 2026,"Backup Generator, Alarm, Serviced, Service Cha..."
2,Nairobi - Kitisuru,House,4 bedrooms,4.0,,500000,17 February 2026,"Aircon, Alarm, Backup Generator, En Suite, Fib..."
3,Nairobi - Lavington,House,4 bedrooms,4.0,,350000,20 February 2026,"Alarm, Service Charge Included, Backup Generat..."
4,Nairobi - Runda,House,5 bedrooms,5.0,450 m²,541000,20 February 2026,"Alarm, Backup Generator, En Suite, Fibre Inter..."
5,Nairobi - Langata,House,4 bedrooms,4.0,,70000,19 February 2026,"Garden, Gated Community, Parking, Hospital, Sc..."
6,Nairobi - Kyuna,House,4 bedrooms,4.0,,330000,20 February 2026,"Alarm, Backup Generator, En Suite, Fibre Inter..."
7,Nairobi - Kiambu Road,House,5 bedrooms,6.0,453 m²,250000,20 February 2026,"Aircon, Serviced, Alarm, Service Charge Includ..."
8,Nairobi - Lavington,House,5 bedrooms,5.0,,420000,20 February 2026,"Alarm, Backup Generator, En Suite, Fibre Inter..."
9,Nairobi - Runda,House,5 bedrooms,5.0,,710700,20 February 2026,"Alarm, Backup Generator, En Suite, Fibre Inter..."


In [7]:
# Data info
print("Dataset Information:")
print("="*50)
data.info()
print("\n" + "="*50)
print(f"\nMemory usage: {data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\nMissing values per column:")
print(data.isnull().sum())
print(f"\nTotal missing values: {data.isnull().sum().sum()}")
print(f"Percentage of missing values: {(data.isnull().sum().sum() / (data.shape[0] * data.shape[1]) * 100):.2f}%")

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 502 entries, 0 to 501
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Location       502 non-null    object 
 1   Property Type  502 non-null    object 
 2   Bedrooms       502 non-null    object 
 3   Bathrooms      486 non-null    float64
 4   Size           262 non-null    object 
 5   Price (KES)    502 non-null    int64  
 6   Listing Date   502 non-null    object 
 7   Amenities      493 non-null    object 
dtypes: float64(1), int64(1), object(6)
memory usage: 31.5+ KB


Memory usage: 0.27 MB

Missing values per column:
Location           0
Property Type      0
Bedrooms           0
Bathrooms         16
Size             240
Price (KES)        0
Listing Date       0
Amenities          9
dtype: int64

Total missing values: 265
Percentage of missing values: 6.60%


In [8]:
# Basic statistics
print("Descriptive Statistics:")
print("="*50)
data.describe()
print("\n" + "="*50)
print("\nData types distribution:")
print(data.dtypes.value_counts())

Descriptive Statistics:


Data types distribution:
object     6
float64    1
int64      1
Name: count, dtype: int64


## 5. Save Data

In [9]:
# Save to raw data folder (already in raw folder, no need to save again)
# Data is already saved at: data/raw/raw_listings.csv
print(f"Data is already saved at: {csv_path}")
print(f"File exists: {csv_path.exists()}")
print(f"File size: {csv_path.stat().st_size / 1024:.2f} KB")

Data is already saved at: /home/shaddy/Downloads/LT Data Fellowship/Nairobi House Prediction /data/raw/raw_listings.csv
File exists: True
File size: 133.07 KB


## Summary

### Key Findings
- **Dataset Size**: 502 properties with 8 features
- **Features**: Location, Property Type, Bedrooms, Bathrooms, Size, Price (KES), Listing Date, Amenities
- **Data Source**: Successfully loaded from `data/raw/raw_listings.csv`
- **Target Variable**: Price (KES) - continuous variable for regression modeling
- **Key Features for Prediction**:
  - Location (categorical)
  - Property Type (categorical)
  - Bedrooms (numerical)
  - Bathrooms (numerical)
  - Size (numerical)
  - Amenities (text/categorical)

### Next Steps
- **Day 2**: Data cleaning and feature engineering
  - Handle missing values
  - Process categorical variables (Location, Property Type)
  - Parse and encode Amenities
  - Convert data types as needed
  - Feature extraction from Listing Date
- **Day 3**: Exploratory Data Analysis (EDA) and baseline model
- **Day 4**: Model improvement and optimization
- **Day 5**: Application preparation
- **Day 6**: Dashboard preparation