# Project 01: Data Cleaning Demo

## Overview
This notebook demonstrates basic data cleaning using Python and pandas.
It includes:
- loading a dataset
- inspecting structure and missing values
- renaming columns
- filtering data
- creating new variables

## Importing Libraries
Below I import the Python libraries needed for data cleaning and analysis.

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("Libraries imported.")


Libraries imported.


## Load the dataset

In [9]:
df = pd.read_csv("C:/Users/steve/documents/portfolio-2025/python/project_01_data_cleaning/Data/Listings.csv", encoding = "latin-1")  # adjust path
df.info()
df.head()
df.describe(include='all')
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')
df['host_is_superhost'] = df['host_is_superhost'].map({'t': True, 'f': False})

  df = pd.read_csv("C:/Users/steve/documents/portfolio-2025/python/project_01_data_cleaning/Data/Listings.csv", encoding = "latin-1")  # adjust path


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279712 entries, 0 to 279711
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   listing_id                   279712 non-null  int64  
 1   name                         279537 non-null  object 
 2   host_id                      279712 non-null  int64  
 3   host_since                   279547 non-null  object 
 4   host_location                278872 non-null  object 
 5   host_response_time           150930 non-null  object 
 6   host_response_rate           150930 non-null  float64
 7   host_acceptance_rate         166625 non-null  float64
 8   host_is_superhost            279547 non-null  object 
 9   host_total_listings_count    279547 non-null  float64
 10  host_has_profile_pic         279547 non-null  object 
 11  host_identity_verified       279547 non-null  object 
 12  neighbourhood                279712 non-null  object 
 13 

## Missing Values Exploration

In [10]:
df.isna().sum().sort_values(ascending=False)
(df.isna().mean() * 100).sort_values(ascending=False)

district                       86.767818
host_response_time             46.040928
host_response_rate             46.040928
host_acceptance_rate           40.429799
review_scores_value            32.814109
review_scores_location         32.810534
review_scores_checkin          32.809104
review_scores_accuracy         32.788368
review_scores_communication    32.779073
review_scores_cleanliness      32.771208
review_scores_rating           32.678255
bedrooms                       10.523324
host_location                   0.300309
name                            0.062564
host_total_listings_count       0.058989
host_is_superhost               0.058989
host_since                      0.058989
host_identity_verified          0.058989
host_has_profile_pic            0.058989
listing_id                      0.000000
longitude                       0.000000
host_id                         0.000000
latitude                        0.000000
city                            0.000000
neighbourhood   

In [11]:
df = df.drop(columns=["district"], errors="ignore")

# 2. CONVERT DATA TYPES
# -------------------------------------------------------

# 2.1 Convert host_since from object to datetime
df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')

# 2.2 Treat host_response_time as a categorical variable
df['host_response_time'] = df['host_response_time'].astype('category')

# 2.3 host_response_rate and host_acceptance_rate are already floats
#     (from df.info()), so we just leave them as they are.
#     If there are weird values, we'll handle them later if needed.
# 2.3 Convert boolean-like columns ("t"/"f", "yes"/"no") to True/False

bool_map = {
    't': True, 'f': False,
    'true': True, 'false': False,
    'yes': True, 'no': False,
    'y': True, 'n': False
}

bool_cols = [
    'host_is_superhost',
    'host_has_profile_pic',
    'host_identity_verified',
    'instant_bookable'
]

for col in bool_cols:
    df[col] = (
        df[col]
        .astype(str)        # convert to string
        .str.strip()        # remove whitespace
        .str.lower()        # normalise
        .map(bool_map)      # map to True/False
    )
    df[col] = df[col].astype('boolean')  # convert to nullable Boolean dtype


In [15]:
# 3. HANDLE MISSING VALUES
# -------------------------------------------------------

# 3.1 Impute 'bedrooms' with median per property_type
if 'bedrooms' in df.columns:
    df['bedrooms'] = df.groupby('property_type')['bedrooms'] \
                       .transform(lambda x: x.fillna(x.median()))

# 3.2 Review scores: create an indicator flag, then fill missing scores with 0
review_cols = [c for c in df.columns if c.startswith('review_scores_')]

# Flag: has at least one review score (before filling)
df['has_reviews'] = df['review_scores_rating'].notna().astype(int)

# Fill missing review scores with 0 (interpreted as "no ratings recorded")
df[review_cols] = df[review_cols].fillna(0)


# 4. FEATURE ENGINEERING
# -------------------------------------------------------

# 4.1 Clean amenities text
df['amenities_clean'] = (
    df['amenities']
    .astype(str)
    .str.replace('{', '', regex=False)
    .str.replace('}', '', regex=False)
    .str.replace('"', '', regex=False)
    .str.lower()
)

# 4.2 Number of amenities
df['amenities_count'] = df['amenities_clean'].str.split(',').apply(len)

# 4.3 Example binary feature: has wifi
df['has_wifi'] = df['amenities_clean'].str.contains('wifi', na=False)


# 5. QUICK SANITY CHECKS AFTER CLEANING
# -------------------------------------------------------

print("Shape after cleaning:", df.shape)

print("\nTop 10 columns by remaining missing values:")
print(df.isna().sum().sort_values(ascending=False).head(10))

print("\nUpdated dtypes (first 15 columns):")
print(df.dtypes.head(15))


  return np.nanmean(a, axis, out=out, keepdims=keepdims)


Shape after cleaning: (279712, 36)

Top 10 columns by remaining missing values:
host_response_rate           128782
host_response_time           128782
host_acceptance_rate         113087
host_location                   840
name                            175
host_is_superhost               165
host_since                      165
host_total_listings_count       165
host_has_profile_pic            165
host_identity_verified          165
dtype: int64

Updated dtypes (first 15 columns):
listing_id                            int64
name                                 object
host_id                               int64
host_since                   datetime64[ns]
host_location                        object
host_response_time                 category
host_response_rate                  float64
host_acceptance_rate                float64
host_is_superhost                   boolean
host_total_listings_count           float64
host_has_profile_pic                boolean
host_identity_verified     

In [16]:
df.head()

Unnamed: 0,listing_id,name,host_id,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,has_reviews,amenities_clean,amenities_count,has_wifi
0,281420,"Beautiful Flat in le Village Montmartre, Paris",1466919,2011-12-03,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, kitchen, washer, wifi, long term sta...",5,True
1,3705183,39 mÃÂ² Paris (Sacre CÃâur),10328771,2013-11-29,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[shampoo, heating, kitchen, essentials, washer...",8,True
2,4082273,"Lovely apartment with Terrace, 60m2",19252768,2014-07-31,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, tv, kitchen, washer, wifi, long term...",6,True
3,4797344,Cosy studio (close to Eiffel tower),10668311,2013-12-17,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, tv, kitchen, wifi, long term stays a...",5,True
4,4823489,Close to Eiffel Tower - Beautiful flat : 2 rooms,24837558,2014-12-14,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, tv, kitchen, essentials, hair dryer,...",12,True


In [18]:
df.to_csv("C:/Users/steve/documents/portfolio-2025/python/project_01_data_cleaning/Data/listings_cleaned.csv", index=False)

In [19]:
# Quick look at the cleaned data
df.head()

Unnamed: 0,listing_id,name,host_id,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,has_reviews,amenities_clean,amenities_count,has_wifi
0,281420,"Beautiful Flat in le Village Montmartre, Paris",1466919,2011-12-03,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, kitchen, washer, wifi, long term sta...",5,True
1,3705183,39 mÃÂ² Paris (Sacre CÃâur),10328771,2013-11-29,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[shampoo, heating, kitchen, essentials, washer...",8,True
2,4082273,"Lovely apartment with Terrace, 60m2",19252768,2014-07-31,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, tv, kitchen, washer, wifi, long term...",6,True
3,4797344,Cosy studio (close to Eiffel tower),10668311,2013-12-17,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, tv, kitchen, wifi, long term stays a...",5,True
4,4823489,Close to Eiffel Tower - Beautiful flat : 2 rooms,24837558,2014-12-14,"Paris, Ile-de-France, France",,,,False,1.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[heating, tv, kitchen, essentials, hair dryer,...",12,True


In [20]:
df.sample(5)

Unnamed: 0,listing_id,name,host_id,host_since,host_location,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,has_reviews,amenities_clean,amenities_count,has_wifi
48457,32976289,Studio Deluxe - Tour Eiffel - 5,113055646,2017-01-24,"Paris, Ile-de-France, France",within an hour,0.98,0.99,True,32.0,...,10.0,10.0,10.0,10.0,10.0,True,1,"[essentials, stove, hot water, hangers, smoke ...",26,True
215224,24707026,Wave Home Cat Sanctuary,186731647,2018-04-28,TH,a few days or more,0.0,,False,1.0,...,0.0,0.0,0.0,0.0,0.0,True,1,"[air conditioning, shampoo, wifi, free parking...",14,True
228820,36716546,"Cozy and homely flat, 15mins from Montmartre",12311572,2014-02-17,"Rouen, Upper Normandy, France",,,1.0,False,0.0,...,10.0,10.0,10.0,10.0,10.0,False,1,"[essentials, stove, hot water, hangers, smoke ...",24,True
115304,34756762,[Ã¥Â¥Â³Girls]3Ã¥Ëâ Ã©ÂËÃ¥ÅÂ°Ã©ÂÂµNorth P...,262108591,2019-05-15,HK,within an hour,1.0,0.92,False,7.0,...,8.0,10.0,10.0,10.0,10.0,False,1,"[shampoo, lock on bedroom door, long term stay...",10,True
156858,15303381,FIVE STAR HOSTEL,97456844,2016-10-01,"Bangkok, Thailand",,,,False,1.0,...,0.0,0.0,0.0,0.0,0.0,True,1,"[cable tv, air conditioning, dryer, shampoo, e...",13,False


In [22]:
output_path = "C:/Users/steve/documents/portfolio-2025/python/project_01_data_cleaning/Data/listings_cleaned.csv"
df.to_csv(output_path, index=False)
print("Saved cleaned data to:", output_path)

Saved cleaned data to: C:/Users/steve/documents/portfolio-2025/python/project_01_data_cleaning/Data/listings_cleaned.csv
